如何检测search引擎机器人与PHP?

如何检测使用PHP的search引擎机器人?

这是蜘蛛名称的search引擎目录

然后你使用$_SERVER['HTTP_USER_AGENT']; 检查代理人是否说蜘蛛。

 if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot")) { // what to do } 

我使用下面的代码似乎工作正常:

 function _bot_detected() { return ( isset($_SERVER['HTTP_USER_AGENT']) && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT']) ); } 

16-06-2017更新https://support.google.com/webmasters/answer/1061943?hl=zh_CN

增加媒体伙伴

检查$_SERVER['HTTP_USER_AGENT']中列出的一些string:

http://www.useragentstring.com/pages/All/

或者更具体地说,对于爬虫:

http://www.useragentstring.com/pages/Crawlerlist/

如果您想logging大多数常见search引擎抓取工具的访问次数,则可以使用

 $interestingCrawlers = array( 'google', 'yahoo' ); $pattern = '/(' . implode('|', $interestingCrawlers) .')/'; $matches = array(); $numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i'); if($numMatches > 0) // Found a match { // $matches[1] contains an array of all text matches to either 'google' or 'yahoo' } 

你可以结帐,如果它是一个search引擎与这个function:

 <?php function crawlerDetect($USER_AGENT) { $crawlers = array( 'Google' => 'Google', 'MSN' => 'msnbot', 'Rambler' => 'Rambler', 'Yahoo' => 'Yahoo', 'AbachoBOT' => 'AbachoBOT', 'accoona' => 'Accoona', 'AcoiRobot' => 'AcoiRobot', 'ASPSeek' => 'ASPSeek', 'CrocCrawler' => 'CrocCrawler', 'Dumbot' => 'Dumbot', 'FAST-WebCrawler' => 'FAST-WebCrawler', 'GeonaBot' => 'GeonaBot', 'Gigabot' => 'Gigabot', 'Lycos spider' => 'Lycos', 'MSRBOT' => 'MSRBOT', 'Altavista robot' => 'Scooter', 'AltaVista robot' => 'Altavista', 'ID-Search Bot' => 'IDBot', 'eStyle Bot' => 'eStyle', 'Scrubby robot' => 'Scrubby', 'Facebook' => 'facebookexternalhit', ); // to get crawlers string used in function uncomment it // it is better to save it in string than use implode every time // global $crawlers $crawlers_agents = implode('|',$crawlers); if (strpos($crawlers_agents, $USER_AGENT) === false) return false; else { return TRUE; } } ?> 

那么你可以像这样使用它:

 <?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT']; if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?> 

因为任何客户端都可以将用户代理设置为他们想要的,寻找“Googlebot”,“bingbot”等只是一半的工作。

第二部分是validation客户的IP。 在过去,这需要维护IP列表。 您在网上find的所有列表都过时了。 顶级search引擎正式支持通过DNS进行validation,详见Google https://support.google.com/webmasters/answer/80553和Bing http://www.bing.com/webmaster/help/how-to-verify -bingbot-3905dc26

首先执行客户端IP的反向DNS查询。 对于谷歌来说,这会在googlebot.com下面显示一个主机名,对于Bing来说,它会在search.msn.com下面显示。 然后,因为有人可以在他的IP上设置这样一个反向DNS,您需要validation该主机名上的正向DNS查询。 如果得到的IP与网站的访问者相同,则确定它是来自该search引擎的爬虫。

我已经用Java编写了一个库来执行这些检查。 随意将其移植到PHP。 它在GitHub上: https : //github.com/optimaize/webcrawler-verifier

您可以分析用户代理( $_SERVER['HTTP_USER_AGENT'] )或将客户端的IP地址( $_SERVER['REMOTE_ADDR'] )与search引擎机器人的IP地址列表进行比较 。

我正在使用它来检测机器人:

 if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) { // is bot } 

另外我使用白名单阻止不需要的机器人:

 if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) { // allowed bot } 

一个不需要的bot(=假阳性用户)然后可以解决一个validation码来解锁自己24小时。 由于没有人解决这个validation码,我知道它不会产生误报。 所以机器人检测似乎完美。

注意:我的白名单基于Facebook robots.txt 。

  <?php // IPCLOACK HOOK if (CLOAKING_LEVEL != 4) { $lastupdated = date("Ymd", filemtime(FILE_BOTS)); if ($lastupdated != date("Ymd")) { $lists = array( 'http://labs.getyacg.com/spiders/google.txt', 'http://labs.getyacg.com/spiders/inktomi.txt', 'http://labs.getyacg.com/spiders/lycos.txt', 'http://labs.getyacg.com/spiders/msn.txt', 'http://labs.getyacg.com/spiders/altavista.txt', 'http://labs.getyacg.com/spiders/askjeeves.txt', 'http://labs.getyacg.com/spiders/wisenut.txt', ); foreach($lists as $list) { $opt .= fetch($list); } $opt = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $opt); $fp = fopen(FILE_BOTS,"w"); fwrite($fp,$opt); fclose($fp); } $ip = isset($_SERVER['REMOTE_ADDR']) ? $_SERVER['REMOTE_ADDR'] : ''; $ref = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : ''; $agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : ''; $host = strtolower(gethostbyaddr($ip)); $file = implode(" ", file(FILE_BOTS)); $exp = explode(".", $ip); $class = $exp[0].'.'.$exp[1].'.'.$exp[2].'.'; $threshold = CLOAKING_LEVEL; $cloak = 0; if (stristr($host, "googlebot") && stristr($host, "inktomi") && stristr($host, "msn")) { $cloak++; } if (stristr($file, $class)) { $cloak++; } if (stristr($file, $agent)) { $cloak++; } if (strlen($ref) > 0) { $cloak = 0; } if ($cloak >= $threshold) { $cloakdirective = 1; } else { $cloakdirective = 0; } } ?> 

这将是为蜘蛛斗篷的理想方式。 它来自一个名为[YACG] – http://getyacg.com的开源脚本;

需要一点工作,但绝对要走。

使用Device Detector开源库,它提供了一个isBot()函数: https : //github.com/piwik/device-detector

我使用这个代码,很不错。 您将很容易知道访问您的网站的用户代理。 这段代码正在打开一个文件,并将user_agent写入文件。 您可以通过访问yourdomain.com/useragent.txt查看每一天这个文件,并了解新的user_agents,并将它们放入if条件的条件中。

 $user_agent = strtolower($_SERVER['HTTP_USER_AGENT']); if(!preg_match("/Googlebot|MJ12bot|yandexbot/i", $user_agent)){ // if not meet the conditions then // do what you need // here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause if($user_agent!=""){ $myfile = fopen("useragent.txt", "a") or die("Unable to open file useragent.txt!"); fwrite($myfile, $user_agent); $user_agent = "\n"; fwrite($myfile, $user_agent); fclose($myfile); } } 

这是useragent.txt的内容

 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots) mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots) mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots) mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots) mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots) mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1 mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36 mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/) mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0 mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0 mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0 mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0 mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0 mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0 mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0 mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0 mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36 mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36 mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html) zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html) mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174 mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174 sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07) mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174