如何在PHP中实现一个Web刮板？

什么内置的PHP函数对网页抓取有用？什么是一些很好的资源（网页或打印），以加快在网上抓取PHP的速度？

有一本关于这个主题的书“Webbots，Spider和Screen Scrapers：用PHP / CURL开发Internet代理的指南” – 请参阅这里的评论

PHP-Architect在Matthew Turland 于2007年12月发表的一篇精心撰写的文章中对此进行了报道

刮擦通常包括3个步骤：

首先获取或发布您的请求到指定的URL
接下来您将收到作为响应返回的html
最后你parsing出你想要抓取的文本。

为了完成第1步和第2步，下面是一个简单的php类，它使用Curl使用GET或POST来获取网页。获取HTML后，只需使用正则expression式来完成步骤3，即可parsing出要扫描的文本。

对于正则expression式，我最喜欢的教程网站如下：正则expression式教程

我最喜欢的正则expression式工作正则expression式好友。我build议你尝试该产品的演示，即使你无意购买它。这是一个非常宝贵的工具，甚至可以为您的select语言（包括php）生成的正则expression式生成代码。

用法：

$ curl = new Curl（）; $ html = $ curl-> get（“ http://www.google.com ”）;

/ /现在，你的正则expression式工作$ html

PHP类：

 <?php class Curl { public $cookieJar = ""; public function __construct($cookieJarFile = 'cookies.txt') { $this->cookieJar = $cookieJarFile; } function setup() { $header = array(); $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; $header[] = "Cache-Control: max-age=0"; $header[] = "Connection: keep-alive"; $header[] = "Keep-Alive: 300"; $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $header[] = "Accept-Language: en-us,en;q=0.5"; $header[] = "Pragma: "; // browsers keep this blank. curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'); curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header); curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar); curl_setopt($this->curl,CURLOPT_AUTOREFERER, true); curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true); curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true); } function get($url) { $this->curl = curl_init($url); $this->setup(); return $this->request(); } function getAll($reg,$str) { preg_match_all($reg,$str,$matches); return $matches[1]; } function postForm($url, $fields, $referer='') { $this->curl = curl_init($url); $this->setup(); curl_setopt($this->curl, CURLOPT_URL, $url); curl_setopt($this->curl, CURLOPT_POST, 1); curl_setopt($this->curl, CURLOPT_REFERER, $referer); curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields); return $this->request(); } function getInfo($info) { $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info); return $info; } function request() { return curl_exec($this->curl); } } ?>
<?php class Curl { public $cookieJar = ""; public function __construct($cookieJarFile = 'cookies.txt') { $this->cookieJar = $cookieJarFile; } function setup() { $header = array(); $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; $header[] = "Cache-Control: max-age=0"; $header[] = "Connection: keep-alive"; $header[] = "Keep-Alive: 300"; $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $header[] = "Accept-Language: en-us,en;q=0.5"; $header[] = "Pragma: "; // browsers keep this blank. curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'); curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header); curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar); curl_setopt($this->curl,CURLOPT_AUTOREFERER, true); curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true); curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true); } function get($url) { $this->curl = curl_init($url); $this->setup(); return $this->request(); } function getAll($reg,$str) { preg_match_all($reg,$str,$matches); return $matches[1]; } function postForm($url, $fields, $referer='') { $this->curl = curl_init($url); $this->setup(); curl_setopt($this->curl, CURLOPT_URL, $url); curl_setopt($this->curl, CURLOPT_POST, 1); curl_setopt($this->curl, CURLOPT_REFERER, $referer); curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields); return $this->request(); } function getInfo($info) { $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info); return $info; } function request() { return curl_exec($this->curl); } } ?>

我想推荐我最近遇到的这个class。简单的HTML DOMparsing器

我推荐Goutte，一个简单的PHP Web Scraper 。

用法示例： –

创build一个Goutte客户端实例（它扩展了Symfony\Component\BrowserKit\Client ）：

 use Goutte\Client; $client = new Client();

使用request()方法request() ：

 $crawler = $client->request('GET', 'http://www.symfony-project.org/');

request方法返回一个Crawler对象（ Symfony\Component\DomCrawler\Crawler ）。

点击链接：

 $link = $crawler->selectLink('Plugins')->link(); $crawler = $client->click($link);

提交表单：

 $form = $crawler->selectButton('sign in')->form(); $crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));

提取数据：

 $nodes = $crawler->filter('.error_list'); if ($nodes->count()) { die(sprintf("Authentification error: %s\n", $nodes->text())); } printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text());

ScraperWiki是一个非常有趣的项目。帮助您使用Python，Ruby或PHP在线构build垃圾邮件 – 几分钟后，我就可以轻松完成一个简单的尝试。

下面是一个使用cURL和file_get_contents进行网页抓取的OK教程（链接已移除，请参阅下文）。不妨阅读下面的几个部分。

（由于恶意软件警告，直接删除了超链接）

http://www.oooff.com/php-scripts/basic-php-scraped-data-parsing/basic-php-data-parsing.php

我实际上正在寻找刮圣经，因为他们不提供一个API来访问我想要创build的Web应用程序的经文。

这听起来像你可能试图“热链接”而不是刮，即基于他们的网站内容实时更新？

本教程相当不错：

http://www.merchantos.com/makebeta/php/scraping-links-with-php/

你可能也想看看Prowser。

如果您需要易于维护的内容，而不是快速执行，那么可以使用SimpleTest等可编写脚本的浏览器。

这里是另外一个：没有正则expression式的简单的PHP刮刀。

刮擦可能相当复杂，取决于你想要做什么。阅读本系列教程，了解在PHP中编写Scraper的基础知识，看看是否能够掌握它。

您可以使用类似的方法来自动化表单注册，login，甚至假冒点击广告！虽然使用CURL的主要限制是它不支持使用JavaScript，所以如果你想抓取一个网站，使用AJAX的分页，例如它可以变得有点棘手…但再次有办法解决这个问题！

file_get_contents()可以采取一个远程的URL，并给你的来源。然后可以使用正则expression式（与Perl兼容的函数）来获取所需的内容。

出于好奇，你试图刮擦什么？

我会使用libcurl或Perl的LWP（libwww for perl）。有没有libwww for php？

刮板类从我的框架：

 <?php /* Example: $site = $this->load->cls('scraper', 'http://www.anysite.com'); $excss = $site->getExternalCSS(); $incss = $site->getInternalCSS(); $ids = $site->getIds(); $classes = $site->getClasses(); $spans = $site->getSpans(); print '<pre>'; print_r($excss); print_r($incss); print_r($ids); print_r($classes); print_r($spans); */ class scraper { private $url = ''; public function __construct($url) { $this->url = file_get_contents("$url"); } public function getInternalCSS() { $tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } public function getExternalCSS() { $tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } public function getIds() { $tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } public function getClasses() { $tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } public function getSpans(){ $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns); $result = array(); array_push($result, $patterns[2]); array_push($result, count($patterns[2])); return $result; } } ?>

很好的PHP网站在这里抓取电子书：

https://leanpub.com/web-scraping

curl库允许您下载网页。你应该看正则expression式做刮擦。

如何在PHP中实现一个Web刮板？

用法示例： –

我如何防止网站刮取？

Perl的WWW :: Mechanize有PHP的等价物吗？

jsoup发布和cookie

像kayak.com网站如何聚合内容？

PhantomJS无法打开HTTPS网站

在线程中执行Webbrowser控件的屏幕视图

什么是一个很好的工具来屏幕刮与JavaScript支持？

从Python执行Javascript

我怎样才能把一个HTML表格CSV？

简单的屏幕抓取使用jQuery