如何从需要cookielogin的网站在PHP中抓取网站内容？

我的问题是，它不仅仅需要一个基本的cookie，而是要求一个会话cookie和随机生成的ID。我想这意味着我需要使用一个cookie jar的web浏览器模拟器？

我试图使用史努比，Goutte和其他一些networking浏览器模拟器，但是至今我还没有find关于如何接收cookies的教程。我有点绝望了！

任何人都可以给我一个如何接受史努比或Gouttecookies的例子吗？

提前致谢！

面向对象的答案

我们尽可能实现上述答案在一个名为Browser应该提供正常的导航function的类。

然后，我们应该能够以非常简单的forms将特定于站点的代码放到一个新的派生类中，我们称之为FooBrowser ，它执行Foo站点的抓取。

浏览器的类必须提供一些站点特定的function，例如允许存储站点特定信息的path()函数

 function path($basename) { return '/var/tmp/www.foo.bar/' . $basename; } abstract class Browser { private $options = []; private $state = []; protected $cookies; abstract protected function path($basename); public function __construct($site, $options = []) { $this->cookies = $this->path('cookies'); $this->options = array_merge( [ 'site' => $site, 'userAgent' => 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 - LeoScraper', 'waitTime' => 250000, ], $options ); $this->state = [ 'referer' => '/', 'url' => '', 'curl' => '', ]; $this->__wakeup(); } /** * Reactivates after sleep (eg in session) or creation */ public function __wakeup() { $this->state['curl'] = curl_init(); $this->config([ CURLOPT_USERAGENT => $this->options['userAgent'], CURLOPT_ENCODING => '', CURLOPT_NOBODY => false, // ...retrieving the body... CURLOPT_BINARYTRANSFER => true, // ...as binary... CURLOPT_RETURNTRANSFER => true, // ...into $ret... CURLOPT_FOLLOWLOCATION => true, // ...following redirections... CURLOPT_MAXREDIRS => 5, // ...reasonably... CURLOPT_COOKIEFILE => $this->cookies, // Save these cookies CURLOPT_COOKIEJAR => $this->cookies, // (already set above) CURLOPT_CONNECTTIMEOUT => 30, // Seconds CURLOPT_TIMEOUT => 300, // Seconds CURLOPT_LOW_SPEED_LIMIT => 16384, // 16 Kb/s CURLOPT_LOW_SPEED_TIME => 15, ]); } /** * Imports an options array. * * @param array $opts * @throws DetailedError */ private function config(array $opts = []) { foreach ($opts as $key => $value) { if (true !== curl_setopt($this->state['curl'], $key, $value)) { throw new \Exception('Could not set cURL option'); } } } private function perform($url) { $this->state['referer'] = $this->state['url']; $this->state['url'] = $url; $this->config([ CURLOPT_URL => $this->options['site'] . $this->state['url'], CURLOPT_REFERER => $this->options['site'] . $this->state['referer'], ]); $response = curl_exec($this->state['curl']); // Should we ever want to randomize waitTime, do so here. usleep($this->options['waitTime']); return $response; } /** * Returns a configuration option. * @param string $key configuration key name * @param string $value value to set * @return mixed */ protected function option($key, $value = '__DEFAULT__') { $curr = $this->options[$key]; if ('__DEFAULT__' !== $value) { $this->options[$key] = $value; } return $curr; } /** * Performs a POST. * * @param $url * @param $fields * @return mixed */ public function post($url, array $fields) { $this->config([ CURLOPT_POST => true, CURLOPT_POSTFIELDS => http_build_query($fields), ]); return $this->perform($url); } /** * Performs a GET. * * @param $url * @param array $fields * @return mixed */ public function get($url, array $fields = []) { $this->config([ CURLOPT_POST => false ]); if (empty($fields)) { $query = ''; } else { $query = '?' . http_build_query($fields); } return $this->perform($url . $query); } }

现在刮FooSite：

 /* WWW_FOO_COM requires username and password to construct */ class WWW_FOO_COM_Browser extends Browser { private $loggedIn = false; public function __construct($username, $password) { parent::__construct('http://www.foo.bar.baz', [ 'username' => $username, 'password' => $password, 'waitTime' => 250000, 'userAgent' => 'FooScraper', 'cache' => true ]); // Open the session $this->get('/'); // Navigate to the login page $this->get('/login.do'); } /** * Perform login. */ public function login() { $response = $this->post( '/ajax/loginPerform', [ 'j_un' => $this->option('username'), 'j_pw' => $this->option('password'), ] ); // TODO: verify that response is OK. // if (!strstr($response, "Welcome " . $this->option('username')) // throw new \Exception("Bad username or password") $this->loggedIn = true; return true; } public function scrape($entry) { // We could implement caching to avoid scraping the same entry // too often. Save $data into path("entry-" . md5($entry)) // and verify the filemtime of said file, is it newer than time() // minus, say, 86400 seconds? If yes, return file_get_content and // leave remote site alone. $data = $this->get( '/foobars/baz.do', [ 'ticker' => $entry ] ); return $data; }

现在实际的刮码是：

  $scraper = new WWW_FOO_COM_Browser('lserni', 'mypassword'); if (!$scraper->login()) { throw new \Exception("bad user or pass"); } foreach ($entries as $entry) { $html = $scraper->scrape($entry); // Parse HTML }

强制通知：使用合适的parsing器从原始HTML获取数据。

你可以在cURL中做到这一点，而不需要外部的“模拟器”。

下面的代码将检索一个页面到一个PHPvariables中进行parsing。

脚本

有一个页面（让我们称之为HOME）打开会话。服务器端，如果它在PHP中，是第一次调用session_start()那一个（实际上是任何一个）。在其他语言中，您需要一个特定的页面来完成所有的会话设置。从客户端，它是提供会话ID cookie的页面。在PHP中，所有会话页面都可以; 在其他语言中，login页面将执行此操作，其他所有人都将检查cookie是否存在，如果没有，则不会创build会话，而会将您放到HOME。

有一个页面（login），生成login表单，并添加关键信息的会话 – “此用户已login”。在下面的代码中，这是要求会话ID的页面。

最后还有N个页面，其中有好的东西要刮。

所以我们要打HOME，然后login，然后GOODIES一个接一个。在PHP（以及其他语言）中，HOME和LOGIN也可能是同一个页面。或者所有页面可能共享相同的地址，例如在单页面应用程序中。

代码

  $url = "the url generating the session ID"; $next_url = "the url asking for session"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); // We do not authenticate, only access page to get a session going. // Change to False if it is not enough (you'll see that cookiefile // remains empty). curl_setopt($ch, CURLOPT_NOBODY, True); // You may want to change User-Agent here, too curl_setopt($ch, CURLOPT_COOKIEFILE, "cookiefile"); curl_setopt($ch, CURLOPT_COOKIEJAR, "cookiefile"); // Just in case curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); $ret = curl_exec($ch); // This page we retrieve, and scrape, with GET method foreach(array( CURLOPT_POST => False, // We GET... CURLOPT_NOBODY => False, // ...the body... CURLOPT_URL => $next_url, // ...of $next_url... CURLOPT_BINARYTRANSFER => True, // ...as binary... CURLOPT_RETURNTRANSFER => True, // ...into $ret... CURLOPT_FOLLOWLOCATION => True, // ...following redirections... CURLOPT_MAXREDIRS => 5, // ...reasonably... CURLOPT_REFERER => $url, // ...as if we came from $url... //CURLOPT_COOKIEFILE => 'cookiefile', // Save these cookies //CURLOPT_COOKIEJAR => 'cookiefile', // (already set above) CURLOPT_CONNECTTIMEOUT => 30, // Seconds CURLOPT_TIMEOUT => 300, // Seconds CURLOPT_LOW_SPEED_LIMIT => 16384, // 16 Kb/s CURLOPT_LOW_SPEED_TIME => 15, // ) as $option => $value) if (!curl_setopt($ch, $option, $value)) die("could not set $option to " . serialize($value)); $ret = curl_exec($ch); // Done; cleanup. curl_close($ch);

履行

首先，我们必须得到login页面。

我们使用一个特殊的用户代理来介绍自己，为了能够被识别（我们不想对抗网站pipe理员），也欺骗服务器发送给我们一个特定版本的浏览器定制的网站。理想情况下，我们使用与我们要用来debugging页面的任何浏览器相同的User-Agent，以及一个后缀，以便检查他们是否正在查看自动化工具（ 请参阅Halfer的评论 ）。

  $ua = 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 (ROBOT)'; $cookiefile = "cookiefile"; $url1 = "the login url generating the session ID"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url1); curl_setopt($ch, CURLOPT_USERAGENT, $ua); curl_setopt($ch, CURLOPT_COOKIEFILE, $cookiefile); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, True); curl_setopt($ch, CURLOPT_NOBODY, False); curl_setopt($ch, CURLOPT_RETURNTRANSFER, True); curl_setopt($ch, CURLOPT_BINARYTRANSFER, True); $ret = curl_exec($ch);

这将检索要求用户/密码的页面。通过检查页面，我们find所需的字段（包括隐藏的），并可以填充它们。 FORM标签告诉我们是否需要继续POST或GET。

我们可能要检查表单代码来调整以下操作，所以我们要求cURL将页面内容原样返回到$ret ，并返回页面主体。有时， CURLOPT_NOBODY设置为True仍然足以触发会话创build和cookie提交，如果这样，速度更快。但是CURLOPT_NOBODY （“no body”）通过发出HEAD请求而不是GET 。有时HEAD请求不起作用，因为服务器只会响应一个完整的GET 。

而不是以这种方式检索身体，也可以使用真正的 Firefoxlogin，并嗅探与Firebug（或Chrome与Chrome工具）发布的表单内容; 有些网站会尝试使用Javascript来填充/修改隐藏字段，以便提交的表单不会成为您在HTML代码中看到的表单。

一个网站pipe理员不希望他的网站 被 刮掉，可能会发送一个带有时间戳的隐藏字段。 一个人（没有太聪明的浏览器的帮助 – 有办法告诉浏览器不要聪明;最坏的情况下，每次你改变用户名和传递字段）至less需要三秒钟来填写表单。 一个cURL脚本为零。 当然，可以模拟延迟。 这都是太阳镜

我们也可能希望留意forms外观。例如，网站pipe理员可以build立一个询问姓名，电子邮件和密码的表单; 然后，通过使用CSS，移动“电子邮件”字段，您希望find名称，反之亦然。因此，提交的真实表单将在称为username的字段中包含“@”，而在称为email的字段中则没有。服务器，期待这一点，只是颠倒了两个领域。手工制作的“刮板”（或垃圾桶）会做一些看似自然的事情，并在email字段中发送email 。这样做，它背叛了自己。通过一个真正的CSS和JS感知的浏览器，发送有意义的数据，并嗅探实际发送的内容，我们可能能够克服这个特殊的障碍。可能，因为有办法让生活困难。正如我所说， 太极拳 。

回到案例中，在这种情况下，表单包含三个字段，没有Javascript覆盖。我们有cPASS ， cUSR和checkLOGIN ，其值为“检查login”。

所以我们准备好适当的领域的forms。请注意，该表单将作为application/x-www-form-urlencoded ，这在PHP cURL中意味着两件事情：

我们要使用CURLOPT_POST
选项CURLOPT_POSTFIELDS必须是一个string （一个数组可以表示cURL提交为multipart/form-data ，这可能工作，也可能不会）。

正如它所说，表单字段是urlencoded; 有一个function。

我们阅读表单的action领域; 这是我们将用来提交我们的身份validation（我们必须）的URL。

所以一切准备就绪

  $fields = array( 'checkLOGIN' => 'Check Login', 'cUSR' => 'jb007', 'cPASS' => 'astonmartin', ); $coded = array(); foreach($fields as $field => $value) $coded[] = $field . '=' . urlencode($value); $string = implode('&', $coded); curl_setopt($ch, CURLOPT_URL, $url1); //same URL as before, the login url generating the session ID curl_setopt($ch, CURLOPT_POST, True); curl_setopt($ch, CURLOPT_POSTFIELDS, $string); $ret = curl_exec($ch);

我们现在期待一个“你好，詹姆斯 – 下一盘好棋怎么样？” 页。但更重要的是，我们预计与保存在$cookiefile的cookie关联的会话已经被提供了关键信息 – “用户被authentication” 。

因此，使用$ch和同一个cookie jar创build的所有后续页面请求都将被授予访问权限，从而使我们可以非常容易地“抓取”页面 – 请记住将请求模式设置回GET ：

  curl_setopt($ch, CURLOPT_POST, False); // Start spidering foreach($urls as $url) { curl_setopt($ch, CURLOPT_URL, $url); $HTML = curl_exec($ch); if (False === $HTML) { // Something went wrong, check curl_error() and curl_errno(). } } curl_close($ch);

在循环中，您可以访问$HTML – 每个页面的HTML代码。

伟大的使用正则expression式的诱惑是。抵制它你必须。为了更好地应对不断变化的HTML，以及当布局保持不变但内容发生变化时 （例如，您发现您拥有Nice，Tourrette-Levens， Castagniers，但从来没有Asprémont或Gattières，是不是那种cürious？），最好的select是使用DOM：

抓取A元素的href属性

如何从需要cookielogin的网站在PHP中抓取网站内容？

面向对象的答案

脚本

代码

履行

XPath ::获取以下兄弟