在Scrapy中通过validation会话进行爬网

在我之前的问题中，我对于我的问题并不是非常具体（用Scrapyauthentication的会话），希望能够从更一般的答案中推导出解决scheme。我应该使用crawling这个词。

所以，这里是我的代码到目前为止：

 class MySpider(CrawlSpider): name = 'myspider' allowed_domains = ['domain.com'] start_urls = ['http://www.domain.com/login/'] rules = ( Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True), ) def parse(self, response): hxs = HtmlXPathSelector(response) if not "Hi Herman" in response.body: return self.login(response) else: return self.parse_item(response) def login(self, response): return [FormRequest.from_response(response, formdata={'name': 'herman', 'password': 'password'}, callback=self.parse)] def parse_item(self, response): i['url'] = response.url # ... do more things return i

正如你所看到的，我访问的第一页是login页面。如果我还没有authentication（在parse函数中），我打电话给我的自定义login函数，该函数发布到login表单。那么，如果我通过身份validation，我想继续爬行。

问题是我试图重写的parse函数为了login，现在不再需要调用任何进一步的页面（我假设）。我不知道如何去保存我创build的项目。

以前有人做过这样的事情吗？（身份validation，然后爬网，使用CrawlSpider ）任何帮助，将不胜感激。

不要重写CrawlSpider的parse函数：

当您使用CrawlSpider ，您不应该重写parse函数。 CrawlSpider文档中有一个警告： http : CrawlSpider

这是因为对于CrawlSpider ， parse （任何请求的默认callback）将发送由Rule处理的响应。

在爬网之前login：

为了在蜘蛛开始爬行之前进行某种初始化，可以使用InitSpider （从CrawlSpiderinheritance），并覆盖init_request函数。这个函数会在蜘蛛初始化的时候被调用，并且在它开始爬行之前。

为了让蜘蛛开始爬行，你需要调用self.initialized 。

你可以阅读这里负责的代码（它有帮助文档）。

一个例子：

 from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import Rule class MySpider(InitSpider): name = 'myspider' allowed_domains = ['domain.com'] login_page = 'http://www.domain.com/login' start_urls = ['http://www.domain.com/useful_page/', 'http://www.domain.com/another_useful_page/'] rules = ( Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True), ) def init_request(self): """This function is called before crawling starts.""" return Request(url=self.login_page, callback=self.login) def login(self, response): """Generate a login request.""" return FormRequest.from_response(response, formdata={'name': 'herman', 'password': 'password'}, callback=self.check_login_response) def check_login_response(self, response): """Check the response returned by a login request to see if we are successfully logged in. """ if "Hi Herman" in response.body: self.log("Successfully logged in. Let's start crawling!") # Now the crawling can begin.. self.initialized() else: self.log("Bad times :(") # Something went wrong, we couldn't log in, so nothing happens. def parse_item(self, response): # Scrape data from page

保存项目：

你的Spider返回的项目被传递给pipe道，它负责做你想做的任何事情。我build议你阅读文档： http : //doc.scrapy.org/en/0.14/topics/item-pipeline.html

如果您对Item有任何问题或疑问，请不要犹豫，打开一个新的问题，我会尽我所能提供帮助。

为了使上述解决scheme起作用，我必须使CrawlSpider从InitSpiderinheritance，而不再通过在Scrapy源代码上更改以下内容从BaseSpider中获得。在文件scrapy / contrib / spiders / crawl.py中：

添加： from scrapy.contrib.spiders.init import InitSpider
将class CrawlSpider(BaseSpider)更改为class CrawlSpider(InitSpider)

否则，蜘蛛不会调用init_request方法。

还有其他更简单的方法吗？

如果你需要的是Http Authentication，使用提供的中间件钩子。

在settings.py

 DOWNLOADER_MIDDLEWARE = [ 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware']

并在你的spider class添加属性

 http_user = "user" http_pass = "pass"

只要在上面加上Acorn的答案。使用他的方法我的脚本没有parsinglogin后的start_urls。它在check_login_response成功login后正在退出。我可以看到我有发电机。我需要使用

 return self.initialized()

那么parsing函数被调用。

在Scrapy中通过validation会话进行爬网

Scrapy和代理

无头浏览器和刮 – 解决scheme

在authentication（login）用户会话中使用Scrapy

Scrapyunit testing

Scrapy引发ImportError：无法导入名称xmlrpc_client

如何在单个Scrapy项目中为不同的蜘蛛使用不同的pipe道

点是不能正确安装包：权限被拒绝错误

我如何使用多个请求，并在他们之间的scrapy python传递项目

无法在Mac OS X 10.9上安装Lxml

BeautifulSoup和Scrapy爬虫之间的区别？