Tag: 抓scrapy

Scrapy引发ImportError：无法导入名称xmlrpc_client: 通过pip安装Scrapy后，使用Python 2.7.10 ： scrapy Traceback (most recent call last): File "/usr/local/bin/scrapy", line 7, in <module> from scrapy.cmdline import execute File "/Library/Python/2.7/site-packages/scrapy/__init__.py", line 48, in <module> from scrapy.spiders import Spider File "/Library/Python/2.7/site-packages/scrapy/spiders/__init__.py", line 10, in <module> from scrapy.http import Request File "/Library/Python/2.7/site-packages/scrapy/http/__init__.py", line 12, in <module> from scrapy.http.request.rpc import XmlRpcRequest File "/Library/Python/2.7/site-packages/scrapy/http/request/rpc.py", line 7, in <module> […]

如何根据scrapy中的URL过滤重复的请求: 我正在为使用Scraw和CrawlSpider的网站编写爬虫程序。 Scrapy提供了一个内置的重复请求filter，它根据url来过滤重复的请求。另外，我可以使用CrawlSpider的规则成员来过滤请求。我想要做的是过滤请求，如： http:://www.abc.com/p/xyz.html?id=1234&refer=5678 如果我已经去过 http:://www.abc.com/p/xyz.html?id=1234&refer=4567 注意：引用是一个参数，不会影响我得到的响应，所以我不在乎如果该参数的值更改。现在，如果我有一个积累所有ID的集合，我可以忽略它在我的callback函数parse_item （这是我的callback函数）来实现这个function。但是这意味着我至less还可以拿到那个页面，当我不需要的时候。那么我可以告诉scrapy它不应该基于url发送特定请求的方式是什么？

在authentication（login）用户会话中使用Scrapy: 在Scrapy文档中，有以下示例来说明如何在Scrapy中使用经过身份validation的会话： class LoginSpider(BaseSpider): name = 'example.com' start_urls = ['http://www.example.com/users/login.php'] def parse(self, response): return [FormRequest.from_response(response, formdata={'username': 'john', 'password': 'secret'}, callback=self.after_login)] def after_login(self, response): # check login succeed before going on if "authentication failed" in response.body: self.log("Login failed", level=log.ERROR) return # continue scraping with authenticated session… 我有这个工作，这很好。但是我的问题是：如果他们在最后一行的评论中说了什么，你需要做些什么才能continue scraping with authenticated session ？

Scrapyunit testing: 我想在Scrapy中执行一些unit testing（screen scraper / web crawler）。由于一个项目是通过“scrapy crawl”命令运行的，我可以通过像鼻子这样的东西运行它。由于scrapy是在扭曲的基础上构build的，我可以使用它的unit testing框架Trial？如果是这样，怎么样？否则，我想鼻子工作。更新：我一直在讨论Scrapy-Users ，我想我应该“在testing代码中构build响应，然后调用带有响应的方法，并断言[我]在输出中获得预期的项目/请求。我似乎无法得到这个工作。我可以build立一个unit testingtesting课程，并在testing中：创build一个响应对象尝试用响应对象调用我的蜘蛛的parsing方法但是它最终会产生这个回溯。任何洞察力为什么？

点是不能正确安装包：权限被拒绝错误: 我想安装lxml在我的Mac上安装scrapy（v 10.9.4） ╭─ishaantaylor@Ishaans-MacBook-Pro.local ~ ╰─➤ pip install lxml Downloading/unpacking lxml Downloading lxml-3.4.0.tar.gz (3.5MB): 3.5MB downloaded Running setup.py (path:/private/var/folders/8l/t7tcq67d34v7qq_4hp3s1dm80000gn/T/pip_build_ishaantaylor/lxml/setup.py) egg_info for package lxml /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url' warnings.warn(msg) Building lxml version 3.4.0. Building without Cython. Using build configuration of libxslt 1.1.28 warning: no previously-included files found matching '*.py' Installing collected packages: lxml Running setup.py […]

selenium与scrapydynamic页面: 我试图从网页上刮取产品信息，使用scrapy。我的被刮的网页看起来像这样：从10个产品的product_list页面开始点击“下一步”button加载下面的10个产品（url在两页之间不会改变）我使用LinkExtractor来跟踪每个产品链接到产品页面，并获得所有我需要的信息我试图复制next-button-ajax-call，但无法正常工作，所以我给selenium一个尝试。我可以在一个单独的脚本中运行selenium的webdriver，但我不知道如何与scrapy集成。我应该在哪里把selenium元素放入我的蜘蛛蛛？我的蜘蛛是相当标准的，如下所示： class ProductSpider(CrawlSpider): name = "product_spider" allowed_domains = ['example.com'] start_urls = ['http://example.com/shanghai'] rules = [ Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'), ] def parse_product(self, response): self.log("parsing product %s" %response.url, level=INFO) hxs = HtmlXPathSelector(response) # actual data follows 任何想法是赞赏。谢谢！

如何在单个Scrapy项目中为不同的蜘蛛使用不同的pipe道: 我有一个包含多个蜘蛛的scrapy项目。有什么办法可以定义哪个pipe道用于哪个蜘蛛？并非我所定义的所有pipe道都适用于每一个蜘蛛。谢谢

如何使用PyCharm来debuggingScrapy项目: 我正在Python 2.7中使用Scrapy 0.20。我发现PyCharm有一个很好的Pythondebugging器。我想testing我的Scrapy蜘蛛使用它。任何人都知道该怎么做？我曾经尝试过其实我试图把蜘蛛作为脚本来运行。结果，我build立了这个脚本。然后，我尝试将我的Scrapy项目添加到PyCharm中，像这样： File->Setting->Project structure->Add content root. 但我不知道我还有什么要做的

BeautifulSoup和Scrapy爬虫之间的区别？: 我想做一个网站，显示亚马逊和电子海湾产品价格之间的比较。哪个更好，为什么？我对BeautifulSoup有点熟悉，但与Scrapy爬虫不太一样。

无法在Mac OS X 10.9上安装Lxml: 我想安装Lxml，所以我可以安装Scrapy。当我今天更新我的Mac时，不会让我重新安装lxml，我得到以下错误： In file included from src/lxml/lxml.etree.c:314: /private/tmp/pip_build_root/lxml/src/lxml/includes/etree_defs.h:9:10: fatal error: 'libxml/xmlversion.h' file not found #include "libxml/xmlversion.h" ^ 1 error generated. error: command 'cc' failed with exit status 1 我已经尝试使用brew来安装libxml2和libxslt，两者都安装正常，但我仍然无法安装lxml。上次我正在安装我需要启用Xcode的开发工具，但自从它更新到Xcode 5，它不再给我这个选项。有谁知道我需要做什么？