如何在Python脚本中运行Scrapy

我是Scrapy的新手,我正在寻找一种从Python脚本运行它的方法。 我发现有两个来源解释这一点:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

我不知道我应该把我的蜘蛛代码,以及如何从主函数调用它。 请帮忙。 这是示例代码:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. # # The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance. # # [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. #!/usr/bin/python import os os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports from scrapy import log, signals, project from scrapy.xlib.pydispatch import dispatcher from scrapy.conf import settings from scrapy.crawler import CrawlerProcess from multiprocessing import Process, Queue class CrawlerScript(): def __init__(self): self.crawler = CrawlerProcess(settings) if not hasattr(project, 'crawler'): self.crawler.install() self.crawler.configure() self.items = [] dispatcher.connect(self._item_passed, signals.item_passed) def _item_passed(self, item): self.items.append(item) def _crawl(self, queue, spider_name): spider = self.crawler.spiders.create(spider_name) if spider: self.crawler.queue.append_spider(spider) self.crawler.start() self.crawler.stop() queue.put(self.items) def crawl(self, spider): queue = Queue() p = Process(target=self._crawl, args=(queue, spider,)) p.start() p.join() return queue.get(True) # Usage if __name__ == "__main__": log.start() """ This example runs spider1 and then spider2 three times. """ items = list() crawler = CrawlerScript() items.append(crawler.crawl('spider1')) for i in range(3): items.append(crawler.crawl('spider2')) print items # Snippet imported from snippets.scrapy.org (which no longer works) # author: joehillen # date : Oct 24, 2010 

谢谢。

所有其他答案参考Scrapy v0.x. 根据更新的文档 ,Scrapy 1.0要求:

 import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition ... process = CrawlerProcess({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished 

虽然我没有尝试过,但我认为可以在scrapy文档中find答案。 直接引用它:

 from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy.settings import Settings from scrapy import log from testspiders.spiders.followall import FollowAllSpider spider = FollowAllSpider(domain='scrapinghub.com') crawler = Crawler(Settings()) crawler.configure() crawler.crawl(spider) crawler.start() log.start() reactor.run() # the script will block here 

从我所了解到的情况来看,这是图书馆的一个新发展,使得一些较早的在线方法(如问题中的方法)已经过时。

在scrapy 0.19.x中,你应该这样做:

 from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy import log, signals from testspiders.spiders.followall import FollowAllSpider from scrapy.utils.project import get_project_settings spider = FollowAllSpider(domain='scrapinghub.com') settings = get_project_settings() crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start() log.start() reactor.run() # the script will block here until the spider_closed signal was sent 

注意这些行

 settings = get_project_settings() crawler = Crawler(settings) 

没有它,你的蜘蛛将不会使用你的设置,并不会保存项目。 花了我一会儿才弄明白为什么文档中的示例没有保存我的项目。 我发送了一个pull请求来修复doc示例。

还有一个就是直接从你的脚本调用命令

 from scrapy import cmdline cmdline.execute("scrapy crawl followall".split()) #followall is the spider's name 

复制这个答案从我的第一个答案在这里: https : //stackoverflow.com/a/19060485/1402286

当有多个爬行器需要在一个python脚本内部运行时,反应器停止需要谨慎处理,因为反应器只能停止一次,不能重新启动。

不过,我在做我的项目时发现使用

 os.system("scrapy crawl yourspider") 

是最简单的。 这将使我无法处理各种信号,特别是当我有多个蜘蛛时。

如果性能是一个问题,您可以使用多处理来并行运行您的蜘蛛,如下所示:

 def _crawl(spider_name=None): if spider_name: os.system('scrapy crawl %s' % spider_name) return None def run_crawler(): spider_names = ['spider1', 'spider2', 'spider2'] pool = Pool(processes=len(spider_names)) pool.map(_crawl, spider_names) 
 # -*- coding: utf-8 -*- import sys from scrapy.cmdline import execute def gen_argv(s): sys.argv = s.split() if __name__ == '__main__': gen_argv('scrapy crawl abc_spider') execute() 

把这段代码放到可以从命令行运行scrapy crawl abc_spider的path中。 (用Scrapy == 0.24.6testing)