我如何使用多个请求,并在他们之间的scrapy python传递项目

我有item对象,我需要沿着很多页面传递,以在单个项目中存储数据

像我的项目是

 class DmozItem(Item): title = Field() description1 = Field() description2 = Field() description3 = Field() 

现在这三个描述分三页。 我想做类似的事情

现在这适用于parseDescription1

 def page_parser(self, response): sites = hxs.select('//div[@class="row"]') items = [] request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription1) request.meta['item'] = item return request def parseDescription1(self,response): item = response.meta['item'] item['desc1'] = "test" return item 

但我想要的东西

 def page_parser(self, response): sites = hxs.select('//div[@class="row"]') items = [] request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription1) request.meta['item'] = item request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription2) request.meta['item'] = item request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription2) request.meta['item'] = item return request def parseDescription1(self,response): item = response.meta['item'] item['desc1'] = "test" return item def parseDescription2(self,response): item = response.meta['item'] item['desc2'] = "test2" return item def parseDescription3(self,response): item = response.meta['item'] item['desc3'] = "test3" return item 

没问题。 代替

 def page_parser(self, response): sites = hxs.select('//div[@class="row"]') items = [] request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription1) request.meta['item'] = item request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription2) request.meta['item'] = item request = Request("http://www.example.com/lin1.cpp", callback =self.parseDescription2) request.meta['item'] = item return request def parseDescription1(self,response): item = response.meta['item'] item['desc1'] = "test" return item def parseDescription2(self,response): item = response.meta['item'] item['desc2'] = "test2" return item def parseDescription3(self,response): item = response.meta['item'] item['desc3'] = "test3" return item 

 def page_parser(self, response): sites = hxs.select('//div[@class="row"]') items = [] request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1) request.meta['item'] = item yield request request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item}) yield request yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item}) def parseDescription1(self,response): item = response.meta['item'] item['desc1'] = "test" return item def parseDescription2(self,response): item = response.meta['item'] item['desc2'] = "test2" return item def parseDescription3(self,response): item = response.meta['item'] item['desc3'] = "test3" return item 

为了保证请求/callback的顺序,并且最终只返回一个项目,您需要使用如下forms链接您的请求:

  def page_parser(self, response): sites = hxs.select('//div[@class="row"]') items = [] request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1) request.meta['item'] = Item() return [request] def parseDescription1(self,response): item = response.meta['item'] item['desc1'] = "test" return [Request("http://www.example.com/lin2.cpp", callback=self.parseDescription2, meta={'item': item})] def parseDescription2(self,response): item = response.meta['item'] item['desc2'] = "test2" return [Request("http://www.example.com/lin3.cpp", callback=self.parseDescription3, meta={'item': item})] def parseDescription3(self,response): item = response.meta['item'] item['desc3'] = "test3" return [item] 

每个callback函数都会返回一个可迭代的项目或请求,请求将被计划,项目将通过您的项目pipe道运行。

如果您从每个callback中返回一个项目,则最终会有4个项目处于完成状态,但是如果您返回下一个请求,那么您可以pipe理请求的顺序,并且您将完全一个项目在执行结束。

被接受的答案总共返回三个项目[desc(i)为i = 1,2,3]。

如果您想要返回单个项目,则Dave McLain的项目可以正常工作,但是它需要parseDescription1parseDescription2parseDescription3才能成功并parseDescription3地运行以返回项目。

对于我的使用情况,一些子请求可能随机返回HTTP 403/404错误,因此我丢失了一些项目,即使我可能已经部分地删除了它们。


解决方法

因此,我现在使用以下解决方法:不要只在request.meta字典中传递项目,而是传递一个知道接下来调用哪个请求的调用栈 。 它会调用堆栈中的下一个项目(只要它不是空的),如果堆栈是空的,则返回项目。

errback request参数用于在发生错误时返callback度器方法,并简单地继续下一个堆栈项目。

 def callnext(self, response): ''' Call next target for the item loader, or yields it if completed. ''' # Get the meta object from the request, as the response # does not contain it. meta = response.request.meta # Items remaining in the stack? Execute them if len(meta['callstack']) > 0: target = meta['callstack'].pop(0) yield Request(target['url'], meta=meta, callback=target['callback'], errback=self.callnext) else: yield meta['loader'].load_item() def parseDescription1(self, response): # Recover item(loader) l = response.meta['loader'] # Use just as before l.add_css(...) # Build the call stack callstack = [ {'url': "http://www.example.com/lin2.cpp", 'callback': self.parseDescription2 }, {'url': "http://www.example.com/lin3.cpp", 'callback': self.parseDescription3 } ] return self.callnext(response) def parseDescription2(self, response): # Recover item(loader) l = response.meta['loader'] # Use just as before l.add_css(...) return self.callnext(response) def parseDescription3(self, response): # ... return self.callnext(response) 

警告

这个解决scheme仍然是同步的,如果你在callback中有任何exception的话,它仍然会失败。

有关更多信息, 请查看我写的关于该解决scheme的博文 。