任何人都知道一个好的基于Python的networking爬虫，我可以使用？

我有一半想写我自己的，但现在我没有足够的时间。我已经看到了维基百科的开源爬虫列表，但我更喜欢用Python编写的东西。我意识到，我可能只是使用维基百科页面上的工具之一，并将其包装在Python中。我可能会这样做 – 如果任何人有任何关于这些工具的build议，我很乐意听到。我通过它的networking界面使用Heritrix，我发现它非常麻烦。我绝对不会为即将到来的项目使用浏览器API。

提前致谢。另外，这是我的第一个SO问题！

机械化是我的最爱; 卓越的高级浏览function（超简单的表格填写和提交）。
Twill是一个build立在Mechanize之上的简单的脚本语言
BeautifulSoup + urllib2也很好地工作。
Scrapy看起来像一个非常有前途的项目; 这是新的。

使用Scrapy 。

这是一个基于扭曲的networking爬虫框架。仍然在重大的发展，但它已经工作。有很多好东西：

内置支持parsingHTML，XML，CSV和Javascript
一个媒体pipe道，用于抓取图像（或其他媒体）的项目，并下载图像文件
支持通过使用中间件，扩展和pipe道来插入自己的function来扩展Scrapy
广泛的内置中间件和扩展，用于处理压缩，caching，cookie，身份validation，用户代理欺骗，robots.txt处理，统计，爬行深度限制等
交互式刮壳控制台，对于开发和debugging非常有用
Webpipe理控制台，用于监视和控制您的机器人
Telnet控制台，用于低级访问Scrapy进程

通过在返回的HTML上使用XPathselect器来提取有关今天在mininova torrent站点中添加的所有torrent文件信息的示例代码：

class Torrent(ScrapedItem): pass class MininovaSpider(CrawlSpider): domain_name = 'mininova.org' start_urls = ['http://www.mininova.org/today'] rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')] def parse_torrent(self, response): x = HtmlXPathSelector(response) torrent = Torrent() torrent.url = response.url torrent.name = xx("//h1/text()").extract() torrent.description = xx("//div[@id='description']").extract() torrent.size = xx("//div[@id='info-left']/p[2]/text()[2]").extract() return [torrent]

检查HarvestMan ，一个用Python编写的multithreadingnetworking爬虫，也可以看看spider.py模块。

在这里你可以find代码示例来构build一个简单的networking爬虫。

我用了Ruya ，发现它很好。

我砍了上面的脚本，包括一个login页面，因为我需要它来访问一个Drupal站点。不漂亮，但可以帮助那里的人。

 #!/usr/bin/python import httplib2 import urllib import urllib2 from cookielib import CookieJar import sys import re from HTMLParser import HTMLParser class miniHTMLParser( HTMLParser ): viewedQueue = [] instQueue = [] headers = {} opener = "" def get_next_link( self ): if self.instQueue == []: return '' else: return self.instQueue.pop(0) def gethtmlfile( self, site, page ): try: url = 'http://'+site+''+page response = self.opener.open(url) return response.read() except Exception, err: print " Error retrieving: "+page sys.stderr.write('ERROR: %s\n' % str(err)) return "" return resppage def loginSite( self, site_url ): try: cj = CookieJar() self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) url = 'http://'+site_url params = {'name': 'customer_admin', 'pass': 'customer_admin123', 'opt': 'Log in', 'form_build_id': 'form-3560fb42948a06b01d063de48aa216ab', 'form_id':'user_login_block'} user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' self.headers = { 'User-Agent' : user_agent } data = urllib.urlencode(params) response = self.opener.open(url, data) print "Logged in" return response.read() except Exception, err: print " Error logging in" sys.stderr.write('ERROR: %s\n' % str(err)) return 1 def handle_starttag( self, tag, attrs ): if tag == 'a': newstr = str(attrs[0][1]) print newstr if re.search('http', newstr) == None: if re.search('mailto', newstr) == None: if re.search('#', newstr) == None: if (newstr in self.viewedQueue) == False: print " adding", newstr self.instQueue.append( newstr ) self.viewedQueue.append( newstr ) else: print " ignoring", newstr else: print " ignoring", newstr else: print " ignoring", newstr def main(): if len(sys.argv)!=3: print "usage is ./minispider.py site link" sys.exit(2) mySpider = miniHTMLParser() site = sys.argv[1] link = sys.argv[2] url_login_link = site+"/node?destination=node" print "\nLogging in", url_login_link x = mySpider.loginSite( url_login_link ) while link != '': print "\nChecking link ", link # Get the file from the site and link retfile = mySpider.gethtmlfile( site, link ) # Feed the file into the HTML parser mySpider.feed(retfile) # Search the retfile here # Get the next link in level traversal order link = mySpider.get_next_link() mySpider.close() print "\ndone\n" if __name__ == "__main__": main()

相信我没有什么比curl更好。以下代码可以在不到300秒的时间内在Amazon EC2上并行抓取10,000个url

小心： 不要以这么高的速度打同一个域。

 #! /usr/bin/env python # -*- coding: iso-8859-1 -*- # vi:ts=4:et # $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $ # # Usage: python retriever-multi.py <file with URLs to fetch> [<# of # concurrent connections>] # import sys import pycurl # We should ignore SIGPIPE when using pycurl.NOSIGNAL - see # the libcurl tutorial for more info. try: import signal from signal import SIGPIPE, SIG_IGN signal.signal(signal.SIGPIPE, signal.SIG_IGN) except ImportError: pass # Get args num_conn = 10 try: if sys.argv[1] == "-": urls = sys.stdin.readlines() else: urls = open(sys.argv[1]).readlines() if len(sys.argv) >= 3: num_conn = int(sys.argv[2]) except: print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0] raise SystemExit # Make a queue with (url, filename) tuples queue = [] for url in urls: url = url.strip() if not url or url[0] == "#": continue filename = "doc_%03d.dat" % (len(queue) + 1) queue.append((url, filename)) # Check args assert queue, "no URLs given" num_urls = len(queue) num_conn = min(num_conn, num_urls) assert 1 <= num_conn <= 10000, "invalid number of concurrent connections" print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM) print "----- Getting", num_urls, "URLs using", num_conn, "connections -----" # Pre-allocate a list of curl objects m = pycurl.CurlMulti() m.handles = [] for i in range(num_conn): c = pycurl.Curl() c.fp = None c.setopt(pycurl.FOLLOWLOCATION, 1) c.setopt(pycurl.MAXREDIRS, 5) c.setopt(pycurl.CONNECTTIMEOUT, 30) c.setopt(pycurl.TIMEOUT, 300) c.setopt(pycurl.NOSIGNAL, 1) m.handles.append(c) # Main loop freelist = m.handles[:] num_processed = 0 while num_processed < num_urls: # If there is an url to process and a free curl object, add to multi stack while queue and freelist: url, filename = queue.pop(0) c = freelist.pop() c.fp = open(filename, "wb") c.setopt(pycurl.URL, url) c.setopt(pycurl.WRITEDATA, c.fp) m.add_handle(c) # store some info c.filename = filename c.url = url # Run the internal curl state machine for the multi stack while 1: ret, num_handles = m.perform() if ret != pycurl.E_CALL_MULTI_PERFORM: break # Check for curl objects which have terminated, and add them to the freelist while 1: num_q, ok_list, err_list = m.info_read() for c in ok_list: c.fp.close() c.fp = None m.remove_handle(c) print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL) freelist.append(c) for c, errno, errmsg in err_list: c.fp.close() c.fp = None m.remove_handle(c) print "Failed: ", c.filename, c.url, errno, errmsg freelist.append(c) num_processed = num_processed + len(ok_list) + len(err_list) if num_q == 0: break # Currently no more I/O is pending, could do something in the meantime # (display a progress bar, etc.). # We just call select() to sleep until some more data is available. m.select(1.0) # Cleanup for c in m.handles: if c.fp is not None: c.fp.close() c.fp = None c.close() m.close()

另一个简单的蜘蛛使用BeautifulSoup和urllib2。没有太复杂的，只是读取所有的href的build立一个列表，并通过它。

pyspider.py

任何人都知道一个好的基于Python的networking爬虫，我可以使用？

有谁知道一个好的networking/graphics可视化软件 – 只需添加数据？

如何判断“移动networking数据”是启用还是禁用（即使通过WiFi连接）？

如何增加内存并设置Vagrant中的主机专用networking？

Docker：如何Dockerize和部署一个LAMP应用程序的多个实例

如何获得使用java的本地系统的子网掩码？

WCF，Rest和SOAP之间有什么关系？

如何编写一个可扩展的基于TCP / IP的服务器

如何知道urllib.urlretrieve是否成功？

我如何处理我的webapp中的时区？

我们不应该使用<noscript>元素吗？