如何通过在Python中的Tor做urllib2请求?

我正在尝试使用Python编写的抓取工具来抓取网站。 我想把Tor与Python整合在一起,这意味着我想用Tor来匿名爬取网站。

我试过这样做。 这似乎并不奏效。 我检查了我的IP,它仍然是我用tor之前的一样。 我通过python检查它。

import urllib2 proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"}) opener = urllib2.build_opener(proxy_handler) urllib2.install_opener(opener) 

您尝试连接到SOCKS端口 – Tor拒绝任何非SOCKS通信。 您可以通过中间人 – Privoxy – 使用端口8118连接。

例:

 proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"}) opener = urllib2.build_opener(proxy_support) opener.addheaders = [('User-agent', 'Mozilla/5.0')] print opener.open('http://www.google.com').read() 

另外请注意传递给ProxyHandler的属性,不在http前加ip:port

 pip install PySocks 

然后:

 import socket import socks import urllib2 ipcheck_url = 'http://checkip.amazonaws.com/' # Actual IP. print(urllib2.urlopen(ipcheck_url).read()) # Tor IP. socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050) socket.socket = socks.socksocket print(urllib2.urlopen(ipcheck_url).read()) 

仅在https://stackoverflow.com/a/2015649/895245中使用;urllib2.ProxyHandler失败:

 Tor is not an HTTP Proxy 

提到: 我如何在urllib2上使用SOCKS 4/5代理?

testingUbuntu 15.10,Tor 0.2.6.10,Python 2.7.10。

在tor的前面使用privoxy作为http代理为我工作 – 这里是一个履带模板:

 import urllib2 import httplib from BeautifulSoup import BeautifulSoup from time import sleep class Scraper(object): def __init__(self, options, args): if options.proxy is None: options.proxy = "http://localhost:8118/" self._open = self._get_opener(options.proxy) def _get_opener(self, proxy): proxy_handler = urllib2.ProxyHandler({'http': proxy}) opener = urllib2.build_opener(proxy_handler) return opener.open def get_soup(self, url): soup = None while soup is None: try: request = urllib2.Request(url) request.add_header('User-Agent', 'foo bar useragent') soup = BeautifulSoup(self._open(request)) except (httplib.IncompleteRead, httplib.BadStatusLine, urllib2.HTTPError, ValueError, urllib2.URLError), err: sleep(1) return soup class PageType(Scraper): _URL_TEMPL = "http://foobar.com/baz/%s" def items_from_page(self, url): nextpage = None soup = self.get_soup(url) items = [] for item in soup.findAll("foo"): items.append(item["bar"]) nexpage = item["href"] return nextpage, items def get_items(self): nextpage, items = self._categories_from_page(self._START_URL % "start.html") while nextpage is not None: nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage) items.extend(newitems) return items() pt = PageType() print pt.get_items() 

这里是一个代码下载使用tor代理在Python中的文件:(更新url)

 import urllib2 url = "data/media/17/Donald_Duck2.gif" proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8118'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) file_name = url.split('/')[-1] u = urllib2.urlopen(url) f = open(file_name, 'wb') meta = u.info() file_size = int(meta.getheaders("Content-Length")[0]) print "Downloading: %s Bytes: %s" % (file_name, file_size) file_size_dl = 0 block_sz = 8192 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size) status = status + chr(8)*(len(status)+1) print status, f.close() 

下面的代码是100%在Python 3.4上工作

(你需要保持TOR浏览器打开使用此代码)

这个脚本通过socks5连接到TOR,从checkip.dyn.com得到IP,更改身份并重新发送请求以获得新的IP(循环10次)

你需要安装适当的库来使这个工作。 (享受和不要滥用)

 import socks import socket import time from stem.control import Controller from stem import Signal import requests from bs4 import BeautifulSoup err = 0 counter = 0 url = "checkip.dyn.com" with Controller.from_port(port = 9151) as controller: try: controller.authenticate() socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150) socket.socket = socks.socksocket while counter < 10: r = requests.get("http://checkip.dyn.com") soup = BeautifulSoup(r.content) print(soup.find("body").text) counter = counter + 1 #wait till next identity will be available controller.signal(Signal.NEWNYM) time.sleep(controller.get_newnym_wait()) except requests.HTTPError: print("Could not reach URL") err = err + 1 print("Used " + str(counter) + " IPs and got " + str(err) + " errors") 

以下解决scheme适用于Python 3 。 改编自CiroSantilli的回答 :

使用urllib (Python 3中的urllib2的名称):

 import socks import socket from urllib.request import urlopen url = 'http://icanhazip.com/' socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150) socket.socket = socks.socksocket response = urlopen(url) print(response.read()) 

requests

 import socks import socket import requests url = 'http://icanhazip.com/' socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150) socket.socket = socks.socksocket response = requests.get(url) print(response.text) 

有了Selenium + PhantomJS:

 from selenium import webdriver url = 'http://icanhazip.com/' service_args = [ '--proxy=localhost:9150', '--proxy-type=socks5', ] phantomjs_path = '/your/path/to/phantomjs' driver = webdriver.PhantomJS( executable_path=phantomjs_path, service_args=service_args) driver.get(url) print(driver.page_source) driver.close() 

注意 :如果您打算经常使用Tor,请考虑捐款以支持他们的杰出工作!

更新 – 最新(高于v2.10.0) requests库支持袜子代理与requests[socks]的额外要求。

安装

 pip install requests requests[socks] 

基本用法

 import requests session = requests.session() # Tor uses the 9050 port as the default socks port session.proxies = {'http': 'socks5://127.0.0.1:9050', 'https': 'socks5://127.0.0.1:9050'} # Make a request through the Tor connection # IP visible through Tor print session.get("http://httpbin.org/ip").text # Above should print an IP different than your public IP # Following prints your normal public IP print requests.get("http://httpbin.org/ip").text 

旧的答案 – 尽pipe这是一个旧的post,回答是因为没有人似乎提到了requesocks库。

它基本上是requests库的一个端口。 请注意,库是一个旧的叉(上次更新2013-03-25),可能不具有最新的请求库相同的function。

安装

 pip install requesocks 

基本用法

 # Assuming that Tor is up & running import requesocks session = requesocks.session() # Tor uses the 9050 port as the default socks port session.proxies = {'http': 'socks5://127.0.0.1:9050', 'https': 'socks5://127.0.0.1:9050'} # Make a request through the Tor connection # IP visible through Tor print session.get("http://httpbin.org/ip").text # Above should print an IP different than your public IP # Following prints your normal public IP import requests print requests.get("http://httpbin.org/ip").text 

也许你有一些networking连接问题? 上面的脚本为我工作(我replace了一个不同的URL – 我使用http://stackoverflow.com/ – 我得到了预期的页面:

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" > <html> <head> <title>Stack Overflow</title> <link rel="stylesheet" href="/content/all.css?v=3856"> 

(等等。)

Tor是一个袜子代理。 直接连接到您引用的示例失败,出现“urlopen错误隧道连接失败:501 Tor不是HTTP代理服务器”。 正如其他人所说,你可以用Privoxy解决这个问题。

或者,您也可以使用PycURL或SocksiPy。 对于使用两个tor的例子见…

https://stem.torproject.org/tutorials/to_russia_with_love.html

你可以使用torify

用你的程序运行

 ~$torify python your_program.py