如何使用Python通过HTTP下载文件？

我有一个小工具，用于从网站上下载一个MP3的时间表，然后build立/更新播客XML文件，我已经明显地添加到iTunes。

创build/更新XML文件的文本处理是用Python编写的。我在Windows .bat文件中使用wget来下载实际的MP3。我宁愿使用Python编写整个实用程序。

我虽然努力find一种方法来实际下载Python中的文件，为什么我诉诸wget 。

那么，如何使用Python下载文件呢？

在Python 2中，使用标准库附带的urllib2。

 import urllib2 response = urllib2.urlopen('http://www.example.com/') html = response.read()

这是使用库的最基本的方式，减去任何error handling。你也可以做更复杂的东西，比如改变标题。文档可以在这里find。

另外，使用urlretrieve ：

 import urllib urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

（对于Python 3+使用'import urllib.request'和urllib.request.urlretrieve）

还有一个，有一个“进度条”

 import urllib2 url = "http://download.thinkbroadband.com/10MB.zip" file_name = url.split('/')[-1] u = urllib2.urlopen(url) f = open(file_name, 'wb') meta = u.info() file_size = int(meta.getheaders("Content-Length")[0]) print "Downloading: %s Bytes: %s" % (file_name, file_size) file_size_dl = 0 block_sz = 8192 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size) status = status + chr(8)*(len(status)+1) print status, f.close()

在2012年，使用python请求库

 >>> import requests >>> >>> url = "http://download.thinkbroadband.com/10MB.zip" >>> r = requests.get(url) >>> print len(r.content) 10485760

你可以运行pip install requests来获取它。

与替代方法相比，请求具有许多优点，因为API非常简单。如果您必须执行身份validation，则尤其如此。 urllib和urllib2在这种情况下非常不直观和痛苦。

2015年12月30日

人们对进度条表示钦佩。这很酷，当然。现在有几种现成的解决scheme，包括tqdm ：

 from tqdm import tqdm import requests url = "http://download.thinkbroadband.com/10MB.zip" response = requests.get(url, stream=True) with open("10MB", "wb") as handle: for data in tqdm(response.iter_content()): handle.write(data)

这实质上是30个月前描述的实施。

 import urllib2 mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3") with open('test.mp3','wb') as output: output.write(mp3file.read())

open('test.mp3','wb')的wb open('test.mp3','wb') ）以二进制模式打开一个文件（并且擦除任何现有的文件），所以你可以用它来保存数据，而不仅仅是文本。

下面是如何在Python 3中使用标准库来完成的：

urllib.request.urlopen

 import urllib.request response = urllib.request.urlopen('http://www.example.com/') html = response.read()

urllib.request.urlretrieve

 import urllib.request urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')

Python 2/3的PabloG代码的改进版本：

 #!/usr/bin/env python # -*- coding: utf-8 -*- from __future__ import ( division, absolute_import, print_function, unicode_literals ) import sys, os, tempfile, logging if sys.version_info >= (3,): import urllib.request as urllib2 import urllib.parse as urlparse else: import urllib2 import urlparse def download_file(url, dest=None): """ Download and save a file specified by url to dest directory, """ u = urllib2.urlopen(url) scheme, netloc, path, query, fragment = urlparse.urlsplit(url) filename = os.path.basename(path) if not filename: filename = 'downloaded.file' if dest: filename = os.path.join(dest, filename) with open(filename, 'wb') as f: meta = u.info() meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all meta_length = meta_func("Content-Length") file_size = None if meta_length: file_size = int(meta_length[0]) print("Downloading: {0} Bytes: {1}".format(url, file_size)) file_size_dl = 0 block_sz = 8192 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = "{0:16}".format(file_size_dl) if file_size: status += " [{0:6.2f}%]".format(file_size_dl * 100 / file_size) status += chr(13) print(status, end="") print() return filename if __name__ == "__main__": # Only run if this file is called directly print("Testing with 10MB download") url = "http://download.thinkbroadband.com/10MB.zip" filename = download_file(url) print(filename)

用纯Python写wget库就是为了这个目的。从2.0版本urlretrieve ，这些特性被urlretrieve 。

使用wget模块：

 import wget wget.download('url')

我同意Corey的观点，urllib2比urllib更完整，如果你想做更复杂的事情，可能会使用这个模块，但是为了使答案更加完整，如果你只需要基础知识，urllib是一个更简单的模块：

 import urllib response = urllib.urlopen('http://www.example.com/sound.mp3') mp3 = response.read()

将工作正常。或者，如果您不想处理“响应”对象，则可以直接调用read（）方法 ：

 import urllib mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()

以下是在Python中下载文件最常用的调用：

urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)

注意： urlopen和urlretrieve被发现执行相对较差的下载大文件（大小> 500 MB）。 requests.get将文件存储在内存中，直到下载完成。

您也可以使用urlretrieve获取进度反馈：

 def report(blocknr, blocksize, size): current = blocknr*blocksize sys.stdout.write("\r{0:.2f}%".format(100.0*current/size)) def downloadFile(url): print "\n",url fname = url.split('/')[-1] print fname urllib.urlretrieve(url, fname, report)

如果你安装了wget，你可以使用parallel_sync。

pip install parallel_sync

 from parallel_sync import wget urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip'] wget.download('/tmp', urls) # or a single file: wget.download('/tmp', urls[0], filenames='x.zip', extract=True)

Doc： https ： //pythonhosted.org/parallel_sync/pages/examples.html

这是非常强大的。它可以并行下载文件，在失败时重试，甚至可以在远程机器上下载文件。

简单而Python 2和Python 3兼容的方式：

 from six.moves import urllib urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

源代码可以是：

 import urllib sock = urllib.urlopen("http://diveintopython.org/") htmlSource = sock.read() sock.close() print htmlSource

这可能有点晚，但是我看到了pabloG的代码，并且无法添加os.system（'cls'），使它看起来非常棒！一探究竟：

  import urllib2,os url = "http://download.thinkbroadband.com/10MB.zip" file_name = url.split('/')[-1] u = urllib2.urlopen(url) f = open(file_name, 'wb') meta = u.info() file_size = int(meta.getheaders("Content-Length")[0]) print "Downloading: %s Bytes: %s" % (file_name, file_size) os.system('cls') file_size_dl = 0 block_sz = 8192 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size) status = status + chr(8)*(len(status)+1) print status, f.close()

如果在Windows以外的环境中运行，则必须使用“cls”之外的其他内容。在MAC OS X和Linux中，它应该是“清楚的”。

urlretrieve和requests.get很简单，但事实并非如此。我已经提取了几个站点的数据，包括文本和图像，上面的两个可能解决大部分任务。但为了更通用的解决scheme，我build议使用urlopen。由于它包含在Python 3标准库中，因此您的代码可以在运行Python 3的任何机器上运行，而无需预先安装site-par

 import urllib.request url_request = urllib.request.Request(url, headers=headers) url_connect = urllib.request.urlopen(url_request) len_content = url_content.length #remember to open file in bytes mode with open(filename, 'wb') as f: while True: buffer = url_connect.read(buffer_size) if not buffer: break #an integer value of size of written data data_wrote = f.write(buffer) #you could probably use with-open-as manner url_connect.close()

这个答案提供了一个解决HTTP 403使用Python下载文件时禁止。我只尝试过请求和urllib模块，其他模块可能提供更好的东西，但这是我用来解决大部分问题的。

我写了下面这个，它在vanilla Python 2或Python 3中工作。

 import sys try: import urllib.request python3 = True except ImportError: import urllib2 python3 = False def progress_callback_simple(downloaded,total): sys.stdout.write( "\r" + (len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total + " [%3.2f%%]"%(100.0*float(downloaded)/float(total)) ) sys.stdout.flush() def download(srcurl, dstfilepath, progress_callback=None, block_size=8192): def _download_helper(response, out_file, file_size): if progress_callback!=None: progress_callback(0,file_size) if block_size == None: buffer = response.read() out_file.write(buffer) if progress_callback!=None: progress_callback(file_size,file_size) else: file_size_dl = 0 while True: buffer = response.read(block_size) if not buffer: break file_size_dl += len(buffer) out_file.write(buffer) if progress_callback!=None: progress_callback(file_size_dl,file_size) with open(dstfilepath,"wb") as out_file: if python3: with urllib.request.urlopen(srcurl) as response: file_size = int(response.getheader("Content-Length")) _download_helper(response,out_file,file_size) else: response = urllib2.urlopen(srcurl) meta = response.info() file_size = int(meta.getheaders("Content-Length")[0]) _download_helper(response,out_file,file_size) import traceback try: download( "data/programming/projects/glLib/glLib Reloaded 0.5.9/0.5.9.zip", "output.zip", progress_callback_simple ) except: traceback.print_exc() input()

笔记：

支持“进度条”callback。
从我的网站下载是一个4 MB的testing.zip。

如果速度对你来说很重要的话，我对模块urllib和wget做了一个小的性能testing，关于wget我试了一下状态栏，一次没有。我拿了三个不同的500MB文件来testing（不同的文件，以消除在引擎盖下有一些caching的机会）。用debian机器testing，用python2。

首先，这些是结果（它们在不同的运行中是相似的）：

 $ python wget_test.py urlretrive_test : starting urlretrive_test : 6.56 ============== wget_no_bar_test : starting wget_no_bar_test : 7.20 ============== wget_with_bar_test : starting 100% [......................................................................] 541335552 / 541335552 wget_with_bar_test : 50.49 ==============

我执行testing的方式是使用“profile”装饰器。这是完整的代码：

 import wget import urllib import time from functools import wraps def profile(func): @wraps(func) def inner(*args): print func.__name__, ": starting" start = time.time() ret = func(*args) end = time.time() print func.__name__, ": {:.2f}".format(end - start) return ret return inner url1 = 'http://host.com/500a.iso' url2 = 'http://host.com/500b.iso' url3 = 'http://host.com/500c.iso' def do_nothing(*args): pass @profile def urlretrive_test(url): return urllib.urlretrieve(url) @profile def wget_no_bar_test(url): return wget.download(url, out='/tmp/', bar=do_nothing) @profile def wget_with_bar_test(url): return wget.download(url, out='/tmp/') urlretrive_test(url1) print '==============' time.sleep(1) wget_no_bar_test(url2) print '==============' time.sleep(1) wget_with_bar_test(url3) print '==============' time.sleep(1)

urllib似乎是最快的

如何使用Python通过HTTP下载文件？

禁用Tomcat中的所有默认HTTP错误响应内容

如何从http.request（）正确捕获exception？

允许HTTP-DELETE请求的响应主体？

用户代理string可以有多大？

URL的目录部分的有效字符（用于简短链接）

代理服务器后面的npm失败，状态为403

内容范围和范围标题之间的区别？

Rails：返回401？

Python的urllib2保持活着

HTTP_REFERER有多可靠？