使用python从HTML页面源下载图像文件？

我正在写一个刮板，从HTML页面下载所有的图像文件，并将它们保存到一个特定的文件夹。所有的图像都是HTML页面的一部分。

下面是一些代码，用于从所提供的URL下载所有图像，并将其保存在指定的输出文件夹中。您可以根据自己的需要进行修改。

""" dumpimages.py Downloads all the images on the supplied URL, and saves them to the specified output file ("/test/" by default) Usage: python dumpimages.py http://example.com/ [output] """ from BeautifulSoup import BeautifulSoup as bs import urlparse from urllib2 import urlopen from urllib import urlretrieve import os import sys def main(url, out_folder="/test/"): """Downloads all the images at 'url' to /test/""" soup = bs(urlopen(url)) parsed = list(urlparse.urlparse(url)) for image in soup.findAll("img"): print "Image: %(src)s" % image filename = image["src"].split("/")[-1] parsed[2] = image["src"] outpath = os.path.join(out_folder, filename) if image["src"].lower().startswith("http"): urlretrieve(image["src"], outpath) else: urlretrieve(urlparse.urlunparse(parsed), outpath) def _usage(): print "usage: python dumpimages.py http://example.com [outpath]" if __name__ == "__main__": url = sys.argv[-1] out_folder = "/test/" if not url.lower().startswith("http"): out_folder = sys.argv[-1] url = sys.argv[-2] if not url.lower().startswith("http"): _usage() sys.exit(-1) main(url, out_folder)

编辑：你现在可以指定输出文件夹。

Ryan的解决scheme很好，但是如果图像源URL是绝对URL，或者只是连接到主页面URL时没有给出好的结果，则失败。 urljoin可以识别绝对和相对URL，所以用下面的代码replace中间的循环：

 for image in soup.findAll("img"): print "Image: %(src)s" % image image_url = urlparse.urljoin(url, image['src']) filename = image["src"].split("/")[-1] outpath = os.path.join(out_folder, filename) urlretrieve(image_url, outpath)

你必须下载页面并parsinghtml文档，用正则expression式find你的图像并下载它..你可以使用urllib2下载和美丽的汤来parsingHTML文件。

这是下载一个图像的function：

 def download_photo(self, img_url, filename): file_path = "%s%s" % (DOWNLOADED_IMAGE_PATH, filename) downloaded_image = file(file_path, "wb") image_on_web = urllib.urlopen(img_url) while True: buf = image_on_web.read(65536) if len(buf) == 0: break downloaded_image.write(buf) downloaded_image.close() image_on_web.close() return file_path

使用htmllib提取所有的img标签（覆盖do_img），然后使用urllib2下载所有的图像。

如果请求需要授权，请参考这个：

 r_img = requests.get(img_url, auth=(username, password)) f = open('000000.jpg','wb') f.write(r_img.content) f.close()

使用python从HTML页面源下载图像文件？

从一个网页上刮很多Javascript的屏幕

无头，脚本化的Firefox / Webkit的Linux？

Python无头浏览器（需要JavaScript支持！）

如果对象有其他类，美丽的汤也找不到CSS类

在Php的HTML刮

CasperJS将数据传回给PHP

用Python来抓取网页