使用python和BeautifulSoup从网页检索链接

我如何检索网页的链接，并复制使用Python的链接的URL地址？

以下是使用BeautifulSoup中的SoupStrainer类的简短片段：

import httplib2 from BeautifulSoup import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request('http://www.nytimes.com') for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): if link.has_attr('href'): print link['href']

BeautifulSoup文档实际上相当不错，涵盖了许多典型的场景：

http://www.crummy.com/software/BeautifulSoup/documentation.html

编辑：请注意，我使用的SoupStrainer类，因为它是一个更有效率（内存和速度明智），如果你知道你在提前parsing。

其他人推荐BeautifulSoup，但使用lxml会好得多。尽pipe它的名字，它也是parsing和刮取HTML。它比BeautifulSoup快得多，它甚至比BeautifulSoup（他们的声望）更好地处理“破碎的”HTML。如果您不想学习lxml API，它也具有用于BeautifulSoup的兼容性API。

伊恩Blicking同意。

没有理由再使用BeautifulSoup，除非你在Google App Engine上，或者任何不是纯粹Python的东西都是不允许的。

lxml.html也支持CSS3select器，所以这种事情是微不足道的。

lxml和xpath的例子如下所示：

 import urllib import lxml.html connection = urllib.urlopen('http://www.nytimes.com') dom = lxml.html.fromstring(connection.read()) for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links) print link

为了完整起见，BeautifulSoup 4版本也使用服务器提供的编码：

 from bs4 import BeautifulSoup import urllib2 resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks") soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset')) for link in soup.find_all('a', href=True): print link['href']

或Python 3版本：

 from bs4 import BeautifulSoup import urllib.request resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks") soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset')) for link in soup.find_all('a', href=True): print(link['href'])

和一个使用requests库的版本，这两个版本都可以在Python 2和Python 3中使用：

 from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector import requests resp = requests.get("http://www.gpsbasecamp.com/national-parks") http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) encoding = html_encoding or http_encoding soup = BeautifulSoup(resp.content, from_encoding=encoding) for link in soup.find_all('a', href=True): print(link['href'])

soup.find_all('a', href=True)调用查找所有具有href属性的<a>元素; 没有属性的元素被跳过。

BeautifulSoup 3于2012年3月停止开发; 新项目真的应该永远使用BeautifulSoup 4。

请注意，您应该将HTML的字节解码为BeautifulSoup 。您可以通知BeautifulSoup在HTTP响应头文件中find的字符集来协助解码，但这可能是错误的，并且与在HTML本身中find的<meta>头信息冲突，这就是为什么上面使用BeautifulSoup内部类方法EncodingDetector.find_declared_encoding()来确保这样的embedded式编码提示胜过错误configuration的服务器。

对于requests ，如果响应具有text/* mimetype，则即使没有返回任何字符集， response.encoding属性也将默认为Latin-1。这与HTTP RFC一致，但与HTMLparsing一起使用时很痛苦，所以当Content-Type标题中没有设置charset时，应该忽略该属性。

 import urllib2 import BeautifulSoup request = urllib2.Request("http://www.gpsbasecamp.com/national-parks") response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response) for a in soup.findAll('a'): if 'national-park' in a['href']: print 'found a url with national-park in the link'

以下代码是使用urllib2和BeautifulSoup4检索网页中的所有可用链接

  import urllib2 from bs4 import BeautifulSoup url = urllib2.urlopen("http://www.espncricinfo.com/").read() soup = BeautifulSoup(url) for line in soup.find_all('a'): print(line.get('href'))

BeautifulSoup现在使用lxml。请求，lxml和列表parsing使杀手组合。

 import requests import lxml.html dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content) [x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

在列表comp中，“if”//“和”url.com“不在x中是一种简单的方法来清理站点内部导航URL的URL列表等。

为了find所有的链接，我们将在这个例子中使用urllib2模块和re.module * re模块中最强大的function之一是“re.findall（）”。虽然re.search（）用于查找模式的第一个匹配项，但re.findall（）会查找所有匹配项并将其作为string列表返回，每个string表示一个匹配项*

 import urllib2 import re #connect to a URL website = urllib2.urlopen(url) #read html code html = website.read() #use re.findall to get all the links links = re.findall('"((http|ftp)s?://.*?)"', html) print links

只是为了获得链接，没有B.soup和正则expression式：

 import urllib2 url="http://www.somewhere.com" page=urllib2.urlopen(url) data=page.read().split("</a>") tag="<a href=\"" endtag="\">" for item in data: if "<a href" in item: try: ind = item.index(tag) item=item[ind+len(tag):] end=item.index(endtag) except: pass else: print item[:end]

对于更复杂的操作，当然BSoup仍然是首选。

这个脚本做你要找的东西，但也解决了绝对链接的相对链接。

 import urllib import lxml.html import urlparse def get_dom(url): connection = urllib.urlopen(url) return lxml.html.fromstring(connection.read()) def get_links(url): return resolve_links((link for link in get_dom(url).xpath('//a/@href'))) def guess_root(links): for link in links: if link.startswith('http'): parsed_link = urlparse.urlparse(link) scheme = parsed_link.scheme + '://' netloc = parsed_link.netloc return scheme + netloc def resolve_links(links): root = guess_root(links) for link in links: if not link.startswith('http'): link = urlparse.urljoin(root, link) yield link for link in get_links('http://www.google.com'): print link

为什么不使用正则expression式：

 import urllib2 import re url = "http://www.somewhere.com" page = urllib2.urlopen(url) page = page.read() links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page) for link in links: print('href: %s, HTML text: %s' % (link[0], link[1]))

BeatifulSoup自己的parsing器可能会很慢。使用能够直接从URLparsing的lxml （有一些下面提到的限制）可能更加可行。

 import lxml.html doc = lxml.html.parse(url) links = doc.xpath('//a[@href]') for link in links: print link.attrib['href']

上面的代码将按原样返回链接，在大多数情况下，它们将是相对链接或从网站根目录的绝对path。由于我的使用案例是只提取某种types的链接，下面是一个版本，将链接转换为完整的URL，并可以接受像*.mp3这样的glob模式。虽然在相对path上不会处理单点和双点，但到目前为止，我并不需要它。如果您需要parsing包含../或./ URL片段，则urlparse.urljoin可能会派上用场。

注：直接lxmlurlparsing不处理从https加载，不做redirect，所以为此，以下版本使用urllib2 + lxml 。

 #!/usr/bin/env python import sys import urllib2 import urlparse import lxml.html import fnmatch try: import urltools as urltools except ImportError: sys.stderr.write('To normalize URLs run: `pip install urltools --user`') urltools = None def get_host(url): p = urlparse.urlparse(url) return "{}://{}".format(p.scheme, p.netloc) if __name__ == '__main__': url = sys.argv[1] host = get_host(url) glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*' doc = lxml.html.parse(urllib2.urlopen(url)) links = doc.xpath('//a[@href]') for link in links: href = link.attrib['href'] if fnmatch.fnmatch(href, glob_patt): if not href.startswith(('http://', 'https://' 'ftp://')): if href.startswith('/'): href = host + href else: parent_url = url.rsplit('/', 1)[0] href = urlparse.urljoin(parent_url, href) if urltools: href = urltools.normalize(href) print href

用法如下：

 getlinks.py http://stackoverflow.com/a/37758066/191246 getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*" getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

 import urllib2 from bs4 import BeautifulSoup a=urllib2.urlopen('http://dir.yahoo.com') code=a.read() soup=BeautifulSoup(code) links=soup.findAll("a") #To get href part alone print links[0].attrs['href']

下面是使用@ars接受的答案和BeautifulSoup4 requests和wget模块处理下载的示例。

 import requests import wget import os from bs4 import BeautifulSoup, SoupStrainer url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/' file_type = '.tar.gz' response = requests.get(url) for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')): if link.has_attr('href'): if file_type in link['href']: full_path = url + link['href'] wget.download(full_path)

我在@ Blairg23find了答案，经过以下更正（涵盖了无法正常工作的场景）：

 for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')): if link.has_attr('href'): if file_type in link['href']: full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported wget.download(full_path)

对于Python 3：

必须使用urllib.parse.urljoin才能获得完整的URL。

使用python和BeautifulSoup从网页检索链接

如果对象有其他类，美丽的汤也找不到CSS类

美丽的汤4 find_all找不到美丽的汤3find的链接