美丽的汤4 find_all找不到美丽的汤3find的链接

我注意到一个非常恼人的bug：BeautifulSoup4（package： bs4 ）经常比以前的版本（package： BeautifulSoup ）find更less的标签。

以下是该问题的可复制实例：

 import requests import bs4 import BeautifulSoup r = requests.get('http://wordpress.org/download/release-archive/') s4 = bs4.BeautifulSoup(r.text) s3 = BeautifulSoup.BeautifulSoup(r.text) print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a'))) print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))

输出：

 With BeautifulSoup 4 : 557 With BeautifulSoup 3 : 1701

如你所见，差异不是很小。

以下是模块的确切版本，以防有人想知道：

 In [20]: bs4.__version__ Out[20]: '4.2.1' In [21]: BeautifulSoup.__version__ Out[21]: '3.2.1'

你已经安装了lxml ，这意味着BeautifulSoup 4将通过标准库html.parser选项使用该parsing器。

你可以升级lxml到3.2.1（对于我来说，你的testing页返回1701个结果）; lxml本身使用libxml2和libxslt ，这也可能是责备在这里。你可能不得不升级这些 /以及。请参阅lxml要求页面 ; 目前推荐使用libxml2 2.7.8或更新版本。

或者在parsing汤时明确地指定另一个分析器：

 s4 = bs4.BeautifulSoup(r.text, 'html.parser')

美丽的汤4 find_all找不到美丽的汤3find的链接

刮/使用JavaScript窃听AJAX数据？