使用BeautifulSoup来查找包含特定文本的HTML标签

我试图获取HTML文档中包含以下模式的文本元素：＃\ S {11}

<h2> this is cool #12345678901 </h2>

所以，以前会使用：

 soup('h2',text=re.compile(r' #\S{11}'))

结果会是这样的：

 [u'blahblah #223409823523', u'thisisinteresting #293845023984']

我能够得到所有匹配的文本（见上面的行）。但是我想要文本的父元素匹配，所以我可以使用它作为遍历文档树的起点。在这种情况下，我想要所有的h2元素返回，而不是文本匹配。

想法？

 from BeautifulSoup import BeautifulSoup import re html_text = """ <h2>this is cool #12345678901</h2> <h2>this is nothing</h2> <h1>foo #126666678901</h1> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2> """ soup = BeautifulSoup(html_text) for elem in soup(text=re.compile(r' #\S{11}')): print elem.parent

打印：

 <h2>this is cool #12345678901</h2> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2>

在其他情况下，BeautifulSoupsearch操作将使用text=作为条件而不是BeautifulSoup.Tag [ BeautifulSoup.NavigableString对象]列表。检查对象的__dict__以查看提供给您的属性。在这些属性中，由于BS4的变化， parent比previous更受青睐。

 from BeautifulSoup import BeautifulSoup from pprint import pprint import re html_text = """ <h2>this is cool #12345678901</h2> <h2>this is nothing</h2> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2> """ soup = BeautifulSoup(html_text) # Even though the OP was not looking for 'cool', it's more understandable to work with item zero. pattern = re.compile(r'cool') pprint(soup.find(text=pattern).__dict__) #>> {'next': u'\n', #>> 'nextSibling': None, #>> 'parent': <h2>this is cool #12345678901</h2>, #>> 'previous': <h2>this is cool #12345678901</h2>, #>> 'previousSibling': None} print soup.find('h2') #>> <h2>this is cool #12345678901</h2> print soup.find('h2', text=pattern) #>> this is cool #12345678901 print soup.find('h2', text=pattern).parent #>> <h2>this is cool #12345678901</h2> print soup.find('h2', text=pattern) == soup.find('h2') #>> False print soup.find('h2', text=pattern) == soup.find('h2').text #>> True print soup.find('h2', text=pattern).parent == soup.find('h2') #>> True

使用BeautifulSoup来查找包含特定文本的HTML标签

bs4.FeatureNotFound：找不到具有您请求的function的树生成器：lxml。你需要安装一个parsing器库吗？

UnicodeEncodeError：'charmap'编解码器不能编码字符

如何按类查找元素

beautifulsoup findAll find_all

我们可以用BeautifulSoup来使用xpath吗？

为什么使用BeautifulSoup和IDLE获得recursion错误？

如何使用美丽的汤find节点的孩子

我可以使用BeautifulSoup删除脚本标记吗？

用pip安装美丽的汤

使用BeautifulSoup来查找包含特定文本的HTML标签

bs4.FeatureNotFound：找不到具有您请求的function的树生成器：lxml。 你需要安装一个parsing器库吗？

UnicodeEncodeError：'charmap'编解码器不能编码字符

如何按类查找元素

beautifulsoup findAll find_all

我们可以用BeautifulSoup来使用xpath吗？

为什么使用BeautifulSoup和IDLE获得recursion错误？

如何使用美丽的汤find节点的孩子

我可以使用BeautifulSoup删除脚本标记吗？

用pip安装美丽的汤

bs4.FeatureNotFound：找不到具有您请求的function的树生成器：lxml。你需要安装一个parsing器库吗？