BeautifulSoup在复合类名称search时返回空列表

当使用正则expression式search复合类名时,BeautifulSoup返回空列表。

例:

import re from bs4 import BeautifulSoup bs = """ <a class="name-single name692" href="www.example.com"">Example Text</a> """ bsObj = BeautifulSoup(bs) # this returns the class found_elements = bsObj.find_all("a", class_= re.compile("^(name-single.*)$")) # this returns an empty list found_elements = bsObj.find_all("a", class_= re.compile("^(name-single name\d*)$")) 

我需要选课非常精确。 有任何想法吗?

不幸的是,当你尝试在一个包含多个类的类属性值上进行正则expression式匹配时, BeautifulSoup会将正则expression式分别应用于每个类。 以下是有关该问题的相关主题:

  • 美丽的汤的Python正则expression式
  • 多CSS类search是不方便的

这是因为class是一个非常特殊的多值属性 ,每当你parsingHTML时, BeautifulSoup的树形构build器(取决于parsing器的select)在内部将一个类string值分割成一个类列表(引用HTMLTreeBuilder的文档string):

 # The HTML standard defines these attributes as containing a # space-separated list of values, not a single value. That is, # class="foo bar" means that the 'class' attribute has two values, # 'foo' and 'bar', not the single value 'foo bar'. When we # encounter one of these attributes, we will parse its value into # a list of values if possible. Upon output, the list will be # converted back into a string. 

有多种解决方法,但是这里是一个黑客 – 我们将要求BeautifulSoup不要通过使用我们简单的自定义树构build器来将class作为多值属性来处理:

 import re from bs4 import BeautifulSoup from bs4.builder._htmlparser import HTMLParserTreeBuilder class MyBuilder(HTMLParserTreeBuilder): def __init__(self): super(MyBuilder, self).__init__() # BeautifulSoup, please don't treat "class" specially self.cdata_list_attributes["*"].remove("class") bs = """<a class="name-single name692" href="www.example.com"">Example Text</a>""" bsObj = BeautifulSoup(bs, "html.parser", builder=MyBuilder()) found_elements = bsObj.find_all("a", class_=re.compile(r"^name\-single name\d+$")) print(found_elements) 

在这种情况下,正则expression式将作为一个整体应用于class属性值。


或者,您可以仅使用xmlfunctionparsingHTML(如果适用):

 soup = BeautifulSoup(data, "xml") 

您还可以使用CSSselect器,并将所有元素与name-single类和一个类似于“名称”的类匹配:

 soup.select("a.name-single,a[class^=name]") 

然后,您可以根据需要手动应用正则expression式:

 pattern = re.compile(r"^name-single name\d+$") for elm in bsObj.select("a.name-single,a[class^=name]"): match = pattern.match(" ".join(elm["class"])) if match: print(elm) 

对于这个用例我简单地使用一个自定义filter ,就像这样:

 import re from bs4 import BeautifulSoup from bs4.builder._htmlparser import HTMLParserTreeBuilder def myclassfilter(tag): return re.compile(r"^name\-single name\d+$").search(' '.join(tag['class'])) bs = """<a class="name-single name692" href="www.example.com"">Example Text</a>""" bsObj = BeautifulSoup(bs, "html.parser") found_elements = bsObj.find_all(myclassfilter) print(found_elements)