BeautifulSoup在复合类名称search时返回空列表

当使用正则expression式search复合类名时，BeautifulSoup返回空列表。

例：

import re from bs4 import BeautifulSoup bs = """ <a class="name-single name692" href="www.example.com"">Example Text</a> """ bsObj = BeautifulSoup(bs) # this returns the class found_elements = bsObj.find_all("a", class_= re.compile("^(name-single.*)$")) # this returns an empty list found_elements = bsObj.find_all("a", class_= re.compile("^(name-single name\d*)$"))

我需要选课非常精确。有任何想法吗？

不幸的是，当你尝试在一个包含多个类的类属性值上进行正则expression式匹配时， BeautifulSoup会将正则expression式分别应用于每个类。以下是有关该问题的相关主题：

美丽的汤的Python正则expression式
多CSS类search是不方便的

这是因为class是一个非常特殊的多值属性，每当你parsingHTML时， BeautifulSoup的树形构build器（取决于parsing器的select）在内部将一个类string值分割成一个类列表（引用HTMLTreeBuilder的文档string）：

 # The HTML standard defines these attributes as containing a # space-separated list of values, not a single value. That is, # class="foo bar" means that the 'class' attribute has two values, # 'foo' and 'bar', not the single value 'foo bar'. When we # encounter one of these attributes, we will parse its value into # a list of values if possible. Upon output, the list will be # converted back into a string.

有多种解决方法，但是这里是一个黑客 – 我们将要求BeautifulSoup不要通过使用我们简单的自定义树构build器来将class作为多值属性来处理：

 import re from bs4 import BeautifulSoup from bs4.builder._htmlparser import HTMLParserTreeBuilder class MyBuilder(HTMLParserTreeBuilder): def __init__(self): super(MyBuilder, self).__init__() # BeautifulSoup, please don't treat "class" specially self.cdata_list_attributes["*"].remove("class") bs = """<a class="name-single name692" href="www.example.com"">Example Text</a>""" bsObj = BeautifulSoup(bs, "html.parser", builder=MyBuilder()) found_elements = bsObj.find_all("a", class_=re.compile(r"^name\-single name\d+$")) print(found_elements)

在这种情况下，正则expression式将作为一个整体应用于class属性值。

或者，您可以仅使用xmlfunctionparsingHTML（如果适用）：

 soup = BeautifulSoup(data, "xml")

您还可以使用CSSselect器，并将所有元素与name-single类和一个类似于“名称”的类匹配：

 soup.select("a.name-single,a[class^=name]")

然后，您可以根据需要手动应用正则expression式：

 pattern = re.compile(r"^name-single name\d+$") for elm in bsObj.select("a.name-single,a[class^=name]"): match = pattern.match(" ".join(elm["class"])) if match: print(elm)

对于这个用例我简单地使用一个自定义filter ，就像这样：

 import re from bs4 import BeautifulSoup from bs4.builder._htmlparser import HTMLParserTreeBuilder def myclassfilter(tag): return re.compile(r"^name\-single name\d+$").search(' '.join(tag['class'])) bs = """<a class="name-single name692" href="www.example.com"">Example Text</a>""" bsObj = BeautifulSoup(bs, "html.parser") found_elements = bsObj.find_all(myclassfilter) print(found_elements)

BeautifulSoup在复合类名称search时返回空列表

通过urllib和python下载图片

BeautifulSoup获取href

BeautifulSoup和Scrapy爬虫之间的区别？

在pythonparsingHTML – lxml或BeautifulSoup？哪种更适合哪种用途？

UnicodeEncodeError：'ascii'编解码器不能以特殊名称编码字符

使用BeautifulSoup删除标签，但保留其内容

美丽的汤，提取一个div和其内容的ID

屏幕抓取：绕过“HTTP错误403：robots.txt不允许的请求”

ImportError：没有名为BeautifulSoup的模块

Python：BeautifulSoup – 根据name属性获取属性值

BeautifulSoup在复合类名称search时返回空列表

通过urllib和python下载图片

BeautifulSoup获取href

BeautifulSoup和Scrapy爬虫之间的区别？

在pythonparsingHTML – lxml或BeautifulSoup？ 哪种更适合哪种用途？

UnicodeEncodeError：'ascii'编解码器不能以特殊名称编码字符

使用BeautifulSoup删除标签，但保留其内容

美丽的汤，提取一个div和其内容的ID

屏幕抓取：绕过“HTTP错误403：robots.txt不允许的请求”

ImportError：没有名为BeautifulSoup的模块

Python：BeautifulSoup – 根据name属性获取属性值

在pythonparsingHTML – lxml或BeautifulSoup？哪种更适合哪种用途？