在Python中将XML / HTML实体转换为Unicodestring

我正在做一些网页抓取，网站经常使用HTML实体来表示非ASCII字符。 Python是否有一个实用程序接受一个带有HTML实体的string并返回一个unicodetypes？

例如：

我回头：

ǎ

这是一个带有音调标记的“ǎ”。在二进制中，这表示为16位01ce。我想将html实体转换为值u'\u01ce'

Python有htmlentitydefs模块，但是这不包括一个函数来隐藏HTML实体。

Python开发者Fredrik Lundh（elementtree的作者）在他的网站上有这样的function，它可以与十进制，hex和命名实体一起工作：

 import re, htmlentitydefs ## # Removes HTML or XML character references and entities from a text string. # # @param text The HTML (or XML) source text. # @return The plain text, as a Unicode string, if necessary. def unescape(text): def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: pass else: # named entity try: text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: pass return text # leave as is return re.sub("&#?\w+;", fixup, text)

标准库的自己的HTMLParser有一个没有logging的函数unescape（），它的确如你所想的那样：

 import HTMLParser h = HTMLParser.HTMLParser() h.unescape('&copy; 2010') # u'\xa9 2010' h.unescape('&#169; 2010') # u'\xa9 2010'

使用内置unichr – BeautifulSoup是没有必要的：

 >>> entity = '&#x01ce' >>> unichr(int(entity[3:],16)) u'\u01ce'

另外，如果你有lxml：

 >>> import lxml.html >>> lxml.html.fromstring('&#x01ce').text u'\u01ce'

如果您使用的是Python 3.4或更新版本，则可以简单地使用html.unescape ：

 s = html.unescape(s)

你可以在这里find答案 – 从网页获取国际字符？

编辑：它似乎像BeautifulSoup不会转换以hexforms写的实体。它可以被修复：

 import copy, re from BeautifulSoup import BeautifulSoup hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE) # replace hexadecimal character reference by decimal one hexentityMassage += [(re.compile('&#x([^;]+);'), lambda m: '&#%d;' % int(m.group(1), 16))] def convert(html): return BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES, markupMassage=hexentityMassage).contents[0].string html = '<html>&#x01ce;&#462;</html>' print repr(convert(html)) # u'\u01ce\u01ce'

编辑：

在这种情况下，使用htmlentitydefs标准模块和unichr()的@dF提到的unescape()函数可能更合适。

这是一个函数，它可以帮助你正确地将实体转换回utf-8字符。

 def unescape(text): """Removes HTML or XML character references and entities from a text string. @param text The HTML (or XML) source text. @return The plain text, as a Unicode string, if necessary. from Fredrik Lundh 2008-01-03: input only unicode characters string. http://effbot.org/zone/re-sub.htm#unescape-html """ def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return unichr(int(text[3:-1], 16)) else: return unichr(int(text[2:-1])) except ValueError: print "Value Error" pass else: # named entity # reescape the reserved characters. try: if text[1:-1] == "amp": text = "&amp;amp;" elif text[1:-1] == "gt": text = "&amp;gt;" elif text[1:-1] == "lt": text = "&amp;lt;" else: print text[1:-1] text = unichr(htmlentitydefs.name2codepoint[text[1:-1]]) except KeyError: print "keyerror" pass return text # leave as is return re.sub("&#?\w+;", fixup, text)

不知道为什么堆栈溢出线程不包括';' 在search/replace（即lambda m：'＆＃％d * ; *'）如果你不这样做，BeautifulSoup可以barf，因为相邻的字符可以被解释为HTML代码的一部分（即＃＆＃ 39Blackout）。

这对我更好：

 import re from BeautifulSoup import BeautifulSoup html_string='<a href="/cgi-bin/article.cgi?f=/c/a/2010/12/13/BA3V1GQ1CI.DTL"title="">&#x27;Blackout in a can; on some shelves despite ban</a>' hexentityMassage = [(re.compile('&#x([^;]+);'), lambda m: '&#%d;' % int(m.group(1), 16))] soup = BeautifulSoup(html_string, convertEntities=BeautifulSoup.HTML_ENTITIES, markupMassage=hexentityMassage)

int（m.group（1），16）将数字（以base-16指定）转换为整数。
m.group（0）返回整个匹配，m.group（1）返回正则expression式捕获组
基本上使用markupMessage是一样的：
html_string = re.sub（'＆＃x（[^;] +）;'，lambda m：'＆＃％d;'％int（m.group（1），16），html_string）

另一个解决scheme是内置库xml.sax.saxutils（对于html和xml）。不过，它只会转换＆gt，＆amp;和＆lt;

 from xml.sax.saxutils import unescape escaped_text = unescape(text_to_escape)

这里是dF的 Python 3版本的答案：

 import re import html.entities def unescape(text): """ Removes HTML or XML character references and entities from a text string. :param text: The HTML (or XML) source text. :return: The plain text, as a Unicode string, if necessary. """ def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return chr(int(text[3:-1], 16)) else: return chr(int(text[2:-1])) except ValueError: pass else: # named entity try: text = chr(html.entities.name2codepoint[text[1:-1]]) except KeyError: pass return text # leave as is return re.sub("&#?\w+;", fixup, text)

主要的变化是关于htmlentitydefs ，现在是html.entities和unichr ，现在是chr 。看到这个Python 3移植指南。

在Python中将XML / HTML实体转换为Unicodestring

如何禁用entity framework6.0中的迁移

如何使用entity framework只更新一个字段？

AsEnumerable（）对LINQ实体有什么影响？

JPA认为我正在删除一个分离的对象

entity framework回滚并删除错误的迁移

cascade = {“remove”} VS orphanRemoval = true VS ondelete =“CASCADE

EF 4.1exception“提供程序没有返回ProviderManifestTokenstring”

如何用entity framework通过id删除一个对象

从现有数据库生成JPA 2实体

什么是上下三angular形的HTML实体？