将HTML实体转换为Unicode,反之亦然

可能的重复项目:

  • 在Python中将XML / HTML实体转换为Unicodestring
  • HTML实体代码到文本

如何在HTML中将HTML实体转换为Unicode,反之亦然?

至于“反之亦然”(我需要我自己,导致我find这个问题,这没有帮助,后来又有一个网站有答案 ):

u'some string'.encode('ascii', 'xmlcharrefreplace') 

将返回任何非ASCII字符转换为XML(HTML)实体的纯string。

你需要有BeautifulSoup 。

 from BeautifulSoup import BeautifulStoneSoup import cgi def HTMLEntitiesToUnicode(text): """Converts HTML entities to unicode. For example '&amp;' becomes '&'.""" text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES)) return text def unicodeToHTMLEntities(text): """Converts unicode to HTML entities. For example '&' becomes '&amp;'.""" text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace') return text text = "&amp;, &reg;, &lt;, &gt;, &cent;, &pound;, &yen;, &euro;, &sect;, &copy;" uni = HTMLEntitiesToUnicode(text) htmlent = unicodeToHTMLEntities(uni) print uni print htmlent # &, ®, <, >, ¢, £, ¥, €, §, © # &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169; 

Python 2.7和BeautifulSoup4的更新

htmlparser – Unicode HTML到htmlparser (Python 2.7标准库)的unicode:

 >>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood' >>> from HTMLParser import HTMLParser >>> htmlparser = HTMLParser() >>> unescaped = htmlparser.unescape(escaped) >>> unescaped u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood' >>> print unescaped Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood 

bs4 – Unicode HTML到bs4 (BeautifulSoup4)的unicode:

 >>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>''' >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> soup.text u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood' >>> print soup.text Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood 

转义 – Unicode到Unicode与bs4 (BeautifulSoup4)的HTML:

 >>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood' >>> from bs4.dammit import EntitySubstitution >>> escaper = EntitySubstitution() >>> escaped = escaper.substitute_html(unescaped) >>> escaped u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood' 

正如hekevintran的答案所暗示的那样,你可以使用cgi.escape(s)来编码cgi.escape(s) ,但是请注意,在这个函数中,quote的编码默认是false,并且把quote=True关键字parameter passing给你的string可能是个好主意。 但是,即使通过传递quote=True ,函数也不会转义单引号( "'" )(由于这些问题,函数从版本3.2开始已被弃用 )

有人build议使用html.escape(s)而不是cgi.escape(s) 。 (3.2版本中的新function)

在3.4版本中也引入了 html.unescape(s)

所以在Python 3.4中,你可以:

  • 使用html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()将特殊字符转换为HTML实体。
  • 用于将HTML实体转换回纯文本格式的html.unescape(text)

我使用下面的函数将xls文件中的unicode转换为html文件,同时保留xls文件中的特殊字符:

 def html_wr(f, dat): ''' write dat to file f as html . file is assumed to be opened in binary format . if dat is nul it is replaced with non breakable space . non-ascii characters are translated to xml ''' if not dat: dat = '&nbsp;' try: f.write(dat.encode('ascii')) except: f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace')) 

希望这对别人有用