Python：将Unicode转换为ASCII而不会出错

我的代码只是一个网页，然后将其转换为Unicode。

html = urllib.urlopen(link).read() html.encode("utf8","ignore") self.response.out.write(html)

但是我得到一个UnicodeDecodeError ：

 Traceback (most recent call last): File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__ handler.get(*groups) File "/Users/greg/clounce/main.py", line 55, in get html.encode("utf8","ignore") UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

所以我认为这意味着HTML在某处包含一些错误的尝试。我可以放下任何代码字节导致问题，而不是得到一个错误？

我们可以得到用于link的实际值吗？

另外，当我们试图编码.encode()一个已经编码的字节串时，我们通常会遇到这个问题。所以你可能会尝试先解码它

 html = urllib.urlopen(link).read() unicode_str = html.decode(<source encoding>) encoded_str = unicode_str.encode("utf8")

举个例子：

 html = '\xa0' encoded_str = html.encode("utf8")

失败

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

而：

 html = '\xa0' decoded_str = html.decode("windows-1252") encoded_str = decoded_str.encode("utf8")

成功没有错误。请注意“windows-1252”是我用作例子。我从chardet得到这个，它有0.5的信心，这是正确的！（好，给定1字符长度的string，你期望什么）你应该把它改为从.urlopen().read()返回的字节串的编码。

我看到的另一个问题是.encode()string方法返回修改后的string，不会修改源代码。所以这是有用的有self.response.out.write(html) HTML不是从HTML.encode编码的string（如果这是你最初的目标）。

正如Ignaciobuild议的那样，从read()中检查返回的string的实际编码的源网页。它要么在Meta标签中，要么在响应中的ContentType标头中。然后使用它作为.decode()的参数。

但请注意，不应该假定其他开发人员有足够的责任来确保头文件和/或元字符集声明与实际内容匹配。（这是一个PITA，是的，我应该知道，我是之前的那个）。

 >>> u'aあä'.encode('ascii', 'ignore') 'a'

编辑：

解码返回的string，使用响应或Content-Type头中相应meta标记中的字符集，然后进行编码。

encode()方法接受其他值为“ignore”。例如：'replace'，'xmlcharrefreplace'，'backslashreplace'。请参阅https://docs.python.org/3/library/stdtypes.html#str.encode

作为Ignacio Vazquez-Abrams的答案的延伸

 >>> u'aあä'.encode('ascii', 'ignore') 'a'

有时候需要从字符中删除重音符号并打印基本forms。这可以用来完成

 >>> import unicodedata >>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore') 'aa'

您可能还想将其他字符（如标点符号）翻译为最接近的等价字符，例如，在编码时，右单引号标记unicode字符不会转换为ASCII码APOSTROPHE。

 >>> print u'\u2019' ' >>> unicodedata.name(u'\u2019') 'RIGHT SINGLE QUOTATION MARK' >>> u'\u2019'.encode('ascii', 'ignore') '' # Note we get an empty string back >>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore') "'"

虽然有更有效的方法来实现这一点。看到这个问题的更多细节Python的“这个Unicode的最好的ASCII”数据库在哪里？

使用unidecode – 它甚至可以把怪异的字符转换成ascii，甚至可以把中文转换成拼音ascii。

 $ pip install unidecode

然后：

 >>> from unidecode import unidecode >>> unidecode(u'北京') 'Bei Jing' >>> unidecode(u'Škoda') 'Skoda'

我在我所有的项目中都使用了这个辅助函数。如果它不能转换的Unicode，它会忽略它。这关系到Django图书馆，但有一点研究，你可以绕过它。

 from django.utils import encoding def convert_unicode_to_string(x): """ >>> convert_unicode_to_string(u'ni\xf1era') 'niera' """ return encoding.smart_str(x, encoding='ascii', errors='ignore')

使用这个之后，我不再有任何unicode错误。

对于像cmd.exe和HTML输出的破损控制台，您始终可以使用：

 my_unicode_string.encode('ascii','xmlcharrefreplace')

这将保留所有非ASCII字符，同时使它们以纯ASCII 和 HTML格式打印。

警告： 如果你在生产代码中使用这个来避免错误，那么你的代码很可能是有问题的 。唯一有效的用例是打印到非Unicode控制台或在HTML上下文中轻松转换为HTML实体。

最后，如果你在windows上并使用cmd.exe，那么你可以键入chcp 65001启用utf-8输出（与Lucida控制台字体一起使用）。您可能需要添加myUnicodeString.encode('utf8') 。

你写了“”“我认为这意味着HTML包含一些在unicode某处错误地形成的尝试。”“

预计HTML不会包含任何forms的“尝试unicode”，格式良好或没有。它必须包含编码的Unicode字符，通常在前面提供…寻找“字符集”。

你似乎认为这个字符集是UTF-8 …基于什么理由？错误信息中显示的“\ xA0”字节表示您可能有单字节字符集，例如cp1252。

如果你不能从HTML开头的声明中得到任何意义，可以尝试使用chardet来找出可能的编码是什么。

你为什么用“正则expression式”标记你的问题？

用一个非问题代替整个问题后更新：

 html = urllib.urlopen(link).read() # html refers to a str object. To get unicode, you need to find out # how it is encoded, and decode it. html.encode("utf8","ignore") # problem 1: will fail because html is a str object; # encode works on unicode objects so Python tries to decode it using # 'ascii' and fails # problem 2: even if it worked, the result will be ignored; it doesn't # update html in situ, it returns a function result. # problem 3: "ignore" with UTF-n: any valid unicode object # should be encodable in UTF-n; error implies end of the world, # don't try to ignore it. Don't just whack in "ignore" willy-nilly, # put it in only with a comment explaining your very cogent reasons for doing so. # "ignore" with most other encodings: error implies that you are mistaken # in your choice of encoding -- same advice as for UTF-n :-) # "ignore" with decode latin1 aka iso-8859-1: error implies end of the world. # Irrespective of error or not, you are probably mistaken # (needing eg cp1252 or even cp850 instead) ;-)

如果你有一个stringline ，你可以使用.encode([encoding], [errors='strict'])方法来转换编码types。

line = 'my big string'

line.encode('ascii', 'ignore')

有关在Python中处理ASCII和unicode的更多信息，这是一个非常有用的站点： https ： //docs.python.org/2/howto/unicode.html

我认为答案是存在的，但只是零碎的，这使得很难快速解决这个问题

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我们来举个例子，假设我有一个文件，其中包含以下forms的数据（包含ascii和非ascii字符）

1/10 / 17,21：36 – 土地：欢迎你

我们只想忽略和保存ascii字符。

此代码将执行：

 import unicodedata fp = open(<FILENAME>) for line in fp: rline = line.strip() rline = unicode(rline, "utf-8") rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore') if len(rline) != 0: print rline

和type（rline）会给你

 >type(rline) <type 'str'>

 unicodestring = '\xa0' decoded_str = unicodestring.decode("windows-1252") encoded_str = decoded_str.encode('ascii', 'ignore')

为我工作

看起来你正在使用Python 2.x. Python 2.x默认为ASCII，它不知道Unicode。因此例外。

只要粘贴下面的行后，它会工作

 # -*- coding: utf-8 -*-

Python：将Unicode转换为ASCII而不会出错

UTF-8编码的html页面显示（问号）而不是字符

如何将默认编码更改为UTF-8的Apache？

将Visual Studio项目中的所有文件保存为UTF-8

没有BOM的UTF-8和UTF-8有什么区别？

UTF-8与Unicode

用Python读取UTF8 CSV文件

如何使用JavaScript将特殊的UTF-8字符转换为与iso-8859-1等效的字符？

编码/解码有什么区别？

什么是JVM的默认编码？

有一个倒挂的字符？

Python：将Unicode转换为ASCII而不会出错

UTF-8编码的html页面显示 （问号）而不是字符

如何将默认编码更改为UTF-8的Apache？

将Visual Studio项目中的所有文件保存为UTF-8

没有BOM的UTF-8和UTF-8有什么区别？

UTF-8与Unicode

用Python读取UTF8 CSV文件

如何使用JavaScript将特殊的UTF-8字符转换为与iso-8859-1等效的字符？

编码/解码有什么区别？

什么是JVM的默认编码？

有一个倒挂的字符？

UTF-8编码的html页面显示（问号）而不是字符