确定Python中文本的编码

我收到了一些编码的文本，但我不知道使用了什么字符集。有没有一种方法来确定使用Python的文本文件的编码？如何检测处理C＃的文本文件的编码/代码页。

正确地检测编码是不可能的 。

（来自chardet常见问题:)

但是，某些编码已针对特定语言进行了优化，而且语言不是随机的。一些字符序列总是popup，而其他序列没有意义。一个英文stream利的人打开报纸，发现“txzqJv 2！dasd0a QqdKjvz”会立即认出这不是英文（即使它完全是英文字母）。通过研究大量的“典型”文本，计算机algorithm可以模拟这种stream畅性，并对文本的语言进行有根据的猜测。

有chardet库使用该研究来尝试检测编码。 chardet是Mozilla中自动检测代码的一个端口。

你也可以使用UnicodeDammit 。它会尝试以下方法：

在文档中发现的一种编码：例如，在一个XML声明或（对于HTML文档）一个http-equiv META标签。如果Beautiful Soup在文档中find这种编码，它会从头开始再次parsing文档，并尝试新的编码。唯一的例外是，如果你明确指定了一个编码，而且这个编码实际上起作用了：那么它将忽略它在文档中find的任何编码。
通过查看文件的前几个字节来查看编码。如果在此阶段检测到编码，它将是UTF- *编码，EBCDIC或ASCII之一。
chardet库嗅探的编码，如果已安装。
UTF-8
Windows的1252

编码的另一种方法是使用libmagic （这是file命令的代码）。有大量的python绑定可用。

生活在文件源树中的python绑定可以作为python-magic （或者python3-magic ）debian包使用。如果可以通过执行来确定文件的编码：

 import magic blob = open('unknown-file').read() m = magic.open(magic.MAGIC_MIME_ENCODING) m.load() encoding = m.buffer(blob) # "utf-8" "us-ascii" etc

在pypi上有一个同样命名但不兼容的python-magic pip包，它也使用libmagic。它也可以通过这样做来获得编码：

 import magic blob = open('unknown-file').read() m = magic.Magic(mime_encoding=True) encoding = m.from_buffer(blob)

一些编码策略，请注意去品尝：

 #!/bin/bash # tmpfile=$1 echo '-- info about file file ........' file -i $tmpfile enca -g $tmpfile echo 'recoding ........' #iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile #enca -x utf-8 $tmpfile #enca -g $tmpfile recode CP1250..UTF-8 $tmpfile

您可能希望通过以循环的forms打开和读取文件来检查编码…但是您可能需要先检查文件大小：

 encodings = ['utf-8', 'windows-1250', 'windows-1252' ...etc] for e in encodings: try: fh = codecs.open('file.txt', 'r', encoding=e) fh.readlines() fh.seek(0) except UnicodeDecodeError: print('got unicode error with %s , trying different encoding' % e) else: print('opening the file with encoding: %s ' % e) break

下面是一个读取和采取chardet编码预测的n_lines ，在文件大的情况下从文件中读取n_lines 。

chardet也给你一个编码预测的概率（即confidence ）（没有看他们是怎么想出来的），它是从chardet.predict()预测中返回的，所以你可以用某种方式来工作，如果你喜欢。

 def predict_encoding(file_path, n_lines=20): '''Predict a file's encoding using chardet''' import chardet # Open the file as binary data with open(file_path, 'rb') as f: # Join binary lines for specified number of lines rawdata = b''.join([f.readline() for _ in range(n_lines)]) return chardet.detect(rawdata)['encoding']

在一般情况下，原则上不可能确定文本文件的编码。所以不，没有标准的Python库来为你做。

如果你对文本文件有更多的具体知识（比如它是XML），可能会有库函数。

如果你知道文件的一些内容，你可以尝试用几种编码来解码它，看看哪个丢失了。一般来说，没有办法，因为文本文件是一个文本文件，这些都是愚蠢的;）

 # Function: OpenRead(file) # A text file can be encoded using: # (1) The default operating system code page, Or # (2) utf8 with a BOM header # # If a text file is encoded with utf8, and does not have a BOM header, # the user can manually add a BOM header to the text file # using a text editor such as notepad++, and rerun the python script, # otherwise the file is read as a codepage file with the # invalid codepage characters removed import sys if int(sys.version[0]) != 3: print('Aborted: Python 3.x required') sys.exit(1) def bomType(file): """ returns file encoding string for open() function EXAMPLE: bom = bomtype(file) open(file, encoding=bom, errors='ignore') """ f = open(file, 'rb') b = f.read(4) f.close() if (b[0:3] == b'\xef\xbb\xbf'): return "utf8" # Python automatically detects endianess if utf-16 bom is present # write endianess generally determined by endianess of CPU if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')): return "utf16" if ((b[0:5] == b'\xfe\xff\x00\x00') or (b[0:5] == b'\x00\x00\xff\xfe')): return "utf32" # If BOM is not provided, then assume its the codepage # used by your operating system return "cp1252" # For the United States its: cp1252 def OpenRead(file): bom = bomType(file) return open(file, 'r', encoding=bom, errors='ignore') ####################### # Testing it ####################### fout = open("myfile1.txt", "w", encoding="cp1252") fout.write("* hi there (cp1252)") fout.close() fout = open("myfile2.txt", "w", encoding="utf8") fout.write("\u2022 hi there (utf8)") fout.close() # this case is still treated like codepage cp1252 # (User responsible for making sure that all utf8 files # have a BOM header) fout = open("badboy.txt", "wb") fout.write(b"hi there. barf(\x81\x8D\x90\x9D)") fout.close() # Read Example file with Bom Detection fin = OpenRead("myfile1.txt") L = fin.readline() print(L) fin.close() # Read Example file with Bom Detection fin = OpenRead("myfile2.txt") L =fin.readline() print(L) #requires QtConsole to view, Cmd.exe is cp1252 fin.close() # Read CP1252 with a few undefined chars without barfing fin = OpenRead("badboy.txt") L =fin.readline() print(L) fin.close() # Check that bad characters are still in badboy codepage file fin = open("badboy.txt", "rb") fin.read(20) fin.close()

根据您的平台，我只是select使用Linux shell file命令。这对我来说是有效的，因为我在一个只运行在我们的linux机器上的脚本中使用它。

显然这不是一个理想的解决scheme或答案，但它可以被修改，以适应您的需求。在我的情况下，我只需要确定一个文件是否是UTF-8或不。

 import subprocess file_cmd = ['file', 'test.txt'] p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE) cmd_output = p.stdout.readlines() # x will begin with the file type output as is observed using 'file' command x = cmd_output[0].split(": ")[1] return x.startswith('UTF-8')

确定Python中文本的编码

Java：如何读取文本文件

如何阅读大约2 GB的文本文件？

Java – 如何写我的ArrayList到一个文件，并读取（加载）该文件到原始的ArrayList？

确定文本文件中的行数

如何使用Java将string保存到文本文件？

从文本文件中删除unicode字符 – sed，其他bash / shell方法

使用C＃从一个文件夹获取所有文件名

什么是逐行读取文本文件的最快方法？

为什么文本文件以换行符结束？

如果不存在，创build一个.txt文件，如果它附加了一个新行