Java：如何确定一个stream的正确的字符集编码

参考以下线程： Java App：无法正确读取iso-8859-1编码的文件

什么是编程式确定inputstream/文件的正确字符集编码的最佳方法？

我曾尝试使用以下内容：

File in = new File(args[0]); InputStreamReader r = new InputStreamReader(new FileInputStream(in)); System.out.println(r.getEncoding());

但是在我知道用ISO8859_1编码的文件上面的代码会产生ASCII，这是不正确的，并且不允许我正确地将文件的内容呈现回控制台。

我已经使用这个库，类似于jchardet来检测Java中的编码： http : //code.google.com/p/juniversalchardet/

您无法确定任意字节stream的编码。这是编码的本质。编码意味着字节值与其表示之间的映射。所以每个编码“可能”是正确的。

getEncoding（）方法将返回为该stream设置的编码（读取JavaDoc ）。它不会猜测你的编码。

有些stream会告诉您使用哪种编码来创build它们：XML，HTML。但不是任意的字节stream。

无论如何，如果必须的话，你可以尝试自己猜测一个编码。每种语言对于每个字符都有一个共同的频率。在英语中，字符很常出现，但是ê很less出现。在ISO-8859-1stream中，通常不会有0x00字符。但是一个UTF-16stream有很多。

或者：你可以问用户。我已经看到了一些应用程序，这些应用程序以不同的编码方式呈现文件片段，并要求您select“正确”的文件。

检查了这一点： http ://site.icu-project.org/（icu4j）他们有用于检测来自IOStream的字符集的库可能是这样的简单：

 BufferedInputStream bis = new BufferedInputStream(input); CharsetDetector cd = new CharsetDetector(); cd.setText(bis); CharsetMatch cm = cd.detect(); if (cm != null) { reader = cm.getReader(); charset = cm.getName(); }else { throw new UnsupportedCharsetException() }

这是我的最爱：

TikaEncodingDetector

相关性：

 <dependency> <groupId>org.codehaus.guessencoding</groupId> <artifactId>guessencoding</artifactId> <version>1.4</version> <type>jar</type> </dependency>

样品：

  public static Charset guessCharset2(File file) throws IOException { return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8); }

你当然可以通过CharsetDecoder对文件进行解码，并注意“畸形input”或“不可映射字符”错误来validation文件的特定字符集。当然，这只会告诉你一个字符集是否是错误的; 它不会告诉你是否正确。为此，您需要比较的基础来评估解码结果，例如，您是否事先知道字符是否局限于某个子集，或者文本是否遵循某种严格的格式？底线是charset检测是猜测没有任何保证。

上面的库是简单的BOM检测器，当然只有在文件开头有BOM时才能工作。看一下http://jchardet.sourceforge.net/ ，它会扫描文本

我find了一个可以检测实际编码的好的第三方库： http : //glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

我没有广泛的testing，但它似乎工作。

如果您不知道数据的编码方式，那么确定起来并不那么容易，但您可以尝试使用库来猜测它。还有一个类似的问题。

对于ISO8859_1文件，没有一种简单的方法将它们与ASCII区分开来。对于Unicode文件，通常可以根据文件的前几个字节来检测。

UTF-8和UTF-16文件在文件的最开始处包含字节顺序标记（BOM）。 BOM是一个零宽度的不间断空间。

不幸的是，由于历史原因，Java不会自动检测到这一点。记事本等程序将检查BOM并使用适当的编码。使用unix或Cygwin，您可以使用file命令检查BOM。例如：

 $ file sample2.sql sample2.sql: Unicode text, UTF-16, big-endian

对于Java，我build议你检查一下这段代码，它将检测通用文件格式并select正确的编码：如何读取文件并自动指定正确的编码

如果您使用ICU4J（ http://icu-project.org/apiref/icu4j/ ）

这是我的代码：

  String charset = "ISO-8859-1"; //Default chartset, put whatever you want byte[] fileContent = null; FileInputStream fin = null; //create FileInputStream object fin = new FileInputStream(file.getPath()); /* * Create byte array large enough to hold the content of the file. * Use File.length to determine size of the file in bytes. */ fileContent = new byte[(int) file.length()]; /* * To read content of the file in byte array, use * int read(byte[] byteArray) method of java FileInputStream class. * */ fin.read(fileContent); byte[] data = fileContent; CharsetDetector detector = new CharsetDetector(); detector.setText(data); CharsetMatch cm = detector.detect(); if (cm != null) { int confidence = cm.getConfidence(); System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%"); //Here you have the encode name and the confidence //In my case if the confidence is > 50 I return the encode, else I return the default value if (confidence > 50) { charset = cm.getName(); } }

记得把所有的尝试赶上需要它。

我希望这对你有用。

据我所知，在这方面没有一个适合所有types问题的图书馆。因此，对于每个问题，您应该testing现有的库并select最适合您的问题的约束条件，但通常没有一个是合适的。在这些情况下，您可以编写自己的编码检测器！正如我写的…

我已经使用IBM ICU4j和Mozilla JCharDet作为内置组件，编写了一个用于检测HTML网页字符集编码的元java工具。在这里你可以find我的工具，请先阅读README部分。另外，你可以在我的论文和参考文献中find这个问题的一些基本概念。

贝娄提供了我在工作中经历的一些有用的评论：

字符集检测不是一个万无一失的过程，因为它基本上是基于统计数据，实际发生的事情是猜测没有检测到
在这方面，icu4j是IBM，imho的主要工具
TikaEncodingDetector和Lucene-ICU4j都使用icu4j，它们的准确性与我的testing中的icu4j没有太大的区别（我记得最多％1）
icu4j比jchardet更普遍，icu4j只是有点偏向于IBM家族编码，而jchardet强烈偏向于utf-8
由于在HTML世界中广泛使用了UTF-8， jchardet是比icu4j更好的select，但并不是最好的select！
icu4j非常适合东亚特定的编码，如EUC-KR，EUC-JP，SHIFT_JIS，BIG5和GB系列编码
使用Windows-1251和Windows-1256编码处理HTML页面时，icu4j和jchardet都处于崩溃状态。 Windows-1251又名cp1251广泛用于西里尔语为基础的语言，如俄语和Windows-1256又名cp1256广泛用于阿拉伯语
几乎所有的编码检测工具都使用统计方法，所以输出的准确性很大程度上取决于input的大小和内容
一些编码本质上是相同的，只是存在部分差异，所以在某些情况下，猜测或检测到的编码可能是错误的，但同时也是正确的！至于Windows-1252和ISO-8859-1。（请参阅我论文的第5.2节的最后一段）

使用哪个库？

在撰写本文时，他们是三个出现的库：

GuessEncoding
ICU4J
juniversalchardet

我不包括Apache Any23，因为它在引擎盖下使用了ICU4j 3.4。

如何判断哪一个检测到正确的字符集（或尽可能靠近）？

这是不可能的，以certificate每个上述图书馆检测到的字符集。但是，可以依次问他们并对返回的回答进行评分。

如何评分返回的回复？

每个响应可以分配一个点。响应越多，检测到的字符集就越有信心。这是一个简单的评分方法。你可以详细说明其他人。

有没有示例代码？

这是一个完整的代码片段，实现了前面几行描述的策略。

 public static String guessEncoding(InputStream input) throws IOException { // Load input data long count = 0; int n = 0, EOF = -1; byte[] buffer = new byte[4096]; ByteArrayOutputStream output = new ByteArrayOutputStream(); while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) { output.write(buffer, 0, n); count += n; } if (count > Integer.MAX_VALUE) { throw new RuntimeException("Inputstream too large."); } byte[] data = output.toByteArray(); // Detect encoding Map<String, int[]> encodingsScores = new HashMap<>(); // * GuessEncoding updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName()); // * ICU4j CharsetDetector charsetDetector = new CharsetDetector(); charsetDetector.setText(data); charsetDetector.enableInputFilter(true); CharsetMatch cm = charsetDetector.detect(); if (cm != null) { updateEncodingsScores(encodingsScores, cm.getName()); } // * juniversalchardset UniversalDetector universalDetector = new UniversalDetector(null); universalDetector.handleData(data, 0, data.length); universalDetector.dataEnd(); String encodingName = universalDetector.getDetectedCharset(); if (encodingName != null) { updateEncodingsScores(encodingsScores, encodingName); } // Find winning encoding Map.Entry<String, int[]> maxEntry = null; for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) { if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) { maxEntry = e; } } String winningEncoding = maxEntry.getKey(); //dumpEncodingsScores(encodingsScores); return winningEncoding; } private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) { String encodingName = encoding.toLowerCase(); int[] encodingScore = encodingsScores.get(encodingName); if (encodingScore == null) { encodingsScores.put(encodingName, new int[] { 1 }); } else { encodingScore[0]++; } } private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) { System.out.println(toString(encodingsScores)); } private static String toString(Map<String, int[]> encodingsScores) { String GLUE = ", "; StringBuilder sb = new StringBuilder(); for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) { sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE); } int len = sb.length(); sb.delete(len - GLUE.length(), len); return "{ " + sb.toString() + " }"; }

改进： guessEncoding方法完全读取inputstream。对于大型inputstream，这可能是一个问题。所有这些库将读取整个inputstream。这意味着用于检测字符集的大量时间消耗。

可以将初始数据加载限制为几个字节，并仅对这几个字节执行字符集检测。

TikaEncodingDetector的替代方法是使用Tika AutoDetectReader 。

 Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();

你可以在构造函数中select适当的字符集：

 new InputStreamReader(new FileInputStream(in), "ISO8859_1");

Java：如何确定一个stream的正确的字符集编码

使用哪个库？

如何判断哪一个检测到正确的字符集（或尽可能靠近）？

如何评分返回的回复？

有没有示例代码？

如何可靠猜测MacRoman，CP1252，Latin1，UTF-8和ASCII之间的编码

你如何正确使用WideCharToMultiByte

将byte 转换为char

将UTF-8string经典ASP转换为SQL数据库

FPDF utf-8编码（HOW-TO）

所有包含的字符集，以避免“java.nio.charset.MalformedInputException：input长度= 1”？

UTF-8字符编码的战斗json_encode（）

在java中编码转换

如何在Eclipse中支持UTF-8编码

如何在C＃中将string转换为UTF-8？