如何在Scala或Java中使用混合编码读取文本文件？

我想parsing一个CSV文件，理想情况下使用weka.core.converters.CSVLoader。但是，我拥有的文件不是有效的UTF-8文件。它主要是一个UTF-8文件，但一些字段值是在不同的编码，所以没有编码整个文件是有效的，但我需要parsing它。除了使用像Weka这样的java库之外，我主要在Scala工作。我甚至无法读取文件usin scala.io.Source：例如

Source. fromFile(filename)("UTF-8"). foreach(print);

抛出：

  java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:277) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:337) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:153) at java.io.BufferedReader.read(BufferedReader.java:174) at scala.io.BufferedSource$$anonfun$iter$1$$anonfun$apply$mcI$sp$1.apply$mcI$sp(BufferedSource.scala:38) at scala.io.Codec.wrap(Codec.scala:64) at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38) at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38) at scala.collection.Iterator$$anon$14.next(Iterator.scala:150) at scala.collection.Iterator$$anon$25.hasNext(Iterator.scala:562) at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400) at scala.io.Source.hasNext(Source.scala:238) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.io.Source.foreach(Source.scala:181)

我非常乐意把所有的无效字符都扔掉，或者用一些假的replace掉。我将有许多像这样的文本以各种方式处理，可能需要将数据传递给各种第三方库。一个理想的解决scheme是某种全局设置，它会导致所有的低级java库忽略文本中的无效字节，这样我就可以在不修改的情况下对这些数据调用第三方库。

解：

 import java.nio.charset.CodingErrorAction import scala.io.Codec implicit val codec = Codec("UTF-8") codec.onMalformedInput(CodingErrorAction.REPLACE) codec.onUnmappableCharacter(CodingErrorAction.REPLACE) val src = Source. fromFile(filename). foreach(print)

感谢+ Esailija指引我在正确的方向。这导致我如何检测非法UTF-8字节序列，以取代他们在Javainputstream？它提供了核心的Java解决scheme。在Scala中，我可以通过隐含编解码器来使其成为默认行为。我想我可以把它作为整个包的默认行为，把隐含的编解码器定义放在包对象中。

这是我设法用java做的：

  FileInputStream input; String result = null; try { input = new FileInputStream(new File("invalid.txt")); CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder(); decoder.onMalformedInput(CodingErrorAction.IGNORE); InputStreamReader reader = new InputStreamReader(input, decoder); BufferedReader bufferedReader = new BufferedReader( reader ); StringBuilder sb = new StringBuilder(); String line = bufferedReader.readLine(); while( line != null ) { sb.append( line ); line = bufferedReader.readLine(); } bufferedReader.close(); result = sb.toString(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch( IOException e ) { e.printStackTrace(); } System.out.println(result);

无效的文件是用字节创build的：

 0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94

UTF-8中有哪些是hellö wörld 4个无效字节。

使用.REPLACE你可以看到正在使用的标准Unicode字符replace字符：

 //"h ellö  wö rld "

使用.IGNORE ，您会看到忽略的无效字节：

 //"hellö wörld"

没有指定.onMalformedInput ，你会得到

 java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(Unknown Source) at sun.nio.cs.StreamDecoder.implRead(Unknown Source) at sun.nio.cs.StreamDecoder.read(Unknown Source) at java.io.InputStreamReader.read(Unknown Source) at java.io.BufferedReader.fill(Unknown Source) at java.io.BufferedReader.readLine(Unknown Source) at java.io.BufferedReader.readLine(Unknown Source)

scala的Source解决scheme（基于@Esailija答案）：

 def toSource(inputStream:InputStream): scala.io.BufferedSource = { import java.nio.charset.Charset import java.nio.charset.CodingErrorAction val decoder = Charset.forName("UTF-8").newDecoder() decoder.onMalformedInput(CodingErrorAction.IGNORE) scala.io.Source.fromInputStream(inputStream)(decoder) }

Scala的编解码器有一个解码器字段，它返回一个java.nio.charset.CharsetDecoder ：

 val decoder = Codec.UTF8.decoder.onMalformedInput(CodingErrorAction.IGNORE) Source.fromFile(filename)(decoder).getLines().toList

忽略无效字节的问题然后决定它们何时再次有效。请注意，UTF-8允许对字符进行可变长度的字节编码，所以如果一个字节无效，您需要了解从哪个字节开始读取以再次获得有效的字符stream。

总之，我不认为你会find一个可以“纠正”的图书馆。我认为更高效的方法是首先尝试清理这些数据。

如果出现故障，我将切换到另一个编解码器。

为了实现这个模式，我从这个其他的stackoverflow问题得到了启发。

我使用默认的编解码器列表，并recursion地通过它们。如果他们都失败了，我会打印出可怕的一点：

 private val defaultCodecs = List( io.Codec("UTF-8"), io.Codec("ISO-8859-1") ) def listLines(file: java.io.File, codecs:Iterable[io.Codec] = defaultCodecs): Iterable[String] = { val codec = codecs.head val fileHandle = scala.io.Source.fromFile(file)(codec) try { val txtArray = fileHandle.getLines().toList txtArray } catch { case ex: Exception => { if (codecs.tail.isEmpty) { println("Exception: " + ex) println("Skipping file: " + file.getPath) List() } else { listLines(file, codecs.tail) } } } finally { fileHandle.close() } }

我只是在学习Scala，所以代码可能不是最优的。

一个简单的解决scheme是将您的数据stream解释为ASCII，忽略所有非文本字符。但是，甚至会丢失有效的UTF8编码字符。不知道这是否可以接受。

编辑：如果你事先知道哪些列是有效的UTF-8，你可以编写自己的CSVparsing器，可以configuration哪个策略在哪个列上使用。

使用ISO-8859-1作为编码器; 这只会给你字节值打包成一个string。这足以parsing大多数编码的CSV。（如果混合使用8位和16位块，则遇到问题;仍然可以读取ISO-8859-1中的行，但可能无法将行parsing为块。）

一旦你有单独的字段作为单独的字段，你可以尝试

 new String(oldstring.getBytes("ISO-8859-1"), "UTF-8")

用适当的编码生成string（如果你知道的话，使用每个字段相应的编码名称）。

编辑：如果你想检测错误，你将不得不使用java.nio.charset.Charset.CharsetDecoder 。当出现错误时，以这种方式映射到UTF-8只会在string中给出0xFFFF。

 val decoder = java.nio.charset.Charset.forName("UTF-8").newDecoder // By default will throw a MalformedInputException if encoding fails decoder.decode( java.nio.ByteBuffer.wrap(oldstring.getBytes("ISO-8859-1")) ).toString

如何在Scala或Java中使用混合编码读取文本文件？

编码和字符集有什么区别？

UTF8到/从STL宽字符转换

ucfirst（）函数用于多字节字符编码

Java：将string转换为ByteBuffer以及相关的问题

ASCII码是7位还是8位？

如何在Java中find默认的字符集/编码？

如何更正文件的字符编码？

FPDF utf-8编码（HOW-TO）

如何将默认编码更改为UTF-8的Apache？

如何在Java中转换ISO-8859-1和UTF-8？