读取UTF-8 – BOM标记

我正在通过FileReader读取文件 – 文件是UTF-8解码（与BOM）现在我的问题是：我读取文件并输出一个string，但遗憾的是BOM标记也输出了。为什么会发生？

fr = new FileReader(file); br = new BufferedReader(fr); String tmp = null; while ((tmp = br.readLine()) != null) { String text; text = new String(tmp.getBytes(), "UTF-8"); content += text + System.getProperty("line.separator"); }

在第一行之后输出

 ?<style>

在Java中，您必须手动使用UTF8 BOM（如果存在）。此行为在Java错误数据库中logging，在这里和这里。现在不会有任何修复，因为它会破坏JavaDoc或XMLparsing器等现有工具。 Apache IO Commons提供了一个BOMInputStream来处理这种情况。

看看这个解决scheme：用BOM处理UTF8文件

最简单的解决方法可能只是从string中删除所产生的\uFEFF ，因为出于其他原因极不可能出现。

 tmp = tmp.replace("\uFEFF", "");

另请参阅此番石榴虫报告

使用Apache Commons库。

类： org.apache.commons.io.input.BOMInputStream

用法示例：

 String defaultEncoding = "UTF-8"; InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom); try { BOMInputStream bOMInputStream = new BOMInputStream(inputStream); ByteOrderMark bom = bOMInputStream.getBOM(); String charsetName = bom == null ? defaultEncoding : bom.getCharsetName(); InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName); //use reader } finally { inputStream.close(); }

以下是我如何使用Apache BOMInputStream，它使用try-with-resources块。 “false”参数告诉对象忽略以下BOM（出于安全原因，我们使用“BOM-less”文本文件，哈哈）：

 try( BufferedReader br = new BufferedReader( new InputStreamReader( new BOMInputStream( new FileInputStream( file), false, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) ) { // use br here } catch( Exception e) }

这里提到，这通常是Windows上的文件的问题。

一个可能的解决scheme是通过像dos2unix这样的工具首先运行该文件。

使用Apache Commons IO 。

例如，让我们来看看我的代码（用于读取拉丁文和西里尔字符的文本文件）：

 String defaultEncoding = "UTF-16"; InputStream inputStream = new FileInputStream(new File("/temp/1.txt")); BOMInputStream bomInputStream = new BOMInputStream(inputStream); ByteOrderMark bom = bomInputStream.getBOM(); String charsetName = bom == null ? defaultEncoding : bom.getCharsetName(); InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName); int data = reader.read(); while (data != -1) { char theChar = (char) data; data = reader.read(); ari.add(Character.toString(theChar)); } reader.close();

因此，我们有一个名为“ari”的ArrayList，其中除了BOM之外的所有字符都来自文件“1.txt”。

然后我想出了这个Reader子类

 /* * Copyright (C) 2016 donizyo * */ package net.donizyo.io; public class BOMReader extends BufferedReader { public static final String DEFAULT_ENCODING = "UTF-8"; public BOMReader(File file) throws IOException { this(file, DEFAULT_ENCODING); } private BOMReader(File file, String encoding) throws IOException { this(new FileInputStream(file), encoding); } private BOMReader(FileInputStream input, String encoding) throws IOException { this(new BOMInputStream(input), encoding); } private BOMReader(BOMInputStream input, String encoding) throws IOException { super(new InputStreamReader(input, getCharset(input, encoding))); } private static String getCharset(BOMInputStream bomInput, String encoding) throws IOException { ByteOrderMark bom; bom = bomInput.getBOM(); return bom == null ? encoding : bom.getCharsetName(); } }

我发现绕过BOM的最简单的方法

 BufferedReader br = new BufferedReader(new InputStreamReader(fis)); while ((currentLine = br.readLine()) != null) { //case of, remove the BOM of UTF-8 BOM currentLine = currentLine.replace("ï»¿","");

不知道你认为你用tmp.getBytes（）和“UTF-8”等来实现

我很确定Java不支持BOM，虽然我找不到现在说的文档。

值得一提的是， UTF-8中的BOM是毫无意义的，因为标准规定了字节顺序而不考虑硬件。所以，如果你能阻止他们在第一时间产生，这可能会有所帮助。

读取UTF-8 – BOM标记

来自DataURL的Blob？

Git日志输出到XML，JSON或YAML？

Java：如何确定一个stream的正确的字符集编码

Swift是否有文档注释或工具？

如何打开文件夹中的每个文件？

你如何确定在C文件的大小？

为什么不std :: fstream类需要一个std :: string？

Filemaker的优点和缺点是什么？

fcntl，lockf，哪个更适合用于文件locking？

在Android中将文件：Uri转换为文件