如何使用.NET快速比较2个文件？

典型的方法build议通过FileStream读取二进制文件并逐字节比较。

CRC校验和比较会更快吗？
是否有任何.NET库可以为文件生成校验和？

校验和比较最可能比逐字节比较慢。

为了生成校验和，您需要加载文件的每个字节，然后对其进行处理。您将不得不在第二个文件上执行此操作。处理几乎肯定比比较检查慢。

至于生成校验和：您可以使用密码类轻松完成此操作。下面是使用C＃生成MD5校验和的简单示例。

但是，如果可以预先计算“testing”或“基本”情况的校验和，则校验和可能会更快，更有意义。如果你有一个现有的文件，并且你正在检查一个新的文件是否与现有文件相同，那么预先计算你的“现有”文件上的校验和将意味着只需要一次执行DiskIO，新文件。这可能会比逐字节比较更快。

最慢的方法是逐字节比较两个文件。我能够得出的最快的是一个类似的比较，但是一次不是一个字节，而是使用大小为Int64的字节数组，然后比较结果数字。

以下是我想到的：

const int BYTES_TO_READ = sizeof(Int64); static bool FilesAreEqual(FileInfo first, FileInfo second) { if (first.Length != second.Length) return false; if (first.FullName == second.FullName) return true; int iterations = (int)Math.Ceiling((double)first.Length / BYTES_TO_READ); using (FileStream fs1 = first.OpenRead()) using (FileStream fs2 = second.OpenRead()) { byte[] one = new byte[BYTES_TO_READ]; byte[] two = new byte[BYTES_TO_READ]; for (int i = 0; i < iterations; i++) { fs1.Read(one, 0, BYTES_TO_READ); fs2.Read(two, 0, BYTES_TO_READ); if (BitConverter.ToInt64(one,0) != BitConverter.ToInt64(two,0)) return false; } } return true; }

在我的testing中，我能够看到这比直接的ReadByte（）场景差不多3：1。平均超过1000次运行，我得到了这个方法在1063毫秒，和下面的方法（直接逐字节比较）在3031毫秒。哈希总是以大约865ms的平均速度回落。这个testing是用一个〜100MB的video文件。

下面是我使用的ReadByte和哈希方法，用于比较：

  static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second) { if (first.Length != second.Length) return false; if (first.FullName == second.FullName) return true; using (FileStream fs1 = first.OpenRead()) using (FileStream fs2 = second.OpenRead()) { for (int i = 0; i < first.Length; i++) { if (fs1.ReadByte() != fs2.ReadByte()) return false; } } return true; } static bool FilesAreEqual_Hash(FileInfo first, FileInfo second) { byte[] firstHash = MD5.Create().ComputeHash(first.OpenRead()); byte[] secondHash = MD5.Create().ComputeHash(second.OpenRead()); for (int i=0; i<firstHash.Length; i++) { if (firstHash[i] != secondHash[i]) return false; } return true; }

除了Reed Copsey的回答：

最糟糕的情况是两个文件是相同的。在这种情况下，最好是逐字节比较文件。
如果两个文件不相同，则可以通过及时发现它们不相同来加快速度。

例如，如果这两个文件的长度不同，那么你知道它们不能相同，甚至不需要比较它们的实际内容。

唯一可能使得校验和比较比逐字节比较稍微快一点的事实是，你一次只读一个文件，这会减less磁头的寻道时间。然而，这个微小的增益可能会被计算散列的额外时间所吞噬。

而且，如果文件是相同的，校验和比较当然只有更快的速度。如果不是这样，一个字节的字节比较将会以第一个差异结束，使其速度更快。

你也应该考虑一个哈希码比较只会告诉你这些文件很可能是相同的。为了100％确定你需要做一个字节一个字节的比较。

如果哈希码例如是32位，那么如果哈希码匹配，那么确定文件是相同的，则大约是99.99999998％。这接近100％，但如果你真的需要100％的确定性，那不是。

如果您确实需要完整的逐字节比较 （请参阅散列讨论的其他答案），那么单线解决scheme是：

 bool bFilesAreEqual = File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));

与其他一些发布的答案不同的是，这对于任何types的文件都是正确的：二进制，文本，媒体，可执行文件等，但是作为完整的二进制比较 ，只有 “不重要”方式（例如BOM ，结尾，字符编码，媒体元数据，空格，填充，源代码注释等）总是被认为是不相等的 。

该代码完全将这两个文件加载到内存中，所以不应该用于比较巨大的文件。除此之外，满载并不是一个真正的惩罚。事实上，这可能是一个理想的.NET解决scheme，文件大小预计小于85K ，因为.NET中的小分配非常便宜，我们最大限度地将文件性能和优化委托给CLR / BCL 。

此外，对于这样的工作日情况，关于通过LINQ枚举器（如图所示）进行逐字节比较的性能的担忧是没有实际意义的，因为按照文件I / O命中磁盘将会使数个数量级的各种记忆比较的select。例如，即使SequenceEqual事实上给了我们在第一次不匹配时放弃的“优化”，但是在已经提取了文件的内容之后这几乎不重要，每一个完全需要确认匹配。

另一方面，上面的代码不包括对不同大小文件的急切放弃，这可以提供有形的（可测量的）性能差异。这个是有形的，因为文件长度在WIN32_FILE_ATTRIBUTE_DATA结构中是可用的（对于任何文件访问，必须首先获取文件长度），继续访问文件的内容需要完全不同的获取，这可能会被避免。如果您担心这个问题，解决scheme将变成两行：

  // slight optimization over the code shown above bool bFilesAreEqual = new FileInfo(path1).Length == new FileInfo(path2).Length && File.ReadAllBytes(path1).SequenceEqual(File.ReadAllBytes(path2));

如果（等效的） Length值都被发现为零（未示出）和/或避免每次构build每个FileInfo （也未示出），则还可以扩展这个以避免二次获取。

如果你不读小8字节块，而是循环读一个更大的块，它会变得更快。我把平均比较时间缩短到了1/4。

  public static bool FilesContentsAreEqual(FileInfo fileInfo1, FileInfo fileInfo2) { bool result; if (fileInfo1.Length != fileInfo2.Length) { result = false; } else { using (var file1 = fileInfo1.OpenRead()) { using (var file2 = fileInfo2.OpenRead()) { result = StreamsContentsAreEqual(file1, file2); } } } return result; } private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2) { const int bufferSize = 1024 * sizeof(Int64); var buffer1 = new byte[bufferSize]; var buffer2 = new byte[bufferSize]; while (true) { int count1 = stream1.Read(buffer1, 0, bufferSize); int count2 = stream2.Read(buffer2, 0, bufferSize); if (count1 != count2) { return false; } if (count1 == 0) { return true; } int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64)); for (int i = 0; i < iterations; i++) { if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64))) { return false; } } } } }

编辑：这种方法将无法比较二进制文件！

在.NET 4.0中， File类具有以下两种新的方法：

 public static IEnumerable<string> ReadLines(string path) public static IEnumerable<string> ReadLines(string path, Encoding encoding)

这意味着你可以使用：

 bool same = File.ReadLines(path1).SequenceEqual(File.ReadLines(path2));

老实说，我认为你需要尽可能的修剪你的search树。

逐字节前要检查的事情：

大小一样吗？
文件A中的最后一个字节与文件B不同

另外，由于驱动器更快地读取连续字节，所以一次读取大块将会更有效。逐字节地进行不仅导致更多的系统调用，而且如果两个文件位于同一个驱动器上，则会导致传统硬盘驱动器的读取头更频繁地来回查找。

将块A和块B读入一个字节缓冲区，然后比较它们（不要使用Array.Equals，参见注释）。调整块的大小，直到你感觉到内存和性能之间的折衷。您也可以multithreading比较，但不要multithreading读取磁盘。

我的实验显示，它确实有助于调用Stream.ReadByte（）更less的次数，但是使用BitConverter来封装字节对比较字节数组中的字节没有太大区别。

所以可以用最简单的代替上面注释中的“Math.Ceiling和迭代”循环：

  for (int i = 0; i < count1; i++) { if (buffer1[i] != buffer2[i]) return false; }

我想这与BitConverter.ToInt64在比较之前需要做一些工作（检查参数，然后执行位移）有关，并且最终与两个数组中的8个字节相比，工作量相同。

长度相同的大文件的另一个改进可能是不按顺序读取文件，而是比较或多或less的随机块。

您可以使用多个线程，从文件的不同位置开始，并向前或向后比较。

这样，您可以检测文件中间/末尾的更改，比使用顺序方法更快。

如果你只需要比较两个文件，我想最快的方法是（在C中，我不知道它是否适用于.NET）

打开文件f1，f2
得到相应的文件长度l1，l2
如果l1！= l2的文件是不同的; 停止
mmap（）这两个文件
在mmap（）ed文件上使用memcmp（）

OTOH，如果您需要查找是否有一组N个文件中有重复的文件，那么最快的方法无疑是使用散列来避免N路逐位比较。

有些东西（希望）合理高效：

 public class FileCompare { public static bool FilesEqual(string fileName1, string fileName2) { return FilesEqual(new FileInfo(fileName1), new FileInfo(fileName2)); } /// <summary> /// /// </summary> /// <param name="file1"></param> /// <param name="file2"></param> /// <param name="bufferSize">8kb seemed like a good default</param> /// <returns></returns> public static bool FilesEqual(FileInfo file1, FileInfo file2, int bufferSize = 8192) { if (!file1.Exists || !file2.Exists || file1.Length != file2.Length) return false; var buffer1 = new byte[bufferSize]; var buffer2 = new byte[bufferSize]; using (var stream1 = file1.Open(FileMode.Open, FileAccess.Read, FileShare.Read)) { using (var stream2 = file2.Open(FileMode.Open, FileAccess.Read, FileShare.Read)) { while (true) { var bytesRead1 = stream1.Read(buffer1, 0, bufferSize); var bytesRead2 = stream2.Read(buffer2, 0, bufferSize); if (bytesRead1 != bytesRead2) return false; if (bytesRead1 == 0) return true; if (!ArraysEqual(buffer1, buffer2, bytesRead1)) return false; } } } } /// <summary> /// /// </summary> /// <param name="array1"></param> /// <param name="array2"></param> /// <param name="bytesToCompare"> 0 means compare entire arrays</param> /// <returns></returns> public static bool ArraysEqual(byte[] array1, byte[] array2, int bytesToCompare = 0) { if (array1.Length != array2.Length) return false; var length = (bytesToCompare == 0) ? array1.Length : bytesToCompare; var tailIdx = length - length % sizeof(Int64); //check in 8 byte chunks for (var i = 0; i < tailIdx; i += sizeof(Int64)) { if (BitConverter.ToInt64(array1, i) != BitConverter.ToInt64(array2, i)) return false; } //check the remainder of the array, always shorter than 8 bytes for (var i = tailIdx; i < length; i++) { if (array1[i] != array2[i]) return false; } return true; } }

以下是一些实用function，可以让您确定两个文件（或两个stream）是否包含相同的数据。

我提供了一个multithreading的“快速”版本，因为它使用任务在不同的线程中比较字节数组（每个缓冲区填充了每个文件中读取的内容）。

正如所料，速度要快得多（大约快3倍），但是它消耗更多的CPU（因为它是multithreading的）和更多的内存（因为每个比较线程需要两个字节的数组缓冲区）。

  public static bool AreFilesIdenticalFast(string path1, string path2) { return AreFilesIdentical(path1, path2, AreStreamsIdenticalFast); } public static bool AreFilesIdentical(string path1, string path2) { return AreFilesIdentical(path1, path2, AreStreamsIdentical); } public static bool AreFilesIdentical(string path1, string path2, Func<Stream, Stream, bool> areStreamsIdentical) { if (path1 == null) throw new ArgumentNullException(nameof(path1)); if (path2 == null) throw new ArgumentNullException(nameof(path2)); if (areStreamsIdentical == null) throw new ArgumentNullException(nameof(path2)); if (!File.Exists(path1) || !File.Exists(path2)) return false; using (var thisFile = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) { using (var valueFile = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) { if (valueFile.Length != thisFile.Length) return false; if (!areStreamsIdentical(thisFile, valueFile)) return false; } } return true; } public static bool AreStreamsIdenticalFast(Stream stream1, Stream stream2) { if (stream1 == null) throw new ArgumentNullException(nameof(stream1)); if (stream2 == null) throw new ArgumentNullException(nameof(stream2)); const int bufsize = 80000; // 80000 is below LOH (85000) var tasks = new List<Task<bool>>(); do { // consumes more memory (two buffers for each tasks) var buffer1 = new byte[bufsize]; var buffer2 = new byte[bufsize]; int read1 = stream1.Read(buffer1, 0, buffer1.Length); if (read1 == 0) { int read3 = stream2.Read(buffer2, 0, 1); if (read3 != 0) // not eof return false; break; } // both stream read could return different counts int read2 = 0; do { int read3 = stream2.Read(buffer2, read2, read1 - read2); if (read3 == 0) return false; read2 += read3; } while (read2 < read1); // consumes more cpu var task = Task.Run(() => { return IsSame(buffer1, buffer2); }); tasks.Add(task); } while (true); Task.WaitAll(tasks.ToArray()); return !tasks.Any(t => !t.Result); } public static bool AreStreamsIdentical(Stream stream1, Stream stream2) { if (stream1 == null) throw new ArgumentNullException(nameof(stream1)); if (stream2 == null) throw new ArgumentNullException(nameof(stream2)); const int bufsize = 80000; // 80000 is below LOH (85000) var buffer1 = new byte[bufsize]; var buffer2 = new byte[bufsize]; var tasks = new List<Task<bool>>(); do { int read1 = stream1.Read(buffer1, 0, buffer1.Length); if (read1 == 0) return stream2.Read(buffer2, 0, 1) == 0; // check not eof // both stream read could return different counts int read2 = 0; do { int read3 = stream2.Read(buffer2, read2, read1 - read2); if (read3 == 0) return false; read2 += read3; } while (read2 < read1); if (!IsSame(buffer1, buffer2)) return false; } while (true); } public static bool IsSame(byte[] bytes1, byte[] bytes2) { if (bytes1 == null) throw new ArgumentNullException(nameof(bytes1)); if (bytes2 == null) throw new ArgumentNullException(nameof(bytes2)); if (bytes1.Length != bytes2.Length) return false; for (int i = 0; i < bytes1.Length; i++) { if (bytes1[i] != bytes2[i]) return false; } return true; }

如果文件不是太大，可以使用：

 public static byte[] ComputeFileHash(string fileName) { using (var stream = File.OpenRead(fileName)) return System.Security.Cryptography.MD5.Create().ComputeHash(stream); }

如果哈希值可用于存储，则只能比较哈希值。

（编辑代码更清洁一些。）

我认为有应用程序的“散列”比逐字节比较更快。如果您需要将文件与其他人进行比较或者可以更改照片的缩略图。这取决于它在哪里以及如何使用。

 private bool CompareFilesByte(string file1, string file2) { using (var fs1 = new FileStream(file1, FileMode.Open)) using (var fs2 = new FileStream(file2, FileMode.Open)) { if (fs1.Length != fs2.Length) return false; int b1, b2; do { b1 = fs1.ReadByte(); b2 = fs2.ReadByte(); if (b1 != b2 || b1 < 0) return false; } while (b1 >= 0); } return true; } private string HashFile(string file) { using (var fs = new FileStream(file, FileMode.Open)) using (var reader = new BinaryReader(fs)) { var hash = new SHA512CryptoServiceProvider(); hash.ComputeHash(reader.ReadBytes((int)file.Length)); return Convert.ToBase64String(hash.Hash); } } private bool CompareFilesWithHash(string file1, string file2) { var str1 = HashFile(file1); var str2 = HashFile(file2); return str1 == str2; }

在这里，你可以得到最快的。

 var sw = new Stopwatch(); sw.Start(); var compare1 = CompareFilesWithHash(receiveLogPath, logPath); sw.Stop(); Debug.WriteLine(string.Format("Compare using Hash {0}", sw.ElapsedTicks)); sw.Reset(); sw.Start(); var compare2 = CompareFilesByte(receiveLogPath, logPath); sw.Stop(); Debug.WriteLine(string.Format("Compare byte-byte {0}", sw.ElapsedTicks));

或者，我们可以将散列保存在数据库中。

希望这可以帮助

我已经find了很好的比较长度而不读取数据，然后比较读取的字节序列

 private static bool IsFileIdentical(string a, string b) { if (new FileInfo(a).Length != new FileInfo(b).Length) return false; return (File.ReadAllBytes(a).SequenceEqual(File.ReadAllBytes(b))); }

如何使用.NET快速比较2个文件？

如何使用Java从正在写入的文件读取？

尝试在Android中创build文件：打开失败：EROFS（只读文件系统）

导轨 – 将控制台输出redirect到一个文件

VS 2012：滚动解决scheme资源pipe理器到当前文件

在android中读取sdcard中的特定文件

recursion地向所有文件添加文件扩展名

我应该在哪里提交文件评论？

在C＃中检查文件名是可能有效（不是它存在）

SQLite3数据库或磁盘已满/数据库磁盘映像格式不正确

最有效的方法来searchPython中的文件的最后x行

如何使用.NET快速比较2个文件？

如何使用Java从正在写入的文件读取？

尝试在Android中创build文件：打开失败：EROFS（只读文件系统）

导轨 – 将控制台输出redirect到一个文件

VS 2012：滚动解决scheme资源pipe理器到当前文件

在android中读取sdcard中的特定文件

recursion地向所有文件添加文件扩展名

我应该在哪里提交文件评论？

在C＃中检查文件名是*可能*有效（不是它存在）

SQLite3数据库或磁盘已满/数据库磁盘映像格式不正确

最有效的方法来searchPython中的文件的最后x行

在C＃中检查文件名是可能有效（不是它存在）