HashSet与列表性能

很显然，genericsHashSet<T>类的search性能高于genericsList<T>类的search性能。只需比较基于散列的密钥和List<T>类中的线性方法。

然而，计算散列键本身可能需要一些CPU周期，因此对于less量项目，线性search可以是HashSet<T>的真正替代。

我的问题：盈亏平衡点在哪里？

为了简化scheme（公平地），我们假设List<T>类使用元素的Equals()方法来标识一个项目。

很多人都说，一旦你达到了速度实际上是HashSet<T>将总是击败List<T> ，但这取决于你在做什么的大小。

假设你有一个List<T> ，它只有平均5个项目。在大量循环中，如果每个循环都添加或删除单个项目，最好使用List<T> 。

我在我的机器上做了一个testing，而且，从List<T>获得优势，它必须非常小。对于短string的列表，优势在5号之后消失，对于20号之后的对象。

 1 item LIST strs time: 617ms 1 item HASHSET strs time: 1332ms 2 item LIST strs time: 781ms 2 item HASHSET strs time: 1354ms 3 item LIST strs time: 950ms 3 item HASHSET strs time: 1405ms 4 item LIST strs time: 1126ms 4 item HASHSET strs time: 1441ms 5 item LIST strs time: 1370ms 5 item HASHSET strs time: 1452ms 6 item LIST strs time: 1481ms 6 item HASHSET strs time: 1418ms 7 item LIST strs time: 1581ms 7 item HASHSET strs time: 1464ms 8 item LIST strs time: 1726ms 8 item HASHSET strs time: 1398ms 9 item LIST strs time: 1901ms 9 item HASHSET strs time: 1433ms 1 item LIST objs time: 614ms 1 item HASHSET objs time: 1993ms 4 item LIST objs time: 837ms 4 item HASHSET objs time: 1914ms 7 item LIST objs time: 1070ms 7 item HASHSET objs time: 1900ms 10 item LIST objs time: 1267ms 10 item HASHSET objs time: 1904ms 13 item LIST objs time: 1494ms 13 item HASHSET objs time: 1893ms 16 item LIST objs time: 1695ms 16 item HASHSET objs time: 1879ms 19 item LIST objs time: 1902ms 19 item HASHSET objs time: 1950ms 22 item LIST objs time: 2136ms 22 item HASHSET objs time: 1893ms 25 item LIST objs time: 2357ms 25 item HASHSET objs time: 1826ms 28 item LIST objs time: 2555ms 28 item HASHSET objs time: 1865ms 31 item LIST objs time: 2755ms 31 item HASHSET objs time: 1963ms 34 item LIST objs time: 3025ms 34 item HASHSET objs time: 1874ms 37 item LIST objs time: 3195ms 37 item HASHSET objs time: 1958ms 40 item LIST objs time: 3401ms 40 item HASHSET objs time: 1855ms 43 item LIST objs time: 3618ms 43 item HASHSET objs time: 1869ms 46 item LIST objs time: 3883ms 46 item HASHSET objs time: 2046ms 49 item LIST objs time: 4218ms 49 item HASHSET objs time: 1873ms

这里是以graphics显示的数据：

在这里输入图像描述

代码如下：

 static void Main(string[] args) { int times = 10000000; for (int listSize = 1; listSize < 10; listSize++) { List<string> list = new List<string>(); HashSet<string> hashset = new HashSet<string>(); for (int i = 0; i < listSize; i++) { list.Add("string" + i.ToString()); hashset.Add("string" + i.ToString()); } Stopwatch timer = new Stopwatch(); timer.Start(); for (int i = 0; i < times; i++) { list.Remove("string0"); list.Add("string0"); } timer.Stop(); Console.WriteLine(listSize.ToString() + " item LIST strs time: " + timer.ElapsedMilliseconds.ToString() + "ms"); timer = new Stopwatch(); timer.Start(); for (int i = 0; i < times; i++) { hashset.Remove("string0"); hashset.Add("string0"); } timer.Stop(); Console.WriteLine(listSize.ToString() + " item HASHSET strs time: " + timer.ElapsedMilliseconds.ToString() + "ms"); Console.WriteLine(); } for (int listSize = 1; listSize < 50; listSize+=3) { List<object> list = new List<object>(); HashSet<object> hashset = new HashSet<object>(); for (int i = 0; i < listSize; i++) { list.Add(new object()); hashset.Add(new object()); } object objToAddRem = list[0]; Stopwatch timer = new Stopwatch(); timer.Start(); for (int i = 0; i < times; i++) { list.Remove(objToAddRem); list.Add(objToAddRem); } timer.Stop(); Console.WriteLine(listSize.ToString() + " item LIST objs time: " + timer.ElapsedMilliseconds.ToString() + "ms"); timer = new Stopwatch(); timer.Start(); for (int i = 0; i < times; i++) { hashset.Remove(objToAddRem); hashset.Add(objToAddRem); } timer.Stop(); Console.WriteLine(listSize.ToString() + " item HASHSET objs time: " + timer.ElapsedMilliseconds.ToString() + "ms"); Console.WriteLine(); } Console.ReadLine(); }

你看错了。是的，列表的线性search会击败less量项目的HashSet。但是性能差异通常对于那些小的collections无关紧要。一般来说，你需要担心的是大型的collections品，这就是你在Big-O方面的想法。然而，如果你已经测量了HashSet性能的真正瓶颈，那么你可以尝试创build一个混合的List / HashSet，但是你会通过进行大量的经验性能testing来做到这一点，而不是在SO上提出问题。

是否使用HashSet <>或者列表<>取决于你如何访问你的集合 。如果您需要保证项目的顺序，请使用列表。如果你不这样做，请使用HashSet。让微软担心他们的哈希algorithm和对象的实现。

HashSet将访问项目而不必枚举集合（ O（1）或其附近的复杂性），并且由于List保证顺序，与HashSet不同，有些项目将不得不枚举（O（n）的复杂性）。

比较两个性能不同的结构是没有意义的。使用传达意图的结构。即使你说你的List<T>不会有重复，并且迭代次序并不重要，使得它与HashSet<T> ，使用List<T>仍然是一个糟糕的select，因为它的容错性相对较小。

也就是说，我会检查一下其他方面的performance，

 +------------+--------+-------------+-----------+----------+----------+-----------+ | Collection | Random | Containment | Insertion | Addition | Removal | Memory | | | access | | | | | | +------------+--------+-------------+-----------+----------+----------+-----------+ | List<T> | O(1) | O(n) | O(n) | O(1)* | O(n) | Lesser | | HashSet<T> | O(n) | O(1) | n/a | O(1) | O(1) | Greater** | +------------+--------+-------------+-----------+----------+----------+-----------+

* Even though addition is O(1) in both cases, it will be relatively slower in HashSet<T> since it involves cost of precomputing hash code before storing it.

** The superior scalability of HashSet<T> has a memory cost. Every entry is stored as a new object along with its hash code. This article might give you an idea.

只是以为我会用不同的场景的基准来说明以前的答案：

一些（12-20）小string（长度在5到10个字符之间）
许多（〜10K）小弦
几个长string（长度在200到1000个字符之间）
许多（〜5K）长的string
几个整数
许多（〜10K）整数

对于每种情况，查找出现的值：

在列表的开头（“开始”，索引0）
在名单的开头（“早”，索引1）
在列表中间（“中间”，索引计数/ 2）
在列表末尾（“迟到”，索引计数2）
在列表的末尾（“结束”，索引计数-1）

在每个场景之前，我生成随机大小的随机string列表，然后将每个列表提供给一个哈希集合。每个场景跑了一万次，本质上是：

（testing伪代码）

 stopwatch.start for X times exists = list.Contains(lookup); stopwatch.stop stopwatch.start for X times exists = hashset.Contains(lookup); stopwatch.stop

示例输出

testingWindows 7,12GB内存，64位，至强2.8GHz

 ---------- Testing few small strings ------------ Sample items: (16 total) vgnwaloqf diwfpxbv tdcdc grfch icsjwk ... Benchmarks: 1: hashset: late -- 100.00 % -- [Elapsed: 0.0018398 sec] 2: hashset: middle -- 104.19 % -- [Elapsed: 0.0019169 sec] 3: hashset: end -- 108.21 % -- [Elapsed: 0.0019908 sec] 4: list: early -- 144.62 % -- [Elapsed: 0.0026607 sec] 5: hashset: start -- 174.32 % -- [Elapsed: 0.0032071 sec] 6: list: middle -- 187.72 % -- [Elapsed: 0.0034536 sec] 7: list: late -- 192.66 % -- [Elapsed: 0.0035446 sec] 8: list: end -- 215.42 % -- [Elapsed: 0.0039633 sec] 9: hashset: early -- 217.95 % -- [Elapsed: 0.0040098 sec] 10: list: start -- 576.55 % -- [Elapsed: 0.0106073 sec] ---------- Testing many small strings ------------ Sample items: (10346 total) dmnowa yshtrxorj vthjk okrxegip vwpoltck ... Benchmarks: 1: hashset: end -- 100.00 % -- [Elapsed: 0.0017443 sec] 2: hashset: late -- 102.91 % -- [Elapsed: 0.0017951 sec] 3: hashset: middle -- 106.23 % -- [Elapsed: 0.0018529 sec] 4: list: early -- 107.49 % -- [Elapsed: 0.0018749 sec] 5: list: start -- 126.23 % -- [Elapsed: 0.0022018 sec] 6: hashset: early -- 134.11 % -- [Elapsed: 0.0023393 sec] 7: hashset: start -- 372.09 % -- [Elapsed: 0.0064903 sec] 8: list: middle -- 48,593.79 % -- [Elapsed: 0.8476214 sec] 9: list: end -- 99,020.73 % -- [Elapsed: 1.7272186 sec] 10: list: late -- 99,089.36 % -- [Elapsed: 1.7284155 sec] ---------- Testing few long strings ------------ Sample items: (19 total) hidfymjyjtffcjmlcaoivbylakmqgoiowbgxpyhnrreodxyleehkhsofjqenyrrtlphbcnvdrbqdvji... ... Benchmarks: 1: list: early -- 100.00 % -- [Elapsed: 0.0018266 sec] 2: list: start -- 115.76 % -- [Elapsed: 0.0021144 sec] 3: list: middle -- 143.44 % -- [Elapsed: 0.0026201 sec] 4: list: late -- 190.05 % -- [Elapsed: 0.0034715 sec] 5: list: end -- 193.78 % -- [Elapsed: 0.0035395 sec] 6: hashset: early -- 215.00 % -- [Elapsed: 0.0039271 sec] 7: hashset: end -- 248.47 % -- [Elapsed: 0.0045386 sec] 8: hashset: start -- 298.04 % -- [Elapsed: 0.005444 sec] 9: hashset: middle -- 325.63 % -- [Elapsed: 0.005948 sec] 10: hashset: late -- 431.62 % -- [Elapsed: 0.0078839 sec] ---------- Testing many long strings ------------ Sample items: (5000 total) yrpjccgxjbketcpmnvyqvghhlnjblhgimybdygumtijtrwaromwrajlsjhxoselbucqualmhbmwnvnpnm ... Benchmarks: 1: list: early -- 100.00 % -- [Elapsed: 0.0016211 sec] 2: list: start -- 132.73 % -- [Elapsed: 0.0021517 sec] 3: hashset: start -- 231.26 % -- [Elapsed: 0.003749 sec] 4: hashset: end -- 368.74 % -- [Elapsed: 0.0059776 sec] 5: hashset: middle -- 385.50 % -- [Elapsed: 0.0062493 sec] 6: hashset: late -- 406.23 % -- [Elapsed: 0.0065854 sec] 7: hashset: early -- 421.34 % -- [Elapsed: 0.0068304 sec] 8: list: middle -- 18,619.12 % -- [Elapsed: 0.3018345 sec] 9: list: end -- 40,942.82 % -- [Elapsed: 0.663724 sec] 10: list: late -- 41,188.19 % -- [Elapsed: 0.6677017 sec] ---------- Testing few ints ------------ Sample items: (16 total) 7266092 60668895 159021363 216428460 28007724 ... Benchmarks: 1: hashset: early -- 100.00 % -- [Elapsed: 0.0016211 sec] 2: hashset: end -- 100.45 % -- [Elapsed: 0.0016284 sec] 3: list: early -- 101.83 % -- [Elapsed: 0.0016507 sec] 4: hashset: late -- 108.95 % -- [Elapsed: 0.0017662 sec] 5: hashset: middle -- 112.29 % -- [Elapsed: 0.0018204 sec] 6: hashset: start -- 120.33 % -- [Elapsed: 0.0019506 sec] 7: list: late -- 134.45 % -- [Elapsed: 0.0021795 sec] 8: list: start -- 136.43 % -- [Elapsed: 0.0022117 sec] 9: list: end -- 169.77 % -- [Elapsed: 0.0027522 sec] 10: list: middle -- 237.94 % -- [Elapsed: 0.0038573 sec] ---------- Testing many ints ------------ Sample items: (10357 total) 370826556 569127161 101235820 792075135 270823009 ... Benchmarks: 1: list: early -- 100.00 % -- [Elapsed: 0.0015132 sec] 2: hashset: end -- 101.79 % -- [Elapsed: 0.0015403 sec] 3: hashset: early -- 102.08 % -- [Elapsed: 0.0015446 sec] 4: hashset: middle -- 103.21 % -- [Elapsed: 0.0015618 sec] 5: hashset: late -- 104.26 % -- [Elapsed: 0.0015776 sec] 6: list: start -- 126.78 % -- [Elapsed: 0.0019184 sec] 7: hashset: start -- 130.91 % -- [Elapsed: 0.0019809 sec] 8: list: middle -- 16,497.89 % -- [Elapsed: 0.2496461 sec] 9: list: end -- 32,715.52 % -- [Elapsed: 0.4950512 sec] 10: list: late -- 33,698.87 % -- [Elapsed: 0.5099313 sec]

盈亏平衡将取决于计算散列的成本。哈希计算可以是微不足道的，或不… … :-)总是有System.Collections.Specialized.HybridDictionary类，以帮助您不必担心盈亏平衡点。

答案一如既往是“ 这要看情况 ”。我从你正在谈论C＃的标签中假设。

你最好的select是确定

一组数据
使用要求

并写一些testing用例。

这也取决于你如何对列表进行sorting（如果是完全sorting），需要进行什么样的比较，“比较”操作需要多长时间才能列表中的特定对象，甚至是你打算如何使用采集。

一般来说，最好的select不是基于你正在使用的数据的大小，而是你打算如何访问它。你有每一块数据与一个特定的string，或其他数据？基于哈希的集合可能是最好的。您正在存储的数据的顺序是重要的，还是需要同时访问所有数据？定期清单可能会更好。

额外：

当然，我上面的评论假设“性能”是指数据访问。还有什么要考虑的：当你说“表演”时你在找什么？ performance个人价值是否抬头？ pipe理大型（10000,10万以上）价值集合吗？用数据填充数据结构的性能是什么？删除数据？访问个人数据位？ replace值？迭代值？内存使用情况？数据复制速度？例如，如果通过string值访问数据，但主要性能要求是最小的内存使用量，则可能存在相互冲突的devise问题。

您可以使用HybridDictionary自动检测中断点，并接受空值，使其基本上与HashSet相同。

这取决于。如果确切的答案真的很重要，做一些分析和发现。如果你确定你永远不会有一定数量的元素在集合中，去一个列表。如果数字是无限的，使用HashSet。

取决于你的哈希值。如果你的键是整数，你可能不需要太多的项目，在HashSet更快。如果你在一个string上键入它，那么它会变慢，并取决于inputstring。

当然，你可以很容易地掀起一个基准？

你不考虑的一个因素是GetHashcode（）函数的健壮性。有了完美的散列函数，HashSet显然会有更好的search性能。但是随着散列函数的减less，HashSetsearch时间也会减less。

取决于很多因素…列表实现，CPU体系结构，JVM，循环语义，equals方法的复杂性等…当列表变得足够大以便有效地进行基准testing（1000多个元素）时，基于哈希的二进制查找超过了线性search，差异只能从那里扩大。

希望这可以帮助！

HashSet与列表性能

示例输出

ASP.NET MVC：通过DataAnnotation进行自定义validation

任何错误的不签署一个.NET程序集？

什么时候会比DateTimeOffset更喜欢DateTime？

什么是实现线程安全字典的最佳方式？

如何防止“aspxerrorpath”作为查询string传递给ASP.NET自定义错误页面

如何以编程方式点击WPF中的button？

C＃（.NET）devise瑕疵

asynchronous等待Task <T>完成超时

我可以使用什么来获得高质量的C＃/ .NET代码覆盖率？

如果创build目录不存在创build文件？