在Java中有效地计算两个集合的交集?

在Java中find两个非稀疏集合交集的最有效方法是什么? 这是一个操作,我将大量调用大量的集合,所以优化是非常重要的。 我不能修改原来的设置。

我看过Apache Commons CollectionUtils.intersection,看起来很慢。 我目前的做法是取两套中较小的一套,克隆它,然后在两套中较大的一套上调用.retainAll。

public static int getIntersection(Set<Long> set1, Set<Long> set2) { boolean set1IsLarger = set1.size() > set2.size(); Set<Long> cloneSet = new HashSet<Long>(set1IsLarger ? set2 : set1); cloneSet.retainAll(set1IsLarger ? set1 : set2); return cloneSet.size(); } 

用张贴的方法运行一些testing,并与构build一个新的HashSet。 也就是说,让A是更小的集合, B是更大的集合,然后,对于A每个项目,如果它也存在于B中,然后将其添加到C(一个新的HashSet) – 为了计算,中间C组可以跳过。

就像贴出来的方法一样,这应该是O(|A|)的代价,因为迭代是O(|A|)并且探测到B是O(1) 。 我不知道如何比较克隆和删除的方法。

快乐编码 – 并发布一些结果;-)


实际上,在进一步的思考中,我认为这比后面的方法稍微好一些: O(|A|)O(|A| + |B|) 。 我不知道这是否会在现实中有所作为(或改进),我只希望它是相关的,当|A| <<< |B| |A| <<< |B|


好吧,我真的很无聊。 至less在JDK 7(Windows 7 x64)中, 看起来后面介绍的方法比上面的方法要慢 – 通过一个好的(虽然看起来大多是常量)因素。 我的眼球猜测说,这是比上面的build议慢四倍 ,只是使用一个计数器,并创build一个新的HashSet的两倍慢 。 在不同的初始尺寸下,这似乎是“大致一致的”。

请记住,正如Voo指出的那样,上面的数字和这个微观基准假设正在使用HashSet!而且一如既往,微型基准面临着危险。

这里是丑陋的结果(以毫秒为单位):

 运行1x1的testing
 IntersectTest $ PostMethod @ 6cc2060e花了13.9808544计数= 1000000
 IntersectTest $ MyMethod1 @ 7d38847d需要2.9893732计数= 1000000
 IntersectTest $ MyMethod2 @ 9826ac5花了7.775945 count = 1000000
运行1x10的testing
 IntersectTest $ PostMethod @ 67fc9fee花了12.4647712计数= 734000
 IntersectTest $ MyMethod1 @ 7a67f797花了3.1567252计数= 734000
 IntersectTest $ MyMethod2 @ 3fb01949花了6.483941 count = 734000
运行1x100的testing
 IntersectTest $ PostMethod @ 16675039花了11.3069326计数= 706000
 IntersectTest $ MyMethod1 @ 58c3d9ac花了2.3482693 count = 706000
 IntersectTest $ MyMethod2 @ 2207d8bb花了4.8687103 count = 706000
运行1x1000的testing
 IntersectTest $ PostMethod @ 33d626a4花了10.28656 count = 729000
 IntersectTest $ MyMethod1 @ 3082f392花了2.3478658计数= 729000
 IntersectTest $ MyMethod2 @ 65450f1f花了4.109205计数= 729000
运行10x2的testing
 IntersectTest $ PostMethod @ 55c4d594花了10.4137618 count = 736000
 IntersectTest $ MyMethod1 @ 6da21389花了2.374206计数= 736000
 IntersectTest $ MyMethod2 @ 2bb0bf9a花了4.9802039 count = 736000
运行testing10x10
 IntersectTest $ PostMethod @ 7930ebb花了25.811083计数= 4370000
 IntersectTest $ MyMethod1 @ 47ac1adf花了6.9409306 count = 4370000
 IntersectTest $ MyMethod2 @ 74184b3b花了14.2603248 count = 4370000
运行10x100的testing
 IntersectTest $ PostMethod @ 7f423820花了25.0577691 count = 4251000
 IntersectTest $ MyMethod1 @ 5472fe25需要6.1376042计数= 4251000
 IntersectTest $ MyMethod2 @ 498b5a73花了13.9880385 count = 4251000
运行10x1000的testing
 IntersectTest $ PostMethod @ 3033b503花了25.0312716计数= 4138000
 IntersectTest $ MyMethod1 @ 12b0f0ae需要6.0932898 count = 4138000
 IntersectTest $ MyMethod2 @ 1e893918花了13.8332505计数= 4138000
运行100x1的testing
 IntersectTest $ PostMethod @ 6366de01花了9.4531628计数= 700000
 IntersectTest $ MyMethod1 @ 767946a2花了2.4284762计数= 700000
 IntersectTest $ MyMethod2 @ 140c7272需要4.7580235计数= 700000
运行100x10的testing
 IntersectTest $ PostMethod @ 3351e824花了24.9788668 count = 4192000
 IntersectTest $ MyMethod1 @ 465fadce花了6.1462852 count = 4192000
 IntersectTest $ MyMethod2 @ 338bd37a需要13.1742654计数= 4192000
运行100x100的testing
 IntersectTest $ PostMethod @ 297630d5花了193.0121077 count = 41047000
 IntersectTest $ MyMethod1 @ e800537花了45.2652397 count = 41047000
 IntersectTest $ MyMethod2 @ 76d66550花了120.8494766计数= 41047000
运行100x1000的testing
 IntersectTest $ PostMethod @ 33576738花了199.6269531计数= 40966000
 IntersectTest $ MyMethod1 @ 2f39a7dd花了45.5255814 count = 40966000
 IntersectTest $ MyMethod2 @ 723bb663花了122.1704975 count = 40966000
运行1x1的testing
 IntersectTest $ PostMethod @ 35e3bdb5花了9.5598373 count = 1000000
 IntersectTest $ MyMethod1 @ 7abbd1b6花了2.6359174计数= 1000000
 IntersectTest $ MyMethod2 @ 40c542ad花了6.1091794计数= 1000000
运行1x10的testing
 IntersectTest $ PostMethod @ 3c33a0c5花了9.4648528 count = 733000
 IntersectTest $ MyMethod1 @ 61800463花了2.302116计数= 733000
 IntersectTest $ MyMethod2 @ 1ba03197需要5.4803628计数= 733000
运行1x100的testing
 IntersectTest $ PostMethod @ 71b8da5花了9.4971057 count = 719000
 IntersectTest $ MyMethod1 @ 21f04f48花了2.2983538计数= 719000
 IntersectTest $ MyMethod2 @ 27e51160花了5.3926902计数= 719000
运行1x1000的testing
 IntersectTest $ PostMethod @ 2fe7106a花了9.4702331 count = 692000
 IntersectTest $ MyMethod1 @ 6ae6b7b7花了2.3013066 count = 692000
 IntersectTest $ MyMethod2 @ 51278635花了5.4488882 count = 692000
运行10x2的testing
 IntersectTest $ PostMethod @ 223b2d85花了9.5660879 count = 743000
 IntersectTest $ MyMethod1 @ 5b298851花了2.3481445 count = 743000
 IntersectTest $ MyMethod2 @ 3b4ac99用了4.8268489 count = 743000
运行testing10x10
 IntersectTest $ PostMethod @ 51c700a0花了23.0709476 count = 4326000
 IntersectTest $ MyMethod1 @ 5ffa3251花了5.5460785 count = 4326000
 IntersectTest $ MyMethod2 @ 22fd9511需要13.4853948 count = 4326000
运行10x100的testing
 IntersectTest $ PostMethod @ 46b49793花了25.1295491计数= 4256000
 IntersectTest $ MyMethod1 @ 7a4b5828花了5.8520418计数= 4256000
 IntersectTest $ MyMethod2 @ 6888e8d1花了14.0856942 count = 4256000
运行10x1000的testing
 IntersectTest $ PostMethod @ 5339af0d花了25.1752685 count = 4158000
 IntersectTest $ MyMethod1 @ 7013a92a花了5.7978328计数= 4158000
 IntersectTest $ MyMethod2 @ 1ac73de2花了13.8914112计数= 4158000
运行100x1的testing
 IntersectTest $ PostMethod @ 3d1344c8花了9.5123442 count = 717000
 IntersectTest $ MyMethod1 @ 3c08c5cb花了2.34665 count = 717000
 IntersectTest $ MyMethod2 @ 63f1b137花了4.907277计数= 717000
运行100x10的testing
 IntersectTest $ PostMethod @ 71695341花了24.9830339计数= 4180000
 IntersectTest $ MyMethod1 @ 39d90a92花了5.8467864 count = 4180000
 IntersectTest $ MyMethod2 @ 584514e9花了13.2197964计数= 4180000
运行100x100的testing
 IntersectTest $ PostMethod @ 21b8dc91花了195.1796213计数= 41060000
 IntersectTest $ MyMethod1 @ 6f98c4e2花了44.5775162 count = 41060000
 IntersectTest $ MyMethod2 @ 16a60aab需要121.1754402计数= 41060000
运行100x1000的testing
 IntersectTest $ PostMethod @ 20b37d62花了200.973133 count = 40940000
 IntersectTest $ MyMethod1 @ 67ecbdb3花了45.4832226 count = 40940000
 IntersectTest $ MyMethod2 @ 679a6812花了121.791293计数= 40940000
运行1x1的testing
 IntersectTest $ PostMethod @ 237aa07b花了9.2210288计数= 1000000
 IntersectTest $ MyMethod1 @ 47bdfd6f花了2.3394042计数= 1000000
 IntersectTest $ MyMethod2 @ a49a735花了6.1688936 count = 1000000
运行1x10的testing
 IntersectTest $ PostMethod @ 2b25ffba花了9.4103967 count = 736000
 IntersectTest $ MyMethod1 @ 4bb82277花了2.2976994计数= 736000
 IntersectTest $ MyMethod2 @ 25ded977花了5.3310813 count = 736000
运行1x100的testing
 IntersectTest $ PostMethod @ 7154a6d5花了9.3818786 count = 704000
 IntersectTest $ MyMethod1 @ 6c952413需要2.3014931计数= 704000
 IntersectTest $ MyMethod2 @ 33739316需要5.3307998计数= 704000
运行1x1000的testing
 IntersectTest $ PostMethod @ 58334198花了9.3831841 count = 736000
 IntersectTest $ MyMethod1 @ d178f65花了2.3071236计数= 736000
 IntersectTest $ MyMethod2 @ 5c7369a花了5.4062184计数= 736000
运行10x2的testing
 IntersectTest $ PostMethod @ 7c4bdf7c花了9.4040537计数= 735000
 IntersectTest $ MyMethod1 @ 593d85a4花了2.3584088计数= 735000
 IntersectTest $ MyMethod2 @ 5610ffc1花了4.8318229计数= 735000
运行testing10x10
 IntersectTest $ PostMethod @ 46bd9fb8花了23.004925 count = 4331000
 IntersectTest $ MyMethod1 @ 4b410d50需要5.5678172 count = 4331000
 IntersectTest $ MyMethod2 @ 1bd125c9花了14.6517184计数= 4331000
运行10x100的testing
 IntersectTest $ PostMethod @ 75d6ecff花了25.0114913 count = 4223000
 IntersectTest $ MyMethod1 @ 716195c9花了5.798676 count = 4223000
 IntersectTest $ MyMethod2 @ 3db0f946花了13.8064737计数= 4223000
运行10x1000的testing
 IntersectTest $ PostMethod @ 761d8e2a花了25.1910652 count = 4292000
 IntersectTest $ MyMethod1 @ e60a3fb需要5.8621189 count = 4292000
 IntersectTest $ MyMethod2 @ 6aadbb1c花了13.8150282 count = 4292000
运行100x1的testing
 IntersectTest $ PostMethod @ 48a50a6e花了9.4141906 count = 736000
 IntersectTest $ MyMethod1 @ 4b4fe104需要2.3507252 count = 736000
 IntersectTest $ MyMethod2 @ 693df43c需要4.7506854计数= 736000
运行100x10的testing
 IntersectTest $ PostMethod @ 4f7d29df花了24.9574096 count = 4219000
 IntersectTest $ MyMethod1 @ 2248183e花费了5.8628954 count = 4219000
 IntersectTest $ MyMethod2 @ 2b2fa007花了12.9836817 count = 4219000
运行100x100的testing
 IntersectTest $ PostMethod @ 545d7b6a花了193.2436192 count = 40987000
 IntersectTest $ MyMethod1 @ 4551976b花了44.634367计数= 40987000
 IntersectTest $ MyMethod2 @ 6fac155a花了119.2478037 count = 40987000
运行100x1000的testing
 IntersectTest $ PostMethod @ 7b6c238d花了200.4385174 count = 40817000
 IntersectTest $ MyMethod1 @ 78923d48花了45.6225227计数= 40817000
 IntersectTest $ MyMethod2 @ 48f57fcf花了121.0602757 count = 40817000
运行1x1的testing
 IntersectTest $ PostMethod @ 102c79f4需要9.0931408计数= 1000000
 IntersectTest $ MyMethod1 @ 57fa8a77需要2.3309466计数= 1000000
 IntersectTest $ MyMethod2 @ 198b7c1花了5.7627226计数= 1000000
运行1x10的testing
 IntersectTest $ PostMethod @ 8c646d0需要9.3208571 count = 726000
 IntersectTest $ MyMethod1 @ 11530630花了2.3123797 count = 726000
 IntersectTest $ MyMethod2 @ 61bb4232需要5.405318计数= 726000
运行1x100的testing
 IntersectTest $ PostMethod @ 1a137105需要9.387384计数= 710000
 IntersectTest $ MyMethod1 @ 72610ca2花了2.2938749计数= 710000
 IntersectTest $ MyMethod2 @ 41849a58花了5.3865938计数= 710000
运行1x1000的testing
 IntersectTest $ PostMethod @ 100001c8花了9.4289031 count = 696000
 IntersectTest $ MyMethod1 @ 7074f9ac花了2.2977923计数= 696000
 IntersectTest $ MyMethod2 @ fb3c4e2花了5.3724119 count = 696000
运行10x2的testing
 IntersectTest $ PostMethod @ 52c638d6花了9.4074124计数= 775000
 IntersectTest $ MyMethod1 @ 53bd940e花了2.3544881计数= 775000
 IntersectTest $ MyMethod2 @ 43434e15花了4.9228549 count = 775000
运行testing10x10
 IntersectTest $ PostMethod @ 73b7628d花了23.2110252 count = 4374000
 IntersectTest $ MyMethod1 @ ca75255需要5.5877838计数= 4374000
 IntersectTest $ MyMethod2 @ 3d0e50f0需要13.5902641 count = 4374000
运行10x100的testing
 IntersectTest $ PostMethod @ 6d6bbba9花了25.1999918计数= 4227000
 IntersectTest $ MyMethod1 @ 3bed8c5e花了5.7879144 count = 4227000
 IntersectTest $ MyMethod2 @ 689a8e0e花了13.9617882 count = 4227000
运行10x1000的testing
 IntersectTest $ PostMethod @ 3da3b0a2花了25.1627329计数= 4222000
 IntersectTest $ MyMethod1 @ 45a17b4b花了5.8319523计数= 4222000
 IntersectTest $ MyMethod2 @ 6ca59ca3花了13.8885479 count = 4222000
运行100x1的testing
 IntersectTest $ PostMethod @ 360202a5花了9.5115367 count = 705000
 IntersectTest $ MyMethod1 @ 3dfbba56花了2.3470254计数= 705000
 IntersectTest $ MyMethod2 @ 598683e4花了4.8955489 count = 705000
运行100x10的testing
 IntersectTest $ PostMethod @ 21426d0d花了25.8234298计数= 4231000
 IntersectTest $ MyMethod1 @ 1005818a花费了5.8832067 count = 4231000
 IntersectTest $ MyMethod2 @ 597b933d花了13.3676148计数= 4231000
运行100x100的testing
 IntersectTest $ PostMethod @ 6d59b81a花了193.676662计数= 41015000
 IntersectTest $ MyMethod1 @ 1d45eb0c花了44.6519088 count = 41015000
 IntersectTest $ MyMethod2 @ 594a6fd7花了119.1646115 count = 41015000
运行100x1000的testing
 IntersectTest $ PostMethod @ 6d57d9ac花了200.1651432 count = 40803000
 IntersectTest $ MyMethod1 @ 2293e349花了45.5311168计数= 40803000
 IntersectTest $ MyMethod2 @ 1b2edf5b花了120.1697135 count = 40803000 

这里是丑陋的(可能有缺陷的)微观基准:

 import java.util.*; public class IntersectTest { static Random rng = new Random(); static abstract class RunIt { public long count; public long nsTime; abstract int Run (Set<Long> s1, Set<Long> s2); } // As presented in the post static class PostMethod extends RunIt { public int Run(Set<Long> set1, Set<Long> set2) { boolean set1IsLarger = set1.size() > set2.size(); Set<Long> cloneSet = new HashSet<Long>(set1IsLarger ? set2 : set1); cloneSet.retainAll(set1IsLarger ? set1 : set2); return cloneSet.size(); } } // No intermediate HashSet static class MyMethod1 extends RunIt { public int Run (Set<Long> set1, Set<Long> set2) { Set<Long> a; Set<Long> b; if (set1.size() <= set2.size()) { a = set1; b = set2; } else { a = set2; b = set1; } int count = 0; for (Long e : a) { if (b.contains(e)) { count++; } } return count; } } // With intermediate HashSet static class MyMethod2 extends RunIt { public int Run (Set<Long> set1, Set<Long> set2) { Set<Long> a; Set<Long> b; Set<Long> res = new HashSet<Long>(); if (set1.size() <= set2.size()) { a = set1; b = set2; } else { a = set2; b = set1; } for (Long e : a) { if (b.contains(e)) { res.add(e); } } return res.size(); } } static Set<Long> makeSet (int count, float load) { Set<Long> s = new HashSet<Long>(); for (int i = 0; i < count; i++) { s.add((long)rng.nextInt(Math.max(1, (int)(count * load)))); } return s; } // really crummy ubench stuff public static void main(String[] args) { int[][] bounds = { {1, 1}, {1, 10}, {1, 100}, {1, 1000}, {10, 2}, {10, 10}, {10, 100}, {10, 1000}, {100, 1}, {100, 10}, {100, 100}, {100, 1000}, }; int totalReps = 4; int cycleReps = 1000; int subReps = 1000; float load = 0.8f; for (int tc = 0; tc < totalReps; tc++) { for (int[] bound : bounds) { int set1size = bound[0]; int set2size = bound[1]; System.out.println("Running tests for " + set1size + "x" + set2size); ArrayList<RunIt> allRuns = new ArrayList<RunIt>( Arrays.asList( new PostMethod(), new MyMethod1(), new MyMethod2())); for (int r = 0; r < cycleReps; r++) { ArrayList<RunIt> runs = new ArrayList<RunIt>(allRuns); Set<Long> set1 = makeSet(set1size, load); Set<Long> set2 = makeSet(set2size, load); while (runs.size() > 0) { int runIdx = rng.nextInt(runs.size()); RunIt run = runs.remove(runIdx); long start = System.nanoTime(); int count = 0; for (int s = 0; s < subReps; s++) { count += run.Run(set1, set2); } long time = System.nanoTime() - start; run.nsTime += time; run.count += count; } } for (RunIt run : allRuns) { double sec = run.nsTime / (10e6); System.out.println(run + " took " + sec + " count=" + run.count); } } } } } 

只需使用Google Guava的Sets#intersection(Set, Set)方法。

您可以通过使用Set方法retainAll()来避免所有手动工作。

从文档:

s1.retainAll(s2) – 将s1转换为s1和s2的交集。 (两个集合的交集是仅包含两个集合共有的元素的集合。)

这些集合中的成员是否可以轻松映射到一个相对较小的整数范围? 如果是这样,请考虑使用BitSets。 然后,交叉路口是一次一个位和32个潜在的成员。

如果两个集合都可以被sorting,像TreeSet运行这两个迭代器可能是更快的方法来计算共享对象的数量。

如果你经常做这个操作,可能会带来很多,如果你可以包装集,以便您可以caching交叉操作的结果保持dirty标志跟踪caching结果的有效性,如果需要再次计算。

使用Java 8stream:

 set1.stream().filter(s -> set2.contains(s)).collect(Collectors.toList()); 

这是一个好方法。 您应该从当前的解决scheme中获得O(n)性能。

如果只是为了统计集合中有多less个元素而计算交集,我build议您只需要直接计算交集,而不是构build集合,然后调用size()

我的计数function:

 /** * Computes the size of intersection of two sets * @param small first set. preferably smaller than the second argument * @param large second set; * @param <T> the type * @return size of intersection of sets */ public <T> int countIntersection(Set<T> small, Set<T> large){ //assuming first argument to be smaller than the later; //however double checking to be sure if (small.size() > large.size()) { //swap the references; Set<T> tmp = small; small = large; large = tmp; } int result = 0; for (T item : small) { if (large.contains(item)){ //item found in both the sets result++; } } return result; } 

仅供参考,如果集合的任何集合都使用相同的比较关系进行sorting,则可以在时间N * M上迭代它们的交集,其中N是最小集合的大小,M是集合的数量。

作为练习留给读者 。 这是一个例子 。