Python:基于交集的简单列表合并

考虑有一些整数列表:

#-------------------------------------- 0 [0,1,3] 1 [1,0,3,4,5,10,...] 2 [2,8] 3 [3,1,0,...] ... n [] #-------------------------------------- 

问题是合并具有至less一个共同元素的列表。 所以只给定部分的结果如下:

 #-------------------------------------- 0 [0,1,3,4,5,10,...] 2 [2,8] #-------------------------------------- 

什么是最有效的方式来做大数据(元素只是数字)? tree结构有什么想法? 我现在通过将列表转换为sets并迭代交集来完成这项工作,但速度很慢! 此外,我有一种如此基本的感觉! 另外,由于某些列表在某些时候没有被隐藏,因此缺less某些东西(未知)。 话虽如此,如果你提出的自我实现,请慷慨,并提供一个简单的示例代码[显然Python是我最喜欢:)]或伪代码。
更新1:这是我正在使用的代码:

 #-------------------------------------- lsts = [[0,1,3], [1,0,3,4,5,10,11], [2,8], [3,1,0,16]]; #-------------------------------------- 

function是( 越野车!! ):

 #-------------------------------------- def merge(lsts): sts = [set(l) for l in lsts] i = 0 while i < len(sts): j = i+1 while j < len(sts): if len(sts[i].intersection(sts[j])) > 0: sts[i] = sts[i].union(sts[j]) sts.pop(j) else: j += 1 #---corrected i += 1 lst = [list(s) for s in sts] return lst #-------------------------------------- 

结果是:

 #-------------------------------------- >>> merge(lsts) >>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]] #-------------------------------------- 

更新2:根据我的经验,下面的Niklas Baumstark给出的代码显示对于简单的情况要快一些。 没有testing“挂钩”给出的方法,因为它是完全不同的方法(顺便说一句,它似乎有趣)。 所有这些testing程序可能很难或不可能保证结果。 我将要使用的真实数据集非常庞大而复杂,因此不可能通过重复来追踪任何错误。 那就是我需要100%地满足方法的可靠性,然后把它作为一个模块放在一个大代码中。 就目前而言, Niklas的方法更快,对于简单的设置答案当然是正确的。
但是,我怎样才能确保它适用于真正的大数据集呢? 由于我无法直观地查看错误!

更新3:请注意,对于此问题,方法的可靠性比速度重要得多。 我希望能够最终将Python代码翻译成Fortran以获得最高性能。

更新4:
这篇文章有很多有趣的地方,慷慨解答,有build设性的意见。 我会build议彻底阅读。 请接受我对这个问题的发展的欣赏,惊人的答案和build设性的意见和讨论。

我的尝试:

 def merge(lsts): sets = [set(lst) for lst in lsts if lst] merged = 1 while merged: merged = 0 results = [] while sets: common, rest = sets[0], sets[1:] sets = [] for x in rest: if x.isdisjoint(common): sets.append(x) else: merged = 1 common |= x results.append(common) sets = results return sets lst = [[65, 17, 5, 30, 79, 56, 48, 62], [6, 97, 32, 93, 55, 14, 70, 32], [75, 37, 83, 34, 9, 19, 14, 64], [43, 71], [], [89, 49, 1, 30, 28, 3, 63], [35, 21, 68, 94, 57, 94, 9, 3], [16], [29, 9, 97, 43], [17, 63, 24]] print merge(lst) 

基准testing:

 import random # adapt parameters to your own usage scenario class_count = 50 class_size = 1000 list_count_per_class = 100 large_list_sizes = list(range(100, 1000)) small_list_sizes = list(range(0, 100)) large_list_probability = 0.5 if False: # change to true to generate the test data file (takes a while) with open("/tmp/test.txt", "w") as f: lists = [] classes = [range(class_size*i, class_size*(i+1)) for i in range(class_count)] for c in classes: # distribute each class across ~300 lists for i in xrange(list_count_per_class): lst = [] if random.random() < large_list_probability: size = random.choice(large_list_sizes) else: size = random.choice(small_list_sizes) nums = set(c) for j in xrange(size): x = random.choice(list(nums)) lst.append(x) nums.remove(x) random.shuffle(lst) lists.append(lst) random.shuffle(lists) for lst in lists: f.write(" ".join(str(x) for x in lst) + "\n") setup = """ # Niklas' def merge_niklas(lsts): sets = [set(lst) for lst in lsts if lst] merged = 1 while merged: merged = 0 results = [] while sets: common, rest = sets[0], sets[1:] sets = [] for x in rest: if x.isdisjoint(common): sets.append(x) else: merged = 1 common |= x results.append(common) sets = results return sets # Rik's def merge_rik(data): sets = (set(e) for e in data if e) results = [next(sets)] for e_set in sets: to_update = [] for i,res in enumerate(results): if not e_set.isdisjoint(res): to_update.insert(0,i) if not to_update: results.append(e_set) else: last = results[to_update.pop(-1)] for i in to_update: last |= results[i] del results[i] last |= e_set return results # katrielalex's def pairs(lst): i = iter(lst) first = prev = item = i.next() for item in i: yield prev, item prev = item yield item, first import networkx def merge_katrielalex(lsts): g = networkx.Graph() for lst in lsts: for edge in pairs(lst): g.add_edge(*edge) return networkx.connected_components(g) # agf's (optimized) from collections import deque def merge_agf_optimized(lists): sets = deque(set(lst) for lst in lists if lst) results = [] disjoint = 0 current = sets.pop() while True: merged = False newsets = deque() for _ in xrange(disjoint, len(sets)): this = sets.pop() if not current.isdisjoint(this): current.update(this) merged = True disjoint = 0 else: newsets.append(this) disjoint += 1 if sets: newsets.extendleft(sets) if not merged: results.append(current) try: current = newsets.pop() except IndexError: break disjoint = 0 sets = newsets return results # agf's (simple) def merge_agf_simple(lists): newsets, sets = [set(lst) for lst in lists if lst], [] while len(sets) != len(newsets): sets, newsets = newsets, [] for aset in sets: for eachset in newsets: if not aset.isdisjoint(eachset): eachset.update(aset) break else: newsets.append(aset) return newsets # alexis' def merge_alexis(data): bins = range(len(data)) # Initialize each bin[n] == n nums = dict() data = [set(m) for m in data ] # Convert to sets for r, row in enumerate(data): for num in row: if num not in nums: # New number: tag it with a pointer to this row's bin nums[num] = r continue else: dest = locatebin(bins, nums[num]) if dest == r: continue # already in the same bin if dest > r: dest, r = r, dest # always merge into the smallest bin data[dest].update(data[r]) data[r] = None # Update our indices to reflect the move bins[r] = dest r = dest # Filter out the empty bins have = [ m for m in data if m ] return have def locatebin(bins, n): while bins[n] != n: n = bins[n] return n lsts = [] size = 0 num = 0 max = 0 for line in open("/tmp/test.txt", "r"): lst = [int(x) for x in line.split()] size += len(lst) if len(lst) > max: max = len(lst) num += 1 lsts.append(lst) """ setup += """ print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max) """.format(class_count=class_count) import timeit print "niklas" print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3) print "rik" print timeit.timeit("merge_rik(lsts)", setup=setup, number=3) print "katrielalex" print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3) print "agf (1)" print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3) print "agf (2)" print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3) print "alexis" print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3) 

这些时间显然取决于基准的具体参数,如class级数量,列表数量,列表大小等。根据您的需要调整这些参数以获得更有用的结果。

以下是我的机器上不同参数的一些示例输出。 他们表明,所有的algorithm都有自己的优势和弱点,取决于他们得到的inputtypes:

 ===================== # many disjoint classes, large lists class_count = 50 class_size = 1000 list_count_per_class = 100 large_list_sizes = list(range(100, 1000)) small_list_sizes = list(range(0, 100)) large_list_probability = 0.5 ===================== niklas 5000 lists, 50 equally distributed classes, average size 298, max size 999 4.80084705353 rik 5000 lists, 50 equally distributed classes, average size 298, max size 999 9.49251699448 katrielalex 5000 lists, 50 equally distributed classes, average size 298, max size 999 21.5317108631 agf (1) 5000 lists, 50 equally distributed classes, average size 298, max size 999 8.61671280861 agf (2) 5000 lists, 50 equally distributed classes, average size 298, max size 999 5.18117713928 => alexis => 5000 lists, 50 equally distributed classes, average size 298, max size 999 => 3.73504281044 =================== # less number of classes, large lists class_count = 15 class_size = 1000 list_count_per_class = 300 large_list_sizes = list(range(100, 1000)) small_list_sizes = list(range(0, 100)) large_list_probability = 0.5 =================== niklas 4500 lists, 15 equally distributed classes, average size 296, max size 999 1.79993700981 rik 4500 lists, 15 equally distributed classes, average size 296, max size 999 2.58237695694 katrielalex 4500 lists, 15 equally distributed classes, average size 296, max size 999 19.5465381145 agf (1) 4500 lists, 15 equally distributed classes, average size 296, max size 999 2.75445604324 => agf (2) => 4500 lists, 15 equally distributed classes, average size 296, max size 999 => 1.77850699425 alexis 4500 lists, 15 equally distributed classes, average size 296, max size 999 3.23530197144 =================== # less number of classes, smaller lists class_count = 15 class_size = 1000 list_count_per_class = 300 large_list_sizes = list(range(100, 1000)) small_list_sizes = list(range(0, 100)) large_list_probability = 0.1 =================== niklas 4500 lists, 15 equally distributed classes, average size 95, max size 997 0.773697137833 rik 4500 lists, 15 equally distributed classes, average size 95, max size 997 1.0523750782 katrielalex 4500 lists, 15 equally distributed classes, average size 95, max size 997 6.04466891289 agf (1) 4500 lists, 15 equally distributed classes, average size 95, max size 997 1.20285701752 => agf (2) => 4500 lists, 15 equally distributed classes, average size 95, max size 997 => 0.714507102966 alexis 4500 lists, 15 equally distributed classes, average size 95, max size 997 1.1286110878 

我试图在这个问题上重复所有关于这个话题所说的和所做的一切。

我试图testing每个解决scheme(所有的代码在这里 )。

testing

这是来自testing模块的TestCase

 class MergeTestCase(unittest.TestCase): def setUp(self): with open('./lists/test_list.txt') as f: self.lsts = json.loads(f.read()) self.merged = self.merge_func(deepcopy(self.lsts)) def test_disjoint(self): """Check disjoint-ness of merged results""" from itertools import combinations for a,b in combinations(self.merged, 2): self.assertTrue(a.isdisjoint(b)) def test_coverage(self): # Credit to katrielalex """Check coverage original data""" merged_flat = set() for s in self.merged: merged_flat |= s original_flat = set() for lst in self.lsts: original_flat |= set(lst) self.assertTrue(merged_flat == original_flat) def test_subset(self): # Credit to WolframH """Check that every original data is a subset""" for lst in self.lsts: self.assertTrue(any(set(lst) <= e for e in self.merged)) 

这个testing是假设一个结果列表,所以我不能testing几个与列表工作的结果。

我无法testing以下内容:

 katrielalex steabert 

在我可以testing的两个中 ,有两个失败了

  -- Going to test: agf (optimized) -- Check disjoint-ness of merged results ... FAIL -- Going to test: robert king -- Check disjoint-ness of merged results ... FAIL 

定时

表演与所采用的数据testing密切相关。

到目前为止,有三个答案试图解决他们的问题。 由于他们使用不同的testing数据,他们有不同的结果

  1. Niklas的基准是非常可笑的。 随着他的banchmark可以做不同的testing,改变一些参数。

    我已经使用了他在自己的答案中使用的三组参数,并将它们放在三个不同的文件中:

     filename = './lists/timing_1.txt' class_count = 50, class_size = 1000, list_count_per_class = 100, large_list_sizes = (100, 1000), small_list_sizes = (0, 100), large_list_probability = 0.5, filename = './lists/timing_2.txt' class_count = 15, class_size = 1000, list_count_per_class = 300, large_list_sizes = (100, 1000), small_list_sizes = (0, 100), large_list_probability = 0.5, filename = './lists/timing_3.txt' class_count = 15, class_size = 1000, list_count_per_class = 300, large_list_sizes = (100, 1000), small_list_sizes = (0, 100), large_list_probability = 0.1, 

    这是我得到的结果:

    从文件: timing_1.txt

     Timing with: >> Niklas << Benchmark Info: 5000 lists, average size 305, max size 999 Timing Results: 10.434 -- alexis 11.476 -- agf 11.555 -- Niklas B. 13.622 -- Rik. Poggi 14.016 -- agf (optimized) 14.057 -- ChessMaster 20.208 -- katrielalex 21.697 -- steabert 25.101 -- robert king 76.870 -- Sven Marnach 133.399 -- hochl 

    从文件: timing_2.txt

     Timing with: >> Niklas << Benchmark Info: 4500 lists, average size 305, max size 999 Timing Results: 8.247 -- Niklas B. 8.286 -- agf 8.637 -- Rik. Poggi 8.967 -- alexis 9.090 -- ChessMaster 9.091 -- agf (optimized) 18.186 -- katrielalex 19.543 -- steabert 22.852 -- robert king 70.486 -- Sven Marnach 104.405 -- hochl 

    从文件: timing_3.txt

     Timing with: >> Niklas << Benchmark Info: 4500 lists, average size 98, max size 999 Timing Results: 2.746 -- agf 2.850 -- Niklas B. 2.887 -- Rik. Poggi 2.972 -- alexis 3.077 -- ChessMaster 3.174 -- agf (optimized) 5.811 -- katrielalex 7.208 -- robert king 9.193 -- steabert 23.536 -- Sven Marnach 37.436 -- hochl 
  2. 用Sven的testing数据,我得到了以下结果:

     Timing with: >> Sven << Benchmark Info: 200 lists, average size 10, max size 10 Timing Results: 2.053 -- alexis 2.199 -- ChessMaster 2.410 -- agf (optimized) 3.394 -- agf 3.398 -- Rik. Poggi 3.640 -- robert king 3.719 -- steabert 3.776 -- Niklas B. 3.888 -- hochl 4.610 -- Sven Marnach 5.018 -- katrielalex 
  3. 最后用Agf的基准,我得到了:

     Timing with: >> Agf << Benchmark Info: 2000 lists, average size 246, max size 500 Timing Results: 3.446 -- Rik. Poggi 3.500 -- ChessMaster 3.520 -- agf (optimized) 3.527 -- Niklas B. 3.527 -- agf 3.902 -- hochl 5.080 -- alexis 15.997 -- steabert 16.422 -- katrielalex 18.317 -- robert king 1257.152 -- Sven Marnach 

正如我在开始时所说的,所有的代码都可以在这个git仓库中使用 。 所有的合并函数都在一个名为core.py的文件中,每一个以_merge结尾的_merge都会在testing过程中被自动加载,因此添加/testing/改进你自己的解决scheme不是很难。

让我也知道如果有什么问题,这是很多的编码,我可以用一两个新的眼睛:)

使用matrix操作

让我以下面的评论来说这个答案:

这是做这件事的错误方法。 它是数字化不稳定的前提,比其他方法呈现的要慢得多,使用需要您自担风险。

话虽如此,我无法抗拒从dynamic的angular度来解决这个问题(我希望你们能对问题有一个全新的认识)。 理论上这应该一直工作,但是特征值计算往往会失败。 我们的想法是将您的列表视为从行到列的stream程 。 如果两行共享一个公共值,则它们之间有一个连接stream。 如果我们把这些stream量看作是水,我们就会看到,当它们之间存在连接的path时,这些stream量会聚集成小池。 为了简单起见,我将使用一个较小的集合,但它也适用于您的数据集:

 from numpy import where, newaxis from scipy import linalg, array, zeros X = [[0,1,3],[2],[3,1]] 

我们需要将数据转换成stream程图。 如果第i行stream入值j,我们把它放在matrix中。 这里我们有3行和4个独特的值:

 A = zeros((4,len(X)), dtype=float) for i,row in enumerate(X): for val in row: A[val,i] = 1 

一般来说,您需要更改4以捕获您拥有的唯一值的数量。 如果集合是从0开始的整数列表,那么可以简单地将其设置为最大的数字。 我们现在执行特征值分解。 SVD是准确的,因为我们的matrix不是方形的。

 S = linalg.svd(A) 

我们只想保留这个答案的3×3部分,因为它将代表池的stream量。 实际上我们只需要这个matrix的绝对值; 我们只关心这个集群空间是否有stream量。

 M = abs(S[2]) 

我们可以把这个matrixM看作一个马尔可夫matrix,并通过行标准化来使其明确。 一旦我们有了这个,我们计算(左)特征值分解。 的matrix。

 M /= M.sum(axis=1)[:,newaxis] U,V = linalg.eig(M,left=True, right=False) V = abs(V) 

现在,一个非连续的(非遍历的)马尔可夫matrix具有非常好的性质,对于每个非连通集群,都有一个统一的特征值。 与这些统一值相关的特征向量是我们想要的:

 idx = where(U > .999)[0] C = VT[idx] > 0 

由于上述的数字不稳定性,我必须使用.999。 在这一点上,我们完成了! 现在每个独立的集群都可以将相应的行拉出来:

 for cluster in C: print where(A[:,cluster].sum(axis=1))[0] 

正如预期的那样:

 [0 1 3] [2] 

X改为你的第一个,你会得到: [ 0 1 3 4 5 10 11 16] [2 8]


附录

为什么这可能有用? 我不知道你的底层数据来自哪里,但是当连接不是绝对的时候会发生什么? 说行1有条目3 80%的时间 – 你会如何推广这个问题? 上面的stream程方法工作得很好,而且完全可以通过.999值来参数化,它越远离联合,联系越松散。


视觉表示

由于一张图片值1K字,下面分别是我的例子和你的lst的matrixA和V的图。 注意在V如何分成两个簇(这是一个在置换后有两个块的块对angularmatrix),因为每个例子中只有两个唯一的列表!

我的例子您的示例数据


更快的实施

事后看来,我意识到你可以跳过SVD步骤,只计算一个单独的分解:

 M = dot(AT,A) M /= M.sum(axis=1)[:,newaxis] U,V = linalg.eig(M,left=True, right=False) 

这种方法的优点(除了速度)是M现在是对称的,因此计算可以更快,更准确(不需要担心假想值)。

编辑:好的,其他问题已经closures,张贴在这里。

很好的问题! 如果你把它看作一个图中的连通性问题,那就简单多了。 下面的代码使用了这个问题的优秀的networkxgraphics库和pairs函数。

 def pairs(lst): i = iter(lst) first = prev = item = i.next() for item in i: yield prev, item prev = item yield item, first lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]] import networkx g = networkx.Graph() for sub_list in lists: for edge in pairs(sub_list): g.add_edge(*edge) networkx.connected_components(g) [[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]] 

说明

我们创build一个新的(空)图g 。 对于列表中的每个子lists ,将其元素视为图的节点并在它们之间添加边。 (因为我们只关心连通性,所以我们不需要添加所有的边 – 只有相邻的边!)请注意, add_edge需要两个对象,把它们当作节点(如果它们不在那里,则添加它们),以及在它们之间增加了一条边。

然后,我们只是find图的连接组件 – 一个解决的问题! – 并将它们输出为我们的交集。

这是我的答案。 我没有对今天的一批答案进行检查。

基于交集的algorithm是O(N ^ 2),因为他们检查每一个新的集合与所有现有的集合,所以我使用了一种方法来索引每个数字并运行在O(N)附近(如果我们接受字典查找O(1))。 然后,我运行基准testing,觉得自己像一个完全白痴,因为它运行速度较慢,但​​仔细检查后发现,testing数据结束了只有less数不同的结果集,所以二次algorithm没有太多的工作做。 用超过10-15个不同的箱子testing它,我的algorithm要快得多。 尝试使用超过50个不同的垃圾箱的testing数据,速度非常快。

(编辑:运行基准testing的方式也存在问题,但是我的诊断错误,我改变了我的代码,以重复testing的方式工作)。

 def mergelists5(data): """Check each number in our arrays only once, merging when we find a number we have seen before. """ bins = range(len(data)) # Initialize each bin[n] == n nums = dict() data = [set(m) for m in data ] # Convert to sets for r, row in enumerate(data): for num in row: if num not in nums: # New number: tag it with a pointer to this row's bin nums[num] = r continue else: dest = locatebin(bins, nums[num]) if dest == r: continue # already in the same bin if dest > r: dest, r = r, dest # always merge into the smallest bin data[dest].update(data[r]) data[r] = None # Update our indices to reflect the move bins[r] = dest r = dest # Filter out the empty bins have = [ m for m in data if m ] print len(have), "groups in result" return have def locatebin(bins, n): """ Find the bin where list n has ended up: Follow bin references until we find a bin that has not moved. """ while bins[n] != n: n = bins[n] return n 

这个新的function只做最小必要数量的不相交testing,其他类似的解决scheme是做不到的。 它还使用deque来避免尽可能多的线性时间操作,比如列表中的列表切片和删除。

 from collections import deque def merge(lists): sets = deque(set(lst) for lst in lists if lst) results = [] disjoint = 0 current = sets.pop() while True: merged = False newsets = deque() for _ in xrange(disjoint, len(sets)): this = sets.pop() if not current.isdisjoint(this): current.update(this) merged = True disjoint = 0 else: newsets.append(this) disjoint += 1 if sets: newsets.extendleft(sets) if not merged: results.append(current) try: current = newsets.pop() except IndexError: break disjoint = 0 sets = newsets return results 

在给定的数据集中,集合之间的重叠程度越低,与其他函数相比,这样做就越好。

这是一个例子。 如果你有4套,你需要比较:

 1,2
 1,3
 1,4
 2,3
 2,4
 3,4

如果1与3重叠,则需要重新testing2以检查它是否与1重叠,以便安全地跳过testing2与3。

有两种方法来处理这个问题。 首先是在每个重叠和合并之后重新开始对其他集合进行testing。 第二是继续testing,比较1和4,然后回去重新testing。 后者导致更less的不相交testing,因为在单次传递中会发生更多的合并,所以在重新testing过程中,有更less的testing集合需要testing。

问题是要跟踪哪些集合必须重新testing。 在上面的例子中,1需要重新testing2而不是4,因为1在第一次被testing之前已经处于当前状态。

disjoint计数器允许跟踪。


我的回答没有帮助find一个改进algorithm重新编码到FORTRAN的主要问题; 在我看来,这是在Python中实现algorithm的最简单最优雅的方式。

根据我的testing(或接受的答案中的testing),它比下一个最快的解决scheme稍快(高达10%)。

 def merge0(lists): newsets, sets = [set(lst) for lst in lists if lst], [] while len(sets) != len(newsets): sets, newsets = newsets, [] for aset in sets: for eachset in newsets: if not aset.isdisjoint(eachset): eachset.update(aset) break else: newsets.append(aset) return newsets 

在其他实现中不需要使用非pythonic计数器( irange )或复杂的变异( delpopinsert )。 它只使用简单的迭代,以最简单的方式合并重叠集,并在每次传递数据时build立一个新的列表。

我的(更快,更简单)版本的testing代码:

 import random tenk = range(10000) lsts = [random.sample(tenk, random.randint(0, 500)) for _ in range(2000)] setup = """ def merge0(lists): newsets, sets = [set(lst) for lst in lists if lst], [] while len(sets) != len(newsets): sets, newsets = newsets, [] for aset in sets: for eachset in newsets: if not aset.isdisjoint(eachset): eachset.update(aset) break else: newsets.append(aset) return newsets def merge1(lsts): sets = [set(lst) for lst in lsts if lst] merged = 1 while merged: merged = 0 results = [] while sets: common, rest = sets[0], sets[1:] sets = [] for x in rest: if x.isdisjoint(common): sets.append(x) else: merged = 1 common |= x results.append(common) sets = results return sets lsts = """ + repr(lsts) import timeit print timeit.timeit("merge0(lsts)", setup=setup, number=10) print timeit.timeit("merge1(lsts)", setup=setup, number=10) 

This would be my updated approach:

 def merge(data): sets = (set(e) for e in data if e) results = [next(sets)] for e_set in sets: to_update = [] for i,res in enumerate(results): if not e_set.isdisjoint(res): to_update.insert(0,i) if not to_update: results.append(e_set) else: last = results[to_update.pop(-1)] for i in to_update: last |= results[i] del results[i] last |= e_set return results 

Note: During the merging empty lists will be removed.

Update: Reliability.

You need two tests for a 100% reliabilty of success:

  • Check that all the resulting sets are mutually disjointed:

     merged = [{0, 1, 3, 4, 5, 10, 11, 16}, {8, 2}, {8}] from itertools import combinations for a,b in combinations(merged,2): if not a.isdisjoint(b): raise Exception(a,b) # just an example 
  • Check that the merged set cover the original data. (as suggested by katrielalex)

I think this will take some time, but maybe it'll be worth it if you want to be 100% sure.

Here's a function (Python 3.1) to check if the result of a merge function is OK. It checks:

  • Are the result sets disjoint? (number of elements of union == sum of numbers of elements)
  • Are the elements of the result sets the same as of the input lists?
  • Is every input list a subset of a result set?
  • Is every result set minimal, ie is it impossible to split it into two smaller sets?
  • It does not check if there are empty result sets – I don't know if you want them or not…

 from itertools import chain def check(lsts, result): lsts = [set(s) for s in lsts] all_items = set(chain(*lsts)) all_result_items = set(chain(*result)) num_result_items = sum(len(s) for s in result) if num_result_items != len(all_result_items): print("Error: result sets overlap!") print(num_result_items, len(all_result_items)) print(sorted(map(len, result)), sorted(map(len, lsts))) if all_items != all_result_items: print("Error: result doesn't match input lists!") if not all(any(set(s).issubset(t) for t in result) for s in lst): print("Error: not all input lists are contained in a result set!") seen = set() todo = list(filter(bool, lsts)) done = False while not done: deletes = [] for i, s in enumerate(todo): # intersection with seen, or with unseen result set, is OK if not s.isdisjoint(seen) or any(t.isdisjoint(seen) for t in result if not s.isdisjoint(t)): seen.update(s) deletes.append(i) for i in reversed(deletes): del todo[i] done = not deletes if todo: print("Error: A result set should be split into two or more parts!") print(todo) 
 lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]] import networkx as nx g = nx.Graph() for sub_list in lists: for i in range(1,len(sub_list)): g.add_edge(sub_list[0],sub_list[i]) print nx.connected_components(g) #[[1, 2, 3, 5, 6], [8, 9, 10], [11, 12, 13]] 

性能:

 5000 lists, 5 classes, average size 74, max size 1000 15.2264976415 

Performance of merge1:

 print timeit.timeit("merge1(lsts)", setup=setup, number=10) 5000 lists, 5 classes, average size 74, max size 1000 1.26998780571 

So it is 11x slower than the fastest.. but the code is much more simple and readable!

This is slower than the solution offered by Niklas (I got 3.9s on the test.txt instead of 0.5s for his solution), but yields the same result and might be easier to implement in eg Fortran, since it doesn't use sets, only sorting of the total amount of elements and then a single run through all of them.

It returns a list with the ids of the merged lists, so also keeps track of empty lists, they stay unmerged.

 def merge(lsts): # this is an index list that stores the joined id for each list joined = range(len(lsts)) # create an ordered list with indices indexed_list = sorted((el,index) for index,lst in enumerate(lsts) for el in lst) # loop throught the ordered list, and if two elements are the same and # the lists are not yet joined, alter the list with joined id el_0,idx_0 = None,None for el,idx in indexed_list: if el == el_0 and joined[idx] != joined[idx_0]: old = joined[idx] rep = joined[idx_0] joined = [rep if id == old else id for id in joined] el_0, idx_0 = el, idx return joined 

Firstly I'm not exactly sure if the benchmarks are fair:

Adding the following code to the start of my function:

 c = Counter(chain(*lists)) print c[1] "88" 

This means that out of all the values in all the lists, there are only 88 distinct values. Usually in the real world duplicates are rare, and you would expect a lot more distinct values. (of course i don't know where your data from so can't make assumptions).

Because Duplicates are more common, it means sets are less likely to be disjoint. This means the set.isdisjoint() method will be much faster because only after a few tests it will find that the sets aren't disjoint.

Having said all that, I do believe the methods presented that use disjoint are the fastest anyway, but i'm just saying, instead of being 20x faster maybe they should only be 10x faster than the other methods with different benchmark testing.

Anyway, i Thought I would try a slightly different technique to solve this, however the merge sorting was too slow, this method is about 20x slower than the two fastest methods using the benchmarking:

I thought I would order everything

 import heapq from itertools import chain def merge6(lists): for l in lists: l.sort() one_list = heapq.merge(*[zip(l,[i]*len(l)) for i,l in enumerate(lists)]) #iterating through one_list takes 25 seconds!! previous = one_list.next() d = {i:i for i in range(len(lists))} for current in one_list: if current[0]==previous[0]: d[current[1]] = d[previous[1]] previous=current groups=[[] for i in range(len(lists))] for k in d: groups[d[k]].append(lists[k]) #add a each list to its group return [set(chain(*g)) for g in groups if g] #since each subroup in each g is sorted, it would be faster to merge these subgroups removing duplicates along the way. lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]] print merge6(lists) "[set([1, 2, 3, 5, 6]), set([8, 9, 10]), set([11, 12, 13])]"" import timeit print timeit.timeit("merge1(lsts)", setup=setup, number=10) print timeit.timeit("merge4(lsts)", setup=setup, number=10) print timeit.timeit("merge6(lsts)", setup=setup, number=10) 5000 lists, 5 classes, average size 74, max size 1000 1.26732238315 5000 lists, 5 classes, average size 74, max size 1000 1.16062907437 5000 lists, 5 classes, average size 74, max size 1000 30.7257182826 

Just for fun…

 def merge(mylists): results, sets = [], [set(lst) for lst in mylists if lst] upd, isd, pop = set.update, set.isdisjoint, sets.pop while sets: if not [upd(sets[0],pop(i)) for i in xrange(len(sets)-1,0,-1) if not isd(sets[0],sets[i])]: results.append(pop(0)) return results 

and my rewrite of the best answer

 def merge(lsts): sets = map(set,lsts) results = [] while sets: first, rest = sets[0], sets[1:] merged = False sets = [] for s in rest: if s and s.isdisjoint(first): sets.append(s) else: first |= s merged = True if merged: sets.append(first) else: results.append(first) return results 

Here's an implementation using a disjoint-set data structure (specifically a disjoint forest), thanks to comingstorm's hint at merging sets which have even one element in common . I'm using path compression for a slight (~5%) speed improvement; it's not entirely necessary (and it prevents find being tail recursive, which could slow things down). Note that I'm using a dict to represent the disjoint forest; given that the data are int s, an array would also work although it might not be much faster.

 def merge(data): parents = {} def find(i): j = parents.get(i, i) if j == i: return i k = find(j) if k != j: parents[i] = k return k for l in filter(None, data): parents.update(dict.fromkeys(map(find, l), find(l[0]))) merged = {} for k, v in parents.items(): merged.setdefault(find(v), []).append(k) return merged.values() 

This approach is comparable to the other best algorithms on Rik's benchmarks.

My solution, works well on small lists and is quite readable without dependencies.

 def merge_list(starting_list): final_list = [] for i,v in enumerate(starting_list[:-1]): if set(v)&set(starting_list[i+1]): starting_list[i+1].extend(list(set(v) - set(starting_list[i+1]))) else: final_list.append(v) final_list.append(starting_list[-1]) return final_list 

Benchmarking it:

lists = [[1,2,3],[3,5,6],[8,9,10],[11,12,13]]

%timeit merge_list(lists)

100000 loops, best of 3: 4.9 µs per loop

This can be solved in O(n) by using the union-find algorithm. Given the first two rows of your data, edges to use in the union-find are the following pairs: (0,1),(1,3),(1,0),(0,3),(3,4),(4,5),(5,10)