确定一个序列是否在Python中是另一个序列的最好方法

这是“更多”任意types的“string contains substring”问题的泛化。

给定一个序列(如列表或元组),确定是否有其他序列是最好的方法是什么? 作为奖励,它应该返回子序列开始的元素的索引:

用法示例(序列中的序列):

>>> seq_in_seq([5,6], [4,'a',3,5,6]) 3 >>> seq_in_seq([5,7], [4,'a',3,5,6]) -1 # or None, or whatever 

到目前为止,我只是依靠暴力,看起来很慢,很丑,笨拙。

其次是Knuth-Morris-Prattalgorithm。 顺便说一下,你的问题(和KMP解决scheme)在Python Cookbook第二版中完全是配方5.13。 你可以在http://code.activestate.com/recipes/117214/find相关的代码;

它找出给定序列中所有正确的子序列,并应该用作迭代器:

 >>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,6]): print s 3 >>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,7]): print s (nothing) 

这是一个powershell方法O(n*m) (类似于@ mcella的答案 )。 对于input序列,它可能比纯Python O(n+m)的Knuth-Morris-Prattalgorithm实现更快(请参阅@Gregg Lind答案 )。

 #!/usr/bin/env python def index(subseq, seq): """Return an index of `subseq`uence in the `seq`uence. Or `-1` if `subseq` is not a subsequence of the `seq`. The time complexity of the algorithm is O(n*m), where n, m = len(seq), len(subseq) >>> index([1,2], range(5)) 1 >>> index(range(1, 6), range(5)) -1 >>> index(range(5), range(5)) 0 >>> index([1,2], [0, 1, 0, 1, 2]) 3 """ i, n, m = -1, len(seq), len(subseq) try: while True: i = seq.index(subseq[0], i + 1, n - m + 1) if subseq == seq[i:i + m]: return i except ValueError: return -1 if __name__ == '__main__': import doctest; doctest.testmod() 

我不知道在这种情况下有多大?

Knuth-Morris-Prattstring匹配

 >>> def seq_in_seq(subseq, seq): ... while subseq[0] in seq: ... index = seq.index(subseq[0]) ... if subseq == seq[index:index + len(subseq)]: ... return index ... else: ... seq = seq[index + 1:] ... else: ... return -1 ... >>> seq_in_seq([5,6], [4,'a',3,5,6]) 3 >>> seq_in_seq([5,7], [4,'a',3,5,6]) -1 

对不起,我不是一个algorithm专家,这只是我脑海里此刻想到的最快速的事情,至less我认为它看起来不错(对我来说),而且我很乐意编码它。 😉

最有可能的是你的蛮力方法正在做同样的事情。

一个简单的方法:转换为string,并依靠string匹配。

使用string列表的示例:

  >>> f = ["foo", "bar", "baz"] >>> g = ["foo", "bar"] >>> ff = str(f).strip("[]") >>> gg = str(g).strip("[]") >>> gg in ff True 

使用string元组的示例:

 >>> x = ("foo", "bar", "baz") >>> y = ("bar", "baz") >>> xx = str(x).strip("()") >>> yy = str(y).strip("()") >>> yy in xx True 

使用数字列表的示例:

 >>> f = [1 , 2, 3, 4, 5, 6, 7] >>> g = [4, 5, 6] >>> ff = str(f).strip("[]") >>> gg = str(g).strip("[]") >>> gg in ff True 

蛮力可能适用于小图案。

对于较大的,请看Aho-Corasickalgorithm 。

这是另一个KMP执行:

 from itertools import tee def seq_in_seq(seq1,seq2): ''' Return the index where seq1 appears in seq2, or -1 if seq1 is not in seq2, using the Knuth-Morris-Pratt algorithm based heavily on code by Neale Pickett <neale@woozle.org> found at: woozle.org/~neale/src/python/kmp.py >>> seq_in_seq(range(3),range(5)) 0 >>> seq_in_seq(range(3)[-1:],range(5)) 2 >>>seq_in_seq(range(6),range(5)) -1 ''' def compute_prefix_function(p): m = len(p) pi = [0] * m k = 0 for q in xrange(1, m): while k > 0 and p[k] != p[q]: k = pi[k - 1] if p[k] == p[q]: k = k + 1 pi[q] = k return pi t,p = list(tee(seq2)[0]), list(tee(seq1)[0]) m,n = len(p),len(t) pi = compute_prefix_function(p) q = 0 for i in range(n): while q > 0 and p[q] != t[i]: q = pi[q - 1] if p[q] == t[i]: q = q + 1 if q == m: return i - m + 1 return -1 

我晚了一点,但这里有一些简单的使用string:

 >>> def seq_in_seq(sub, full): ... f = ''.join([repr(d) for d in full]).replace("'", "") ... s = ''.join([repr(d) for d in sub]).replace("'", "") ... #return f.find(s) #<-- not reliable for finding indices in all cases ... return s in f ... >>> seq_in_seq([5,6], [4,'a',3,5,6]) True >>> seq_in_seq([5,7], [4,'a',3,5,6]) False >>> seq_in_seq([4,'abc',33], [4,'abc',33,5,6]) True 

正如Ilya V. Schurov指出的,在这种情况下, find方法不会返回多string或多位数字的正确索引。

另一种方法,使用集合:

 set([5,6])== set([5,6])&set([4,'a',3,5,6]) True