在Python中是否有`string.split()`的生成器版本?

string.split()返回一个列表实例。 有没有一个版本,而不是返回一个发电机 ? 有没有任何理由反对发电机版本?

re.finditer极有可能使用相当小的内存开销。

 def split_iter(string): return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string)) 

演示:

 >>> list( split_iter("A programmer's RegEx test.") ) ['A', "programmer's", 'RegEx', 'test'] 

编辑:我刚刚证实,这需要在Python 3.2.1不断的内存,假设我的testing方法是正确的。 我创build了一个非常大的string(1GB左右),然后用for循环遍历迭代(不是列表理解,这会产生额外的内存)。 这并没有导致内存的显着增长(也就是说,如果内存有所增长,则远远低于1GB的内存)。

使用str.find()方法的offset参数来写一个最有效的方法。 这避免了大量的内存使用,并且在不需要时依赖于正则expression式的开销。

[编辑2016-8-2:更新这个可选支持正则expression式分隔符]

 def isplit(source, sep=None, regex=False): """ generator version of str.split() :param source: source string (unicode or bytes) :param sep: separator to split on. :param regex: if True, will treat sep as regular expression. :returns: generator yielding elements of string. """ if sep is None: # mimic default python behavior source = source.strip() sep = "\\s+" if isinstance(source, bytes): sep = sep.encode("ascii") regex = True if regex: # version using re.finditer() if not hasattr(sep, "finditer"): sep = re.compile(sep) start = 0 for m in sep.finditer(source): idx = m.start() assert idx >= start yield source[start:idx] start = m.end() yield source[start:] else: # version using str.find(), less overhead than re.finditer() sepsize = len(sep) start = 0 while True: idx = source.find(sep, start) if idx == -1: yield source[start:] return yield source[start:idx] start = idx + sepsize 

这可以像你想要的那样使用…

 >>> print list(isplit("abcb","b")) ['a','c',''] 

虽然每次执行find()或slicing操作时都会在string内寻找一些成本,但是这应该是最小的,因为string在内存中表示为continguous数组。

这是通过re.search() split()实现的split()生成器版本,不存在分配太多子string的问题。

 import re def itersplit(s, sep=None): exp = re.compile(r'\s+' if sep is None else re.escape(sep)) pos = 0 while True: m = exp.search(s, pos) if not m: if pos < len(s) or sep is not None: yield s[pos:] break if pos < m.start() or sep is not None: yield s[pos:m.start()] pos = m.end() sample1 = "Good evening, world!" sample2 = " Good evening, world! " sample3 = "brackets][all][][over][here" sample4 = "][brackets][all][][over][here][" assert list(itersplit(sample1)) == sample1.split() assert list(itersplit(sample2)) == sample2.split() assert list(itersplit(sample3, '][')) == sample3.split('][') assert list(itersplit(sample4, '][')) == sample4.split('][') 

编辑:纠正处理周围的空白,如果没有分隔符字符给出。

我没有看到split()生成器版本的任何明显的好处。 生成器对象将不得不包含整个string迭代,所以你不会通过生成器来保存任何内存。

如果你想写一个它会很容易,虽然:

 import string def gsplit(s,sep=string.whitespace): word = [] for c in s: if c in sep: if word: yield "".join(word) word = [] else: word.append(c) if word: yield "".join(word) 

这里是我的实现,比这里的其他答案更快,更完整。 它有4个不同的情况下单独的子function。

我只是复制主str_split函数的文档string:


 str_split(s, *delims, empty=None) 

将strings拆分为其余的参数,可能省略空白部分( empty关键字参数负责)。 这是一个生成器函数。

当只提供一个分隔符时,string被简单地分割。 empty默认为True

 str_split('[]aaa[][]bb[c', '[]') -> '', 'aaa', '', 'bb[c' str_split('[]aaa[][]bb[c', '[]', empty=False) -> 'aaa', 'bb[c' 

当提供多个分隔符时,默认情况下,string会被这些分隔符的最长可能序列分割,或者,如果将empty值设置为True ,则分隔符之间的空string也将包含在内。 请注意,在这种情况下,分隔符只能是单个字符。

 str_split('aaa, bb : c;', ' ', ',', ':', ';') -> 'aaa', 'bb', 'c' str_split('aaa, bb : c;', *' ,:;', empty=True) -> 'aaa', '', 'bb', '', '', 'c', '' 

当没有提供分隔符时,使用string.whitespace ,所以效果和str.split()相同,除了这个函数是一个生成器。

 str_split('aaa\\t bb c \\n') -> 'aaa', 'bb', 'c' 

 import string def _str_split_chars(s, delims): "Split the string `s` by characters contained in `delims`, including the \ empty parts between two consecutive delimiters" start = 0 for i, c in enumerate(s): if c in delims: yield s[start:i] start = i+1 yield s[start:] def _str_split_chars_ne(s, delims): "Split the string `s` by longest possible sequences of characters \ contained in `delims`" start = 0 in_s = False for i, c in enumerate(s): if c in delims: if in_s: yield s[start:i] in_s = False else: if not in_s: in_s = True start = i if in_s: yield s[start:] def _str_split_word(s, delim): "Split the string `s` by the string `delim`" dlen = len(delim) start = 0 try: while True: i = s.index(delim, start) yield s[start:i] start = i+dlen except ValueError: pass yield s[start:] def _str_split_word_ne(s, delim): "Split the string `s` by the string `delim`, not including empty parts \ between two consecutive delimiters" dlen = len(delim) start = 0 try: while True: i = s.index(delim, start) if start!=i: yield s[start:i] start = i+dlen except ValueError: pass if start<len(s): yield s[start:] def str_split(s, *delims, empty=None): """\ Split the string `s` by the rest of the arguments, possibly omitting empty parts (`empty` keyword argument is responsible for that). This is a generator function. When only one delimiter is supplied, the string is simply split by it. `empty` is then `True` by default. str_split('[]aaa[][]bb[c', '[]') -> '', 'aaa', '', 'bb[c' str_split('[]aaa[][]bb[c', '[]', empty=False) -> 'aaa', 'bb[c' When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if `empty` is set to `True`, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters. str_split('aaa, bb : c;', ' ', ',', ':', ';') -> 'aaa', 'bb', 'c' str_split('aaa, bb : c;', *' ,:;', empty=True) -> 'aaa', '', 'bb', '', '', 'c', '' When no delimiters are supplied, `string.whitespace` is used, so the effect is the same as `str.split()`, except this function is a generator. str_split('aaa\\t bb c \\n') -> 'aaa', 'bb', 'c' """ if len(delims)==1: f = _str_split_word if empty is None or empty else _str_split_word_ne return f(s, delims[0]) if len(delims)==0: delims = string.whitespace delims = set(delims) if len(delims)>=4 else ''.join(delims) if any(len(d)>1 for d in delims): raise ValueError("Only 1-character multiple delimiters are supported") f = _str_split_chars if empty else _str_split_chars_ne return f(s, delims) 

这个函数可以在Python 3中使用,并且可以应用一个简单而又相当难看的修复方法,使它能够在2和3版本中工作。 函数的第一行应该改为:

 def str_split(s, *delims, **kwargs): """...docstring...""" empty = kwargs.get('empty') 

不,但使用itertools.takewhile()编写一个应该很容易。

编辑:

很简单,半破解的实现:

 import itertools import string def isplitwords(s): i = iter(s) while True: r = [] for c in itertools.takewhile(lambda x: not x in string.whitespace, i): r.append(c) else: if r: yield ''.join(r) continue else: raise StopIteration() 
 def split_generator(f,s): """ f is a string, s is the substring we split on. This produces a generator rather than a possibly memory intensive list. """ i=0 j=0 while j<len(f): if i>=len(f): yield f[j:] j=i elif f[i] != s: i=i+1 else: yield [f[j:i]] j=i+1 i=i+1 

我写了一个@ ninjagecko的答案,其行为更像string.split(即默认分隔的空白,你可以指定一个分隔符)的版本。

 def isplit(string, delimiter = None): """Like string.split but returns an iterator (lazy) Multiple character delimters are not handled. """ if delimiter is None: # Whitespace delimited by default delim = r"\s" elif len(delimiter) != 1: raise ValueError("Can only handle single character delimiters", delimiter) else: # Escape, incase it's "\", "*" etc. delim = re.escape(delimiter) return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string)) 

这里是我使用的testing(在python 3和python 2中):

 # Wrapper to make it a list def helper(*args, **kwargs): return list(isplit(*args, **kwargs)) # Normal delimiters assert helper("1,2,3", ",") == ["1", "2", "3"] assert helper("1;2;3,", ";") == ["1", "2", "3,"] assert helper("1;2 ;3, ", ";") == ["1", "2 ", "3, "] # Whitespace assert helper("1 2 3") == ["1", "2", "3"] assert helper("1\t2\t3") == ["1", "2", "3"] assert helper("1\t2 \t3") == ["1", "2", "3"] assert helper("1\n2\n3") == ["1", "2", "3"] # Surrounding whitespace dropped assert helper(" 1 2 3 ") == ["1", "2", "3"] # Regex special characters assert helper(r"1\2\3", "\\") == ["1", "2", "3"] assert helper(r"1*2*3", "*") == ["1", "2", "3"] # No multi-char delimiters allowed try: helper(r"1,.2,.3", ",.") assert False except ValueError: pass 

python的正则expression式模块说,它为unicode空白“正确的事情” ,但我没有真正的testing它。

也可作为要点 。

如果你还想一个迭代器(以及返回一个),试试这个:

 import itertools as it def iter_split(string, sep=None): sep = sep or ' ' groups = it.groupby(string, lambda s: s != sep) return (''.join(g) for k, g in groups if k) 

用法

 >>> list(iter_split(iter("Good evening, world!"))) ['Good', 'evening,', 'world!'] 

对提出的各种方法做了一些性能testing(这里我不再重复)。 一些结果:

  • str.split (默认= 0.3461570239996945
  • 手动search(按字符)(Dave Webb的答案之一)= 0.8260340550004912
  • re.finditer (ninjagecko的回答)= 0.698872097000276
  • str.find (Eli Collins的答案之一)= 0.7230395330007013
  • itertools.takewhile (Ignacio Vazquez-Abrams's answer)= 2.023023967998597
  • str.split(..., maxsplit=1) recursion = N / A†

†recursion答案( string.splitmaxsplit = 1 )无法在合理的时间内完成,给定string.split s的速度,他们可能会更好地在较短的string,但是我看不到短string的用例内存不是问题无论如何。

使用timeittesting:

 the_text = "100 " * 9999 + "100" def test_function( method ): def fn( ): total = 0 for x in method( the_text ): total += int( x ) return total return fn 

这引发了另外一个问题,即为什么string.split尽pipe内存使用更快。

你可以使用str.split自己创build一个限制:

 def isplit(s, sep=None): while s: parts = s.split(sep, 1) if len(parts) == 2: s = parts[1] else: s = '' yield parts[0] 

这样,你不必复制strip()的function和行为(例如,当sep = None时),这取决于它可能快速的本地实现。 我认为一旦它有足够的“部分”,string.split将停止扫描分隔符的string。

正如格伦·梅纳德(Glenn Maynard)所指出的那样,这对大string(O(n ^ 2))来说是不好的。 我已经通过'timit'testing证实了这一点。

对于我来说,至less需要将文件用作生成器。

这是我在准备一些空行分隔文本块的巨大文件时所做的准备(如果要在生产系统中使用它,需要对angular落案例进行彻底testing):

 from __future__ import print_function def isplit(iterable, sep=None): r = '' for c in iterable: r += c if sep is None: if not c.strip(): r = r[:-1] if r: yield r r = '' elif r.endswith(sep): r=r[:-len(sep)] yield r r = '' if r: yield r def read_blocks(filename): """read a file as a sequence of blocks separated by empty line""" with open(filename) as ifh: for block in isplit(ifh, '\n\n'): yield block.splitlines() if __name__ == "__main__": for lineno, block in enumerate(read_blocks("logfile.txt"), 1): print(lineno,':') print('\n'.join(block)) print('-'*40) print('Testing skip with None.') for word in isplit('\tTony \t Jarkko \n Veijalainen\n'): print(word)