在Python中用空格分隔string – 保留带引号的子string

我有一个这样的string：

this is "a test"

我正在尝试用Python编写一些东西来分割它，而忽略引号内的空格。我正在寻找的结果是：

 ['this','is','a test']

PS。我知道你会问“如果引号内有引号，那么在我的应用程序中会发生什么事情，这是不会发生的。

你想从shlex模块分割。

 >>> import shlex >>> shlex.split('this is "a test"') ['this', 'is', 'a test']

这应该做你想要的。

看看shlex模块，特别是shlex.split 。

 >>> import shlex >>> shlex.split('This is "a test"') ['This', 'is', 'a test']

我看到这里的正则expression式看起来复杂和/或错误。这令我感到惊讶，因为正则expression式语法可以很容易地描述“空白或东西，由引号包围”，并且大多数正则expression式引擎（包括Python）可以在正则expression式上分割。所以如果你要使用正则expression式，为什么不直接说出你的意思呢？

 test = 'this is "a test"' # or "this is 'a test'" # pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()] # From comments, use this: pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

说明：

 [\\\"'] = double-quote or single-quote .* = anything ( |X) = space or X .strip() = remove space and empty-string separators

尽pipe如此，shlex可能会提供更多的function。

根据您的使用情况，您可能还想查看csv模块：

 import csv lines = ['this is "a string"', 'and more "stuff"'] for row in csv.reader(lines, delimiter=" "): print row

输出：

 ['this', 'is', 'a string'] ['and', 'more', 'stuff']

由于这个问题用正则expression式标记，我决定尝试一个正则expression式的方法。首先用\ x00replace引号部分中的所有空格，然后用空格分隔，然后将\ x00replace回每个部分中的空格。

两个版本都做同样的事情，但分离器是一个更可读性，然后splitter2。

 import re s = 'this is "a test" some text "another test"' def splitter(s): def replacer(m): return m.group(0).replace(" ", "\x00") parts = re.sub('".+?"', replacer, s).split() parts = [p.replace("\x00", " ") for p in parts] return parts def splitter2(s): return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()] print splitter2(s)

我使用shlex.split来处理7000万行鱿鱼日志，它太慢了。所以我转向了。

如果您遇到shlex的性能问题，请尝试此操作。

 import re def line_split(line): return re.findall(r'[^"\s]\S*|".+?"', line)

为了解决一些Python 2版本中的unicode问题，我build议：

 from shlex import split as _split split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

要保留引号，使用这个函数：

 def getArgs(s): args = [] cur = '' inQuotes = 0 for char in s.strip(): if char == ' ' and not inQuotes: args.append(cur) cur = '' elif char == '"' and not inQuotes: inQuotes = 1 cur += char elif char == '"' and inQuotes: inQuotes = 0 cur += char else: cur += char args.append(cur) return args

嗯，似乎无法find“回复”button…无论如何，这个答案是基于Kate的方法，但正确地拆分string与包含转义引号的子string，也删除了子string的开始和结束引号：

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

这对像'This is " a \\\"test\\\"\\\'s substring"'这样'This is " a \\\"test\\\"\\\'s substring"'string起作用， 'This is " a \\\"test\\\"\\\'s substring"' （疯狂的标记不幸的是为了防止Python去除转义符）。

如果返回列表中string中的结果不需要，你可以使用这个稍微改变的函数版本：

 [i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

上面讨论的shlex的unicode问题（最佳答案）似乎是在2.7.2+中按照http://bugs.python.org/issue6988#msg146200解决（间接）;

（单独的答案，因为我不能评论）

我build议：

testingstring：

 s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

也捕捉“”和“：”

 import re re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

结果：

 ['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

忽略空的“”和“”：

 import re re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

结果：

 ['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

如果你不关心子串而不是简单的

 >>> 'a short sized string with spaces '.split()

性能：

 >>> s = " ('a short sized string with spaces '*100).split() " >>> t = timeit.Timer(stmt=s) >>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000) 171.39 usec/pass

或者string模块

 >>> from string import split as stringsplit; >>> stringsplit('a short sized string with spaces '*100)

性能：string模块似乎执行比string方法更好

 >>> s = "stringsplit('a short sized string with spaces '*100)" >>> t = timeit.Timer(s, "from string import split as stringsplit") >>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000) 154.88 usec/pass

或者你可以使用RE引擎

 >>> from re import split as resplit >>> regex = '\s+' >>> medstring = 'a short sized string with spaces '*100 >>> resplit(regex, medstring)

性能

 >>> s = "resplit(regex, medstring)" >>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100") >>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000) 540.21 usec/pass

对于非常长的string，你不应该把整个string加载到内存中，而是分割线或者使用一个迭代循环

尝试这个：

  def adamsplit(s): result = [] inquotes = False for substring in s.split('"'): if not inquotes: result.extend(substring.split()) else: result.append(substring) inquotes = not inquotes return result

一些testingstring：

 'This is "a test"' -> ['This', 'is', 'a test'] '"This is \'a test\'"' -> ["This is 'a test'"]

在Python中用空格分隔string – 保留带引号的子string

如何使用空格将string拆分，并使用正则expression式将前导空格和尾部空格忽略为单词数组？

如何检查一个string是否完全匹配Scala中的正则expression式？

正则expression式来分割camelCase或TitleCase（高级）

我如何删除特殊字符？

在ASP.NET RegularExpressionValidator中使正则expression式不区分大小写

什么是最好的正则expression式来检查一个string是否是一个有效的URL？

/ \ s / g和/ \ s + / g有区别吗？

^，$什么时候在正则expression式中使用这个符号？

未知的修饰符'/'在…？它是什么？

从string中replace非ASCII字符

在Python中用空格分隔string – 保留带引号的子string

如何使用空格将string拆分，并使用正则expression式将前导空格和尾部空格忽略为单词数组？

如何检查一个string是否完全匹配Scala中的正则expression式？

正则expression式来分割camelCase或TitleCase（高级）

我如何删除特殊字符？

在ASP.NET RegularExpressionValidator中使正则expression式不区分大小写

什么是最好的正则expression式来检查一个string是否是一个有效的URL？

/ \ s / g和/ \ s + / g有区别吗？

^，$什么时候在正则expression式中使用这个符号？

未知的修饰符'/'在…？ 它是什么？

从string中replace非ASCII字符

未知的修饰符'/'在…？它是什么？