分割string与多个分隔符？

我想我想做的事情是一个相当普遍的任务，但我没有在网上find任何参考。我有文字，标点符号，我想要的单词列表。

"Hey, you - what are you doing here!?"

应该

 ['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

但是Python的str.split()只能用一个参数来工作…所以我用空格分割所有带有标点符号的单词。有任何想法吗？

正则expression式是合理的情况：

 import re DATA = "Hey, you - what are you doing here!?" print re.findall(r"[\w']+", DATA) # Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

re.split（）

re.split（pattern，string [，maxsplit = 0]）

由模式发生的分割string。如果在模式中使用捕获圆括号，则模式中所有组的文本也将作为结果列表的一部分返回。如果maxsplit不为零，则最多发生maxsplit分割，并将其余的string作为列表的最后一个元素返回。（不兼容性注释：在原始的Python 1.5版本中，maxsplit被忽略了，这在以后的版本中已经修复。

 >>> re.split('\W+', 'Words, words, words.') ['Words', 'words', 'words', ''] >>> re.split('(\W+)', 'Words, words, words.') ['Words', ', ', 'words', ', ', 'words', '.', ''] >>> re.split('\W+', 'Words, words, words.', 1) ['Words', 'words, words.']

另一种不用正则expression式的快速方法是先replace字符，如下所示：

 >>> 'a;bcd,ef g'.replace(';',' ').replace(',',' ').split() ['a', 'bcd', 'ef', 'g']

这么多的答案，但我找不到任何解决scheme，有效地问题标题字面上要求（分裂与多个分隔符 – 而是，许多答案删除任何不是一个字，这是不同的）。所以这里是对标题问题的回答，它依赖于Python的标准和高效的re模块：

 >>> import re # Will be splitting on: , <space> - ! ? : >>> filter(None, re.split("[, \-!?:]+", "Hey, you - what are you doing here!?")) ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

哪里：

\-在正则expression式中，这里是为了防止特殊解释-作为字符范围指示符，以及在哪里
filter(None, …)删除可能由前导和尾随分隔符创build的空string（因为空string具有假布尔值）。

这个re.split()正如“问题标题”中所要求的“与多个分隔符分开”。

这个解决scheme也不会遇到非ASCII字符的问题（见第一个对ghostdog74答案的评论）。

re模块比“循环”Python循环和testing更有效率。

另一种方式，没有正则expression式

 import string punc = string.punctuation thestring = "Hey, you - what are you doing here!?" s = list(thestring) ''.join([o for o in s if not o in punc]).split()

专业提示：使用string.translate进行Python的最快的string操作。

一些certificate…

首先，慢的方式（对不起pprzemek）：

 >>> import timeit >>> S = 'Hey, you - what are you doing here!?' >>> def my_split(s, seps): ... res = [s] ... for sep in seps: ... s, res = res, [] ... for seq in s: ... res += seq.split(sep) ... return res ... >>> timeit.Timer('my_split(S, punctuation)', 'from __main__ import S,my_split; from string import punctuation').timeit() 54.65477919578552

接下来，我们使用re.findall() （由build议的答案给出）。快多了：

 >>> timeit.Timer('findall(r"\w+", S)', 'from __main__ import S; from re import findall').timeit() 4.194725036621094

最后，我们使用translate ：

 >>> from string import translate,maketrans,punctuation >>> T = maketrans(punctuation, ' '*len(punctuation)) >>> timeit.Timer('translate(S, T).split()', 'from __main__ import S,T,translate').timeit() 1.2835021018981934

说明：

string.translate在C中实现，与Python中的许多string操作函数不同， string.translate 不会生成新的string。所以它的速度和stringreplace一样快。

不过，这有点尴尬，因为它需要一个翻译表才能做到这一点。您可以使用maketrans()方便函数创build一个翻译表。这里的目标是将所有不需要的字符转换为空格。一对一的替代品。再次，没有新的数据产生。所以这很快！

接下来，我们使用旧的split() 。在默认情况下， split()将对所有空白字符进行操作，将它们分组在一起以进行分割。结果将是你想要的单词列表。这个方法几乎比re.findall()快4倍！

有点晚回答:)，但我有一个类似的困境，不想使用're'模块。

 def my_split(s, seps): res = [s] for sep in seps: s, res = res, [] for seq in s: res += seq.split(sep) return res print my_split('1111 2222 3333;4444,5555;6666', [' ', ';', ',']) ['1111', '', '2222', '3333', '4444', '5555', '6666']

 join = lambda x: sum(x,[]) # aka flatten1([[1],[2,3],[4]]) -> [1,2,3,4] # ...alternatively... join = lambda lists: [x for l in lists for x in l]

然后，这成为一个三线：

 fragments = [text] for token in tokens: fragments = join(f.split(token) for f in fragments)

说明

这就是Haskell被称为List monad的原因。 monad背后的想法是，一旦“在monad中”，你“停留在monad”，直到有东西把你带出去。例如，在Haskell中，假设您将Python range(n) -> [1,2,...,n]映射到List上。如果结果是一个列表，它将被附加到列表就地，所以你会得到像map(range, [3,4,1]) -> [0,1,2,0,1,2,3,0] 。这被称为map-append（或mappend，或者类似的东西）。这里的想法是，你有这个操作你正在申请（拆分一个令牌），每当你这样做，你join到列表中的结果。

你可以把这个抽象成一个函数，默认情况下有tokens=string.punctuation 。

这种方法的优点：

这种方法（不同于真正的基于正则expression式的方法）可以使用任意长度的标记（正则expression式也可以使用更高级的语法）。
你不仅仅限于令牌; 你可以用任意的逻辑代替每个标记，例如“标记”之一可以是根据嵌套括号是如何分割的函数。

首先，我想与其他人一致认为，基于正则expression式或者str.translate(...)的解决scheme是最str.translate(...) 。对于我的用例来说，这个函数的performance并不重要，所以我想用这个标准来添加我所考虑的想法。

我的主要目标是将一些其他答案的思想推广到一个解决scheme中，这个解决scheme可以用于不仅包含正则expression式单词的string（即将标点符号的显式子集与白名单字符列入白名单）。

请注意，在任何方法中，也可以考虑使用string.punctuation代替手动定义的列表。

选项1 – re.sub

（…），我很惊讶地看到没有答案。我觉得这是一个简单而自然的方法来解决这个问题。

 import re my_str = "Hey, you - what are you doing here!?" words = re.split(r'\s+', re.sub(r'[,\-!?]', ' ', my_str).strip())

在这个解决scheme中，我将调用嵌套到re.split(...) re.sub(...) – 但是如果性能是至关重要的，编译外部正则expression式可能是有益的 – 对我的用例来说， t显着，所以我更喜欢简单性和可读性。

选项2 – str.replace

这是几行，但它有扩展的好处，而不必检查你是否需要逃避一个正则expression式中的某个字符。

 my_str = "Hey, you - what are you doing here!?" replacements = (',', '-', '!', '?') for r in replacements: my_str = my_str.replace(r, ' ') words = my_str.split()

如果能够将str.replace映射到string本来是很好的，但是我不认为这可以用不可变的string来完成，而且映射到字符列表的时候可以工作，对每个字符运行每个replace听起来过度。（编辑：查看下一个选项的function的例子。）

选项3 – functools.reduce

（在Python 2中， reduce在全局名称空间中是可用的，无需从functools中导入）。

 import functools my_str = "Hey, you - what are you doing here!?" replacements = (',', '-', '!', '?') my_str = functools.reduce(lambda s, sep: s.replace(sep, ' '), replacements, my_str) words = my_str.split()

尝试这个：

 import re phrase = "Hey, you - what are you doing here!?" matches = re.findall('\w+', phrase) print matches

这将打印['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

使用replace两次：

 a = '11223FROM33344INTO33222FROM3344' a.replace('FROM', ',,,').replace('INTO', ',,,').split(',,,')

结果是：

 ['11223', '33344', '33222', '3344']

我重新熟悉Python，需要同样的东西。 findall解决scheme可能会更好，但我想出了这个：

 tokens = [x.strip() for x in data.split(',')]

我喜欢，但这里是我的解决scheme没有它：

 from itertools import groupby sep = ' ,-!?' s = "Hey, you - what are you doing here!?" print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]

sep .__ contains__是'in'运算符使用的方法。基本上是一样的

 lambda ch: ch in sep

但在这里更方便。

groupby获取我们的string和函数。它使用该函数分组分割string：只要函数的值发生变化，就会生成一个新的组。所以， 包含___正是我们所需要的。

groupby返回一系列对，其中pair [0]是我们函数的结果，pair [1]是一个组。使用'if not k'，我们用分隔符过滤掉组（因为在分隔符上sep .__ contains__的结果是True）。那么，这就是全部 – 现在我们有一系列的组，每个组是一个单词（组实际上是一个迭代，所以我们使用连接将其转换为string）。

这个解决scheme是相当普遍的，因为它使用一个函数来分隔string（你可以通过任何你需要的条件来分割）。此外，它不创build中间的string/列表（你可以删除连接，expression式将变得懒惰，因为每个组是一个迭代器）

另一种方法是使用自然语言工具包（ nltk ）。

 import nltk data= "Hey, you - what are you doing here!?" word_tokens = nltk.tokenize.regexp_tokenize(data, r'\w+') print word_tokens

这打印： ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

这种方法最大的缺点是你需要安装nltk包。

好处是，一旦你得到你的令牌，你可以用nltk包的其余部分做很多有趣的事情。

与@ooboo有同样的问题，并find这个话题@ ghostdog74启发了我，也许有人发现我的解决scheme有用

 str1='adj:sg:nom:m1.m2.m3:pos' splitat=':.' ''.join([ s if s not in splitat else ' ' for s in str1]).split()

在空间位置input一些东西，如果你不想在空格处分割，则使用相同的字符分割。

这是我去与多个deliminaters拆分：

 def msplit( str, delims ): w = '' for z in str: if z not in delims: w += z else: if len(w) > 0 : yield w w = '' if len(w) > 0 : yield w

我喜欢replace()方法最好。以下过程将stringsplitlist定义的所有分隔符更改为splitlist中的第一个分隔符，然后将该分隔符上的文本拆分。它也解释了splitlist碰巧是一个空string。它返回一个单词列表，没有空string。

 def split_string(text, splitlist): for sep in splitlist: text = text.replace(sep, splitlist[0]) return filter(None, text.split(splitlist[0])) if splitlist else [text]

 def get_words(s): l = [] w = '' for c in s.lower(): if c in '-!?,. ': if w != '': l.append(w) w = '' else: w = w + c if w != '': l.append(w) return l

这是用法：

 >>> s = "Hey, you - what are you doing here!?" >>> print get_words(s) ['hey', 'you', 'what', 'are', 'you', 'doing', 'here']

首先，我不认为你的意图是在分割函数中实际使用标点符号作为分隔符。你的描述build议你只是想从结果string中消除标点符号。

我经常遇到这种情况，而我通常的解决scheme并不需要重复。

单线lambda函数w / list理解：

（需要import string ）：

 split_without_punc = lambda text : [word.strip(string.punctuation) for word in text.split() if word.strip(string.punctuation) != ''] # Call function split_without_punc("Hey, you -- what are you doing?!") # returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

function（传统）

作为一个传统的函数，这仍然只有两个列表理解（除了import string ）：

 def split_without_punctuation2(text): # Split by whitespace words = text.split() # Strip punctuation from each word return [word.strip(ignore) for word in words if word.strip(ignore) != ''] split_without_punctuation2("Hey, you -- what are you doing?!") # returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

这也将自然而然地使收缩和连字符保持不变。您总是可以使用text.replace("-", " ")在分割之前将连字符变成空格。

一般functionW / O Lambda或列表理解

对于更通用的解决scheme（您可以指定要消除的字符），并且没有列表理解，您将得到：

 def split_without(text: str, ignore: str) -> list: # Split by whitespace split_string = text.split() # Strip any characters in the ignore string, and ignore empty strings words = [] for word in split_string: word = word.strip(ignore) if word != '': words.append(word) return words # Situation-specific call to general function import string final_text = split_without("Hey, you - what are you doing?!", string.punctuation) # returns ['Hey', 'you', 'what', 'are', 'you', 'doing']

当然，你总是可以将lambda函数概括为任何指定的string。

首先，在循环中执行任何RegEx操作之前，始终使用re.compile（），因为它比正常操作更快。

所以对于你的问题首先编译模式，然后对其执行操作。

 import re DATA = "Hey, you - what are you doing here!?" reg_tok = re.compile("[\w']+") print reg_tok.findall(DATA)

这里有一些解释的答案。

 st = "Hey, you - what are you doing here!?" # replace all the non alpha-numeric with space and then join. new_string = ''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st]) # output of new_string 'Hey you what are you doing here ' # str.split() will remove all the empty string if separator is not provided new_list = new_string.split() # output of new_list ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here'] # we can join it to get a complete string without any non alpha-numeric character ' '.join(new_list) # output 'Hey you what are you doing'

或者在一行中，我们可以这样做：

 (''.join([x.replace(x, ' ') if not x.isalnum() else x for x in st])).split() # output ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

更新了答案

我认为以下是满足您的需求的最佳答案：

\W+可能适合这种情况，但可能不适合其他情况。

 filter(None, re.compile('[ |,|\-|!|?]').split( "Hey, you - what are you doing here!?")

inheritance人我的承担….

 def split_string(source,splitlist): splits = frozenset(splitlist) l = [] s1 = "" for c in source: if c in splits: if s1: l.append(s1) s1 = "" else: print s1 s1 = s1 + c if s1: l.append(s1) return l >>>out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",") >>>print out >>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']

创build一个以input两个string（要拆分的源string和分隔符string）作为input的函数，并输出拆分字的列表：

 `def split_string(source, splitlist): output = [] #output list of cleaned words atsplit = True for char in source: if char in splitlist: atsplit = True else: if atsplit: output.append(char) #append new word after split atsplit = False else: output[-1] = output[-1] + char #continue copying characters until next split return output`

你需要Python的RegEx模块的findall()方法：

http://www.regular-expressions.info/python.html

例

使用列表parsing这个东西…它似乎更容易

 data= "Hey, you - what are you doing here!?" tokens = [c for c in data if c not in (',', ' ', '-', '!', '?')]

我发现这比使用正则expression式更容易理解（read..maintain），只是因为我不是很擅长正则expression式…这是我们大多数人的情况:)。另外，如果你知道你可能使用了哪些分隔符，你可以把它们放在一个集合中。有一个非常庞大的集合，这可能会慢一些…但是're'模块也很慢。

分割string与多个分隔符？

选项1 – re.sub

选项2 – str.replace

选项3 – functools.reduce

单线lambda函数w / list理解：

function（传统）

一般functionW / O Lambda或列表理解

如何在Ruby中分隔分隔string并将其转换为数组？

PHP的：拆分字符串

拆分string，将ToList <int>（）转换为一行

在R中拆分一个string向量

根据列中的公共值将大数据框分割成数据框列表

用点作为分隔符分割string

每N个字符/数字分割一个string/数字？

我如何分割和parsingPython中的string？

根据多个char分隔符分割一个string

在Pythonstring中分割最后的分隔符？