将string拆分为单词和标点符号

我试图将string拆分成单词和标点符号，将标点符号添加到拆分生成的列表中。

例如：

>>> c = "help, me" >>> print c.split() ['help,', 'me']

我真正想要的列表是：

 ['help', ',', 'me']

所以，我希望string以空格分隔标点符号。

我试图parsingstring，然后运行拆分：

 >>> for character in c: ... if character in ".,;!?": ... outputCharacter = " %s" % character ... else: ... outputCharacter = character ... separatedPunctuation += outputCharacter >>> print separatedPunctuation help , me >>> print separatedPunctuation.split() ['help', ',', 'me']

这产生了我想要的结果，但是在大文件上却很慢。

有没有办法更有效地做到这一点？

这或多或less是这样做的：

 >>> import re >>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!") ['Hello', ',', "I'm", 'a', 'string', '!']

诀窍是，不要去考虑拆分string的位置，而是要考虑在记号中包含什么。

注意事项：

下划线（_）被认为是一个内字的字符。 replace\ w，如果你不想要的话。
这不适用于string中的（单个）引号。
在正则expression式的右半部分添加任何想要使用的标点符号。
在re中没有明确提到的任何内容都被无声地抛弃了。

这是一个Unicode感知版本：

 re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

第一个替代方法捕获单词字符序列（由unicode定义，所以“résumé”不会变成['r', 'sum'] ）; 第二个捕获单个非单词字符，忽略空白。

请注意，与顶部答案不同，这将单引号视为单独的标点符号（例如“我是” – > ['I', "'", 'm'] ）。这似乎是NLP的标准，所以我认为它是一个function。

在Perl风格的正则expression式语法中， \b匹配一个字边界。这应该派上用场做一个基于正则expression式的拆分。

编辑：我已经被跳转告知，“空匹配”不能在Python的re模块的分割函数中工作。我将把这里留下来作为任何人被这个“function”困住的信息。

这是我的入口。

我怀疑这样做是否有效，或者如果它抓住所有的情况（注意“!!!”分组在一起，这可能会或可能不是一件好事）。

 >>> import re >>> import string >>> s = "Helo, my name is Joe! and i live!!! in a button; factory:" >>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0] >>> l ['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':'] >>>

一个显而易见的优化是，如果你要逐行地完成这个工作，先手动编译正则expression式（使用re.compile）。

这里是你的实现的一个小的更新。如果你想做更详细的事情，我build议你看看dorfierbuild议的NLTK。

这可能只是快一点，因为''.join（）被用来代替+ =，这是已知的更快。

 import string d = "Hello, I'm a string!" result = [] word = '' for char in d: if char not in string.whitespace: if char not in string.ascii_letters + "'": if word: result.append(word) result.append(char) word = '' else: word = ''.join([word,char]) else: if word: result.append(word) word = '' print result ['Hello', ',', "I'm", 'a', 'string', '!']

我想你可以在NLTK中find所有你可以想象的帮助，尤其是在你使用python的时候。本教程中对此问题进行了全面的讨论。

我想出了一种方法来标记所有的单词和\W+模式使用\b不需要硬编码：

 >>> import re >>> sentence = 'Hello, world!' >>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)] ['Hello', ',', 'world', '!']

这里.*?\S.*? 是一种匹配任何不是空格的模式，如果它是一个标点符号， $被添加到匹配string中的最后一个标记。

请注意以下事项 – 这将组成多个符号组成的标点符号：

 >>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')] ['Oh', 'no', '",', 'she', 'said']

当然，你可以find并拆分这样的组：

 >>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]: ... print re.findall(r'(?:\w+|\W)', token) ['You'] ['can'] ['"', ','] ['she'] ['said']

尝试这个：

 string_big = "One of Python's coolest features is the string format operator This operator is unique to strings" my_list =[] x = len(string_big) poistion_ofspace = 0 while poistion_ofspace < x: for i in range(poistion_ofspace,x): if string_big[i] == ' ': break else: continue print string_big[poistion_ofspace:(i+1)] my_list.append(string_big[poistion_ofspace:(i+1)]) poistion_ofspace = i+1 print my_list

你有没有尝试过使用正则expression式？

http://docs.python.org/library/re.html#re-syntax

顺便一提。为什么你需要第二个“，”？你会知道，每个文本后，即写

[0]

“”

[1]

“”

所以如果你想添加“，”你可以在每次迭代之后使用数组。

将string拆分为单词和标点符号

如何将一个string拆分为Python中的整数？

如何将string拆分成给定长度的子string？

用逗号分隔外引号

使用String.split（）提取单词对

一个方法来扭转效果的Java String.split（）？

Java split（）方法在最后剥离空string？

如何在Java中使用“。”作为String.split（）的分隔符

如何在一个行号文件拆分

在Java中爆炸和Implode（就像PHP一样）

如何拆分string与一些分隔符，但不删除在Java中的分隔符？