最好的方法是从Python中的string中去除标点符号

似乎应该有一个比以下更简单的方法:

import string s = "string. With. Punctuation?" # Sample string out = s.translate(string.maketrans("",""), string.punctuation) 

在那儿?

从效率的angular度来看,你不会打败

 s.translate(None, string.punctuation) 

它使用查找表在C中执行原始string操作 – 没有太多的东西会打败你,而是编写你自己的C代码。

如果速度不是一个担心,但另一个选项,虽然是:

 exclude = set(string.punctuation) s = ''.join(ch for ch in s if ch not in exclude) 

这比使用每个字符的s.replace更快,但是不会像非正式的Python方法(如regexes或者string.translate)那样执行,正如您从下面的时间点可以看到的那样。 对于这种types的问题,在尽可能低的水平上做到这一点是值得的。

时间码:

 import re, string, timeit s = "string. With. Punctuation" exclude = set(string.punctuation) table = string.maketrans("","") regex = re.compile('[%s]' % re.escape(string.punctuation)) def test_set(s): return ''.join(ch for ch in s if ch not in exclude) def test_re(s): # From Vinko's solution, with fix. return regex.sub('', s) def test_trans(s): return s.translate(table, string.punctuation) def test_repl(s): # From S.Lott's solution for c in string.punctuation: s=s.replace(c,"") return s print "sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000) print "regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000) print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000) print "replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000) 

这给出了以下结果:

 sets : 19.8566138744 regex : 6.86155414581 translate : 2.12455511093 replace : 28.4436721802 

正则expression式很简单,如果你知道的话。

 import re s = "string. With. Punctuation?" s = re.sub(r'[^\w\s]','',s) 
 myString.translate(None, string.punctuation) 

我通常使用这样的东西:

 >>> s = "string. With. Punctuation?" # Sample string >>> import string >>> for c in string.punctuation: ... s= s.replace(c,"") ... >>> s 'string With Punctuation' 

不一定简单,但是如果你更熟悉这个家庭,则是一种不同的方式。

 import re, string s = "string. With. Punctuation?" # Sample string out = re.sub('[%s]' % re.escape(string.punctuation), '', s) 

string.punctuation是ascii只! 更正确的(但也更慢)的方法是使用unicodedata模块:

 # -*- coding: utf-8 -*- from unicodedata import category s = u'String — with - «punctation »...' s = ''.join(ch for ch in s if category(ch)[0] != 'P') print 'stripped', s 

为了便于使用,我总结了Python2和Python3中string条形标注的注释。 有关详细说明,请参阅其他答案。


Python2

 import string s = "string. With. Punctuation?" table = string.maketrans("","") new_s = s.translate(table, string.punctuation) # Output: string without punctuation 

Python3

 import string s = "string. With. Punctuation?" table = str.maketrans({key: None for key in string.punctuation}) new_s = s.translate(table) # Output: string without punctuation 

对于Python 3 str或Python 2 unicode值, str.translate()只需要一个字典; 代码点(整数)在该映射中查找,并且映射到None任何东西None被移除。

要删除(某些?)标点符号,请使用:

 import string remove_punct_map = dict.fromkeys(map(ord, string.punctuation)) s.translate(remove_punct_map) 

dict.fromkeys()类方法使创build映射的过程变得很简单,根据键的顺序将所有值设置为None

要删除所有的标点符号,不只是ASCII标点符号,您的表格需要更大一点; 请参阅JF Sebastian的答案 (Python 3版本):

 import unicodedata import sys remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')) 

这可能不是最好的解决scheme,但这是我做到的。

 import string f = lambda x: ''.join([i for i in x if i not in string.punctuation]) 

这个问题已经过了6年了,但是我想到了我写了一个函数。 这不是很有效,但它很简单,你可以添加或删除任何你想要的标点符号:

 def stripPunc(wordList): """Strips punctuation from list of words""" puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""] for punc in puncList: for word in wordList: wordList=[word.replace(punc,'') for word in wordList] return wordList 

这是一个python 3.5的单行代码:

 import string "l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation})) 

string.punctuation错过了在现实世界中常用的点状标记的加载。 如何解决非ASCII标点的问题?

 import regex s = u"string. With. Some・Really Weird、Non?ASCII。 「(Punctuation)」?" remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE) remove.sub(u" ", s).strip() 

就我个人而言,我相信这是从Python中的string中删除标点符号的最好方法,因为:

  • 它删除所有的Unicode标点符号
  • 这很容易修改,例如,如果你想删除标点符号,可以删除\{S} ,但是保留象$这样的符号。
  • 您可以获得关于您想要保留的内容以及要删除的内容的具体信息,例如\{Pd}将仅删除破折号。
  • 这个正则expression式也标准化空白。 它将制表符,回车符,和其他古怪地图映射到很好的单个空格。

这使用unicode字符属性, 您可以在wikipedia上阅读更多信息 。

 >>> s = "string. With. Punctuation?" >>> s = re.sub(r'[^\w\s]','',s) >>> re.split(r'\s*', s) ['string', 'With', 'Punctuation'] 

这是一个没有正则expression式的解决scheme。

 import string input_text = "!where??and!!or$$then:)" punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation)) print ' '.join(input_text.translate(punctuation_replacer).split()).strip() Output>> where and or then 
  • 用空格replace标点符号
  • 用单个空格replace单词之间的多个空格
  • 删除拖尾的空格,如果有的话strip()

使用正则expression式函数进行search和replace,如下所示。 。 如果你不得不重复执行这个操作,你可以保留一个正则expression式模式(你的标点符号)的编译副本,这会加速一些事情。

一个class轮在不是非常严格的情况下可能会有帮助:

 ''.join([c for c in s if c.isalnum() or c.isspace()]) 
 #FIRST METHOD #Storing all punctuations in a variable punctuation='!?,.:;"\')(_-' newstring='' #Creating empty string word=raw_input("Enter string: ") for i in word: if(i not in punctuation): newstring+=i print "The string without punctuation is",newstring #SECOND METHOD word=raw_input("Enter string: ") punctuation='!?,.:;"\')(_-' newstring=word.translate(None,punctuation) print "The string without punctuation is",newstring #Output for both methods Enter string: hello! welcome -to_python(programming.language)??, The string without punctuation is: hello welcome topythonprogramminglanguage 

这是如何将文件更改为大写或小写。

 print('@@@@This is lower case@@@@') with open('students.txt','r')as myFile: str1=myFile.read() str1.lower() print(str1.lower()) print('*****This is upper case****') with open('students.txt','r')as myFile: str1=myFile.read() str1.upper() print(str1.upper()) 
 import re s = "string. With. Punctuation?" # Sample string out = re.sub(r'[^a-zA-Z0-9\s]', '', s) 
 with open('one.txt','r')as myFile: str1=myFile.read() print(str1) punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"] for i in punctuation: str1 = str1.replace(i," ") myList=[] myList.extend(str1.split(" ")) print (str1) for i in myList: print(i,end='\n') print ("____________") 

我还没有看到这个答案。 只要使用正则expression式,就可以删除除单词字符( \w )和数字字符( \d )之外的所有字符,后跟一个空格字符( \s ):

 import re s = "string. With. Punctuation?" # Sample string out = re.sub(ur'[^\w\d\s]+', '', s) 

使用Python删除文本文件中的停用词

 print('====THIS IS HOW TO REMOVE STOP WORS====') with open('one.txt','r')as myFile: str1=myFile.read() stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these" myList=[] myList.extend(str1.split(" ")) for i in myList: if i not in stop_words: print ("____________") print(i,end='\n') 

我喜欢使用这样的function:

 def scrub(abc): while abc[-1] is in list(string.punctuation): abc=abc[:-1] while abc[0] is in list(string.punctuation): abc=abc[1:] return abc