在Python 3中加速数以百万计的正则expression式replace

我正在使用Python 3.5.2

我有两个名单

约750,000个“句子”（长串）
我想从我的75万个句子中删除约2万个“单词”的列表

所以，我必须循环使用750,000个句子，并且执行大约20000次replace， 但是只有我的单词实际上是“单词”，而不是更大string的一部分。

我正在通过预编译我的单词，使它们被\b元字符包围

 compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

然后我循环通过我的“句子”

 import re for sentence in sentences: for word in compiled_words: sentence = re.sub(word, "", sentence) # put sentence into a growing list

这个嵌套的循环每秒处理大约50个句子 ，这很好，但是处理所有的句子还需要几个小时。

有没有一种方法来使用str.replace方法（我相信是更快），但仍然要求replace只发生在字边界 ？
另外，有没有办法加快re.sub方法？如果我的单词的长度大于我的句子的长度，我已经通过跳过re.sub略微提高了速度，但是这并没有太大的改进。

谢谢你的任何build议。

你可以尝试的一件事是编译一个单一的模式，如"\b(word1|word2|word3)\b" 。

因为依靠C代码来进行实际的匹配，节省可能是巨大的。

正如@pvg在评论中指出的那样，它也可以从单次匹配中获益。

如果你的话不是正则expression式，埃里克的答案是更快。

TLDR

如果您想要最快的解决scheme，请使用此方法（使用设置查找）。对于类似于OP的数据集，它比接受的答案快大约2000倍。

如果你坚持使用正则expression式进行查找，使用这个基于特里的版本，它仍然比正则expression式联盟快1000倍。

理论

如果你的句子不是庞大的string，那么每秒处理超过50个字符可能是可行的。

如果将所有被禁止的单词保存到一个集合中，那么检查该集合中是否包含另一个单词会非常快。

把逻辑打包成一个函数，把这个函数作为re.sub参数，就完成了！

码

 import re with open('/usr/share/dict/american-english') as wordbook: banned_words = set(word.strip().lower() for word in wordbook) def delete_banned_words(matchobj): word = matchobj.group(0) if word.lower() in banned_words: return "" else: return word sentences = ["I'm eric. Welcome here!", "Another boring sentence.", "GiraffeElephantBoat", "sfgsdg sdwerha aswertwe"] * 250000 for sentence in sentences: sentence = re.sub('\w+', delete_banned_words, sentence) # print(sentence) # do something with sentence

转换的句子是：

 ' . ! . GiraffeElephantBoat sfgsdg sdwerha aswertwe

注意：

search是不区分大小写的（感谢lower() ）
用""replace单词可能会留下两个空格（如在您的代码中）
使用python3， \w+也可以匹配重音字符（例如"ångström" ）。
任何非单词字符（制表符，空格，换行符，标记…）都将保持不变。

性能

有一百万个句子， banned_words有近10万字，剧本运行时间不到7s。

相比之下，Liteye的答案需要一千个句子的160秒。

n是单词总数， m是被禁止单词的数量，OP和Liteye的代码是O(n*m) 。

相比之下，我的代码应该运行在O(n+m) 。考虑到有更多的句子比被禁止的单词，algorithm变成O(n) 。

正则expression式联合testing

使用'\b(word1|word2|...|wordN)\b'模式进行正则expression式search的复杂性是多less？是O(N)还是O(1) ？

要理解正则expression式引擎的工作方式是相当困难的，所以我们来写一个简单的testing。

这段代码将10**i随机的英文单词提取到列表中。它创build相应的正则expression式联合，并用不同的单词进行testing：

一个显然不是一个字（它以#开头）
一个是列表中的第一个单词
一个是列表中的最后一个词
一个看起来像一个字，但不是

 import re import timeit import random with open('/usr/share/dict/american-english') as wordbook: english_words = [word.strip().lower() for word in wordbook] random.shuffle(english_words) print("First 10 words :") print(english_words[:10]) test_words = [ ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"), ("First word", english_words[0]), ("Last word", english_words[-1]), ("Almost a word", "couldbeaword") ] def find(word): def fun(): return union.match(word) return fun for exp in range(1, 6): print("\nUnion of %d words" % 10**exp) union = re.compile(r"\b(%s)\b" % '|'.join(english_words[:10**exp])) for description, test_word in test_words: time = timeit.timeit(find(test_word), number=1000) * 1000 print(" %-17s : %.1fms" % (description, time))

它输出：

 First 10 words : ["geritol's", "sunstroke's", 'fib', 'fergus', 'charms', 'canning', 'supervisor', 'fallaciously', "heritage's", 'pastime'] Union of 10 words Surely not a word : 0.7ms First word : 0.8ms Last word : 0.7ms Almost a word : 0.7ms Union of 100 words Surely not a word : 0.7ms First word : 1.1ms Last word : 1.2ms Almost a word : 1.2ms Union of 1000 words Surely not a word : 0.7ms First word : 0.8ms Last word : 9.6ms Almost a word : 10.1ms Union of 10000 words Surely not a word : 1.4ms First word : 1.8ms Last word : 96.3ms Almost a word : 116.6ms Union of 100000 words Surely not a word : 0.7ms First word : 0.8ms Last word : 1227.1ms Almost a word : 1404.1ms

所以它看起来像search一个'\b(word1|word2|...|wordN)\b'模式的单个词有：

O(1)最好的情况
O(n/2)平均情况，仍然是O(n)
O(n)最坏的情况

这些结果与简单的循环search一致。

正则expression式联合的更快的替代方法是从树中创build正则expression式模式。

TLDR

如果您想要最快的基于正则expression式的解决scheme，请使用此方法。对于类似于OP的数据集，它比接受的答案快大约1000倍。

如果你不关心正则expression式，使用这个基于集合的版本，比正则expression式联盟快2000倍。

与Trie优化的正则expression式

一个简单的正则expression式联合方法变得很慢，因为正则expression式引擎在优化模式方面做得不好。

用所有被禁止的单词创build一个Trie并写出相应的正则expression式是可能的。由此产生的trie或正则expression式不是真正的人类可读的，但它们允许非常快速的查找和匹配。

例

 ['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

正则表达式联盟

该列表被转换为一个trie：

 {'f': {'o': {'o': {'x': {'a': {'r': {'': 1}}}, 'b': {'a': {'r': {'': 1}, 'h': {'': 1}}}, 'z': {'a': {'': 1, 'p': {'': 1}}}}}}}

然后到这个正则expression式模式：

 r"\bfoo(?:ba[hr]|xar|zap?)\b"

正则表达式

巨大的优势是，为了testingzoo匹配，正则expression式引擎只需要比较第一个字符（它不匹配），而不是尝试5个字。对于5个单词来说，这是一个预处理过度的过量，但是对于成千上万个单词来说，它显示出有希望的结果

请注意，使用(?:)非捕获组是因为：

foobar|baz会匹配foobar或baz ，但不是foobaz
foo(bar|baz)会将不需要的信息保存到捕获组中。

码

这里有一个稍微修改的要点，我们可以使用它作为trie.py库：

 import re class Trie(): """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern. The corresponding Regex should match much faster than a simple Regex union.""" def __init__(self): self.data = {} def add(self, word): ref = self.data for char in word: ref[char] = char in ref and ref[char] or {} ref = ref[char] ref[''] = 1 def dump(self): return self.data def quote(self, char): return re.escape(char) def _pattern(self, pData): data = pData if "" in data and len(data.keys()) == 1: return None alt = [] cc = [] q = 0 for char in sorted(data.keys()): if isinstance(data[char], dict): try: recurse = self._pattern(data[char]) alt.append(self.quote(char) + recurse) except: cc.append(self.quote(char)) else: q = 1 cconly = not len(alt) > 0 if len(cc) > 0: if len(cc) == 1: alt.append(cc[0]) else: alt.append('[' + ''.join(cc) + ']') if len(alt) == 1: result = alt[0] else: result = "(?:" + "|".join(alt) + ")" if q: if cconly: result += "?" else: result = "(?:%s)?" % result return result def pattern(self): return self._pattern(self.dump())

testing

这是一个小testing（和这个一样）：

 # Encoding: utf-8 import re import timeit import random from trie import Trie with open('/usr/share/dict/american-english') as wordbook: banned_words = [word.strip().lower() for word in wordbook] random.shuffle(banned_words) test_words = [ ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"), ("First word", banned_words[0]), ("Last word", banned_words[-1]), ("Almost a word", "couldbeaword") ] def trie_regex_from_words(words): trie = Trie() for word in words: trie.add(word) return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE) def find(word): def fun(): return union.match(word) return fun for exp in range(1, 6): print("\nTrieRegex of %d words" % 10**exp) union = trie_regex_from_words(banned_words[:10**exp]) for description, test_word in test_words: time = timeit.timeit(find(test_word), number=1000) * 1000 print(" %s : %.1fms" % (description, time))

它输出：

 TrieRegex of 10 words Surely not a word : 0.3ms First word : 0.4ms Last word : 0.5ms Almost a word : 0.5ms TrieRegex of 100 words Surely not a word : 0.3ms First word : 0.5ms Last word : 0.9ms Almost a word : 0.6ms TrieRegex of 1000 words Surely not a word : 0.3ms First word : 0.7ms Last word : 0.9ms Almost a word : 1.1ms TrieRegex of 10000 words Surely not a word : 0.1ms First word : 1.0ms Last word : 1.2ms Almost a word : 1.2ms TrieRegex of 100000 words Surely not a word : 0.3ms First word : 1.2ms Last word : 0.9ms Almost a word : 1.6ms

有关信息，正则expression式像这样开始：

（：一（：（：\'S | A（：\？？？？S |陈| liyah（：\？'S）| R（？：？dvark（：（？：？：\的| S ））| ON））| b'（：\？ 'S | A（：C（：我们（：（：\？？？' S | ES））| [IK]）|英尺|孤独的（？：（？：\ 'S | S）？）| NDON（:( ?:版| ING |换货（？：\？' S）| S））| S（：？？？？E（:( ?:精神疾病：| [DS））| H（:( ?: E [DS] |荷兰国际集团））|荷兰国际集团）| T（（\'S'）：？？？？？？E（:( ?:包换（？：\'S）| [DS]））| ING | toir（？：？（：\？的| S））））| b（：如（：？？？ID）| E（？：SS（：（：\ 'Y S | | ES）？（：（：\？？？？）' |）| OT（S S））：（：\'S | T（：？？？\的）| S））| reviat？（：E [DS] | I（：？？？纳克|上（：（：\？的| S））））| Y（？：？\” ？S）| \ E（：（：？\的| S）？））| d（：ICAT（：E [DS] | I（：？？？？？纳克|上（：（：\的| S））））| OM（？：恩？（：（：\？'S | S））|伊纳勒）| U（？：？克拉（:( ?:版| I（？： NG |上（：（：\？ 'S | S）））|或？（？：（：\？' S | S））| S））| L（：？？？？\'S）））| E（：（？：？：\ 'S |上午| L（：（：\？' S | ARD |子（？：\'S）））| R（？：？？？DEEN（：\ 'S）| nathy？（？：\' S）| RA（？：？NT |重刑（：（？：？：\'S | S））？））| T（:( ?:吨（？： E（：R（：（：？\'？S | S））| d）| ING |或（：（：？\的| S）））| S））| yance（？？？？：\ '？S）| d））| HOR（:( ?: R（：E（为：n（：CE（：？？？？？\'？S）| T）| d）|荷兰国际集团）|或多个））| I？（：d（：E [DS] |荷兰国际集团|一月（：？？？\'S'））|盖尔| L（：？烯|它（？：IES | Y（?: \ '？S））））|百灵（：ECT（：LY）| UR（：？？？？通货膨胀（：（：？\'？S | S））| E [DS] |荷兰国际集团）） | L？（？：？一个（：略去（：（：？？？\的| S））| ZE）| E（:( ?: ST | R））| OOM | ution（:(？？：\'S | S））| Y）| M \的| N（？：ΔE（：GAT（：E [DS] | I（：？纳克|上（？：？\'？S）））| R（：\” ？S））| ormal（:( ?:它（：IES | Y（？：？：\？ 'S））| LY）））| O（？：？？ARD |德（？：？（：\' ？S | S））|李（：SH（:( ?: E [DS] |荷兰国际集团））|和灰（：？？？（？：？\的| IST（：（：？\的|或多个）））））|米娜（：？？？BL [EY] | T（：E [DS] | I（：？？？？？纳克|上（：（？：\的| S）））））| R（：？igin（：人（：（：？\的| S））|？E（：（：？\的| S）））| T（:(？？？？：编辑| I（？：？纳克|上（：（：\ 'S | IST（：（？：？：\'？？？S | S））| S））|阳离子）|？S）））| U：T）| | VE（ND（:( ?:编| ING | S）？）？（：（：？\的|板）？？？？））| R（：一个（：cadabra（？：\'？S）| d（？：？E [DS] |荷兰国际集团）|火腿（：\？的）|米（？：？（？：\的| S））| SI（？：对（：（：\？ 'S | S））|已经？（？：？（：\' S | LY |岬（？：？：\'S）| S））？））|东| IDG （：E（:( ?:换货（：（：？？？？\的| S））| [DS]）？）| ING |换货？（：（：？\的| S））？）| O（？：广告| GAT（：E [DS] | I（：？？？？？纳克|上（：（？：\的| S））））？）| UPT（:( ?: E：LY | |岬（ST | R？）（：\？'）））| S（S）：？ALOM | C（：？？？？ESS（：（：\的| E [DS] |荷兰国际集团））| ISSA（？：（：\的| [ES]））| OND（:( ?:编| ING | S）））| EN（：？？？？？？？CE（:( ？：\'？S | S））| T（:( ?: E（：E（：（：\？？？？的| ISM（：？\的）| S））|？d） | ING | LY | S）））| INTH（？：（：？？？\'？S | E（：\的）））| O（？：？L（：UT（：E（？？？？：（？：\的| LY | ST？））| I？（：上（：\？的）| SM？：））| v（（\'S'）：？E [DS] ？|荷兰国际集团））| R（：b（:( ?: E（为：n（：CY（：？？？？？\'？S）| T（：（：？\的| S））？）| d）| ING | S））|？PT 一世…

这真的是不可读的，但是对于100000个被禁止的单词列表，这个Trie正则expression式比简单的正则expression式联合速度快1000倍！

以下是使用trie-python-graphviz和graphviz twopi导出的完整特里结构图：

在这里输入图像说明

你可能想要尝试的一件事是预处理句子来编码单词边界。基本上通过分割词边界将每个句子变成一个单词列表。

这应该是更快，因为处理一个句子，你只需要通过每一个单词，并检查它是否匹配。

目前，正则expression式search每次都要重新遍历整个string，寻找单词边界，然后在下一遍之前“丢弃”这个工作的结果。

那么，这里有一个快速和简单的解决scheme，testing集。

获胜策略：

re.sub（“\ w +”，repl，句子）search单词。

“repl”可以是可调用的。我使用了一个执行字典查找的函数，并且该字典包含要search和replace的字词。

这是最简单和最快的解决scheme（请参阅下面的示例代码中的函数replace4）。

次好的

这个想法是使用re.split将句子拆分成单词，同时保留分隔符来重新构造句子。然后，replace是用一个简单的字典查找来完成的。

（请参阅下面的示例代码中的函数replace3）。

时间例如function：

 replace1: 0.62 sentences/s replace2: 7.43 sentences/s replace3: 48498.03 sentences/s replace4: 61374.97 sentences/s (...and 240.000/s with PyPy)

…和代码：

 #! /bin/env python3 # -*- coding: utf-8 import time, random, re def replace1( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns: sentence = re.sub( "\\b"+search+"\\b", repl, sentence ) def replace2( sentences ): for n, sentence in enumerate( sentences ): for search, repl in patterns_comp: sentence = re.sub( search, repl, sentence ) def replace3( sentences ): pd = patterns_dict.get for n, sentence in enumerate( sentences ): #~ print( n, sentence ) # Split the sentence on non-word characters. # Note: () in split patterns ensure the non-word characters ARE kept # and returned in the result list, so we don't mangle the sentence. # If ALL separators are spaces, use string.split instead or something. # Example: #~ >>> re.split(r"([^\w]+)", "ab céé? . d2eéf") #~ ['ab', ' ', 'céé', '? . ', 'd2eéf'] words = re.split(r"([^\w]+)", sentence) # and... done. sentence = "".join( pd(w,w) for w in words ) #~ print( n, sentence ) def replace4( sentences ): pd = patterns_dict.get def repl(m): w = m.group() return pd(w,w) for n, sentence in enumerate( sentences ): sentence = re.sub(r"\w+", repl, sentence) # Build test set test_words = [ ("word%d" % _) for _ in range(50000) ] test_sentences = [ " ".join( random.sample( test_words, 10 )) for _ in range(1000) ] # Create search and replace patterns patterns = [ (("word%d" % _), ("repl%d" % _)) for _ in range(20000) ] patterns_dict = dict( patterns ) patterns_comp = [ (re.compile("\\b"+search+"\\b"), repl) for search, repl in patterns ] def test( func, num ): t = time.time() func( test_sentences[:num] ) print( "%30s: %.02f sentences/s" % (func.__name__, num/(time.time()-t))) print( "Sentences", len(test_sentences) ) print( "Words ", len(test_words) ) test( replace1, 1 ) test( replace2, 10 ) test( replace3, 1000 ) test( replace4, 1000 )

也许Python不是这里的正确工具。这是一个Unix工具链

 sed G file | tr ' ' '\n' | grep -vf blacklist | awk -v RS= -v OFS=' ' '{$1=$1}1'

假设您的黑名单文件是预处理与单词边界添加。步骤是：将文件转换为双倍行距，将每个句子拆分为每行一个字，批量删除文件中的黑名单字，并合并行。

这应该至less运行一个数量级。

为了从单词（每行一个单词）预处理黑名单文件，

 sed 's/.*/\\b&\\b/' words > blacklist

这个怎么样：

 #!/usr/bin/env python3 from __future__ import unicode_literals, print_function import re import time import io def replace_sentences_1(sentences, banned_words): # faster on CPython, but does not use \b as the word separator # so result is slightly different than replace_sentences_2() def filter_sentence(sentence): words = WORD_SPLITTER.split(sentence) words_iter = iter(words) for word in words_iter: norm_word = word.lower() if norm_word not in banned_words: yield word yield next(words_iter) # yield the word separator WORD_SPLITTER = re.compile(r'(\W+)') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) def replace_sentences_2(sentences, banned_words): # slower on CPython, uses \b as separator def filter_sentence(sentence): boundaries = WORD_BOUNDARY.finditer(sentence) current_boundary = 0 while True: last_word_boundary, current_boundary = current_boundary, next(boundaries).start() yield sentence[last_word_boundary:current_boundary] # yield the separators last_word_boundary, current_boundary = current_boundary, next(boundaries).start() word = sentence[last_word_boundary:current_boundary] norm_word = word.lower() if norm_word not in banned_words: yield word WORD_BOUNDARY = re.compile(r'\b') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) corpus = io.open('corpus2.txt').read() banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()] sentences = corpus.split('. ') output = io.open('output.txt', 'wb') print('number of sentences:', len(sentences)) start = time.time() for sentence in replace_sentences_1(sentences, banned_words): output.write(sentence.encode('utf-8')) output.write(b' .') print('time:', time.time() - start)

这些解决scheme分裂在单词边界上，并查找一组中的每个单词。他们应该比re.sub字替代（Liteyes的解决scheme），因为这些解决scheme是O(n)其中n是input的大小，由于amortized O(1)集合查找分期付款，而使用正则expression式交替会导致正则expression式引擎必须检查每个字符上的单词匹配，而不仅仅是单词边界。我的解决scheme需要特别注意保留原始文本中使用的空格（即不压缩空格并保留制表符，换行符和其他空格字符），但如果您决定不关心它，应该是相当简单的从输出中删除它们。

我在corpus.txt上进行了testing，这是从Gutenberg项目下载的多个电子书的串联，而banned_words.txt是从Ubuntu的单词表（/ usr / share / dict / american-english）中随机选取的20000个单词。处理862462个句子（和PyPy的一半）需要大约30秒的时间。我已经把句子定义为任何以“。”分开的东西。

 $ # replace_sentences_1() $ python3 filter_words.py number of sentences: 862462 time: 24.46173644065857 $ pypy filter_words.py number of sentences: 862462 time: 15.9370770454 $ # replace_sentences_2() $ python3 filter_words.py number of sentences: 862462 time: 40.2742919921875 $ pypy filter_words.py number of sentences: 862462 time: 13.1190629005

PyPy从第二种方法中获益更多，而CPython在第一种方法上performance更好。上面的代码可以在Python 2和Python 3上运行。

实用的方法

下面介绍的解决scheme使用大量内存将所有文本存储在同一个string中，并降低复杂性级别。 如果RAM是一个问题，在使用之前要三思。

通过join / split技巧，您可以避免循环，从而加速algorithm。

连接一个不包含在句子中的特殊分隔符的句子：

 merged_sentences = ' * '.join(sentences)

使用|编译一个单一的正则expression式，用于从句子中删除所有需要的单词 “或”正则expression式：

 regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

使用编译后的正则expression式来标记单词，并将其由特殊分隔符分隔回分隔的句子：

 clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

性能

"".join复杂度是O（n）。这是非常直观的，但无论如何有一个从源头缩短引用：

 for (i = 0; i < seqlen; i++) { [...] sz += PyUnicode_GET_LENGTH(item);

因此，通过join/split您可以使用O（单词）+ 2 * O（句子），这与初始方法仍然是线性复杂度，相对于2 * O（N ² ）。

顺便说一句，不要使用multithreading。 GIL会阻止每一个操作，因为你的任务是严格的CPU绑定，所以GIL没有机会被释放，但是每个线程都会同时发送tick，这会导致额外的努力，甚至导致操作无限。

将所有句子连接成一个文档。使用Aho-Corasickalgorithm（这里是一个）的任何实现来find你所有的“坏”字。遍历文件，replace每个不好的单词，更新后面find的单词的偏移等。

在Python 3中加速数以百万计的正则expression式replace

TLDR

理论

码

性能

正则expression式联合testing

TLDR

与Trie优化的正则expression式

例

码

testing

实用的方法

性能

Javareplace文本文件中的行

List中的元素用scalareplace

正则expression式匹配/replaceJavaScript注释（多行和内联）

在Visual Studio中多行查找和replace

记事本++ – 如何replace空行

Notepad ++如何插入一列数据？

在python大量stringreplace？

Ruby – 用另一个stringreplace第一次出现的子string

重命名pandas列

用Pythonreplace列表中的值