有没有一种有效的方法来在Ruby中执行数百个文本replace？

我试图使用一个数百个常见的拼写错误列表来清理一些input之前，search重复。

这是一个时间关键的过程，所以我希望有一个比拥有数百个正则expression式（或者有一百个分支）更快的方法。

有没有一种有效的方法来在Ruby中执行数百个文本replace？

另一种方法是，如果你的input数据是分开的单词，则只需构build一个{error => correction}的哈希表。

哈希表查找速度很快 ，所以如果您可以将input数据转换为这种格式，那么几乎肯定会足够快。

我很高兴地说我刚刚find了“ RegexpTrie ”，它是代码的可用替代品，并且需要Perl的Regexp :: Assemble。

安装它，并尝试一下：

 require 'regexp_trie' foo = %w(miss misses missouri mississippi) RegexpTrie.union(foo) # => /miss(?:(?:es|ouri|issippi))?/ RegexpTrie.union(foo, option: Regexp::IGNORECASE) # => /miss(?:(?:es|ouri|issippi))?/i

以下是输出的比较。数组中的第一个注释输出来自Regexp :: Assemble，后面的输出来自RegexpTrie：

 require 'regexp_trie' [ 'how now brown cow', # /(?:[chn]ow|brown)/ 'the rain in spain stays mainly on the plain', # /(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)/ 'jackdaws love my giant sphinx of quartz', # /(?:jackdaws|quartz|sphinx|giant|love|my|of)/ 'fu foo bar foobar', # /(?:f(?:oo(?:bar)?|u)|bar)/ 'ms miss misses missouri mississippi' # /m(?:iss(?:(?:issipp|our)i|es)?|s)/ ].each do |s| puts "%-43s # /%s/" % [s, RegexpTrie.union(s.split).source] end # >> how now brown cow # /(?:how|now|brown|cow)/ # >> the rain in spain stays mainly on the plain # /(?:the|rain|in|s(?:pain|tays)|mainly|on|plain)/ # >> jackdaws love my giant sphinx of quartz # /(?:jackdaws|love|my|giant|sphinx|of|quartz)/ # >> fu foo bar foobar # /(?:f(?:oo(?:bar)?|u)|bar)/ # >> ms miss misses missouri mississippi # /m(?:iss(?:(?:es|ouri|issippi))?|s)/

关于如何使用维基百科链接和拼写错误的话：

 require 'nokogiri' require 'open-uri' require 'regexp_trie' URL = 'https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines' doc = Nokogiri::HTML(open(URL)) corrections = doc.at('div#mw-content-text pre').text.lines[1..-1].map { |s| a, b = s.chomp.split('->', 2) [a, b.split(/,\s+/) ] }.to_h # {"abandonned"=>["abandoned"], # "aberation"=>["aberration"], # "abilityes"=>["abilities"], # "abilties"=>["abilities"], # "abilty"=>["ability"], # "abondon"=>["abandon"], # "abbout"=>["about"], # "abotu"=>["about"], # "abouta"=>["about a"], # ... # } misspelled_words_regex = /\b(?:#{RegexpTrie.union(corrections.keys, option: Regexp::IGNORECASE).source})\b/i # => /\b(?:(?:a(?:b(?:andonned|eration|il(?:ityes|t(?:ies|y))|o(?:ndon(?:(?:ed|ing|s))?|tu|ut(?:it|the|a)...

此时，您可以使用gsub(misspelled_words_regex, corrections) ，但是， corrections中的值包含一些数组，因为多个单词或短语可能已被用于replace拼写错误的单词。你将不得不做一些事情来决定使用哪个选项。

Ruby在Perl中找不到一个非常有用的模块，名为Regexp :: Assemble 。 Python有hachoir-regex似乎做同样的事情。

Regexp :: Assemble根据单词和简单expression式列表创build一个非常有效的正则expression式。这真是了不起…还是…恶魔般的？

查看模块的示例; 它的基本forms非常简单：

 use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add( 'ab+c' ); $ra->add( 'ab+-' ); $ra->add( 'a\w\d+' ); $ra->add( 'a\d+' ); print $ra->re; # prints a(?:\w?\d+|b+[-c])

注意它是如何组合模式的。它会做一样的常规单词，只有它会更有效，因为常见的string将被结合：

 use Regexp::Assemble; my $lorem = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.'; my $ra = Regexp::Assemble->new('flags' => 'i'); $lorem =~ s/[^a-zA-Z ]+//g; $ra->add(split(' ', lc($lorem))); print $ra->anchor_word(1)->as_string, "\n";

哪些产出：

 \b(?:a(?:dipisicing|liqua|met)|(?:consectetu|tempo)r|do(?:lor(?:emagna)?)?|e(?:(?:li)?t|iusmod)|i(?:ncididunt|psum)|l(?:abore|orem)|s(?:ed|it)|ut)\b

这段代码忽略了大小写，并且尊重了单词的界限。

我build议编写一个可以获取单词列表的小Perl应用程序，并使用该模块输出正则expression式模式的string化版本。你应该可以把这个模式导入到Ruby中。那会让你很快find拼错的单词。您甚至可以将模式输出到YAML文件，然后将该文件加载到您的Ruby代码中。定期分析拼错的单词页面，通过Perl代码运行输出，并且您的Ruby代码将具有更新模式。

你可以使用该模式对一块文本只是为了看是否有拼写错误的单词。如果是这样，那么你将文本分解成句子或单词，并再次检查正则expression式。不要立即对单词进行testing，因为大多数单词拼写都是正确的。这几乎就像是一个对你的文本的二分search – testing整个事情，如果有一个打击，然后分成更小的块，以缩小search，直到你find个人拼写错误。如何分解块取决于传入文本的数量。一个正则expression式模式可以testing整个文本块，并返回一个零或索引值，除了单个单词以相同的方式，所以你获得很大的速度做文本的大块。

那么，如果你知道你有一个拼写错误的单词，你可以做一个正确的拼写哈希查找。这将是一个很大的麻烦，但筛选好与坏拼写的任务是最长的。查找速度非常快。

以下是一些示例代码：

get_words.rb

 #!/usr/bin/env ruby require 'open-uri' require 'nokogiri' require 'yaml' words = {} ['0-9', *('A'..'Z').to_a].each do |l| begin print "Reading #{l}... " html = open("http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/#{l}").read puts 'ok' rescue Exception => e puts "got \"#{e}\"" next end doc = Nokogiri::HTML(html) doc.search('div#bodyContent > ul > li').each do |n| n.content =~ /^(\w+) \s+ \(([^)]+)/x words[$1] = $2 end end File.open('wordlist.yaml', 'w') do |wordfile| wordfile.puts words.to_yaml end

regex_assemble.pl

 #!/usr/bin/env perl use Regexp::Assemble; use YAML; use warnings; use strict; my $ra = Regexp::Assemble->new('flags' => 'i'); my %words = %{YAML::LoadFile('wordlist.yaml')}; $ra->add(map{ lc($_) } keys(%words)); print $ra->chomp(1)->anchor_word(1)->as_string, "\n";

运行第一个，然后运行第二个pipe道输出到一个文件来捕获发射的正则expression式。

更多的单词和生成输出的例子：

 'how now brown cow' => /\b(?:[chn]ow|brown)\b/ 'the rain in spain stays mainly on the plain' => /\b(?:(?:(?:(?:pl|r)a)?i|o)n|s(?:pain|tays)|mainly|the)\b/ 'jackdaws love my giant sphinx of quartz' => /\b(?:jackdaws|quartz|sphinx|giant|love|my|of)\b/ 'fu foo bar foobar' => /\b(?:f(?:oo(?:bar)?|u)|bar)\b/ 'ms miss misses missouri mississippi' => /\bm(?:iss(?:(?:issipp|our)i|es)?|s)\b/

Ruby的Regexp.union与Regexp::Assemble的复杂性没有任何关系。捕获拼写错误的单词列表后，共有4225个单词，包含41,817个字符。在针对该列表运行Perl的Regexp :: Assemble之后，生成了30,954个字符的正则expression式。我会说这是有效的。

试试这个方法。不要纠正拼写错误，并检查结果重复，而是将所有内容都放到相同的格式（如Metaphone或Soundex）中，然后检查该格式的重复项。

现在，我不知道哪种方法可能会更快 – 一方面，你有几百个正则expression式，每个正则expression式几乎不会立即匹配并返回。另一方面，你有30多个潜在的正则expression式replace，其中一两个肯定会匹配每一个单词。

现在，metaphone是相当快的 – algorithm真的没有太多 – 所以我只能build议你尝试一下，并测量是否足够快，供您使用。

Interesting Posts

List中的元素用scalareplace

正则expression式匹配/replaceJavaScript注释（多行和内联）

Notepad ++如何插入一列数据？

重命名pandas列

Javareplace文本文件中的行

如何使用bash删除并replaceterminal中的最后一行？

记事本++ – 如何replace空行

Ruby – 用另一个stringreplace第一次出现的子string

在Python 3中加速数以百万计的正则expression式replace

用Pythonreplace列表中的值