从Bash中另一个更大的文本文件中find文本文件的行的最快方法

我有两个文件, file1.txtfile2.txtfile1.txt约有14K行, file2.txt约有20亿。 file1.txt每行有一个字段f1 ,而file2.txt有3个字段,从f1f3 ,由|

我想从file2.txt中find所有行,其中file1.txt f2file2.txt f2匹配(或者如果我们不想花费额外的时间分割file2.txt的值,则file2.txt )。

file1.txt(约14K行, 未sorting ):

 foo1 foo2 ... bar1 bar2 ... 

file2.txt(约20亿行, 未sorting ):

 date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 ... date1|bar1|number1 date2|bar2|number2 date3|bar3|number3 

预期产出:

 date1|foo1|number1 date2|foo2|number2 date1|bar1|number1 date2|bar2|number2 ... 

这是我已经尝试过,似乎需要几个小时才能运行:

 fgrep -F -f file1.txt file2.txt > file.matched 

我想知道是否有一个更好,更快的方式来执行这个操作与普通的Unix命令或一个小脚本。

一个Perl解决scheme。 [请参阅下面的更新注释 。]

对第一个文件使用散列。 当你逐行阅读大文件时,通过正则expression式提取字段(捕获||之间的第一个模式)或split (获取第二个字)并打印(如果exists 。 他们可能在速度上有点不同(时间)。 defined检查在正则expression式中是不需要的,而对于split使用// (已定义)或短路。

 use warnings; use strict; # If 'prog smallfile bigfile' is the preferred use die "Usage: $0 smallfile bigfile" if @ARGV != 2; my ($smallfile, $bigfile) = @ARGV; open my $fh, '<', $smallfile or die "Can't open $smallfile: $!"; my %word = map { chomp; $_ => 1 } <$fh>; open $fh, '<', $bigfile or die "Can't open $bigfile: $!"; while (<$fh>) { exists $word{ (/\|([^|]+)/)[0] } && print; # Or #exists $word{ (split /\|/)[1] // '' } && print; } close $fh; 

避免if分支和使用短路更快,但只有很less。 在数十亿线上,这些调整加起来,但不要太多。 它可能会(也可能不会)稍微快一点点读取小文件行,而不是在上面的列表上下文,但这应该引人注目。

更新写入STDOUT保存了两个操作,我反复将其logging速度比写入文件快一点。 这种用法也与大多数UNIX工具一致,所以我改为写入STDOUT 。 接下来, existstesting是不需要的,并放弃它的操作。 不过,我始终能够更好地运用 ,同时也更好地传达了目的。 总而言之,我要离开它了。感谢ikegami的意见。

注意注释掉的版本比其他版本快了50% ,按照我的基准testing。 这些都是因为他们不同而有所不同 ,一个find第一个匹配,另一个find第二个领域。 我把它作为一个更普遍的select,因为这个问题是模棱两可的。


一些比较(基准)[更新为写入STDOUT ,请参阅上面的“更新”]

HåkonHægland的答案中有一个广泛的分析,大多数解决scheme的一次运行。 下面是另一个需要考虑的问题,上面提到的两个解决scheme,OP的自己的答案,以及已经发布的fgrep ,预计会很快,并在问题和许多答案中使用。

我以下面的方式构buildtesting数据。 一些大致如图所示的长度的行是用随机的字来表示的,这两个文件都是在第二个字段中匹配的。 然后我用数据样本填充这个“种子”,用不匹配的行来模拟大小和OP所引用的匹配之间的比率:对于小文件中的14K行,大文件中有130万行,产生126K个匹配。 然后重复写这些样本以构build完整的数据文件作为OP,每次使用List :: Util进行 shuffle

所有运行比较下面产生106_120匹配上述文件大小( diff检查),所以匹配的频率是足够接近。 他们通过使用my $res = timethese(60 ...)调用完整的程序来进行基准testing。 在cmpthese($res)的结果是

        利率正则expression式cfor分裂fgrep
正则expression式1.05 / s  -  -23%-35%-44%
 cfor 1.36 / s 30% -  -16%-28%
分1.62 /秒54%19% -  -14%
 fgrep 1.89 / s 80%39%17% - 

事实上,优化的C编程fgrep出现在顶端并不奇怪。 “ 正则expression式 ”落后于“ 分裂 ”可能是由于多次启动小火柴引擎的开销。 鉴于不断发展的正则expression式引擎优化,这可能会因Perl版本而有所不同。 我声称自己的答案是最快的,其中包括@codeforester答案(“cfor”),其20%落后于非常类似的“ 分裂 ”可能是由于零散的小的低效率(请参阅下面的评论)。

这并不是完全不同的,而硬件和软件以及数据细节上的确有不同。 我在不同的Perls和机器上运行它,显着的区别是在某些情况下, fgrep确实快了一个数量级

OP的fgrep非常慢的经验令人惊讶。 考虑到他们引用的运行时间,比上面慢几个数量级,我猜测有一个旧的系统“责怪”。

尽pipe这是完全基于I / O的,但是将它放在多个内核上还是会带来并发的好处,我希望能有一个很好的加速,达到几个因素。

我试图对这里介绍的一些方法进行比较。

首先,我创build了一个Perl脚本来生成input文件file1.txtfile2.txt 。 为了比较一些解决scheme,我确信file1.txt的单词只能出现在file2.txt的第二个字段中。 另外为了能够使用@GeorgeVasiliou提供的联接解决scheme,我对file1.txtfile2.txt进行了sorting。 目前我只基于75个随机单词(取自https://www.randomlists.com/random-words )生成input文件。 在file1.txt中只使用了这file2.txt个单词中的file2.txt ,其余的七十个单词用来填充file2.txt的字段。 可能有必要大量增加单词的数量以获得切合实际的结果(根据OP,原文件file1.txt包含14000个单词)。 在下面的testing中,我用了1000000(100万)行的file2.txt 。 该脚本还生成regexp1.txt的grep解决scheme所需的文件regexp1.txt。

gen_input_files.pl

 #! /usr/bin/env perl use feature qw(say); use strict; use warnings; use Data::Printer; use Getopt::Long; GetOptions ("num_lines=i" => \my $nlines ) or die("Error in command line arguments\n"); # Generated random words from site: https://www.randomlists.com/random-words my $word_filename = 'words.txt'; # 75 random words my $num_match_words = 5; my $num_file2_lines = $nlines || 1_000_000; my $file2_words_per_line = 3; my $file2_match_field_no = 2; my $file1_filename = 'file1.txt'; my $file2_filename = 'file2.txt'; my $file1_regex_fn = 'regexp1.txt'; say "generating $num_file2_lines lines.."; my ( $words1, $words2 ) = get_words( $word_filename, $num_match_words ); write_file1( $file1_filename, $words2 ); write_file2( $file2_filename, $words1, $words2, $num_file2_lines, $file2_words_per_line, $file2_match_field_no ); write_BOC_regexp_file( $file1_regex_fn, $words2 ); sub write_BOC_regexp_file { my ( $fn, $words ) = @_; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh '\\|' . (join "|", @$words) . '\\|'; close $fh; } sub write_file2 { my ( $fn, $words1, $words2, $nlines, $words_per_line, $field_no ) = @_; my $nwords1 = scalar @$words1; my $nwords2 = scalar @$words2; my @lines; for (1..$nlines) { my @words_line; my $key; for (1..$words_per_line) { my $word; if ( $_ != $field_no ) { my $index = int (rand $nwords1); $word = @{ $words1 }[$index]; } else { my $index = int (rand($nwords1 + $nwords2) ); if ( $index < $nwords2 ) { $word = @{ $words2 }[$index]; } else { $word = @{ $words1 }[$index - $nwords2]; } $key = $word; } push @words_line, $word; } push @lines, [$key, (join "|", @words_line)]; } @lines = map { $_->[1] } sort { $a->[0] cmp $b->[0] } @lines; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh (join "\n", @lines); close $fh; } sub write_file1 { my ( $fn, $words ) = @_; open( my $fh, '>', $fn ) or die "Could not open file '$fn': $!"; print $fh (join "\n", sort @$words); close $fh; } sub get_words { my ( $fn, $N ) = @_; open( my $fh, '<', $fn ) or die "Could not open file '$fn': $!"; my @words = map {chomp $_; $_} <$fh>; close $fh; my @words1 = @words[$N..$#words]; my @words2 = @words[0..($N - 1)]; return ( \@words1, \@words2 ); } 

接下来,我创build了一个包含所有testing用例的子文件夹solutions

 $ tree solutions/ solutions/ ├── BOC1 │  ├── out.txt │  └── run.sh ├── BOC2 │  ├── out.txt │  └── run.sh ├── codeforester │  ├── out.txt │  ├── run.pl │  └── run.sh [...] 

这里的文件out.txt是每个解决schemegreps的输出。 脚本run.sh运行给定testing用例的解决scheme。

注意不同的解决scheme

  • BOC1 :@BOC提出的第一个解决scheme

     grep -E -f regexp1.txt file2.txt 
  • BOC2 :@BOCbuild议的第二个解决scheme:

     LC_ALL=C grep -E -f regexp1.txt file2.txt 
  • codeforester :由@codeforester接受Perl解决scheme(请参阅源代码 )

  • codeforester_orig :由@codeforested提供的原始解决scheme:

     fgrep -f file1.txt file2.txt 
  • dawg :使用@dawg提供的字典和分割线的Python解决scheme(参见源代码 )

  • gregory1 :使用@gregorybuild议的Gnu Parallel的解决scheme

     parallel -k --pipepart -a file2.txt --block "$block_size" fgrep -F -f file1.txt 

    请参阅下面有关如何select$block_size注释。

  • hakon1 :由@HåkonHægland提供的Perl解决scheme(请参阅参考资料)。 此解决scheme需要在首次运行代码时编译c-extension。 当file1.txtfile2.txt更改时,不需要重新编译。 注意:下面列出的运行时间中不包括初始运行时用于编译c延伸的时间。

  • ikegami :解决scheme使用汇编正则expression式和使用grep -P由@ikegami给出。 注意:汇编正则expression式被写入到一个单独的文件regexp_ikegami.txt ,所以生成正则expression式的运行时间不包含在下面的比较中。 这是使用的代码:

     regexp=$(< "regexp_ikegami.txt") grep -P "$regexp" file2.txt 
  • inian1 :@Inian使用match()第一个解决scheme

     awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt 
  • inian2 :通过@Inian使用index()第二个解决scheme

     awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (index($0,i)) {print; break} }' file1.txt FS='|' file2.txt 
  • inian3 :第三个解决scheme@Inian只检查$2字段:

     awk 'FNR==NR{ hash[$1]; next } $2 in hash' file1.txt FS='|' file2.txt 
  • inian4 :第四次使用@Inian(基本上和LC_ALL codeforester_orig一样):

     LC_ALL=C fgrep -f file1.txt file2.txt 
  • inian5inian5第五个解决scheme(与inian1相同,但是LC_ALL ):

     LC_ALL=C awk 'FNR==NR{ hash[$1]; next } { for (i in hash) if (match($0,i)) {print; break} }' file1.txt FS='|' file2.txt 
  • inian6 :与inian6相同,但LC_ALL=C 感谢@GeorgeVasiliou的build议。

  • jjoao :编译由@JJaoao提供的flex生成的C代码(参见源代码 )。 注意:每次file1.txt更改时,必须重新编译exectuable。 编译可执行文件的时间不包含在下面的运行时间中。

  • oliv:由@oliv提供的Python脚本(参见源代码 )

  • Vasiliou :按照@GeorgeVasiliou的build议使用join

     join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 file1.txt file2.txt 
  • Vasiliou2 :和Vasiliou一样,但LC_ALL=C

  • zdim :使用zdim提供的Perl脚本(参见源代码 )。 注意:这使用正则expression式search版本(而不是分割线解决scheme)。

  • zdim2 :与zdim2相同,除了它使用splitfunction而不是正则expression式searchfile2.txt的字段。

笔记

  1. 我尝试了一下Gnu并行(见上面的gregory1解决scheme),以确定我的CPU的最佳块大小。 我有4个内核,而且目前看来,最佳select是将文件( file2.txt )分成4个相同大小的块,并在4个处理器中的每一个上运行一个作业。 这里可能需要更多的testing。 因此,对于file2.txt为20M的第一个testing用例,我将$block_size设置$block_size 5M(参见上面的gregory1解决scheme),而对于下面介绍的更为真实的情况,其中file2.txtfile2.txt ,使用67M的$block_size

  2. 解决schemeBOC1BOC2codeforester_originian1inian4inian5gregory1都使用了松散匹配。 这意味着file1.txt中的单词不必完全匹配file2.txt第2个file2.txt 。 线上任何地方的比赛都被接受了。 由于这种行为使得与其他方法相比更加困难,所以也引入了一些修改的方法。 前两个名为BOC1BBOC2B方法使用了修改的regexp1.txt文件。 原始regexp1.txt中的regexp1.txt窗体\|foo1|foo2|...|fooN\| 这将匹配在任何领域的边界的话。 修改过的文件regexp1b.txt使用格式^[^|]*\|foo1|foo2|...|fooN\|来匹配字段#2 ^[^|]*\|foo1|foo2|...|fooN\| 代替。

    然后,修改后的方法codeforester_origBinian1Binian4Binian5Bgregory1B使用修改的file1.txt 。 修改后的文件file1b.txt在表单上每行使用一个正则expression式 ,而不是每行一个字面值

      ^[^|]*\|word1\| ^[^|]*\|word2\| ^[^|]*\|word3\| [...] 

    另外, fgrep -fgrep -E -freplace为这些方法。

运行testing

这是用于运行所有testing的脚本。 它使用Bash time命令来logging每个脚本所花费的时间。 请注意, time命令会返回三个不同的时间,分别称为realusersys 。 首先我使用了user + sys ,但是在使用Gnu并行命令时意识到这是不正确的,所以下面报告的时间现在是time返回的real部分。 请参阅此问题以获取有关按时间返回的不同时间的更多信息。

第一个testing使用包含5行的file1.txt和包含1000000行的file2.txt运行。 这是run_all.pl脚本的前52行,脚本的其余部分在这里可用。

run_all.pl

 #! /usr/bin/env perl use feature qw(say); use strict; use warnings; use Cwd; use Getopt::Long; use Data::Printer; use FGB::Common; use List::Util qw(max shuffle); use Number::Bytes::Human qw(format_bytes); use Sys::Info; GetOptions ( "verbose" => \my $verbose, "check" => \my $check, "single-case=s" => \my $case, "expected=i" => \my $expected_no_lines, ) or die("Error in command line arguments\n"); my $test_dir = 'solutions'; my $output_file = 'out.txt'; my $wc_expected = $expected_no_lines; # expected number of output lines my $tests = get_test_names( $test_dir, $case ); my $file2_size = get_file2_size(); my $num_cpus = Sys::Info->new()->device( CPU => () )->count; chdir $test_dir; my $cmd = 'run.sh'; my @times; for my $case (@$tests) { my $savedir = getcwd(); chdir $case; say "Running '$case'.."; my $arg = get_cmd_args( $case, $file2_size, $num_cpus ); my $output = `bash -c "{ time -p $cmd $arg; } 2>&1"`; my ($user, $sys, $real ) = get_run_times( $output ); print_timings( $user, $sys, $real ) if $verbose; check_output_is_ok( $output_file, $wc_expected, $verbose, $check ); print "\n" if $verbose; push @times, $real; #push @times, $user + $sys; # this is wrong when using Gnu parallel chdir $savedir; } say "Done.\n"; print_summary( $tests, \@times ); 

结果

以下是运行testing的输出:

 $ run_all.pl --verbose Running 'inian3'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.00 ) ..no of output lines: 66711 Running 'inian2'.. ..finished in 0.73 seconds ( user: 0.73, sys: 0.00 ) ..no of output lines: 66711 Running 'Vasiliou'.. ..finished in 0.09 seconds ( user: 0.08, sys: 0.00 ) ..no of output lines: 66711 Running 'codeforester_orig'.. ..finished in 0.05 seconds ( user: 0.05, sys: 0.00 ) ..no of output lines: 66711 Running 'codeforester'.. ..finished in 0.45 seconds ( user: 0.44, sys: 0.01 ) ..no of output lines: 66711 [...] 

概要

[@Vasiliou获得的结果显示在中间栏。]

  |Vasiliou My Benchmark |Results | Details -------------------------------|---------|---------------------- inian4 : 0.04s |0.22s | LC_ALL fgrep -f [loose] codeforester_orig : 0.05s | | fgrep -f [loose] Vasiliou2 : 0.06s |0.16s | [LC_ALL join [requires sorted files]] BOC1 : 0.06s | | grep -E [loose] BOC2 : 0.07s |15s | LC_ALL grep -E [loose] BOC2B : 0.07s | | LC_ALL grep -E [strict] inian4B : 0.08s | | LC_ALL grep -E -f [strict] Vasiliou : 0.08s |0.23s | [join [requires sorted files]] gregory1B : 0.08s | | [parallel + grep -E -f [strict]] ikegami : 0.1s | | grep -P gregory1 : 0.11s |0.5s | [parallel + fgrep -f [loose]] hakon1 : 0.14s | | [perl + c] BOC1B : 0.14s | | grep -E [strict] jjoao : 0.21s | | [compiled flex generated c code] inian6 : 0.26s |0.7s | [LC_ALL awk + split+dict] codeforester_origB : 0.28s | | grep -E -f [strict] dawg : 0.35s | | [python + split+dict] inian3 : 0.44s |1.1s | [awk + split+dict] zdim2 : 0.4s | | [perl + split+dict] codeforester : 0.45s | | [perl + split+dict] oliv : 0.5s | | [python + compiled regex + re.search()] zdim : 0.61s | | [perl + regexp+dict] inian2 : 0.73s |1.7s | [awk + index($0,i)] inian5 : 18.12s | | [LC_ALL awk + match($0,i) [loose]] inian1 : 19.46s | | [awk + match($0,i) [loose]] inian5B : 42.27s | | [LC_ALL awk + match($0,i) [strict]] inian1B : 85.67s | | [awk + match($0,i) [strict]] Vasiliou Results : 2 X CPU Intel 2 Duo T6570 @ 2.10GHz - 2Gb RAM-Debian Testing 64bit- kernel 4.9.0.1 - no cpu freq scaling. 

一个更现实的testing案例

然后,我创build了一个更具现实性的案例,其中包含100个字的file1.txt和1000万行(268Mb文件大小)的file2.txt。 我用shuf -n1000 /usr/share/dict/american-english > words.txt/usr/share/dict/american-english的字典中提取了1000个随机单词,然后将这100个单词提取到file1.txt ,然后构buildfile2.txt与第一个testing用例的描述相同。 请注意,字典文件是UTF-8编码的,我从words.txt去除了所有非ASCII字符。

然后我运行testing,没有从前面的情况下的三个最慢的方法。 即inian1inian2inian5被排除在外。 以下是新的结果:

 gregory1 : 0.86s | [parallel + fgrep -f [loose]] Vasiliou2 : 0.94s | [LC_ALL join [requires sorted files]] inian4B : 1.12s | LC_ALL grep -E -f [strict] BOC2B : 1.13s | LC_ALL grep -E [strict] BOC2 : 1.15s | LC_ALL grep -E [loose] BOC1 : 1.18s | grep -E [loose] ikegami : 1.33s | grep -P Vasiliou : 1.37s | [join [requires sorted files]] hakon1 : 1.44s | [perl + c] inian4 : 2.18s | LC_ALL fgrep -f [loose] codeforester_orig : 2.2s | fgrep -f [loose] inian6 : 2.82s | [LC_ALL awk + split+dict] jjoao : 3.09s | [compiled flex generated c code] dawg : 3.54s | [python + split+dict] zdim2 : 4.21s | [perl + split+dict] codeforester : 4.67s | [perl + split+dict] inian3 : 5.52s | [awk + split+dict] zdim : 6.55s | [perl + regexp+dict] gregory1B : 45.36s | [parallel + grep -E -f [strict]] oliv : 60.35s | [python + compiled regex + re.search()] BOC1B : 74.71s | grep -E [strict] codeforester_origB : 75.52s | grep -E -f [strict] 

注意

基于grep的解决scheme正在寻找整条线上的匹配,所以在这种情况下,它们包含一些错误匹配: codeforester_origBOC1BOC2gregory1inian4和oliv的方法从10,000,000行中提取1,087,609行,而其他方法从file2.txt提取正确的997,993行。

笔记

  • 我testing了这个在我的Ubuntu 16.10笔记本电脑(英特尔酷睿i7-7500U CPU @ 2.70GHz)

  • 整个基准研究可以在这里find 。

你有没有尝试Awk来加速一些事情:

 awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt 

(或)使用Awk index()函数,如下面的Benjamin W.的build议

 awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (index($0,i)) {print; break}}' file1.txt FS='|' file2.txt 

(或) Ed Morton在评论中提出的更直接的正则expression式匹配,

 awk 'FNR==NR{hash[$1]; next}{for (i in hash) if ($0~i) {print; break}}' file1.txt FS='|' file2.txt 

是你所需要的。 我猜这将是更快,但不完全确定有百万条目的文件。 这里的问题是与沿线任何地方的可能性匹配。 如果在任何特定的专栏中(如单独说$2 ),可以采用更快的方法

 awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt 

你也可以通过使用系统中设置的locale来加快速度。 从这个美妙的StéphaneChazelas对这个问题的回答中解释 ,你可以通过设置传递地方LC_ALL=C本地运行的命令来加快速度。

在任何基于GNU的系统上, locale的默认值

 $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= 

使用一个variablesLC_ALL ,您可以将所有LC_variables一次设置为指定的语言环境

 $ LC_ALL=C locale LANG=en_US.UTF-8 LC_CTYPE="C" LC_NUMERIC="C" LC_TIME="C" LC_COLLATE="C" LC_MONETARY="C" LC_MESSAGES="C" LC_PAPER="C" LC_NAME="C" LC_ADDRESS="C" LC_TELEPHONE="C" LC_MEASUREMENT="C" LC_IDENTIFICATION="C" LC_ALL=C 

那么这是什么影响?

简单地说,在使用locale C ,它将默认为服务器的基本Unix / Linux语言ASCII 。 基本上,当你grep东西,默认情况下,你的语言环境将被国际化,并设置为UTF-8 ,它可以代表Unicode字符集中的每个字符,以帮助显示任何世界的写作系统,目前超过110,000独特的字符,而对于ASCII每个字符以单个字节序列编码,其字符集不超过128唯一字符。

所以在使用以UTF-8字符集编码的文件上使用grep时,需要将每个字符与十万个唯一字符中的任何一个匹配,但在ASCII只有128 ASCII ,所以使用你的fgrep作为

 LC_ALL=C fgrep -F -f file1.txt file2.txt 

此外,同样的可以适应Awk ,因为它使用match($0,i)调用regex匹配,设置C语言环境可以加快string匹配。

 LC_ALL=C awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt 

假设:1.你只需要在你的本地工作站上运行这个search。 2.您有多个核心/ cpus来利用并行search。

 parallel --pipepart -a file2.txt --block 10M fgrep -F -f file1.txt 

根据上下文进行一些调整:A.用LANG = C禁用NLS(这在另一个答案中已经提到)B.用-m标志设置最大匹配数。

注意:我猜file2是〜4GB,10M的块大小是可以的,但是你可能需要优化块的大小以获得最快的运行。

这个Perl脚本( a )生成一个正则expression式模式:

 #!/usr/bin/perl use strict; use warnings; use Regexp::Assemble qw( ); chomp( my @ids = <> ); my $ra = Regexp::Assemble->new(); $ra->add(quotemeta($_)) for @ids; print("^[^|]*\\|(?:" . (re::regexp_pattern($ra->re()))[0] . ")\\|"); 

以下是如何使用它:

 $ LC_ALL=C grep -P "$( a file1.txt )" file2.txt date1|foo1|number1 date2|foo2|number2 date1|bar1|number1 date2|bar2|number2 

注意脚本使用Regexp :: Assemble,所以你可能需要安装它。

 sudo su cpan Regexp::Assemble 

笔记:

  • 与名为BOC1,BOC2,codeforester_orig,gregory1,inian2,inian4和oliv的解决scheme不同,我的解决scheme正确地处理

     file1.txt foo1 file2.txt date1|foo12|number5 
  • 矿井应该比@BOC类似的解决scheme更好,因为该模式被优化以减less回溯。 (如果file2.txt有三个以上的字段,我也可以工作,而链接的解决scheme可能会失败。)

  • 我不知道它是如何比较拆分+词典解决scheme。

一小段Perl代码解决了这个问题。 这是采取的方法:

  • 将文件file1.txt的行存储在一个散列中
  • file2.txt读取file2.txt ,parsing并提取第二个字段
  • 检查提取的字段是否在散列中; 如果是的话,打印行

这里是代码:

 #!/usr/bin/perl -w use strict; if (scalar(@ARGV) != 2) { printf STDERR "Usage: fgrep.pl smallfile bigfile\n"; exit(2); } my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]); my ($small_fp, $big_fp, %small_hash, $field); open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!; open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!; # store contents of small file in a hash while (<$small_fp>) { chomp; $small_hash{$_} = undef; } close($small_fp); # loop through big file and find matches while (<$big_fp>) { # no need for chomp $field = (split(/\|/, $_))[1]; if (defined($field) && exists($small_hash{$field})) { printf("%s", $_); } } close($big_fp); exit(0); 

我用file1.txt中的14K行和file2.txt中的1.3M行来运行上面的脚本。 它在13秒内完成,产生了126K的比赛。 下面是同样的time输出:

 real 0m11.694s user 0m11.507s sys 0m0.174s 

我跑@Inian的awk代码:

 awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt 

它比Perl解决scheme慢得多,因为它在file2.txt中的每一行循环14K次 – 这非常昂贵。 它在处理了file2.txt的592K条logging并产生了40K条匹配的行之后终止了。 以下是花了多长时间:

 awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989 input record number 675280, file file2.txt source line number 1 real 55m5.539s user 54m53.080s sys 0m5.095s 

使用@Inian的另一个awk解决scheme,它消除了循环问题:

 time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out real 0m39.966s user 0m37.916s sys 0m0.743s time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out real 0m41.057s user 0m38.475s sys 0m0.904s 

awk在这里是非常令人印象深刻的,因为我们不需要编写一个完整的程序来完成它。

我也运行了@ oliv的Python代码。 It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time output:

 real 895m14.862s user 806m59.219s sys 1m12.147s 

I tried to follow the suggestion to use parallel . However, it failed with fgrep: memory exhausted error, even with very small block sizes.


What surprised me was that fgrep was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep had an option to force the content of -f file to be kept in a hash, just like what the Perl code did.

I didn't check join approach – I didn't want the additional overhead of sorting the files. Also, given fgrep 's poor performance, I don't believe join would have done better than the Perl code.

Thanks everyone for your attention and responses.

Here is a Python solution using sets — roughly equivalent to a Perl key only hash or awk array in concept.

 #!/usr/bin/python import sys with open(sys.argv[1]) as f: tgt={e.rstrip() for e in f} with open(sys.argv[2]) as f: for line in f: cells=line.split("|") if cells[1] in tgt: print line.rstrip() 

When I run this on files of similar size, it runs in about 8 seconds.

Same speed as:

 $ awk 'FNR==NR{arr[$1]; next} $2 in arr{print $0}' FS="|" /tmp/f1 /tmp/f2 

Both the Python and awk solution here are full string match only; not a partial regex style match.

Since the awk solution is fast and POSIX compliant, that is the better answer.

Here is Perl solution that uses Inline::C to speed up the search for matching fields in the large file:

 use strict; use warnings; use Inline C => './search.c'; my $smallfile = 'file1.txt'; my $bigfile = 'file2.txt'; open my $fh, '<', $smallfile or die "Can't open $smallfile: $!"; my %word = map { chomp; $_ => 1 } <$fh>; search( $bigfile, \%word ); 

The search() sub routine is implemented in pure C using perlapi to look up keys in the small file dictionary %words :

search.c :

 #include <stdio.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #define BLOCK_SIZE 8192 /* how much to read from file each time */ static char read_buf[BLOCK_SIZE + 1]; /* reads a block from file, returns -1 on error, 0 on EOF, else returns chars read, pointer to buf, and pointer to end of buf */ size_t read_block( int fd, char **ret_buf, char **end_buf ) { int ret; char *buf = read_buf; size_t len = BLOCK_SIZE; while (len != 0 && (ret = read(fd, buf, len)) != 0) { if (ret == -1) { if (errno == EINTR) continue; perror( "read" ); return ret; } len -= ret; buf += ret; } *end_buf = buf; *ret_buf = read_buf; return (size_t) (*end_buf - *ret_buf); } /* updates the line buffer with the char pointed to by cur, also updates cur */ int update_line_buffer( char **cur, char **line, size_t *llen, size_t max_line_len ) { if ( *llen > max_line_len ) { fprintf( stderr, "Too long line. Maximimum allowed line length is %ld\n", max_line_len ); return 0; } **line = **cur; (*line)++; (*llen)++; (*cur)++; return 1; } /* search for first pipe on a line (or next line if this is empty), assume line ptr points to beginning of line buffer. return 1 on success Return 0 if pipe could not be found for some reason, or if line buffer length was exceeded */ int search_field_start( int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len ) { char *line_start = *line; while (1) { if ( *cur >= *end_buf ) { size_t res = read_block( fd, cur, end_buf ); if (res <= 0) return 0; } if ( **cur == '|' ) break; /* Currently we just ignore malformed lines ( lines that do not have a pipe, and empty lines in the input */ if ( **cur == '\n' ) { *line = line_start; *llen = 0; (*cur)++; } else { if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0; } } return 1; } /* assume cur points at starting pipe of field return -1 on read error, return 0 if field len was too large for buffer or line buffer length exceed, else return 1 and field, and length of field */ int copy_field( int fd, char **cur, char **end_buf, char *field, size_t *flen, char **line, size_t *llen, size_t max_field_len, size_t max_line_len ) { *flen = 0; while( 1 ) { if (! update_line_buffer( cur, line, llen, max_line_len ) ) return 0; if ( *cur >= *end_buf ) { size_t res = read_block( fd, cur, end_buf ); if (res <= 0) return -1; } if ( **cur == '|' ) break; if ( *flen > max_field_len ) { printf( "Field width too large. Maximum allowed field width: %ld\n", max_field_len ); return 0; } *field++ = **cur; (*flen)++; } /* It is really not necessary to null-terminate the field since we return length of field and also field could contain internal null characters as well */ //*field = '\0'; return 1; } /* search to beginning of next line, return 0 on error, else return 1 */ int search_eol( int fd, char **cur, char **end_buf, char **line, size_t *llen, size_t max_line_len) { while (1) { if ( *cur >= *end_buf ) { size_t res = read_block( fd, cur, end_buf ); if (res <= 0) return 0; } if ( !update_line_buffer( cur, line, llen, max_line_len ) ) return 0; if ( *(*cur-1) == '\n' ) { break; } } //**line = '\0'; // not necessary return 1; } #define MAX_FIELD_LEN 80 /* max number of characters allowed in a field */ #define MAX_LINE_LEN 80 /* max number of characters allowed on a line */ /* Get next field ( ie field #2 on a line). Fields are separated by pipes '|' in the input file. Also get the line of the field. Return 0 on error, on success: Move internal pointer to beginning of next line return 1 and the field. */ size_t get_field_and_line_fast( int fd, char *field, size_t *flen, char *line, size_t *llen ) { static char *cur = NULL; static char *end_buf = NULL; size_t res; if (cur == NULL) { res = read_block( fd, &cur, &end_buf ); if ( res <= 0 ) return 0; } *llen = 0; if ( !search_field_start( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN )) return 0; if ( (res = copy_field( fd, &cur, &end_buf, field, flen, &line, llen, MAX_FIELD_LEN, MAX_LINE_LEN ) ) <= 0) return 0; if ( !search_eol( fd, &cur, &end_buf, &line, llen, MAX_LINE_LEN ) ) return 0; return 1; } void search( char *filename, SV *href) { if( !SvROK( href ) || ( SvTYPE( SvRV( href ) ) != SVt_PVHV ) ) { croak( "Not a hash reference" ); } int fd = open (filename, O_RDONLY); if (fd == -1) { croak( "Could not open file '%s'", filename ); } char field[MAX_FIELD_LEN+1]; char line[MAX_LINE_LEN+1]; size_t flen, llen; HV *hash = (HV *)SvRV( href ); while ( get_field_and_line_fast( fd, field, &flen, line, &llen ) ) { if( hv_exists( hash, field, flen ) ) fwrite( line, sizeof(char), llen, stdout); } if (close(fd) == -1) croak( "Close failed" ); } 

Tests indicate that it is approximately 3 times faster than the fastest pure Perl solution (see method zdim2 in my other answer ) presented here.

A possible way is to use python :

 $ cat test.py import sys,re with open(sys.argv[1], "r") as f1: patterns = f1.read().splitlines() # read pattern from file1 without the trailing newline m = re.compile("|".join(patterns)) # create the regex with open(sys.argv[2], "r") as f2: for line in f2: if m.search(line) : print line, # print line from file2 if this one matches the regex 

并像这样使用它:

 python test.py file1.txt file2.txt 

Can you give a try to join ? Files must be sorted though…

 $ cat d.txt bar1 bar2 foo1 foo2 $ cat e.txt date1|bar1|number1 date2|bar2|number2 date3|bar3|number3 date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 $ join --nocheck-order -11 -22 -t'|' -o 2.1 2.2 2.3 d.txt e.txt date1|bar1|number1 date2|bar2|number2 date1|foo1|number1 date2|foo2|number2 

Small Update:
By using LC_ALL=C in front of join, things are really speed up as can be seen in the benchmark of Håkon Hægland

PS1: I have my doubts if join can be faster than grep -f …

你也可以使用Perl来完成这个工作:

Please note that this will hog memory and your machine/server better has some.

Sample Data:

 %_STATION@gaurav * /root/ga/pl> head file1.txt file2.txt ==> file1.txt <== foo1 foo2 ... bar1 bar2 ... ==> file2.txt <== date1|foo1|number1 date2|foo2|number2 date3|foo3|number3 ... date1|bar1|number1 date2|bar2|number2 date3|bar3|number3 %_STATION@gaurav * /root/ga/study/pl> 

Script Output: Script will produce final output in a file named output_comp .

 %_STATION@gaurav * /root/ga/pl> ./comp.pl file1.txt file2.txt ; cat output_comp date1|bar1|number1 date2|bar2|number2 date2|foo2|number2 date1|foo1|number1 %_STATION@gaurav * /root/ga/pl> 

脚本:

 %_STATION@gaurav * /root/ga/pl> cat comp.pl #!/usr/bin/perl use strict ; use warnings ; use Data::Dumper ; my ($file1,$file2) = @ARGV ; my $output = "output_comp" ; my %hash ; # This will store main comparison data. my %tmp ; # This will store already selected results, to be skipped. (scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ; # Read all files at once and use their name as the key. for (@ARGV) { open FH, "<$_" or die "Cannot open $_\n" ; while (my $line = <FH>) {chomp $line ;$hash{$_}{$line} = "$line"} close FH ; } # Now we churn through the data and compare to generate # the sorted output in the output file. open FH, ">>$output" or die "Cannot open outfile!\n" ; foreach my $k1 (keys %{$hash{$file1}}){ foreach my $k2 (keys %{$hash{$file2}}){ if ($k1 =~ m/^.+?$k2.+?$/) { if (!defined $tmp{"$hash{$file2}{$k2}"}) { print FH "$hash{$file2}{$k2}\n" ; $tmp{"$hash{$file2}{$k2}"} = 1 ; } } } } close FH ; %_STATION@gaurav * /root/ga/pl> 

谢谢。

IMHO, grep is a good tool highly optimised for huge file2.txt but maybe not for so many patterns to search. I suggest to combine all the strings of file1.txt into a single huge regexp like \|bar1|bar2|foo1|foo2\|

 echo '\|'$(paste -s -d '|' file1.txt)'\|' > regexp1.txt grep -E -f regexp1.txt file2.txt > file.matched 

And of course LANG=C may help. Please give feedback or send your files so I can test myself.

I would use SQLite3 🙂 Maybe in-memory database or whatever. Import the files and use SQL query.

Using flex :

1: build the flex processor:

 $ awk 'NR==1{ printf "%%%%\n\n.*\\|(%s",$0 } { printf "|%s",$0 } END { print ")\\|.*\\n ECHO;\n.*\\n ;\n%%\n" }' file1.txt > a.fl 

2: compile it

 $ flex -Ca -F a.fl ; cc -O lex.yy.c -lfl 

3: and run

 $ a.out < file2.txt > out 

Compiling (cc …) is a slow process; this approach will pay just for cases of stable file1.txt

(In my machine) The times taken to run a search "100 in 10_000_000" test in this approach is 3 times faster than LC_ALL=C fgrep...

Though this thread is over, but all grep-alike methods between two files are gathered in this post, why not to add this awk alternative, similar (or even improved) to the bounty winning Inian's awk solution:

 awk 'NR==FNR{a[$0]=1;next}a[$2]' patterns.txt FS="|" datafile.txt >matches.txt # For matches restricted on Field2 of datafile 

This is equivalent to Inian awk $2 in hash solution but it could be even faster due to the fact that we don't ask awk to check if the whole hash array contains $2 of file2 – we just check if a[$2] has a value or not.

While reading the first patterns file appart from creating the hash array we assign also a value.

If $2 of datafile had been found before in patterns file, then a[$2] would have a value and thus will be printed because is not null.

if a[$2] of datafile returns no value(null) this is translated to false => no printing.

Extension to match any of the three fields of datafile:

 awk 'NR==FNR{a[$0]=1;next}(a[$1] || a[$2] || a[$3])' patterns.txt FS="|" datafile.txt >matches.txt. #Printed if any of the three fields of datafile match pattern. 

In both cases, applying LC_ALL=C in front of awk, seems to speed things up.

PS1: Offcourse this solution has also the pitfalls of all awk solutions. Is not a pattern matching. Is a direct / fixed matching between of the two files, like most of the solutions inhere.

PS2: In my poor machine benchmark using the small benchmark files of Håkon Hægland , i get about 20% better performance comparing to the awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

setting language etc helps a little, perhaps.

otherwise I can not think of a magic solution to escape from your basic issue: the data is not structured, so you will have a search that comes down to the number of lines in file1 multiplied with the number of lines in file2.

put the billion lines in a database, and index it in a smart way, is the only speed up I can think of. that index would have to be very smart, though……

SImple solution is: have enough memory to fit everything in to. otherwise nothing much more you can do about this….