文字处理 – python vs perl的性能

这里是我的perl和python脚本，从约21个日志文件中进行一些简单的文本处理，每个文件大约300KB到1MB（最多）×5次重复（总共125个文件，由于日志重复5次）。

Python代码 （代码修改为使用编译的re和使用re.I）

#!/usr/bin/python import re import fileinput exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I) location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I) for line in fileinput.input(): fn = fileinput.filename() currline = line.rstrip() mprev = exists_re.search(currline) if(mprev): xlogtime = mprev.group(1) mcurr = location_re.search(currline) if(mcurr): print fn, xlogtime, mcurr.group(1)

Perl代码

 #!/usr/bin/perl while (<>) { chomp; if (m/^(.*?) INFO.*Such a record already exists/i) { $xlogtime = $1; } if (m/^AwbLocation (.*?) insert into/i) { print "$ARGV $xlogtime $1\n"; } }

而且，在我的电脑上，这两个代码生成了10,790行完全相同的结果文件。而且，这里是在cygwin perl和python上完成的时间

 User@UserHP /cygdrive/d/tmp/Clipboard # time /tmp/scripts/python/afs/process_file.py *log* *log* *log* *log* *log* > summarypy.log real 0m8.185s user 0m8.018s sys 0m0.092s User@UserHP /cygdrive/d/tmp/Clipboard # time /tmp/scripts/python/afs/process_file.pl *log* *log* *log* *log* *log* > summarypl.log real 0m1.481s user 0m1.294s sys 0m0.124s

最初，这个简单的文本处理花费了10.2秒的时间，而使用Perl的时间只有1.9秒。

（更新），但编译后的Python版本，它现在需要8.2秒的Python和1.5秒的Perl。 还是perl要快得多。

无论如何要提高Python的速度，或者很显然你是专家，Perl将会成为简单文本处理的快速工具。

顺便说一下，这不是我为简单的文本处理做的唯一的testing…而且，我做出源代码的每一种不同的方式总是总是让Perl赢得很大的利润。而且，Python对于简单的m/regex/匹配和打印的东西没有一次performance得更好。

感谢您的input。

请不要使用C，C ++，汇编，Python等其他风格。

我正在寻找一个解决scheme，使用标准的Python与其内置模块比较标准的Perl（甚至不使用模块）。男孩，由于其可读性，我希望使用Python来完成所有任务，但是放弃速度，我不这么认为。

因此，请提出如何改进代码以便与perl具有可比较的结果。

更新：18OCT2012

正如其他用户所build议的那样，Perl有其自己的位置，而Python也有它的位置

因此，对于这个问题，可以有把握地得出结论：对于数百或数千文本文件的每一行简单的正则expression式匹配，并将结果写入文件（或打印到屏幕）， Perl总是会赢得这个工作的性能，就如此容易。

请注意，当我说Perl赢得了性能..只有标准的Perl和Python进行比较…不使用一些模糊的模块（像我这样的普通用户不明白），也不能从Python调用C，C ++，汇编库或perl的。我们没有时间学习所有这些额外的步骤和安装一个简单的文字匹配工作。

所以，Perl摇滚文本处理和正则expression式。

Python有其他地方摇滚的地方。

更新2013年5月29日：优秀的文章，在这里做类似的比较。 Perl再次赢得了简单的文本匹配..和更多的细节阅读文章。

这正是Perldevise要做的东西，所以它不会让我感到惊讶。

在你的Python代码中，一个简单的优化就是预编译这些正则expression式，所以每次都不会重新编译。

 exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists') location_re = re.compile(r'^AwbLocation (.*?) insert into')

然后在你的循环中：

 mprev = exists_re.search(currline)

和

 mcurr = location_re.search(currline)

这本身不会奇迹般地将您的Python脚本与您的Perl脚本保持一致，但是在不编译的情况下反复调用re是一个不好的习惯。

假设：由于Python没有进行优化，Perl在不匹配的行中花费较less的时间回溯。

你取代什么？

 ^(.*?) INFO.*Such a record already exists

同

 ^((?:(?! INFO).)*?) INFO.*Such a record already

要么

 ^(?>(.*?) INFO).*Such a record already exists

在Python中，函数调用在时间上有点贵。然而你有一个循环不变的函数调用来获取循环内的文件名：

 fn = fileinput.filename()

把这一行移到for循环的上面，你会看到Python时序的一些改进。虽然可能不足以击败Perl。

@pepr谢谢。与Perl的1.8秒相比，你的代码运行6.1秒（大约2秒）**

但是，男孩，对于普通用户（我）来说，想一想这个代码是非常复杂的，他们碰巧从书中到实际使用都遵循着快速的例子。

与perl的代码相比，python代码对可读性没有多大帮助，但是它只是有很多循环…而且还是最后一击，它甚至不会接近perl的性能。无论如何，欢迎提供更多build议。

 User@UserHP /cygdrive/d/tmp/Clipboard # time /tmp/scripts/python/afs/process_file_pepr.py *log* *log* *log* *log* *lo g* > summarypy_pepr.log real 0m6.089s user 0m5.772s sys 0m0.155s

一般来说， 所有的人造基准都是邪恶的。 然而，其他一切都是平等的（algorithm），你可以在相对基础上进行改进。但是，应该指出，我不使用Perl，所以我不能辩护它。这就是说，用Python你可以尝试使用Pyrex或Cython来提高性能。或者，如果您喜欢冒险，可以尝试通过ShedSkin将Python代码转换为C ++（适用于大多数核心语言以及一些（但不是全部）核心模块）。

不过，你可以按照这里发布的一些提示：

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

我期望Perl更快。只是好奇，你可以尝试以下吗？

 #!/usr/bin/python import re import glob import sys import os exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I) location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I) for mask in sys.argv[1:]: for fname in glob.glob(mask): if os.path.isfile(fname): f = open(fname) for line in f: mex = exists_re.search(line) if mex: xlogtime = mex.group(1) mloc = location_re.search(line) if mloc: print fname, xlogtime, mloc.group(1) f.close()

更新为对“太复杂”的反应。

当然，它看起来比Perl版本更复杂。 Perl是围绕正则expression式构build的。这样，你很难find在正则expression式中更快的解释型语言。 Perl语法…

 while (<>) { ... }

…也隐藏了许多必须用更一般的语言来完成的事情。另一方面，如果将不可读的部分移出，则使Python代码更易读：

 #!/usr/bin/python import re import glob import sys import os def input_files(): '''The generator loops through the files defined by masks from cmd.''' for mask in sys.argv[1:]: for fname in glob.glob(mask): if os.path.isfile(fname): yield fname exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I) location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I) for fname in input_files(): with open(fname) as f: # now the f.close() is done automatically for line in f: mex = exists_re.search(line) if mex: xlogtime = mex.group(1) mloc = location_re.search(line) if mloc: print fname, xlogtime, mloc.group(1)

这里def input_files()可以放在其他地方（比如在另一个模块中），或者可以重用。 while (<>) {...}很容易模仿Perl的语法，尽pipe语法上不尽相同：

 #!/usr/bin/python import re import glob import sys import os def input_lines(): '''The generator loops through the lines of the files defined by masks from cmd.''' for mask in sys.argv[1:]: for fname in glob.glob(mask): if os.path.isfile(fname): with open(fname) as f: # now the f.close() is done automatically for line in f: yield fname, line exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I) location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I) for fname, line in input_lines(): mex = exists_re.search(line) if mex: xlogtime = mex.group(1) mloc = location_re.search(line) if mloc: print fname, xlogtime, mloc.group(1)

然后，最后一个可能看起来像Perl while (<>) {...}一样简单（原则上）。 Perl中的这种可读性增强更加困难。

无论如何，它不会使Python程序更快。 Perl在这里会再快一点。 Perl 是一个文件/文本分析器。但是 – 在我看来，Python是一个更好的编程语言，用于更一般的目的。

文字处理 – python vs perl的性能

如何将新的文本行添加到Java中的现有文件？

如何在Vim中将所有文本转换为小写

如何使用sed来replace文件中的第一个事件？

有一个bash命令来统计文件吗？

还有什么理由要学习AWK吗？

如何在文本文件中replace$ {}占位符？

使用SQL来确定文本字段的字数统计

通过grep删除文本文件中的空行

从文本中检测短语和关键字的algorithm

从bash中的文件中select随机行