正则expression式比python慢

嗨，我想了解为什么下面的代码使用正则expression式拆分string拆分

#include<regex> #include<vector> #include<string> std::vector<std::string> split(const std::string &s){ static const std::regex rsplit(" +"); auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1); auto rend = std::sregex_token_iterator(); auto res = std::vector<std::string>(rit, rend); return res; } int main(){ for(auto i=0; i< 10000; ++i) split("abc", " "); return 0; }

比以下的python代码慢

 import re for i in range(10000): re.split(' +', 'ab c')

这里的

 > python test.py 0.05s user 0.01s system 94% cpu 0.070 total ./test 0.26s user 0.00s system 99% cpu 0.296 total

我在osx上使用clang ++。

与-O3编译将其降低到0.09s user 0.00s system 99% cpu 0.109 total

注意

另请参阅此答案： https : //stackoverflow.com/a/21708215这是编辑2底部在这里的基础。

我已经将循环增加到1000000以获得更好的时间测量。

这是我的Python时机：

 real 0m2.038s user 0m2.009s sys 0m0.024s

这里有一个相当于你的代码，只是更漂亮一些：

 #include <regex> #include <vector> #include <string> std::vector<std::string> split(const std::string &s, const std::regex &r) { return { std::sregex_token_iterator(s.begin(), s.end(), r, -1), std::sregex_token_iterator() }; } int main() { const std::regex r(" +"); for(auto i=0; i < 1000000; ++i) split("abc", r); return 0; }

定时：

 real 0m5.786s user 0m5.779s sys 0m0.005s

这是一个优化，以避免向量和string对象的构build/分配：

 #include <regex> #include <vector> #include <string> void split(const std::string &s, const std::regex &r, std::vector<std::string> &v) { auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1); auto rend = std::sregex_token_iterator(); v.clear(); while(rit != rend) { v.push_back(*rit); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string> v; for(auto i=0; i < 1000000; ++i) split("abc", r, v); return 0; }

定时：

 real 0m3.034s user 0m3.029s sys 0m0.004s

这接近100％的性能提升。

向量是在循环之前创build的，并且可以在第一次迭代中增加其内存。之后， clear()没有内存释放，vector保持内存并在原地构造string。

另一个性能增加将是完全避免构造/销毁std::string ，因此，它的对象的分配/释放。

这是一个尝试性的方向：

 #include <regex> #include <vector> #include <string> void split(const char *s, const std::regex &r, std::vector<std::string> &v) { auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.push_back(*rit); ++rit; } }

定时：

 real 0m2.509s user 0m2.503s sys 0m0.004s

最终的改进是将const char *的std::vector作为返回值，其中每个char指针都指向原始s cstring本身内的子string 。问题是，你不能这样做，因为它们中的每一个都不会被空终止（为此，在后面的示例中查看C ++ 1y string_ref用法）。

这个最后的改进也可以通过这个来实现：

 #include <regex> #include <vector> #include <string> void split(const std::string &s, const std::regex &r, std::vector<std::string> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.push_back(*rit); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string> v; for(auto i=0; i < 1000000; ++i) split("abc", r, v); // the constant string("abc") should be optimized // by the compiler. I got the same performance as // if it was an object outside the loop return 0; }

我已经用-O3创build了3.3（从中继）的样本。也许其他的正则expression式库能够更好地执行，但是在任何情况下，分配/释放往往是一个性能问题。

Boost.Regex

这是Cstring参数示例的boost::regex时序：

 real 0m1.284s user 0m1.278s sys 0m0.005s

同样的代码， boost::regex和std::regex接口在这个例子中是相同的，只是需要改变命名空间和include。

最好的祝愿它随着时间的推移而变得更好，C ++ stdlib正则expression式实现还处于起步阶段。

编辑

为了完成，我已经尝试了这个（上面提到的“终极改进”的build议），并没有改善等效std::vector<std::string> &v版本在任何情况下的性能：

 #include <regex> #include <vector> #include <string> template<typename Iterator> class intrusive_substring { private: Iterator begin_, end_; public: intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {} Iterator begin() {return begin_;} Iterator end() {return end_;} }; using intrusive_char_substring = intrusive_substring<const char *>; void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); // This can potentially be optimized away by the compiler because // the intrusive_char_substring destructor does nothing, so // resetting the internal size is the only thing to be done. // Formerly allocated memory is maintained. while(rit != rend) { v.emplace_back(rit->first, rit->second); ++rit; } } int main() { const std::regex r(" +"); std::vector<intrusive_char_substring> v; for(auto i=0; i < 1000000; ++i) split("abc", r, v); return 0; }

这与array_ref和string_ref提议有关。以下是使用它的示例代码：

 #include <regex> #include <vector> #include <string> #include <string_ref> void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v) { auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1); auto rend = std::cregex_token_iterator(); v.clear(); while(rit != rend) { v.emplace_back(rit->first, rit->length()); ++rit; } } int main() { const std::regex r(" +"); std::vector<std::string_ref> v; for(auto i=0; i < 1000000; ++i) split("abc", r, v); return 0; }

在向量返回的情况下，返回一个string_ref向量而不是string副本也会更便宜。

编辑2

这个新的解决scheme能够通过回报获得输出。我使用了Marshall Clow的string_view （ string_ref得到了重命名）libc ++实现在https://github.com/mclow/string_viewfind。;

 #include <string> #include <string_view> #include <boost/regex.hpp> #include <boost/range/iterator_range.hpp> #include <boost/iterator/transform_iterator.hpp> using namespace std; using namespace std::experimental; using namespace boost; string_view stringfier(const cregex_token_iterator::value_type &match) { return {match.first, static_cast<size_t>(match.length())}; } using string_view_iterator = transform_iterator<decltype(&stringfier), cregex_token_iterator>; iterator_range<string_view_iterator> split(string_view s, const regex &r) { return { string_view_iterator( cregex_token_iterator(s.begin(), s.end(), r, -1), stringfier ), string_view_iterator() }; } int main() { const regex r(" +"); for (size_t i = 0; i < 1000000; ++i) { split("abc", r); } }

定时：

 real 0m0.385s user 0m0.385s sys 0m0.000s

请注意，这与以前的结果相比要快得多。当然，它并不是在循环中填充一个vector （也不可能提前匹配任何东西），但是你可以得到一个范围，你可以使用基于范围的范围来进行范围，甚至可以用它来填充vector 。

由于iterator_range通过原始string （或以null结尾的string ）创buildstring_view ，所以它变得非常轻量级，从不产生不必要的string分配。

只是为了比较使用这个split实现，但实际上填充vector我们可以这样做：

 int main() { const regex r(" +"); vector<string_view> v; v.reserve(10); for (size_t i = 0; i < 1000000; ++i) { copy(split("abc", r), back_inserter(v)); v.clear(); } }

这使用升压范围复制algorithm来填充每个迭代中的vector，时间是：

 real 0m1.002s user 0m0.997s sys 0m0.004s

可以看出，与优化的string_view输出参数版本相比，没有太大的区别。

还要注意一个std::split的提议，可以这样工作。

为了优化，一般来说，你想避免两件事情：

烧掉CPU周期的不必要的东西
等待一些事情发生（内存读取，磁盘读取，networking读取，…）

这两者可能是对立的，因为有时候它比在内存中caching所有内容更快的计算速度，所以这是一个平衡的游戏。

我们来分析一下你的代码：

 std::vector<std::string> split(const std::string &s){ static const std::regex rsplit(" +"); // only computed once // search for first occurrence of rsplit auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1); auto rend = std::sregex_token_iterator(); // simultaneously: // - parses "s" from the second to the past the last occurrence // - allocates one `std::string` for each match... at least! (there may be a copy) // - allocates space in the `std::vector`, possibly multiple times auto res = std::vector<std::string>(rit, rend); return res; }

我们可以做得更好吗？那么，如果我们可以重新使用现有的存储，而不是分配和释放内存，我们应该看到一个重大的改进[1]：

 // Overwrites 'result' with the matches, returns the number of matches // (note: 'result' is never shrunk, but may be grown as necessary) size_t split(std::string const& s, std::vector<std::string>& result){ static const std::regex rsplit(" +"); // only computed once auto rit = std::cregex_token_iterator(s.begin(), s.end(), rsplit, -1); auto rend = std::cregex_token_iterator(); size_t pos = 0; // As long as possible, reuse the existing strings (in place) for (size_t max = result.size(); rit != rend && pos != max; ++rit, ++pos) { result[pos].assign(rit->first, rit->second); } // When more matches than existing strings, extend capacity for (; rit != rend; ++rit, ++pos) { result.emplace_back(rit->first, rit->second); } return pos; } // split

在你执行的testing中，在迭代次数不变的情况下，这个版本不太可能被打败：它只会在第一次运行时分配内存（ rsplit和result ），然后继续重用已有的内存。

[1]：免责声明，我只是certificate这个代码是正确的，我没有testing它（唐纳德·克努特会说）。

这个版本怎么样？这是没有正则expression式，但它很快解决了分裂…

 #include <vector> #include <string> #include <algorithm> size_t split2(const std::string& s, std::vector<std::string>& result) { size_t count = 0; result.clear(); std::string::const_iterator p1 = s.cbegin(); std::string::const_iterator p2 = p1; bool run = true; do { p2 = std::find(p1, s.cend(), ' '); result.push_back(std::string(p1, p2)); ++count; if (p2 != s.cend()) { p1 = std::find_if(p2, s.cend(), [](char c) -> bool { return c != ' '; }); } else run = false; } while (run); return count; } int main() { std::vector<std::string> v; std::string s = "abc"; for (auto i = 0; i < 100000; ++i) split2(s, v); return 0; }

$ time splittest.exe

实际0m0.132s用户0m0.000s系统0m0.109s

我会说C ++ 11正则expression式比perl慢，可能比python慢。

要正确测量性能，最好使用一些不重要的expression式来进行testing，否则你只能测量正则expression式本身。

例如，比较C ++ 11和perl

C ++ 11testing代码

  #include <iostream> #include <string> #include <regex> #include <chrono> int main () { int veces = 10000; int count = 0; std::regex expres ("([^-]*)-([^-]*)-(\\d\\d\\d:\\d\\d)---(.*)"); std::string text ("some-random-text-453:10--- etc etc blah"); std::smatch macha; auto start = std::chrono::system_clock::now(); for (int ii = 0; ii < veces; ii ++) count += std::regex_search (text, macha, expres); auto milli = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count(); std::cout << count << "/" << veces << " matches " << milli << " ms --> " << (milli / (float) veces) << " ms per regex_search" << std::endl; return 0; }

在我的电脑上用gcc编译4.9.3我得到了输出

  10000/10000 matches 1704 ms --> 0.1704 ms per regex_search

perltesting代码

  use Time::HiRes qw/ time sleep /; my $veces = 1000000; my $conta = 0; my $phrase = "some-random-text-453:10--- etc etc blah"; my $start = time; for (my $ii = 0; $ii < $veces; $ii++) { if ($phrase =~ m/([^-]*)-([^-]*)-(\d\d\d:\d\d)---(.*)/) { $conta = $conta + 1; } } my $elapsed = (time - $start) * 1000.0; print $conta . "/" . $veces . " matches " . $elapsed . " ms --> " . ($elapsed / $veces) . " ms per regex\n";

在我的电脑中再次使用perl v5.8.8

  1000000/1000000 matches 2929.55303192139 ms --> 0.00292955303192139 ms per regex

所以在这个testing中，C ++ 11 / perl的比例是

  0.1704 / 0.002929 = 58.17 times slower than perl

在真实的情况下，我得到的比率大约慢100到200倍。因此，例如，parsing一个有百万行的大文件需要大约一秒的时间，而使用正则expression式的C ++ 11程序可能需要更多的时间（！）。

正则expression式比python慢

注意

Boost.Regex

编辑

编辑2

尾部返回types语法样式应该成为新的C ++ 11程序的默认值？

C ++ 11引入了标准化的内存模型。这是什么意思？那么如何影响C ++编程呢？

debugging模板实例

如何使用Boost来实现C ++ 14风格的自动返回types？

为什么没有std :: stou？

c大括号内的{* this}

Eric Niebler如何实现std :: is_function？

C ++ 11相当于boost shared_mutex

如何使我的自定义types使用“基于范围的循环”？

alignment在C ++ 11中的性能真的很重要吗？

正则expression式比python慢

注意

Boost.Regex

编辑

编辑2

尾部返回types语法样式应该成为新的C ++ 11程序的默认值？

C ++ 11引入了标准化的内存模型。 这是什么意思？ 那么如何影响C ++编程呢？

debugging模板实例

如何使用Boost来实现C ++ 14风格的自动返回types？

为什么没有std :: stou？

c大括号内的{* this}

Eric Niebler如何实现std :: is_function？

C ++ 11相当于boost shared_mutex

如何使我的自定义types使用“基于范围的循环”？

alignment在C ++ 11中的性能真的很重要吗？

C ++ 11引入了标准化的内存模型。这是什么意思？那么如何影响C ++编程呢？