C ++ 11 std ::函数比虚拟调用慢？

我正在创build一种机制，允许用户使用装饰器模式从基本构build块形成任意复杂的function。这在function上是很好的，但是我不喜欢它涉及到大量的虚拟调用，特别是当嵌套深度变大的时候。这让我很担心，因为复杂的function可能经常被调用（> 100.000次）。

为了避免这个问题，我试图把修饰器scheme一旦完成（ to_function()就变成了std::function 。所有的内部函数调用在std::function构造过程中都是连线的。我认为这将比原来的装饰器scheme更快，因为在std::function版本中不需要执行虚拟查找。

唉，基准testingcertificate我错了：装饰器scheme实际上比我用它构build的std::function更快。所以现在我想知道为什么。也许我的testing设置是错误的，因为我只使用两个简单的基本function，这意味着可以caching的Vtable查找？

我使用的代码包含在下面，不幸的是它很长。

SSCCE

 // sscce.cpp #include <iostream> #include <vector> #include <memory> #include <functional> #include <random> /** * Base class for Pipeline scheme (implemented via decorators) */ class Pipeline { protected: std::unique_ptr<Pipeline> wrappee; Pipeline(std::unique_ptr<Pipeline> wrap) :wrappee(std::move(wrap)){} Pipeline():wrappee(nullptr){} public: typedef std::function<double(double)> FnSig; double operator()(double input) const{ if(wrappee.get()) input=wrappee->operator()(input); return process(input); } virtual double process(double input) const=0; virtual ~Pipeline(){} // Returns a std::function which contains the entire Pipeline stack. virtual FnSig to_function() const=0; }; /** * CRTP for to_function(). */ template <class Derived> class Pipeline_CRTP : public Pipeline{ protected: Pipeline_CRTP(const Pipeline_CRTP<Derived> &o):Pipeline(o){} Pipeline_CRTP(std::unique_ptr<Pipeline> wrappee) :Pipeline(std::move(wrappee)){} Pipeline_CRTP():Pipeline(){}; public: typedef typename Pipeline::FnSig FnSig; FnSig to_function() const override{ if(Pipeline::wrappee.get()!=nullptr){ FnSig wrapfun = Pipeline::wrappee->to_function(); FnSig processfun = std::bind(&Derived::process, static_cast<const Derived*>(this), std::placeholders::_1); FnSig fun = [=](double input){ return processfun(wrapfun(input)); }; return std::move(fun); }else{ FnSig processfun = std::bind(&Derived::process, static_cast<const Derived*>(this), std::placeholders::_1); FnSig fun = [=](double input){ return processfun(input); }; return std::move(fun); } } virtual ~Pipeline_CRTP(){} }; /** * First concrete derived class: simple scaling. */ class Scale: public Pipeline_CRTP<Scale>{ private: double scale_; public: Scale(std::unique_ptr<Pipeline> wrap, double scale) // todo move :Pipeline_CRTP<Scale>(std::move(wrap)),scale_(scale){} Scale(double scale):Pipeline_CRTP<Scale>(),scale_(scale){} double process(double input) const override{ return input*scale_; } }; /** * Second concrete derived class: offset. */ class Offset: public Pipeline_CRTP<Offset>{ private: double offset_; public: Offset(std::unique_ptr<Pipeline> wrap, double offset) // todo move :Pipeline_CRTP<Offset>(std::move(wrap)),offset_(offset){} Offset(double offset):Pipeline_CRTP<Offset>(),offset_(offset){} double process(double input) const override{ return input+offset_; } }; int main(){ // used to make a random function / arguments // to prevent gcc from being overly clever std::default_random_engine generator; auto randint = std::bind(std::uniform_int_distribution<int>(0,1),std::ref(generator)); auto randdouble = std::bind(std::normal_distribution<double>(0.0,1.0),std::ref(generator)); // make a complex Pipeline std::unique_ptr<Pipeline> pipe(new Scale(randdouble())); for(unsigned i=0;i<100;++i){ if(randint()) pipe=std::move(std::unique_ptr<Pipeline>(new Scale(std::move(pipe),randdouble()))); else pipe=std::move(std::unique_ptr<Pipeline>(new Offset(std::move(pipe),randdouble()))); } // make a std::function from pipe Pipeline::FnSig fun(pipe->to_function()); double bla=0.0; for(unsigned i=0; i<100000; ++i){ #ifdef USE_FUNCTION // takes 110 ms on average bla+=fun(bla); #else // takes 60 ms on average bla+=pipe->operator()(bla); #endif } std::cout << bla << std::endl; }

基准

使用pipe ：

 g++ -std=gnu++11 sscce.cpp -march=native -O3 sudo nice -3 /usr/bin/time ./a.out -> 60 ms

使用fun ：

 g++ -DUSE_FUNCTION -std=gnu++11 sscce.cpp -march=native -O3 sudo nice -3 /usr/bin/time ./a.out -> 110 ms

正如Sebastian Redl的回答所说，虚拟函数的“替代”通过dynamic绑定函数（虚拟函数指针或通过函数指针取决于std::function实现）添加了几个间接层，然后它仍然调用虚拟Pipeline::process(double)function呢！

这个修改使得它显着更快，通过删除std::function间接的一层，并阻止对Derived::process的调用是虚拟的：

 FnSig to_function() const override { FnSig fun; auto derived_this = static_cast<const Derived*>(this); if (Pipeline::wrappee) { FnSig wrapfun = Pipeline::wrappee->to_function(); fun = [=](double input){ return derived_this->Derived::process(wrapfun(input)); }; } else { fun = [=](double input){ return derived_this->Derived::process(input); }; } return fun; }

虽然这里还有更多的工作要比虚拟function版本完成。

你有std::function的绑定lambdas，调用std::function s绑定lamdbas，调用std::function s …

看你的to_function 。它创build一个lambda，调用两个std::function s，并返回该lambda绑定到另一个std::function 。编译器将不会解决任何这些静态。

所以到最后，你只需要和虚拟函数解决scheme一样多的间接调用就可以了，这就是如果你摆脱了绑定的processfun并直接在lambda中调用它。否则你有两倍的。

如果你想加快速度，你将不得不以一种可以静态解决的方式来创build整个pipe道，这意味着更多的模板，然后才能最终擦除types为一个std::function 。

std::function是非常慢的; types擦除和由此产生的分配在这一点上也是如此，在gcc ，调用被内联/优化得非常糟糕。由于这个原因，人们试图解决这个问题的C ++“委托人”有很多。我移植了一个Code Review：

https://codereview.stackexchange.com/questions/14730/impossibly-fast-delegate-in-c11

但是你可以用Googlefind很多其他人，或者自己写。

编辑：

这些天来，在这里寻找一个快速的代表。

std :: function的libstdc ++实现大致如下所示：

 template<typename Signature> struct Function { Ptr functor; Ptr functor_manager; template<class Functor> Function(const Functor& f) { functor_manager = &FunctorManager<Functor>::manage; functor = new Functor(f); } Function(const Function& that) { functor = functor_manager(CLONE, that->functor); } R operator()(args) // Signature { return functor_manager(INVOKE, functor, args); } ~Function() { functor_manager(DESTROY, functor); } } template<class Functor> struct FunctorManager { static manage(int operation, Functor& f) { switch (operation) { case CLONE: call Functor copy constructor; case INVOKE: call Functor::operator(); case DESTROY: call Functor destructor; } } }

因此，尽pipestd::function不知道Functor对象的确切types，但它通过functor_manager函数指针调度重要的操作，该指针是模板实例的静态函数，该函数确实知道Functortypes。

每个std::function实例将在堆上分配它自己拥有的函子对象副本（除非它不大于指针，比如函数指针，在这种情况下，它只是将指针保存为子对象）。

重要的是，如果底层仿函数对象具有昂贵的拷贝构造函数和/或占用大量空间（例如保存绑定参数），那么复制std::function很昂贵。

C ++ 11 std ::函数比虚拟调用慢？

SSCCE

基准

SSL强加多less开销？

如果分析器不是答案，我们还有其他的选择吗？

ADD 1真的比INC快吗？ x86

在SQL Server上使用LIKE和CONTAINS

性能计数器的性能受到什么影响？

为什么Haskell（GHC）如此快速？

CROSS APPLY vs OUTER APPLY速度差

Python与Java性能（运行时速度）

gcc的快速math实际上做了什么？

为什么Python代码在函数中运行得更快？