为什么Haskell代码与-O运行速度较慢？

这块Haskell代码运行速度比-O慢得多，但是-O应该是非危险的。谁能告诉我发生了什么事？如果重要的话，这是一个试图解决这个问题，它使用二进制search和持久段树：

 import Control.Monad import Data.Array data Node = Leaf Int -- value | Branch Int Node Node -- sum, left child, right child type NodeArray = Array Int Node -- create an empty node with range [l, r) create :: Int -> Int -> Node create lr | l + 1 == r = Leaf 0 | otherwise = Branch 0 (create lm) (create mr) where m = (l + r) `div` 2 -- Get the sum in range [0, r). The range of the node is [nl, nr) sumof :: Node -> Int -> Int -> Int -> Int sumof (Leaf val) r nl nr | nr <= r = val | otherwise = 0 sumof (Branch sum lc rc) r nl nr | nr <= r = sum | r > nl = (sumof lc r nl m) + (sumof rc rm nr) | otherwise = 0 where m = (nl + nr) `div` 2 -- Increase the value at x by 1. The range of the node is [nl, nr) increase :: Node -> Int -> Int -> Int -> Node increase (Leaf val) x nl nr = Leaf (val + 1) increase (Branch sum lc rc) x nl nr | x < m = Branch (sum + 1) (increase lc x nl m) rc | otherwise = Branch (sum + 1) lc (increase rc xm nr) where m = (nl + nr) `div` 2 -- signature said it all tonodes :: Int -> [Int] -> [Node] tonodes n = reverse . tonodes' . reverse where tonodes' :: [Int] -> [Node] tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t tonodes' _ = [create 0 n] -- find the minimum m in [l, r] such that (predicate m) is True binarysearch :: (Int -> Bool) -> Int -> Int -> Int binarysearch predicate lr | l == r = r | predicate m = binarysearch predicate lm | otherwise = binarysearch predicate (m+1) r where m = (l + r) `div` 2 -- main, literally main :: IO () main = do [n, m] <- fmap (map read . words) getLine nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine replicateM_ m $ query n nodes where query :: Int -> NodeArray -> IO () query n nodes = do [p, k] <- fmap (map read . words) getLine print $ binarysearch (ok nodes npk) 0 n where ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool ok nodes npks = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k

（这与代码审查完全一样，但这个问题解决了另一个问题。）

这是我在C ++中的input生成器：

 #include <cstdio> #include <cstdlib> using namespace std; int main (int argc, char * argv[]) { srand(1827); int n = 100000; if(argc > 1) sscanf(argv[1], "%d", &n); printf("%d %d\n", n, n); for(int i = 0; i < n; i++) printf("%d%c", rand() % n + 1, i == n - 1 ? '\n' : ' '); for(int i = 0; i < n; i++) { int p = rand() % n; int k = rand() % n + 1; printf("%d %d\n", p, k); } }

如果您没有可用的C ++编译器，则这是./gen.exe 1000的结果。

这是我电脑上的执行结果：

 $ ghc --version The Glorious Glasgow Haskell Compilation System, version 7.8.3 $ ghc -fforce-recomp 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m0.088s user 0m0.015s sys 0m0.015s $ ghc -fforce-recomp -O 1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ time ./gen.exe 1000 | ./1827.exe > /dev/null real 0m2.969s user 0m0.000s sys 0m0.045s

这是堆configuration文件总结：

 $ ghc -fforce-recomp -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 70,207,096 bytes allocated in the heap 2,112,416 bytes copied during GC 613,368 bytes maximum residency (3 sample(s)) 28,816 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 132 colls, 0 par 0.00s 0.00s 0.0000s 0.0004s Gen 1 3 colls, 0 par 0.00s 0.00s 0.0006s 0.0010s INIT time 0.00s ( 0.00s elapsed) MUT time 0.03s ( 0.03s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.03s ( 0.04s elapsed) %GC time 0.0% (14.7% elapsed) Alloc rate 2,250,213,011 bytes per MUT second Productivity 100.0% of total user, 83.1% of total elapsed $ ghc -fforce-recomp -O -rtsopts ./1827.hs [1 of 1] Compiling Main ( 1827.hs, 1827.o ) Linking 1827.exe ... $ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null 6,009,233,608 bytes allocated in the heap 622,682,200 bytes copied during GC 443,240 bytes maximum residency (505 sample(s)) 48,256 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 10945 colls, 0 par 0.72s 0.63s 0.0001s 0.0004s Gen 1 505 colls, 0 par 0.16s 0.13s 0.0003s 0.0005s INIT time 0.00s ( 0.00s elapsed) MUT time 2.00s ( 2.13s elapsed) GC time 0.87s ( 0.76s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.89s ( 2.90s elapsed) %GC time 30.3% (26.4% elapsed) Alloc rate 3,009,412,603 bytes per MUT second Productivity 69.7% of total user, 69.4% of total elapsed

我想现在是时候这个问题得到一个正确的答案。

`-O`代码发生了什么变化

让我放大你的主要function，并稍微改写它：

 main :: IO () main = do [n, m] <- fmap (map read . words) getLine line <- getLine let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line replicateM_ m $ query n nodes

显然，这里的意图是NodeArray被创build一次，然后被用在每个query的m调用中。

不幸的是，GHC把这个代码转换成实际上，

 main = do [n, m] <- fmap (map read . words) getLine line <- getLine replicateM_ m $ do let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line query n nodes

你可以马上看到这里的问题。

什么是国家黑客，为什么会破坏我的节目performance

原因是国家黑客，它（粗略地）说：“当某种types的IO a ，假设它只被调用一次。” 官方文件没有多less细节：

-fno-state-hack

closures“状态黑客”，任何带有State＃token的lambda都被认为是单项，因此在内部内联事物被认为是可以的。这可以提高IO和ST monad代码的性能，但是会降低共享的风险。

粗略地说，这个想法如下：如果你定义一个IOtypes和一个where子句的函数，例如

 foo x = do putStrLn y putStrLn y where y = ...x...

IO atypes的东西可以被看作是RealWord -> (a, RealWorld)types的RealWord -> (a, RealWorld) 。从这个angular度来看，上述（大致）

 foo x = let y = ...x... in \world1 -> let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ())

对foo的调用将（通常）看起来像这个foo argument world 。但是foo的定义只有一个参数，而另一个只是在本地的lambdaexpression式中被使用！这将是一个非常缓慢的呼吁。如果代码看起来像这样会更快：

 foo x world1 = let y = ...x... in let (world2, ()) = putStrLn y world1 let (world3, ()) = putStrLn y world2 in (world3, ())

这被称为eta-expansion，并且以各种理由完成（例如，通过分析函数的定义，通过检查它是如何被调用的，以及在这种情况下是types定向启发式的）。

不幸的是，对于foo的调用实际上是let fooArgument = foo argument的forms，也就是说有一个参数，但没有world通过（尚），这是let fooArgument = foo argument 。在原始代码中，如果fooArgument被多次使用，那么y仍然只能被计算一次，并被共享。在修改过的代码中，每次都会重新计算y这正是发生在您nodes上的事情。

事情能被解决吗？

有可能。请参阅＃9388试图这样做。修正它的问题是，在很多情况下，即使编译器无法确定地知道转换发生的情况，它仍然会降低性能。而且有可能出现技术上不好的情况，即共享丢失，但仍然是有益的，因为加速呼叫的速度超过了重新计算的额外成本。所以目前还不清楚从哪里去。

为什么Haskell代码与-O运行速度较慢？

`-O`代码发生了什么变化

什么是国家黑客，为什么会破坏我的节目performance

事情能被解决吗？

为什么更喜欢在C + +无符号签名？

GMSPolyline非常大的记忆秒杀

SELECT COUNT（）vs mysql_num_rows（）;

遗传algorithm和PYTHON的多目标优化：库/工具的使用？

三元运算符是否比“if”条件更快

无网格K均值（或其他优化）

为什么绘制电话昂贵？

编译器是否允许优化堆内存分配？

什么时候优化过早？

图像压缩工具通过命令行

为什么Haskell代码与-O运行速度较慢？

-O代码发生了什么变化

什么是国家黑客，为什么会破坏我的节目performance

事情能被解决吗？

为什么更喜欢在C + +无符号签名？

GMSPolyline非常大的记忆秒杀

SELECT COUNT（）vs mysql_num_rows（）;

遗传algorithm和PYTHON的多目标优化：库/工具的使用？

三元运算符是否比“if”条件更快

无网格K均值（或其他优化）

为什么绘制电话昂贵？

编译器是否允许优化堆内存分配？

什么时候优化过早？

图像压缩工具通过命令行

`-O`代码发生了什么变化