最快的方法来生长一个numpy数组数组

要求：

我需要从数据中随意增长一个数组。
我可以猜测大小（大约100-200），不保证数组每次都适合
一旦它增长到最终的大小，我需要对它进行数值计算，所以我宁愿最终得到一个二维numpy数组。
速度至关重要。例如，对于300个文件中的一个，update（）方法被称为4500万次（大约需要150s），finalize（）方法被称为500k次（总共需要106s）…总共250s或者。

这是我的代码：

def __init__(self): self.data = [] def update(self, row): self.data.append(row) def finalize(self): dx = np.array(self.data)

其他的东西，我尝试包括以下代码…但这是waaaaay慢。

 def class A: def __init__(self): self.data = np.array([]) def update(self, row): np.append(self.data, row) def finalize(self): dx = np.reshape(self.data, size=(self.data.shape[0]/5, 5))

这里是一个如何调用的示意图：

 for i in range(500000): ax = A() for j in range(200): ax.update([1,2,3,4,5]) ax.finalize() # some processing on ax

我尝试了几个不同的事情，有时间。

 import numpy as np

你提到的方法很慢：（32.094秒）

 class A: def __init__(self): self.data = np.array([]) def update(self, row): self.data = np.append(self.data, row) def finalize(self): return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5))

普通的Python Python列表：（0.308秒）

 class B: def __init__(self): self.data = [] def update(self, row): for r in row: self.data.append(r) def finalize(self): return np.reshape(self.data, newshape=(len(self.data)/5, 5))

试图实现一个arrays列表numpy：（0.362秒）

 class C: def __init__(self): self.data = np.zeros((100,)) self.capacity = 100 self.size = 0 def update(self, row): for r in row: self.add(r) def add(self, x): if self.size == self.capacity: self.capacity *= 4 newdata = np.zeros((self.capacity,)) newdata[:self.size] = self.data self.data = newdata self.data[self.size] = x self.size += 1 def finalize(self): data = self.data[:self.size] return np.reshape(data, newshape=(len(data)/5, 5))

这就是我如何计时的：

 x = C() for i in xrange(100000): x.update([i])

所以它看起来像普通的旧Python列表是相当不错的;）

np.append（）每次都复制数组中的所有数据，但是列表会将容量增长一倍（1.125）。列表很快，但内存使用量比数组大。如果你关心内存，你可以使用python标准库的数组模块。

这里是关于这个话题的讨论：

如何创build一个dynamic数组

在Owen的post中使用类别声明，这里是一个修改时间，有一定的效果。

简而言之，我发现C类提供的实现比原始文章中的方法快60倍以上。（道歉的文本墙）

我使用的文件：

 #!/usr/bin/python import cProfile import numpy as np # ... class declarations here ... def test_class(f): x = f() for i in xrange(100000): x.update([i]) for i in xrange(1000): x.finalize() for x in 'ABC': cProfile.run('test_class(%s)' % x)

现在，由此产生的时间：

  903005 function calls in 16.049 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 16.049 16.049 <string>:1(<module>) 100000 0.139 0.000 1.888 0.000 fromnumeric.py:1043(ravel) 1000 0.001 0.000 0.003 0.000 fromnumeric.py:107(reshape) 100000 0.322 0.000 14.424 0.000 function_base.py:3466(append) 100000 0.102 0.000 1.623 0.000 numeric.py:216(asarray) 100000 0.121 0.000 0.298 0.000 numeric.py:286(asanyarray) 1000 0.002 0.000 0.004 0.000 test.py:12(finalize) 1 0.146 0.146 16.049 16.049 test.py:50(test_class) 1 0.000 0.000 0.000 0.000 test.py:6(__init__) 100000 1.475 0.000 15.899 0.000 test.py:9(update) 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 100000 0.126 0.000 0.126 0.000 {method 'ravel' of 'numpy.ndarray' objects} 1000 0.002 0.000 0.002 0.000 {method 'reshape' of 'numpy.ndarray' objects} 200001 1.698 0.000 1.698 0.000 {numpy.core.multiarray.array} 100000 11.915 0.000 11.915 0.000 {numpy.core.multiarray.concatenate} 208004 function calls in 16.885 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 16.885 16.885 <string>:1(<module>) 1000 0.025 0.000 16.508 0.017 fromnumeric.py:107(reshape) 1000 0.013 0.000 16.483 0.016 fromnumeric.py:32(_wrapit) 1000 0.007 0.000 16.445 0.016 numeric.py:216(asarray) 1 0.000 0.000 0.000 0.000 test.py:16(__init__) 100000 0.068 0.000 0.080 0.000 test.py:19(update) 1000 0.012 0.000 16.520 0.017 test.py:23(finalize) 1 0.284 0.284 16.883 16.883 test.py:50(test_class) 1000 0.005 0.000 0.005 0.000 {getattr} 1000 0.001 0.000 0.001 0.000 {len} 100000 0.012 0.000 0.012 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1000 0.020 0.000 0.020 0.000 {method 'reshape' of 'numpy.ndarray' objects} 1000 16.438 0.016 16.438 0.016 {numpy.core.multiarray.array} 204010 function calls in 0.244 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.244 0.244 <string>:1(<module>) 1000 0.001 0.000 0.003 0.000 fromnumeric.py:107(reshape) 1 0.000 0.000 0.000 0.000 test.py:27(__init__) 100000 0.082 0.000 0.170 0.000 test.py:32(update) 100000 0.087 0.000 0.088 0.000 test.py:36(add) 1000 0.002 0.000 0.005 0.000 test.py:46(finalize) 1 0.068 0.068 0.243 0.243 test.py:50(test_class) 1000 0.000 0.000 0.000 0.000 {len} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1000 0.002 0.000 0.002 0.000 {method 'reshape' of 'numpy.ndarray' objects} 6 0.001 0.000 0.001 0.000 {numpy.core.multiarray.zeros}

A级被更新销毁，B级被终止销毁。 C类在他们两个面前都很健壮。

用于定稿的function有很大的性能差异。考虑下面的代码：

 N=100000 nruns=5 a=[] for i in range(N): a.append(np.zeros(1000)) print "start" b=[] for i in range(nruns): s=time() c=np.vstack(a) b.append((time()-s)) print "Timing version vstack ",np.mean(b) b=[] for i in range(nruns): s=time() c1=np.reshape(a,(N,1000)) b.append((time()-s)) print "Timing version reshape ",np.mean(b) b=[] for i in range(nruns): s=time() c2=np.concatenate(a,axis=0).reshape(-1,1000) b.append((time()-s)) print "Timing version concatenate ",np.mean(b) print c.shape,c2.shape assert (c==c2).all() assert (c==c1).all()

使用连接似乎是第一个版本的两倍，比第二个版本快10倍以上。

 Timing version vstack 1.5774928093 Timing version reshape 9.67419199944 Timing version concatenate 0.669512557983

如果你想通过列表操作提高性能，请看一下blist库。这是一个python列表和其他结构的优化实现。

我没有对它进行基准testing，但是他们的页面结果似乎很有希望。

最快的方法来生长一个numpy数组数组

为什么numpy.power为小指数返回0，而math.pow返回正确的答案？

按列sortingNumPy中的数组

检测一个NumPy数组是否至less包含一个非数字值？

SimpleJSON和NumPy数组

如何用python / numpy计算百分位数？

在Python中使用numpy.linalg.eig后sorting特征值和相关的特征向量

使用pip安装SciPy和NumPy

迭代一个numpy数组

使用scipy计算matrix排名

如何平整numpy数组的一些维度