在一个numpy向量中查找最频繁的数字

假设我在python中有以下列表：

a = [1,2,3,1,2,1,1,1,3,2,2,1]

如何在这个列表中find最频繁的数字？

如果你的列表包含所有非负整数，你应该看看numpy.bincounts：

http://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

然后可能使用np.argmax：

 a = np.array([1,2,3,1,2,1,1,1,3,2,2,1]) counts = np.bincount(a) print np.argmax(counts)

对于更复杂的列表（可能包含负数或非整数值），可以用类似的方法使用np.histogram 。或者，如果你只是想在不使用numpy的情况下使用python， collections.Counter是处理这类数据的好方法。

 from collections import Counter a = [1,2,3,1,2,1,1,1,3,2,2,1] b = Counter(a) print b.most_common(1)

你可以使用

 (values,counts) = np.unique(a,return_counts=True) ind=np.argmax(counts) print values[ind] # prints the most frequent element

如果某个元素和另一个元素一样频繁，那么这个代码将只返回第一个元素。

如果您愿意使用SciPy ：

 >>> from scipy.stats import mode >>> mode([1,2,3,1,2,1,1,1,3,2,2,1]) (array([ 1.]), array([ 6.])) >>> most_frequent = mode([1,2,3,1,2,1,1,1,3,2,2,1])[0][0] >>> most_frequent 1.0

性能（使用iPython）在这里find一些解决scheme：

 >>> # small array >>> a = [12,3,65,33,12,3,123,888000] >>> >>> import collections >>> collections.Counter(a).most_common()[0][0] 3 >>> %timeit collections.Counter(a).most_common()[0][0] 100000 loops, best of 3: 11.3 µs per loop >>> >>> import numpy >>> numpy.bincount(a).argmax() 3 >>> %timeit numpy.bincount(a).argmax() 100 loops, best of 3: 2.84 ms per loop >>> >>> import scipy.stats >>> scipy.stats.mode(a)[0][0] 3.0 >>> %timeit scipy.stats.mode(a)[0][0] 10000 loops, best of 3: 172 µs per loop >>> >>> from collections import defaultdict >>> def jjc(l): ... d = defaultdict(int) ... for i in a: ... d[i] += 1 ... return sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0] ... >>> jjc(a)[0] 3 >>> %timeit jjc(a)[0] 100000 loops, best of 3: 5.58 µs per loop >>> >>> max(map(lambda val: (a.count(val), val), set(a)))[1] 12 >>> %timeit max(map(lambda val: (a.count(val), val), set(a)))[1] 100000 loops, best of 3: 4.11 µs per loop >>>

最好是“最大”与“设置”

尽pipe上面的大多数答案都是有用的，但是如果你需要：1）需要它支持非正整数值（例如浮点数或负整数;-)），2）不在Python 2.7中（collections.Counter需要），3）不要添加scipy（甚至numpy）到你的代码的依赖，那么一个纯粹的Python 2.6解决scheme是O（nlogn）（即高效）就是这样：

 from collections import defaultdict a = [1,2,3,1,2,1,1,1,3,2,2,1] d = defaultdict(int) for i in a: d[i] += 1 most_frequent = sorted(d.iteritems(), key=lambda x: x[1], reverse=True)[0]

另外，如果你想得到最常见的值（正面或负面），而不加载任何模块，你可以使用下面的代码：

 lVals = [1,2,3,1,2,1,1,1,3,2,2,1] print max(map(lambda val: (lVals.count(val), val), set(lVals)))

我喜欢JoshAdel的解决scheme。

但是只有一个问题。

np.bincount()解决scheme仅适用于数字。

如果你有string， collections.Counter解决scheme将为你工作。

这里是一个通用的解决scheme，可以沿轴使用，不pipe值是多less，都使用纯粹的numpy。我还发现，如果有很多独特的值，这比scipy.stats.mode快得多。

 import numpy def mode(ndarray, axis=0): # Check inputs ndarray = numpy.asarray(ndarray) ndim = ndarray.ndim if ndarray.size == 1: return (ndarray[0], 1) elif ndarray.size == 0: raise Exception('Cannot compute mode on empty array') try: axis = range(ndarray.ndim)[axis] except: raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim)) # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice if all([ndim == 1, int(numpy.__version__.split('.')[0]) >= 1, int(numpy.__version__.split('.')[1]) >= 9]): modals, counts = numpy.unique(ndarray, return_counts=True) index = numpy.argmax(counts) return modals[index], counts[index] # Sort array sort = numpy.sort(ndarray, axis=axis) # Create array to transpose along the axis and get padding shape transpose = numpy.roll(numpy.arange(ndim)[::-1], axis) shape = list(sort.shape) shape[axis] = 1 # Create a boolean array along strides of unique values strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'), numpy.diff(sort, axis=axis) == 0, numpy.zeros(shape=shape, dtype='bool')], axis=axis).transpose(transpose).ravel() # Count the stride lengths counts = numpy.cumsum(strides) counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])]) counts[strides] = 0 # Get shape of padded counts and slice to return to the original shape shape = numpy.array(sort.shape) shape[axis] += 1 shape = shape[transpose] slices = [slice(None)] * ndim slices[axis] = slice(1, None) # Reshape and compute final counts counts = counts.reshape(shape).transpose(transpose)[slices] + 1 # Find maximum counts and return modals/counts slices = [slice(None, i) for i in sort.shape] del slices[axis] index = numpy.ogrid[slices] index.insert(axis, numpy.argmax(counts, axis=axis)) return sort[index], counts[index]

在这个方法上进行扩展，应用于查找数据的模式，您可能需要实际数组的索引来查看距离分布中心有多远。

 (_, idx, counts) = np.unique(a, return_index=True, return_counts=True) index = idx[np.argmax(counts)] mode = a[index]

当len（np.argmax（counts））> 1时记得放弃模式

我最近正在做一个项目，并使用collections.Counter（哪个折磨我）。

在我看来，collections柜台的performance非常糟糕。这只是一个包装字典（）。

更糟糕的是，如果你使用cProfile来描述它的方法，你应该看到很多'__missing__'和'__instancecheck__'的东西浪费了整个时间。

小心使用它的most_common（），因为每次它都会调用一个非常缓慢的sorting。如果使用most_common（x），它将调用堆sorting，这也是很慢的。

顺便说一下，numpy的bincount也有一个问题：如果你使用np.bincount（[1,2,4000000]），你会得到一个有4000000个元素的数组。

在一个numpy向量中查找最频繁的数字

性能（使用iPython）在这里find一些解决scheme：

Numpy的`logical_or`为两个以上的参数

如何将张量转换为TensorFlow中的一个numpy数组？

matplotlib：设置线上单个点的标记

如何在Python直方图中使用对数框

python numpy ValueError：操作数不能和形状一起播放

是否可以按降序使用argsort

PIL和numpy

为什么在Numpy的0D数组不考虑标量？

将一个NumPy数组附加到一个NumPy数组中

使用pip安装SciPy和NumPy