在numpy.array中查找唯一的行

我需要在numpy.arrayfind唯一的行。

例如：

 >>> a # I have array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]) >>> new_a # I want to get to array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 0]])

我知道我可以创build一个集合并循环遍历数组，但是我正在寻找一个高效的纯粹解决scheme。我相信有一种方法可以将数据types设置为无效，然后我可以使用numpy.unique ，但我不知道如何使其工作。

从NumPy 1.13开始，可以简单地select用于在任何N-dimarrays中select唯一值的轴。要获得唯一的行，可以这样做：

unique_rows = np.unique(original_array, axis=0)

又一个可能的解决scheme

 np.vstack({tuple(row) for row in a})

使用结构化数组的另一个select是使用将整行连接到单个项目的voidtypes的视图：

 a = np.array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]) b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1]))) _, idx = np.unique(b, return_index=True) unique_a = a[idx] >>> unique_a array([[0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]])

编辑添加@ seberg的build议np.ascontiguousarray。如果数组不是连续的，这将减慢方法。

编辑上面可以稍微加快，也许以清晰的代价，通过做：

 unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])

另外，至less在我的系统中，性能方面与lexsort方法相比甚至更好：

 a = np.random.randint(2, size=(10000, 6)) %timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1]) 100 loops, best of 3: 3.17 ms per loop %timeit ind = np.lexsort(aT); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))] 100 loops, best of 3: 5.93 ms per loop a = np.random.randint(2, size=(10000, 100)) %timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1]) 10 loops, best of 3: 29.9 ms per loop %timeit ind = np.lexsort(aT); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))] 10 loops, best of 3: 116 ms per loop

如果你想避免转换为一系列元组或其他类似的数据结构的内存开销，你可以利用numpy的结构化数组。

诀窍是将原始数组视为结构化数组，其中每个项目对应于原始数组的一行。这并不是一个副本，而且效率很高。

举一个简单的例子：

 import numpy as np data = np.array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]) ncols = data.shape[1] dtype = data.dtype.descr * ncols struct = data.view(dtype) uniq = np.unique(struct) uniq = uniq.view(data.dtype).reshape(-1, ncols) print uniq

要了解正在发生的事情，请查看中间结果。

一旦我们将事物视为结构化数组，则数组中的每个元素都是原始数组中的一行。（基本上，这是一个元组列表类似的数据结构。）

 In [71]: struct Out[71]: array([[(1, 1, 1, 0, 0, 0)], [(0, 1, 1, 1, 0, 0)], [(0, 1, 1, 1, 0, 0)], [(1, 1, 1, 0, 0, 0)], [(1, 1, 1, 1, 1, 0)]], dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')]) In [72]: struct[0] Out[72]: array([(1, 1, 1, 0, 0, 0)], dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

一旦我们运行numpy.unique ，我们会得到一个结构化的数组：

 In [73]: np.unique(struct) Out[73]: array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)], dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

然后，我们需要查看一个“正常”数组（ _存储在ipython中的最后一次计算的结果，这就是为什么你看到_.view... ）：

 In [74]: _.view(data.dtype) Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])

然后重新塑形成2D数组（ -1是一个占位符，告诉numpy计算正确的行数，给出列数）：

 In [75]: _.reshape(-1, ncols) Out[75]: array([[0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]])

显然，如果你想更简洁，你可以写成：

 import numpy as np def unique_rows(data): uniq = np.unique(data.view(data.dtype.descr * data.shape[1])) return uniq.view(data.dtype).reshape(-1, data.shape[1]) data = np.array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]) print unique_rows(data)

其结果是：

 [[0 1 1 1 0 0] [1 1 1 0 0 0] [1 1 1 1 1 0]]

当我在np.random.random(100).reshape(10,10)上运行它时， np.random.random(100).reshape(10,10)返回所有独特的单个元素，但是你想要唯一的行，所以首先你需要把它们放入元组中：

 array = #your numpy array of lists new_array = [tuple(row) for row in array] uniques = np.unique(new_array)

这是唯一的方法，我看到你改变types做你想做的，我不知道如果列表迭代更改为元组是可以与你的“不循环”

np.unique通过对一个扁平数组进行sorting，然后查看每个项是否与前一个相等。这可以手动完成而不会变扁平：

 ind = np.lexsort(aT) a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]

这个方法不使用元组，并且应该比这里给出的其他方法更快更简单。

注意：以前的版本没有在[，这意味着使用了错误的索引。另外，乔·金顿（Joe Kington）提出了一个很好的观点，那就是制作各种中间副本。下面的方法通过创build一个sorting后的副本，然后使用它的视图来减less：

 b = a[np.lexsort(aT)] b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]

这是更快，使用更less的内存。

另外，如果您想在ndarray中查找唯一的行， 而不pipe数组中有多less维，则以下操作将起作用：

 b = a[lexsort(a.reshape((a.shape[0],-1)).T)]; b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]

如果你想在任意维数组的任意轴上进行sorting/唯一的话，那么一个有趣的问题就会变得更加困难。

编辑：

为了演示速度差异，我在ipython中运行了几个答案中描述的三种不同方法的testing。用你的确切的a，没有太大的区别，虽然这个版本有点快：

 In [87]: %timeit unique(a.view(dtype)).view('<i8') 10000 loops, best of 3: 48.4 us per loop In [88]: %timeit ind = np.lexsort(aT); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))] 10000 loops, best of 3: 37.6 us per loop In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b) 10000 loops, best of 3: 41.6 us per loop

然而，随着更大的一个，这个版本结束了很多，更快：

 In [96]: a = np.random.randint(0,2,size=(10000,6)) In [97]: %timeit unique(a.view(dtype)).view('<i8') 10 loops, best of 3: 24.4 ms per loop In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b) 10 loops, best of 3: 28.2 ms per loop In [99]: %timeit ind = np.lexsort(aT); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))] 100 loops, best of 3: 3.25 ms per loop

这是@Greg pythonic答案的另一个变体

 np.vstack(set(map(tuple, a)))

我不喜欢任何这些答案，因为没有处理线性代数或向量空间意义上的浮点数组，其中两行“相等”意思是“在某个ε内”。有一个容忍阈值的答案， https://stackoverflow.com/a/26867764/500207 ，采取了门槛元素明智和小数的精度，这对某些情况下工作，但不是在math上一般作为真正的vector距离。

这是我的版本：

 from scipy.spatial.distance import squareform, pdist def uniqueRows(arr, thresh=0.0, metric='euclidean'): "Returns subset of rows that are unique, in terms of Euclidean distance" distances = squareform(pdist(arr, metric=metric)) idxset = {tuple(np.nonzero(v)[0]) for v in distances <= thresh} return arr[[x[0] for x in idxset]] # With this, unique columns are super-easy: def uniqueColumns(arr, *args, **kwargs): return uniqueRows(arr.T, *args, **kwargs)

上面的公共域函数使用scipy.spatial.distance.pdist来查找每对行之间的欧几里得（可定制的）距离。然后，将每个距离与thresh进行比较，以find相互之间的行，并从每个thresh -cluster返回一行。

正如所暗示的那样，距离metric不一定是欧几里德pdist可以计算各种距离，包括cityblock （曼哈顿范数）和cosine （vector之间的angular度）。

如果thresh=0 （默认值），则行必须精确到位才能被视为“唯一”。其他好的thresh值使用缩放的机器精度，即thresh=np.spacing(1)*1e3 。

numpy_indexed软件包（免责声明：我是它的作者）将由Jaime发布的解决scheme包装在一个不错的testing界面中，还有更多的function：

 import numpy_indexed as npi new_a = npi.unique(a) # unique elements over axis=0 (rows) by default

为什么不使用pandas的drop_duplicates ：

 >>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values 1 loops, best of 3: 3.08 s per loop >>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)}) 1 loops, best of 3: 51 s per loop

我比较了build议的速度替代方法，并且发现令人惊讶的是，无效视图unique解决scheme甚至比numpy的axis参数unique速度更快。不过，只有小arrays才有显着的区别，

 numpy.unique(a, axis=0)

可能是最好的解决scheme，如果你有NumPy 1.13或更高版本。

在这里输入图像描述

代码重现情节：

 import numpy import perfplot def unique_void_view(a): return numpy.unique( a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1]))) ).view(a.dtype).reshape(-1, a.shape[1]) def lexsort(a): ind = numpy.lexsort(aT) return a[ind[ numpy.concatenate(( [True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1) )) ]] def vstack(a): return numpy.vstack({tuple(row) for row in a}) def unique_axis(a): return numpy.unique(a, axis=0) perfplot.show( setup=lambda n: numpy.random.randint(2, size=(n, 20)), kernels=[unique_void_view, lexsort, vstack, unique_axis], n_range=[2**k for k in range(15)], logx=True, logy=True, xlabel='len(a)', equality_check=None )

np.unique给出了一个元组列表：

 >>> np.unique([(1, 1), (2, 2), (3, 3), (4, 4), (2, 2)]) Out[9]: array([[1, 1], [2, 2], [3, 3], [4, 4]])

列表的列表引发了TypeError: unhashable type: 'list'

基于这个页面的答案，我写了一个函数，它复制了MATLAB unique(input,'rows')函数的function，附加的function是接受容差来检查唯一性。它还返回索引，例如c = data[ia,:]和data = c[ic,:] 。如果您发现任何差异或错误，请报告。

 def unique_rows(data, prec=5): import numpy as np d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0 b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1]))) _, ia = np.unique(b, return_index=True) _, ic = np.unique(b, return_inverse=True) return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic

超越@Jaime优秀的答案，另一种折叠行的方法是使用a.strides[0] （假设a是C连续的），它等于a.dtype.itemsize*a.shape[0] 。此外void(n)是dtype((void,n))的快捷方式。我们终于到达这个最短的版本：

 a[unique(a.view(void(a.strides[0])),1)[1]]

对于

 [[0 1 1 1 0 0] [1 1 1 0 0 0] [1 1 1 1 1 0]]

对于像3D或更高维的嵌套数组一般用途，试试这个：

 import numpy as np def unique_nested_arrays(ar): origin_shape = ar.shape origin_dtype = ar.dtype ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:])) ar = np.ascontiguousarray(ar) unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:]))) return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0], ) + origin_shape[1:])

满足您的二维数据集：

 a = np.array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]) unique_nested_arrays(a)

得到：

 array([[0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]])

而且3D数组像：

 b = np.array([[[1, 1, 1], [0, 1, 1]], [[0, 1, 1], [1, 1, 1]], [[1, 1, 1], [0, 1, 1]], [[1, 1, 1], [1, 1, 1]]]) unique_nested_arrays(b)

得到：

 array([[[0, 1, 1], [1, 1, 1]], [[1, 1, 1], [0, 1, 1]], [[1, 1, 1], [1, 1, 1]]])

这些答案都没有为我工作。我假设我独特的行包含string，而不是数字。然而，从另一个线程这个答案确实工作：

资料来源： https : //stackoverflow.com/a/38461043/5402386

您可以使用.count（）和.index（）列表的方法

 coor = np.array([[10, 10], [12, 9], [10, 5], [12, 9]]) coor_tuple = [tuple(x) for x in coor] unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x)) unique_count = [coor_tuple.count(x) for x in unique_coor] unique_index = [coor_tuple.index(x) for x in unique_coor]

实际上，我们可以将mxn数字numpy数组转换为mx 1 numpystring数组，请尝试使用以下函数，它提供count ， inverse_idx等，就像numpy.unique：

 import numpy as np def uniqueRow(a): #This function turn mxn numpy array into mx 1 numpy array storing #string, and so the np.unique can be used #Input: an mxn numpy array (a) #Output unique m' xn numpy array (unique), inverse_indx, and counts s = np.chararray((a.shape[0],1)) s[:] = '-' b = (a).astype(np.str) s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1) n = a.shape[1] - 2 for i in range(0,n): s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1) s3, idx, inv_, c = np.unique(s2,return_index = True, return_inverse = True, return_counts = True) return a[idx], inv_, c

例：

 A = np.array([[ 3.17 9.502 3.291], [ 9.984 2.773 6.852], [ 1.172 8.885 4.258], [ 9.73 7.518 3.227], [ 8.113 9.563 9.117], [ 9.984 2.773 6.852], [ 9.73 7.518 3.227]]) B, inv_, c = uniqueRow(A) Results: B: [[ 1.172 8.885 4.258] [ 3.17 9.502 3.291] [ 8.113 9.563 9.117] [ 9.73 7.518 3.227] [ 9.984 2.773 6.852]] inv_: [3 4 1 0 2 4 0] c: [2 1 1 1 2]

让整个numpymatrix作为一个列表，然后从这个列表中删除重复，最后返回到一个numpymatrix我们的唯一列表：

 matrix_as_list=data.tolist() matrix_as_list: [[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]] uniq_list=list() uniq_list.append(matrix_as_list[0]) [uniq_list.append(item) for item in matrix_as_list if item not in uniq_list] unique_matrix=np.array(uniq_list) unique_matrix: array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 0]])

最直接的解决scheme是通过使行成为单个项目来使行成为string。然后每行使用numpy可以作为一个整体来比较它的唯一性。这个解决scheme是generalize – 能够你只需要重塑和转置你的arrays的其他组合。以下是提供的问题的解决scheme。

 import numpy as np original = np.array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]) uniques, index = np.unique([str(i) for i in original], return_index=True) cleaned = original[index] print(cleaned)

会给：

  array([[0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]])

发送我的诺贝尔奖邮件

 import numpy as np original = np.array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]) # create a view that the subarray as tuple and return unique indeies. _, unique_index = np.unique(original.view(original.dtype.descr * original.shape[1]), return_index=True) # get unique set print(original[unique_index])

在numpy.array中查找唯一的行

scandir（）按修改datesorting

计算两个multidimensional array之间的相关系数

用一个常数值直接进行数组初始化

给定一组正整数和负整数，重新排列它，使得一端为正整数，另一端为负整数

将php数组转换为Javascript

Javascript数组：删除其他数组中包含的所有元素

展开每个列单元格的列单元格

什么是Ljava.lang.String; @

在JavaScript中获得两个数组的联合

在JavaScript中将数组解包为单独的variables