从NumPy 2D数组中删除重复的列和行

我正在使用2D形状arrays来存储经度+纬度对。有一点，我必须合并这些二维数组中的两个，然后删除任何重复的条目。我一直在寻找一个类似numpy.unique的function，但我没有运气。我一直在想的任何实现看起来都非常“没有优化”。例如，我试图将数组转换为元组列表，删除重复的集合，然后再次转换为数组：

coordskeys = np.array(list(set([tuple(x) for x in coordskeys])))

有没有现有的解决scheme，所以我不重新发明轮子？

为了说清楚，我正在寻找：

 >>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]]) >>> unique_rows(a) array([[1, 1], [2, 3],[5, 4]])

顺便说一句，我只想使用它的元组列表，但列表是如此之大，他们消耗我的4Gb RAM + 4Gb交换（numpy数组更有效率）。

这里有一个想法，它需要一点点的工作，但可能会很快。我会给你1d的情况，让你弄清楚如何将其扩展到2d。以下函数查找1d数组的唯一元素：

 import numpy as np def unique(a): a = np.sort(a) b = np.diff(a) b = np.r_[1, b] return a[b != 0]

现在把它扩展到2d，你需要改变两件事情。你将需要弄清楚如何自己sorting，关于sorting的重要之处在于两个相同的条目彼此相邻。其次，你需要做一些像(b != 0).all(axis)因为你想比较整个行/列。让我知道如果这足以让你开始。

更新：有了道格的一些帮助，我认为这应该适用于第二种情况。

 import numpy as np def unique(a): order = np.lexsort(aT) a = a[order] diff = np.diff(a, axis=0) ui = np.ones(len(a), 'bool') ui[1:] = (diff != 0).any(axis=1) return a[ui]

这应该做的伎俩：

 def unique_rows(a): a = np.ascontiguousarray(a) unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1])) return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))

例：

 >>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]]) >>> unique_rows(a) array([[1, 1], [2, 3], [5, 4]])

我的方法是把一个二维数组转换成一维复数组，其中实部是第一列，虚部是第二列。然后使用np.unique。虽然这只能用于2列。

 import numpy as np def unique2d(a): x, y = aT b = x + y*1.0j idx = np.unique(b,return_index=True)[1] return a[idx]

示例 –

 a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]]) unique2d(a) array([[1, 1], [2, 3], [5, 4]])

 >>> import numpy as NP >>> # create a 2D NumPy array with some duplicate rows >>> A array([[1, 1, 1, 5, 7], [5, 4, 5, 4, 7], [7, 9, 4, 7, 8], [5, 4, 5, 4, 7], [1, 1, 1, 5, 7], [5, 4, 5, 4, 7], [7, 9, 4, 7, 8], [5, 4, 5, 4, 7], [7, 9, 4, 7, 8]]) >>> # first, sort the 2D NumPy array row-wise so dups will be contiguous >>> # and rows are preserved >>> a, b, c, d, e = AT # create the keys for to pass to lexsort >>> ndx = NP.lexsort((a, b, c, d, e)) >>> ndx array([1, 3, 5, 7, 0, 4, 2, 6, 8]) >>> A = A[ndx,] >>> # now diff by row >>> A1 = NP.diff(A, axis=0) >>> A1 array([[0, 0, 0, 0, 0], [4, 3, 3, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 1, 0], [0, 0, 1, 0, 0], [2, 5, 0, 2, 1], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]) >>> # the index array holding the location of each duplicate row >>> ndx = NP.any(A1, axis=1) >>> ndx array([False, True, False, True, True, True, False, False], dtype=bool) >>> # retrieve the duplicate rows: >>> A[1:,:][ndx,] array([[7, 9, 4, 7, 8], [1, 1, 1, 5, 7], [5, 4, 5, 4, 7], [7, 9, 4, 7, 8]])

numpy_indexed包（免责声明：我是它的作者）将user545424发布的解决scheme包装在一个很好的testing界面中，加上许多相关的function：

 import numpy_indexed as npi npi.unique(coordskeys)

既然你指的是numpy.unique，你不会在乎保持原来的顺序，对吗？转换成集合，其中删除重复，然后回到列表往往是习惯用法：

 >>> x = [(1, 1), (2, 3), (1, 1), (5, 4), (2, 3)] >>> y = list(set(x)) >>> y [(5, 4), (2, 3), (1, 1)] >>>

从NumPy 2D数组中删除重复的列和行

如何删除重复的条目？

我如何删除重复的行？

C＃LINQ在List中查找重复项

删除MySQL中除了一个之外的所有重复行？

获取连接表格列的不同总和

寻找近似重复logging的技术

删除重复的行（不要删除所有重复的）

如何从MySQL数据库中删除重复的条目？

删除MySQL中的重复行

如何在没有临时表的情况下删除MySQL表中的所有重复logging