Fast Haversine近似（Python / Pandas）

pandas数据框中的每一行都包含2点的lat / lng坐标。使用下面的Python代码，对于许多（百万）行计算这两个点之间的距离需要很长时间！

考虑到这两点距离不足50英里，准确度不是很重要，有可能使计算速度更快吗？

from math import radians, cos, sin, asin, sqrt def haversine(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a)) km = 6367 * c return km for index, row in df.iterrows(): df.loc[index, 'distance'] = haversine(row['a_longitude'], row['a_latitude'], row['b_longitude'], row['b_latitude'])

这是一个相同function的vector化numpy版本：

 import numpy as np def haversine_np(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) All args must be of equal length. """ lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) dlon = lon2 - lon1 dlat = lat2 - lat1 a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2 c = 2 * np.arcsin(np.sqrt(a)) km = 6367 * c return km

input是所有值的数组，它应该能够立即执行数百万个点。要求是input是ndarrays，但你的pandas表的列将工作。

例如，随机生成的值：

 >>> import numpy as np >>> import pandas >>> lon1, lon2, lat1, lat2 = np.random.randn(4, 1000000) >>> df = pandas.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2}) >>> km = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])

在python中循环访问数据非常缓慢。 Numpy提供了对整个数据arrays进行操作的函数，可以避免循环并显着提高性能。

这是一个向量化的例子。

纯粹是为了说明一个例子，我从@ballsdotballs的答案中取得了numpy版本，并且还通过ctypes调用了一个伴侣C的实现。由于numpy是一个高度优化的工具，所以我的C代码几乎没有机会高效，但应该有点接近。这里最大的好处是，通过运行一个Ctypes的例子，它可以帮助您看到如何将自己的个人C函数连接到Python，而无需太多的开销。当你只想通过在C源文件中而不是在Python中编写小文件来优化一小块更大的计算时，这是特别好的。简单地使用numpy可以解决大部分时间的问题，但是对于那些你并不真正需要所有numpy并且你不想在整个代码中添加耦合来要求使用numpy数据types，这非常方便知道如何下载到内置的ctypes库并自己动手。

首先让我们创build我们的C源文件，名为haversine.c ：

 #include <stdlib.h> #include <stdio.h> #include <math.h> int haversine(size_t n, double *lon1, double *lat1, double *lon2, double *lat2, double *kms){ if ( lon1 == NULL || lon2 == NULL || lat1 == NULL || lat2 == NULL || kms == NULL){ return -1; } double km, dlon, dlat; double iter_lon1, iter_lon2, iter_lat1, iter_lat2; double km_conversion = 2.0 * 6367.0; double degrees2radians = 3.14159/180.0; int i; for(i=0; i < n; i++){ iter_lon1 = lon1[i] * degrees2radians; iter_lat1 = lat1[i] * degrees2radians; iter_lon2 = lon2[i] * degrees2radians; iter_lat2 = lat2[i] * degrees2radians; dlon = iter_lon2 - iter_lon1; dlat = iter_lat2 - iter_lat1; km = pow(sin(dlat/2.0), 2.0) + cos(iter_lat1) * cos(iter_lat2) * pow(sin(dlon/2.0), 2.0); kms[i] = km_conversion * asin(sqrt(km)); } return 0; } // main function for testing int main(void) { double lat1[2] = {16.8, 27.4}; double lon1[2] = {8.44, 1.23}; double lat2[2] = {33.5, 20.07}; double lon2[2] = {14.88, 3.05}; double kms[2] = {0.0, 0.0}; size_t arr_size = 2; int res; res = haversine(arr_size, lon1, lat1, lon2, lat2, kms); printf("%d\n", res); int i; for (i=0; i < arr_size; i++){ printf("%3.3f, ", kms[i]); } printf("\n"); }

请注意，我们正试图保持与C公约。通过引用明确地传递数据参数，使用size_t作为大小variables，并期待我们的函数能够通过改变其中一个传入的input来工作，以便它将在退出时包含期望的数据。该函数实际上返回一个整数，这是一个成功/失败的标志，可以被该函数的其他C级使用者使用。

我们将需要find一种方法来处理Python中所有这些C特有的问题。

接下来，让我们把函数的numpy版本以及一些导入和一些testing数据放到一个叫做haversine.py的文件中：

 import time import ctypes import numpy as np from math import radians, cos, sin, asin, sqrt def haversine(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = (np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2) c = 2 * np.arcsin(np.sqrt(a)) km = 6367 * c return km if __name__ == "__main__": lat1 = 50.0 * np.random.rand(1000000) lon1 = 50.0 * np.random.rand(1000000) lat2 = 50.0 * np.random.rand(1000000) lon2 = 50.0 * np.random.rand(1000000) t0 = time.time() r1 = haversine(lon1, lat1, lon2, lat2) t1 = time.time() print t1-t0, r1

我select了从0到50之间随机select的lats和lons（以度为单位），但是这个解释并不重要。

接下来我们需要做的是编译我们的C模块，使其可以被Pythondynamic加载。我使用的是Linux系统（您可以在Google上很容易地find其他系统的示例），所以我的目标是将haversine.c编译为共享对象，如下所示：

 gcc -shared -o haversine.so -fPIC haversine.c -lm

我们也可以编译成一个可执行文件并运行它来查看C程序的mainfunction是什么：

 > gcc haversine.c -o haversine -lm > ./haversine 0 1964.322, 835.278,

现在我们已经编译了共享对象haversine.so ，我们可以使用ctypes将它加载到Python中，我们需要提供文件的path来执行此操作：

 lib_path = "/path/to/haversine.so" # Obviously use your real path here. haversine_lib = ctypes.CDLL(lib_path)

现在， haversine_lib.haversine作用与Python函数几乎相同，只不过我们可能需要做一些手动types编组来确保正确解释input和输出。

numpy实际上为这个提供了一些不错的工具，我将在这里使用的是numpy.ctypeslib 。我们将构build一个指针types ，它将允许我们将numpy.ndarrays传递给这些ctypes numpy.ndarrays函数，就像它们是指针一样。代码如下：

 arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, ndim=1, flags='CONTIGUOUS') haversine_lib.haversine.restype = ctypes.c_int haversine_lib.haversine.argtypes = [ctypes.c_size_t, arr_1d_double, arr_1d_double, arr_1d_double, arr_1d_double, arr_1d_double]

请注意，我们告诉haversine_lib.haversine函数代理根据我们想要的types来解释它的参数。

现在， 要从Python中进行testing，只剩下一个大小variables，以及一个将被突变的数组（就像在C代码中一样）来包含结果数据，那么我们可以调用它：

 size = len(lat1) output = np.empty(size, dtype=np.double) print "=====" print output t2 = time.time() res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output) t3 = time.time() print t3 - t2, res print type(output), output

把它放在haversine.py的__main__块中，整个文件现在看起来像这样：

 import time import ctypes import numpy as np from math import radians, cos, sin, asin, sqrt def haversine(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = (np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2) c = 2 * np.arcsin(np.sqrt(a)) km = 6367 * c return km if __name__ == "__main__": lat1 = 50.0 * np.random.rand(1000000) lon1 = 50.0 * np.random.rand(1000000) lat2 = 50.0 * np.random.rand(1000000) lon2 = 50.0 * np.random.rand(1000000) t0 = time.time() r1 = haversine(lon1, lat1, lon2, lat2) t1 = time.time() print t1-t0, r1 lib_path = "/home/ely/programming/python/numpy_ctypes/haversine.so" haversine_lib = ctypes.CDLL(lib_path) arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, ndim=1, flags='CONTIGUOUS') haversine_lib.haversine.restype = ctypes.c_int haversine_lib.haversine.argtypes = [ctypes.c_size_t, arr_1d_double, arr_1d_double, arr_1d_double, arr_1d_double, arr_1d_double] size = len(lat1) output = np.empty(size, dtype=np.double) print "=====" print output t2 = time.time() res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output) t3 = time.time() print t3 - t2, res print type(output), output

要运行它，分别运行Python和ctypes版本并打印一些结果，我们可以这样做

 python haversine.py

其中显示：

 0.111340045929 [ 231.53695005 3042.84915093 169.5158946 ..., 1359.2656769 2686.87895954 3728.54788207] ===== [ 6.92017600e-310 2.97780954e-316 2.97780954e-316 ..., 3.20676686e-001 1.31978329e-001 5.15819721e-001] 0.148446083069 0 <type 'numpy.ndarray'> [ 231.53675618 3042.84723579 169.51575588 ..., 1359.26453029 2686.87709456 3728.54493339]

正如预期的那样， numpy版本稍微快一点（对于长度为100万的向量，为0.11秒），但是我们的快速和脏的ctypes版本不是懒惰的：在相同的数据上是可观的0.148秒。

我们来比较一下Python中一个简单的for循环解决scheme：

 from math import radians, cos, sin, asin, sqrt def slow_haversine(lon1, lat1, lon2, lat2): n = len(lon1) kms = np.empty(n, dtype=np.double) for i in range(n): lon1_v, lat1_v, lon2_v, lat2_v = map( radians, [lon1[i], lat1[i], lon2[i], lat2[i]] ) dlon = lon2_v - lon1_v dlat = lat2_v - lat1_v a = (sin(dlat/2)**2 + cos(lat1_v) * cos(lat2_v) * sin(dlon/2)**2) c = 2 * asin(sqrt(a)) kms[i] = 6367 * c return kms

当我把它和其他的Python文件放在同一个Python文件中，并在相同的元素数据上计时时，我一直在我的机器上看到大约2.65秒的时间。

所以，通过快速切换到ctypes我们将速度提高了18倍。对于许多可以从获取裸露的连续数据中受益的计算，您经常会看到比这更高的收益。

为了超级清晰，我完全不赞成使用numpy作为一个更好的select。这正是numpy所要解决的问题，因此，只要（a）将numpy数据types合并到应用程序中是有意义的，并且（b）存在一种将代码映射到相当于numpy ，效率不是很高。

但是如果你喜欢用C语言编写某些东西，但是在Python中调用它，或者对numpy的依赖是不切实际的情况下（在embedded式系统中不能安装numpy下，例）。

如果使用scikit学习是允许的，我会给以下机会：

 from sklearn.neighbors import DistanceMetric dist = DistanceMetric.get_metric('haversine') # example data lat1, lon1 = 36.4256345, -5.1510261 lat2, lon2 = 40.4165, -3.7026 lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) X = [[lat1, lon1], [lat2, lon2]] kms = 6367 print(kms * dist.pairwise(X))

其中一些答案是“四舍五入”地球的半径。如果您对其他距离计算器（如geopy ）进行检查，这些function将closures。

如果您想以英里为单位的答案，您可以转换出下面的转换常数R=3959.87433 。

如果你想要公里，使用R= 6372.8 。

 lon1 = -103.548851 lat1 = 32.0004311 lon2 = -103.6041946 lat2 = 33.374939 def haversine(lat1, lon1, lat2, lon2): R = 3959.87433 # this is in miles. For Earth radius in kilometers use 6372.8 km dLat = radians(lat2 - lat1) dLon = radians(lon2 - lon1) lat1 = radians(lat1) lat2 = radians(lat2) a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2 c = 2*asin(sqrt(a)) return R * c print(haversine(lat1, lon1, lat2, lon2))

Fast Haversine近似（Python / Pandas）

如何智能地降解或平滑GIS数据（简化多边形）？

GeoJSON和TopoJSON的区别

有没有一种有效的algorithm来生成二维凹面船体？

书写纬度和经度元组的首选顺序

find两个纬度/长点之间距离的最快方法

GIS：PostGIS / PostgreSQL vs. MySql vs. SQL Server？

在Google地图上覆盖ESRI shapefile的最佳方法是什么？

NoSQL和空间数据

根据纬度/经度确定时区，而不使用像Geonames.org这样的networking服务

将shapefile（.shp）转换为xml / json