在二进制search的vector中查找最接近的值

作为一个愚蠢的玩具的例子，假设

x=4.5 w=c(1,2,4,6,7)

我想知道是否有一个简单的R函数可以find与w最接近的匹配的索引。所以如果foo是那个函数， foo(w,x)会返回3 。 functionmatch是正确的想法，但似乎只适用于完全匹配。

which(abs(wx)==min(abs(wx))) which.min(abs(w - x)) ， which(abs(wx)==min(abs(wx)))等都是O(n)而不是log(n) （I假设w已经sorting）。

你可以使用data.table来进行二分search：

 dt = data.table(w, val = w) # you'll see why val is needed in a sec setattr(dt, "sorted", "w") # let data.table know that w is sorted

请注意，如果列w尚未sorting，那么您将不得不使用setkey(dt, w)而不是setattr(.) 。

 # binary search and "roll" to the nearest neighbour dt[J(x), roll = "nearest"] # w val #1: 4.5 4

在最后的expressionval列将有你正在寻找。

 # or to get the index as Josh points out # (and then you don't need the val column): dt[J(x), .I, roll = "nearest", by = .EACHI] # w .I #1: 4.5 3

 R>findInterval(4.5, c(1,2,4,5,6)) [1] 3

将这样做与价格是正确的匹配（最近没有超过）。

为了在字符向量上做到这一点，Martin Morgan在R-help上提出了这个函数：

 bsearch7 <- function(val, tab, L=1L, H=length(tab)) { b <- cbind(L=rep(L, length(val)), H=rep(H, length(val))) i0 <- seq_along(val) repeat { updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L tabM <- tab[M] val0 <- val[i0] i <- tabM < val0 updt[i] <- M[i] + 1L i <- tabM > val0 updt[i] <- M[i] - 1L b[i0 + i * length(val)] <- updt i0 <- which(b[i0, "H"] >= b[i0, "L"]) if (!length(i0)) break; } b[,"L"] - 1L }

 x = 4.5 w = c(1,2,4,6,7) closestLoc = which(min(abs(wx))) closestVal = w[which(min(abs(wx)))] # On my phone- please pardon typos

如果你的向量很长，试试两步法：

 x = 4.5 w = c(1,2,4,6,7) sdev = sapply(w,function(v,x) abs(vx), x = x) closestLoc = which(min(sdev))

对于疯狂的长vector（数百万行！），警告 – 对于不是非常非常大的数据，这实际上会变慢。

 require(doMC) registerDoMC() closestLoc = which(min(foreach(i = w) %dopar% { abs(ix) }))

这个例子只是给你一个当你有大量数据时利用并行处理的基本思想。请注意，我不build议您将它用于简单和快速的函数，如abs（）。

您始终可以使用自定义二进制searchalgorithm来查找最接近的值。或者，您可以利用libc bsearch（）的标准实现。您也可以使用其他的二进制search实现，但这并不会改变您必须仔细实现比较函数才能find最接近的数组元素的事实。标准的二进制search实现的问题在于它是用于精确比较的 。这意味着你的即兴比较函数需要做一些处理，以确定一个元素是否足够接近。为了达到这个目的，比较function需要具有arrays中其他元素的意识，特别是以下几个方面：

当前元素的位置（与密钥进行比较的位置）。
与键的距离以及它与邻居（前一个或下一个元素）的比较。

为了在比较function中提供这些额外的知识，密钥需要与附加信息（而不仅仅是密钥值）一起打包。一旦比较函数具有这些方面的意识，就可以确定元素本身是否最接近。当它知道它是最接近的时候，它返回“匹配”。

以下C代码find最接近的值。

 #include <stdio.h> #include <stdlib.h> struct key { int key_val; int *array_head; int array_size; }; int compar(const void *k, const void *e) { struct key *key = (struct key*)k; int *elem = (int*)e; int *arr_first = key->array_head; int *arr_last = key->array_head + key->array_size -1; int kv = key->key_val; int dist_left; int dist_right; if (kv == *elem) { /* easy case: if both same, got to be closest */ return 0; } else if (key->array_size == 1) { /* easy case: only element got to be closest */ return 0; } else if (elem == arr_first) { /* element is the first in array */ if (kv < *elem) { /* if keyval is less the first element then * first elem is closest. */ return 0; } else { /* check distance between first and 2nd elem. * if distance with first elem is smaller, it is closest. */ dist_left = kv - *elem; dist_right = *(elem+1) - kv; return (dist_left <= dist_right) ? 0:1; } } else if (elem == arr_last) { /* element is the last in array */ if (kv > *elem) { /* if keyval is larger than the last element then * last elem is closest. */ return 0; } else { /* check distance between last and last-but-one. * if distance with last elem is smaller, it is closest. */ dist_left = kv - *(elem-1); dist_right = *elem - kv; return (dist_right <= dist_left) ? 0:-1; } } /* condition for remaining cases (other cases are handled already): * - elem is neither first or last in the array * - array has atleast three elements. */ if (kv < *elem) { /* keyval is smaller than elem */ if (kv <= *(elem -1)) { /* keyval is smaller than previous (of "elem") too. * hence, elem cannot be closest. */ return -1; } else { /* check distance between elem and elem-prev. * if distance with elem is smaller, it is closest. */ dist_left = kv - *(elem -1); dist_right = *elem - kv; return (dist_right <= dist_left) ? 0:-1; } } /* remaining case: (keyval > *elem) */ if (kv >= *(elem+1)) { /* keyval is larger than next (of "elem") too. * hence, elem cannot be closest. */ return 1; } /* check distance between elem and elem-next. * if distance with elem is smaller, it is closest. */ dist_right = *(elem+1) - kv; dist_left = kv - *elem; return (dist_left <= dist_right) ? 0:1; } int main(int argc, char **argv) { int arr[] = {10, 20, 30, 40, 50, 60, 70}; int *found; struct key k; if (argc < 2) { return 1; } k.key_val = atoi(argv[1]); k.array_head = arr; k.array_size = sizeof(arr)/sizeof(int); found = (int*)bsearch(&k, arr, sizeof(arr)/sizeof(int), sizeof(int), compar); if(found) { printf("found closest: %d\n", *found); } else { printf("closest not found. absurd! \n"); } return 0; }

不用说，上面例子中的bsearch（）不应该失败（除非数组大小为零）。

如果你实现自己的自定义二进制search，基本上你必须在二进制search代码的主体中embedded相同的比较逻辑（而不是在上面的例子中比较函数中有这个逻辑）。

 NearestValueSearch = function(x, w){ ## A simple binary search algo ## Assume the w vector is sorted so we can use binary search left = 1 right = length(w) while(right - left > 1){ middle = floor((left + right) / 2) if(x < w[middle]){ right = middle } else{ left = middle } } if(abs(x - w[right]) < abs(x - w[left])){ return(right) } else{ return(left) } } x = 4.5 w = c(1,2,4,6,7) NearestValueSearch(x, w) # return 3

在二进制search的vector中查找最接近的值

如何search“R”材料？

用数据框中的NAreplace字符值

如何正确使用R中的列表？

dplyr总结：“.drop = FALSE”等同于保持输出中长度为零的组

在read.csv中指定colClasses

如何在ggplot2的绘图区外画线？

学习R.从哪里开始？

手动设置ggplot2的组颜色

统计独特的价值

为什么data.table通过引用更新名称（DT），即使我分配给另一个variables？