在二进制search的vector中查找最接近的值

作为一个愚蠢的玩具的例子,假设

x=4.5 w=c(1,2,4,6,7) 

我想知道是否有一个简单的R函数可以find与w最接近的匹配的索引。 所以如果foo是那个函数, foo(w,x)会返回3 。 functionmatch是正确的想法,但似乎只适用于完全匹配。

which(abs(wx)==min(abs(wx))) which.min(abs(w - x))which(abs(wx)==min(abs(wx)))等都是O(n)而不是log(n) (I假设w已经sorting)。

你可以使用data.table来进行二分search:

 dt = data.table(w, val = w) # you'll see why val is needed in a sec setattr(dt, "sorted", "w") # let data.table know that w is sorted 

请注意,如果列w尚未sorting,那么您将不得不使用setkey(dt, w)而不是setattr(.)

 # binary search and "roll" to the nearest neighbour dt[J(x), roll = "nearest"] # w val #1: 4.5 4 

在最后的expressionval列将有你正在寻找。

 # or to get the index as Josh points out # (and then you don't need the val column): dt[J(x), .I, roll = "nearest", by = .EACHI] # w .I #1: 4.5 3 
 R>findInterval(4.5, c(1,2,4,5,6)) [1] 3 

将这样做与价格是正确的匹配(最近没有超过)。

为了在字符向量上做到这一点,Martin Morgan在R-help上提出了这个函数:

 bsearch7 <- function(val, tab, L=1L, H=length(tab)) { b <- cbind(L=rep(L, length(val)), H=rep(H, length(val))) i0 <- seq_along(val) repeat { updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L tabM <- tab[M] val0 <- val[i0] i <- tabM < val0 updt[i] <- M[i] + 1L i <- tabM > val0 updt[i] <- M[i] - 1L b[i0 + i * length(val)] <- updt i0 <- which(b[i0, "H"] >= b[i0, "L"]) if (!length(i0)) break; } b[,"L"] - 1L } 
 x = 4.5 w = c(1,2,4,6,7) closestLoc = which(min(abs(wx))) closestVal = w[which(min(abs(wx)))] # On my phone- please pardon typos 

如果你的向量很长,试试两步法:

 x = 4.5 w = c(1,2,4,6,7) sdev = sapply(w,function(v,x) abs(vx), x = x) closestLoc = which(min(sdev)) 

对于疯狂的长vector(数百万行!),警告 – 对于不是非常非常大的数据,这实际上会变慢。

 require(doMC) registerDoMC() closestLoc = which(min(foreach(i = w) %dopar% { abs(ix) })) 

这个例子只是给你一个当你有大量数据时利用并行处理的基本思想。 请注意,我不build议您将它用于简单和快速的函数,如abs()。

您始终可以使用自定义二进制searchalgorithm来查找最接近的值。 或者,您可以利用libc bsearch()的标准实现。 您也可以使用其他的二进制search实现,但这并不会改变您必须仔细实现比较函数才能find最接近的数组元素的事实。 标准的二进制search实现的问题在于它是用于精确比较的 。 这意味着你的即兴比较函数需要做一些处理,以确定一个元素是否足够接近。 为了达到这个目的,比较function需要具有arrays中其他元素的意识,特别是以下几个方面:

  • 当前元素的位置(与密钥进行比较的位置)。
  • 与键的距离以及它与邻居(前一个或下一个元素)的比较。

为了在比较function中提供这些额外的知识,密钥需要与附加信息(而不仅仅是密钥值)一起打包。 一旦比较函数具有这些方面的意识,就可以确定元素本身是否最接近。 当它知道它是最接近的时候,它返回“匹配”。

以下C代码find最接近的值。

 #include <stdio.h> #include <stdlib.h> struct key { int key_val; int *array_head; int array_size; }; int compar(const void *k, const void *e) { struct key *key = (struct key*)k; int *elem = (int*)e; int *arr_first = key->array_head; int *arr_last = key->array_head + key->array_size -1; int kv = key->key_val; int dist_left; int dist_right; if (kv == *elem) { /* easy case: if both same, got to be closest */ return 0; } else if (key->array_size == 1) { /* easy case: only element got to be closest */ return 0; } else if (elem == arr_first) { /* element is the first in array */ if (kv < *elem) { /* if keyval is less the first element then * first elem is closest. */ return 0; } else { /* check distance between first and 2nd elem. * if distance with first elem is smaller, it is closest. */ dist_left = kv - *elem; dist_right = *(elem+1) - kv; return (dist_left <= dist_right) ? 0:1; } } else if (elem == arr_last) { /* element is the last in array */ if (kv > *elem) { /* if keyval is larger than the last element then * last elem is closest. */ return 0; } else { /* check distance between last and last-but-one. * if distance with last elem is smaller, it is closest. */ dist_left = kv - *(elem-1); dist_right = *elem - kv; return (dist_right <= dist_left) ? 0:-1; } } /* condition for remaining cases (other cases are handled already): * - elem is neither first or last in the array * - array has atleast three elements. */ if (kv < *elem) { /* keyval is smaller than elem */ if (kv <= *(elem -1)) { /* keyval is smaller than previous (of "elem") too. * hence, elem cannot be closest. */ return -1; } else { /* check distance between elem and elem-prev. * if distance with elem is smaller, it is closest. */ dist_left = kv - *(elem -1); dist_right = *elem - kv; return (dist_right <= dist_left) ? 0:-1; } } /* remaining case: (keyval > *elem) */ if (kv >= *(elem+1)) { /* keyval is larger than next (of "elem") too. * hence, elem cannot be closest. */ return 1; } /* check distance between elem and elem-next. * if distance with elem is smaller, it is closest. */ dist_right = *(elem+1) - kv; dist_left = kv - *elem; return (dist_left <= dist_right) ? 0:1; } int main(int argc, char **argv) { int arr[] = {10, 20, 30, 40, 50, 60, 70}; int *found; struct key k; if (argc < 2) { return 1; } k.key_val = atoi(argv[1]); k.array_head = arr; k.array_size = sizeof(arr)/sizeof(int); found = (int*)bsearch(&k, arr, sizeof(arr)/sizeof(int), sizeof(int), compar); if(found) { printf("found closest: %d\n", *found); } else { printf("closest not found. absurd! \n"); } return 0; } 

不用说,上面例子中的bsearch()不应该失败(除非数组大小为零)。

如果你实现自己的自定义二进制search,基本上你必须在二进制search代码的主体中embedded相同的比较逻辑(而不是在上面的例子中比较函数中有这个逻辑)。

 NearestValueSearch = function(x, w){ ## A simple binary search algo ## Assume the w vector is sorted so we can use binary search left = 1 right = length(w) while(right - left > 1){ middle = floor((left + right) / 2) if(x < w[middle]){ right = middle } else{ left = middle } } if(abs(x - w[right]) < abs(x - w[left])){ return(right) } else{ return(left) } } x = 4.5 w = c(1,2,4,6,7) NearestValueSearch(x, w) # return 3