字符向量的随机样本,没有元素相互前缀

考虑一个字符向量pool ,其元素是(零填充的)二进制数字,最高可达max_len数字。

 max_len <- 4 pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) pool ## [1] "0" "1" "00" "10" "01" "11" "000" "100" "010" "110" ## [11] "001" "101" "011" "111" "0000" "1000" "0100" "1100" "0010" "1010" ## [21] "0110" "1110" "0001" "1001" "0101" "1101" "0011" "1011" "0111" "1111" 

我想抽样n个这样的元素,其约束条件是没有任何采样元素是任何其他采样元素的前缀 (即,如果我们抽样1101 ,我们禁止11110 ,而如果我们抽样1 ,我们禁止那些从1开始的元素,比如100等等。

以下是我尝试使用while ,但当n很大(或接近2^max_len )时,这当然是慢的。

 set.seed(1) n <- 10 chosen <- sample(pool, n) while(any(rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1)) { prefixes <- rowSums(outer(paste0('^', chosen), chosen, Vectorize(grepl))) > 1 pool <- pool[rowSums(Vectorize(grepl, 'pattern')( paste0('^', chosen[!prefixes]), pool)) == 0] chosen <- c(chosen[!prefixes], sample(pool, sum(prefixes))) } chosen ## [1] "0100" "0101" "0001" "0011" "1000" "111" "0000" "0110" "1100" "0111" 

这可以通过最初从pool去除那些包含在pool元素不足以取得总大小为n样本的元素来略微改善。 例如,当max_len = 4n > 9 ,我们可以立即从pool删除01 ,因为通过包括其中任一个,最大采样将是9( 0和8个以4开始的4个字符的元素,八个以0开头的4个字符的元素)。

基于这个逻辑,我们可以在采集初始样本之前省略pool元素,如下所示:

 pool <- pool[ nchar(pool) > tail(which(n > (2^max_len - rev(2^(0:max_len))[-1] + 1)), 1)] 

任何人都可以想到一个更好的方法? 我觉得我正在俯视更简单的事情。


编辑

为了澄清我的意图,我将把游泳池描绘成一组分支,其中路口和小路是节点( pool元素)。 假定下图中的黄色节点(即010)被绘制。 现在,从池中删除由节点0,01和010组成的整个红色“分支”。 这就是我所说的禁止对我们的样本中已经有“前缀”节点的节点进行抽样(以及那些已经被我们样本中的节点加上前缀的节点)。

在这里输入图像说明

如果采样节点沿着分支的一半,如下图中的01,那么所有的红色节点(0,01,010和011)都是不允许的,因为0前缀01和01前缀都是010和011。

在这里输入图像说明

我并不是要在每个交叉点上抽样1 0(即沿着树枝在树枝上掷硬币) – 样本中都可以,只要:(1)父母(或祖父母等) )或节点的孩子(孙辈等)未被采样; (2)在对节点进行采样时,将剩下足够的节点来获得大小为n的期望样本。

在上面的第二个图中,如果010是第一个select,则假设n <= 4 ,黑色节点上的所有节点仍然(当前)是有效的。 例如,如果n==4并且我们接下来对节点1进行采样(所以我们的select现在包括01和1),随后我们将禁止节点00(由于上面的规则2),但是仍然可以select000和001, 4元素样品。 另一方面,如果n==5 ,那么节点1在这个阶段将被禁止。

介绍

这是我们在其他答案中实现的stringalgorithm的数字变体。 它速度更快,不需要创build或sorting池。

algorithm大纲

我们可以使用整数来表示你的二进制string,这大大简化了池生成和顺序消除值的问题。 例如,用max_len==3 ,我们可以把数字1-- (其中-表示填充)表示为十进制的4 。 此外,如果我们select这个数字,我们可以确定需要消除的数字是44 + 2 ^ x - 1 。 这里x是填充元素的数量(在这种情况下是2),所以消除的数字在44 + 2 ^ 2 - 1之间(或者在47之间,表示为110111 )。

为了准确地匹配你的问题,我们需要一点点皱纹,因为你把二进制中可能相同的数字视为algorithm某些部分的不同。 例如, 1--1--都是相同的数字,但在您的scheme中需要区别对待。 在max_len==3世界里,我们有8个可能的数字,但有14个可能的表示:

 0 - 000: 0--, 00- 1 - 001: 2 - 010: 01- 3 - 011: 4 - 100: 1--, 10- 5 - 101: 6 - 110: 11- 7 - 111: 

所以0和4有三种可能的编码,2和6有两种,其余的只有一种。 我们需要生成一个整数池,这个整数池表示具有多种表示forms的数字select的概率较高,以及跟踪数量包含多less个空格的机制。 我们可以通过在数字末尾添加几位来表示我们想要的权重。 所以我们的数字变成了(我们在这里用了两位):

 jbaum | int | bin | bin.enc | int.enc 0-- | 0 | 000 | 00000 | 0 00- | 0 | 000 | 00001 | 1 000 | 0 | 000 | 00010 | 2 001 | 1 | 001 | 00100 | 3 01- | 2 | 010 | 01000 | 4 010 | 2 | 010 | 01001 | 5 011 | 3 | 011 | 01101 | 6 1-- | 4 | 100 | 10000 | 7 10- | 4 | 100 | 10001 | 8 100 | 4 | 100 | 10010 | 9 101 | 5 | 101 | 10100 | 10 11- | 6 | 110 | 11000 | 11 110 | 6 | 110 | 11001 | 12 111 | 7 | 111 | 11100 | 13 

一些有用的属性:

  • enc.bits表示我们需要编码多less位(在这种情况下是两位)
  • int.enc %% enc.bits告诉我们显式指定了多less个数字
  • int.enc %/% enc.bits返回int
  • int * 2 ^ enc.bits + explicitly.specified返回int.enc

请注意,由于总是至less指定一个数字,所以在我们的实现中explicitly.specified0max_len - 1之间。 我们现在有一个编码,只用整数就能完全代表你的数据结构。 我们可以从整数中采样,重现你想要的结果,用正确的权重等等。这种方法的一个限制是我们在R中使用32位整数,并且我们必须保留一些位来编码,所以我们限制在max_len==25左右。 如果使用双精度浮点指定的整数,则可以更大,但是我们在这里没有这样做。

避免重复的select

有两个粗略的方法可以确保我们不会两次select相同的值

  1. 跟踪可供select的值,并从中随机抽样
  2. 从所有可能的值中随机抽样,然后检查该值是否先前被选中,如果有,则再次采样

虽然第一个选项似乎是最干净的,但实际上计算上非常昂贵。 它需要对每个选项进行所有可能值的vector扫描,以预先取消选取的值,或创build一个包含未取消限定值的缩小vector。 如果通过C代码引导vector缩小,收缩选项只比vector扫描更有效,但即使如此,它也需要重复转换大部分vector,并且它需要C

这里我们使用方法#2。 这使得我们可以随机地随机洗牌一次可能值的宇宙,然后依次挑选每个值,检查是否没有被取消资格,如果有,select另一个值等。这是可行的,因为检查由于我们的价值编码,价值被挑选出来; 我们可以根据值单独推断sorting表中的值的位置 。 因此,我们将每个值的状态logging在已sorting的表中,并且可以通过直接索引访问(不需要扫描)更新或查找该状态。

例子

这个algorithm在R中的实现可以作为一个要点 。 这个特定的实现只能完成绘制。 这是一个从max_len==4池中抽取8个元素的max_len==4

 # each column represents a draw from a `max_len==4` pool set.seed(6); replicate(10, sample0110b(4, 8)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] "1000" "1" "0011" "0010" "100" "0011" "0" "011" "0100" "1011" [2,] "111" "0000" "1101" "0000" "0110" "0100" "1000" "00" "0101" "1001" [3,] "0011" "0110" "1001" "0100" "0000" "0101" "1101" "1111" "10" "1100" [4,] "0100" "0010" "0000" "0101" "1101" "101" "1011" "1101" "0110" "1101" [5,] "101" "0100" "1100" "1100" "0101" "1001" "1001" "1000" "1111" "1111" [6,] "110" "0111" "1011" "111" "1011" "110" "1111" "0100" "0011" "000" [7,] "0101" "0101" "111" "011" "1010" "1000" "1100" "101" "0001" "0101" [8,] "011" "0001" "01" "1010" "0011" "1110" "1110" "1001" "110" "1000" 

我们原来也有两个依赖方法#1的实现来避免重复,一个是基本R,一个是C,但是当n很大的时候,即使C版本也不如新的基本R版本快。 这些函数确实能够绘制不完整的绘图,所以我们在这里提供给他们参考:

  • 原始基地R
  • 原创与C

比较基准

下面是一组比较本Q / A中显示的几个function的基准。 以毫秒为单位的时间。 brodie.b版本是在这个答案中描述的。 brodie.C是原来的实现, brodie.C是原来的一些C.所有这些强制要求完整的样本。 brodie.str是另一个答案中基于string的版本。

  size n jbaum josilber frank tensibai brodie.b brodie brodie.C brodie.str 1 4 10 11 1 3 1 1 1 1 0 2 4 50 - - - 1 - - - 1 3 4 100 - - - 1 - - - 0 4 4 256 - - - 1 - - - 1 5 4 1000 - - - 1 - - - 1 6 8 10 1 290 6 3 2 2 1 1 7 8 50 388 - 8 8 3 4 3 4 8 8 100 2,506 - 13 18 6 7 5 5 9 8 256 - - 22 27 13 14 12 6 10 8 1000 - - - 27 - - - 7 11 16 10 - - 615 688 31 61 19 424 12 16 50 - - 2,123 2,497 28 276 19 1,764 13 16 100 - - 4,202 4,807 30 451 23 3,166 14 16 256 - - 11,822 11,942 40 1,077 43 8,717 15 16 1000 - - 38,132 44,591 83 3,345 130 27,768 

这相对较好地扩大池

 system.time(sample0110b(18, 100000)) user system elapsed 8.441 0.079 8.527 

基准说明:

  • 坦率地说,brodie(减brodie.str)不需要任何预先生成的池,这将影响比较(见下文)
  • Josilber是LP版本
  • jbaum是OP的例子
  • 如果游泳池是空的,那么可以稍微修改一下退出而不是失败
  • 没有设置运行python,所以不能比较/考虑缓冲
  • -代表不可行的select或太慢,合理的时间

时间不包括绘制jbaumjosilberbrodie.str运行所需的池(分别为大小为816jbaum brodie.str ,或者对它们进行sorting( brodie.str毫秒为8 ,和16 ),这是brodie.str除了平局所必需的。 是否要包含这些取决于您为特定池运行函数的次数。 此外,几乎可以肯定有更好的方式来产生/sorting池。

这是三个微microbenchmark运行的中间时间。 代码可以作为一个要点 ,不过请注意,您必须事先加载sample0110bsample0110sample01101sample01函数。

我发现这个问题很有意思,所以我尝试着在R中使用这个技巧,所以这可能会得到改善:

更新的编辑版本,感谢@Franck的build议:

 library(microbenchmark) library(lineprof) max_len <- 16 pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) n<-100 library(stringr) tree_sample <- function(samples,pool) { results <- vector("integer",samples) # Will be used on a regular basis, compute it in advance PoolLen <- str_length(pool) # Make a mask vector based on the length of each entry of the pool masks <- strtoi(str_pad(str_pad("1",PoolLen,"right","1"),max_len,"right","0"),base=2) # Make an integer vector from "0" right padded orignal: for max_len=4 and pool entry "1" we get "1000" => 8 # This will allow to find this entry as parent of 10 and 11 which become "1000" and "1100", as integer 8 and 12 respectively # once bitwise "anded" with the repective mask "1000" the first bit is striclty the same, so it's a parent. integerPool <- strtoi(str_pad(pool,max_len,"right","0"),base=2) # Create a vector to filter the available value to sample ok <- rep(TRUE,length(pool)) #Precompute the result of the bitwise and betwwen our integer pool and the masks MaskedPool <- bitwAnd(integerPool,masks) while(samples) { samp <- sample(pool[ok],1) # Get a sample results[samples] <- samp # Store it as result ok[pool == samp] <- FALSE # Remove it from available entries vsamp <- strtoi(str_pad(samp,max_len,"right","0"),base=2) # Get the integer value of the "0" right padded sample mlen <- str_length(samp) # Get sample len #Creation of unitary mask to remove childs of sample mask <- strtoi(paste0(rep(1:0,c(mlen,max_len-mlen)),collapse=""),base=2) # Get the result of bitwise And between the integerPool and the sample mask FilterVec <- bitwAnd(integerPool,mask) # Get the bitwise and result of the sample and it's mask Childm <- bitwAnd(vsamp,mask) ok[FilterVec == Childm] <- FALSE # Remove from available entries the childs of the sample ok[MaskedPool == bitwAnd(vsamp,masks)] <- FALSE # compare the sample with all the masks to remove parents matching samples <- samples -1 } print(results) } microbenchmark(tree_sample(n,pool),times=10L) 

主要思想是使用位掩码比较来知道一个样本是否是另一个样本(公共位部分)的父亲,如果是从池中压缩此元素。

现在需要1.4秒从我的机器上的一个长度为16的池中抽取100个样本。

您可以对池进行sorting以帮助确定哪些元素不合格。 例如,查看三元素sorting池:

  [1] "0" "00" "000" "001" "01" "010" "011" "1" "10" "100" "101" "11" [13] "110" "111" 

我可以告诉我可以取消任何跟随我的select的项目比我的项目有更多的字符的任何资格,直到具有相同数目或更less字符的第一项目。 例如,如果我select“01”,我可以立即看到接下来的两个项目(“010”,“011”)需要被删除,但是之后没有,因为“1”具有较less的字符。 之后去除“0”很容易。 这是一个实现:

 library(fastmatch) # could use `match`, but we repeatedly search against same hash # `pool` must be sorted! sample01 <- function(pool, n) { picked <- logical(length(pool)) chrs <- nchar(pool) pick.list <- character(n) pool.seq <- seq_along(pool) for(i in seq(n)) { # Make sure pool not exhausted left <- which(!picked) left.len <- length(left) if(!length(left)) break # Sample from pool seq.left <- seq.int(left) pool.left <- pool[left] chrs.left <- chrs[left] pick <- sample(length(pool.left), 1L) # Find all the elements with more characters that are disqualified # and store their indices in `valid` (bad name...) valid.tmp <- chrs.left > chrs.left[[pick]] & seq.left > pick first.invalid <- which(!valid.tmp & seq.left > pick) valid <- if(length(first.invalid)) { pick:(first.invalid[[1L]] - 1L) } else pick:left.len # Translate back to original pool indices since we're working on a # subset in `pool.left` pool.seq.left <- pool.seq[left] pool.idx <- pool.seq.left[valid] val <- pool[[pool.idx[[1L]]]] # Record the picked value, and all the disqualifications pick.list[[i]] <- val picked[pool.idx] <- TRUE # Disqualify shorter matches to.rem <- vapply( seq.int(nchar(val) - 1), substr, character(1L), x=val, start=1L ) to.rem.idx <- fmatch(to.rem, pool, nomatch=0) picked[to.rem.idx] <- TRUE } pick.list } 

和一个函数来创buildsorting池(完全像你的代码,但返回sorting):

 make_pool <- function(size) sort( unlist( lapply( seq_len(size), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))) ) ) ) 

然后,使用max_len 3池(用于视觉检查事物的行为如预期):

 pool3 <- make_pool(3) set.seed(1) sample01(pool3, 8) # [1] "001" "1" "010" "011" "000" "" "" "" sample01(pool3, 8) # [1] "110" "111" "011" "10" "00" "" "" "" sample01(pool3, 8) # [1] "000" "01" "11" "10" "001" "" "" "" sample01(pool3, 8) # [1] "011" "101" "111" "001" "110" "100" "000" "010" 

注意在最后一个例子中,我们得到所有3位数的二进制组合(2 ^ 3),因为偶然我们保留了3位数的取样。 此外,只有一个3尺寸的游泳池,有许多抽样,防止全面8抽签; 你可以用你的build议来解决这个问题,即消除防止从池中完全抽取的组合。

这很快。 看看用备用解决scheme花了2秒钟的max_len==9例子:

 pool9 <- make_pool(9) microbenchmark(sample01(pool9, 4)) # Unit: microseconds # expr min lq median uq max neval # sample01(pool9, 4) 493.107 565.015 571.624 593.791 983.663 100 

大约半毫秒。 你也可以合理地尝试相当大的池:

 pool16 <- make_pool(16) # 131K entries system.time(sample01(pool16, 100)) # user system elapsed # 3.407 0.146 3.552 

这不是很快,但我们正在谈论一个有130K项目的游泳池。 还有额外优化的潜力。

请注意,对于大型池来说,sorting步骤相对较慢,但是我不会把它计算在内,因为您只需要一次就可以完成,而且您可能会想出一个合理的algorithm来生成预先sorting的池。

还有一个更快的整数到基于二进制的方法,我在一个现在被删除的答案中探索的可能性,但这需要更多的工作,以确切地find你正在寻找的东西。

这是python,而不是r,但jbaums说,这将是好的。

所以这里是我的贡献,请参阅源文件中的注释以解释关键部分。
我仍然在研究parsing解决scheme,以确定深度为tS样本树的可能组合c的数量,所以我可以改进函数combs 。 也许有人拥有它? 这真是现在的瓶颈。

从深度为16的树中采样100个节点需要大约8ms的笔记本电脑。 这不是第一次,但是由于combBuffer被填充,样本越多,速度越快。

 import random class Tree(object): """ :param level: The distance of this node from the root. :type level: int :param parent: This trees parent node :type parent: Tree :param isleft: Determines if this is a left or a right child node. Can be omitted if this is the root node. :type isleft: bool A binary tree representing possible strings which match r'[01]{1,n}'. Its purpose is to be able to sample n of its nodes where none of the sampled nodes' ids is a prefix for another one. It is possible to change Tree.maxdepth and then reuse the root. All children are created ON DEMAND, which means everything is lazily evaluated. If the Tree gets too big anyway, you can call 'prune' on any node to delete its children. >>> t = Tree() >>> t.sample(8, toString=True, depth=3) ['111', '110', '101', '100', '011', '010', '001', '000'] >>> Tree.maxdepth = 2 >>> t.sample(4, toString=True) ['11', '10', '01', '00'] """ maxdepth = 10 _combBuffer = {} def __init__(self, level=0, parent=None, isleft=None): self.parent = parent self.level = level self.isleft = isleft self._left = None self._right = None @classmethod def setMaxdepth(cls, depth): """ :param depth: The new depth :type depth: int Sets the maxdepth of the Trees. This basically is the depth of the root node. """ if cls.maxdepth == depth: return cls.maxdepth = depth @property def left(self): """This tree's left child, 'None' if this is a leave node""" if self.depth == 0: return None if self._left is None: self._left = Tree(self.level+1, self, True) return self._left @property def right(self): """This tree's right child, 'None' if this is a leave node""" if self.depth == 0: return None if self._right is None: self._right = Tree(self.level+1, self, False) return self._right @property def depth(self): """ This tree's depth. (maxdepth-level) """ return self.maxdepth-self.level @property def id(self): """ This tree's id, string of '0's and '1's equal to the path from the root to this subtree. Where '1' means going left and '0' means going right. """ # level 0 is the root node, it has no id if self.level == 0: return '' # This takes at most Tree.maxdepth recursions. Therefore # it is save to do it this way. We could also save each nodes # id once it is created to avoid recreating it every time, however # this won't save much time but use quite some space. return self.parent.id + ('1' if self.isleft else '0') @property def leaves(self): """ The amount of leave nodes, this tree has. (2**depth) """ return 2**self.depth def __str__(self): return self.id def __len__(self): return 2*self.leaves-1 def prune(self): """ Recursively prune this tree's children. """ if self._left is not None: self._left.prune() self._left.parent = None self._left = None if self._right is not None: self._right.prune() self._right.parent = None self._right = None def combs(self, n): """ :param n: The amount of samples to be taken from this tree :type n: int :returns: The amount of possible combinations to choose n samples from this tree Determines recursively the amount of combinations of n nodes to be sampled from this tree. Subsequent calls with same n on trees with same depth will return the result from the previous computation rather than computing it again. >>> t = Tree() >>> Tree.maxdepth = 4 >>> t.combs(16) 1 >>> Tree.maxdepth = 3 >>> t.combs(6) 58 """ # important for the amount of combinations is only n and the depth of # this tree key = (self.depth, n) # We use the dict to save computation time. Calling the function with # equal values on equal nodes just returns the alrady computed value if # possible. if key not in Tree._combBuffer: leaves = self.leaves if n < 0: N = 0 elif n == 0 or self.depth == 0 or n == leaves: N = 1 elif n == 1: return (2*leaves-1) else: if n > leaves/2: # if n > leaves/2, at least n-leaves/2 have to stay on # either side, otherweise the other one would have to # sample more nodes than possible. nMin = n-leaves/2 else: nMin = 0 # The rest n-2*nMin is the amount of samples that are free to # fall on either side free = n-2*nMin N = 0 # sum up the combinations of all possible splits for addLeft in range(0, free+1): nLeft = nMin + addLeft nRight = n - nLeft N += self.left.combs(nLeft)*self.right.combs(nRight) Tree._combBuffer[key] = N return N return Tree._combBuffer[key] def sample(self, n, toString=False, depth=None): """ :param n: How may samples to take from this tree :type n: int :param toString: If 'True' result will direclty be turned into a list of strings :type toString: bool :param depth: If not None, will overwrite Tree.maxdepth :type depth: int or None :returns: List of n nodes sampled from this tree :throws ValueError: when n is invalid Takes n random samples from this tree where none of the sample's ids is a prefix for another one's. For an example see Tree's docstring. """ if depth is not None: Tree.setMaxdepth(depth) if toString: return [str(e) for e in self.sample(n)] if n < 0: raise ValueError('Negative sample size is not possible!') if n == 0: return [] leaves = self.leaves if n > leaves: raise ValueError(('Cannot sample {} nodes, with only {} ' + 'leaves!').format(n, leaves)) # Only one sample to choose, that is nice! We are free to take any node # from this tree, including this very node itself. if n == 1 and self.level > 0: # This tree has 2*leaves-1 nodes, therefore # the probability that we keep the root node has to be # 1/(2*leaves-1) = P_root. Lets create a random number from the # interval [0, 2*leaves-1). # It will be 0 with probability 1/(2*leaves-1) P_root = random.randint(0, len(self)-1) if P_root == 0: return [self] else: # The probability to land here is 1-P_root # A child tree's size is (leaves-1) and since it obeys the same # rule as above, the probability for each of its nodes to # 'survive' is 1/(leaves-1) = P_child. # However all nodes must have equal probability, therefore to # make sure that their probability is also P_root we multiply # them by 1/2*(1-P_root). The latter is already done, the # former will be achieved by the next condition. # If we do everything right, this should hold: # 1/2 * (1-P_root) * P_child = P_root # Lets see... # 1/2 * (1-1/(2*leaves-1)) * (1/leaves-1) # (1-1/(2*leaves-1)) * (1/(2*(leaves-1))) # (1-1/(2*leaves-1)) * (1/(2*leaves-2)) # (1/(2*leaves-2)) - 1/((2*leaves-2) * (2*leaves-1)) # (2*leaves-1)/((2*leaves-2) * (2*leaves-1)) - 1/((2*leaves-2) * (2*leaves-1)) # (2*leaves-2)/((2*leaves-2) * (2*leaves-1)) # 1/(2*leaves-1) # There we go! if random.random() < 0.5: return self.right.sample(1) else: return self.left.sample(1) # Now comes the tricky part... n > 1 therefore we are NOT going to # sample this node. Its probability to be chosen is 0! # It HAS to be 0 since we are definitely sampling from one of its # children which means that this node will be blocked by those samples. # The difficult part now is to prove that the sampling the way we do it # is really random. if n > leaves/2: # if n > leaves/2, at least n-leaves/2 have to stay on either # side, otherweise the other one would have to sample more # nodes than possible. nMin = n-leaves/2 else: nMin = 0 # The rest n-2*nMin is the amount of samples that are free to fall # on either side free = n-2*nMin # Let's have a look at an example, suppose we were to distribute 5 # samples among two children which have 4 leaves each. # Each child has to get at least 1 sample, so the free samples are 3. # There are 4 different ways to split the samples among the # children (left, right): # (1, 4), (2, 3), (3, 2), (4, 1) # The amount of unique sample combinations per child are # (7, 1), (11, 6), (6, 11), (1, 7) # The amount of total unique samples per possible split are # 7 , 66 , 66 , 7 # In case of the first and last split, all samples have a probability # of 1/7, this was already proven above. # Lets suppose we are good to go and the per sample probabilities for # the other two cases are (1/11, 1/6) and (1/6, 1/11), this way the # overall per sample probabilities for the splits would be: # 1/7 , 1/66 , 1/66 , 1/7 # If we used uniform random to determine the split, all splits would be # equally probable and therefore be multiplied with the same value (1/4) # But this would mean that NOT every sample is equally probable! # We need to know in advance how many sample combinations there will be # for a given split in order to find out the probability to choose it. # In fact, due to the restrictions, this becomes very nasty to # determine. So instead of solving it analytically, I do it numerically # with the method 'combs'. It gives me the amount of possible sample # combinations for a certain amount of samples and a given tree depth. # It will return 146 for this node and 7 for the outer and 66 for the # inner splits. # What we now do is, we take a number from [0, 146). # if it is smaller than 7, we sample from the first split, # if it is smaller than 7+66, we sample from the second split, # ... # This way we get the probabilities we need. r = random.randint(0, self.combs(n)-1) p = 0 for addLeft in xrange(0, free+1): nLeft = nMin + addLeft nRight = n - nLeft p += (self.left.combs(nLeft) * self.right.combs(nRight)) if r < p: return self.left.sample(nLeft) + self.right.sample(nRight) assert False, ('Something really strange happend, p did not sum up ' + 'to combs or r was too big') def main(): """ Do a microbenchmark. """ import timeit i = 1 main.t = Tree() template = ' {:>2} {:>5} {:>4} {:<5}' print(template.format('i', 'depth', 'n', 'time (ms)')) N = 100 for depth in [4, 8, 15, 16, 17, 18]: for n in [10, 50, 100, 150]: if n > 2**depth: time = '--' else: time = timeit.timeit( 'main.t.sample({}, depth={})'.format(n, depth), setup= 'from __main__ import main', number=N)*1000./N print(template.format(i, depth, n, time)) i += 1 if __name__ == "__main__": main() 

The benchmark output:

  i depth n time (ms) 1 4 10 0.182511806488 2 4 50 -- 3 4 100 -- 4 4 150 -- 5 8 10 0.397620201111 6 8 50 1.66054964066 7 8 100 2.90236949921 8 8 150 3.48146915436 9 15 10 0.804011821747 10 15 50 3.7428188324 11 15 100 7.34910964966 12 15 150 10.8230614662 13 16 10 0.804491043091 14 16 50 3.66818904877 15 16 100 7.09567070007 16 16 150 10.404779911 17 17 10 0.865840911865 18 17 50 3.9999294281 19 17 100 7.70257949829 20 17 150 11.3758206367 21 18 10 0.915451049805 22 18 50 4.22935962677 23 18 100 8.22361946106 24 18 150 12.2081303596 

10 samples of size 10 from a tree with depth 10:

 ['1111010111', '1110111010', '1010111010', '011110010', '0111100001', '011101110', '01110010', '01001111', '0001000100', '000001010'] ['110', '0110101110', '0110001100', '0011110', '0001111011', '0001100010', '0001100001', '0001100000', '0000011010', '0000001111'] ['11010000', '1011111101', '1010001101', '1001110001', '1001100110', '10001110', '011111110', '011001100', '0101110000', '001110101'] ['11111101', '110111', '110110111', '1101010101', '1101001011', '1001001100', '100100010', '0100001010', '0100000111', '0010010110'] ['111101000', '1110111101', '1101101', '1101000000', '1011110001', '0111111101', '01101011', '011010011', '01100010', '0101100110'] ['1111110001', '11000110', '1100010100', '101010000', '1010010001', '100011001', '100000110', '0100001111', '001101100', '0001101101'] ['111110010', '1110100', '1101000011', '101101', '101000101', '1000001010', '0111100', '0101010011', '0101000110', '000100111'] ['111100111', '1110001110', '1100111111', '1100110010', '11000110', '1011111111', '0111111', '0110000100', '0100011', '0010110111'] ['1101011010', '1011111', '1011100100', '1010000010', '10010', '1000010100', '0111011111', '01010101', '001101', '000101100'] ['111111110', '111101001', '1110111011', '111011011', '1001011101', '1000010100', '0111010101', '010100110', '0100001101', '0010000000'] 

Mapping ids to strings. You can map numbers to your 0/1 vectors, as @BrodieG mentioned:

 # some key objects n_pool = sum(2^(1:max_len)) # total number of indices cuts = cumsum(2^(1:max_len-1)) # new group starts inds_by_g = mapply(seq,cuts,cuts*2) # indices grouped by length # the mapping to strings (one among many possibilities) library(data.table) get_01str <- function(id,max_len){ cuts = cumsum(2^(1:max_len-1)) g = findInterval(id,cuts) gid = id-cuts[g]+1 data.table(g,gid)[,s:= do.call(paste,c(list(sep=""),lapply( seq(g[1]), function(x) (gid-1) %/% 2^(x-1) %% 2 ))) ,by=g]$s } 

Finding ids to drop. We'll sequentially drop id s from the sampling pool:

  # the mapping from one index to indices of nixed strings get_nixstrs <- function(g,gid,max_len){ cuts = cumsum(2^(1:max_len-1)) gids_child = { x = gid%%2^sequence(g-1) ifelse(x,x,2^sequence(g-1)) } ids_child = gids_child+cuts[sequence(g-1)]-1 ids_parent = if (g==max_len) gid+cuts[g]-1 else { gids_par = vector(mode="list",max_len) gids_par[[g]] = gid for (gg in seq(g,max_len-1)) gids_par[[gg+1]] = c(gids_par[[gg]],gids_par[[gg]]+2^gg) unlist(mapply(`+`,gids_par,cuts-1)) } c(ids_child,ids_parent) } 

The indices are grouped by g , the number of characters, nchar(get_01str(id)) . Because the indices are sorted by g , g=findInterval(id,cuts) is a faster route.

An index in group g , 1 < g < max_len has one "child" index of size g-1 and two parent indices of size g+1 . For each child node, we take its child node until we hit g==1 ; and for each parent node, we take their pair of parent nodes until we hit g==max_len .

The structure of the tree is simplest in terms of the identifier within the group, gid . gid maps to the two parents, gid and gid+2^g ; and reversing this mapping one finds the child.

Sampling

 drawem <- function(n,max_len){ cuts = cumsum(2^(1:max_len-1)) inds_by_g = mapply(seq,cuts,cuts*2) oklens = (1:max_len)[ n <= 2^max_len*(1-2^(-(1:max_len)))+1 ] okinds = unlist(inds_by_g[oklens]) mysamp = rep(0,n) for (i in 1:n){ id = if (length(okinds)==1) okinds else sample(okinds,1) g = findInterval(id,cuts) gid = id-cuts[g]+1 nixed = get_nixstrs(g,gid,max_len) # print(id); print(okinds); print(nixed) mysamp[i] = id okinds = setdiff(okinds,nixed) if (!length(okinds)) break } res <- rep("",n) res[seq.int(i)] <- get_01str(mysamp[seq.int(i)],max_len) res } 

The oklens part integrates the OP's idea for omitting strings that are guaranteed to make the sampling impossible. However, even doing this, we may follow a sampling path that leaves us with no more options. Taking the OP's example of max_len=4 and n=10 , we know we must drop 0 and 1 from consideration, but what happens if our first four draws are 00 , 01 , 11 and 10 ? Oh well, I guess we are out of luck. This is why you should actually define sampling probabilities. (The OP has another idea, for determining, at each step, which nodes will lead to an impossible state, but that seems a tall order.)

插图

 # how the indices line up n_pool = sum(2^(1:max_len)) pdt <- data.table(id=1:n_pool) pdt[,g:=findInterval(id,cuts)] pdt[,gid:=1:.N,by=g] pdt[,s:=get_01str(id,max_len)] # example run set.seed(4); drawem(5,5) # [1] "01100" "1" "0001" "0101" "00101" set.seed(4); drawem(8,4) # [1] "1100" "0" "111" "101" "1101" "100" "" "" 

Benchmarks (older than those in @BrodieG's answer)

 require(rbenchmark) max_len = 8 n = 8 benchmark( jos_lp = { pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) sample.lp(pool, n)}, bro_string = {pool <- make_pool(max_len);sample01(pool,n)}, fra_num = drawem(n,max_len), replications=5)[1:5] # test replications elapsed relative user.self # 2 bro_string 5 0.05 2.5 0.05 # 3 fra_num 5 0.02 1.0 0.02 # 1 jos_lp 5 1.56 78.0 1.55 n = 12 max_len = 12 benchmark( bro_string={pool <- make_pool(max_len);sample01(pool,n)}, fra_num=drawem(n,max_len), replications=5)[1:5] # test replications elapsed relative user.self # 1 bro_string 5 0.54 6.75 0.51 # 2 fra_num 5 0.08 1.00 0.08 

Other answers. There are two other answers:

 jos_enum = {pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) get.template(pool, n)} bro_num = sample011(max_len,n) 

I left out @josilber's enumeration method because it took far too long; and @BrodieG's numeric/index method because it didn't work at the time, but now does. See @BrodieG's updated answer for more benchmarking.

Speed vs correctness. While @josilber's answers are far slower (and for the enumeration method, apparently much more memory-intensive), they are guaranteed to draw a sample of size n on the first try. With @BrodieG's string method or this answer, you'll have to resample again and again in the hope of drawing a full n . With large max_len , that shouldn't be such a problem, I guess.

This answer scales better than bro_string because it doesn't require the construction of pool up front.

If you don't want to generate the set of all possible tuples and then randomly sample (which as you note may be infeasible for large input sizes), another option is to draw a single sample with integer programming. Basically you could assign each element in pool a random value and then select the feasible tuple with the largest sum of values. This should give each tuple an equal probability of being selected because they are all the same size and their values were selected randomly. The constraints of the model would ensure that none of the disallowed pairs of tuples are selected and that the correct number of elements are selected.

Here's a solution with the lpSolve package:

 library(lpSolve) sample.lp <- function(pool, max_len) { pool <- sort(pool) pml <- max(nchar(pool)) runs <- c(rev(cumsum(2^(seq(pml-1)))), 0) banned.from <- rep(seq(pool), runs[nchar(pool)]) banned.to <- banned.from + unlist(lapply(runs[nchar(pool)], seq_len)) banned.constr <- matrix(0, nrow=length(banned.from), ncol=length(pool)) banned.constr[cbind(seq(banned.from), banned.from)] <- 1 banned.constr[cbind(seq(banned.to), banned.to)] <- 1 mod <- lp(direction="max", objective.in=runif(length(pool)), const.mat=rbind(banned.constr, rep(1, length(pool))), const.dir=c(rep("<=", length(banned.from)), "=="), const.rhs=c(rep(1, length(banned.from)), max_len), all.bin=TRUE) pool[which(mod$solution == 1)] } set.seed(144) pool <- unlist(lapply(seq_len(4), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) sample.lp(pool, 4) # [1] "0011" "010" "1000" "1100" sample.lp(pool, 8) # [1] "0000" "0100" "0110" "1001" "1010" "1100" "1101" "1110" 

This seems to scale to fairly large pools. For instance, it takes a little over 2 seconds to get a sample of length 20 from a pool of size 510:

 pool <- unlist(lapply(seq_len(8), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) length(pool) # [1] 510 system.time(sample.lp(pool, 20)) # user system elapsed # 0.232 0.008 0.239 

If you needed really, really huge problem sizes solved, then you could move from the non-open source solvers that ship with lpSolve to a commercial solver like gurobi or cplex (not free in general, but free for academic use).

One approach would be to simply generate all possible tuples of the appropriate size with an iterative approach:

  1. Build all tuples of size 1 (all elements in pool )
  2. Take a cross product with elements in pool
  3. Remove any tuple using the same element of pool more than once
  4. Remove any exact duplicate of another tuple
  5. Remove any tuple with a pair that can't be used together
  6. Rinse and repeat until you get the appropriate tuple size

This is runnable for the sizes given ( pool of length 30, max_len 4):

 get.template <- function(pool, max_len) { banned <- which(outer(paste0('^', pool), pool, Vectorize(grepl)), arr.ind=T) banned <- banned[banned[,1] != banned[,2],] banned <- paste(banned[,1], banned[,2]) vals <- matrix(seq(length(pool))) for (k in 2:max_len) { vals <- cbind(vals[rep(1:nrow(vals), each=length(pool)),], rep(1:length(pool), nrow(vals))) # Can't sample same value more than once vals <- vals[apply(vals, 1, function(x) length(unique(x)) == length(x)),] # Sort rows to ensure unique only vals <- t(apply(vals, 1, sort)) vals <- unique(vals) # Can't have banned pair combos <- combn(ncol(vals), 2) for (k in seq(ncol(combos))) { c1 <- combos[1,k] c2 <- combos[2,k] vals <- vals[!paste(vals[,c1], vals[,c2]) %in% banned,] } } return(matrix(pool[vals], nrow=nrow(vals))) } max_len <- 4 pool <- unlist(lapply(seq_len(max_len), function(x) do.call(paste0, expand.grid(rep(list(c('0', '1')), x))))) system.time(template <- get.template(pool, 4)) # user system elapsed # 4.549 0.050 4.614 

Now you can sample as many times as desired from the rows of template (which will be very fast), and this will be the same as randomly sampling from the space defined.

介绍

I found this question so interesting that I felt compelled to think it over, and ultimately provide my own answer. Since the algorithm I arrived at does not immediately follow from the problem description, I will start by explaining how I arrived at this solution, and then provide a sample implementation in C++ (I have never written R).

Development of the Solution

At First Glance

Reading the problem description was initially confusing, but as soon as I saw the edit with the pictures of the trees, I immediately understood the problem and my intuition suggested that a binary tree was also a solution: build a tree (a collection of trees of size 1), and break the tree into a collection of smaller trees after eliminating the branches and ancestors upon making a selection.

Although this initially seemed good, the selection process and maintenance of the collection would be a pain. Still, the tree seemed like it should play a big part in any solution.

Revision 1

Don't break up the tree. Instead have a Boolean data payload at each node indicating if it had already been eliminated. This leaves only one tree that maintains form.

Notice however, that this is not just any binary tree, it is actually a complete binary tree of depth max_len-1 .

Revision 2

A complete binary tree can be represented very nicely as an array. The typical array representation used the breadth-first search of the tree, has the following properties:

 Let x be the array index. x = 0 is the root of the entire tree left_child(x) = 2x + 1 right_child(x) = 2x + 2 parent(x) = floor((n-1)/2) 

In the illustration below each node is labelled with its array index: breadth-first

As an array, this takes up less memory (no more pointers), the memory used is contiguous (good for caching), and can go entirely on the stack rather than the heap (assuming your language gives you a choice). Of course some conditions apply here, in particular the size of the array. 我稍后再回来。

Just as in Revision 1, the data stored in the array will be boolean: true for available, false for unavailable. Since the root node is not actually a valid choice, index 0 should be initialized to false. The problem of how to make a selection still remains:

As indicies are elimitated, it is trivial to keep track of how many are eliminated, and thus how many remain. Choose a random number in that range, then walk the array until having seen that many indicies set to true (the current index included). The index arrived at is then the selection to make. Select until n indicies are selected, or there is nothing left to select.

This is a complete algorithm that will work, however there is room for improvement in the selection process, and there is also the practical issue of size that hasn't been addressed: the array size in would be O(2^n). As n gets larger, first the caching benefit disappears, then the data starts being paged to disk, and at some point it becomes impossible to store at all.

Revision 3

I decided to tackle the easier problem first: improving the selection process.

Scanning the array left-to-right is wasteful. It may be more efficient to track the ranges which have been eliminated rather than checking and finding several falses in a row. However our tree representation is not ideal for this, since few of the nodes which would be eliminated each round would be contiguous in the array.

By rearranging how the array maps to the tree, it is possible however to better exploit this. In particular, let us use pre-order depth-first search rather than the breadth-first search. In order to do so, the tree needs to be fixed in size, which is the case for this problem. It's also less obvious how the indices of child and parent nodes are connected mathematically.

depth-first

By using this arrangement, every choice that is not a leaf is guaranteed to eliminate a contiguous range: its subtree.

Revision 4

By tracking the eliminated ranges, there is no longer a need for the true/false data, and thus no need for the array or tree at all. At each random draw, the eliminated ranges can be used to quickly find the node to select. All ancestors and the entire subtree are eliminated, and can be represented as ranges which can easily be merged with the others.

The final task is to turn the selected nodes into the string representation the OP wanted. That is quite easy as this binary tree still maintains a strict order: traversing from the root, all elements >= the right child are on the right, and the others are on the left. Thus searching the tree will provide both the list of ancestors and the binary string by appending '0' when traversing left; or a '1' when traversing right.

Sample Implementation

 #include <stdint.h> #include <algorithm> #include <cmath> #include <list> #include <deque> #include <ctime> #include <cstdlib> #include <iostream> /* * A range of values of the form (a, b), where a <= b, and is inclusive. * Ex (1,1) is the range from 1 to 1 (ie: just 1) */ class Range { private: friend bool operator< (const Range& lhs, const Range& rhs); friend std::ostream& operator<<(std::ostream& os, const Range& obj); int64_t m_start; int64_t m_end; public: Range(int64_t start, int64_t end) : m_start(start), m_end(end) {} int64_t getStart() const { return m_start; } int64_t getEnd() const { return m_end; } int64_t size() const { return m_end - m_start + 1; } bool canMerge(const Range& other) const { return !((other.m_start > m_end + 1) || (m_start > other.m_end + 1)); } int64_t merge(const Range& other) { int64_t change = 0; if (m_start > other.m_start) { change += m_start - other.m_start; m_start = other.m_start; } if (other.m_end > m_end) { change += other.m_end - m_end; m_end = other.m_end; } return change; } }; inline bool operator< (const Range& lhs, const Range& rhs){return lhs.m_start < rhs.m_start;} std::ostream& operator<<(std::ostream& os, const Range& obj) { os << '(' << obj.m_start << ',' << obj.m_end << ')'; return os; } /* * Stuct to allow returning of multiple values */ struct NodeInfo { int64_t subTreeSize; int64_t depth; std::list<int64_t> ancestors; std::string representation; }; /* * Collection of functions representing a complete binary tree * as an array created using pre-order depth-first search, * with 0 as the root. * Depth of the root is defined as 0. */ class Tree { private: int64_t m_depth; public: Tree(int64_t depth) : m_depth(depth) {} int64_t size() const { return (int64_t(1) << (m_depth+1))-1; } int64_t getDepthOf(int64_t node) const{ if (node == 0) { return 0; } int64_t searchDepth = m_depth; int64_t currentDepth = 1; while (true) { int64_t rightChild = int64_t(1) << searchDepth; if (node == 1 || node == rightChild) { break; } else if (node > rightChild) { node -= rightChild; } else { node -= 1; } currentDepth += 1; searchDepth -= 1; } return currentDepth; } int64_t getSubtreeSizeOf(int64_t node, int64_t nodeDepth = -1) const { if (node == 0) { return size(); } if (nodeDepth == -1) { nodeDepth = getDepthOf(node); } return (int64_t(1) << (m_depth + 1 - nodeDepth)) - 1; } int64_t getLeftChildOf(int64_t node, int64_t nodeDepth = -1) const { if (nodeDepth == -1) { nodeDepth = getDepthOf(node); } if (nodeDepth == m_depth) { return -1; } return node + 1; } int64_t getRightChildOf(int64_t node, int64_t nodeDepth = -1) const { if (nodeDepth == -1) { nodeDepth = getDepthOf(node); } if (nodeDepth == m_depth) { return -1; } return node + 1 + ((getSubtreeSizeOf(node, nodeDepth) - 1) / 2); } NodeInfo getNodeInfo(int64_t node) const { NodeInfo info; int64_t depth = 0; int64_t currentNode = 0; while (currentNode != node) { if (currentNode != 0) { info.ancestors.push_back(currentNode); } int64_t rightChild = getRightChildOf(currentNode, depth); if (rightChild == -1) { break; } else if (node >= rightChild) { info.representation += '1'; currentNode = rightChild; } else { info.representation += '0'; currentNode = getLeftChildOf(currentNode, depth); } depth++; } info.depth = depth; info.subTreeSize = getSubtreeSizeOf(node, depth); return info; } }; // random selection amongst remaining allowed nodes int64_t selectNode(const std::deque<Range>& eliminationList, int64_t poolSize, std::mt19937_64& randomGenerator) { std::uniform_int_distribution<> randomDistribution(1, poolSize); int64_t selection = randomDistribution(randomGenerator); for (auto const& range : eliminationList) { if (selection >= range.getStart()) { selection += range.size(); } else { break; } } return selection; } // determin how many nodes have been elimintated int64_t countEliminated(const std::deque<Range>& eliminationList) { int64_t count = 0; for (auto const& range : eliminationList) { count += range.size(); } return count; } // merge all the elimination ranges to listA, and return the number of new elimintations int64_t mergeEliminations(std::deque<Range>& listA, std::deque<Range>& listB) { if(listB.empty()) { return 0; } if(listA.empty()) { listA.swap(listB); return countEliminated(listA); } int64_t newEliminations = 0; int64_t x = 0; auto listA_iter = listA.begin(); auto listB_iter = listB.begin(); while (listB_iter != listB.end()) { if (listA_iter == listA.end()) { listA_iter = listA.insert(listA_iter, *listB_iter); x = listB_iter->size(); assert(x >= 0); newEliminations += x; ++listB_iter; } else if (listA_iter->canMerge(*listB_iter)) { x = listA_iter->merge(*listB_iter); assert(x >= 0); newEliminations += x; ++listB_iter; } else if (*listB_iter < *listA_iter) { listA_iter = listA.insert(listA_iter, *listB_iter) + 1; x = listB_iter->size(); assert(x >= 0); newEliminations += x; ++listB_iter; } else if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) { listA_iter->merge(*(listA_iter+1)); listA_iter = listA.erase(listA_iter+1); } else { ++listA_iter; } } while (listA_iter != listA.end()) { if ((listA_iter+1) != listA.end() && listA_iter->canMerge(*(listA_iter+1))) { listA_iter->merge(*(listA_iter+1)); listA_iter = listA.erase(listA_iter+1); } else { ++listA_iter; } } return newEliminations; } int main (int argc, char** argv) { std::random_device rd; std::mt19937_64 randomGenerator(rd()); int64_t max_len = std::stoll(argv[1]); int64_t num_samples = std::stoll(argv[2]); int64_t samplesRemaining = num_samples; Tree tree(max_len); int64_t poolSize = tree.size() - 1; std::deque<Range> eliminationList; std::deque<Range> eliminated; std::list<std::string> foundList; while (samplesRemaining > 0 && poolSize > 0) { // find a valid node int64_t selectedNode = selectNode(eliminationList, poolSize, randomGenerator); NodeInfo info = tree.getNodeInfo(selectedNode); foundList.push_back(info.representation); samplesRemaining--; // determine which nodes this choice eliminates eliminated.clear(); for( auto const& ancestor : info.ancestors) { Range r(ancestor, ancestor); if(eliminated.empty() || !eliminated.back().canMerge(r)) { eliminated.push_back(r); } else { eliminated.back().merge(r); } } Range r(selectedNode, selectedNode + info.subTreeSize - 1); if(eliminated.empty() || !eliminated.back().canMerge(r)) { eliminated.push_back(r); } else { eliminated.back().merge(r); } // add the eliminated nodes to the existing list poolSize -= mergeEliminations(eliminationList, eliminated); } // Print some stats // std::cout << "tree: " << tree.size() << " samplesRemaining: " // << samplesRemaining << " poolSize: " // << poolSize << " samples: " << foundList.size() // << " eliminated: " // << countEliminated(eliminationList) << std::endl; // Print list of binary strings // std::cout << "list:"; // for (auto const& s : foundList) { // std::cout << " " << s; // } // std::cout << std::endl; } 

Additional Thoughts

This algorithm will scale extremely well for max_len. Scaling with n is not very good, but based on my own profiling it seems to do better than the other solutions.

This algorithm could be modified with little effort to allow for strings containing more than just '0' and '1'. More possible letters in the words increases the fan-out of the tree, and would eliminate a wider range with each selection – still with all nodes in each subtree remaining contiguous.