random.choice的加权版本

我需要编写random.choice的加权版本(列表中的每个元素有不同的select概率)。 这就是我想到的:

def weightedChoice(choices): """Like random.choice, but each element can have a different chance of being selected. choices can be any iterable containing iterables with two items each. Technically, they can have more than two items, the rest will just be ignored. The first item is the thing being chosen, the second item is its weight. The weights can be any numeric values, what matters is the relative differences between them. """ space = {} current = 0 for choice, weight in choices: if weight > 0: space[current] = choice current += weight rand = random.uniform(0, current) for key in sorted(space.keys() + [current]): if rand < key: return choice choice = space[key] return None 

这个function对我来说似乎过于复杂,而且很丑陋。 我希望这里的每个人都可以提出一些改进的方法或者替代方法。 效率对代码清洁度和可读性来说并不重要。

 def weighted_choice(choices): total = sum(w for c, w in choices) r = random.uniform(0, total) upto = 0 for c, w in choices: if upto + w >= r: return c upto += w assert False, "Shouldn't get here" 

自1.7.0版本以来,NumPy具有支持概率分布的choicefunction。

 from numpy.random import choice draw = choice(list_of_candidates, number_of_items_to_pick, p=probability_distribution) 

请注意, probability_distribution是与list_of_candidates的顺序相同的顺序。 您还可以使用关键字replace=False更改行为,以便不会replace绘制的项目。

  1. 将权重排列成累积分布。
  2. 使用random.random()来select一个随机的浮点数0.0 <= x < total
  3. 使用bisect.bisectsearch发行 ,如http://docs.python.org/dev/library/bisect.html#other-examples中的示例所示。;
 from random import random from bisect import bisect def weighted_choice(choices): values, weights = zip(*choices) total = 0 cum_weights = [] for w in weights: total += w cum_weights.append(total) x = random() * total i = bisect(cum_weights, x) return values[i] >>> weighted_choice([("WHITE",90), ("RED",8), ("GREEN",2)]) 'WHITE' 

如果你需要做出多于一个的select,把它分成两个函数,一个是build立累积权重,另一个是平分到一个随机点。

由于Python3.6有一个从random模块的方法choices

 Python 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) Type 'copyright', 'credits' or 'license' for more information IPython 6.0.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: import random In [2]: population = [['a','b'], ['b','a'], ['c','b']] In [3]: list_of_prob = [0.2, 0.2, 0.6] In [4]: population = random.choices(population, weights=list_of_prob, k=10) In [5]: population Out[5]: [['c', 'b'], ['c', 'b'], ['b', 'a'], ['c', 'b'], ['c', 'b'], ['b', 'a'], ['c', 'b'], ['b', 'a'], ['c', 'b'], ['c', 'b']] 

而且人们也提到有支持权重的numpy.random.choice它不支持2d数组 ,等等。

所以,基本上你可以得到任何你喜欢的内置random.choices如果你有3.6.x的Python

如果你不介意使用numpy,你可以使用numpy.random.choice 。

例如:

 import numpy items = [["item1", 0.2], ["item2", 0.3], ["item3", 0.45], ["item4", 0.05] elems = [i[0] for i in items] probs = [i[1] for i in items] trials = 1000 results = [0] * len(items) for i in range(trials): res = numpy.random.choice(items, p=probs) #This is where the item is selected! results[items.index(res)] += 1 results = [r / float(trials) for r in results] print "item\texpected\tactual" for i in range(len(probs)): print "%s\t%0.4f\t%0.4f" % (items[i], probs[i], results[i]) 

如果您知道需要提前做出多lessselect,则可以不进行如下循环:

 numpy.random.choice(items, trials, p=probs) 

原油,但可能已经足够:

 import random weighted_choice = lambda s : random.choice(sum(([v]*wt for v,wt in s),[])) 

它工作吗?

 # define choices and relative weights choices = [("WHITE",90), ("RED",8), ("GREEN",2)] # initialize tally dict tally = dict.fromkeys(choices, 0) # tally up 1000 weighted choices for i in xrange(1000): tally[weighted_choice(choices)] += 1 print tally.items() 

打印:

 [('WHITE', 904), ('GREEN', 22), ('RED', 74)] 

假设所有的权重都是整数。 他们不需要加起来100,我只是这样做,使testing结果更容易解释。 (如果权重是浮点数,则重复全部乘以10,直到所有权重> = 1。)

 weights = [.6, .2, .001, .199] while any(w < 1.0 for w in weights): weights = [w*10 for w in weights] weights = map(int, weights) 

如果你有一个加权的字典,而不是一个列表,你可以写这个

 items = { "a": 10, "b": 5, "c": 1 } random.choice([k for k in items for dummy in range(items[k])]) 

请注意, [k for k in items for dummy in range(items[k])]中的[k for k in items for dummy in range(items[k])]产生该列表['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'c', 'b', 'b', 'b', 'b', 'b']

以下是Python 3.6标准库中包含的版本:

 import itertools as _itertools import bisect as _bisect class Random36(random.Random): "Show the code included in the Python 3.6 version of the Random class" def choices(self, population, weights=None, *, cum_weights=None, k=1): """Return ak sized list of population elements chosen with replacement. If the relative weights or cumulative weights are not specified, the selections are made with equal probability. """ random = self.random if cum_weights is None: if weights is None: _int = int total = len(population) return [population[_int(random() * total)] for i in range(k)] cum_weights = list(_itertools.accumulate(weights)) elif weights is not None: raise TypeError('Cannot specify both weights and cumulative weights') if len(cum_weights) != len(population): raise ValueError('The number of weights does not match the population') bisect = _bisect.bisect total = cum_weights[-1] return [population[bisect(cum_weights, random() * total)] for i in range(k)] 

资料来源: https : //hg.python.org/cpython/file/tip/Lib/random.py#l340

从Python random.choices ,可以使用random.choices从可选的权重给定的总体中返回指定大小的元素list

random.choices(population, weights=None, *, cum_weights=None, k=1)

  • 人口 :包含独特观察的list 。 (如果为空,则引发IndexError

  • 权重 :更精确的select所需的相对权重。

  • cum_weights :进行select所需的累计权重。

  • k :要输出的list大小( len )。 (默认len()=1


几个警告:

1)利用加权抽样进行replace,以便后来replace抽取的项目。 权重序列中的值本身并不重要,但它们的相对比例却是如此。

np.random.choice不同, np.random.choice只能将概率作为权重,而且必须确保个体概率的总和达到1个标准,这里没有这样的规定。 只要它们属于数字types(除Decimaltypes之外的int/float/fraction ),它们仍然会执行。

 >>> import random # weights being integers >>> random.choices(["white", "green", "red"], [12, 12, 4], k=10) ['green', 'red', 'green', 'white', 'white', 'white', 'green', 'white', 'red', 'white'] # weights being floats >>> random.choices(["white", "green", "red"], [.12, .12, .04], k=10) ['white', 'white', 'green', 'green', 'red', 'red', 'white', 'green', 'white', 'green'] # weights being fractions >>> random.choices(["white", "green", "red"], [12/100, 12/100, 4/100], k=10) ['green', 'green', 'white', 'red', 'green', 'red', 'white', 'green', 'green', 'green'] 

2)如果没有指定权重cum_weights ,则select等概率。 如果提供了权重序列,它必须与总体序列的长度相同。

同时指定权重cum_weights会引发TypeError

 >>> random.choices(["white", "green", "red"], k=10) ['white', 'white', 'green', 'red', 'red', 'red', 'white', 'white', 'white', 'green'] 

3) cum_weights通常是itertools.accumulate函数的结果,在这种情况下非常方便。

从链接的文档:

在进行select之前,内部将相对权重转换为累计权重,因此提供累计权重可以节省工作量。

因此,对于我们设想的情况,要么提供weights=[12, 12, 4] cum_weights=[12, 24, 28] weights=[12, 12, 4]cum_weights=[12, 24, 28]产生相同的结果,后者似乎更快/更有效。

我会要求总和的select是1,但是这个工作无论如何

 def weightedChoice(choices): # Safety check, you can remove it for c,w in choices: assert w >= 0 tmp = random.uniform(0, sum(c for c,w in choices)) for choice,weight in choices: if tmp < weight: return choice else: tmp -= weight raise ValueError('Negative values in input') 

一般解决scheme:

 import random def weighted_choice(choices, weights): total = sum(weights) treshold = random.uniform(0, total) for k, weight in enumerate(weights): total -= weight if total < treshold: return choices[k] 
 import numpy as np w=np.array([ 0.4, 0.8, 1.6, 0.8, 0.4]) np.random.choice(w, p=w/sum(w)) 

我可能来不及提供有用的东西,但是这里有一个简单的,简短的,非常有效的代码片段:

 def choose_index(probabilies): cmf = probabilies[0] choice = random.random() for k in xrange(len(probabilies)): if choice <= cmf: return k else: cmf += probabilies[k+1] 

不需要对你的概率进行sorting或者用你的cmf创build一个向量,一旦find它的select,它就会终止。 内存:O(1),时间:O(N),平均运行时间〜N / 2。

如果你有重量,只需添加一行:

 def choose_index(weights): probabilities = weights / sum(weights) cmf = probabilies[0] choice = random.random() for k in xrange(len(probabilies)): if choice <= cmf: return k else: cmf += probabilies[k+1] 

如果你的加权select列表是相对静态的,并且你想要频繁采样,你可以做一个O(N)预处理步骤,然后在O(1)中使用这个相关答案中的函数进行select。

 # run only when `choices` changes. preprocessed_data = prep(weight for _,weight in choices) # O(1) selection value = choices[sample(preprocessed_data)][0] 

这取决于您想要分配样品的次数。

假设你想对K次分布进行采样。 那么,当n是分布中的项目数时,每次使用np.random.choice()的时间复杂度是O(K(n + log(n)))

在我的情况下,我需要对相同的分布进行多次采样,其次是10 ^ 3的数量级,其中n是10 ^ 6的数量级。 我使用下面的代码,它预先计算累积分布并在O(log(n))中对其进行采样。 总的时间复杂度是O(n+K*log(n))

 import numpy as np n,k = 10**6,10**3 # Create dummy distribution a = np.array([i+1 for i in range(n)]) p = np.array([1.0/n]*n) cfd = p.cumsum() for _ in range(k): x = np.random.uniform() idx = cfd.searchsorted(x, side='right') sampled_element = a[idx] 

我查看了指向另一个线程,并提出了我的编码风格的这种变化,这返回索引的目的是理货,但很简单,返回string(评论返回替代):

 import random import bisect try: range = xrange except: pass def weighted_choice(choices): total, cumulative = 0, [] for c,w in choices: total += w cumulative.append((total, c)) r = random.uniform(0, total) # return index return bisect.bisect(cumulative, (r,)) # return item string #return choices[bisect.bisect(cumulative, (r,))][0] # define choices and relative weights choices = [("WHITE",90), ("RED",8), ("GREEN",2)] tally = [0 for item in choices] n = 100000 # tally up n weighted choices for i in range(n): tally[weighted_choice(choices)] += 1 print([t/sum(tally)*100 for t in tally]) 

这是使用numpy的另一个版本的weighted_choice。 通过权向量,它将返回一个0的数组,其中包含1表示select了哪个bin。 该代码默认为只进行一次抽奖,但是您可以传入抽奖数量,并且每个抽奖的计数将被返回。

如果权向量不等于1,那么它将会被归一化。

 import numpy as np def weighted_choice(weights, n=1): if np.sum(weights)!=1: weights = weights/np.sum(weights) draws = np.random.random_sample(size=n) weights = np.cumsum(weights) weights = np.insert(weights,0,0.0) counts = np.histogram(draws, bins=weights) return(counts[0])