# 什么是“熵和信息获得”？

### 6 Solutions collect form web for “什么是“熵和信息获得”？”

` `name gender ----------------- Now we want to predict Ashley f the gender of "Amro" (my name) Brian m Caroline f David m` `

` `# name ends-vowel num-vowels length gender # ------------------------------------------------ Ashley 1 3 6 f Brian 0 2 5 m Caroline 1 4 8 f David 0 2 5 m` `

` `length<7 | num-vowels<3: male | num-vowels>=3 | | ends-vowel=1: female | | ends-vowel=0: male length>=7 | length=5: male` `

` `Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))` `

（公式中的对数通常取为2的对数 ）

` ` ends-vowel [9m,5f] <--- the [..,..] notation represents the class / \ distribution of instances that reached a node =1 =0 ------- ------- [3m,4f] [6m,1f]` `

` `Entropy_before = - (5/14)*log2(5/14) - (9/14)*log2(9/14) = 0.9403` `

` `Entropy_left = - (3/7)*log2(3/7) - (4/7)*log2(4/7) = 0.9852` `

`ends-vowel=0`的右分支，我们有：

` `Entropy_right = - (6/7)*log2(6/7) - (1/7)*log2(1/7) = 0.5917` `

` `Entropy_after = 7/14*Entropy_left + 7/14*Entropy_right = 0.7885` `

` `Information_Gain = Entropy_before - Entropy_after = 0.1518` `

# 我们如何`measure`信息？

• 如果事件的概率是1（可预测），则函数给出0
• 如果事件的概率接近0，那么函数应该给出很高的数字
• 如果发生概率为0.5的事件，则给出`one bit`信息。

` `I(X) = -log_2(p)` `

# 熵

` `H(Y) = E[ I(Y)]` `

## 例

Y = 1：事件X以概率p出现

Y = 0：事件X不以概率1-p出现

` `H(Y) = E[I(Y)] = p I(Y==1) + (1-p) I(Y==0) = - p log p - (1-p) log (1-p)` `

logging所有日志的基数2。

``` -log p i
```

``` I =-Σp i log（p i ）
```

```红色灯泡烧毁：p 红 = 0，p 绿 = 1，I =  - （0 + 0）= 0

```

` `//Loop over image array elements and count occurrences of each possible //pixel to pixel difference value. Store these values in prob_array for j = 0, ysize-1 do \$ for i = 0, xsize-2 do begin diff = array(i+1,j) - array(i,j) if diff lt (array_size+1)/2 and diff gt -(array_size+1)/2 then begin prob_array(diff+(array_size-1)/2) = prob_array(diff+(array_size-1)/2) + 1 endif endfor //Convert values in prob_array to probabilities and compute entropy n = total(prob_array) entrop = 0 for i = 0, array_size-1 do begin prob_array(i) = prob_array(i)/n //Base 2 log of x is Ln(x)/Ln(2). Take Ln of array element //here and divide final sum by Ln(2) if prob_array(i) ne 0 then begin entrop = entrop - prob_array(i)*alog(prob_array(i)) endif endfor entrop = entrop/alog(2)` `

• 使用NLTK清除停用词
• NLTK的所有pos标签是什么？
• Python：tf-idf-cosine：查找文档相似度
• 点安装几乎任何库的问题
• 在NLTK中parsing英文语法
• 如何从代码configurationnltk数据目录？
• 如何摆脱标点符号使用NLTK tokenizer？
• 以编程方式安装NLTK语料库/模型，即没有GUI下载器？
• 如何使用nltk或python删除停用词
• 资源u'tokenizers / punkt / english.pickle'找不到