如何获得scikit-learn分类器的大多数信息function？

像liblinear和nltk这样的机器学习软件包中的分类器提供了一个方法show_most_informative_features() ，这对debuggingfunction确实很有帮助：

 viagra = None ok : spam = 4.5 : 1.0 hello = True ok : spam = 4.5 : 1.0 hello = None spam : ok = 3.3 : 1.0 viagra = True spam : ok = 3.3 : 1.0 casino = True spam : ok = 2.0 : 1.0 casino = None ok : spam = 1.5 : 1.0

我的问题是如果scikit-learn中的分类器实现了类似的东西。我搜查了文档，但找不到类似的东西。

如果还没有这样的function，有人知道如何得到这些价值的解决方法吗？

非常感谢！

分类器本身不loggingfunction名称，它们只是看到数字数组。但是，如果使用Vectorizer / CountVectorizer / TfidfVectorizer / DictVectorizer提取function，并且使用线性模型（例如LinearSVC或朴素贝叶斯），则可以应用与文档分类示例相同的技巧。示例（ 未经testing ，可能包含一个或两个错误）：

 def print_top10(vectorizer, clf, class_labels): """Prints features with the highest coefficient values, per class""" feature_names = vectorizer.get_feature_names() for i, class_label in enumerate(class_labels): top10 = np.argsort(clf.coef_[i])[-10:] print("%s: %s" % (class_label, " ".join(feature_names[j] for j in top10)))

这是多类分类; 对于二进制的情况，我认为你应该只使用clf.coef_[0] 。您可能需要对class_labels进行sorting。

在larsmans代码的帮助下，我想出了这个二进制代码的代码：

 def show_most_informative_features(vectorizer, clf, n=20): feature_names = vectorizer.get_feature_names() coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) for (coef_1, fn_1), (coef_2, fn_2) in top: print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

要添加更新， RandomForestClassifier现在支持.feature_importances_属性。这个属性告诉你有多less观察到的方差是由该特征解释的。显然，所有这些值的总和必须<= 1。

在执行特征工程时，我发现这个属性非常有用。

感谢scikit-learn团队和贡献者的实施！

编辑：这适用于RandomForest和GradientBoosting。所以RandomForestClassifier ， RandomForestRegressor ， GradientBoostingClassifier和GradientBoostingRegressor都支持这个。

我们最近发布了一个库（ https://github.com/TeamHG-Memex/eli5 ），它允许做到这一点：它处理从scikit学习variuos分类器，二进制/多种情况下，允许突出显示文本根据特征值，与IPython等整合

RandomForestClassifier还没有coef_ attrubute，但它会在0.17版本，我想。但是，使用scikit-learn在随机森林上的recursion特征消除中查看RandomForestClassifierWithCoef类。这可能会给你一些想法来解决上面的限制。

你也可以做这样的事情来按顺序创build重要特征图：

 importances = clf.feature_importances_ std = np.std([tree.feature_importances_ for tree in clf.estimators_], axis=0) indices = np.argsort(importances)[::-1] # Print the feature ranking #print("Feature ranking:") # Plot the feature importances of the forest plt.figure() plt.title("Feature importances") plt.bar(range(train[features].shape[1]), importances[indices], color="r", yerr=std[indices], align="center") plt.xticks(range(train[features].shape[1]), indices) plt.xlim([-1, train[features].shape[1]]) plt.show()

如何获得scikit-learn分类器的大多数信息function？

张量stream中类不平衡二元分类器的损失函数

我如何build立一个模型来区分关于苹果（公司）的推文与关于苹果（水果）的推文呢？

线性时间sorting？

在Ruby中按降序对数组进行sorting

sortingalgorithm的稳定性是什么？为什么它很重要？

近乎重复的图像检测

数据挖掘中分类和聚类的区别？

在scikit-learn中将分类器保存到磁盘

如何使用sortedArrayUsingDescriptorssortingNSMutableArray？

UNIXsorting命令如何sorting非常大的文件？