如何将数据集分割/分割为训练和testing数据集，例如交叉validation？

将NumPy数组随机分为训练和testing/validation数据集的好方法是什么？类似于Matlab中的cvpartition或cvpartition函数。

如果要将数据集分为两部分，则可以使用numpy.random.shuffle或numpy.random.permutation如果需要跟踪索引：

 import numpy # x is your dataset x = numpy.random.rand(100, 5) numpy.random.shuffle(x) training, test = x[:80,:], x[80:,:]

要么

 import numpy # x is your dataset x = numpy.random.rand(100, 5) indices = numpy.random.permutation(x.shape[0]) training_idx, test_idx = indices[:80], indices[80:] training, test = x[training_idx,:], x[test_idx,:]

有很多方法可以反复分割相同的数据集进行交叉validation 。一个策略是从数据集重新采样，重复：

 import numpy # x is your dataset x = numpy.random.rand(100, 5) training_idx = numpy.random.randint(x.shape[0], size=80) test_idx = numpy.random.randint(x.shape[0], size=20) training, test = x[training_idx,:], x[test_idx,:]

最后， sklearn包含几种交叉validation方法（k-fold，leave-n-out，分层-k-fold，…）。对于文档，您可能需要查看示例或最新的git存储库，但代码是可靠的。

还有另一个select，只需要使用scikit学习。正如scikit的wiki所描述的，你可以使用下面的说明：

 from sklearn.model_selection import train_test_split data, labels = np.arange(10).reshape((5, 2)), range(5) data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

通过这种方式，您可以将您正试图拆分的数据的标签保持同步以进行训练和testing。

只是一个说明。如果你想训练，testing和validation集，你可以这样做：

 from sklearn.cross_validation import train_test_split X = get_my_X() y = get_my_y() x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3) x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

这些参数将给70％的训练，15％，每个testing和val集。希望这可以帮助。

由于sklearn.cross_validation模块已被弃用，您可以使用：

 import numpy as np from sklearn.model_selection import train_test_split X, y = np.arange(10).reshape((5, 2)), range(5) X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

你也可以考虑分层分解成训练和testing集。启动的部门也随机生成训练和testing集，但保留了原始类的比例。这使得训练和testing集更好地反映了原始数据集的属性。

 import numpy as np def get_train_test_inds(y,train_proportion=0.7): '''Generates indices, making random stratified split into training set and testing sets with proportions train_proportion and (1-train_proportion) of initial sample. y is any iterable indicating classes of each observation in the sample. Initial proportions of classes inside training and testing sets are preserved (stratified sampling). ''' y=np.array(y) train_inds = np.zeros(len(y),dtype=bool) test_inds = np.zeros(len(y),dtype=bool) values = np.unique(y) for value in values: value_inds = np.nonzero(y==value)[0] np.random.shuffle(value_inds) n = int(train_proportion*len(value_inds)) train_inds[value_inds[:n]]=True test_inds[value_inds[n:]]=True return train_inds,test_inds y = np.array([1,1,2,2,3,3]) train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5) print y[train_inds] print y[test_inds]

这个代码输出：

 [1 2 3] [1 2 3]

我为我自己的项目写了一个函数来做到这一点（虽然它不使用numpy）：

 def partition(seq, chunks): """Splits the sequence into equal sized chunks and them as a list""" result = [] for i in range(chunks): chunk = [] for element in seq[i:len(seq):chunks]: chunk.append(element) result.append(chunk) return result

如果你想把这个块随机化，只需要在清单传递之前洗掉它。

这里是一个代码，以分层的方式将数据分成n = 5倍

 % X = data array % y = Class_label from sklearn.cross_validation import StratifiedKFold skf = StratifiedKFold(y, n_folds=5) for train_index, test_index in skf: print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index]

谢谢你的回答。我只是修改它，以避免（1）取样时取代（2）重复实例发生在培训和testing：

 training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False) training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)] test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

如何将数据集分割/分割为训练和testing数据集，例如交叉validation？

如何缩小JS或CSS的dynamic

加快茱莉亚写得不好的R例子

在Javascript中比较string的最佳方法？

浮点乘法与重复加法

图像压缩工具通过命令行

文件I / O与stream – 最佳的内存缓冲区大小

为什么绘制电话昂贵？

除了渲染vector之外，AS3中的主要性能击球手是什么？

在C＃中进行math优化

为什么GCC不优化结构？