如何用pandas创build一个数据框的testing和训练样本？

我有一个相当大的数据集forms的数据框，我想知道如何能够将数据框分成两个随机样本（80％和20％）进行培训和testing。

谢谢！

我只会使用numpy的randn ：

 In [11]: df = pd.DataFrame(np.random.randn(100, 2)) In [12]: msk = np.random.rand(len(df)) < 0.8 In [13]: train = df[msk] In [14]: test = df[~msk]

而只是看到这个工作：

 In [15]: len(test) Out[15]: 21 In [16]: len(train) Out[16]: 79

scikit的学习train_test_split是一个很好的。

 import pandas as pd import numpy as np from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.2)

pandas随机样本也将工作

 train=df.sample(frac=0.8,random_state=200) test=df.drop(train.index)

我会使用scikit-learn自己的training_test_split，并从索引生成它

 from sklearn.cross_validation import train_test_split y = df.pop('output') X = df X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2) X.iloc[X_train] # return dataframe train

有很多有效的答案。再添加一堆。从sklearn.cross_validation导入train_test_split

 #gets a random 80% of the entire set X_train = X.sample(frac=0.8, random_state=1) #gets the left out portion of the dataset X_test = X.loc[~df_model.index.isin(X_train.index)]

你也可以考虑分层分解成训练和testing集。启动的部门也随机生成训练和testing集，但保留了原始类的比例。这使得训练和testing集更好地反映了原始数据集的属性。

 import numpy as np def get_train_test_inds(y,train_proportion=0.7): '''Generates indices, making random stratified split into training set and testing sets with proportions train_proportion and (1-train_proportion) of initial sample. y is any iterable indicating classes of each observation in the sample. Initial proportions of classes inside training and testing sets are preserved (stratified sampling). ''' y=np.array(y) train_inds = np.zeros(len(y),dtype=bool) test_inds = np.zeros(len(y),dtype=bool) values = np.unique(y) for value in values: value_inds = np.nonzero(y==value)[0] np.random.shuffle(value_inds) n = int(train_proportion*len(value_inds)) train_inds[value_inds[:n]]=True test_inds[value_inds[n:]]=True return train_inds,test_inds

df [train_inds]和df [test_inds]为您提供原始DataFrame df的训练和testing集。

这是我需要分割一个DataFrame时写的。我考虑过使用Andy的方法，但不喜欢我无法精确控制数据集的大小（即有时会有79，有时是81等）。

 def make_sets(data_df, test_portion): import random as rnd tot_ix = range(len(data_df)) test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df)))) train_ix = list(set(tot_ix) ^ set(test_ix)) test_df = data_df.ix[test_ix] train_df = data_df.ix[train_ix] return train_df, test_df train_df, test_df = make_sets(data_df, 0.2) test_df.head()

只要从这个dfselect范围行

 row_count = df.shape[0] split_point = int(row_count*1/5) test_data, train_data = df[:split_point], df[split_point:]

如果你的愿望是有一个数据框和两个数据框（不是数组），这应该做的伎俩：

 def split_data(df, train_perc = 0.8): df['train'] = np.random.rand(len(df)) < train_perc train = df[df.train == 1] test = df[df.train == 0] split_data ={'train': train, 'test': test} return split_data

我想你也需要得到一个副本，而不是一片数据框，如果你想以后添加列。

 msk = np.random.rand(len(df)) < 0.8 train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)

你可以使用df.as_matrix（）函数并创buildNumpy数组并传递它。

 Y = df.pop() X = df.as_matrix() x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2) model.fit(x_train, y_train) model.test(x_test)

这个怎么样？ DF是我的数据框

 total_size=len(df) train_size=math.floor(0.66*total_size) (2/3 part of my dataset) #training dataset train=df.head(train_size) #test dataset test=df.tail(len(df) -train_size)

如果您需要根据数据集中的标签列分割数据，可以使用以下命令：

 def split_to_train_test(df, label_column, train_frac=0.8): train_df, test_df = pd.DataFrame(), pd.DataFrame() labels = df[label_column].unique() for lbl in labels: lbl_df = df[df[label_column] == lbl] lbl_train_df = lbl_df.sample(frac=train_frac) lbl_test_df = lbl_df.drop(lbl_train_df.index) print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df)) train_df = train_df.append(lbl_train_df) test_df = test_df.append(lbl_test_df) return train_df, test_df

并使用它：

 train, test = split_to_train_test(data, 'class', 0.7)

如果你想控制分裂的随机性或者使用一些全局的随机种子，你也可以传递random_state。

如何用pandas创build一个数据框的testing和训练样本？

如何拆分数据框？

pandas：如何将应用函数用于多列

如何使read_csv中的分隔符更加灵活wrt空格？

pandas：结合string和int列

重命名Pandas DataFrame索引

如何删除某些列中的值为NaN的Pandas DataFrame的行

过滤date的pandas数据框

python dataframe pandas使用int来删除列

R Apply（）函数在特定的数据框列上

Python / Pandas – 用于查看DataFrame或Matrix的GUI