R中dataframe的分层随机采样

我有一个数据框格式：

head(subset) # ants 0 1 1 0 1 # age 1 2 2 1 3 # lc 1 1 0 1 0

我需要根据年龄和lc随机样本创build新的dataframe。例如，我想从30岁的年龄：1和1c：1，30岁的样本，从年龄：1和1c：0等我看看随机抽样的方法，

 newdata<- function(subset, age, 30)

但这不是我想要的代码。

提前致谢！

我build议使用stratified从我的“splitstackshape”包，或从“dplyr”包sample_n ：

 ## Sample data set.seed(1) n <- 1e4 d <- data.table(age = sample(1:5, n, T), lc = rbinom(n, 1 , .5), ants = rbinom(n, 1, .7)) # table(d$age, d$lc)

对于stratified ，您基本上指定了数据集，分层列和表示您希望每个组的大小的整数，或者表示要返回的分数的小数（例如，.1表示每个组的10％）。

 library(splitstackshape) set.seed(1) out <- stratified(d, c("age", "lc"), 30) head(out) # age lc ants # 1: 1 0 1 # 2: 1 0 0 # 3: 1 0 1 # 4: 1 0 1 # 5: 1 0 0 # 6: 1 0 1 table(out$age, out$lc) # # 0 1 # 1 30 30 # 2 30 30 # 3 30 30 # 4 30 30 # 5 30 30

对于sample_n您首先创build一个分组表格（使用group_by ），然后指定所需的观测值数量。如果你想要比例采样，你应该使用sample_frac 。

 library(dplyr) set.seed(1) out2 <- d %>% group_by(age, lc) %>% sample_n(30) # table(out2$age, out2$lc)

这里有一些数据：

 set.seed(1) n <- 1e4 d <- data.frame(age = sample(1:5,n,TRUE), lc = rbinom(n,1,.5), ants = rbinom(n,1,.7))

你想要一个拆分应用组合策略，在这里你split你的data.frame（在这个例子中是d ），从每个子样本中抽样行/观察值，然后再与rbind一起合并。这是如何工作的：

 sp <- split(d, list(d$age, d$lc)) samples <- lapply(sp, function(x) x[sample(1:nrow(x), 30, FALSE),]) out <- do.call(rbind, samples)

结果：

 > str(out) 'data.frame': 300 obs. of 3 variables: $ age : int 1 1 1 1 1 1 1 1 1 1 ... $ lc : int 0 0 0 0 0 0 0 0 0 0 ... $ ants: int 1 1 0 1 1 1 1 1 1 1 ... > head(out) age lc ants 1.0.2242 1 0 1 1.0.4417 1 0 1 1.0.389 1 0 0 1.0.4578 1 0 1 1.0.8170 1 0 1 1.0.5606 1 0 1

从包装采样中查看functionstrata 。函数select分层的简单随机采样，并给出一个样本。增加两列 – 包含概率（ Prob ）和地层指标（ Stratum ）。看例子。

 require(data.table) require(sampling) set.seed(1) n <- 1e4 d <- data.table(age = sample(1:5, n, T), lc = rbinom(n, 1 , .5), ants = rbinom(n, 1, .7)) # Sort setkey(d, age, lc) # Population size by strata d[, .N, keyby = list(age, lc)] # age lc N # 1: 1 0 1010 # 2: 1 1 1002 # 3: 2 0 993 # 4: 2 1 1026 # 5: 3 0 1021 # 6: 3 1 982 # 7: 4 0 958 # 8: 4 1 940 # 9: 5 0 1012 # 10: 5 1 1056 # Select sample set.seed(2) s <- data.table(strata(d, c("age", "lc"), rep(30, 10), "srswor")) # Sample size by strata s[, .N, keyby = list(age, lc)] # age lc N # 1: 1 0 30 # 2: 1 1 30 # 3: 2 0 30 # 4: 2 1 30 # 5: 3 0 30 # 6: 3 1 30 # 7: 4 0 30 # 8: 4 1 30 # 9: 5 0 30 # 10: 5 1 30