将R因子自动扩展为每个因子水平的1/0指标variables的集合

我有一个R数据框,其中包含一个我想“展开”的因子,因此对于每个因子级别,新数据框中都有一个关联列,其中包含1/0指示符。 例如,假设我有:

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4)) 

我想要:

 df.desired <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4)) 

因为对于某些需要完全数字化数据框架的分析(例如主成分分析),我认为这个function可能是内置的。写一个函数来做这个不应该太难,但是我可以预见一些与列名有关的挑战,如果已经存在,我宁愿使用它。

使用model.matrixfunction:

 model.matrix( ~ Species - 1, data=iris ) 

如果您的数据框仅由因素组成(或者您正在处理所有因素的variables子集),则还可以使用ade4软件包中的acm.disjonctif函数:

 R> library(ade4) R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red")) R> acm.disjonctif(df) eggs.bar eggs.foo ham.blue ham.green ham.red 1 0 1 0 0 1 2 0 1 1 0 0 3 1 0 0 1 0 4 1 0 0 0 1 

不完全是你正在描述的情况,但它也可能是有用的…

使用reshape2软件包的快速方法:

 require(reshape2) > dcast(df.original, ham ~ eggs, length) Using ham as value column: use value_var to override. ham bar foo 1 1 0 1 2 2 0 1 3 3 1 0 4 4 1 0 

请注意,这将生成所需的列名称。

可能虚拟variables是类似于你想要的。 那么,model.matrix是有用的:

 > with(df.original, data.frame(model.matrix(~eggs+0), ham)) eggsbar eggsfoo ham 1 0 1 1 2 0 1 2 3 1 0 3 4 1 0 4 

来自nnet包的迟到class.ind

 library(nnet) with(df.original, data.frame(class.ind(eggs), ham)) bar foo ham 1 0 1 1 2 0 1 2 3 1 0 3 4 1 0 4 

刚刚遇到这个老的线程,并认为我会添加一个函数,利用ade4来获取由因素和/或数字数据组成的dataframe,并返回一个dataframe作为虚拟代码的因素。

 dummy <- function(df) { NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)] FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)] require(ade4) if (is.null(ncol(NUM(df)))) { DF <- data.frame(NUM(df), acm.disjonctif(FAC(df))) names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))] } else { DF <- data.frame(NUM(df), acm.disjonctif(FAC(df))) } return(DF) } 

我们来试试吧。

 df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"), x=rnorm(4)) dummy(df) df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red")) dummy(df2) 

我需要一个function来“爆炸”一些更灵活的因素,并根据ade4软件包中的acm.disjonctif函数创build一个函数。 这使您可以selectacm.disjonctif中的分解值0和1。 它只是爆炸“几乎”水平的因素。 数字列被保留。

 # Function to explode factors that are considered to be categorical, # ie, they do not have too many levels. # - data: The data.frame in which categorical variables will be exploded. # - values: The exploded values for the value being unequal and equal to a level. # - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors. # Inspired by the acm.disjonctif function in the ade4 package. explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) { exploders <- colnames(data)[sapply(data, function(col){ is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col) })] if (length(exploders) > 0) { exploded <- lapply(exploders, function(exp){ col <- data[, exp] n <- length(col) dummies <- matrix(values[1], n, length(levels(col))) dummies[(1:n) + n * (unclass(col) - 1)] <- values[2] colnames(dummies) <- paste(exp, levels(col), sep = '_') dummies }) # Only keep numeric data. data <- data[sapply(data, is.numeric)] # Add exploded values. data <- cbind(data, exploded) } return(data) } 

这是一个更清楚的方法来做到这一点。 我使用model.matrix来创build虚拟布尔variables,然后将其合并回原始数据框。

 df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4)) df.original # eggs ham # 1 foo 1 # 2 foo 2 # 3 bar 3 # 4 bar 4 # Create the dummy boolean variables using the model.matrix() function. > mm <- model.matrix(~eggs-1, df.original) > mm # eggsbar eggsfoo # 1 0 1 # 2 0 1 # 3 1 0 # 4 1 0 # attr(,"assign") # [1] 1 1 # attr(,"contrasts") # attr(,"contrasts")$eggs # [1] "contr.treatment" # Remove the "eggs" prefix from the column names as the OP desired. colnames(mm) <- gsub("eggs","",colnames(mm)) mm # bar foo # 1 0 1 # 2 0 1 # 3 1 0 # 4 1 0 # attr(,"assign") # [1] 1 1 # attr(,"contrasts") # attr(,"contrasts")$eggs # [1] "contr.treatment" # Combine the matrix back with the original dataframe. result <- cbind(df.original, mm) result # eggs ham bar foo # 1 foo 1 0 1 # 2 foo 2 0 1 # 3 bar 3 1 0 # 4 bar 4 1 0 # At this point, you can select out the columns that you want.