将R因子自动扩展为每个因子水平的1/0指标variables的集合

我有一个R数据框，其中包含一个我想“展开”的因子，因此对于每个因子级别，新数据框中都有一个关联列，其中包含1/0指示符。例如，假设我有：

df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))

我想要：

 df.desired <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))

因为对于某些需要完全数字化数据框架的分析（例如主成分分析），我认为这个function可能是内置的。写一个函数来做这个不应该太难，但是我可以预见一些与列名有关的挑战，如果已经存在，我宁愿使用它。

使用model.matrixfunction：

 model.matrix( ~ Species - 1, data=iris )

如果您的数据框仅由因素组成（或者您正在处理所有因素的variables子集），则还可以使用ade4软件包中的acm.disjonctif函数：

 R> library(ade4) R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red")) R> acm.disjonctif(df) eggs.bar eggs.foo ham.blue ham.green ham.red 1 0 1 0 0 1 2 0 1 1 0 0 3 1 0 0 1 0 4 1 0 0 0 1

不完全是你正在描述的情况，但它也可能是有用的…

使用reshape2软件包的快速方法：

 require(reshape2) > dcast(df.original, ham ~ eggs, length) Using ham as value column: use value_var to override. ham bar foo 1 1 0 1 2 2 0 1 3 3 1 0 4 4 1 0

请注意，这将生成所需的列名称。

可能虚拟variables是类似于你想要的。那么，model.matrix是有用的：

 > with(df.original, data.frame(model.matrix(~eggs+0), ham)) eggsbar eggsfoo ham 1 0 1 1 2 0 1 2 3 1 0 3 4 1 0 4

来自nnet包的迟到class.ind

 library(nnet) with(df.original, data.frame(class.ind(eggs), ham)) bar foo ham 1 0 1 1 2 0 1 2 3 1 0 3 4 1 0 4

刚刚遇到这个老的线程，并认为我会添加一个函数，利用ade4来获取由因素和/或数字数据组成的dataframe，并返回一个dataframe作为虚拟代码的因素。

 dummy <- function(df) { NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)] FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)] require(ade4) if (is.null(ncol(NUM(df)))) { DF <- data.frame(NUM(df), acm.disjonctif(FAC(df))) names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))] } else { DF <- data.frame(NUM(df), acm.disjonctif(FAC(df))) } return(DF) }

我们来试试吧。

 df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"), x=rnorm(4)) dummy(df) df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red")) dummy(df2)

我需要一个function来“爆炸”一些更灵活的因素，并根据ade4软件包中的acm.disjonctif函数创build一个函数。这使您可以selectacm.disjonctif中的分解值0和1。它只是爆炸“几乎”水平的因素。数字列被保留。

 # Function to explode factors that are considered to be categorical, # ie, they do not have too many levels. # - data: The data.frame in which categorical variables will be exploded. # - values: The exploded values for the value being unequal and equal to a level. # - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors. # Inspired by the acm.disjonctif function in the ade4 package. explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) { exploders <- colnames(data)[sapply(data, function(col){ is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col) })] if (length(exploders) > 0) { exploded <- lapply(exploders, function(exp){ col <- data[, exp] n <- length(col) dummies <- matrix(values[1], n, length(levels(col))) dummies[(1:n) + n * (unclass(col) - 1)] <- values[2] colnames(dummies) <- paste(exp, levels(col), sep = '_') dummies }) # Only keep numeric data. data <- data[sapply(data, is.numeric)] # Add exploded values. data <- cbind(data, exploded) } return(data) }

这是一个更清楚的方法来做到这一点。我使用model.matrix来创build虚拟布尔variables，然后将其合并回原始数据框。

 df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4)) df.original # eggs ham # 1 foo 1 # 2 foo 2 # 3 bar 3 # 4 bar 4 # Create the dummy boolean variables using the model.matrix() function. > mm <- model.matrix(~eggs-1, df.original) > mm # eggsbar eggsfoo # 1 0 1 # 2 0 1 # 3 1 0 # 4 1 0 # attr(,"assign") # [1] 1 1 # attr(,"contrasts") # attr(,"contrasts")$eggs # [1] "contr.treatment" # Remove the "eggs" prefix from the column names as the OP desired. colnames(mm) <- gsub("eggs","",colnames(mm)) mm # bar foo # 1 0 1 # 2 0 1 # 3 1 0 # 4 1 0 # attr(,"assign") # [1] 1 1 # attr(,"contrasts") # attr(,"contrasts")$eggs # [1] "contr.treatment" # Combine the matrix back with the original dataframe. result <- cbind(df.original, mm) result # eggs ham bar foo # 1 foo 1 0 1 # 2 foo 2 0 1 # 3 bar 3 1 0 # 4 bar 4 1 0 # At this point, you can select out the columns that you want.

将R因子自动扩展为每个因子水平的1/0指标variables的集合

dynamic创build带有shiny图的标签，而无需重新创build现有标签

大部分使用不足的数据可视化

从完整文件path查找文件名

为什么在join带有重复键的data.tables时需要allow.cartesian？

在R中的同一图中绘制两个图

R knitr：可以以编程方式修改块标签？

如何在R中有效地使用Rprof？

在data.frame（）中移动列而不用重新input

在两个字符variables之间查找常见的子string

在两点之间缩放一系列