Data.table元编程

我认为元编程在这里是合适的。

我希望能够使用data.table,就像在一个web应用程序中使用MySQL一样。 也就是说,Web用户使用一些Web前端(例如Shiny服务器)来select数据库,select要过滤的列,select要分组的列,select要聚合的列和聚合函数。 我想使用R和data.table作为查询,聚合等的后端。假设前端存在,并且R将这些variables作为string进行validation等。

我编写了以下函数来构builddata.tableexpression式,并使用R的parse / eval元编程function来运行它。 这是一个合理的方式来做到这一点?

我包括所有相关的代码来testing这个。 源代码(读取安全性后)并运行test_agg_meta()来testing它。 这只是一个开始。 我可以添加更多的function。

但是我的主要问题是我是否过分地考虑这个问题。 当所有的input都是未定的时候,是否有一个更直接的方式来使用data.table,而不使用parsing/评估元编程?

我也知道“有”的说法和其他无糖function的方法,但不知道是否可以照顾所有情况。

require(data.table) fake_data<-function(num=12){ #make some fake data x=1:num lets=letters[1:num] data=data.table( u=rep(c("A","B","C"),floor(num/3)), v=x %%2, w=lets, x=x, y=x^2, z=1-x) return(data) } data_table_meta<-function( #aggregate a data.table meta-programmatically data_in=fake_data(), filter_cols=NULL, filter_min=NULL, filter_max=NULL, groupby_cols=NULL, agg_cols=setdiff(names(data_in),groupby_cols), agg_funcs=NULL, verbose=F, validate=T, jsep="_" ){ all_cols=names(data_in) if (validate) { stopifnot(length(filter_cols) == length(filter_min)) stopifnot(length(filter_cols) == length(filter_max)) stopifnot(filter_cols %in% all_cols) stopifnot(groupby_cols %in% all_cols) stopifnot(length(intersect(agg_cols,groupby_cols)) == 0) stopifnot((length(agg_cols) == length(agg_funcs)) | (length(agg_funcs)==1) | (length(agg_funcs)==0)) } #build the command #defaults i_filter="" j_select="" n_agg_funcs=length(agg_funcs) n_agg_cols=length(agg_cols) n_groupby_cols=length(groupby_cols) if (n_agg_funcs == 0) { #NULL print("NULL") j_select=paste(agg_cols,collapse=",") j_select=paste("list(",j_select,")") } else { agg_names=paste(agg_funcs,agg_cols,sep=jsep) jsels=paste(agg_names,"=",agg_funcs,"(",agg_cols,")",sep="") if (n_groupby_cols>0) jsels=c(jsels,"N_Rows_Aggregated=.N") j_select=paste(jsels,collapse=",") j_select=paste("list(",j_select,")") } groupby="" if (n_groupby_cols>0) { groupby=paste(groupby_cols,collapse=",") groupby=paste("by=list(",groupby,")",sep="") } n_filter_cols=length(filter_cols) if (n_filter_cols > 0) { i_filters=rep("",n_filter_cols) for (i in 1:n_filter_cols) { i_filters[i]=paste(" (",filter_cols[i]," >= ",filter_min[i]," & ",filter_cols[i]," <= ",filter_max[i],") ",sep="") } i_filter=paste(i_filters,collapse="&") } command=paste("data_in[",i_filter,",",j_select,",",groupby,"]",sep="") if (verbose == 2) { print("all_cols:") print(all_cols) print("filter_cols:") print(filter_cols) print("agg_cols:") print(agg_cols) print("filter_min:") print(filter_min) print("filter_max:") print(filter_max) print("groupby_cols:") print(groupby_cols) print("agg_cols:") print(agg_cols) print("agg_funcs:") print(agg_funcs) print("i_filter") print(i_filter) print("j_select") print(j_select) print("groupby") print(groupby) print("command") print(command) } print(paste("evaluating command:",command)) eval(parse(text=command)) } my_agg<-function(data=fake_data()){ data_out=data[ i=x<=5, j=list( mean_x=mean(x), mean_y=mean(y), sum_z=sum(z), N_Rows_Aggregated=.N ), by=list(u,v)] return(data_out) } my_agg_meta<-function(data=fake_data()){ #should give same results as my_agg data_out=data_table_meta(data, filter_cols=c("x"), filter_min=c(-10000), filter_max=c(5), groupby_cols=c("u","v"), agg_cols=c("x","y","z"), agg_funcs=c("mean","mean","sum"), verbose=T, validate=T, jsep="_") return(data_out) } test_agg_meta<-function(){ stopifnot(all(my_agg()==my_agg_meta())) print("Congrats, you passed the test") } 

当你的function看起来很有趣的时候,我相信你在问是否还有其他方法可以去做。
就我个人而言,我喜欢使用这样的东西:

 ## SAMPLE DATA DT1 <- data.table(id=sample(LETTERS[1:4], 20, TRUE), Col1=1:20, Col2=rnorm(20)) DT2 <- data.table(id=sample(LETTERS[3:8], 20, TRUE), Col1=sample(100:500, 20), Col2=rnorm(20)) DT3 <- data.table(id=sample(LETTERS[19:20], 20, TRUE), Col1=sample(100:500, 20), Col2=rnorm(20)) 

通过参考表格名称访问表格:

这很简单,很像R任何对象

 # use strings to select the table tablesSelected <- "DT3" # use get to access them get(tablesSelected) # and we can perform operations: get(tablesSelected)[, list(C1mean=mean(Col1), C2mean=mean(Col2))] 

通过参考select列

要通过引用其名称来select列,请使用.SDcols参数。 给定一个列名称向量:

 columnsSelected <- c("Col1", "Col2") 

将该vector指定给.SDcols参数:

 ## Here we are simply accessing those columns DT3[, .SD, .SDcols = columnsSelected] 

我们也可以对string向量中的每一列应用一个函数:

 ## apply a function to each column DT3[, lapply(.SD, mean), .SDcols = columnsSelected] 

请注意,如果我们的目标只是输出列,我们可以closures:

 # This works for displaying DT3[, columnsSelected, with=FALSE] 

但是,如果使用with=FALSE ,则不能以通常的方式直接在列上操作

 ## This does NOT work: DT3[, someFunc(columnsSelected), with=FALSE] ## This DOES work: DT3[, someFunc(.SD), .SDcols=columnsSelected] ## This also works, but is less ideal, ie assigning to new columns is more cumbersome DT3[, columnsSelected, with=FALSE][, someFunc(.SD)] 

我们也可以使用get ,但是有点棘手。 我现在把它留在这里作为参考,但是.SDcols是要走的路

 ## we need to use `get`, but inside `j` ## AND IN A WRAPPER FUNCTION <~~~~~ THIS IS VITAL DT3[, lapply(columnsSelected, function(.col) get(.col))] ## We can execute functions on the columns: DT3[, lapply(columnsSelected, function(.col) mean( get(.col) ))] ## And of course, we can use more involved-functions, much like any *ply call: # using .SDcols DT3[, lapply(.SD, function(.col) c(mean(.col) + 2*sd(.col), mean(.col) - 2*sd(.col))), .SDcols = columnsSelected] # using `get` and assigning the value to a var. # Note that this method has memory drawbacks, so using .SDcols is preferred DT3[, lapply(columnsSelected, function(.col) {TheCol <- get(.col); c(mean(TheCol) + 2*sd(TheCol), mean(TheCol) - 2*sd(TheCol))})] 

作为参考,如果你尝试以下,你会发现他们不会产生我们以后的结果。

  ## this DOES NOT work DT3[, columnsSelected] ## netiher does this DT3[, eval(columnsSelected)] ## still does not work: DT3[, lapply(columnsSelected, get)] 

如果你想改变列的名字:

 # Using the `.SDcols` method: change names using `setnames` (lowercase "n") DT3[, setnames(.SD, c("new.Name1", "new.Name2")), .SDcols =columnsSelected] # Using the `get` method: ## The names of the new columns will be the names of the `columnsSelected` vector ## Thus, if we want to preserve the names, use the following: names(columnsSelected) <- columnsSelected DT3[, lapply(columnsSelected, function(.col) get(.col))] ## we can also use this trick to give the columns new names names(columnsSelected) <- c("new.Name1", "new.Name2") DT3[, lapply(columnsSelected, function(.col) get(.col))] 

显然,使用.SDcols更容易,更优雅。

怎么样?

 # `by` is straight forward, you can use a vector of strings in the `by` argument. # lets add another column to show how to use two columns in `by` DT3[, secondID := sample(letters[1:2], 20, TRUE)] # here is our string vector: byCols <- c("id", "secondID") # and here is our call DT3[, lapply(columnsSelected, function(.col) mean(get(.col))), by=byCols] 

把它放在一起

我们可以通过引用其名称来访问data.table,然后通过名称来select它的列:

 get(tablesSelected)[, .SD, .SDcols=columnsSelected] ## OR WITH MULTIPLE TABLES tablesSelected <- c("DT1", "DT3") lapply(tablesSelected, function(.T) get(.T)[, .SD, .SDcols=columnsSelected]) # we may want to name the vector for neatness, since # the resulting list inherits the names. names(tablesSelected) <- tablesSelected 

这是最好的部分:

由于data.table这么多data.table是通过引用传递的,因此很容易获得一个表的列表,要添加的列的单独列表以及要操作的另一列的列表,并将所有列放在一起以添加执行类似的操作 – 但有不同的投入 – 在你所有的桌子上。 与data.frame类似,不需要重新分配最终结果。

 newColumnsToAdd <- c("UpperBound", "LowerBound") FunctionToExecute <- function(vec) c(mean(vec) - 2*sd(vec), mean(vec) + 2*sd(vec)) # note the list of column names per table! columnsUsingPerTable <- list("DT1" = "Col1", DT2 = "Col2", DT3 = "Col1") tablesSelected <- names(columnsUsingPerTable) byCols <- c("id") # TADA: dummyVar <- # I use `dummyVar` because I do not want to display the output lapply(tablesSelected, function(.T) get(.T)[, c(newColumnsToAdd) := lapply(.SD, FunctionToExecute), .SDcols=columnsUsingPerTable[[.T]], by=byCols ] ) # Take a look at the tables now: DT1 DT2 DT3