Recently I came across a post where user wants to average over triplicates for each time point. I wish there were easy solutions in R. For eg. group and sum by rows or group and sum by columns without common column name. For eg triplicates will have names R1, R2 and R3. There is no way to group them esp when one has multiple triplicates. Here are three solutions for such a scenario.
genes = paste("gene", seq(1, 4), sep = "_"),
X_T0_R1 = seq(1, 4),
X_T0_R2 = seq(1, 4),
X_T0_R3 = seq(1, 4),
X_T1_R1 = seq(3, 6),
X_T1_R2 = seq(3, 6),
X_T1_R3 = seq(3, 6)
)
df1
genes X_T0_R1 X_T0_R2 X_T0_R3 X_T1_R1 X_T1_R2 X_T1_R3
gene_1 1 1 1 3 3
gene_2 2 2 2 4 4
gene_3 3 3 3 5 5
gene_4 4 4 4 6 6
User wants to average (mean) triplicates for each gene for each time point (in this case: X_T0 and X_T1). Triplicates are represented by R1,R2 and R3.
There are 3 methods in R to do this:
> suppressPackageStartupMessages(library(stringr))
> gdf1 = gather(df1, "group", "Expression", -genes)
> gdf1$tgroup = apply(str_split_fixed(gdf1$group, "_", 3)[, c(1, 2)], 1, paste, collapse ="_")
> suppressPackageStartupMessages(library(dplyr))
> final_df=gdf1 %>% group_by(genes, tgroup) %>% summarize(expression_mean = mean(Expression)) %>% spread(., tgroup, expression_mean)
> final_df
genes X_T0 X_T1
gene_1 1 3
gene_2 2 4
gene_3 3 5
gene_4 4 6
Create a data frame with two triplicates, for four genes
df1 = data.frame(genes = paste("gene", seq(1, 4), sep = "_"),
X_T0_R1 = seq(1, 4),
X_T0_R2 = seq(1, 4),
X_T0_R3 = seq(1, 4),
X_T1_R1 = seq(3, 6),
X_T1_R2 = seq(3, 6),
X_T1_R3 = seq(3, 6)
)
df1
genes X_T0_R1 X_T0_R2 X_T0_R3 X_T1_R1 X_T1_R2 X_T1_R3
gene_1 1 1 1 3 3
gene_2 2 2 2 4 4
gene_3 3 3 3 5 5
gene_4 4 4 4 6 6
User wants to average (mean) triplicates for each gene for each time point (in this case: X_T0 and X_T1). Triplicates are represented by R1,R2 and R3.
There are 3 methods in R to do this:
Method 1: Using tidyverse
> suppressPackageStartupMessages(library(tidyr))> suppressPackageStartupMessages(library(stringr))
> gdf1 = gather(df1, "group", "Expression", -genes)
> gdf1$tgroup = apply(str_split_fixed(gdf1$group, "_", 3)[, c(1, 2)], 1, paste, collapse ="_")
> suppressPackageStartupMessages(library(dplyr))
> final_df=gdf1 %>% group_by(genes, tgroup) %>% summarize(expression_mean = mean(Expression)) %>% spread(., tgroup, expression_mean)
> final_df
genes X_T0 X_T1
gene_1 1 3
gene_2 2 4
gene_3 3 5
gene_4 4 6
Method 2: Using apply and arrays: Note that headers are not stored
> final_df=data.frame(apply(array(as.matrix(df1[,-1]), c(nrow(df1),3, ncol(df1)/3)),3, rowMeans))
> final_df=cbind(df1$genes, final_df)
> final_df
> final_df=cbind(df1$genes, final_df)
> final_df
df1$genes X1 X2
gene_1 1 3
gene_2 2 4
gene_3 3 5
gene_4 4 6
gene_2 2 4
gene_3 3 5
gene_4 4 6
Method 3: Using a loop, apply
> final_df=data.frame(matrix(nrow = nrow(df1)))
> for (i in unique(gsub("_R[1-9]","",names(df1)))[-1]){
final_df[,i]=apply(df1[,grepl(gsub("_R[1-9]","",i),names(df1))],1, mean)}
> final_df[,1]=df1[,1]
> names(final_df)[1]=names(df1)[1]
> final_df
Gene X_T0 X_T1 X_T2 X_T3
AA 1.3260611 1.4999564 1.1185287 1.0250821
PQ 0.2227680 0.2931172 0.5482280 0.8411926
XY 0.2934894 0.3244539 0.8463782 1.1081067
UBQ 1.2741957 1.1195535 0.9363191 0.9577433
What if the rows are duplicate and you need to average the triplicate? Code is available at my github repo. Please find the script in repo rscripts.
> for (i in unique(gsub("_R[1-9]","",names(df1)))[-1]){
final_df[,i]=apply(df1[,grepl(gsub("_R[1-9]","",i),names(df1))],1, mean)}
> final_df[,1]=df1[,1]
> names(final_df)[1]=names(df1)[1]
> final_df
Gene X_T0 X_T1 X_T2 X_T3
AA 1.3260611 1.4999564 1.1185287 1.0250821
PQ 0.2227680 0.2931172 0.5482280 0.8411926
XY 0.2934894 0.3244539 0.8463782 1.1081067
UBQ 1.2741957 1.1195535 0.9363191 0.9577433
What if the rows are duplicate and you need to average the triplicate? Code is available at my github repo. Please find the script in repo rscripts.