Biologist's bioinformatics notes

Recently there was a request for a user. User has a data frame with 3 columns.

==========================

header    marker    cm
101    4    0-7.195
103    8    38.582-49.653
103    5    43.096-46.534
103    1    49.653-49.653
103    1    51.676-51.676
104    2    22.454-37.061
104    4    23.351-37.061
105    2    83.619-84.178
106    1    36.307-36.307
106    1    40.62-40.62
===========================

Now what user wants to group by "header" column, sum of "marker" column values for each group, take min value on left for each group from "cm" column and max value on right for each group from "cm" column. Last requirement is kind of confusing in description. For eg let us say for group 103 in header column, in "cm" column there are 4 ranges. In 4 ranges, take the min value on lowest side and take the maximum value on highest side.

Now i thought this is easier in shell as I know how to use datamash. It hardly took me 2 min for this solution. Remember this problem involves column wise statistics and also grouping on the data. However, one of the commentators said I should not be using tools like datamash, rather I should use programming languages. Then i wrote the solution in 3 different ways, but with same logic. Writing in python took more time (~ 2-3 hours in fact), R took 10 min and shell solution took 2 min to arrive at solution.

Let us look at each solution and output:

1. Shell:

===========================

$ tail -n+2 file.txt | sed -e 's/-/\t/g' | datamash -s -g 1 sum 2 min 3 max 4 | awk -v OFS="\t" 'BEGIN {print "header","marker","cm"} {print $1,$2,$3"-"$4}'

header    marker    cm
101    4    0-7.195
103    15    38.582-51.676
104    6    22.454-37.061
105    2    83.619-84.178
106    2    36.307-40.62
==============================

What I did here was to remove headers (tail -n+2), then replace "-" with tab so that "cm" column is split into two columns. Using datamash, grouped header column, then took the sum of marker per group, minimum of first column (from CM column split), maximum of second column (from CM column split), then using awk to print headers first and then the columns I wanted.

2. R:

========================

df1 = read.csv("file.txt",stringsAsFactors = F,strip.white = T,sep = "\t")
library(stringr)
df1[, c("min", "max")] = str_split_fixed(df1$cm, "-", 2)
library(dplyr)
data.frame(df1 %>%
group_by(header) %>%
summarise(sum = sum(marker), range = paste(min(min), max(max), sep = "-")))
=========================

Using library stringr, I split "cm" column into two and appended it to data frame (df1). Then using dplyr, I grouped by header column and then summarize each column. Created a new column by name "range" by pasting min of minimum column and max of the maximum column (both min and max columns came from splitting of "cm" column)

3. Python:

=============================

$ python test.py
   header marker          range
0     101       4        0-7.195
1     103      15 38.582-51.676
2     104       6 22.454-37.061
3     105       2 83.619-84.178
4     106       2   36.307-40.62

$ cat test.py

> import os
> import pandas as pd
# Read CSV
> df1 = pd.read_csv("file.txt", sep="\t")
# Print CM column into two columns

> df1[['Min', 'Max']] = df1.cm.str.split('-', expand=True)
# Group by header column and multiple operations on each column
> df2 = pd.DataFrame(df1.groupby(['header']).agg({'marker': sum, 'Min': min, "Max": max})).reset_index()
# Create a new column
> df2["range"] = df2["Min"] + "-" + df2["Max"]
# Create a new dataframe with required information
> df3 = (df2.iloc[:, [0, 1, 4]])
# Print result
> print(df3)
====================

Import data as csv and then split cm column by -. Then do a column wise statistics and merge them to range. Then print requested columns.

Recent Posts

Links

Jun 26, 2018 - Group by and column wise statistics in Shell, R and Python 3