Biologist's bioinformatics notes

Pandas is a data analysis library available in python. There are several operations that are same between R and Pandas esp data frame operations. However, Pandas inherits object based operations from python which makes operations easy on data frames. In this note, we will merge 3 data frames and extract common ones. One can use sql libraries available in python or use pandas it self to get the result.

Following is the issue:
User has 3 data frames (df) as follows:

df1:
Seq1.id    Seq2.id
seq1.01    seq5.02
seq2.01    Seq6.02
seq3.01    Seq7.02
seq4.01    Seq8.02

df2:
Seq.1.id
seq1.01
seq2.01
seq4.01

df3:
Seq.2.id
seq5.02
Seq6.02
Seq7.02

Now 2 and 3 are single column data frames. They both share single column each, with data frame1. User wants to common values between three data frames.

I wanted to join the dataframes by SQL. However it is tricky because SQL separates the columns from data frames by ".". If column name has . in it, SQL join will have two ".". Let us see how pandas handles this. We will use pandasql library for SQL join and "merge" function in pandas.

code: (with pandasql)
===============
> import pandas as pd
> from pandasql import sqldf as psql

> df1=pd.read_csv("df1", sep="\t")
> df2=pd.read_csv("df2", sep="\t")
> df3=pd.read_csv("df3", sep="\t")

> print(df1)
> print(df2)
> print(df3)

Seq1.id Seq2.id
0 seq1.01 seq5.02
1 seq2.01 Seq6.02
2 seq3.01 Seq7.02
3 seq4.01 Seq8.02

Seq.1.id
0 seq1.01
1 seq2.01
2 seq4.01

Seq.2.id
0 seq5.02
1 Seq6.02
2 Seq7.02

> psql('select [Seq1.id],[Seq2.id] from df1 inner join df2 on df1.[Seq1.id]=df2.[Seq.1.id] inner join df3 on df1.[Seq2.id]=df3.[Seq.2.id]')
      Seq1.id       Seq2.id
0     seq1.01     seq5.02
1     seq2.01     Seq6.02
===================================

Now, let us do the same with pd.merge function. Load all the data frames like above and execute following command:
===============
> df1.merge(df2, left_on='Seq1.id', right_on='Seq.1.id', sort=True, how='inner').merge(df3,left_on='Seq2.id', right_on='Seq.2.id',sort=True, how="inner").loc[:,"Seq1.id":"Seq2.id"]

    Seq1.id     Seq2.id
0     seq2.01     Seq6.02
1     seq1.01     seq5.02
================

Recent Posts

Links

May 19, 2018 - Pandas multijoin dataframes with dots in column names