Pandas is a data analysis library available in python. There are several operations that are same between R and Pandas esp data frame operations. However, Pandas inherits object based operations from python which makes operations easy on data frames. In this note, we will merge 3 data frames and extract common ones. One can use sql libraries available in python or use pandas it self to get the result.
Following is the issue:
User has 3 data frames (df) as follows:
df1:
Seq1.id Seq2.id
seq1.01 seq5.02
seq2.01 Seq6.02
seq3.01 Seq7.02
seq4.01 Seq8.02
df2:
Seq.1.id
seq1.01
seq2.01
seq4.01
df3:
Seq.2.id
seq5.02
Seq6.02
Seq7.02
Now 2 and 3 are single column data frames. They both share single column each, with data frame1. User wants to common values between three data frames.
I wanted to join the dataframes by SQL. However it is tricky because SQL separates the columns from data frames by ".". If column name has . in it, SQL join will have two ".". Let us see how pandas handles this. We will use pandasql library for SQL join and "merge" function in pandas.
code: (with pandasql)
===============
> import pandas as pd
> from pandasql import sqldf as psql
> df1=pd.read_csv("df1", sep="\t")
> df2=pd.read_csv("df2", sep="\t")
> df3=pd.read_csv("df3", sep="\t")
> print(df1)
> print(df2)
> print(df3)
Seq1.id Seq2.id
0 seq1.01 seq5.02
1 seq2.01 Seq6.02
2 seq3.01 Seq7.02
3 seq4.01 Seq8.02
Seq.1.id
0 seq1.01
1 seq2.01
2 seq4.01
Seq.2.id
0 seq5.02
1 Seq6.02
2 Seq7.02
> psql('select [Seq1.id],[Seq2.id] from df1 inner join df2 on df1.[Seq1.id]=df2.[Seq.1.id] inner join df3 on df1.[Seq2.id]=df3.[Seq.2.id]')
Seq1.id Seq2.id
0 seq1.01 seq5.02
1 seq2.01 Seq6.02
===================================
Now, let us do the same with pd.merge function. Load all the data frames like above and execute following command:
===============
> df1.merge(df2, left_on='Seq1.id', right_on='Seq.1.id', sort=True, how='inner').merge(df3,left_on='Seq2.id', right_on='Seq.2.id',sort=True, how="inner").loc[:,"Seq1.id":"Seq2.id"]
Seq1.id Seq2.id
0 seq2.01 Seq6.02
1 seq1.01 seq5.02
================
Following is the issue:
User has 3 data frames (df) as follows:
df1:
Seq1.id Seq2.id
seq1.01 seq5.02
seq2.01 Seq6.02
seq3.01 Seq7.02
seq4.01 Seq8.02
df2:
Seq.1.id
seq1.01
seq2.01
seq4.01
df3:
Seq.2.id
seq5.02
Seq6.02
Seq7.02
Now 2 and 3 are single column data frames. They both share single column each, with data frame1. User wants to common values between three data frames.
I wanted to join the dataframes by SQL. However it is tricky because SQL separates the columns from data frames by ".". If column name has . in it, SQL join will have two ".". Let us see how pandas handles this. We will use pandasql library for SQL join and "merge" function in pandas.
code: (with pandasql)
===============
> import pandas as pd
> from pandasql import sqldf as psql
> df1=pd.read_csv("df1", sep="\t")
> df2=pd.read_csv("df2", sep="\t")
> df3=pd.read_csv("df3", sep="\t")
> print(df1)
> print(df2)
> print(df3)
Seq1.id Seq2.id
0 seq1.01 seq5.02
1 seq2.01 Seq6.02
2 seq3.01 Seq7.02
3 seq4.01 Seq8.02
Seq.1.id
0 seq1.01
1 seq2.01
2 seq4.01
Seq.2.id
0 seq5.02
1 Seq6.02
2 Seq7.02
> psql('select [Seq1.id],[Seq2.id] from df1 inner join df2 on df1.[Seq1.id]=df2.[Seq.1.id] inner join df3 on df1.[Seq2.id]=df3.[Seq.2.id]')
Seq1.id Seq2.id
0 seq1.01 seq5.02
1 seq2.01 Seq6.02
===================================
Now, let us do the same with pd.merge function. Load all the data frames like above and execute following command:
===============
> df1.merge(df2, left_on='Seq1.id', right_on='Seq.1.id', sort=True, how='inner').merge(df3,left_on='Seq2.id', right_on='Seq.2.id',sort=True, how="inner").loc[:,"Seq1.id":"Seq2.id"]
Seq1.id Seq2.id
0 seq2.01 Seq6.02
1 seq1.01 seq5.02
================