In functional analysis of significant genes, one needs to pathway analysis. KEGG is the de facto tool for visualization of metabolic pathways since it's inception. Let us take an example where there are few genes and the corresponding pathway IDs for each gene. Well, easy question is to get pathway(name)s directly from gene symbols. However, there are some times, we get KEGG IDs instead of KEGG pathway names for a set of genes. This solution applies to that scenario.
x1 mmu04520
x2 mmu04145,mmu04514,mmu04650,mmu04670,mmu04810,
mmu05140,mmu05144,mmu05146,mmu05150,mmu05323,mmu05416
x3 mmu04622
x4 mmu00561,mmu00564,mmu01100,mmu04070
In this example, all x with numbers are genes (not real though) and corresponding mouse pathway ID. Now we need to get the pathways and append them to the genes in the columns against each gene. We can do it in bash shell with the help of wonderful Togows web service and in R.
Let us do it in bash shell:
Before we do it, let us look at the logic:
1) Break down the KEGG IDs into one ID per line against each gene
for eg:
x1 mmu04520
x2 mmu04145
x2 mmu04514
x2 mmu04650
2) Store these IDs in variable and use this variable to get the information
3) Now parse the output to have a standard format. Output will have KEGGIDs and pathway names.
4) Now match KEGG IDs and pathways with genes.
Now let us run the code:
$ awk -F"\t" '{split($2,a,","); for(i in a)print $1"\t"a[i]}' file.txt > file.out.txt
This code in awk splits second column (field) by comma and stores in array a.Then it prints corresponding column 1 values (gene here) for each value (KEGG ID here) in column 2. Do not forget to store the output. In this example, output file is: file.out.txt
Output:
x1 mmu04520
x2 mmu04145
x2 mmu04514
x2 mmu04650
x2 mmu04670
x2 mmu04810
x2 mmu05140
x2 mmu05144
x2 mmu05146
x2 mmu05150
x2 mmu05323
x2 mmu05416
x3 mmu04622
x4 mmu00561
x4 mmu00564
x4 mmu01100
x4 mmu04070
Next command (shell is bash shell):
$ a=$(cut -f2 file.out.txt | tr '\n' ',' ) && wget -qO- http://togows.org/entry/kegg-pathway/$a/pathways | sed 's/ /\t/' > pathways.txt
Output (from pathways.txt):
$ cat pathways.txt
mmu04520 Adherens junction
mmu04145 Phagosome
mmu04514 Cell adhesion molecules (CAMs)
mmu04650 Natural killer cell mediated cytotoxicity
mmu04670 Leukocyte transendothelial migration
mmu04810 Regulation of actin cytoskeleton
mmu05140 Leishmaniasis
mmu05144 Malaria
mmu05146 Amoebiasis
mmu05150 Staphylococcus aureus infection
mmu05323 Rheumatoid arthritis
mmu05416 Viral myocarditis
mmu04622 RIG-I-like receptor signaling pathway
mmu00561 Glycerolipid metabolism
mmu00564 Glycerophospholipid metabolism
mmu01100 Metabolic pathways
mmu04070 Phosphatidylinositol signaling system
Now we need to get genes, pathway IDs and Pathway names together. For this let us combine both input and output using common field (pathway ID in this case- mmuXXXXX)
Next command:
$ join -t $'\t' -1 2 -2 1 <(sort -k2 file.out.txt) <(sort -k1 pathways.txt) | datamash -sg2 unique 1,3 > final.out
Output from final.out:
x1 mmu04520
x2 mmu04145,mmu04514,mmu04650,mmu04670,mmu04810,
mmu05140,mmu05144,mmu05146,mmu05150,mmu05323,mmu05416
x3 mmu04622
x4 mmu00561,mmu00564,mmu01100,mmu04070
In this example, all x with numbers are genes (not real though) and corresponding mouse pathway ID. Now we need to get the pathways and append them to the genes in the columns against each gene. We can do it in bash shell with the help of wonderful Togows web service and in R.
Let us do it in bash shell:
Before we do it, let us look at the logic:
1) Break down the KEGG IDs into one ID per line against each gene
for eg:
x1 mmu04520
x2 mmu04145
x2 mmu04514
x2 mmu04650
2) Store these IDs in variable and use this variable to get the information
3) Now parse the output to have a standard format. Output will have KEGGIDs and pathway names.
4) Now match KEGG IDs and pathways with genes.
Now let us run the code:
$ awk -F"\t" '{split($2,a,","); for(i in a)print $1"\t"a[i]}' file.txt > file.out.txt
This code in awk splits second column (field) by comma and stores in array a.Then it prints corresponding column 1 values (gene here) for each value (KEGG ID here) in column 2. Do not forget to store the output. In this example, output file is: file.out.txt
Output:
x1 mmu04520
x2 mmu04145
x2 mmu04514
x2 mmu04650
x2 mmu04670
x2 mmu04810
x2 mmu05140
x2 mmu05144
x2 mmu05146
x2 mmu05150
x2 mmu05323
x2 mmu05416
x3 mmu04622
x4 mmu00561
x4 mmu00564
x4 mmu01100
x4 mmu04070
Next command (shell is bash shell):
$ a=$(cut -f2 file.out.txt | tr '\n' ',' ) && wget -qO- http://togows.org/entry/kegg-pathway/$a/pathways | sed 's/ /\t/' > pathways.txt
Output (from pathways.txt):
$ cat pathways.txt
mmu04520 Adherens junction
mmu04145 Phagosome
mmu04514 Cell adhesion molecules (CAMs)
mmu04650 Natural killer cell mediated cytotoxicity
mmu04670 Leukocyte transendothelial migration
mmu04810 Regulation of actin cytoskeleton
mmu05140 Leishmaniasis
mmu05144 Malaria
mmu05146 Amoebiasis
mmu05150 Staphylococcus aureus infection
mmu05323 Rheumatoid arthritis
mmu05416 Viral myocarditis
mmu04622 RIG-I-like receptor signaling pathway
mmu00561 Glycerolipid metabolism
mmu00564 Glycerophospholipid metabolism
mmu01100 Metabolic pathways
mmu04070 Phosphatidylinositol signaling system
Now we need to get genes, pathway IDs and Pathway names together. For this let us combine both input and output using common field (pathway ID in this case- mmuXXXXX)
Next command:
$ join -t $'\t' -1 2 -2 1 <(sort -k2 file.out.txt) <(sort -k1 pathways.txt) | datamash -sg2 unique 1,3 > final.out
Output from final.out:
Let us do this in R:
input is same:
$ cat file.txt
x1 mmu04520
x2 mmu04145,mmu04514,mmu04650,mmu04670, mmu04810, mmu05140, mmu05144, mmu05146, mmu05150, mmu05323, mmu05416
x3 mmu04622
x4 mmu00561,mmu00564,mmu01100,mmu04070
x1 mmu04520
x2 mmu04145,mmu04514,mmu04650,mmu04670, mmu04810, mmu05140, mmu05144, mmu05146, mmu05150, mmu05323, mmu05416
x3 mmu04622
x4 mmu00561,mmu00564,mmu01100,mmu04070
Rcode:
===================================================
## Load libraried KEGGREST, tidyr, stringr
library(KEGGREST)
library(tidyr)
library(stringr)
library(stringr)
## Load file.txt in to R
test=read.csv("file.txt", header = F, stringsAsFactors = F, sep="\t")
test=read.csv("file.txt", header = F, stringsAsFactors = F, sep="\t")
## Expand values in column 2 so that each ID is in a separate row against it's gene
test1=separate_rows(test,V2)
## Get pathways from KEGG server
pathways=data.frame(pathways=sapply(test1$V2, function(x) keggGet(x)[[1]]$PATHWAY_MAP))
## Split row names by full stop as above code duplicates KEGG IDs. Use the first ## ## ## column of the output as "pathids" column. pathways$pathids=str_split_fixed(row.names(pathways),"\\.",2)[,1]
## Join original data with gene symbols with new data. This would be a single row per ## each entry
final_pw=inner_join(test1, pathways, by=c("V2"="pathids"))
## Collapse pathway names and IDs to corresponding gene names.
final_pw_agr=aggregate(final_pw[,2:3], by=list(final_pw$V1), paste, collapse=", ")
=======================================================
output: