Biologist's bioinformatics notes

In functional analysis of significant genes, one needs to pathway analysis. KEGG is the de facto tool for visualization of metabolic pathways since it's inception. Let us take an example where there are few genes and the corresponding pathway IDs for each gene. Well, easy question is to get pathway(name)s directly from gene symbols. However, there are some times, we get KEGG IDs instead of KEGG pathway names for a set of genes. This solution applies to that scenario.

x1   mmu04520
x2   mmu04145,mmu04514,mmu04650,mmu04670,mmu04810,
       mmu05140,mmu05144,mmu05146,mmu05150,mmu05323,mmu05416
x3   mmu04622
x4   mmu00561,mmu00564,mmu01100,mmu04070

In this example, all x with numbers are genes (not real though) and corresponding mouse pathway ID. Now we need to get the pathways and append them to the genes in the columns against each gene. We can do it in bash shell with the help of wonderful Togows web service and in R.

Let us do it in bash shell:
Before we do it, let us look at the logic:

1) Break down the KEGG IDs into one ID per line against each gene
for eg:
x1   mmu04520
x2   mmu04145
x2   mmu04514
x2   mmu04650
2) Store these IDs in variable and use this variable to get the information
3) Now parse the output to have a standard format. Output will have KEGGIDs and pathway names.
4) Now match KEGG IDs and pathways with genes.

Now let us run the code:
$ awk -F"\t" '{split($2,a,","); for(i in a)print $1"\t"a[i]}' file.txt > file.out.txt
This code in awk splits second column (field) by comma and stores in array a.Then it prints corresponding column 1 values (gene here) for each value (KEGG ID here) in column 2. Do not forget to store the output. In this example, output file is: file.out.txt

Output:
x1    mmu04520
x2    mmu04145
x2    mmu04514
x2    mmu04650
x2    mmu04670
x2    mmu04810
x2    mmu05140
x2    mmu05144
x2    mmu05146
x2    mmu05150
x2    mmu05323
x2    mmu05416
x3    mmu04622
x4    mmu00561
x4    mmu00564
x4    mmu01100
x4    mmu04070

Next command (shell is bash shell):
$ a=$(cut -f2 file.out.txt | tr '\n' ',' ) && wget -qO- http://togows.org/entry/kegg-pathway/$a/pathways | sed 's/ /\t/' > pathways.txt

Output (from pathways.txt):

$ cat pathways.txt
mmu04520    Adherens junction
mmu04145    Phagosome
mmu04514    Cell adhesion molecules (CAMs)
mmu04650    Natural killer cell mediated cytotoxicity
mmu04670    Leukocyte transendothelial migration
mmu04810    Regulation of actin cytoskeleton
mmu05140    Leishmaniasis
mmu05144    Malaria
mmu05146    Amoebiasis
mmu05150    Staphylococcus aureus infection
mmu05323    Rheumatoid arthritis
mmu05416    Viral myocarditis
mmu04622    RIG-I-like receptor signaling pathway
mmu00561    Glycerolipid metabolism
mmu00564    Glycerophospholipid metabolism
mmu01100    Metabolic pathways
mmu04070    Phosphatidylinositol signaling system

Now we need to get genes, pathway IDs and Pathway names together. For this let us combine both input and output using common field (pathway ID in this case- mmuXXXXX)

Next command:
$ join -t $'\t' -1 2 -2 1 <(sort -k2 file.out.txt) <(sort -k1 pathways.txt) | datamash -sg2 unique 1,3 > final.out

Output from final.out:

Let us do this in R:

input is same:

$ cat file.txt
x1   mmu04520
x2   mmu04145,mmu04514,mmu04650,mmu04670, mmu04810, mmu05140,                 mmu05144, mmu05146, mmu05150, mmu05323, mmu05416
x3   mmu04622
x4   mmu00561,mmu00564,mmu01100,mmu04070

Rcode:

===================================================

## Load libraried KEGGREST, tidyr, stringr

library(KEGGREST)

library(tidyr)
library(stringr)

## Load file.txt in to R
test=read.csv("file.txt", header = F, stringsAsFactors = F, sep="\t")

## Expand values in column 2 so that each ID is in a separate row against it's gene

test1=separate_rows(test,V2)

## Get pathways from KEGG server

pathways=data.frame(pathways=sapply(test1$V2, function(x) keggGet(x)[[1]]$PATHWAY_MAP))

## Split row names by full stop as above code duplicates KEGG IDs. Use the first ## ## ## column of the output as "pathids" column. pathways$pathids=str_split_fixed(row.names(pathways),"\\.",2)[,1]

## Join original data with gene symbols with new data. This would be a single row per ## each entry

final_pw=inner_join(test1, pathways, by=c("V2"="pathids"))

## Collapse pathway names and IDs to corresponding gene names.

final_pw_agr=aggregate(final_pw[,2:3], by=list(final_pw$V1), paste, collapse=", ")

=======================================================

output:

Recent Posts

Links

Nov 7, 2017 - Extract KEGG pathways using KEGG IDs using bash and R