Eutils are provided by NCBI and are very useful in accessing information from NCBI programmatically. In this note, let us convert PMIDs into PMCIDs. For this let us use eutils (for shell programming) and biopython (for python programming) to convert PMIDs to PMCIDs. There is small issue which I am not able to address is biopython results are not on par with command line utils from NCBI.
First, let us look at the PMID examples.
==================================
$ cat pmid.txt
29900339
29897644
21990379
19304878
==================================
29900339
29897644
21990379
19304878
==================================
Now let us convert these 4 PMIDs to PMCIDs using eutils and bash shell:
==================================
$ while read line; do efetch -db pubmed -id $line -format xml | xtract -pattern ArticleIdList -element ArticleId ; done < pmid.txt | cut -f1,4
==================================
==================================
output:
===============================================
29900339 PMC5997901
29897644
21990379 PMC3266030
19304878 PMC2682512
===============================================
29900339 PMC5997901
29897644
21990379 PMC3266030
19304878 PMC2682512
===============================================
First and second columns are PMID and PMCID, respectively.
Now let us use biopython libraries in python to extract the same information.
======================================
from Bio import Entrez
> import pandas as pd
> with open("pmid.txt") as f:
pmids = f.read()
> Entrez.email = "someone@example.org"
>elink_res = []
for i in pmids.splitlines():
elink_res.append(Entrez.read(Entrez.elink(dbfrom="pubmed", id=i, linkname="pubmed_pmc")))
> elink_res_format = []
for i in range(0, len(elink_res)):
for j in elink_res[i]:
if (len(j["LinkSetDb"])) == 0:
elink_res_format.append([j["IdList"][0], "NA"])
if (len(j["LinkSetDb"])) == 1:
elink_res_format.append([j["IdList"][0], j["LinkSetDb"][0]["Link"][0]["Id"]])
> print(pd.DataFrame(elink_res_format, columns=["PMID", "PMCID"]))
> import pandas as pd
> with open("pmid.txt") as f:
pmids = f.read()
> Entrez.email = "someone@example.org"
>elink_res = []
for i in pmids.splitlines():
elink_res.append(Entrez.read(Entrez.elink(dbfrom="pubmed", id=i, linkname="pubmed_pmc")))
> elink_res_format = []
for i in range(0, len(elink_res)):
for j in elink_res[i]:
if (len(j["LinkSetDb"])) == 0:
elink_res_format.append([j["IdList"][0], "NA"])
if (len(j["LinkSetDb"])) == 1:
elink_res_format.append([j["IdList"][0], j["LinkSetDb"][0]["Link"][0]["Id"]])
> print(pd.DataFrame(elink_res_format, columns=["PMID", "PMCID"]))
=======================================
Please make sure that indentation is as per python code requirements. Output is
===============================================
PMID PMCID
0 29900339 NA
1 29897644 NA
2 21990379 3266030
3 19304878 2682512
0 29900339 NA
1 29897644 NA
2 21990379 3266030
3 19304878 2682512
===============================================
Issue is that for first (29900339) PMID, biopython is not able to fetch corresponding PMCID where as command line utils from NCBI was able to fetch PMCID. I am not sure if this is code error or biopython error or NCBI returning results error.