Biologist's bioinformatics notes

Biomart is annotation package in R for annotating data from microarray studies. User can download biomart from this page: http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html

Using biomart is very simple. Biomart package contains several sources of annotations (for eg. ENSEMBL, VEGA, unimart etc) and each source will have an associated version number which would be useful in replicating the annotations. Within each source, there are several databases for organism. Each database contains annotations for user provided information (for eg. affymetrix probe sets, gene symbol). User can list the sources of annotations, choose the database and supply the information that needs to be annotated.

Think biomart as a street with full of shops. In this analogy, each shop is a mart or source of annotation. Within each shop, there are several sections (i.e several databases). Each section (database) contains information that user requires.

When user annotates information, user has to choose the source/mart (shop), section (database) to annotate the data. User can supply data in multiple formats depending on the database/mart. Biomart annotates the data with a simple logic. User needs to provide common information between database and user's data. Biomart automagically fetches corresponding data. It works more or less like a joining tables in data base. While joining two tables, user has to join tables based on a common id between the two tables. Once they are joined, it is easy to fetch data using both the tables.

Biomart query has three parts:

1) What user wants to fetch
2) What biological annotations program should use. This should be common with source. Will explain below.
3) List of entities (genes, probes etc) for which user wants annotation

Regarding point 1, user wants to fetch several annotations for entities mentioned in point 3. Available annotations are called "attributes" in biomart package and available attributes can change from mart to mart and within mart, data base to database.

Regarding point 2, when user wants to fetch annotations, user has to define the category or categories (for eg. affy_hg_u133_plus_2, unigene etc ) using which biomart annotates user supplied list. These categories are called "filters" and user can list available filters within a database and mart.

Regarding point 3, user needs to supply the entities that are to annotated. They are called "values" in biomart. The entity category must be listed in "filters" of the dataset/mart. For eg. for annotating affymetrix hgu 133 plus 2 vairants (eg. "210254_at", "220772_at"), filters should contain a category called "affy_hg_u133_plus_2".

Biomart query structure is calling an annotation function with 3 arguments as followed:

get biomart annotation (wanted annotations, for which category annotations are wanted, probe/gene list under the category)

1) Load biomart in R

 $ library("biomaRt")

If biomart is not installed, user should install biomart first before proceeding further and installation instructions are provided in the page: http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html

2) List all the marts / sources of information

$ listMarts()

This should list around 60 sources (as of biomart package version 2.20.0). Out put would be huge and user should see three columns:

column 1 without header: serial number , column 2 with header "biomart" , listing the sources/marts and column 3 with header version listing each source version.

3) Select a mart

$ useMart("ensembl")

User can choose any mart of choice. Here we choose ensemble as mart/source of annotation.

3) List the databases within ensembl mart.

$ ensembl = useMart("ensembl")

This would store useMart command as variable and to list the databases within ensembl, use following command:

$ listDatasets(ensembl)

This should all the databases in ensembl annotation source.

They are around 66 (as of biomart package version 2.20.0). Out put would be huge and user should see 4 columns:
column 1 without header: serial number , column 2 with header "dataset" , listing the datasets, column 3 with header "Description", short description of each dataset and column 4 with header version listing each source version.
In general, each data set represents an organism in ensembl mart.

A working annotation example for Affymetrix Human Genome U133 Plus 2.0 Array probes:

1) Click on the link to view example probes: https://drive.google.com/file/d/0B0MpwluEDxNuY3h6NnBPczR1Wjg/edit?usp=sharing. These probes are from Affymetrix Human Genome U133 Plus 2.0 Array.

User should see following probes on screen:

2) Copy the text and paste in a text editor of choice. Save it as "affy_probes.txt"
3) Initiate R section and navigate to the directory where affy_probes.txt file is saved
4) Import probes by executing following command:

$ affy <- read.table ("affy_probes.txt", header=TRUE)

4) This should create an object in current R section, by name "affy". Print affy to see if import is correct. Execute following command:

$ affy

User should see following screen (if import is correct):

5) Now load Biomart (biomaRt package) library.
6) List marts availabale

$ listMarts()

7) Select a mart to choose and store it as an object for later use
$ ensembl=useMart("ensembl")

8) List datasets available within ensembl mart
$ listDatasets(ensembl)

9) Use human gene dataset to annotate affymetrix probes.
$ hs_ensembl=useDataset("hsapiens_gene_ensembl", mart=ensembl)

10) Choose the required/wanted annotations from available attributes from dataset "hs_ensembl".

For this example, I chose Chromosome Name, Gene End (bp), GO Term Name, Gene Start (bp), WikiGene Name and Affy HG U133-PLUS-2 probeset. (Though we are supplying probes, we need to fetch them as well, for comparison)

$ attr=listAttributes(hs_ensembl)[c(94,6,7,8,85,28,29),]

11) Print selected annotations to cross check if required annotations are selected.

12) Annotate the probesets (and store the annotations to an object: "bm_hs_ensembl")

$ bm_hs_ensembl=getBM(attributes=c(attr[,1]), filters='affy_hg_u133_plus_2', values=affy, mart=hs_ensembl)

13) Store the annotated probes as a tab separated file for future use

$ write.table (bm_hs_ensembl, "bm_hs_ensembl.txt", sep="\t")

This is how probes of interest can be annotated in biomart. R configuration I used for this work is:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics grDevices utils     datasets methods   base

other attached packages:
[1] biomaRt_2.20.0

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.26.0 Biobase_2.24.0       BiocGenerics_0.10.0 DBI_0.2-7
[5] GenomeInfoDb_1.0.2   IRanges_1.22.9       parallel_3.1.1       RCurl_1.95-4.1
[9] RSQLite_0.11.4       stats4_3.1.1         tools_3.1.1          XML_3.98-1.1

Recent Posts

Links

Jul 17, 2014 - using Biomart in R