An intro to biomaRt

By Synnøve Yndestad in R Bioinformatics sequencing

October 19, 2021

Or “How to annotate Hugo Symbol and Entrez ID to ensembl ID, fetch the location of the genes and associated gene ontologies and dbSNPs using biomaRt.”
biomaRt is a Bioconductor package that provides an R interface to the HGNC BioMart server. It makes it possible to access and query a large amount of data and resources from ensembl. We can use that for gene annotation and database mining across multiple species.

Genes and transcripts are named in a variety of ways, which can be quite confusing and a source of errors.
The ensembl identifiers is based on the NCBI RefSeq set and is stable. They are consistent and unambiguous across releases. Gene names on the other hand may change based on new knowledge or preferences, but the stable identifiers will remain the same. Working with ensembl format id´s is therefore more reproducible and consistent over time, but we will need to translate those to more human recognizable format like hugo symbol (HGNC) gene names at the end stage of the data processing. Many bioinformatic tools also requires entrez id, i.e various subtyping tools. This means that translating between ensembl id, hugo symbol and entrez id is something that we need to do quite frequently, and depending on your workflow.

Code in brief

Just show me the code, no explanation needed:

library(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

Values <- c("ENSG00000142208", "ENSG00000132781", "ENSG00000139618", "ENSG00000012048", "ENSG00000121879", "ENSG00000073050","ENSG00000181555", "ENSG00000141510")

MyGenes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"), 
                 filters = "ensembl_gene_id",
                 values = Values,
                 mart = ensembl)
MyGenes

##   ensembl_gene_id hgnc_symbol entrezgene_id
## 1 ENSG00000012048       BRCA1           672
## 2 ENSG00000073050       XRCC1          7515
## 3 ENSG00000121879      PIK3CA          5290
## 4 ENSG00000132781       MUTYH          4595
## 5 ENSG00000139618       BRCA2           675
## 6 ENSG00000141510        TP53          7157
## 7 ENSG00000142208        AKT1           207
## 8 ENSG00000181555       SETD2         29072

Getting started; Select database and dataset

Install from bioconductor for R Version 4.1:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("biomaRt")

Load package and list the available BioMart databases.

library(biomaRt)
library(tidyverse)

# List the available BioMart databases 
listMarts()

##                biomart                version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 104
## 2   ENSEMBL_MART_MOUSE      Mouse strains 104
## 3     ENSEMBL_MART_SNP  Ensembl Variation 104
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 104

We want the ensembl BioMart, useMart to select and save it as mart.

# Pick the ensembl database and save it
mart <- useMart("ensembl")
mart

## Object of class 'Mart':
##   Using the ENSEMBL_MART_ENSEMBL BioMart database
##   No dataset selected.

Each database contains several datasets.
Have a look:

head(listDatasets(mart))

##                        dataset                           description
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
## 2     acalliptera_gene_ensembl      Eastern happy genes (fAstCal1.2)
## 3   acarolinensis_gene_ensembl       Green anole genes (AnoCar2.0v2)
## 4    acchrysaetos_gene_ensembl       Golden eagle genes (bAquChr1.2)
## 5    acitrinellus_gene_ensembl        Midas cichlid genes (Midas_v5)
## 6    amelanoleuca_gene_ensembl       Giant panda genes (ASM200744v2)
##       version
## 1 ASM259213v1
## 2  fAstCal1.2
## 3 AnoCar2.0v2
## 4  bAquChr1.2
## 5    Midas_v5
## 6 ASM200744v2

Search for the Human dataset

listDatasets(mart) %>% filter(grepl("Human", description))

##                 dataset              description    version
## 1 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13

This tells us that the current version of the Human genes is GRCh38.p13.
Make a note of the version used and the date you downloaded the dataset. If an older version for some reason is required, it is possible to make a query especially for the version you need.

Now, select the Homo Sapiens dataset.

# Select your dataset
ensembl <- useDataset("hsapiens_gene_ensembl", mart)
ensembl

## Object of class 'Mart':
##   Using the ENSEMBL_MART_ENSEMBL BioMart database
##   Using the hsapiens_gene_ensembl dataset

Build the query

To build your query, we use values, attributes, filters and which mart.

Filters = The kind of value you have, and want to search against (i.e “ensembl_gene_id”)
Attributes = The stuff you want to get, i.e hugo symbols, entrez id´s etc.
Values = The input values to annotate (Specific id`s, like “ENSG00000132781)

Get the available filters and attributes with listFilters() and listAttributes() and store them in your environment for easier inspection.

# Fetch available filters
filters <- listFilters(ensembl)
head(filters)

##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
## 6    marker_start             Marker Start

# Fetch available attributes
attributes <- listAttributes(ensembl)
head(attributes)

##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

In the attributes, we find the all the attributes we can use in the query; “ensembl_gene_id,” “hgnc_symbol” and “entrezgene_id”

To get the entrez id and hgnc symbol for all ensembl id´s, we don´t need any values. We can just call getBM() with the attributes we want:

All.ensembl.ids <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"), mart = ensembl)
head(All.ensembl.ids)

##   ensembl_gene_id hgnc_symbol entrezgene_id
## 1 ENSG00000210049       MT-TF            NA
## 2 ENSG00000211459     MT-RNR1            NA
## 3 ENSG00000210077       MT-TV            NA
## 4 ENSG00000210082     MT-RNR2            NA
## 5 ENSG00000209082      MT-TL1            NA
## 6 ENSG00000198888      MT-ND1          4535

# Print the number of rows
nrow(All.ensembl.ids)

## [1] 67485

This may be saved as a key for future reference and consistent workflows. Take note of the chromosome version used and the date it was downloaded for future references.

write_csv(All.ensembl.ids, "Ensembl_hgnc_entrez_key_GRCh38p13_191021.csv")

All the genes are more than we need for most purposes. In stead, we build a query containing only our genes of interest. Now, take the ensembl id´s that you want to annotate into a vector to use in the getBM() query as values, and set “ensembl_gene_id” as filters to specify that your values are of the type “ensembl_gene_id.”

GenesOfInterest <- c("ENSG00000142208", "ENSG00000132781", "ENSG00000139618", "ENSG00000012048", "ENSG00000121879", "ENSG00000073050","ENSG00000181555", "ENSG00000141510")

MyGenes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"), 
                 filters = "ensembl_gene_id",
                 values = GenesOfInterest,
                 mart = ensembl)
DT::datatable(MyGenes)

Similarly, we can fetch where our genes of interests are located by chromosome, start position, end position and band. We use the Values vector as before, but now we don´t include the ensembl_gene_id or the entrez_id in the output (the attributes to list) since we don´t need that in this output:

Location <- getBM(attributes = c("hgnc_symbol", "chromosome_name", 'start_position', 'end_position', 'band'), 
                 filters = "ensembl_gene_id",
                 values = GenesOfInterest,
                 mart = ensembl)
DT::datatable(Location)

Then we can answer things like, which genes are the shortest/longest

DT::datatable(Location %>% mutate(Length = end_position-start_position) %>% 
                           arrange(Length))

Search for GO terms

Now we want to see if any of our genes of interest are associated with a particular DNA repair pathway. Then we can fetch the genes associated with the GO terms for Base Excision Repair, Double Strand Break, MisMatch Repair and Single Strand Break repair.

BER: GO:0006284
DSB: GO:0006302
MMR: GO:0006298
SSB: GO:0000012

BER.Genes <- getBM(attributes= c("hgnc_symbol"),
                   filters=c("go"),
                   values = "GO:0006284",
                   mart=ensembl)
DSB.Genes <- getBM(attributes= c("hgnc_symbol"),
                   filters=c("go"),
                   values = "GO:0006302",
                   mart=ensembl)
MMR.Genes <- getBM(attributes= c("hgnc_symbol"),
                   filters=c("go"),
                   values = "GO:0006298",
                   mart=ensembl)
SSB.genes <- getBM(attributes= c("hgnc_symbol"),
                   filters=c("go"),
                   values = "GO:0000012",
                   mart=ensembl)

Merge the results:

BER.Genes$BER <- "BER"
DSB.Genes$DSB <- "DSB"
MMR.Genes$MMR <- "MMR"
SSB.genes$SSB <- "SSB"

RepairGenes <- BER.Genes %>% full_join(DSB.Genes) %>% 
                             full_join(MMR.Genes) %>% 
                             full_join(SSB.genes)

## Joining, by = "hgnc_symbol"
## Joining, by = "hgnc_symbol"
## Joining, by = "hgnc_symbol"

# Add annotation to MyResults
MyGenes <- MyGenes %>% left_join(RepairGenes)

## Joining, by = "hgnc_symbol"

DT::datatable(MyGenes)

Fetch dbSNP from biomaRt

We can fetch variants from dbSNP, but then we need to query a different mart, the “snp” mart.

snpmart = useEnsembl(biomart = "snp", dataset="hsapiens_snp")

table(attributes$page)

## 
## feature_page     homologs    sequences          snp  snp_somatic    structure 
##          203         2782           60           44           35           40

To find what attributes we can use in the query, we can search in the snp attributes for the ensembl mart we created in the beginning.

attributes %>% filter(page == "snp")

##                             name                             description page
## 1                ensembl_gene_id                          Gene stable ID  snp
## 2        ensembl_gene_id_version                  Gene stable ID version  snp
## 3                        version                          Version (gene)  snp
## 4          ensembl_transcript_id                    Transcript stable ID  snp
## 5  ensembl_transcript_id_version            Transcript stable ID version  snp
## 6             transcript_version                    Version (transcript)  snp
## 7             ensembl_peptide_id                       Protein stable ID  snp
## 8     ensembl_peptide_id_version               Protein stable ID version  snp
## 9                peptide_version                       Version (protein)  snp
## 10               chromosome_name                Chromosome/scaffold name  snp
## 11                start_position                         Gene start (bp)  snp
## 12                  end_position                           Gene end (bp)  snp
## 13                        strand                                  Strand  snp
## 14                          band                          Karyotype band  snp
## 15            external_gene_name                               Gene name  snp
## 16          external_gene_source                     Source of gene name  snp
## 17              transcript_count                        Transcript count  snp
## 18    percentage_gene_gc_content                       Gene % GC content  snp
## 19                   description                        Gene description  snp
## 20                variation_name                            Variant name  snp
## 21    germ_line_variation_source                          Variant source  snp
## 22            source_description              Variant source description  snp
## 23                        allele                         Variant alleles  snp
## 24                     validated             Variant supporting evidence  snp
## 25                     mapweight                               Mapweight  snp
## 26                  minor_allele                            Minor allele  snp
## 27             minor_allele_freq                  Minor allele frequency  snp
## 28            minor_allele_count                      Minor allele count  snp
## 29         clinical_significance                   Clinical significance  snp
## 30           transcript_location                Transcript location (bp)  snp
## 31         snp_chromosome_strand               Variant chromosome Strand  snp
## 32              peptide_location                   Protein location (aa)  snp
## 33              chromosome_start chromosome/scaffold position start (bp)  snp
## 34                chromosome_end   Chromosome/scaffold position end (bp)  snp
## 35      polyphen_prediction_2076                     PolyPhen prediction  snp
## 36           polyphen_score_2076                          PolyPhen score  snp
## 37          sift_prediction_2076                         SIFT prediction  snp
## 38               sift_score_2076                              SIFT score  snp
## 39   distance_to_transcript_2076                  Distance to transcript  snp
## 40                cds_start_2076                               CDS start  snp
## 41                  cds_end_2076                                 CDS end  snp
## 42                 peptide_shift                          Protein allele  snp
## 43             synonymous_status                     Variant consequence  snp
## 44            allele_string_2076             Consequence specific allele  snp

So, to get all the known snp´s for MUTYH, we can use the chromosome location in the query above to select the snp´s to extract. We specify in the filters and values the location to list SNPs from. So we specify the location og MUTYH by chromosome name, start position and end position.

From the Location table above:
MUTYH, 1, 45329163, 45340893

Get all known variants in MUTYH and print the variants that is listed as pathogenic in clinical_significance.

# Load the snp mart:
snpmart = useEnsembl(biomart = "snp", dataset="hsapiens_snp")


MUTYH_snp <-getBM(attributes = c('refsnp_id','allele','chrom_start',"clinical_significance",
                                 "minor_allele_freq","validated"), 
                     filters = c('chr_name','start','end'), 
                      values = list(1,  45329163,   45340893), 
                        mart = snpmart)

DT::datatable(MUTYH_snp %>% filter(grepl("pathogenic", clinical_significance)))

Other things you can do with biomaRt is to retrieve gene sequences, coding regions, 5’UTR regions, transcripts, protein sequence etc using the getSequence() function.

For more details on how to query in biomaRt see the excellent vignette or reference manual.

Posted on:: October 19, 2021

Length:: 9 minute read, 1901 words

Categories:: R Bioinformatics sequencing

Tags:: biomaRt gene annotation gene ontology dbSNP ClinVar ensembl entrez hgnc database query

See Also: