An intro to biomaRt
By Synnøve Yndestad in R Bioinformatics sequencing
October 19, 2021
Or “How to annotate Hugo Symbol and Entrez ID to ensembl ID, fetch the location of the genes and associated gene ontologies and dbSNPs using biomaRt.”
biomaRt is a Bioconductor package that provides an R interface to the HGNC BioMart server. It makes it possible to access and query a large amount of data and resources from ensembl. We can use that for gene annotation and database mining across multiple species.
Genes and transcripts are named in a variety of ways, which can be quite confusing and a source of errors.
The ensembl identifiers is based on the NCBI RefSeq set and is stable. They are consistent and unambiguous across releases. Gene names on the other hand may change based on new knowledge or preferences, but the stable identifiers will remain the same.
Working with ensembl format id´s is therefore more reproducible and consistent over time, but we will need to translate those to more human recognizable format like hugo symbol (HGNC) gene names at the end stage of the data processing. Many bioinformatic tools also requires entrez id, i.e various subtyping tools. This means that translating between ensembl id, hugo symbol and entrez id is something that we need to do quite frequently, and depending on your workflow.
Code in brief
Just show me the code, no explanation needed:
library(biomaRt)
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
Values <- c("ENSG00000142208", "ENSG00000132781", "ENSG00000139618", "ENSG00000012048", "ENSG00000121879", "ENSG00000073050","ENSG00000181555", "ENSG00000141510")
MyGenes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"),
filters = "ensembl_gene_id",
values = Values,
mart = ensembl)
MyGenes
## ensembl_gene_id hgnc_symbol entrezgene_id
## 1 ENSG00000012048 BRCA1 672
## 2 ENSG00000073050 XRCC1 7515
## 3 ENSG00000121879 PIK3CA 5290
## 4 ENSG00000132781 MUTYH 4595
## 5 ENSG00000139618 BRCA2 675
## 6 ENSG00000141510 TP53 7157
## 7 ENSG00000142208 AKT1 207
## 8 ENSG00000181555 SETD2 29072
Getting started; Select database and dataset
Install from bioconductor for R Version 4.1:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
Load package and list the available BioMart databases.
library(biomaRt)
library(tidyverse)
# List the available BioMart databases
listMarts()
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 104
## 2 ENSEMBL_MART_MOUSE Mouse strains 104
## 3 ENSEMBL_MART_SNP Ensembl Variation 104
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 104
We want the ensembl BioMart, useMart to select and save it as mart.
# Pick the ensembl database and save it
mart <- useMart("ensembl")
mart
## Object of class 'Mart':
## Using the ENSEMBL_MART_ENSEMBL BioMart database
## No dataset selected.
Each database contains several datasets.
Have a look:
head(listDatasets(mart))
## dataset description
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
## 2 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2)
## 3 acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2)
## 4 acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2)
## 5 acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5)
## 6 amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2)
## version
## 1 ASM259213v1
## 2 fAstCal1.2
## 3 AnoCar2.0v2
## 4 bAquChr1.2
## 5 Midas_v5
## 6 ASM200744v2
Search for the Human dataset
listDatasets(mart) %>% filter(grepl("Human", description))
## dataset description version
## 1 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13
This tells us that the current version of the Human genes is GRCh38.p13.
Make a note of the version used and the date you downloaded the dataset. If an older version for some reason is required, it is possible to make a query especially for the version you need.
Now, select the Homo Sapiens dataset.
# Select your dataset
ensembl <- useDataset("hsapiens_gene_ensembl", mart)
ensembl
## Object of class 'Mart':
## Using the ENSEMBL_MART_ENSEMBL BioMart database
## Using the hsapiens_gene_ensembl dataset
Build the query
To build your query, we use values, attributes, filters and which mart.
Filters = The kind of value you have, and want to search against (i.e “ensembl_gene_id”)
Attributes = The stuff you want to get, i.e hugo symbols, entrez id´s etc.
Values = The input values to annotate (Specific id`s, like “ENSG00000132781)
Get the available filters and attributes with listFilters() and listAttributes() and store them in your environment for easier inspection.
# Fetch available filters
filters <- listFilters(ensembl)
head(filters)
## name description
## 1 chromosome_name Chromosome/scaffold name
## 2 start Start
## 3 end End
## 4 band_start Band Start
## 5 band_end Band End
## 6 marker_start Marker Start
# Fetch available attributes
attributes <- listAttributes(ensembl)
head(attributes)
## name description page
## 1 ensembl_gene_id Gene stable ID feature_page
## 2 ensembl_gene_id_version Gene stable ID version feature_page
## 3 ensembl_transcript_id Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5 ensembl_peptide_id Protein stable ID feature_page
## 6 ensembl_peptide_id_version Protein stable ID version feature_page
In the attributes, we find the all the attributes we can use in the query; “ensembl_gene_id,” “hgnc_symbol” and “entrezgene_id”
To get the entrez id and hgnc symbol for all ensembl id´s, we don´t need any values. We can just call getBM() with the attributes we want:
All.ensembl.ids <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"), mart = ensembl)
head(All.ensembl.ids)
## ensembl_gene_id hgnc_symbol entrezgene_id
## 1 ENSG00000210049 MT-TF NA
## 2 ENSG00000211459 MT-RNR1 NA
## 3 ENSG00000210077 MT-TV NA
## 4 ENSG00000210082 MT-RNR2 NA
## 5 ENSG00000209082 MT-TL1 NA
## 6 ENSG00000198888 MT-ND1 4535
# Print the number of rows
nrow(All.ensembl.ids)
## [1] 67485
This may be saved as a key for future reference and consistent workflows. Take note of the chromosome version used and the date it was downloaded for future references.
write_csv(All.ensembl.ids, "Ensembl_hgnc_entrez_key_GRCh38p13_191021.csv")
All the genes are more than we need for most purposes. In stead, we build a query containing only our genes of interest. Now, take the ensembl id´s that you want to annotate into a vector to use in the getBM() query as values, and set “ensembl_gene_id” as filters to specify that your values are of the type “ensembl_gene_id.”
GenesOfInterest <- c("ENSG00000142208", "ENSG00000132781", "ENSG00000139618", "ENSG00000012048", "ENSG00000121879", "ENSG00000073050","ENSG00000181555", "ENSG00000141510")
MyGenes <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"),
filters = "ensembl_gene_id",
values = GenesOfInterest,
mart = ensembl)
DT::datatable(MyGenes)
Similarly, we can fetch where our genes of interests are located by chromosome, start position, end position and band. We use the Values vector as before, but now we don´t include the ensembl_gene_id or the entrez_id in the output (the attributes to list) since we don´t need that in this output:
Location <- getBM(attributes = c("hgnc_symbol", "chromosome_name", 'start_position', 'end_position', 'band'),
filters = "ensembl_gene_id",
values = GenesOfInterest,
mart = ensembl)
DT::datatable(Location)
Then we can answer things like, which genes are the shortest/longest
DT::datatable(Location %>% mutate(Length = end_position-start_position) %>%
arrange(Length))
Search for GO terms
Now we want to see if any of our genes of interest are associated with a particular DNA repair pathway. Then we can fetch the genes associated with the GO terms for Base Excision Repair, Double Strand Break, MisMatch Repair and Single Strand Break repair.
BER: GO:0006284
DSB: GO:0006302
MMR: GO:0006298
SSB: GO:0000012
BER.Genes <- getBM(attributes= c("hgnc_symbol"),
filters=c("go"),
values = "GO:0006284",
mart=ensembl)
DSB.Genes <- getBM(attributes= c("hgnc_symbol"),
filters=c("go"),
values = "GO:0006302",
mart=ensembl)
MMR.Genes <- getBM(attributes= c("hgnc_symbol"),
filters=c("go"),
values = "GO:0006298",
mart=ensembl)
SSB.genes <- getBM(attributes= c("hgnc_symbol"),
filters=c("go"),
values = "GO:0000012",
mart=ensembl)
Merge the results:
BER.Genes$BER <- "BER"
DSB.Genes$DSB <- "DSB"
MMR.Genes$MMR <- "MMR"
SSB.genes$SSB <- "SSB"
RepairGenes <- BER.Genes %>% full_join(DSB.Genes) %>%
full_join(MMR.Genes) %>%
full_join(SSB.genes)
## Joining, by = "hgnc_symbol"
## Joining, by = "hgnc_symbol"
## Joining, by = "hgnc_symbol"
# Add annotation to MyResults
MyGenes <- MyGenes %>% left_join(RepairGenes)
## Joining, by = "hgnc_symbol"
DT::datatable(MyGenes)
Fetch dbSNP from biomaRt
We can fetch variants from dbSNP, but then we need to query a different mart, the “snp” mart.
snpmart = useEnsembl(biomart = "snp", dataset="hsapiens_snp")
table(attributes$page)
##
## feature_page homologs sequences snp snp_somatic structure
## 203 2782 60 44 35 40
To find what attributes we can use in the query, we can search in the snp attributes for the ensembl mart we created in the beginning.
attributes %>% filter(page == "snp")
## name description page
## 1 ensembl_gene_id Gene stable ID snp
## 2 ensembl_gene_id_version Gene stable ID version snp
## 3 version Version (gene) snp
## 4 ensembl_transcript_id Transcript stable ID snp
## 5 ensembl_transcript_id_version Transcript stable ID version snp
## 6 transcript_version Version (transcript) snp
## 7 ensembl_peptide_id Protein stable ID snp
## 8 ensembl_peptide_id_version Protein stable ID version snp
## 9 peptide_version Version (protein) snp
## 10 chromosome_name Chromosome/scaffold name snp
## 11 start_position Gene start (bp) snp
## 12 end_position Gene end (bp) snp
## 13 strand Strand snp
## 14 band Karyotype band snp
## 15 external_gene_name Gene name snp
## 16 external_gene_source Source of gene name snp
## 17 transcript_count Transcript count snp
## 18 percentage_gene_gc_content Gene % GC content snp
## 19 description Gene description snp
## 20 variation_name Variant name snp
## 21 germ_line_variation_source Variant source snp
## 22 source_description Variant source description snp
## 23 allele Variant alleles snp
## 24 validated Variant supporting evidence snp
## 25 mapweight Mapweight snp
## 26 minor_allele Minor allele snp
## 27 minor_allele_freq Minor allele frequency snp
## 28 minor_allele_count Minor allele count snp
## 29 clinical_significance Clinical significance snp
## 30 transcript_location Transcript location (bp) snp
## 31 snp_chromosome_strand Variant chromosome Strand snp
## 32 peptide_location Protein location (aa) snp
## 33 chromosome_start chromosome/scaffold position start (bp) snp
## 34 chromosome_end Chromosome/scaffold position end (bp) snp
## 35 polyphen_prediction_2076 PolyPhen prediction snp
## 36 polyphen_score_2076 PolyPhen score snp
## 37 sift_prediction_2076 SIFT prediction snp
## 38 sift_score_2076 SIFT score snp
## 39 distance_to_transcript_2076 Distance to transcript snp
## 40 cds_start_2076 CDS start snp
## 41 cds_end_2076 CDS end snp
## 42 peptide_shift Protein allele snp
## 43 synonymous_status Variant consequence snp
## 44 allele_string_2076 Consequence specific allele snp
So, to get all the known snp´s for MUTYH, we can use the chromosome location in the query above to select the snp´s to extract. We specify in the filters and values the location to list SNPs from. So we specify the location og MUTYH by chromosome name, start position and end position.
From the Location table above:
MUTYH, 1, 45329163, 45340893
Get all known variants in MUTYH and print the variants that is listed as pathogenic in clinical_significance.
# Load the snp mart:
snpmart = useEnsembl(biomart = "snp", dataset="hsapiens_snp")
MUTYH_snp <-getBM(attributes = c('refsnp_id','allele','chrom_start',"clinical_significance",
"minor_allele_freq","validated"),
filters = c('chr_name','start','end'),
values = list(1, 45329163, 45340893),
mart = snpmart)
DT::datatable(MUTYH_snp %>% filter(grepl("pathogenic", clinical_significance)))
Other things you can do with biomaRt is to retrieve gene sequences, coding regions, 5’UTR regions, transcripts, protein sequence etc using the getSequence() function.
For more details on how to query in biomaRt see the excellent vignette or reference manual.
- Posted on:
- October 19, 2021
- Length:
- 9 minute read, 1901 words
- Categories:
- R Bioinformatics sequencing
- See Also: