Bioinformatics

NGS sequencing file formats

Sequencing data comes in a wide variety of formats and they contain very specific information. This is a collection of notes on different formats, and how to interact with some of them using command line tools like samtools, or R. Next Generation Sequencing (NGS) technology in brief: NGS and Sanger sequencing is similar in principle. DNA polymerase adds fluorescent nucleotides to a growing DNA template strand. DNA bases are identified when each base C, T, G, A, emits a fluorescent signal as they are added to a nucleic acid chain.

An intro to biomaRt

Or “How to annotate Hugo Symbol and Entrez ID to ensembl ID, fetch the location of the genes and associated gene ontologies and dbSNPs using biomaRt.” biomaRt is a Bioconductor package that provides an R interface to the HGNC BioMart server. It makes it possible to access and query a large amount of data and resources from ensembl.

How to make R and Python scripts, and make them executable

While practicing my R and Python skills, I have written a script in both R and Python that will perform the same task. Namely take a DNA sequence from a fasta file, count it’s length, and the number of G’s, C’s and N’s while taking upper and lowercase format into account. Then it prints a message with the calculation results. I also included a step that will count how many seconds it takes to do the the calculations, so we can compare which script runs fastest.