sequencing

Read and merge multiple files by folder

Often the data we need is spread out across multiple files, and we need a way to read all the files and merge the content.The goal is to generate a tidy data frame.Tidy data have variables in columns and observations in rows. Here I demonstrate how to gather data from multiple files into a tidy dataset from; 1- A folder with one file pr measurement. 2- A folder where you have one file pr sample with multiple measurements. 3- A folder using regex to select specific files. 4- Show off some superpowers by applying this in creating a function to merge a folder of VCF files into one long data frame.

NGS sequencing file formats

Sequencing data comes in a wide variety of formats and they contain very specific information. This is a collection of notes on different formats, and how to interact with some of them using command line tools like samtools, or R. Next Generation Sequencing (NGS) technology in brief: NGS and Sanger sequencing is similar in principle. DNA polymerase adds fluorescent nucleotides to a growing DNA template strand. DNA bases are identified when each base C, T, G, A, emits a fluorescent signal as they are added to a nucleic acid chain.

An intro to biomaRt

Or “How to annotate Hugo Symbol and Entrez ID to ensembl ID, fetch the location of the genes and associated gene ontologies and dbSNPs using biomaRt.” biomaRt is a Bioconductor package that provides an R interface to the HGNC BioMart server. It makes it possible to access and query a large amount of data and resources from ensembl.

How to make R and Python scripts, and make them executable

While practicing my R and Python skills, I have written a script in both R and Python that will perform the same task. Namely take a DNA sequence from a fasta file, count it’s length, and the number of G’s, C’s and N’s while taking upper and lowercase format into account. Then it prints a message with the calculation results. I also included a step that will count how many seconds it takes to do the the calculations, so we can compare which script runs fastest.