VCF

The Pediatric Soft Tissue Sarcoma Paper, a COVID lock-down side quest and what is a genomic variant

Our paper Germline variants in patients diagnosed with pediatric soft tissue sarcoma. was published during the summer. This paper began with the story of a young male with a sarcoma in the prostate. This is a very rare condition. The treating physician wanted to find better treatment options and in an attempt to get some clues to what, our lab did 360 gene panel DNA sequencing to see if there was any targetable mutations that could direct the treatment of this young man.

Read and merge multiple files by folder

Often the data we need is spread out across multiple files, and we need a way to read all the files and merge the content.The goal is to generate a tidy data frame.Tidy data have variables in columns and observations in rows. Here I demonstrate how to gather data from multiple files into a tidy dataset from; 1- A folder with one file pr measurement. 2- A folder where you have one file pr sample with multiple measurements. 3- A folder using regex to select specific files. 4- Show off some superpowers by applying this in creating a function to merge a folder of VCF files into one long data frame.

NGS sequencing file formats

Sequencing data comes in a wide variety of formats and they contain very specific information. This is a collection of notes on different formats, and how to interact with some of them using command line tools like samtools, or R. Next Generation Sequencing (NGS) technology in brief: NGS and Sanger sequencing is similar in principle. DNA polymerase adds fluorescent nucleotides to a growing DNA template strand. DNA bases are identified when each base C, T, G, A, emits a fluorescent signal as they are added to a nucleic acid chain.