Dragen QC report template for multiple samples from fastqc metrics
By Synnøve Yndestad in R sequencing RNAseq
February 14, 2022
For RNAseq performed on the Illumina NovaSeq6000, a single RNAseq run may contain 70 different samples. Batch aggregating and plotting Quality Control metrics from a sequencing run is very useful to spot samples with low sequencing quality within a single run.
While
multiQC is an excellent tool for aggregating and visualizing QC metrics, my RNAseq project is run using the Dragen pipeline. Since no Dragen module had yet been implemented when I was processing the samples, I wrote my own version in the form of a Rmarkdown report template. It takes a folder of *.fastq_metrics.csv files generated by Dragen, and produces a html report with plots made interactive by plotly.
An example report can be viewed
here.
The Rmarkdown template and the example report can be found in my GitHub
here.
Instructions for use:
1- Add a folder containing the *fastq_metrics.csv files to the working directory.
The folder name will be assigned as RunID.
2- Change any run-specific details in the Description section to document for future reference i.e what kind of samples, from which study, what prep protocol generated the library and which dragen version was used in the processing.
3- Knit report
The plots produced will be the same kind of plots listed below.
Plots produced by the report:
1- Read Mean quality; Per-Sequence Quality Scores
Total number of reads. Each average Phred-scale quality value is rounded to the nearest integer.
2- Positional Base Mean Quality; Per-Base Quality Scores
Average Phred-scale quality value of bases with a specific nucleotide at a given location in the read. Locations are listed first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, or T. N or ambiguous bases are assumed to have the system default value, usually QV2.
3- Positional Base Content
Number of bases of each specific nucleotide at given locations in the read. Locations are given first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, T, N.
Per-Position Sequence Content heatmap and Per-Position N Content
4- Read length
Total number of reads with each observed length.
5- Read GC Content; Per-Sequence GC Content
Total number of reads with each GC content percentile between 0% and 100%.
6- Read GC Content Quality; Average mean quality for reads by GC%
Average Phred-scale read mean quality for reads with each GC content percentile between 0% and 100%.
6- Sequence Positions; Cumulative Adapter Content
Number of times an adapter or other kmer sequence is found, starting at a given position in the input reads.
7- Positional Quality
Phred-scale quality value for bases at a given location and a given quantile of the distribution. Locations are listed first and can be either specific positions or ranges. Quantiles are listed second and can be any whole integer 0–100.
This plot represent the same type of plot as the Box and Whisker plot generated by fastQC. The major difference is that here I plot all values from fastqc output and not just min, max and the interquartile range. For samples with low phred score, the plot will become darker. The brighter the plot, the better the overall score.