Dragen QC report template for multiple samples from fastqc metrics

By Synnøve Yndestad in R sequencing RNAseq

February 14, 2022

For RNAseq performed on the Illumina NovaSeq6000, a single RNAseq run may contain 70 different samples. Batch aggregating and plotting Quality Control metrics from a sequencing run is very useful to spot samples with low sequencing quality within a single run.
While multiQC is an excellent tool for aggregating and visualizing QC metrics, my RNAseq project is run using the Dragen pipeline. Since no Dragen module had yet been implemented when I was processing the samples, I wrote my own version in the form of a Rmarkdown report template. It takes a folder of *.fastq_metrics.csv files generated by Dragen, and produces a html report with plots made interactive by plotly.

An example report can be viewed here.
The Rmarkdown template and the example report can be found in my GitHub here.

Instructions for use:
1- Add a folder containing the *fastq_metrics.csv files to the working directory.
The folder name will be assigned as RunID.
2- Change any run-specific details in the Description section to document for future reference i.e what kind of samples, from which study, what prep protocol generated the library and which dragen version was used in the processing.
3- Knit report

The plots produced will be the same kind of plots listed below.

Plots produced by the report:

1- Read Mean quality; Per-Sequence Quality Scores

Total number of reads. Each average Phred-scale quality value is rounded to the nearest integer.

2- Positional Base Mean Quality; Per-Base Quality Scores

Average Phred-scale quality value of bases with a specific nucleotide at a given location in the read. Locations are listed first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, or T. N or ambiguous bases are assumed to have the system default value, usually QV2.

3- Positional Base Content

Number of bases of each specific nucleotide at given locations in the read. Locations are given first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, T, N.

Per-Position Sequence Content heatmap and Per-Position N Content

4- Read length

Total number of reads with each observed length.

5- Read GC Content; Per-Sequence GC Content

Total number of reads with each GC content percentile between 0% and 100%.

6- Read GC Content Quality; Average mean quality for reads by GC%

Average Phred-scale read mean quality for reads with each GC content percentile between 0% and 100%.

6- Sequence Positions; Cumulative Adapter Content

Number of times an adapter or other kmer sequence is found, starting at a given position in the input reads.

7- Positional Quality

Phred-scale quality value for bases at a given location and a given quantile of the distribution. Locations are listed first and can be either specific positions or ranges. Quantiles are listed second and can be any whole integer 0–100.

This plot represent the same type of plot as the Box and Whisker plot generated by fastQC. The major difference is that here I plot all values from fastqc output and not just min, max and the interquartile range. For samples with low phred score, the plot will become darker. The brighter the plot, the better the overall score.

Posted on:
February 14, 2022
Length:
4235 minute read, 901932 words
Categories:
R sequencing RNAseq
Tags:
Illumina Dragen NGS tidyverse Rmarkdown Plotly MultiQC fastQC
See Also:
Read and merge multiple files by folder
Plotting bar charts in R, geom_bar vs geom_col
For loop for Multiple Trend in Proportions