R

Calculating the positive predictive value of a diagnostic test

Will a diagnostic test have the same predictive value regardless of how it is used? My intuitive answer is “It should”, but hey, this is why we do statistics before implementing large screening programs. The predictive value of a test is different when the test is used in a high-risk population compared to when it is used in a low-risk population. This means that the positive predictive value of a test differs if the test is used as a screening test versus when it is used as a confirmatory test, or used in two populations with different prevalence. Sensitivity, specificity and accuracy are important principles to consider when evaluating how and when diagnostic tests should be used, such as the mammography screening program, or when evaluating the differences in policies regarding COVID testing during the early and late stages of the pandemic.

Read and merge multiple files by folder

Often the data we need is spread out across multiple files, and we need a way to read all the files and merge the content.The goal is to generate a tidy data frame.Tidy data have variables in columns and observations in rows. Here I demonstrate how to gather data from multiple files into a tidy dataset from; 1- A folder with one file pr measurement. 2- A folder where you have one file pr sample with multiple measurements. 3- A folder using regex to select specific files. 4- Show off some superpowers by applying this in creating a function to merge a folder of VCF files into one long data frame.

Plotting bar charts in R, geom_bar vs geom_col

Plotting the Nightingale data made me realize that there are more to plotting a bar chart than first meets the eye. While a histogram visualize the distribution of a numerical variable, a bar plot visualize the relationship between a categorical variable and a numerical variable. ggplot has two functions for plotting bar charts, geom_bar and geom_col. In short, geom_bar() counts the categorical values for you, while geom_col() takes the summarized numerical value as input.

Plot coxcomb diagrams like Florence Nightingale

Florence Nightingale (1820-1910) is best known as a pioneer in modern nursing, but she was also a pioneer in statistics and the use of statistical graphics in data analysis. In her work during the Crimean War, she tended to the wounded soldiers in the hospitals and helped to improve the conditions in which they were treated. She collected data on patients and their outcomes, and used a coxcomb diagram to visually display the causes of death in soldiers. Nightingales coxcomb plot “Diagram of the Causes of Mortality in the Army in the East” illustrated that the main cause of death among the British troops in the Crimean War was preventable disease rather than injuries from fighting. The plot also shows that the death rate decreased when a Sanitary Commissioner arrived to aid in improving hygiene and sanitation. The coxcomb plot was later used by Nightingale to lobby for improved sanitation and hygiene in hospitals. This eventually led to a reduction in the death rate from disease in hospitals. She was a firm believer that statistical data presented as charts and diagrams is a powerful tool to make complex data more understandable. It help people see relationships between data and enables us to make informed decisions. I wanted to recreate Nightingales historical plot using R, and at the same time give a tutorial on “How to” make a coxcomb/polar-area plot/rose diagram

For loop for Multiple Trend in Proportions

When you have only one parameter to test, following my previous tutorial for test for trends in proportions will be sufficient. However, if you have many independent variables to be tested across several dependent variables, it may become quite tedious to do them all one by one. Therefore, I wrote a for-loop that will create all the summary-tables, perform the test for trends in proportions for each table, add the test result to the count matrix and save the output neatly in a csv/excel format. Here I explain each step in the process.

Calculate Z-Score and plot heatmaps

Z-score is a measure for how values deviates from the mean in a given population. Calculating z-score is a handy way to standardize, or normalize data. This kind of normalization is frequently used in gene expression studies to visualize heat maps of differential expressed genes.

Plotting categorical values as a tiled chart

Plotting your variables as a tiled map, can visualize interactions between them very efficiently. Here is a “How to” plot categorical values as a tiled chart with fixed squares.

Extract tables from pdf files with tabulizer

Far too often i find myself in a situation where I need to fetch lists of genes, expression data or similar from journal articles, only to to realize that the data is only to be found buried somewhere deep within the supplementary in the form of a giant pdf. (The horror!) Here is a how to to scrape data from a linked pdf file (by url) using the tabulizer R package.

How to ‘Pivot Wider’ when you have only character values

Reshaping your data by pivoting from long to wide, or wide to long is used frequently when wrangling your data. The pivot_longer() and pivot_wider() function from the tidyr package does this job excellent in most cases. I have however had some issues when reshaping data containing only characters. This is my solution to this issue.

NGS sequencing file formats

Sequencing data comes in a wide variety of formats and they contain very specific information. This is a collection of notes on different formats, and how to interact with some of them using command line tools like samtools, or R. Next Generation Sequencing (NGS) technology in brief: NGS and Sanger sequencing is similar in principle. DNA polymerase adds fluorescent nucleotides to a growing DNA template strand. DNA bases are identified when each base C, T, G, A, emits a fluorescent signal as they are added to a nucleic acid chain.

An intro to biomaRt

Or “How to annotate Hugo Symbol and Entrez ID to ensembl ID, fetch the location of the genes and associated gene ontologies and dbSNPs using biomaRt.” biomaRt is a Bioconductor package that provides an R interface to the HGNC BioMart server. It makes it possible to access and query a large amount of data and resources from ensembl.

How to make a waterfall plot with ggpubr

Results from clinical trials in oncology is often presented as a waterfall plot. The plot visualize how tumor growth is affected by treatment for each subject after a given time. It can communicate very effectively the overall results for an entire study using only one figure. A waterfall plot is in essence a bar-chart ordered according to size. Each bar represents one subject and describes how much in % a tumor has changed from baseline (start of treatment), to the defined end of treatment.

How to make R and Python scripts, and make them executable

While practicing my R and Python skills, I have written a script in both R and Python that will perform the same task. Namely take a DNA sequence from a fasta file, count it’s length, and the number of G’s, C’s and N’s while taking upper and lowercase format into account. Then it prints a message with the calculation results. I also included a step that will count how many seconds it takes to do the the calculations, so we can compare which script runs fastest.

Test for trend in proportions

The test for trends in proportions is also known as the Cochran Armitage test. It performs Chi-squared test for trend in proportions and is used to test whether there is a difference between groups considering the size of the groups. It takes count data from contingency tables where you have one nominal variable with two levels (i.e “Mutated”, “Wild-type”) and the other variable is an ordinal value with minimum 3 values where the variables is naturally ranked

Chi-square in R

The Chi-square test is used to compare differences between two or more categorical variables. All variables must be ordinal or nominal and summarized as a frequency table. It is a non-parametric test, meaning that it is suitable also for data that is not normally distributed. Some of the assumptions for performing a Chi-square test are: Each observation is independent of all the others (one observation per subject), and the categories must be mutually exclusive so that a subject fits into only one of the categories.

Fisher-Exact in R

Disregarding the problematic side of Fisher, the statistical methods he developed are still very useful. Read any clinical paper, and I guarantee you that a Fisher exact test has been performed. Fisher-Exact is a statistical test used for 2x2 contingency tables of categorical data. It is particularity useful for small sample sizes where other tests, like the Chi square test would be unsuitable. Fisher-Exact from a 2x2 table: First you need to enter your data, and I will use some real life examples.