Calculate Z-Score and plot heatmaps
By Synnøve Yndestad in R RNAseq Statistics Visualizations
June 29, 2022
Z-score
Z-score is a measure for how values deviates from the mean in a given population.
A value with z-score = 0 is the average value in the data. Negative values is below average, and positive is above.
The formlua for calculating z-score is:
z-score = (x-μ)/σ
x = one observation in the population
μ = the mean of the population
σ = the standard deviation of the population
Calculating this in R can be done by:
library(tidyverse)
# Enter some data and calculate z-score
data <- c(1,5,9,12,14,23)
z_scores <- (data-mean(data))/sd(data)
z_scores
## [1] -1.2620630 -0.7398300 -0.2175971 0.1740777 0.4351941 1.6102183
The final number 23 is 1.6 standard deviations away from the mean.
mean(data)
## [1] 10.66667
Visualize how far above or below average the values are:
par(mfrow=c(1,2))
plot(data, type="b", ylab = "data value")
plot(z_scores, type="b", col="black", ylab = "z score")
abline(h = 0, col = "brown", lty = 4)
Calculating z-score is a handy way to standardize, or normalize data.
This is frequently used in gene expression studies to visualize heat maps of differential expressed genes.
The DE analysis is performed on normalized counts before log2 and Z-score transformation.
In R, scale() can be used for z-score transformation of large matrices.
Compared with manual Z-score calculation, we get identical results when using scale().
# Manually calculated z-score
z_scores
## [1] -1.2620630 -0.7398300 -0.2175971 0.1740777 0.4351941 1.6102183
# z-score using scale()
scale(data) %>% as.vector()
## [1] -1.2620630 -0.7398300 -0.2175971 0.1740777 0.4351941 1.6102183
Scaling is part of a normalization approach to be able to compare variance between samples.
scale() is performed column-wise.
In a matrix of gene expression data, the matrix must be transposed before and after z-score transformation if you want to compare gene expression between samples.
This will scale the expression of each gene compared to the mean of the expression in the population.
Doing scale on the columns will compare the expression of the gene compared to the other genes within the sample.
For genes in columns and samples in rows, use:
scale(data)
For genes in rows and samples in columns, use:
t(scale(t(data)))
Z-Score normalization with gene expression data:
Here is a demonstration of z-score normalization using gene expression from the airway dataset. The airway data set is available on Bioconductor.
Get the airway dataset
library(airway)
data(airway)
airway
## class: RangedSummarizedExperiment
## dim: 64102 8
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
Extract the count matrix
airway.matrix = assay(airway)
# matrix dimension before removing low count genes
dim(airway.matrix)
## [1] 64102 8
By removing low count genes, the mean variance can be estimated more reliable.
Therefore, genes with 0 counts should be removed for some applications.
# Remove genes with 0 counts
# Sett genes to keep, all genes that are not expressed to be removed
keep <- rowSums(airway.matrix) > 1
airway.matrix <- airway.matrix[keep,]
# Check dimensions
dim(airway.matrix)
## [1] 29391 8
Do log2 transform the count matrix to get normal distributed data before scaling and calculate z-score:
scale(log2(airway.matrix+1)) %>% head()
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 1.1260901 1.0208946 1.15821433 1.0901426 1.2602939
## ENSG00000000419 0.9916230 1.0714161 1.03630532 1.0488761 1.0238041
## ENSG00000000457 0.7814606 0.7483022 0.72915287 0.7529099 0.7121244
## ENSG00000000460 0.2582945 0.2647380 0.06166199 0.1873347 0.3058427
## ENSG00000000938 -1.2212054 -1.1974507 -0.87556058 -1.1439277 -1.0090925
## ENSG00000000971 1.6893080 1.7850331 1.85914197 1.9600613 1.8952581
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1.1574578 1.2089940 1.0829765
## ENSG00000000419 1.0631139 0.9851562 1.0406101
## ENSG00000000457 0.7558361 0.7730355 0.7564559
## ENSG00000000460 0.1806572 0.3666383 0.2817005
## ENSG00000000938 -1.2724024 -1.2215628 -1.1888027
## ENSG00000000971 1.9797589 1.9052498 2.0258340
Plotting heatmap
Plotting heatmap with and without log2 normalization, and also plotting when scaling with and without transposing the matrix is a nice visual demonstration of what happens to the data during z-score scaling.
Observe the difference in the heatmap when plotting the data with and without log2 normalization, and scaling with and without transposing the matrix.
# Get metadata for heatmap
metadata = as.data.frame(colData(airway))
# Extract genes with highest variance
topVarGenes <- order(-rowVars(assay(airway)))[0:50]
mat <- assay(airway)[ topVarGenes, ]
# The count matrix has samples in columns, and genes in rows
mat %>% head()
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000166923 126514 64824 156189 59216 247345
## ENSG00000115414 260262 213986 513766 273878 252667
## ENSG00000011465 219586 191767 422752 214062 293318
## ENSG00000198804 210138 213952 265361 169551 397791
## ENSG00000115461 74051 79611 328753 137165 241640
## ENSG00000087086 236805 138524 236361 95045 343728
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000166923 336864 98504 56549
## ENSG00000115414 399841 345651 370450
## ENSG00000011465 313112 378834 372489
## ENSG00000198804 401539 233051 248032
## ENSG00000115461 216484 145858 148389
## ENSG00000087086 229323 167067 114257
Plot the heatmap using pheatmap()
of un-normalized raw counts:
pheatmap::pheatmap(mat,
annotation = select(metadata, cell, dex),
main = "Raw counts")
Plot a heatmap with the data log2 transformed to get data that are normal distributed:
pheatmap::pheatmap(log2(mat+1),
annotation = select(metadata, cell, dex),
main = "Log2 transformed counts, no scaling")
Plot a heatmap with the data log2 transformed, but scaling is performed column wise:
pheatmap::pheatmap(scale(log2(mat+1)),
annotation = select(metadata, cell, dex),
main = "Log2 transformed with column wise scaling; \n Gene expression scaled for all genes within sample")
Plot a heatmap with the data log2 transformed, and properly scaled row-wise:
pheatmap::pheatmap(t(scale(t(log2(mat+1)))),
annotation = select(metadata, cell, dex),
main = "Log2 transformed with row wise scaling; \n Individual gene expression scaled between samples")
The last heatmap show the difference in gene expression between samples, while the second last show the difference of gene expression within each sample. So it is important to do the scaling in the desired direction.
- Posted on:
- June 29, 2022
- Length:
- 5 minute read, 927 words
- Categories:
- R RNAseq Statistics Visualizations
- Tags:
- z-score heatmap scale gene expression
- See Also:
- Read and merge multiple files by folder