Read and merge multiple files by folder

By Synnøve Yndestad in R RNAseq sequencing data vrangeling

November 30, 2022

Reading multiple files by folder

Often the data we need is spread out across multiple files, and we need a way to read all the files and merge the content.
The goal is to generate a tidy data frame.
Tidy data have variables in columns and observations in rows. Here I demonstrate how to gather data from multiple files into a tidy dataset from;
1- A folder with one file pr measurement.
2- A folder where you have one file pr sample with multiple measurements.
3- A folder using regex to select specific files.
4- Show off some superpowers by applying this in creating a function to merge a folder of VCF files into one long data frame.

Files used in this “How to” can be downloded from here:
https://github.com/Syndestad/Learning-curve

Load the necessary libraries:

library(tidyverse)
library(fs)
library(here)

The fs package provides a cross-platform, uniform interface to file system operations. It is very useful when working with file-paths. The here package is very useful for setting your file path relative to here::here(). Setting a relative path to here::here() will make your code transportable, and everything will work even of you move the script file to another location or computer.

Where are we?

here::here()

## [1] "/Users/synnoveyndestad/Syndestad.github.io"

1- A folder with one file pr measurement

We have a folder of files that has samples in rows, and observations in columns with the following structure:

Bcells_ave <- read_csv("OneFilePrMeasurement/Bcells.ave.csv")
head(Bcells_ave)

## # A tibble: 6 × 2
##   SampleId  Bcells
##      <dbl>   <dbl>
## 1        1 -0.0909
## 2        2 -1.00  
## 3        3 -1.00  
## 4        4  0.818 
## 5        5  1.73  
## 6        6 -1.00

In stead of reading in files one by one, we can use fs::dir_map().

dir_map(path, function)

dir_map(), applies a function to each entry in the path and returns the result in a list.
Set the file path by naming the folder in your working directory where your data is, and pasting it to here::here() Select appropriate function for file type, i.e use read_csv for csv files, readxl::read_excel() for excel files.
Then, if your data contains a key that is identical in each file such as “SampleId”, you can merge the list of files by calling full_join within Reduce.

# Set the name of the folder to read in your working directory.
MyFolder = "/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrMeasurement"

here::here()

## [1] "/Users/synnoveyndestad/Syndestad.github.io"

# Read files
ListOfFiles = fs::dir_map( paste0(here::here(), MyFolder), read_csv)
# Merge
MergedFiles = Reduce(full_join,ListOfFiles)

TADAAAA!
Yes, it really is that simple.

The list of files:
The merged data frame:

MergedFiles %>% knitr::kable()

SampleId	Bcells	Chemokine12	Dendritic	MastC	Tcells
1	-0.0908674	-0.6123724	-1.4661469	1.6035675	-1.38873
2	-0.9995412	1.2247449	1.1211712	-0.8017837	1.38873
3	-0.9995412	0.0000000	-0.6037076	-1.6035675	0.46291
4	0.8178064	0.6123724	0.2587318	-0.8017837	0.46291
5	1.7264802	1.2247449	0.2587318	0.0000000	-0.46291
6	-0.9995412	1.2247449	-0.6037076	-0.8017837	-1.38873
7	-0.9995412	-1.2247449	1.9836105	0.0000000	0.46291
8	0.8178064	-0.6123724	0.2587318	0.8017837	-0.46291
9	0.8178064	-1.2247449	-0.6037076	0.8017837	-0.46291
10	-0.0908674	-0.6123724	-0.6037076	0.8017837	1.38873

Note:
as.data.frame(ListOfFiles) works surprisingly well too, but does not merge by key.

as.data.frame(ListOfFiles)  %>% knitr::kable()

SampleId	Bcells	SampleId.1	Chemokine12	SampleId.2	Dendritic	SampleId.3	MastC	SampleId.4	Tcells
1	-0.0908674	1	-0.6123724	1	-1.4661469	1	1.6035675	1	-1.38873
2	-0.9995412	2	1.2247449	2	1.1211712	2	-0.8017837	2	1.38873
3	-0.9995412	3	0.0000000	3	-0.6037076	3	-1.6035675	3	0.46291
4	0.8178064	4	0.6123724	4	0.2587318	4	-0.8017837	4	0.46291
5	1.7264802	5	1.2247449	5	0.2587318	5	0.0000000	5	-0.46291
6	-0.9995412	6	1.2247449	6	-0.6037076	6	-0.8017837	6	-1.38873
7	-0.9995412	7	-1.2247449	7	1.9836105	7	0.0000000	7	0.46291
8	0.8178064	8	-0.6123724	8	0.2587318	8	0.8017837	8	-0.46291
9	0.8178064	9	-1.2247449	9	-0.6037076	9	0.8017837	9	-0.46291
10	-0.0908674	10	-0.6123724	10	-0.6037076	10	0.8017837	10	1.38873

2- One file pr sample with multiple measurements

When you have one file pr sample and need to keep track of the file name as sample name, we can use fs::dir_ls().
dir_ls() is equivalent to the ls command. It returns filenames as a named fs_path character vector.

Consider the following file structure:

read_csv("OneFilePrSample/Sample1_FACET_TumorPurityPloidy.csv")

## # A tibble: 1 × 2
##   purity ploidy
##    <dbl>  <dbl>
## 1 0.0579   4.30

To read and merge all files in the folder, list all paths in the folder with dir_ls().

MyFolder = "OneFilePrSample"

files = fs::dir_ls(MyFolder)
files

## OneFilePrSample/Sample1_FACET_TumorPurityPloidy.csv
## OneFilePrSample/Sample2_FACET_TumorPurityPloidy.csv
## OneFilePrSample/Sample5_FACET_TumorPurityPloidy.csv
## OneFilePrSample/Sample6_FACET_TumorPurityPloidy.csv
## OneFilePrSample/Sample7_FACET_TumorPurityPloidy.csv

Read as a large data frame using an appropriate function.
Use read_csv for csv files, read_xls for excel files etc.
Set name of file with th “.id” argument.

allFiles = files %>% map_df(read_csv, .id = "filename")
allFiles %>% knitr::kable()

filename	purity	ploidy
OneFilePrSample/Sample1_FACET_TumorPurityPloidy.csv	0.0579344	4.300485
OneFilePrSample/Sample2_FACET_TumorPurityPloidy.csv	NA	2.000000
OneFilePrSample/Sample5_FACET_TumorPurityPloidy.csv	0.3919855	2.121393
OneFilePrSample/Sample6_FACET_TumorPurityPloidy.csv	NA	2.000000
OneFilePrSample/Sample7_FACET_TumorPurityPloidy.csv	0.9386437	2.289907

And we have successfully merged the files to a tidy data frame.
The Sample names may need some editing.

Remove file path and file-extension from sample name, and the merged data frame is ready:

# Add Sample ID column
allFiles$SampleID = allFiles$filename
# reorder columns
allFiles = allFiles %>% select(filename, SampleID, everything())
# Remove folder name
allFiles$SampleID <- str_remove(allFiles$SampleID, paste0(MyFolder, "/"))
# Remove all after "_"
allFiles$SampleID <- gsub("_.*","",allFiles$SampleID)

head(allFiles) %>% knitr::kable()

filename	SampleID	purity	ploidy
OneFilePrSample/Sample1_FACET_TumorPurityPloidy.csv	Sample1	0.0579344	4.300485
OneFilePrSample/Sample2_FACET_TumorPurityPloidy.csv	Sample2	NA	2.000000
OneFilePrSample/Sample5_FACET_TumorPurityPloidy.csv	Sample5	0.3919855	2.121393
OneFilePrSample/Sample6_FACET_TumorPurityPloidy.csv	Sample6	NA	2.000000
OneFilePrSample/Sample7_FACET_TumorPurityPloidy.csv	Sample7	0.9386437	2.289907

The merged data frame with the added and cleaned sample names are ready!

3- Read multiple files based on regexp, create count matrix from RNAseq

Often, you have a mix of files in a folder. Here, we have RNAseq output by transcript (.quant.sf) and gene (.quant.genes.sf) all in the same folder. To make a count matrix for downstream analysis we only want to read the files including genes, the “quant.genes.sf” file extension and not the transcript with the “quant.sf” extension.
Then we can use regular expressions (regex) in the read call to select only the files that we want.

More on regular expressions here:
https://www.rexegg.com/regex-quickstart.html

The goal is to read only the “quant.genes.sf” files in the folder and make a count matrix.

List all paths in the folder that has a file name that ends with “.genes.sf” by using “*” in the regexp argument:

/Users/syndestad/Documents/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/

MyFolder = "/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/"
# List filepaths of files containig a spesific expression
files = fs::dir_ls(paste0(here::here(), MyFolder), regexp = "*.genes.sf")
files

## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No14.quant.genes.sf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No15.quant.genes.sf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No73.quant.genes.sf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No74.quant.genes.sf

Read as a large data frame using an appropriate function i.e read.table for tabular data, read_csv for csv files, read_xls for excel files.
Add name of file with the “.id” function.

allFiles = files %>% map_df(read.table, .id = "SampleID", header = TRUE)
head(allFiles) %>% knitr::kable()

	SampleID	Name	Length	EffectiveLength	TPM	NumReads
1…1	/Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf	ENSG00000121879.6	3620	2828.85	12.489	1770.99
2…2	/Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf	ENSG00000091831.24	6194	5736.83	0.598	172.00
3…3	/Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf	ENSG00000171862.11	2573	1773.48	50.906	4525.55
4…4	/Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf	ENSG00000139618.17	7505	6155.61	3.299	1018.00
5…5	/Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf	ENSG00000142208.18	2681	2506.42	25.740	3234.00
6…6	/Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf	ENSG00000141510.18	1437	1352.28	16.036	1087.00

Now we have all the files in a Very Long format.
Clean up the sample names by removing file-path and file-extension from sample name.

# Remove filepath
allFiles$SampleID <- str_remove(allFiles$SampleID, paste0(here::here(), MyFolder))
# Remove file-extension
allFiles$SampleID <- str_remove(allFiles$SampleID, ".quant.genes.sf")
head(allFiles, n= 15) %>% knitr::kable()

	SampleID	Name	Length	EffectiveLength	TPM	NumReads
1…1	No13	ENSG00000121879.6	3620	2828.85	12.489	1770.99
2…2	No13	ENSG00000091831.24	6194	5736.83	0.598	172.00
3…3	No13	ENSG00000171862.11	2573	1773.48	50.906	4525.55
4…4	No13	ENSG00000139618.17	7505	6155.61	3.299	1018.00
5…5	No13	ENSG00000142208.18	2681	2506.42	25.740	3234.00
6…6	No13	ENSG00000141510.18	1437	1352.28	16.036	1087.00
7…7	No13	ENSG00000012048.23	2875	2646.37	1.621	215.00
1…8	No14	ENSG00000121879.6	4203	3192.80	16.732	1529.00
2…9	No14	ENSG00000091831.24	5318	4864.04	3.513	489.00
3…10	No14	ENSG00000171862.11	3639	2538.89	64.437	4682.38
4…11	No14	ENSG00000139618.17	5306	4250.20	2.252	274.00
5…12	No14	ENSG00000142208.18	2647	2545.31	44.915	3272.00
6…13	No14	ENSG00000141510.18	2396	2324.18	16.521	1099.00
7…14	No14	ENSG00000012048.23	5903	5439.87	1.182	184.00
1…15	No15	ENSG00000121879.6	3147	2626.14	11.999	1260.00

For the count matrix, we don’t need the “Length” or “EffectiveLength” columns.
Select columns to keep by using select().
Here, we want to use NumReads in the count matrix.
Pivot wider to create a count matrix with sample names in columns, and ENSEMBL ID (Name) as rows.

CountDF <- allFiles %>% select(SampleID, Name, NumReads) %>% 
                             pivot_wider(names_from = SampleID, 
                                         values_from = NumReads) %>% 
                             as.data.frame()
# Set rownames
MyRownames = CountDF$Name
rownames(CountDF) = MyRownames
# Remove names column and print matrix 
MyMatrix = CountDF[, -1] %>% as.matrix()
MyMatrix %>% knitr::kable()

	No13	No14	No15	No73	No74
ENSG00000121879.6	1770.99	1529.00	1260.00	1167.00	1578.00
ENSG00000091831.24	172.00	489.00	77.00	359.00	2043.00
ENSG00000171862.11	4525.55	4682.38	2738.66	3992.53	4896.19
ENSG00000139618.17	1018.00	274.00	1591.00	586.00	755.00
ENSG00000142208.18	3234.00	3272.00	4380.00	4580.00	3475.00
ENSG00000141510.18	1087.00	1099.00	1046.00	2738.00	1550.00
ENSG00000012048.23	215.00	184.00	450.00	464.00	597.00

TADAAAA!
There is your count matrix ready to analyze!

4- Read and merge VCF files

Now that we have become a file-reading and merging wizard, lets try something more complex. Lets make a function that will read the FIX part of a VCF file, and use that when reading all the vcf files in a folder.

The Variant Call Format (VCF) stores the location and type of variant deviating from the reference genome.

The header starts with # and contains various metadata i.e what reference genome was used. The body has 8 mandatory columns, but can contain multiple others as well. We are only interested in the body, aka the FIX part in this exersice.

vcfR is an R package for working with vcf files, see details at:
https://knausb.github.io/vcfR_documentation/index.html

Load packages

library(tidyverse)
library(fs)
library(here)
library(vcfR)

We start with a folder containing a mix of files, vcf and the annotated csv file.

List all paths in the folder containing “*.vcf” to select only the .vcf files.

MyFolder = "/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files"

files = fs::dir_ls(paste0(here::here(), MyFolder),
                   regexp = "*.vcf")
files

## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/BT20_S8.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/HCC1143_S6.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/HCC1937_S3.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/MB157_S2.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/MB330_S3.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/MB436_S7.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/MCF-7_S6.vcf

Read as a large data frame using an appropriate function. But now, we have to MAKE the function to use.

We can read a vcf file with read.vcfR() and then use getFIX() to get the fixed/body info.
If we pipe them together, we get:

read.vcfR("seq_files/BT20_S8.vcf", verbose = FALSE ) %>% 
          getFIX(getINFO = TRUE) %>% as.data.frame() %>% 
                                     DT::datatable()

Make it into a function:

getVCFfix <- function (vcfFile) {
            read.vcfR(vcfFile, verbose = FALSE ) %>% 
            getFIX(getINFO = TRUE) %>% as.data.frame()
}

Read all vcf files in the selected folder as a large data frame with the generated function getVCFfix().
Add the file-path-name with the “.id” argument so we keep track of which sample the data originates from.

## Read the FIX part of all vcf files in the folder using the generated function **getVCFfix** 
allFiles = files %>% map_df(getVCFfix, .id = "SampleID")

# Clean up Sample ID, remove path names
allFiles$SampleID <- str_remove(allFiles$SampleID, paste0(here::here(), MyFolder, "/"))

# Remove everything after "_":
allFiles$SampleID <- sub("_[^_]+$", "", allFiles$SampleID)
allFiles %>% DT::datatable()

Tadaaa!
Now we have a Very Long data frame with all the FIX info from a folder of vcf files, with a column specifying SampleID.
Sweet!

# count all variants from each sample
table(allFiles$SampleID)

## 
##    BT20 HCC1143 HCC1937   MB157   MB330   MB436   MCF-7 
##     681     732     698     803     780     791     728

Posted on:: November 30, 2022

Length:: 9 minute read, 1865 words

Categories:: R RNAseq sequencing data vrangeling

Tags:: fs::dir_map() fs::dir_ls VCF function gene expression regex tidyverse vcfR

See Also:: The Pediatric Soft Tissue Sarcoma Paper, a COVID lock-down side quest and what is a genomic variant; Plotting bar charts in R, geom_bar vs geom_col; For loop for Multiple Trend in Proportions