Read and merge multiple files by folder

By Synnøve Yndestad in R RNAseq sequencing data vrangeling

November 30, 2022

Reading multiple files by folder

Often the data we need is spread out across multiple files, and we need a way to read all the files and merge the content.
The goal is to generate a tidy data frame.
Tidy data have variables in columns and observations in rows. Here I demonstrate how to gather data from multiple files into a tidy dataset from;
1- A folder with one file pr measurement.
2- A folder where you have one file pr sample with multiple measurements.
3- A folder using regex to select specific files.
4- Show off some superpowers by applying this in creating a function to merge a folder of VCF files into one long data frame.

Files used in this “How to” can be downloded from here:
https://github.com/Syndestad/Learning-curve

Load the necessary libraries:

library(tidyverse)
library(fs)
library(here)

The fs package provides a cross-platform, uniform interface to file system operations. It is very useful when working with file-paths. The here package is very useful for setting your file path relative to here::here(). Setting a relative path to here::here() will make your code transportable, and everything will work even of you move the script file to another location or computer.

Where are we?

here::here()
## [1] "/Users/synnoveyndestad/Syndestad.github.io"

1- A folder with one file pr measurement

We have a folder of files that has samples in rows, and observations in columns with the following structure:

Bcells_ave <- read_csv("OneFilePrMeasurement/Bcells.ave.csv")
head(Bcells_ave)
## # A tibble: 6 × 2
##   SampleId  Bcells
##      <dbl>   <dbl>
## 1        1 -0.0909
## 2        2 -1.00  
## 3        3 -1.00  
## 4        4  0.818 
## 5        5  1.73  
## 6        6 -1.00

In stead of reading in files one by one, we can use fs::dir_map().

dir_map(path, function)

dir_map(), applies a function to each entry in the path and returns the result in a list.
Set the file path by naming the folder in your working directory where your data is, and pasting it to here::here() Select appropriate function for file type, i.e use read_csv for csv files, readxl::read_excel() for excel files.
Then, if your data contains a key that is identical in each file such as “SampleId”, you can merge the list of files by calling full_join within Reduce.

# Set the name of the folder to read in your working directory.
MyFolder = "/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrMeasurement"

here::here()
## [1] "/Users/synnoveyndestad/Syndestad.github.io"
# Read files
ListOfFiles = fs::dir_map( paste0(here::here(), MyFolder), read_csv)
# Merge
MergedFiles = Reduce(full_join,ListOfFiles)

TADAAAA!
Yes, it really is that simple.

The list of files:
The merged data frame:

MergedFiles %>% knitr::kable()
SampleId Bcells Chemokine12 Dendritic MastC Tcells
1 -0.0908674 -0.6123724 -1.4661469 1.6035675 -1.38873
2 -0.9995412 1.2247449 1.1211712 -0.8017837 1.38873
3 -0.9995412 0.0000000 -0.6037076 -1.6035675 0.46291
4 0.8178064 0.6123724 0.2587318 -0.8017837 0.46291
5 1.7264802 1.2247449 0.2587318 0.0000000 -0.46291
6 -0.9995412 1.2247449 -0.6037076 -0.8017837 -1.38873
7 -0.9995412 -1.2247449 1.9836105 0.0000000 0.46291
8 0.8178064 -0.6123724 0.2587318 0.8017837 -0.46291
9 0.8178064 -1.2247449 -0.6037076 0.8017837 -0.46291
10 -0.0908674 -0.6123724 -0.6037076 0.8017837 1.38873

Note:
as.data.frame(ListOfFiles) works surprisingly well too, but does not merge by key.

as.data.frame(ListOfFiles)  %>% knitr::kable()
SampleId Bcells SampleId.1 Chemokine12 SampleId.2 Dendritic SampleId.3 MastC SampleId.4 Tcells
1 -0.0908674 1 -0.6123724 1 -1.4661469 1 1.6035675 1 -1.38873
2 -0.9995412 2 1.2247449 2 1.1211712 2 -0.8017837 2 1.38873
3 -0.9995412 3 0.0000000 3 -0.6037076 3 -1.6035675 3 0.46291
4 0.8178064 4 0.6123724 4 0.2587318 4 -0.8017837 4 0.46291
5 1.7264802 5 1.2247449 5 0.2587318 5 0.0000000 5 -0.46291
6 -0.9995412 6 1.2247449 6 -0.6037076 6 -0.8017837 6 -1.38873
7 -0.9995412 7 -1.2247449 7 1.9836105 7 0.0000000 7 0.46291
8 0.8178064 8 -0.6123724 8 0.2587318 8 0.8017837 8 -0.46291
9 0.8178064 9 -1.2247449 9 -0.6037076 9 0.8017837 9 -0.46291
10 -0.0908674 10 -0.6123724 10 -0.6037076 10 0.8017837 10 1.38873

2- One file pr sample with multiple measurements

When you have one file pr sample and need to keep track of the file name as sample name, we can use fs::dir_ls().
dir_ls() is equivalent to the ls command. It returns filenames as a named fs_path character vector.

Consider the following file structure:

read_csv("OneFilePrSample/Sample1_FACET_TumorPurityPloidy.csv")
## # A tibble: 1 × 2
##   purity ploidy
##    <dbl>  <dbl>
## 1 0.0579   4.30

To read and merge all files in the folder, list all paths in the folder with dir_ls().

MyFolder = "OneFilePrSample"

files = fs::dir_ls(MyFolder)
files
## OneFilePrSample/Sample1_FACET_TumorPurityPloidy.csv
## OneFilePrSample/Sample2_FACET_TumorPurityPloidy.csv
## OneFilePrSample/Sample5_FACET_TumorPurityPloidy.csv
## OneFilePrSample/Sample6_FACET_TumorPurityPloidy.csv
## OneFilePrSample/Sample7_FACET_TumorPurityPloidy.csv

Read as a large data frame using an appropriate function.
Use read_csv for csv files, read_xls for excel files etc.
Set name of file with th “.id” argument.

allFiles = files %>% map_df(read_csv, .id = "filename")
allFiles %>% knitr::kable()
filename purity ploidy
OneFilePrSample/Sample1_FACET_TumorPurityPloidy.csv 0.0579344 4.300485
OneFilePrSample/Sample2_FACET_TumorPurityPloidy.csv NA 2.000000
OneFilePrSample/Sample5_FACET_TumorPurityPloidy.csv 0.3919855 2.121393
OneFilePrSample/Sample6_FACET_TumorPurityPloidy.csv NA 2.000000
OneFilePrSample/Sample7_FACET_TumorPurityPloidy.csv 0.9386437 2.289907

And we have successfully merged the files to a tidy data frame.
The Sample names may need some editing.

Remove file path and file-extension from sample name, and the merged data frame is ready:

# Add Sample ID column
allFiles$SampleID = allFiles$filename
# reorder columns
allFiles = allFiles %>% select(filename, SampleID, everything())
# Remove folder name
allFiles$SampleID <- str_remove(allFiles$SampleID, paste0(MyFolder, "/"))
# Remove all after "_"
allFiles$SampleID <- gsub("_.*","",allFiles$SampleID)

head(allFiles) %>% knitr::kable()
filename SampleID purity ploidy
OneFilePrSample/Sample1_FACET_TumorPurityPloidy.csv Sample1 0.0579344 4.300485
OneFilePrSample/Sample2_FACET_TumorPurityPloidy.csv Sample2 NA 2.000000
OneFilePrSample/Sample5_FACET_TumorPurityPloidy.csv Sample5 0.3919855 2.121393
OneFilePrSample/Sample6_FACET_TumorPurityPloidy.csv Sample6 NA 2.000000
OneFilePrSample/Sample7_FACET_TumorPurityPloidy.csv Sample7 0.9386437 2.289907

The merged data frame with the added and cleaned sample names are ready!

3- Read multiple files based on regexp, create count matrix from RNAseq

Often, you have a mix of files in a folder. Here, we have RNAseq output by transcript (.quant.sf) and gene (.quant.genes.sf) all in the same folder. To make a count matrix for downstream analysis we only want to read the files including genes, the “quant.genes.sf” file extension and not the transcript with the “quant.sf” extension.
Then we can use regular expressions (regex) in the read call to select only the files that we want.

More on regular expressions here:
https://www.rexegg.com/regex-quickstart.html

The goal is to read only the “quant.genes.sf” files in the folder and make a count matrix.

List all paths in the folder that has a file name that ends with “.genes.sf” by using “*” in the regexp argument:

/Users/syndestad/Documents/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/

MyFolder = "/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/"
# List filepaths of files containig a spesific expression
files = fs::dir_ls(paste0(here::here(), MyFolder), regexp = "*.genes.sf")
files
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No14.quant.genes.sf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No15.quant.genes.sf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No73.quant.genes.sf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No74.quant.genes.sf

Read as a large data frame using an appropriate function i.e read.table for tabular data, read_csv for csv files, read_xls for excel files.
Add name of file with the “.id” function.

allFiles = files %>% map_df(read.table, .id = "SampleID", header = TRUE)
head(allFiles) %>% knitr::kable()
SampleID Name Length EffectiveLength TPM NumReads
1…1 /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf ENSG00000121879.6 3620 2828.85 12.489 1770.99
2…2 /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf ENSG00000091831.24 6194 5736.83 0.598 172.00
3…3 /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf ENSG00000171862.11 2573 1773.48 50.906 4525.55
4…4 /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf ENSG00000139618.17 7505 6155.61 3.299 1018.00
5…5 /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf ENSG00000142208.18 2681 2506.42 25.740 3234.00
6…6 /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/OneFilePrSample2/No13.quant.genes.sf ENSG00000141510.18 1437 1352.28 16.036 1087.00

Now we have all the files in a Very Long format.
Clean up the sample names by removing file-path and file-extension from sample name.

# Remove filepath
allFiles$SampleID <- str_remove(allFiles$SampleID, paste0(here::here(), MyFolder))
# Remove file-extension
allFiles$SampleID <- str_remove(allFiles$SampleID, ".quant.genes.sf")
head(allFiles, n= 15) %>% knitr::kable()
SampleID Name Length EffectiveLength TPM NumReads
1…1 No13 ENSG00000121879.6 3620 2828.85 12.489 1770.99
2…2 No13 ENSG00000091831.24 6194 5736.83 0.598 172.00
3…3 No13 ENSG00000171862.11 2573 1773.48 50.906 4525.55
4…4 No13 ENSG00000139618.17 7505 6155.61 3.299 1018.00
5…5 No13 ENSG00000142208.18 2681 2506.42 25.740 3234.00
6…6 No13 ENSG00000141510.18 1437 1352.28 16.036 1087.00
7…7 No13 ENSG00000012048.23 2875 2646.37 1.621 215.00
1…8 No14 ENSG00000121879.6 4203 3192.80 16.732 1529.00
2…9 No14 ENSG00000091831.24 5318 4864.04 3.513 489.00
3…10 No14 ENSG00000171862.11 3639 2538.89 64.437 4682.38
4…11 No14 ENSG00000139618.17 5306 4250.20 2.252 274.00
5…12 No14 ENSG00000142208.18 2647 2545.31 44.915 3272.00
6…13 No14 ENSG00000141510.18 2396 2324.18 16.521 1099.00
7…14 No14 ENSG00000012048.23 5903 5439.87 1.182 184.00
1…15 No15 ENSG00000121879.6 3147 2626.14 11.999 1260.00

For the count matrix, we don’t need the “Length” or “EffectiveLength” columns.
Select columns to keep by using select().
Here, we want to use NumReads in the count matrix.
Pivot wider to create a count matrix with sample names in columns, and ENSEMBL ID (Name) as rows.

CountDF <- allFiles %>% select(SampleID, Name, NumReads) %>% 
                             pivot_wider(names_from = SampleID, 
                                         values_from = NumReads) %>% 
                             as.data.frame()
# Set rownames
MyRownames = CountDF$Name
rownames(CountDF) = MyRownames
# Remove names column and print matrix 
MyMatrix = CountDF[, -1] %>% as.matrix()
MyMatrix %>% knitr::kable()
No13 No14 No15 No73 No74
ENSG00000121879.6 1770.99 1529.00 1260.00 1167.00 1578.00
ENSG00000091831.24 172.00 489.00 77.00 359.00 2043.00
ENSG00000171862.11 4525.55 4682.38 2738.66 3992.53 4896.19
ENSG00000139618.17 1018.00 274.00 1591.00 586.00 755.00
ENSG00000142208.18 3234.00 3272.00 4380.00 4580.00 3475.00
ENSG00000141510.18 1087.00 1099.00 1046.00 2738.00 1550.00
ENSG00000012048.23 215.00 184.00 450.00 464.00 597.00

TADAAAA!
There is your count matrix ready to analyze!

4- Read and merge VCF files

Now that we have become a file-reading and merging wizard, lets try something more complex. Lets make a function that will read the FIX part of a VCF file, and use that when reading all the vcf files in a folder.

The Variant Call Format (VCF) stores the location and type of variant deviating from the reference genome.

The header starts with # and contains various metadata i.e what reference genome was used. The body has 8 mandatory columns, but can contain multiple others as well. We are only interested in the body, aka the FIX part in this exersice.

vcfR is an R package for working with vcf files, see details at:
https://knausb.github.io/vcfR_documentation/index.html

Load packages

library(tidyverse)
library(fs)
library(here)
library(vcfR)

We start with a folder containing a mix of files, vcf and the annotated csv file.

List all paths in the folder containing “*.vcf” to select only the .vcf files.

MyFolder = "/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files"

files = fs::dir_ls(paste0(here::here(), MyFolder),
                   regexp = "*.vcf")
files
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/BT20_S8.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/HCC1143_S6.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/HCC1937_S3.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/MB157_S2.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/MB330_S3.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/MB436_S7.vcf
## /Users/synnoveyndestad/Syndestad.github.io/content/blog/2022-11-30-read-and-merge-multiple-files-by-folder/seq_files/MCF-7_S6.vcf

Read as a large data frame using an appropriate function. But now, we have to MAKE the function to use.

We can read a vcf file with read.vcfR() and then use getFIX() to get the fixed/body info.
If we pipe them together, we get:

read.vcfR("seq_files/BT20_S8.vcf", verbose = FALSE ) %>% 
          getFIX(getINFO = TRUE) %>% as.data.frame() %>% 
                                     DT::datatable()

Make it into a function:

getVCFfix <- function (vcfFile) {
            read.vcfR(vcfFile, verbose = FALSE ) %>% 
            getFIX(getINFO = TRUE) %>% as.data.frame()
}

Read all vcf files in the selected folder as a large data frame with the generated function getVCFfix().
Add the file-path-name with the “.id” argument so we keep track of which sample the data originates from.

## Read the FIX part of all vcf files in the folder using the generated function **getVCFfix** 
allFiles = files %>% map_df(getVCFfix, .id = "SampleID")

# Clean up Sample ID, remove path names
allFiles$SampleID <- str_remove(allFiles$SampleID, paste0(here::here(), MyFolder, "/"))

# Remove everything after "_":
allFiles$SampleID <- sub("_[^_]+$", "", allFiles$SampleID)
allFiles %>% DT::datatable()

Tadaaa!
Now we have a Very Long data frame with all the FIX info from a folder of vcf files, with a column specifying SampleID.
Sweet!

# count all variants from each sample
table(allFiles$SampleID)
## 
##    BT20 HCC1143 HCC1937   MB157   MB330   MB436   MCF-7 
##     681     732     698     803     780     791     728
Posted on:
November 30, 2022
Length:
9 minute read, 1865 words
Categories:
R RNAseq sequencing data vrangeling
Tags:
fs::dir_map() fs::dir_ls VCF function gene expression regex tidyverse vcfR
See Also:
The Pediatric Soft Tissue Sarcoma Paper, a COVID lock-down side quest and what is a genomic variant
Plotting bar charts in R, geom_bar vs geom_col
For loop for Multiple Trend in Proportions