Rename, edit and subset columns, rows and values in R

By Synnøve Yndestad in R

December 30, 2021

How does one add or edit single variables in a data frame?
This is a “how to” rename and edit column names, row names and single values in R.
Then how to subset rows by filtering.

First, lets make a data frame of gene names with Entrez ID, HGNC symbols, and ENSMBL ID that we can edit.

# Add vectors
Ens <- c("ENSG00000142208", "ENSG00000012048", "ENSG00000139618", "ENSG00000091831", "ENSG00000121879", "ENSG00000171862", "ENSG00000141510")
hg <- c("AKT1", "BRCA1", "BRCA2", "ESR1", "PIK3CA", "PTEN", "TP53")
entr <- c("207", "672", "675", "2099", "5290", "5728", "7157")
expr <- c(NA, NA, "Low", "Medium", "Medium", "Low", "High")

#  use cbind to bind the vectors as columns, and return as data frame
df <- as.data.frame(cbind(Ens, hg, entr, expr))
df

##               Ens     hg entr   expr
## 1 ENSG00000142208   AKT1  207   <NA>
## 2 ENSG00000012048  BRCA1  672   <NA>
## 3 ENSG00000139618  BRCA2  675    Low
## 4 ENSG00000091831   ESR1 2099 Medium
## 5 ENSG00000121879 PIK3CA 5290 Medium
## 6 ENSG00000171862   PTEN 5728    Low
## 7 ENSG00000141510   TP53 7157   High

Changing Column names

To change single column names, you can either change by column number, or by column name.

Change name of by number:

Changing the name of column 2:
colnames(df)[2] <- "newname2"

When working with larger data sets, you don´t always know the exact location of the column.
To change name by column names:

names(df)[names(df) == "old.name"] <- "new.name"

Step by step, this will do the following;

names(df)
Check all the names in the data frame = df

[names(df) == “old.name”]
Extract the name you want to check

<- “new.name”
Assign the new name.

# Change column name by number
colnames(df)[2] <- "new.name.1"

# Change column name by column name
names(df)[names(df) == "Ens"] <- 'new.name.2'

df

##        new.name.2 new.name.1 entr   expr
## 1 ENSG00000142208       AKT1  207   <NA>
## 2 ENSG00000012048      BRCA1  672   <NA>
## 3 ENSG00000139618      BRCA2  675    Low
## 4 ENSG00000091831       ESR1 2099 Medium
## 5 ENSG00000121879     PIK3CA 5290 Medium
## 6 ENSG00000171862       PTEN 5728    Low
## 7 ENSG00000141510       TP53 7157   High

To change all column names, we can use a vector.

MyColumnNames <- c("ENSMBL","HGNC_symbol", "ENTREZ", "Expression")

colnames(df) <- MyColumnNames

df

##            ENSMBL HGNC_symbol ENTREZ Expression
## 1 ENSG00000142208        AKT1    207       <NA>
## 2 ENSG00000012048       BRCA1    672       <NA>
## 3 ENSG00000139618       BRCA2    675        Low
## 4 ENSG00000091831        ESR1   2099     Medium
## 5 ENSG00000121879      PIK3CA   5290     Medium
## 6 ENSG00000171862        PTEN   5728        Low
## 7 ENSG00000141510        TP53   7157       High

Changing Row Names

Use row.names() and a vector to change all row names.

We can turn the Entrez-ID column into row names by:

row.names(df) <- df$ENTREZ

df

##               ENSMBL HGNC_symbol ENTREZ Expression
## 207  ENSG00000142208        AKT1    207       <NA>
## 672  ENSG00000012048       BRCA1    672       <NA>
## 675  ENSG00000139618       BRCA2    675        Low
## 2099 ENSG00000091831        ESR1   2099     Medium
## 5290 ENSG00000121879      PIK3CA   5290     Medium
## 5728 ENSG00000171862        PTEN   5728        Low
## 7157 ENSG00000141510        TP53   7157       High

Editing single values:

The syntax for editing single values in a data frame is:

data.frame[row_number, column_number] = new_value

data.frame["row_name", "column_name"] = new_value

Then, if we got new data or need to change a value, we can add it to the data frame.

The expression values (High, Medium, Low) for AKT1 is located in row number 1, column number 4. To edit the expression value for AKT1, select [1, 4]. You can also use the row- and column names. In order to change the expression value for BRCA1 from “NA” to Low, we select the row name and column name by [“672”, “Expression”].

#  Edit single value using
## [Row number, Column number]
df[1, 4]= "High"

## ["Row Name", "Column Name"]
df["672", "Expression"] = "Low"

Similarly, we can add a new row using the same syntax:

# Add new row by Row-name
df["1019", ] = c("ENSG00000135446", NA, NA, "Medium")
df

##               ENSMBL HGNC_symbol ENTREZ Expression
## 207  ENSG00000142208        AKT1    207       High
## 672  ENSG00000012048       BRCA1    672        Low
## 675  ENSG00000139618       BRCA2    675        Low
## 2099 ENSG00000091831        ESR1   2099     Medium
## 5290 ENSG00000121879      PIK3CA   5290     Medium
## 5728 ENSG00000171862        PTEN   5728        Low
## 7157 ENSG00000141510        TP53   7157       High
## 1019 ENSG00000135446        <NA>   <NA>     Medium

If the data.frame is big and we don´t know the row-name, we have other options.

Change single values by extracting the column and select the value that defines the row

data.frame$column_to_change[data.frame$Column_value_to_search_for=="Value_key"] <- "Value_to_change"

# Change hgnc_symbol from "NA" to "CDK4"
df$HGNC_symbol[df$ENSMBL=="ENSG00000135446"] <- "CDK4"

# Change entrezid for gene CDK4 from "NA" to "1019"
df$ENTREZ[df$ENSMBL=="ENSG00000135446"] <- "1019"
 

df

##               ENSMBL HGNC_symbol ENTREZ Expression
## 207  ENSG00000142208        AKT1    207       High
## 672  ENSG00000012048       BRCA1    672        Low
## 675  ENSG00000139618       BRCA2    675        Low
## 2099 ENSG00000091831        ESR1   2099     Medium
## 5290 ENSG00000121879      PIK3CA   5290     Medium
## 5728 ENSG00000171862        PTEN   5728        Low
## 7157 ENSG00000141510        TP53   7157       High
## 1019 ENSG00000135446        CDK4   1019     Medium

Adding rows

There are many ways to add new rows.

# Add an empty row to a data frame
df[nrow(df)+1,] <- NA

# Add new empty row by Row-name
df["NewRowName", ] <- NA

# Add new empty row by Row-number
df[11, ] <- NA

df

##                     ENSMBL HGNC_symbol ENTREZ Expression
## 207        ENSG00000142208        AKT1    207       High
## 672        ENSG00000012048       BRCA1    672        Low
## 675        ENSG00000139618       BRCA2    675        Low
## 2099       ENSG00000091831        ESR1   2099     Medium
## 5290       ENSG00000121879      PIK3CA   5290     Medium
## 5728       ENSG00000171862        PTEN   5728        Low
## 7157       ENSG00000141510        TP53   7157       High
## 1019       ENSG00000135446        CDK4   1019     Medium
## 9                     <NA>        <NA>   <NA>       <NA>
## NewRowName            <NA>        <NA>   <NA>       <NA>
## 11                    <NA>        <NA>   <NA>       <NA>

Subsetting data frames

Tidyverse is a collection of R packages that work well together and is very useful for transforming and visualizing data. The dplyr::filter() function is particularly useful when subsetting data frames.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Subsetting Rows

Subset the top 3 rows. Remember the syntax [row, column]

# Top 3 observations
Top3 <- df[1:3,]
Top3

##              ENSMBL HGNC_symbol ENTREZ Expression
## 207 ENSG00000142208        AKT1    207       High
## 672 ENSG00000012048       BRCA1    672        Low
## 675 ENSG00000139618       BRCA2    675        Low

Subset the genes with low expression

# Select rows with a value equal to ==
LowExpression = filter(df, Expression == "Low")
LowExpression

##               ENSMBL HGNC_symbol ENTREZ Expression
## 672  ENSG00000012048       BRCA1    672        Low
## 675  ENSG00000139618       BRCA2    675        Low
## 5728 ENSG00000171862        PTEN   5728        Low

Find any rows that contains a specific string, i.e “BRCA”

# Select rows that contains a specific string
filter(df, str_detect(HGNC_symbol, "BRCA"))

##              ENSMBL HGNC_symbol ENTREZ Expression
## 672 ENSG00000012048       BRCA1    672        Low
## 675 ENSG00000139618       BRCA2    675        Low

Removing/excluding Rows

# Removing one or more rows using a vector of row numbers. use i.e c(3,6,9) etc to select rows 3, 6 and 9.
df <- df[-c(9),]

# Removing a row using row names
df <- df[row.names(df) != "NewRowName", , drop = FALSE]

# Remove a row using dplyr::filter() and != which means not equal to
df <- filter(df, HGNC_symbol != "AKT1")

df

##               ENSMBL HGNC_symbol ENTREZ Expression
## 672  ENSG00000012048       BRCA1    672        Low
## 675  ENSG00000139618       BRCA2    675        Low
## 2099 ENSG00000091831        ESR1   2099     Medium
## 5290 ENSG00000121879      PIK3CA   5290     Medium
## 5728 ENSG00000171862        PTEN   5728        Low
## 7157 ENSG00000141510        TP53   7157       High
## 1019 ENSG00000135446        CDK4   1019     Medium

Subset rows with multiple conditionsn

Using base R:

subset(df, Expression=="Low" & HGNC_symbol == "BRCA1")

##              ENSMBL HGNC_symbol ENTREZ Expression
## 672 ENSG00000012048       BRCA1    672        Low

Using tidyverse grammar:

df %>% filter(Expression == "Low" & 
              HGNC_symbol == "BRCA1")

##              ENSMBL HGNC_symbol ENTREZ Expression
## 672 ENSG00000012048       BRCA1    672        Low

Removing rows with NA

Missing values may cause problems in downstream analysis, and knowing how to deal with them is important. Sometimes, we need to exclude all rows with missing values.

# Add a row with missing values
df["NewRowName", ] <- NA
df

##                     ENSMBL HGNC_symbol ENTREZ Expression
## 672        ENSG00000012048       BRCA1    672        Low
## 675        ENSG00000139618       BRCA2    675        Low
## 2099       ENSG00000091831        ESR1   2099     Medium
## 5290       ENSG00000121879      PIK3CA   5290     Medium
## 5728       ENSG00000171862        PTEN   5728        Low
## 7157       ENSG00000141510        TP53   7157       High
## 1019       ENSG00000135446        CDK4   1019     Medium
## NewRowName            <NA>        <NA>   <NA>       <NA>

Remove all rows with missing values using na.omit()

df <- na.omit(df)
df

##               ENSMBL HGNC_symbol ENTREZ Expression
## 672  ENSG00000012048       BRCA1    672        Low
## 675  ENSG00000139618       BRCA2    675        Low
## 2099 ENSG00000091831        ESR1   2099     Medium
## 5290 ENSG00000121879      PIK3CA   5290     Medium
## 5728 ENSG00000171862        PTEN   5728        Low
## 7157 ENSG00000141510        TP53   7157       High
## 1019 ENSG00000135446        CDK4   1019     Medium

Posted on:: December 30, 2021

Length:: 7 minute read, 1423 words

Categories:: R

Tags:: tidyverse subsetting edit

See Also:: Read and merge multiple files by folder; Plotting bar charts in R, geom_bar vs geom_col; For loop for Multiple Trend in Proportions