Visualising WGS data in R

Overview

In our previous session, we used Ubuntu to process WGS files and gained several outputs from biomarker screenings. In this guide, we will explore how to work with these files and convert them to a publication-ready figures and tables.

Working with R and RStudio

Unlike the previous session where we utilised Ubuntu Bash language to power our bioinformatics processes, this session uses R, a statistical programming language. The vocabularies and sentence structure are slightly different, although both shared some common features.

You can install R in your computer here.

RStudio is an integrated development environment (IDE), which is a program from where we can easily code and interact with R. The default RStudio appearance consists of four panels:

Where:

1. Panel to write our script and codes
1. Panel to view our R variables and environment
1. Panel to view our script output
1. Panel to view our Windows directory

You can install RStudio Desktop here.

Accessing your working directory

Go to my github page to download the “input” folder and the “r_tutorial.Rproj” file here. Make sure the “input” folder and the Rproj file is placed in the same path. Double-click on the “r_tutorial.Rproj” file. This will launch RStudio, where we can start working on our WGS output files.

Creating a new R script

Let’s start off by creating a new R script. We can do this by pressing on the green plus icon on the top left corner of the screen, just below “File”, or alternatively we can use the keyboard shortcut of CTRL+SHIFT+N. After you’ve done this step, you should see a new blank script appearing in your panel 1.

This blank script will be where we shall write our codes.

Working with R

R is essentially an advanced calculator, where you can perform various operations, from basic calculations (e.g., 1+1), to statistical modelling (e.g., regression modelling).

Besides this, you can also use R to view tables and sheets, modify them, produce figures, and many more. You can think of R as a code-based Excel sheets, or Google sheets.

We will be working under the tidyverse environment, which is a collection of functions we can run in R which is more user-friendly and intuitive compared to the conventional R. We will load the tidyverse by typing the following command in your R script, followed by a CTRL+ENTER.

if (!require("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(tidyverse)

Using R as a calculator

Let’s try some example. Type 1+1 in your script, then press CTRL+ENTER. You will see the result in Panel 3.

You can do multiplication by using the asterisk * symbol. For example 3*3.

Division is done with the slash / symbol. For example 9/3

Power is done with the ^ symbol. For example 3^2

Assigning variables in R

Variable is an integral part of R that we will encounter frequently in our tutorial. A variable will store whatever syntax that you have provided, which you can later call out. You can assign a syntax into a variable by using the following format: “<variable> <- <syntax>”. For example, we can assign the number “1” to the variable “a”:

a <- 1

You will see a new variable called a in Panel 2. Now, we can run our mathematical operations by invoking this variable. For example:

a+3

## [1] 4

You can reassign variable anytime, and it will overwrite the existing value. For example:

## [1] 1

a <- 3
a

## [1] 3

You can remove a variable from the environment by using the rm function:

rm(a)

Which will remove the a variable from your Environment (i.e., Panel 2).

Quiz: Assign the value 345 to variable ‘x’ and 123 to variable ‘y’. What is the output of ((x*y)+x)/y?

x <- 345
y <- 123

((x*y)+x)/y

## [1] 347.8049

Opening files in R

We can also use R to view files and sheets from Windows. Still remember the compiled MLST text file that we created using SRST2? I have copied the file into the folder “input”, which you can access from the lower-right window of RStudio.

The MLST output from SRST is named “srst2_out_mlst_compiled__compiledResults.txt”. If you click on it, it will open in a new tab within Panel 1.

To import the files into R, we can use the function read.delim. For example:

read.delim(file = 'input/srst2_out_mlst_compiled__compiledResults.txt')

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308

We can then assign this file to a new variable in R by typing the following:

mlst <- read.delim(file = 'input/srst2_out_mlst_compiled__compiledResults.txt')

Now, you can call out the MLST table anytime by just typing mlst. Try it out:

mlst

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308

This type of data table in R is also known as a data frame. You can think of a data frame as something you will find when you open an Excel file, which has rows and columns you can modify. The good news is, working with data frame in R gives you much more flexibility in terms of data transformation and visualisation than if you only use Excel. I will elaborate more on this below.

Transforming data in R

R is a powerful tool to transform and reshape our data. There are millions of tools that you can use, but I will highlight some commonly used functions below:

Converting between column and rownames

Sometimes you need to assign rownames to your data frame and vice versa, removing rownames into a column of your data frame. You can use the function rownames_to_column or column_to_rownames to achieve this.

mlst

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308

mlst %>% column_to_rownames('Sample')

##        ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## commA 131  53   40   47  13  36   28   29          0           -  78.50900
## commB 131  53   40   47  13  36   28   29          0           - 122.90043
## commC 131  53   40   47  13  36   28   29          0           -  60.69700
##           maxMAF
## clinA 0.06896552
## clinB 0.11111111
## commA 0.09090909
## commB 0.06000000
## commC 0.07692308

mlst %>% column_to_rownames('Sample') %>% rownames_to_column('ID')

##      ID  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1 clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2 clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3 commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4 commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5 commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308

Creating a new column in your data frame

You can create a new column in your data frame through the mutate function. You can even assign the values for the new column from the existing values that we have.

mlst

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308

mlst %>% 
  mutate(newcolumn = 'values_for_new_column')

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF             newcolumn
## 1 0.06896552 values_for_new_column
## 2 0.11111111 values_for_new_column
## 3 0.09090909 values_for_new_column
## 4 0.06000000 values_for_new_column
## 5 0.07692308 values_for_new_column

mlst %>%
  mutate(newcolumn = 9)

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF newcolumn
## 1 0.06896552         9
## 2 0.11111111         9
## 3 0.09090909         9
## 4 0.06000000         9
## 5 0.07692308         9

mlst %>%
  mutate(recA_added = recA+100)

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF recA_added
## 1 0.06896552        129
## 2 0.11111111        129
## 3 0.09090909        129
## 4 0.06000000        129
## 5 0.07692308        129

mlst %>%
  mutate(sample_setting = str_extract(string=Sample, pattern = 'clin|comm'))

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF sample_setting
## 1 0.06896552           clin
## 2 0.11111111           clin
## 3 0.09090909           comm
## 4 0.06000000           comm
## 5 0.07692308           comm

mlst %>%
  mutate(Sample = ifelse(Sample == 'clinA', 'clinicalA', Sample))

##      Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty
## 1 clinicalA 131  53   40   47  13  36   28   29          0           -
## 2     clinB 131  53   40   47  13  36   28   29          0           -
## 3     commA 131  53   40   47  13  36   28   29          0           -
## 4     commB 131  53   40   47  13  36   28   29          0           -
## 5     commC 131  53   40   47  13  36   28   29          0           -
##       depth     maxMAF
## 1  62.90257 0.06896552
## 2  42.11329 0.11111111
## 3  78.50900 0.09090909
## 4 122.90043 0.06000000
## 5  60.69700 0.07692308

Renaming column names

You can rename your column names using the function rename and its derivatives.

mlst %>%
  rename(ID = Sample)

##      ID  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1 clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2 clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3 commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4 commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5 commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308

mlst %>%
  rename_all(.funs = ~paste0('extra_', .))

##   extra_Sample extra_ST extra_adk extra_fumC extra_gyrB extra_icd extra_mdh
## 1        clinA      131        53         40         47        13        36
## 2        clinB      131        53         40         47        13        36
## 3        commA      131        53         40         47        13        36
## 4        commB      131        53         40         47        13        36
## 5        commC      131        53         40         47        13        36
##   extra_purA extra_recA extra_mismatches extra_uncertainty extra_depth
## 1         28         29                0                 -    62.90257
## 2         28         29                0                 -    42.11329
## 3         28         29                0                 -    78.50900
## 4         28         29                0                 -   122.90043
## 5         28         29                0                 -    60.69700
##   extra_maxMAF
## 1   0.06896552
## 2   0.11111111
## 3   0.09090909
## 4   0.06000000
## 5   0.07692308

mlst %>%
  rename_if(is.integer, ~paste0('integer_', .))

##   Sample integer_ST integer_adk integer_fumC integer_gyrB integer_icd
## 1  clinA        131          53           40           47          13
## 2  clinB        131          53           40           47          13
## 3  commA        131          53           40           47          13
## 4  commB        131          53           40           47          13
## 5  commC        131          53           40           47          13
##   integer_mdh integer_purA integer_recA integer_mismatches uncertainty
## 1          36           28           29                  0           -
## 2          36           28           29                  0           -
## 3          36           28           29                  0           -
## 4          36           28           29                  0           -
## 5          36           28           29                  0           -
##       depth     maxMAF
## 1  62.90257 0.06896552
## 2  42.11329 0.11111111
## 3  78.50900 0.09090909
## 4 122.90043 0.06000000
## 5  60.69700 0.07692308

Selecting specific columns

You can choose to only keep certain columns in R using the function select

mlst

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308

mlst %>% 
  select(Sample, ST, adk)

##   Sample  ST adk
## 1  clinA 131  53
## 2  clinB 131  53
## 3  commA 131  53
## 4  commB 131  53
## 5  commC 131  53

Pivoting a data frame

You can pivot a data frame using the function pivot_table

mlst

##   Sample  ST adk fumC gyrB icd mdh purA recA mismatches uncertainty     depth
## 1  clinA 131  53   40   47  13  36   28   29          0           -  62.90257
## 2  clinB 131  53   40   47  13  36   28   29          0           -  42.11329
## 3  commA 131  53   40   47  13  36   28   29          0           -  78.50900
## 4  commB 131  53   40   47  13  36   28   29          0           - 122.90043
## 5  commC 131  53   40   47  13  36   28   29          0           -  60.69700
##       maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308

mlst %>% 
  select(Sample, ST, adk) %>%
  pivot_longer(!Sample, 
               names_to = 'name', 
               values_to = 'value')

## # A tibble: 10 × 3
##    Sample name  value
##    <chr>  <chr> <int>
##  1 clinA  ST      131
##  2 clinA  adk      53
##  3 clinB  ST      131
##  4 clinB  adk      53
##  5 commA  ST      131
##  6 commA  adk      53
##  7 commB  ST      131
##  8 commB  adk      53
##  9 commC  ST      131
## 10 commC  adk      53

Pipe, or %>%

If you haven’t already noticed, we used a lot of the %>% symbol in our codes. This symbol is referred to as the pipe, which is basically a bridge between your previous and the next codes.

Quiz: We can save table from R into a csv files, which we can open using Excel, using the function write.csv. Try saving the mlst table into a csv file named ‘mlst_table.csv’.

write.csv(x = mlst,
          file = 'mlst_table.csv',
          quote=F)

Quiz: Now that you know how to load files and assign them into variables, try loading the remaining SRST2 output into R. There are three files:

srst2_out_card_compiled__compiledResults.txt into the R variable card
srst2_out_plasmidfinder_compiled__compiledResults.txt into the R variable plasmidfinder
srst2_out_plasmidreplicon_compiled__compiledResults.txt into the R variable plasmidreplicon

card <- read.delim(file = 'input/srst2_out_card_compiled__compiledResults.txt')
plasmidfinder <- read.delim(file = 'input/srst2_out_plasmidfinder_compiled__compiledResults.txt')
plasmidreplicon <- read.delim(file = 'input/srst2_out_plasmidreplicon_compiled__compiledResults.txt')

Visualising WGS screening output

Now let’s combine all the above information to see an example of how we could transform a raw output into a publication-ready figure.

# card
card2 <- 
  card %>%
  column_to_rownames('Sample') %>%
  mutate(across(.fns = ~ifelse(.=='-', 0, 1))) %>%
  rename_all(~paste0('card_', .)) %>%
  rownames_to_column('ID')

fig.card <- 
  card2 %>%
  pivot_longer(!ID) %>%
  ggplot(aes(x=name, y=ID, fill=as.factor(value))) +
  geom_tile() +
  labs(x='Resistance Genes',
       y='Sample ID',
       title = 'Resistance genes carriage of ST131 isolates') +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  scale_fill_discrete(name='Presence', labels = c('Absent', 'Present')) 

fig.card

# plasmidfinder
plasmidfinder2 <- 
  plasmidfinder %>%
  column_to_rownames('Sample') %>%
  mutate(across(.fns = ~ifelse(.=='-', 0, 1))) %>%
  rename_all(~paste0('finder_', .)) %>%
  rownames_to_column('ID')

fig.finder <- 
  plasmidfinder2 %>%
  pivot_longer(!ID) %>%
  ggplot(aes(x=name, y=ID, fill=as.factor(value))) +
  geom_tile() +
  labs(x='Resistance Genes',
       y='Sample ID',
       title = 'PlasmidFinder carriage of ST131 isolates') +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  scale_fill_discrete(name='Presence', labels = c('Absent', 'Present')) 

fig.finder

# plasmidreplicon
plasmidreplicon2 <- 
  plasmidreplicon %>%
  column_to_rownames('Sample') %>%
  mutate(across(.fns = ~ifelse(.=='-', 0, 1))) %>%
  rename_all(~paste0('replicon_', .)) %>%
  rownames_to_column('ID')

fig.replicon <- 
  plasmidreplicon2 %>%
  pivot_longer(!ID) %>%
  ggplot(aes(x=name, y=ID, fill=as.factor(value))) +
  geom_tile() +
  labs(x='Resistance Genes',
       y='Sample ID',
       title = 'Plasmid replicon carriage of ST131 isolates') +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  scale_fill_discrete(name='Presence', labels = c('Absent', 'Present')) 

fig.replicon

We can combine the figures in several ways:

if (!require("ggpubr", quietly = TRUE))
    install.packages("ggpubr")

library(ggpubr)

fig.compiled <- 
  ggarrange(fig.card, fig.finder, fig.replicon,
            ncol=1,
            common.legend = T)

Quiz: We can save figures using the ggsave function. Save the above figure into a file named ‘fig_compiled.png’.

ggsave(filename = 'fig_compiled.png', 
       plot = fig.compiled,
       height=9, width=9)

# the longer but tidier way
fig.compiled2 <- 
  plyr::join_all(dfs = list(card2, plasmidreplicon2, plasmidfinder2), by='ID') %>%
  pivot_longer(!ID) %>%
  mutate(group = str_extract(string = name,
                             pattern = 'card|replicon|finder')) %>%
  ggplot(aes(y=name, x=ID, fill=as.factor(value))) +
  geom_tile() +
  facet_wrap(~group, 
             scales='free') +
  scale_fill_manual(name='Presence', 
                    labels = c('Absent', 'Present'),
                    values = c('grey', 'dodgerblue')) +
  labs(x='Features', y='ID') +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        strip.text.x = element_text(face='bold'))

Quiz: Save the above figures into a file named ‘fig_compiled2.png’

ggsave(filename='fig_compiled2.png',
       plot = fig.compiled2,
       height=7,
       width=9)

Concluding remarks

Now you have learned some basic operations in R, including importing output data from WGS pipeline and transforming them into figures for reporting and publication purposes. As you may have noticed, getting familiar with computer languages require lots and lots of trials and errors before you get the right result. However, I promise you that the time spent will be worth the effort in the long run.

Keep scripting!