In our previous session, we used Ubuntu to process WGS files and gained several outputs from biomarker screenings. In this guide, we will explore how to work with these files and convert them to a publication-ready figures and tables.
Unlike the previous session where we utilised Ubuntu Bash language to power our bioinformatics processes, this session uses R, a statistical programming language. The vocabularies and sentence structure are slightly different, although both shared some common features.
You can install R in your computer here.
RStudio is an integrated development environment (IDE), which is a program from where we can easily code and interact with R. The default RStudio appearance consists of four panels:
Where:
You can install RStudio Desktop here.
Go to my github page to download the “input” folder and the “r_tutorial.Rproj” file here. Make sure the “input” folder and the Rproj file is placed in the same path. Double-click on the “r_tutorial.Rproj” file. This will launch RStudio, where we can start working on our WGS output files.
Let’s start off by creating a new R script. We can do this by pressing on the green plus icon on the top left corner of the screen, just below “File”, or alternatively we can use the keyboard shortcut of CTRL+SHIFT+N. After you’ve done this step, you should see a new blank script appearing in your panel 1.
This blank script will be where we shall write our codes.
R is essentially an advanced calculator, where you can perform various operations, from basic calculations (e.g., 1+1), to statistical modelling (e.g., regression modelling).
Besides this, you can also use R to view tables and sheets, modify them, produce figures, and many more. You can think of R as a code-based Excel sheets, or Google sheets.
We will be working under the tidyverse environment, which is a collection of functions we can run in R which is more user-friendly and intuitive compared to the conventional R. We will load the tidyverse by typing the following command in your R script, followed by a CTRL+ENTER.
if (!require("tidyverse", quietly = TRUE))
install.packages("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidyverse)
Let’s try some example. Type 1+1
in your script, then
press CTRL+ENTER. You will see the result in Panel 3.
You can do multiplication by using the asterisk *
symbol. For example 3*3
.
Division is done with the slash /
symbol. For example
9/3
Power is done with the ^
symbol. For example
3^2
Variable is an integral part of R that we will encounter frequently in our tutorial. A variable will store whatever syntax that you have provided, which you can later call out. You can assign a syntax into a variable by using the following format: “<variable> <- <syntax>”. For example, we can assign the number “1” to the variable “a”:
a <- 1
You will see a new variable called a
in Panel 2. Now, we
can run our mathematical operations by invoking this variable. For
example:
a+3
## [1] 4
You can reassign variable anytime, and it will overwrite the existing value. For example:
a
## [1] 1
a <- 3
a
## [1] 3
You can remove a variable from the environment by using the
rm
function:
rm(a)
Which will remove the a
variable from your Environment
(i.e., Panel 2).
Quiz: Assign the value 345 to variable ‘x’ and 123 to variable ‘y’. What is the output of ((x*y)+x)/y?
x <- 345
y <- 123
((x*y)+x)/y
## [1] 347.8049
We can also use R to view files and sheets from Windows. Still remember the compiled MLST text file that we created using SRST2? I have copied the file into the folder “input”, which you can access from the lower-right window of RStudio.
The MLST output from SRST is named “srst2_out_mlst_compiled__compiledResults.txt”. If you click on it, it will open in a new tab within Panel 1.
To import the files into R, we can use the function
read.delim
. For example:
read.delim(file = 'input/srst2_out_mlst_compiled__compiledResults.txt')
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308
We can then assign this file to a new variable in R by typing the following:
mlst <- read.delim(file = 'input/srst2_out_mlst_compiled__compiledResults.txt')
Now, you can call out the MLST table anytime by just typing
mlst
. Try it out:
mlst
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308
This type of data table in R is also known as a data frame. You can think of a data frame as something you will find when you open an Excel file, which has rows and columns you can modify. The good news is, working with data frame in R gives you much more flexibility in terms of data transformation and visualisation than if you only use Excel. I will elaborate more on this below.
R is a powerful tool to transform and reshape our data. There are millions of tools that you can use, but I will highlight some commonly used functions below:
Sometimes you need to assign rownames to your data frame and vice
versa, removing rownames into a column of your data frame. You can use
the function rownames_to_column
or
column_to_rownames
to achieve this.
mlst
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308
mlst %>% column_to_rownames('Sample')
## ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## commA 131 53 40 47 13 36 28 29 0 - 78.50900
## commB 131 53 40 47 13 36 28 29 0 - 122.90043
## commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF
## clinA 0.06896552
## clinB 0.11111111
## commA 0.09090909
## commB 0.06000000
## commC 0.07692308
mlst %>% column_to_rownames('Sample') %>% rownames_to_column('ID')
## ID ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308
You can create a new column in your data frame through the
mutate
function. You can even assign the values for the new
column from the existing values that we have.
mlst
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308
mlst %>%
mutate(newcolumn = 'values_for_new_column')
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF newcolumn
## 1 0.06896552 values_for_new_column
## 2 0.11111111 values_for_new_column
## 3 0.09090909 values_for_new_column
## 4 0.06000000 values_for_new_column
## 5 0.07692308 values_for_new_column
mlst %>%
mutate(newcolumn = 9)
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF newcolumn
## 1 0.06896552 9
## 2 0.11111111 9
## 3 0.09090909 9
## 4 0.06000000 9
## 5 0.07692308 9
mlst %>%
mutate(recA_added = recA+100)
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF recA_added
## 1 0.06896552 129
## 2 0.11111111 129
## 3 0.09090909 129
## 4 0.06000000 129
## 5 0.07692308 129
mlst %>%
mutate(sample_setting = str_extract(string=Sample, pattern = 'clin|comm'))
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF sample_setting
## 1 0.06896552 clin
## 2 0.11111111 clin
## 3 0.09090909 comm
## 4 0.06000000 comm
## 5 0.07692308 comm
mlst %>%
mutate(Sample = ifelse(Sample == 'clinA', 'clinicalA', Sample))
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty
## 1 clinicalA 131 53 40 47 13 36 28 29 0 -
## 2 clinB 131 53 40 47 13 36 28 29 0 -
## 3 commA 131 53 40 47 13 36 28 29 0 -
## 4 commB 131 53 40 47 13 36 28 29 0 -
## 5 commC 131 53 40 47 13 36 28 29 0 -
## depth maxMAF
## 1 62.90257 0.06896552
## 2 42.11329 0.11111111
## 3 78.50900 0.09090909
## 4 122.90043 0.06000000
## 5 60.69700 0.07692308
You can rename your column names using the function
rename
and its derivatives.
mlst %>%
rename(ID = Sample)
## ID ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308
mlst %>%
rename_all(.funs = ~paste0('extra_', .))
## extra_Sample extra_ST extra_adk extra_fumC extra_gyrB extra_icd extra_mdh
## 1 clinA 131 53 40 47 13 36
## 2 clinB 131 53 40 47 13 36
## 3 commA 131 53 40 47 13 36
## 4 commB 131 53 40 47 13 36
## 5 commC 131 53 40 47 13 36
## extra_purA extra_recA extra_mismatches extra_uncertainty extra_depth
## 1 28 29 0 - 62.90257
## 2 28 29 0 - 42.11329
## 3 28 29 0 - 78.50900
## 4 28 29 0 - 122.90043
## 5 28 29 0 - 60.69700
## extra_maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308
mlst %>%
rename_if(is.integer, ~paste0('integer_', .))
## Sample integer_ST integer_adk integer_fumC integer_gyrB integer_icd
## 1 clinA 131 53 40 47 13
## 2 clinB 131 53 40 47 13
## 3 commA 131 53 40 47 13
## 4 commB 131 53 40 47 13
## 5 commC 131 53 40 47 13
## integer_mdh integer_purA integer_recA integer_mismatches uncertainty
## 1 36 28 29 0 -
## 2 36 28 29 0 -
## 3 36 28 29 0 -
## 4 36 28 29 0 -
## 5 36 28 29 0 -
## depth maxMAF
## 1 62.90257 0.06896552
## 2 42.11329 0.11111111
## 3 78.50900 0.09090909
## 4 122.90043 0.06000000
## 5 60.69700 0.07692308
You can choose to only keep certain columns in R using the function
select
mlst
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308
mlst %>%
select(Sample, ST, adk)
## Sample ST adk
## 1 clinA 131 53
## 2 clinB 131 53
## 3 commA 131 53
## 4 commB 131 53
## 5 commC 131 53
You can pivot a data frame using the function
pivot_table
mlst
## Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth
## 1 clinA 131 53 40 47 13 36 28 29 0 - 62.90257
## 2 clinB 131 53 40 47 13 36 28 29 0 - 42.11329
## 3 commA 131 53 40 47 13 36 28 29 0 - 78.50900
## 4 commB 131 53 40 47 13 36 28 29 0 - 122.90043
## 5 commC 131 53 40 47 13 36 28 29 0 - 60.69700
## maxMAF
## 1 0.06896552
## 2 0.11111111
## 3 0.09090909
## 4 0.06000000
## 5 0.07692308
mlst %>%
select(Sample, ST, adk) %>%
pivot_longer(!Sample,
names_to = 'name',
values_to = 'value')
## # A tibble: 10 × 3
## Sample name value
## <chr> <chr> <int>
## 1 clinA ST 131
## 2 clinA adk 53
## 3 clinB ST 131
## 4 clinB adk 53
## 5 commA ST 131
## 6 commA adk 53
## 7 commB ST 131
## 8 commB adk 53
## 9 commC ST 131
## 10 commC adk 53
If you haven’t already noticed, we used a lot of the
%>%
symbol in our codes. This symbol is referred to as
the pipe, which is basically a bridge between your previous and the next
codes.
Quiz: We can save table from R into a csv files,
which we can open using Excel, using the function
write.csv
. Try saving the mlst table into a csv file named
‘mlst_table.csv’.
write.csv(x = mlst,
file = 'mlst_table.csv',
quote=F)
Quiz: Now that you know how to load files and assign them into variables, try loading the remaining SRST2 output into R. There are three files:
card
plasmidfinder
plasmidreplicon
card <- read.delim(file = 'input/srst2_out_card_compiled__compiledResults.txt')
plasmidfinder <- read.delim(file = 'input/srst2_out_plasmidfinder_compiled__compiledResults.txt')
plasmidreplicon <- read.delim(file = 'input/srst2_out_plasmidreplicon_compiled__compiledResults.txt')
Now let’s combine all the above information to see an example of how we could transform a raw output into a publication-ready figure.
# card
card2 <-
card %>%
column_to_rownames('Sample') %>%
mutate(across(.fns = ~ifelse(.=='-', 0, 1))) %>%
rename_all(~paste0('card_', .)) %>%
rownames_to_column('ID')
fig.card <-
card2 %>%
pivot_longer(!ID) %>%
ggplot(aes(x=name, y=ID, fill=as.factor(value))) +
geom_tile() +
labs(x='Resistance Genes',
y='Sample ID',
title = 'Resistance genes carriage of ST131 isolates') +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
scale_fill_discrete(name='Presence', labels = c('Absent', 'Present'))
fig.card
# plasmidfinder
plasmidfinder2 <-
plasmidfinder %>%
column_to_rownames('Sample') %>%
mutate(across(.fns = ~ifelse(.=='-', 0, 1))) %>%
rename_all(~paste0('finder_', .)) %>%
rownames_to_column('ID')
fig.finder <-
plasmidfinder2 %>%
pivot_longer(!ID) %>%
ggplot(aes(x=name, y=ID, fill=as.factor(value))) +
geom_tile() +
labs(x='Resistance Genes',
y='Sample ID',
title = 'PlasmidFinder carriage of ST131 isolates') +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
scale_fill_discrete(name='Presence', labels = c('Absent', 'Present'))
fig.finder
# plasmidreplicon
plasmidreplicon2 <-
plasmidreplicon %>%
column_to_rownames('Sample') %>%
mutate(across(.fns = ~ifelse(.=='-', 0, 1))) %>%
rename_all(~paste0('replicon_', .)) %>%
rownames_to_column('ID')
fig.replicon <-
plasmidreplicon2 %>%
pivot_longer(!ID) %>%
ggplot(aes(x=name, y=ID, fill=as.factor(value))) +
geom_tile() +
labs(x='Resistance Genes',
y='Sample ID',
title = 'Plasmid replicon carriage of ST131 isolates') +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
scale_fill_discrete(name='Presence', labels = c('Absent', 'Present'))
fig.replicon
We can combine the figures in several ways:
if (!require("ggpubr", quietly = TRUE))
install.packages("ggpubr")
library(ggpubr)
fig.compiled <-
ggarrange(fig.card, fig.finder, fig.replicon,
ncol=1,
common.legend = T)
Quiz: We can save figures using the
ggsave
function. Save the above figure into a file named
‘fig_compiled.png’.
ggsave(filename = 'fig_compiled.png',
plot = fig.compiled,
height=9, width=9)
# the longer but tidier way
fig.compiled2 <-
plyr::join_all(dfs = list(card2, plasmidreplicon2, plasmidfinder2), by='ID') %>%
pivot_longer(!ID) %>%
mutate(group = str_extract(string = name,
pattern = 'card|replicon|finder')) %>%
ggplot(aes(y=name, x=ID, fill=as.factor(value))) +
geom_tile() +
facet_wrap(~group,
scales='free') +
scale_fill_manual(name='Presence',
labels = c('Absent', 'Present'),
values = c('grey', 'dodgerblue')) +
labs(x='Features', y='ID') +
theme(axis.text.x = element_text(angle=45, hjust=1),
strip.text.x = element_text(face='bold'))
Quiz: Save the above
figures into a file named ‘fig_compiled2.png’
ggsave(filename='fig_compiled2.png',
plot = fig.compiled2,
height=7,
width=9)
Now you have learned some basic operations in R, including importing output data from WGS pipeline and transforming them into figures for reporting and publication purposes. As you may have noticed, getting familiar with computer languages require lots and lots of trials and errors before you get the right result. However, I promise you that the time spent will be worth the effort in the long run.
Keep scripting!