One of the situation we might face in our daily dealings with data and files involve having to segregrate files from a single directory into several subdirectories, be it for aesthetics, organisation, or to facilitate downstream analyses. In my case, I have previously downloaded ~2000 DNA sequence data belonging to multiple original projects from the NCBI public repository into a single folder, but now am required to process data from each project separately. It is easy to do this manually if we are only dealing with a few files. In my case, however, manually segregating ~2000 files into 40 subdirectories is simply not a productive use of time.
To facilitate the lazy me, I came up with the script below to bypass the boring manual job and complete it in seconds.
I like to work within the tidyverse environment, which I think is one of the advantages of scripting in R. Tidyverse provides a collection of functions and script formatting which helps make our script more intuitive. You can load the tidyverse package, or install it if you haven’t using the following script:
# this will install R if not already available
if (!require("tidyverse", quietly = TRUE))
install.packages("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# load the tidyverse package
library(tidyverse)
I will demonstrate this project by first creating nine dummy text
files that I shall name {a-c}{1-3}.txt
for convenience’s
sake. We can achieve this in R by doing a simple loop with the
file.create()
function:
# a string of files to create
file_to_create = c(paste0('a', 1:3, '.txt'), paste0('b', 1:3, '.txt'), paste0('c', 1:3, '.txt'))
# loop each filename into the file.create function
for (x in file_to_create) {
file.create(x)
}
Your directory structure should contain the nine files you have just created.
list.files(pattern='txt')
## [1] "a1.txt" "a2.txt" "a3.txt" "b1.txt" "b2.txt" "b3.txt" "c1.txt" "c2.txt"
## [9] "c3.txt"
The task now is to group and move these files into different folders based on their alphabetical naming. We can achieve this by first creating a helper file which we will use later on to create and copy each file into their respective folders. An example is shown below:
## # A tibble: 9 × 2
## filename folder
## <chr> <chr>
## 1 a1.txt folder_a
## 2 a2.txt folder_a
## 3 a3.txt folder_a
## 4 b1.txt folder_b
## 5 b2.txt folder_b
## 6 b3.txt folder_b
## 7 c1.txt folder_c
## 8 c2.txt folder_c
## 9 c3.txt folder_c
From the above example, I want to move files whose names start with
a
into folder_a
, b
into
folder_b
, and c
into folder_c
.
Based on the dummy files that I’ve created, I can create the above
helper file using the steps below:
helper_file =
# work with the file_to_create variable
file_to_create %>%
# convert to tibble format
tibble %>%
# rename the default column name of '.' into 'filename'
rename('filename' = '.') %>%
# create a new column of target folder name by combining the string 'folder_' and the first letter of the file name together
mutate(folder = paste0('folder_', str_extract(string=filename, pattern = '^[[:letter:]]{1}')))
helper_file
## # A tibble: 9 × 2
## filename folder
## <chr> <chr>
## 1 a1.txt folder_a
## 2 a2.txt folder_a
## 3 a3.txt folder_a
## 4 b1.txt folder_b
## 5 b2.txt folder_b
## 6 b3.txt folder_b
## 7 c1.txt folder_c
## 8 c2.txt folder_c
## 9 c3.txt folder_c
Let’s now save this helper file into a CSV-formatted file.
write_csv(helper_file, file = 'helper_file.csv', col_names = F)
I excluded the column names from the files using the argument
col_names=F
for reason you will understand in the next
section.
If your end folder structure cannot be determined based on the file names alone, then you will have to manually create the helper file using human input.
We now have the dummy files and a helper file denoting where each file should be moved into ready. We can now initiate the loop for folder creation and file moving using the loop below:
for (x in readLines(con = 'helper_file.csv')) { # read file in the working directory line by line
# split the line into two variables separated by a comma (as the input is a csv file) and extract as filename and foldername
filename = str_split(string=x, pattern = ',', n=2)[[1]][1]
foldername = str_split(string=x, pattern = ',', n=2)[[1]][2]
if (file.exists(foldername)) {
# copy file into directory if directory already exists
file.copy(from = filename, to = paste0(foldername, '/', filename))
} else {
# create directory, then copy file into directory
dir.create(path = foldername)
file.copy(from = filename, to = paste0(foldername, '/', filename))
}
}
Previously, we set the column names=F
when writing our
helper file as the above loop functions through reading the file line by
line. As such, the presence of column names is unnecessary and their
exclusion simplifies this loop.
We can check to see if we have completed our task by looking at the directory structure:
list.files(pattern='txt', recursive = T) # set recursive=T to recursively evaluate the sub-directories
## [1] "a1.txt" "a2.txt" "a3.txt" "b1.txt"
## [5] "b2.txt" "b3.txt" "c1.txt" "c2.txt"
## [9] "c3.txt" "folder_a/a1.txt" "folder_a/a2.txt" "folder_a/a3.txt"
## [13] "folder_b/b1.txt" "folder_b/b2.txt" "folder_b/b3.txt" "folder_c/c1.txt"
## [17] "folder_c/c2.txt" "folder_c/c3.txt"
Seems like we have successfuly copied the files into their respective folders. We can now safely delete the original files in the main directory to clean up our work environment:
file.remove(file_to_create) # remove files as defined in the file_to_create variable
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
list.files(pattern='txt', recursive = T)
## [1] "folder_a/a1.txt" "folder_a/a2.txt" "folder_a/a3.txt" "folder_b/b1.txt"
## [5] "folder_b/b2.txt" "folder_b/b3.txt" "folder_c/c1.txt" "folder_c/c2.txt"
## [9] "folder_c/c3.txt"