One of the first steps in NGS data analysis is to create directories for samples. Some of the tools require one directory for each sample so that each sample analysis result is saved with the same name in individual directories. For eg. hisat2 workflow (HiSat2-Stringtie-Ballgown) enforces user to use individual directories for each output. This is kind of hassle for a bohemian analyst. But then one can't escape such developer forced architecture, for good or bad.

Purpose of this note today is to create multiple directories for NGS data one per sequence file. Since directory creation is independent and can be parallelized, let us use a tool called GNU-parallel, available in most of the linux distros.

Below is a typical example of data on a linux machine: Tree command lists the files in any given directory. In this example, we are listing all the files in raw_data directory. Let us create one directory per sample (without .fastq.gz extension and move each one to respective directory)
$ tree raw_data
raw_data/
├── hcc1395_normal_rep1_r1.fastq.gz
├── hcc1395_normal_rep1_r2.fastq.gz
├── hcc1395_normal_rep2_r1.fastq.gz
├── hcc1395_normal_rep2_r2.fastq.gz
├── hcc1395_normal_rep3_r1.fastq.gz
├── hcc1395_normal_rep3_r2.fastq.gz
├── hcc1395_tumor_rep1_r1.fastq.gz
├── hcc1395_tumor_rep1_r2.fastq.gz
├── hcc1395_tumor_rep2_r1.fastq.gz
├── hcc1395_tumor_rep2_r2.fastq.gz
├── hcc1395_tumor_rep3_r1.fastq.gz
└── hcc1395_tumor_rep3_r2.fastq.gz

0 directories, 12 files

Command:

$ ls *.gz |parallel 'basename {} .fastq.gz | xargs mkdir'

What is happening here:
ls *.gz - lists all the files with .gz extension. In this case, all the sample files
parallel - Program
basename - a command  use to remove file extension. File extension to be removed is .fastq.gz
{} - Output from ls *.gz, but individual line
mkdir - make directory option
xargs - since mkdir doesn't take stdin, we are using xargs.

Program has following logic steps
  1. list all the sequence files with .gz extension
  2. Strip .fastq.gz extension using basename function
  3. create directories using sequence names with stripped extensions
Output:

$ ls
hcc1395_normal_rep1_r1         hcc1395_tumor_rep1_r1
hcc1395_normal_rep1_r1.fastq.gz  hcc1395_tumor_rep1_r1.fastq.gz
hcc1395_normal_rep1_r2         hcc1395_tumor_rep1_r2
hcc1395_normal_rep1_r2.fastq.gz  hcc1395_tumor_rep1_r2.fastq.gz
hcc1395_normal_rep2_r1         hcc1395_tumor_rep2_r1
hcc1395_normal_rep2_r1.fastq.gz  hcc1395_tumor_rep2_r1.fastq.gz
hcc1395_normal_rep2_r2         hcc1395_tumor_rep2_r2
hcc1395_normal_rep2_r2.fastq.gz  hcc1395_tumor_rep2_r2.fastq.gz
hcc1395_normal_rep3_r1         hcc1395_tumor_rep3_r1
hcc1395_normal_rep3_r1.fastq.gz  hcc1395_tumor_rep3_r1.fastq.gz
hcc1395_normal_rep3_r2         hcc1395_tumor_rep3_r2
hcc1395_normal_rep3_r2.fastq.gz  hcc1395_tumor_rep3_r2.fastq.gz

Let us assume that you have sequences in .fastq format and they are not gzipped, then it is much easier. Use following command:

$ ls *.gz |parallel 'mkdir {.}'

{.} in parallel would strip of the extension. This we can exploit in string two extensions. For eg. let us say we have to repeat the same exercise above, without using basename. Then the command would be:
$ ls *.gz |parallel 'echo {.}' | parallel 'mkdir {.}'
In this command, we are stripping extension with first parallel command (which .gz), in the second parallel command, we are stripping remaining extension (.fastq).
Now the issue is that we are not happy with the directories created. We would like to delete all the directories, keeping files intact.

$ echo */ | xargs rm -rf

echo */ lists all the directories in current directory and rm -rf deletes directories. xargs is used to direct echo output to rm -rf

Now that we have created directories, now we need to move the files into the directory. We can write another script to move each file into it's directory. It is not necessary. Let us do following things where we create directories and then move the files into respective directories:

Command:

Before:
$ tree .
.
├── hcc1395_normal_rep1_r1.fastq.gz
├── hcc1395_normal_rep1_r2.fastq.gz
├── hcc1395_normal_rep2_r1.fastq.gz
├── hcc1395_normal_rep2_r2.fastq.gz
├── hcc1395_normal_rep3_r1.fastq.gz
├── hcc1395_normal_rep3_r2.fastq.gz
├── hcc1395_tumor_rep1_r1.fastq.gz
├── hcc1395_tumor_rep1_r2.fastq.gz
├── hcc1395_tumor_rep2_r1.fastq.gz
├── hcc1395_tumor_rep2_r2.fastq.gz
├── hcc1395_tumor_rep3_r1.fastq.gz
└── hcc1395_tumor_rep3_r2.fastq.gz

0 directories, 12 files

After:
$ tree .
.
├── hcc1395_normal_rep1_r1
│   └── hcc1395_normal_rep1_r1.fastq.gz
├── hcc1395_normal_rep1_r2
│   └── hcc1395_normal_rep1_r2.fastq.gz
├── hcc1395_normal_rep2_r1
│   └── hcc1395_normal_rep2_r1.fastq.gz
├── hcc1395_normal_rep2_r2
│   └── hcc1395_normal_rep2_r2.fastq.gz
├── hcc1395_normal_rep3_r1
│   └── hcc1395_normal_rep3_r1.fastq.gz
├── hcc1395_normal_rep3_r2
│   └── hcc1395_normal_rep3_r2.fastq.gz
├── hcc1395_tumor_rep1_r1
│   └── hcc1395_tumor_rep1_r1.fastq.gz
├── hcc1395_tumor_rep1_r2
│   └── hcc1395_tumor_rep1_r2.fastq.gz
├── hcc1395_tumor_rep2_r1
│   └── hcc1395_tumor_rep2_r1.fastq.gz
├── hcc1395_tumor_rep2_r2
│   └── hcc1395_tumor_rep2_r2.fastq.gz
├── hcc1395_tumor_rep3_r1
│   └── hcc1395_tumor_rep3_r1.fastq.gz
└── hcc1395_tumor_rep3_r2
    └── hcc1395_tumor_rep3_r2.fastq.gz

12 directories, 12 files
Now let us say that we are not happy and we want to bring back all the files from subdirectories into current directory i.e copy/move all the sequence files in sub directories to current directory, do the following:

$ ls | parallel 'cp {}/*.gz .'

What is happening  here:

ls - lists all the directories
cp - copy
{} - each directory
/- escape character
*- every thing
gz - copy files with .gz extension
. - current directory