GNU-Parallel is one of the parallel job executing programs available for GNU-Linux that can be easily learnt and used. Recently, I got a request to rename bunch of files. It is easy to do with other renaming programs and then pipe the new file names to old files, for eg with sed. It involves pattern identification, reuse the pattern and append strings to the pattern. Instead of that, I decided to do it in parallel.
But first look at the problem and expected output: User has following files: file1_1.hiseq.fq.gz, file1_2.hiseq.fq.gz, file2_1.hiseq.fq.gz and file2_2.hiseq.fq.gz. Now user wants to reshuffle so that file: file1_1.hiseq.fq.gz will be file_1.R1.fq.gz.
- Now let us create the gzipped files (without any data inside it as we need only file names).
$ touch file{1..2}_{1..2}.hiseq.fq.gz
$ ls
file1_1.hiseq.fq.gz file1_2.hiseq.fq.gz file2_1.hiseq.fq.gz file2_2.hiseq.fq.gz
- Now let us rename them. Instead of replacing the original files, let us copy the same files, but with new names.
$ parallel cp {} '{= s:([0-9]+)_([0-9]+)\.hiseq:_$1\.R$2: =}' ::: *.gz - Let us look at the output:
$ ls
file1_1.hiseq.fq.gz file_1.R1.fq.gz file2_1.hiseq.fq.gz file_2.R1.fq.gz
file1_2.hiseq.fq.gz file_1.R2.fq.gz file2_2.hiseq.fq.gz file_2.R2.fq.gz
Now, what happened here?
In the old file names there is a pattern (1_1.hiseq in first file): digit_digit. Now we have to put an underscore before first digit (1_1.hiseq will be _1_1.hiseq) and append R to the second digit (_1_1.hiseq will be_1_R1.hiseq). Replace second underscore with . (dot) so that_1_R1.hiseq will be _1.R1.hiseq and replace .hiseq with empty (i.e remove .hiseq). For this we made first and second digit as pattern and then we backreferenced them using $. For back referencing, pattern must be () and backreference numbering is from left to right.