One of the strengths of AWK is column operation which makes column based manipulation of text files are easy. One such example as follows below. Example text is as follows:

$ cat test.txt 
	1:924024
	1:924310
SAMD11 	1:930353
SAMD11 	1:930939
NOC2L  	1:944858
NOC2L  	1:946247
KLHL17 	1:960891
KLHL17 	1:961945

First two lines have no values in first column and each column is separated by tab. User request is to fill it up with incremental numbers, followed by a text :na. It is easy with awk as follows:

$ awk -F "\t" -v OFS="\t" '{if ($1=="") $1=NR":na"}1' test.txt 

1:na	1:924024
2:na	1:924310
SAMD11 	1:930353
SAMD11 	1:930939
NOC2L  	1:944858
NOC2L  	1:946247
KLHL17 	1:960891
KLHL17 	1:961945

using ternary operator in awk:


$ awk -F "\t" -v OFS="\t" '{ ($1=="")? ($1=NR":na"):$1}1' test.txt 

1:na	1:924024
2:na	1:924310
SAMD11 	1:930353
SAMD11 	1:930939
NOC2L  	1:944858
NOC2L  	1:946247
KLHL17 	1:960891
KLHL17 	1:961945

But what if user wants to number each and every line, but empty one being filled with na

$ awk -F "\t" -v OFS="\t" '{ ($1==" ")? ($1=NR":na"):($1=NR":"$1)}1' test.txt

1:na	1:924024
2:na	1:924310
3:SAMD11	1:930353
4:SAMD11	1:930939
5:NOC2L	1:944858
6:NOC2L	1:946247
7:KLHL17	1:960891
8:KLHL17	1:961945


Now this is easy. Right? But user wants a different way. All duplicated entries must have same numbering, while empty ones should have individual entries. It’s tricky. Without writing large code, I could not collapse rows, based on a column in awk. So I used datamash. I wish datamash had expand option for any given column just the way they have group based collapse.

here is the code any way. A little bit messy, but works.


$ awk -F "\t" -v OFS="\t" '$1==" " {$1="na_"NR}1' test.txt \
	| datamash -g1  collapse 2 \
	| awk -F "\t" -v OFS='\t' '{split($2,a,",");for(i in a) print NR,$1,a[i]}' \
	| awk -v OFS="\t" '{gsub(/_.*/,"",$2)}{print $1":"$2,$3}' 

1:na	1:924024
2:na	1:924310
3:SAMD11	1:930353
3:SAMD11	1:930939
4:NOC2L	1:944858
4:NOC2L	1:946247
5:KLHL17	1:960891
5:KLHL17	1:961945