UCSC has a wonderful tool called Table browser to export, intersect and sequence information of reference genome data in context of user supplied data where user data is optional. User can export data in several standard formats. One of them is gtf format. gtf format is limited version of gff3. User can learn how to export data in gtf from here hosted on UCSC.One of the problems that users face while analyzing RNAseq data (esp by Tophat/Cufflinks or New tuxedo protocol) is that both gene ID and Transcript ID are same. Once you download, gtf track, take a back up of it before you proceed any further.
Example gtf record would look like this:
chr22 hg38_refGene exon 50798655 50799637 0.000000 + . gene_id "NR_026981"; transcript_id "NR_026981";
Steps to follow:
- In hgtables, Choose NCBI refseq as track and in output format,
- Choose selected fields from primary and related tables and click on get output.It would take you to a page where you can choose the output.
- Select name and name2 fields and click on get output. This would output transcript name and gene name.
Example ouput would look like this:
NR_026981 RPL23AP82 - Now replace genid (in UCSC gtf) with gene symbol from freshly exported list.
$ awk 'NR==FNR{A[$1]=$2;next} $2 in A{$2=A[$2]}1' <file from step3> FS='[";]' OFS='"' <your gtf> | sed 's/""/";/g'
Example data and code:
$ cat test.gtf
chr22 hg38_refGene exon 50798655 50799637 0.000000 + . gene_id "NR_026981"; transcript_id "NR_026981";
$ cat test.hgtables
NR_026981 RPL23AP82
$ awk 'NR==FNR{A[$1]=$2;next} $2 in A{$2=A[$2]}1' test.hgtables FS='[";]' OFS='"' test.gtf | sed 's/""/";/g'
chr22 hg38_refGene exon 50798655 50799637 0.000000 + . gene_id "RPL23AP82"; transcript_id "NR_026981";
Output from step 3 will have extra two lines at the start (starting with #. Remove them)