sort and reduce redundancy in FASTA files

we had a fasta file with unwanted redundancy.

Reads were mapped to a reference and then each match was extracted with an ID and  sequence, to which this ID mapped. This resulted in a redundancy in IDs. Our goal was to get only one ID for each sequence.

awk ‘{print$1}’ original.file > original_file.short

awk ‘{print$1} = print only first colum
> = write output in new file

paste   –   –   < original_file.short > original_file_short.one-line               (source)

reformat file into a two column file (each second line becomes the second column)

 sort -u -k1,1 original_file_short .one-line > final.file                                 (source)

-u keep only uniq
-k1,1 only sort/uniq first colum

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s