Extract Part of a FASTA Sequences with Position


Actually I have hundreds of protein sequence and I identified the conserved domain sequence from all those hundreds of protein sequences. Now I got the location of all those domains and want to extract the exact sequence from that locations. So it is easy if I have a single sequence and have location of one or more domain in my protein but it's very difficult to extract out the domain sequences from many protein sequences with the help of  domain location coordinates. I found a easy python script to extracting fasta sequences based on position. I have also shared a online program originally written by Dr Pierre Lindenbaum  HERE 

Example FASTA file with protein sequence

>AT1G01250 
MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEIREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQVAAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVDGDLYQYQFMACL

>AT1G03800 
MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRPWGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSATPVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAVAGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLDLNASP
Example ID file with domain location
AT1G01250   45  102
AT1G03800   65  109


Script name Download
domainseq.py

Uses
python domainseq.py input.fasta ids.txt > result.fasta
Results
>AT1G01250:45-102
IREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPR
>AT1G03800:65-109
AEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFP






  • Remove Empty Fasta Sequences from a file
  • How to Extract Multiple Sequence from Fasta File
  • Add FASTA Description to Multiple Sequences

  • 11 comments:

    1. File "domainseq.py", line 27, in
      outname= line[0] + ':' + line[1] + '-' + line[2]
      IndexError: list index out of range

      ReplyDelete
      Replies
      1. Hi sorry you face the problem with extract sub sequence with this python script. You may use this method to Extract Part of a FASTA Sequences with Position

        Delete
      2. Use tabs instead of spaces to separate the name and the positions on the ids.txt file.

        Delete
    2. How can this error be solved?
      Traceback (most recent call last) :
      File "domainseq.py", line 31, in
      print (fasta_dict[line[0][s:e])
      Keyerror: 'MyfirstID'


      ReplyDelete
      Replies
      1. Hi BuckeyePuzzler,
        Sorry for your problem. Your name and positions should be separated by 'tab' instead of space. You can download the script again. I have attached an example file both for id and sequence. Hope this will help you.

        Delete
    3. This comment has been removed by the author.

      ReplyDelete
    4. I think line 29 and line 30 should be:
      s= int(line[1])-1
      e= int(line[2])-1

      ReplyDelete
    5. I think line 29 and line 30 should be:
      s= int(line[1])-1
      e= int(line[2])-1

      ReplyDelete
      Replies
      1. I have checked this script. It is working. Thanks

        Delete
      2. This comment has been removed by the author.

        Delete

    Have Problem ?? Drop a comments here!