Polyproteins, ribo slippage, and mat_peptide in viruses?

View: New views
13 Messages — Rating Filter:   Alert me  

Polyproteins, ribo slippage, and mat_peptide in viruses?

by Chris Larsen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

All,

I am attempting to find some solutions to a DB loading problem we are  
encountering in viruses. It is multifold:

Some viruses churn out a polyprotein rather than individual peptides;  
further they also slip the ribosome, so a source nucleotide is used  
more than once  in translation (ribosome halts, backs up one  
nucleotide, and continues in a new frame); and finally we have post  
translational processing into mature peptides. The main thing is that  
the mature peptide is contained a a subset of the whole parent  
polyprotein, but is not provided as a single file in GBK for each  
mat_peptide CDS. We have to get that in order to run algorithms on the  
relevant processed proteins. Therefore we cannot directly load into  
GUS, but rather have to choose how to get the mat_peptide sequence.  
Actually I think the viruses know that, and are just messing with us  
out of spite, since we have iPods and they dont. Anyway.. from anyone  
who has encountered this I seek guidance.

We have as choices:

1. Get the locations of mature peptide children in /Protein/
carve the mat_peptide sequence out of the whole polyprotein translation
check that the mat_peptide is infact an identical subset of the  
translated protein
load that

OR

2. Use the locations of starts and stops in /Nucleotide/
translate that, using the slippage information
get mature peptides that line up exactly to the parent polyprotein

If you know of BioPerl sequence handling support for this, I would  
love to hear more. Clearly this is a nonstandard thingamabob.

Stupid viruses

Chris

--

Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Peter-329 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 4:33 PM, Chris Larsen <clarsen@...> wrote:

> All,
>
> I am attempting to find some solutions to a DB loading problem we are
> encountering in viruses. It is multifold:
>
> Some viruses churn out a polyprotein rather than individual peptides;
> further they also slip the ribosome, so a source nucleotide is used more
> than once  in translation (ribosome halts, backs up one nucleotide, and
> continues in a new frame); and finally we have post translational processing
> into mature peptides. The main thing is that the mature peptide is contained
> a a subset of the whole parent polyprotein, but is not provided as a single
> file in GBK for each mat_peptide CDS. We have to get that in order to run
> algorithms on the relevant processed proteins. Therefore we cannot directly
> load into GUS, but rather have to choose how to get the mat_peptide
> sequence. Actually I think the viruses know that, and are just messing with
> us out of spite, since we have iPods and they dont. Anyway.. from anyone who
> has encountered this I seek guidance.
>
> We have as choices:
>
> 1. Get the locations of mature peptide children in /Protein/
> carve the mat_peptide sequence out of the whole polyprotein translation
> check that the mat_peptide is infact an identical subset of the translated
> protein
> load that
>
> OR
>
> 2. Use the locations of starts and stops in /Nucleotide/
> translate that, using the slippage information
> get mature peptides that line up exactly to the parent polyprotein
>
> If you know of BioPerl sequence handling support for this, I would love to
> hear more. Clearly this is a nonstandard thingamabob.
>
> Stupid viruses
>
> Chris

Cool viruses :)

Do you have some specific examples from GenBank? I'm starting
to deal with virus annotation in my work, so this is of interest.

Peter

P.S. As you might guess from my email address, I'm actually more
interested in Biopython than BioPerl, but the same algorithmic
approach could be tested in either.

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: EXAMPLE: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Chris Larsen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

My bad...thanks everyone.

An example of a virus that does this is in order so we don't have to  
do this in our heads. There's no slippage in this isolate per se,  
however the main problem of generating component *.faa from  
polyprotein still exists.

Chris

> For instance, check this:
> http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
>
> You may recall some cruise where half the ship threw up. YAY
> norovirus. Lets take that as a useful problem to address solutions to.
>
> No mat_peptide sequence is given. How to get that?


_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Parent Message unknown Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Peter-329 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 7:15 PM, Chris Larsen <clarsen@...> wrote:
>
> Hello Peter!
>
> For instance, check this:
> http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
> ...
>
> No mat_peptide sequence is given. We want that...

Looking at the GenBank file displayed, the mat_peptide features
(mature peptides) do not include a translation entry (like the parent
CDS feature does). However, they do have protein IDs - which are
actually links in the HTML version.

This leads me to suggest a third option as an alternative to the two
ideas you outlined. You could parse the GenBank file(s), and for each
mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
a FASTA file, or a GenPept file). If you only have a relatively small
number of viruses and proteins this is probably going to be pretty
easy. At least, I could do it in Biopython and I am sure the same is
true with the BioPerl GenBank parser and their EFetch interface.

However, for a large dataset, handling it all locally (your options
(1) and (2) sound best).

Peter
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Chris Larsen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Peter,

This is a good strategy when the gi is given. However I failed to  
mention that we are finding the example I gave is unusual (15%?)---
most virus 'mature peptides' we will apply this analysis to do not in  
fact have a gi number or unique identifier associated with them. There  
are thousands of dengue virus files to be processed to give mature  
proteins.

Should have mentioned this...Hence the problem--we cant look it up  
because only the parent polyprotein has a gi. Theres nothing to look  
up /by/ in most cases. So we still have to build a set of proteins  
that are cleaved out of every polyprotein, by local and high  
throughput methods, by building it out of the available information  
(sadly, kind of a run around-- it should be in the genbank entry).

Chris


On Oct 27, 2009, at 3:54 PM, Peter wrote:

> On Tue, Oct 27, 2009 at 7:15 PM, Chris Larsen <clarsen@...>  
> wrote:
>>
>> Hello Peter!
>>
>> For instance, check this:
>> http://www.ncbi.nlm.nih.gov/nuccore/NC_001959
>> ...
>>
>> No mat_peptide sequence is given. We want that...
>
> Looking at the GenBank file displayed, the mat_peptide features
> (mature peptides) do not include a translation entry (like the parent
> CDS feature does). However, they do have protein IDs - which are
> actually links in the HTML version.
>
> This leads me to suggest a third option as an alternative to the two
> ideas you outlined. You could parse the GenBank file(s), and for each
> mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as
> a FASTA file, or a GenPept file). If you only have a relatively small
> number of viruses and proteins this is probably going to be pretty
> easy. At least, I could do it in Biopython and I am sure the same is
> true with the BioPerl GenBank parser and their EFetch interface.
>
> However, for a large dataset, handling it all locally (your options
> (1) and (2) sound best).
>
> Peter


--

Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Peter-329 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 8:07 PM, Chris Larsen <clarsen@...> wrote:

>
> Peter,
>
> This is a good strategy when the gi is given. However I failed to mention
> that we are finding the example I gave is unusual (15%?)---most virus
> 'mature peptides' we will apply this analysis to do not in fact have a gi
> number or unique identifier associated with them. There are thousands of
> dengue virus files to be processed to give mature proteins.
>
> Should have mentioned this...Hence the problem--we cant look it up because
> only the parent polyprotein has a gi. Theres nothing to look up /by/ in most
> cases. So we still have to build a set of proteins that are cleaved out of
> every polyprotein, by local and high throughput methods, by building it out
> of the available information (sadly, kind of a run around-- it should be in
> the genbank entry).
>
> Chris

Ah. That's a shame. I did just take a few minutes to try out the
EFetch idea (using Biopython) and it does work beautifully for
this "nice" example virus which the NCBI have annotated.

I also note that in the example given, all the mature peptides
have nice and simple locations (in terms of their co-ordindates
for the nucleotides), no ribosomal slippages etc. This means
grabbing the relevant bits of the genome and translating it is
also pretty easy (option 2 in your original email).

Have you got a more typical entry you can point us at?

If there is nothing publicly available, I wouldn't mind you
emailing me one or two to look at off list (and if don't mind,
they might make good examples for Bio* project unit tests
or examples).

Peter
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 27, 2009, at 3:17 PM, Peter wrote:

> On Tue, Oct 27, 2009 at 8:07 PM, Chris Larsen <clarsen@...>  
> wrote:
>>
>> Peter,
>>
>> This is a good strategy when the gi is given. However I failed to  
>> mention
>> that we are finding the example I gave is unusual (15%?)---most virus
>> 'mature peptides' we will apply this analysis to do not in fact  
>> have a gi
>> number or unique identifier associated with them. There are  
>> thousands of
>> dengue virus files to be processed to give mature proteins.
>>
>> Should have mentioned this...Hence the problem--we cant look it up  
>> because
>> only the parent polyprotein has a gi. Theres nothing to look up /
>> by/ in most
>> cases. So we still have to build a set of proteins that are cleaved  
>> out of
>> every polyprotein, by local and high throughput methods, by  
>> building it out
>> of the available information (sadly, kind of a run around-- it  
>> should be in
>> the genbank entry).
>>
>> Chris
>
> Ah. That's a shame. I did just take a few minutes to try out the
> EFetch idea (using Biopython) and it does work beautifully for
> this "nice" example virus which the NCBI have annotated.

Interesting thing about that example: if you follow the hyperlinks for  
the mat_peptide feature key, they relate back to the full protein  
sequence with from/to, not to the protein_id for the feature.  Example:

# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts

# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959

This record doesn't appear to contain any mapping information along  
those lines, which makes me think this is an autogenerated record  
using the Gene record, which does have those mappings:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970

> I also note that in the example given, all the mature peptides
> have nice and simple locations (in terms of their co-ordindates
> for the nucleotides), no ribosomal slippages etc. This means
> grabbing the relevant bits of the genome and translating it is
> also pretty easy (option 2 in your original email).
>
> Have you got a more typical entry you can point us at?
>
> If there is nothing publicly available, I wouldn't mind you
> emailing me one or two to look at off list (and if don't mind,
> they might make good examples for Bio* project unit tests
> or examples).
>
> Peter

I think one could use the full-length protein and run TFASTX (which  
allows frameshifts) against the nucleotide sequence.  The output will  
have the frameshifts designated with '/' or '\', so it would then be a  
matter of splitting the sequence based on the midline, then mapping  
those protein fragments back to the original sequence coordinates.  Is  
this along the lines of what you mean?

chris

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Peter-329 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 8:46 PM, Chris Fields <cjfields@...> wrote:

>>
>> Ah. That's a shame. I did just take a few minutes to try out the
>> EFetch idea (using Biopython) and it does work beautifully for
>> this "nice" example virus which the NCBI have annotated.
>
> Interesting thing about that example: if you follow the hyperlinks for the
> mat_peptide feature key, they relate back to the full protein sequence with
> from/to, not to the protein_id for the feature.  Example:
>
> # link from the first mat_peptide
> http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
>
> # protein_id
> http://www.ncbi.nlm.nih.gov/protein/28416959

Right - the protein ID link is just based on the GI number, 28416959.
This link (or EFetch) gives you the (short) mature peptide.

> This record doesn't appear to contain any mapping information along those
> lines, which makes me think this is an autogenerated record using the Gene
> record, which does have those mappings:
>
> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970

Are you suggesting one option is (if the mat_peptide annotation
is lacking a protein or GI number) to go online to the Gene
database using the gene ID of the precursor (parent) protein
to find the IDs of the mature (child) peptides?

Peter

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by bill-158 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

These mature proteins do have gi.
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

>grep NC_001959 gene2accession
11983   1491970 REVIEWED        -       -       NP_786945.1     28416959  
     NC_001959.2     106060735       4       5373    +       -
11983   1491970 REVIEWED        -       -       NP_786946.1     28416960  
     NC_001959.2     106060735       4       5373    +       -
11983   1491970 REVIEWED        -       -       NP_786947.1     28416961  
     NC_001959.2     106060735       4       5373    +       -
11983   1491970 REVIEWED        -       -       NP_786948.1     28416962  
     NC_001959.2     106060735       4       5373    +       -
11983   1491970 REVIEWED        -       -       NP_786949.1     28416963  
     NC_001959.2     106060735       4       5373    +       -
11983   1491970 REVIEWED        -       -       NP_786950.1     28416964  
     NC_001959.2     106060735       4       5373    +       -
11983   1491971 PROVISIONAL     -       -       NP_056822.1     9630806
NC_001959.2     106060735       6949    7587    +       -
11983   1491972 PROVISIONAL     -       -       NP_056821.2     106060736
     NC_001959.2     106060735       5357    6949    +       -

Bill

> On Tue, Oct 27, 2009 at 8:46 PM, Chris Fields <cjfields@...>
> wrote:
>>>
>>> Ah. That's a shame. I did just take a few minutes to try out the
>>> EFetch idea (using Biopython) and it does work beautifully for
>>> this "nice" example virus which the NCBI have annotated.
>>
>> Interesting thing about that example: if you follow the hyperlinks for
>> the
>> mat_peptide feature key, they relate back to the full protein sequence
>> with
>> from/to, not to the protein_id for the feature.  Example:
>>
>> # link from the first mat_peptide
>> http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts
>>
>> # protein_id
>> http://www.ncbi.nlm.nih.gov/protein/28416959
>
> Right - the protein ID link is just based on the GI number, 28416959.
> This link (or EFetch) gives you the (short) mature peptide.
>
>> This record doesn't appear to contain any mapping information along
>> those
>> lines, which makes me think this is an autogenerated record using the
>> Gene
>> record, which does have those mappings:
>>
>> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970
>
> Are you suggesting one option is (if the mat_peptide annotation
> is lacking a protein or GI number) to go online to the Gene
> database using the gene ID of the precursor (parent) protein
> to find the IDs of the mature (child) peptides?
>
> Peter
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Peter-329 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Oct 27, 2009 at 9:47 PM,  <bill@...> wrote:
> These mature proteins do have gi.

Yes, they are in the original GenBank file too. The
problem is Chris said he picked a poor example - this
one is well annotated, but in general his GenBank files
lack these cross references. Which is why I asked for a
more realistic example if he can share one ;)

Peter
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Chris Larsen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Peter, Chris,

Thank you muchly for your expert and well presented dialog.  Yes here  
is an actual and typical problem in generating protein seq from viral  
polyproteins, in the absence of mat_peptide Seq and unique ID:

For record:

http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank

this is the coronavirus:
LOCUS       DQ848678               29277 bp    RNA     linear   VRL 12-
SEP-2006
DEFINITION  Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION   DQ848678
VERSION     DQ848678.1  GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,  
backs up one nuc, changes frame, and then continues:
      CDS             join(311..12358,12358..20391)
                      /ribosomal_slippage
                      /codon_start=1
                      /product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides  
towards whom we would actually love to focus our comparative  
bioinformatic scrutiny, but none of which have mappable IDs or seqs  
below the polyprotein level as can be seen:
      mat_peptide     15118..16914 <===
                      /product="nsp13" <===
                      /note="helicase"  <==these are all we have to go  
on
and where are given no ID below the polyprotein level, no protein  
sequence...just positions. They are nuc positions at that, but we are  
given the complete polyprotein seq and have the components to do this  
on paper, but no code.  In summary we would like to dump in genbank  
files to a method, and get out fasta protein files which have some IDs  
and seqs.

You guys are forcing me (thank god) to think critically and clearly  
about it too, so let me extend the proposed module or method as best I  
can:

Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide  
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus  
isolate or CDS set for your virus

PHEW.

Chris L

==========
>
> I think one could use the full-length protein and run TFASTX (which  
> allows frameshifts) against the nucleotide sequence.  The output  
> will have the frameshifts designated with '/' or '\', so it would  
> then be a matter of splitting the sequence based on the midline,  
> then mapping those protein fragments back to the original sequence  
> coordinates.  Is this along the lines of what you mean?
>
> chris

Let me look into this thank you CF, I have not used that in the past.

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by bill-158 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It appears that acc/gi is not required for mat_peptide. Your solution
should work fine.

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objmgr/util/create_defline.cpp#L870

Bill

> Peter, Chris,
>
> Thank you muchly for your expert and well presented dialog.  Yes here
> is an actual and typical problem in generating protein seq from viral
> polyproteins, in the absence of mat_peptide Seq and unique ID:
>
> For record:
>
> http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank
>
> this is the coronavirus:
> LOCUS       DQ848678               29277 bp    RNA     linear   VRL 12-
> SEP-2006
> DEFINITION  Feline coronavirus strain FCoV C1Je, complete genome.
> ACCESSION   DQ848678
> VERSION     DQ848678.1  GI:112253723
> Containing a poly protein 1ab where the ribosome stalls at 12358,
> backs up one nuc, changes frame, and then continues:
>       CDS             join(311..12358,12358..20391)
>                       /ribosomal_slippage
>                       /codon_start=1
>                       /product="polyprotein 1ab"
> and which has multiple, gigantic polyproteins, the child peptides
> towards whom we would actually love to focus our comparative
> bioinformatic scrutiny, but none of which have mappable IDs or seqs
> below the polyprotein level as can be seen:
>       mat_peptide     15118..16914 <===
>                       /product="nsp13" <===
>                       /note="helicase"  <==these are all we have to go
> on
> and where are given no ID below the polyprotein level, no protein
> sequence...just positions. They are nuc positions at that, but we are
> given the complete polyprotein seq and have the components to do this
> on paper, but no code.  In summary we would like to dump in genbank
> files to a method, and get out fasta protein files which have some IDs
> and seqs.
>
> You guys are forcing me (thank god) to think critically and clearly
> about it too, so let me extend the proposed module or method as best I
> can:
>
> Input offending Virus Genbank file
> For each mat_peptide in a CDS
> get nuc coords e.g. 15118..161914
> translate it to aa
> if slippage then stop translating at end of frame 1 and hold for later
> now translate frame 2
> $full_translation = $translation_part_1 . $translation_part_2
> compare $full_translation to CDS translation e.g "/
> translation="MSSKHFKILVNE..."
> if identical subsequence then admit as the real, valid mat_peptide
> sequence
> Annotate with parent Unique ID+product name e.g. 112253723_helicase
> go to next mat_peptide
> concatenate products to a fasta amino acid file for whole virus
> isolate or CDS set for your virus
>
> PHEW.
>
> Chris L
>
> ==========
>>
>> I think one could use the full-length protein and run TFASTX (which
>> allows frameshifts) against the nucleotide sequence.  The output
>> will have the frameshifts designated with '/' or '\', so it would
>> then be a matter of splitting the sequence based on the midline,
>> then mapping those protein fragments back to the original sequence
>> coordinates.  Is this along the lines of what you mean?
>>
>> chris
>
> Let me look into this thank you CF, I have not used that in the past.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Polyproteins, ribo slippage, and mat_peptide in viruses?

by Chris Larsen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

All,

This is a short followup on the prior thread of discussion, regarding  
computing mature peptide sequences for viruses. The topic has gone  
underwater for the time being as we solve some problems with source  
data. While the biopython effort and contributors on this board have  
given good guidance, and we now have scripts that function (thanks  
mostly to pcock), however, the source data on which everything relies  
is suspect:

   mat_peptide 15118..16914 <===
                /product="nsp13"
                /note="helicase"
I can tell you the virus community does not want to rely heavily, on  
those position numbers. Furthermore we have found fewer compete source  
genomes for viruses than bacteria, more virus-to-virus variation in  
the data fields annotated in the GBK file, (Gene, CDS, ORF, Protein,  
Polyprotein, mat_peptide, db_xref) and in fact the community will have  
to come together significantly on how these molecules are defined in  
public repositories, before a mature scripting effort becomes  
reliable, public and well received. Because of the variation in  
viruses, it's not even clear at this point what a 'gene' is. I will  
let you know how we proceed when more sequence data has been fully  
analyzed, and we can think about making any perl based solution a new  
viral protein module.

Thanks,

Chris

--

Christopher Larsen, Ph.D.
Sr. Scientist / Grants Manager
Vecna Technologies
6404 Ivy Lane #500
Greenbelt, MD 20770
Phone: (240) 965-4525
Fax: (240) 547-6133
240-737-4525

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l