|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
Polyproteins, ribo slippage, and mat_peptide in viruses?All,
I am attempting to find some solutions to a DB loading problem we are encountering in viruses. It is multifold: Some viruses churn out a polyprotein rather than individual peptides; further they also slip the ribosome, so a source nucleotide is used more than once in translation (ribosome halts, backs up one nucleotide, and continues in a new frame); and finally we have post translational processing into mature peptides. The main thing is that the mature peptide is contained a a subset of the whole parent polyprotein, but is not provided as a single file in GBK for each mat_peptide CDS. We have to get that in order to run algorithms on the relevant processed proteins. Therefore we cannot directly load into GUS, but rather have to choose how to get the mat_peptide sequence. Actually I think the viruses know that, and are just messing with us out of spite, since we have iPods and they dont. Anyway.. from anyone who has encountered this I seek guidance. We have as choices: 1. Get the locations of mature peptide children in /Protein/ carve the mat_peptide sequence out of the whole polyprotein translation check that the mat_peptide is infact an identical subset of the translated protein load that OR 2. Use the locations of starts and stops in /Nucleotide/ translate that, using the slippage information get mature peptides that line up exactly to the parent polyprotein If you know of BioPerl sequence handling support for this, I would love to hear more. Clearly this is a nonstandard thingamabob. Stupid viruses Chris -- Christopher Larsen, Ph.D. Sr. Scientist / Grants Manager Vecna Technologies 6404 Ivy Lane #500 Greenbelt, MD 20770 Phone: (240) 965-4525 Fax: (240) 547-6133 240-737-4525 _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?On Tue, Oct 27, 2009 at 4:33 PM, Chris Larsen <clarsen@...> wrote:
> All, > > I am attempting to find some solutions to a DB loading problem we are > encountering in viruses. It is multifold: > > Some viruses churn out a polyprotein rather than individual peptides; > further they also slip the ribosome, so a source nucleotide is used more > than once in translation (ribosome halts, backs up one nucleotide, and > continues in a new frame); and finally we have post translational processing > into mature peptides. The main thing is that the mature peptide is contained > a a subset of the whole parent polyprotein, but is not provided as a single > file in GBK for each mat_peptide CDS. We have to get that in order to run > algorithms on the relevant processed proteins. Therefore we cannot directly > load into GUS, but rather have to choose how to get the mat_peptide > sequence. Actually I think the viruses know that, and are just messing with > us out of spite, since we have iPods and they dont. Anyway.. from anyone who > has encountered this I seek guidance. > > We have as choices: > > 1. Get the locations of mature peptide children in /Protein/ > carve the mat_peptide sequence out of the whole polyprotein translation > check that the mat_peptide is infact an identical subset of the translated > protein > load that > > OR > > 2. Use the locations of starts and stops in /Nucleotide/ > translate that, using the slippage information > get mature peptides that line up exactly to the parent polyprotein > > If you know of BioPerl sequence handling support for this, I would love to > hear more. Clearly this is a nonstandard thingamabob. > > Stupid viruses > > Chris Cool viruses :) Do you have some specific examples from GenBank? I'm starting to deal with virus annotation in my work, so this is of interest. Peter P.S. As you might guess from my email address, I'm actually more interested in Biopython than BioPerl, but the same algorithmic approach could be tested in either. _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: EXAMPLE: Polyproteins, ribo slippage, and mat_peptide in viruses?My bad...thanks everyone.
An example of a virus that does this is in order so we don't have to do this in our heads. There's no slippage in this isolate per se, however the main problem of generating component *.faa from polyprotein still exists. Chris > For instance, check this: > http://www.ncbi.nlm.nih.gov/nuccore/NC_001959 > > You may recall some cruise where half the ship threw up. YAY > norovirus. Lets take that as a useful problem to address solutions to. > > No mat_peptide sequence is given. How to get that? _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
|
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?Peter,
This is a good strategy when the gi is given. However I failed to mention that we are finding the example I gave is unusual (15%?)--- most virus 'mature peptides' we will apply this analysis to do not in fact have a gi number or unique identifier associated with them. There are thousands of dengue virus files to be processed to give mature proteins. Should have mentioned this...Hence the problem--we cant look it up because only the parent polyprotein has a gi. Theres nothing to look up /by/ in most cases. So we still have to build a set of proteins that are cleaved out of every polyprotein, by local and high throughput methods, by building it out of the available information (sadly, kind of a run around-- it should be in the genbank entry). Chris On Oct 27, 2009, at 3:54 PM, Peter wrote: > On Tue, Oct 27, 2009 at 7:15 PM, Chris Larsen <clarsen@...> > wrote: >> >> Hello Peter! >> >> For instance, check this: >> http://www.ncbi.nlm.nih.gov/nuccore/NC_001959 >> ... >> >> No mat_peptide sequence is given. We want that... > > Looking at the GenBank file displayed, the mat_peptide features > (mature peptides) do not include a translation entry (like the parent > CDS feature does). However, they do have protein IDs - which are > actually links in the HTML version. > > This leads me to suggest a third option as an alternative to the two > ideas you outlined. You could parse the GenBank file(s), and for each > mat_peptide feature look up the protein ID via Entrez EFetch (e.g. as > a FASTA file, or a GenPept file). If you only have a relatively small > number of viruses and proteins this is probably going to be pretty > easy. At least, I could do it in Biopython and I am sure the same is > true with the BioPerl GenBank parser and their EFetch interface. > > However, for a large dataset, handling it all locally (your options > (1) and (2) sound best). > > Peter -- Christopher Larsen, Ph.D. Sr. Scientist / Grants Manager Vecna Technologies 6404 Ivy Lane #500 Greenbelt, MD 20770 Phone: (240) 965-4525 Fax: (240) 547-6133 240-737-4525 _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?On Tue, Oct 27, 2009 at 8:07 PM, Chris Larsen <clarsen@...> wrote:
> > Peter, > > This is a good strategy when the gi is given. However I failed to mention > that we are finding the example I gave is unusual (15%?)---most virus > 'mature peptides' we will apply this analysis to do not in fact have a gi > number or unique identifier associated with them. There are thousands of > dengue virus files to be processed to give mature proteins. > > Should have mentioned this...Hence the problem--we cant look it up because > only the parent polyprotein has a gi. Theres nothing to look up /by/ in most > cases. So we still have to build a set of proteins that are cleaved out of > every polyprotein, by local and high throughput methods, by building it out > of the available information (sadly, kind of a run around-- it should be in > the genbank entry). > > Chris Ah. That's a shame. I did just take a few minutes to try out the EFetch idea (using Biopython) and it does work beautifully for this "nice" example virus which the NCBI have annotated. I also note that in the example given, all the mature peptides have nice and simple locations (in terms of their co-ordindates for the nucleotides), no ribosomal slippages etc. This means grabbing the relevant bits of the genome and translating it is also pretty easy (option 2 in your original email). Have you got a more typical entry you can point us at? If there is nothing publicly available, I wouldn't mind you emailing me one or two to look at off list (and if don't mind, they might make good examples for Bio* project unit tests or examples). Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?On Oct 27, 2009, at 3:17 PM, Peter wrote: > On Tue, Oct 27, 2009 at 8:07 PM, Chris Larsen <clarsen@...> > wrote: >> >> Peter, >> >> This is a good strategy when the gi is given. However I failed to >> mention >> that we are finding the example I gave is unusual (15%?)---most virus >> 'mature peptides' we will apply this analysis to do not in fact >> have a gi >> number or unique identifier associated with them. There are >> thousands of >> dengue virus files to be processed to give mature proteins. >> >> Should have mentioned this...Hence the problem--we cant look it up >> because >> only the parent polyprotein has a gi. Theres nothing to look up / >> by/ in most >> cases. So we still have to build a set of proteins that are cleaved >> out of >> every polyprotein, by local and high throughput methods, by >> building it out >> of the available information (sadly, kind of a run around-- it >> should be in >> the genbank entry). >> >> Chris > > Ah. That's a shame. I did just take a few minutes to try out the > EFetch idea (using Biopython) and it does work beautifully for > this "nice" example virus which the NCBI have annotated. Interesting thing about that example: if you follow the hyperlinks for the mat_peptide feature key, they relate back to the full protein sequence with from/to, not to the protein_id for the feature. Example: # link from the first mat_peptide http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts # protein_id http://www.ncbi.nlm.nih.gov/protein/28416959 This record doesn't appear to contain any mapping information along those lines, which makes me think this is an autogenerated record using the Gene record, which does have those mappings: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970 > I also note that in the example given, all the mature peptides > have nice and simple locations (in terms of their co-ordindates > for the nucleotides), no ribosomal slippages etc. This means > grabbing the relevant bits of the genome and translating it is > also pretty easy (option 2 in your original email). > > Have you got a more typical entry you can point us at? > > If there is nothing publicly available, I wouldn't mind you > emailing me one or two to look at off list (and if don't mind, > they might make good examples for Bio* project unit tests > or examples). > > Peter I think one could use the full-length protein and run TFASTX (which allows frameshifts) against the nucleotide sequence. The output will have the frameshifts designated with '/' or '\', so it would then be a matter of splitting the sequence based on the midline, then mapping those protein fragments back to the original sequence coordinates. Is this along the lines of what you mean? chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?On Tue, Oct 27, 2009 at 8:46 PM, Chris Fields <cjfields@...> wrote:
>> >> Ah. That's a shame. I did just take a few minutes to try out the >> EFetch idea (using Biopython) and it does work beautifully for >> this "nice" example virus which the NCBI have annotated. > > Interesting thing about that example: if you follow the hyperlinks for the > mat_peptide feature key, they relate back to the full protein sequence with > from/to, not to the protein_id for the feature. Example: > > # link from the first mat_peptide > http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts > > # protein_id > http://www.ncbi.nlm.nih.gov/protein/28416959 Right - the protein ID link is just based on the GI number, 28416959. This link (or EFetch) gives you the (short) mature peptide. > This record doesn't appear to contain any mapping information along those > lines, which makes me think this is an autogenerated record using the Gene > record, which does have those mappings: > > http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970 Are you suggesting one option is (if the mat_peptide annotation is lacking a protein or GI number) to go online to the Gene database using the gene ID of the precursor (parent) protein to find the IDs of the mature (child) peptides? Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?These mature proteins do have gi.
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz >grep NC_001959 gene2accession 11983 1491970 REVIEWED - - NP_786945.1 28416959 NC_001959.2 106060735 4 5373 + - 11983 1491970 REVIEWED - - NP_786946.1 28416960 NC_001959.2 106060735 4 5373 + - 11983 1491970 REVIEWED - - NP_786947.1 28416961 NC_001959.2 106060735 4 5373 + - 11983 1491970 REVIEWED - - NP_786948.1 28416962 NC_001959.2 106060735 4 5373 + - 11983 1491970 REVIEWED - - NP_786949.1 28416963 NC_001959.2 106060735 4 5373 + - 11983 1491970 REVIEWED - - NP_786950.1 28416964 NC_001959.2 106060735 4 5373 + - 11983 1491971 PROVISIONAL - - NP_056822.1 9630806 NC_001959.2 106060735 6949 7587 + - 11983 1491972 PROVISIONAL - - NP_056821.2 106060736 NC_001959.2 106060735 5357 6949 + - Bill > On Tue, Oct 27, 2009 at 8:46 PM, Chris Fields <cjfields@...> > wrote: >>> >>> Ah. That's a shame. I did just take a few minutes to try out the >>> EFetch idea (using Biopython) and it does work beautifully for >>> this "nice" example virus which the NCBI have annotated. >> >> Interesting thing about that example: if you follow the hyperlinks for >> the >> mat_peptide feature key, they relate back to the full protein sequence >> with >> from/to, not to the protein_id for the feature. Example: >> >> # link from the first mat_peptide >> http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts >> >> # protein_id >> http://www.ncbi.nlm.nih.gov/protein/28416959 > > Right - the protein ID link is just based on the GI number, 28416959. > This link (or EFetch) gives you the (short) mature peptide. > >> This record doesn't appear to contain any mapping information along >> those >> lines, which makes me think this is an autogenerated record using the >> Gene >> record, which does have those mappings: >> >> http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970 > > Are you suggesting one option is (if the mat_peptide annotation > is lacking a protein or GI number) to go online to the Gene > database using the gene ID of the precursor (parent) protein > to find the IDs of the mature (child) peptides? > > Peter > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l > _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?On Tue, Oct 27, 2009 at 9:47 PM, <bill@...> wrote:
> These mature proteins do have gi. Yes, they are in the original GenBank file too. The problem is Chris said he picked a poor example - this one is well annotated, but in general his GenBank files lack these cross references. Which is why I asked for a more realistic example if he can share one ;) Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?Peter, Chris,
Thank you muchly for your expert and well presented dialog. Yes here is an actual and typical problem in generating protein seq from viral polyproteins, in the absence of mat_peptide Seq and unique ID: For record: http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank this is the coronavirus: LOCUS DQ848678 29277 bp RNA linear VRL 12- SEP-2006 DEFINITION Feline coronavirus strain FCoV C1Je, complete genome. ACCESSION DQ848678 VERSION DQ848678.1 GI:112253723 Containing a poly protein 1ab where the ribosome stalls at 12358, backs up one nuc, changes frame, and then continues: CDS join(311..12358,12358..20391) /ribosomal_slippage /codon_start=1 /product="polyprotein 1ab" and which has multiple, gigantic polyproteins, the child peptides towards whom we would actually love to focus our comparative bioinformatic scrutiny, but none of which have mappable IDs or seqs below the polyprotein level as can be seen: mat_peptide 15118..16914 <=== /product="nsp13" <=== /note="helicase" <==these are all we have to go on and where are given no ID below the polyprotein level, no protein sequence...just positions. They are nuc positions at that, but we are given the complete polyprotein seq and have the components to do this on paper, but no code. In summary we would like to dump in genbank files to a method, and get out fasta protein files which have some IDs and seqs. You guys are forcing me (thank god) to think critically and clearly about it too, so let me extend the proposed module or method as best I can: Input offending Virus Genbank file For each mat_peptide in a CDS get nuc coords e.g. 15118..161914 translate it to aa if slippage then stop translating at end of frame 1 and hold for later now translate frame 2 $full_translation = $translation_part_1 . $translation_part_2 compare $full_translation to CDS translation e.g "/ translation="MSSKHFKILVNE..." if identical subsequence then admit as the real, valid mat_peptide sequence Annotate with parent Unique ID+product name e.g. 112253723_helicase go to next mat_peptide concatenate products to a fasta amino acid file for whole virus isolate or CDS set for your virus PHEW. Chris L ========== > > I think one could use the full-length protein and run TFASTX (which > allows frameshifts) against the nucleotide sequence. The output > will have the frameshifts designated with '/' or '\', so it would > then be a matter of splitting the sequence based on the midline, > then mapping those protein fragments back to the original sequence > coordinates. Is this along the lines of what you mean? > > chris Let me look into this thank you CF, I have not used that in the past. _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?It appears that acc/gi is not required for mat_peptide. Your solution
should work fine. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objmgr/util/create_defline.cpp#L870 Bill > Peter, Chris, > > Thank you muchly for your expert and well presented dialog. Yes here > is an actual and typical problem in generating protein seq from viral > polyproteins, in the absence of mat_peptide Seq and unique ID: > > For record: > > http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank > > this is the coronavirus: > LOCUS DQ848678 29277 bp RNA linear VRL 12- > SEP-2006 > DEFINITION Feline coronavirus strain FCoV C1Je, complete genome. > ACCESSION DQ848678 > VERSION DQ848678.1 GI:112253723 > Containing a poly protein 1ab where the ribosome stalls at 12358, > backs up one nuc, changes frame, and then continues: > CDS join(311..12358,12358..20391) > /ribosomal_slippage > /codon_start=1 > /product="polyprotein 1ab" > and which has multiple, gigantic polyproteins, the child peptides > towards whom we would actually love to focus our comparative > bioinformatic scrutiny, but none of which have mappable IDs or seqs > below the polyprotein level as can be seen: > mat_peptide 15118..16914 <=== > /product="nsp13" <=== > /note="helicase" <==these are all we have to go > on > and where are given no ID below the polyprotein level, no protein > sequence...just positions. They are nuc positions at that, but we are > given the complete polyprotein seq and have the components to do this > on paper, but no code. In summary we would like to dump in genbank > files to a method, and get out fasta protein files which have some IDs > and seqs. > > You guys are forcing me (thank god) to think critically and clearly > about it too, so let me extend the proposed module or method as best I > can: > > Input offending Virus Genbank file > For each mat_peptide in a CDS > get nuc coords e.g. 15118..161914 > translate it to aa > if slippage then stop translating at end of frame 1 and hold for later > now translate frame 2 > $full_translation = $translation_part_1 . $translation_part_2 > compare $full_translation to CDS translation e.g "/ > translation="MSSKHFKILVNE..." > if identical subsequence then admit as the real, valid mat_peptide > sequence > Annotate with parent Unique ID+product name e.g. 112253723_helicase > go to next mat_peptide > concatenate products to a fasta amino acid file for whole virus > isolate or CDS set for your virus > > PHEW. > > Chris L > > ========== >> >> I think one could use the full-length protein and run TFASTX (which >> allows frameshifts) against the nucleotide sequence. The output >> will have the frameshifts designated with '/' or '\', so it would >> then be a matter of splitting the sequence based on the midline, >> then mapping those protein fragments back to the original sequence >> coordinates. Is this along the lines of what you mean? >> >> chris > > Let me look into this thank you CF, I have not used that in the past. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l > _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Polyproteins, ribo slippage, and mat_peptide in viruses?All,
This is a short followup on the prior thread of discussion, regarding computing mature peptide sequences for viruses. The topic has gone underwater for the time being as we solve some problems with source data. While the biopython effort and contributors on this board have given good guidance, and we now have scripts that function (thanks mostly to pcock), however, the source data on which everything relies is suspect: mat_peptide 15118..16914 <=== /product="nsp13" /note="helicase" I can tell you the virus community does not want to rely heavily, on those position numbers. Furthermore we have found fewer compete source genomes for viruses than bacteria, more virus-to-virus variation in the data fields annotated in the GBK file, (Gene, CDS, ORF, Protein, Polyprotein, mat_peptide, db_xref) and in fact the community will have to come together significantly on how these molecules are defined in public repositories, before a mature scripting effort becomes reliable, public and well received. Because of the variation in viruses, it's not even clear at this point what a 'gene' is. I will let you know how we proceed when more sequence data has been fully analyzed, and we can think about making any perl based solution a new viral protein module. Thanks, Chris -- Christopher Larsen, Ph.D. Sr. Scientist / Grants Manager Vecna Technologies 6404 Ivy Lane #500 Greenbelt, MD 20770 Phone: (240) 965-4525 Fax: (240) 547-6133 240-737-4525 _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
| Free embeddable forum powered by Nabble | Forum Help |