|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 | Next > |
|
|
Next-gen modulesDear all,
after several years of absence I am slowly coming back to Bioperl, and hope to contribute again to its development. One area that I was thinking of starting from, since we are actively involved with it, is to improve BIoperl's support fo next-gen sequencing data, tools, etc. Since I am sure I have missed out on a lot of recent developments, do let me know if/what is useful. One example that comes to mind is that the conversion of various formats to/from FASTQ does not seem to be supported. Some code can be found within Li Heng's script: http://maq.sourceforge.net/ fq_all2std.pl but it would be good if it could make its way into SeqIO? And similarly, potentially, for other next-gen sequence formats? Similarly, there seems to be little in bioperl-run to support tools that have been developed in this area, such as Maq, BowTie, TopHat, etc? Do let me know if there is a past thread on this, or other people actively developing, etc. so that I can find out what priorities are. thanks and best regards to all (old friends and new), Elia --- Senior Lecturer, Bioinformatics UCL Cancer Institute Paul O' Gorman Building University College London Gower Street WC1E 6BT London UK Office (UCL): +44 207 679 6493 Office (ICMS): +44 0207 8822374 Mobile: +44 7597 566 194 Mobile (Italy): +39 338 8448801 _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesElia--
I say a definite +1; in fact, this sounds like it should be a Hot Topic (see http://www.bioperl.org/wiki/Category:Hot_Topics for some others you might have missed in your hiatus...). I will create a page that can be a central point for wish lists, discussion, etc. There has been much discussion of late about FASTQ http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030187.html http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029970.html http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html http://lists.open-bio.org/pipermail/bioperl-l/2009-April/029765.html cheers from a newbie, Mark ----- Original Message ----- From: "Elia Stupka" <e.stupka@...> To: <bioperl-l@...> Sent: Wednesday, June 17, 2009 7:29 AM Subject: [Bioperl-l] Next-gen modules > Dear all, > > after several years of absence I am slowly coming back to Bioperl, and > hope to contribute again to its development. > > One area that I was thinking of starting from, since we are actively > involved with it, is to improve BIoperl's support fo next-gen > sequencing data, tools, etc. Since I am sure I have missed out on a > lot of recent developments, do let me know if/what is useful. > > One example that comes to mind is that the conversion of various > formats to/from FASTQ does not seem to be supported. Some code can be > found within Li Heng's script: http://maq.sourceforge.net/ > fq_all2std.pl but it would be good if it could make its way into > SeqIO? And similarly, potentially, for other next-gen sequence formats? > > Similarly, there seems to be little in bioperl-run to support tools > that have been developed in this area, such as Maq, BowTie, TopHat, etc? > > Do let me know if there is a past thread on this, or other people > actively developing, etc. so that I can find out what priorities are. > > thanks and best regards to all (old friends and new), > > Elia > > --- > Senior Lecturer, Bioinformatics > UCL Cancer Institute > Paul O' Gorman Building > University College London > Gower Street > WC1E 6BT > London > UK > > Office (UCL): +44 207 679 6493 > Office (ICMS): +44 0207 8822374 > > Mobile: +44 7597 566 194 > Mobile (Italy): +39 338 8448801 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modules[ and here's a new wikipage: http://www.bioperl.org/wiki/Nextgen_in_Bioperl ]
----- Original Message ----- From: "Elia Stupka" <e.stupka@...> To: <bioperl-l@...> Sent: Wednesday, June 17, 2009 7:29 AM Subject: [Bioperl-l] Next-gen modules > Dear all, > > after several years of absence I am slowly coming back to Bioperl, and > hope to contribute again to its development. > > One area that I was thinking of starting from, since we are actively > involved with it, is to improve BIoperl's support fo next-gen > sequencing data, tools, etc. Since I am sure I have missed out on a > lot of recent developments, do let me know if/what is useful. > > One example that comes to mind is that the conversion of various > formats to/from FASTQ does not seem to be supported. Some code can be > found within Li Heng's script: http://maq.sourceforge.net/ > fq_all2std.pl but it would be good if it could make its way into > SeqIO? And similarly, potentially, for other next-gen sequence formats? > > Similarly, there seems to be little in bioperl-run to support tools > that have been developed in this area, such as Maq, BowTie, TopHat, etc? > > Do let me know if there is a past thread on this, or other people > actively developing, etc. so that I can find out what priorities are. > > thanks and best regards to all (old friends and new), > > Elia > > --- > Senior Lecturer, Bioinformatics > UCL Cancer Institute > Paul O' Gorman Building > University College London > Gower Street > WC1E 6BT > London > UK > > Office (UCL): +44 207 679 6493 > Office (ICMS): +44 0207 8822374 > > Mobile: +44 7597 566 194 > Mobile (Italy): +39 338 8448801 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Wed, Jun 17, 2009 at 12:29 PM, Elia Stupka<e.stupka@...> wrote:
> > Dear all, > > after several years of absence I am slowly coming back to Bioperl, and hope > to contribute again to its development. > > One area that I was thinking of starting from, since we are actively > involved with it, is to improve BIoperl's support fo next-gen sequencing > data, tools, etc. Since I am sure I have missed out on a lot of recent > developments, do let me know if/what is useful. > > One example that comes to mind is that the conversion of various formats > to/from FASTQ does not seem to be supported. Some code can be found within > Li Heng's script: http://maq.sourceforge.net/fq_all2std.pl but it would be > good if it could make its way into SeqIO? And similarly, potentially, for > other next-gen sequence formats? If you do add FASTQ support to BioPerl's SeqIO (and I think that is a good idea), please could you follow the format names used by Biopython - as this time we got there first ;) I'm asking this as Biopython's SeqIO tries to use the same format names as BioPerl's SeqIO and EMBOSS, see http://biopython.org/wiki/SeqIO Specifically, * "fastq" in Biopython means the original Sanger standard FASTQ files encoding PHRED qualities using an ASCII offset of 33. * "fastq-solexa" in Biopython means the early Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64. * "fastq-illumina" in Biopython will mean recent Solexa/Illumina style FASTQ files (from pipeline version 1.3+) which encode PHRED qualities using an ASCII offset of 64. This is in the Biopython repository, but hasn't been released yet - so the name "fastq-illumina" isn't set in stone yet. For good quality reads, PHRED and Solexa scores are approximately equal, so the "fastq-solexa" and "fastq-illumina" variants are almost equivalent. > Similarly, there seems to be little in bioperl-run to support tools that > have been developed in this area, such as Maq, BowTie, TopHat, etc? > > Do let me know if there is a past thread on this, or other people actively > developing, etc. so that I can find out what priorities are. Have you seen these recent threads?: http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029970.html http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030187.html Regards, Peter (at Biopython) _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesDear Mark,
thanks a lot for the pointers. With regards to FASTQ parsing: -my understanding by reading past threads is to work on a single format, i.e. FASTQ and to interpet the quality "flavours" as just quality conversions, right? -However, I assume we would still want to support a simple way for the user to say format => 'fastq-solexa' using the nomenclature adopted in BioPython suggested by Peter, right? -I also saw Heikki's "long essay", but did not yet compare to Heng Li's code at http://maq.sourceforge.net/fq_all2std.pl, I guess we would hope they would produce identical outputs, will be a good check. Finally, I saw Tristan's reply to Heikki's thread, so what is the status quo? Is it moving forward? cheers Elia On 17 Jun 2009, at 13:02, Mark A. Jensen wrote: > Elia-- > I say a definite +1; in fact, this sounds like it should be a Hot > Topic (see http://www.bioperl.org/wiki/Category:Hot_Topics for some > others > you might have missed in your hiatus...). I will create a page that > can be a central point for wish lists, discussion, etc. > > There has been much discussion of late about FASTQ http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030187.html > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029970.html > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html > http://lists.open-bio.org/pipermail/bioperl-l/2009-April/029765.html > > cheers from a newbie, Mark > > ----- Original Message ----- From: "Elia Stupka" <e.stupka@...> > To: <bioperl-l@...> > Sent: Wednesday, June 17, 2009 7:29 AM > Subject: [Bioperl-l] Next-gen modules > > >> Dear all, >> after several years of absence I am slowly coming back to Bioperl, >> and hope to contribute again to its development. >> One area that I was thinking of starting from, since we are >> actively involved with it, is to improve BIoperl's support fo next- >> gen sequencing data, tools, etc. Since I am sure I have missed out >> on a lot of recent developments, do let me know if/what is useful. >> One example that comes to mind is that the conversion of various >> formats to/from FASTQ does not seem to be supported. Some code can >> be found within Li Heng's script: http://maq.sourceforge.net/ >> fq_all2std.pl but it would be good if it could make its way into >> SeqIO? And similarly, potentially, for other next-gen sequence >> formats? >> Similarly, there seems to be little in bioperl-run to support >> tools that have been developed in this area, such as Maq, BowTie, >> TopHat, etc? >> Do let me know if there is a past thread on this, or other people >> actively developing, etc. so that I can find out what priorities are. >> thanks and best regards to all (old friends and new), >> Elia >> --- >> Senior Lecturer, Bioinformatics >> UCL Cancer Institute >> Paul O' Gorman Building >> University College London >> Gower Street >> WC1E 6BT >> London >> UK >> Office (UCL): +44 207 679 6493 >> Office (ICMS): +44 0207 8822374 >> Mobile: +44 7597 566 194 >> Mobile (Italy): +39 338 8448801 >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l@... >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> --- Senior Lecturer, Bioinformatics UCL Cancer Institute Paul O' Gorman Building University College London Gower Street WC1E 6BT London UK Office (UCL): +44 207 679 6493 Office (ICMS): +44 0207 8822374 Mobile: +44 7597 566 194 Mobile (Italy): +39 338 8448801 _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesElia,
As Mark indicated, we recently discussed the lack of support for next- gen on list, at least re: fastq. I may be hit with the same thing in a few months time myself, and I recall Jason and a few others also mentioning the same. Heikki wrote some code for Illumina FASTQ for SeqIO and related modules but I don't believe it has been committed to trunk yet, so maybe he can answer. From prior discussions IIRC the issues were: 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0, Illumina 1.3) from one another (so maybe some optional validation), and 2) having a way for the Seq object to either 'know' what format is contained, or we use phred score and convert back and forth from that (I think the latter makes more sense). Peter's suggestions also are reasonable, though does biopython have a separate module for each of these variations? Our version (I believe) mainly varied the conversion within Bio::SeqIO::fastq itself based on the fastq variant passed in as a separate named argument. As for the wrappers, we would most certainly welcome them! chris On Jun 17, 2009, at 6:29 AM, Elia Stupka wrote: > Dear all, > > after several years of absence I am slowly coming back to Bioperl, > and hope to contribute again to its development. > > One area that I was thinking of starting from, since we are actively > involved with it, is to improve BIoperl's support fo next-gen > sequencing data, tools, etc. Since I am sure I have missed out on a > lot of recent developments, do let me know if/what is useful. > > One example that comes to mind is that the conversion of various > formats to/from FASTQ does not seem to be supported. Some code can > be found within Li Heng's script: http://maq.sourceforge.net/fq_all2std.pl > but it would be good if it could make its way into SeqIO? And > similarly, potentially, for other next-gen sequence formats? > > Similarly, there seems to be little in bioperl-run to support tools > that have been developed in this area, such as Maq, BowTie, TopHat, > etc? > > Do let me know if there is a past thread on this, or other people > actively developing, etc. so that I can find out what priorities are. > > thanks and best regards to all (old friends and new), > > Elia > > --- > Senior Lecturer, Bioinformatics > UCL Cancer Institute > Paul O' Gorman Building > University College London > Gower Street > WC1E 6BT > London > UK > > Office (UCL): +44 207 679 6493 > Office (ICMS): +44 0207 8822374 > > Mobile: +44 7597 566 194 > Mobile (Italy): +39 338 8448801 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields@...> wrote:
> > Elia, > > As Mark indicated, we recently discussed the lack of support for next-gen on > list, at least re: fastq. Â I may be hit with the same thing in a few months > time myself, and I recall Jason and a few others also mentioning the same. > Â Heikki wrote some code for Illumina FASTQ for SeqIO and related modules but > I don't believe it has been committed to trunk yet, so maybe he can answer. > > From prior discussions IIRC the issues were: > > 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0, Illumina > 1.3) from one another (so maybe some optional validation), and Following the python rule of thumb for being explicit, Biopython makes the user specify which FASTQ variant is being used. I don't think you can do anything else. Any attempted validation would have to be heuristic based on the ASCII characters found, and would risk false positive warnings. > 2) having a way for the Seq object to either 'know' what format is > contained, or we use phred score and convert back and forth from that (I > think the latter makes more sense). I think it could make sense for BioPerl to convert Solexa scores to/from PHRED scores on the fly (especially now that Illumina is abandoning the Solexa score system). Python style tries to avoid implicit conversions, so Biopython doesn't automatically do a conversion from Solexa to PHRED scores on parsing (but will on writing if the requested output format requires this). > Peter's suggestions also are reasonable, though does biopython have a > separate module for each of these variations? Â Our version (I believe) > mainly varied the conversion within Bio::SeqIO::fastq itself based on the > fastq variant passed in as a separate named argument. Biopython's SeqIO gives the three FASTQ variants their own unique names. This format name is a required argument for parsing/writing (we don't try and guess the file format from the data contents). Internally we have three separate FASTQ parsers/writers although they do share code. Other issues to keep in mind: (3) There should be no warning parsing files where the optional repeated title is missing on the "+" lines (as discussed earlier on the BioPerl list). (4) When writing FASTQ files should BioPerl omit the optional repeated title on the "+" line? Biopython omits this as I understand this to be common practice, and can make a big different to file sizes - especially on short read data from Solexa/Illumina. (5) Also test reading and writing files with an optional description (as well as an identifier) on the "@" (and "+") lines. See the NCBI SRA for examples, e.g. @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC (6) Test reading and writing files where the encoded quality string starts with a "@" or a "+" character, e.g. http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesHello,
Regarding next-gen sequences and bioperl, following my experience, another issue is bioperl speed. For example, if you want to trim bad quality bases at ends of 1E6 Solexa reads using Bio::SeqIO::fastq and some methods in Bio::Seq::Quality, well, you've got to be patient (but may be I missed some shortcuts...). A pure perl solution will be between 100 to 1000x faster... Would it be possible to have an ultra-light quality object with few simple methods for next-gen reads? I can contribute some tests if that sounds like an important point. -Tristan On Wednesday 17 June 2009 08:02:11 Mark A. Jensen wrote: > Elia-- > I say a definite +1; in fact, this sounds like it should > be a Hot Topic (see > http://www.bioperl.org/wiki/Category:Hot_Topics for some > others you might have missed in your hiatus...). I will > create a page that can be a central point for wish lists, > discussion, etc. > > There has been much discussion of late about FASTQ > http://lists.open-bio.org/pipermail/bioperl-l/2009-June/0 >30187.html > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/02 >9970.html > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/02 >9911.html > http://lists.open-bio.org/pipermail/bioperl-l/2009-April/ >029765.html > > cheers from a newbie, > Mark > > ----- Original Message ----- > From: "Elia Stupka" <e.stupka@...> > To: <bioperl-l@...> > Sent: Wednesday, June 17, 2009 7:29 AM > Subject: [Bioperl-l] Next-gen modules > > > Dear all, > > > > after several years of absence I am slowly coming back > > to Bioperl, and hope to contribute again to its > > development. > > > > One area that I was thinking of starting from, since we > > are actively involved with it, is to improve BIoperl's > > support fo next-gen sequencing data, tools, etc. Since > > I am sure I have missed out on a lot of recent > > developments, do let me know if/what is useful. > > > > One example that comes to mind is that the conversion > > of various formats to/from FASTQ does not seem to be > > supported. Some code can be found within Li Heng's > > script: http://maq.sourceforge.net/ fq_all2std.pl but > > it would be good if it could make its way into SeqIO? > > And similarly, potentially, for other next-gen sequence > > formats? > > > > Similarly, there seems to be little in bioperl-run to > > support tools that have been developed in this area, > > such as Maq, BowTie, TopHat, etc? > > > > Do let me know if there is a past thread on this, or > > other people actively developing, etc. so that I can > > find out what priorities are. > > > > thanks and best regards to all (old friends and new), > > > > Elia > > > > --- > > Senior Lecturer, Bioinformatics > > UCL Cancer Institute > > Paul O' Gorman Building > > University College London > > Gower Street > > WC1E 6BT > > London > > UK > > > > Office (UCL): +44 207 679 6493 > > Office (ICMS): +44 0207 8822374 > > > > Mobile: +44 7597 566 194 > > Mobile (Italy): +39 338 8448801 > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l@... > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn 17 Jun 2009, at 12:29, Elia Stupka wrote:
> Similarly, there seems to be little in bioperl-run to support tools > that have been developed in this area, such as Maq, BowTie, TopHat, > etc? FYI I have a Bio::Tools::Run::Velvet wrapper [1] that I plan to submit in the not too distant future. (First it needs some "blah blah" replaced with actual documentation and a test suite.) Cheers, John [1] http://www.ebi.ac.uk/~zerbino/velvet/ -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Wed, Jun 17, 2009 at 1:54 PM, Elia Stupka<e.stupka@...> wrote:
> > Dear Mark, > > thanks a lot for the pointers. > > With regards to FASTQ parsing: > > -my understanding by reading past threads is to work on a single format, > i.e. FASTQ and to interpet the quality "flavours" as just quality > conversions, right? > -However, I assume we would still want to support a simple way for the user > to say format => 'fastq-solexa' using the nomenclature adopted in BioPython > suggested by Peter, right? I think you will need a way for the user to say they have a Solexa, or an Illumina 1.3+, or an original Sanger standard FASTQ file. >From reading the http://bioperl.org/wiki/HOWTO:SeqIO wiki page, I assumed BioPerl's SeqIO just had formats (e.g. the "chadoxml" format and the variant "flybase_chadoxml" format). Does BioPerl's SeqIO format system have any concept of flavour that I am not aware of? > -I also saw Heikki's "long essay", but did not yet compare to Heng Li's code > at http://maq.sourceforge.net/fq_all2std.pl, I guess we would hope they > would produce identical outputs, will be a good check. Heng Li's code at http://maq.sourceforge.net/fq_all2std.pl is a useful guide (although it doesn't yet cope with the new Illumina 1.3+ variant), but I don't trust it 100%. See e.g. http://lists.open-bio.org/pipermail/biopython/2009-June/005208.html http://lists.open-bio.org/pipermail/biopython/2009-June/005209.html Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Jun 17, 2009, at 8:25 AM, Peter wrote: > On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields@...> > wrote: >> >> Elia, >> >> As Mark indicated, we recently discussed the lack of support for >> next-gen on >> list, at least re: fastq. I may be hit with the same thing in a >> few months >> time myself, and I recall Jason and a few others also mentioning >> the same. >> Heikki wrote some code for Illumina FASTQ for SeqIO and related >> modules but >> I don't believe it has been committed to trunk yet, so maybe he can >> answer. >> >> From prior discussions IIRC the issues were: >> >> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0, >> Illumina >> 1.3) from one another (so maybe some optional validation), and > > Following the python rule of thumb for being explicit, Biopython makes > the user specify which FASTQ variant is being used. I don't think you > can do anything else. Any attempted validation would have to be > heuristic based on the ASCII characters found, and would risk false > positive warnings. Right; I'm thinking along the same lines. If anything the most we would allow is some level of validation, so if there were a degree of uncertainty about the format one could set a validation flag to check bounds during the parse and warn if they are exceeded. >> 2) having a way for the Seq object to either 'know' what format is >> contained, or we use phred score and convert back and forth from >> that (I >> think the latter makes more sense). > > I think it could make sense for BioPerl to convert Solexa scores to/ > from > PHRED scores on the fly (especially now that Illumina is abandoning > the Solexa score system). Python style tries to avoid implicit > conversions, > so Biopython doesn't automatically do a conversion from Solexa to > PHRED scores on parsing (but will on writing if the requested output > format requires this). > >> Peter's suggestions also are reasonable, though does biopython have a >> separate module for each of these variations? Our version (I >> believe) >> mainly varied the conversion within Bio::SeqIO::fastq itself based >> on the >> fastq variant passed in as a separate named argument. > > Biopython's SeqIO gives the three FASTQ variants their own unique > names. This format name is a required argument for parsing/writing > (we don't try and guess the file format from the data contents). > Internally > we have three separate FASTQ parsers/writers although they do share > code. We could easily do the same if others agree. Actually, if we specified that shorthand for a variant on a format would be designated as -format => 'format-variant', I think we could easily hack SeqIO to deal with that by splitting on '-' and passing everything to the constructor as (-format => 'format', -variant => 'variant'). Very little repeated code in this case, just an additional named parameter indicating the format variant (and the SeqIO class can do the type checking on that within the constructor). > Other issues to keep in mind: > > (3) There should be no warning parsing files where the optional > repeated > title is missing on the "+" lines (as discussed earlier on the > BioPerl list). Agreed, though we'll have to check the current fastq parser to see if that's currently the case. I thought that was fixed but maybe not? > (4) When writing FASTQ files should BioPerl omit the optional repeated > title on the "+" line? Biopython omits this as I understand this to be > common practice, and can make a big different to file sizes - > especially > on short read data from Solexa/Illumina. Agreed, particularly if it's commonly encountered. > (5) Also test reading and writing files with an optional description > (as well > as an identifier) on the "@" (and "+") lines. See the NCBI SRA for > examples, > e.g. > > @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 > GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC > +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 > IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC Should be easy enough to implement with a simple regex. > (6) Test reading and writing files where the encoded quality string > starts > with a "@" or a "+" character, e.g. > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html > > Peter Mark, getting all that? ;> chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Jun 17, 2009, at 8:27 AM, Tristan Lefebure wrote: > Hello, > Regarding next-gen sequences and bioperl, following my > experience, another issue is bioperl speed. For example, if > you want to trim bad quality bases at ends of 1E6 Solexa > reads using Bio::SeqIO::fastq and some methods in > Bio::Seq::Quality, well, you've got to be patient (but may > be I missed some shortcuts...). The key issues affecting speed in bioperl are contained object instantiation and inheritance (and between those two, the latter much more so as it plays a role with contained objects as well as the container). http://www.bioperl.org/wiki/Why_BioPerl_is_slow Moose/Perl6 roles/traits are one way around that issue, but we are a ways off from getting that running. I think to get that working decently would be a from-ground-up endeavor (see my past posts on biomoose/bioperl6). > A pure perl solution will be between 100 to 1000x faster... > Would it be possible to have an ultra-light quality object > with few simple methods for next-gen reads? > > I can contribute some tests if that sounds like an important > point. > > -Tristan The quality objects themselves I don't think are that heavy; I think the main impediment is inheritance. One could get around that a bit by using a direct_new method to create a blessed hash directly, then reimplement methods to lazily create any objects contained on the fly. chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesI'm on the case! (but maybe not in realtime, today!)
----- Original Message ----- From: "Chris Fields" <cjfields@...> To: "Peter" <biopython@...> Cc: "BioPerl List" <bioperl-l@...>; "Elia Stupka" <e.stupka@...>; "Heikki Lehvaslaiho" <heikki@...> Sent: Wednesday, June 17, 2009 1:06 PM Subject: Re: [Bioperl-l] Next-gen modules > > On Jun 17, 2009, at 8:25 AM, Peter wrote: > >> On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields@...> wrote: >>> >>> Elia, >>> >>> As Mark indicated, we recently discussed the lack of support for next-gen >>> on >>> list, at least re: fastq. I may be hit with the same thing in a few months >>> time myself, and I recall Jason and a few others also mentioning the same. >>> Heikki wrote some code for Illumina FASTQ for SeqIO and related modules >>> but >>> I don't believe it has been committed to trunk yet, so maybe he can answer. >>> >>> From prior discussions IIRC the issues were: >>> >>> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0, >>> Illumina >>> 1.3) from one another (so maybe some optional validation), and >> >> Following the python rule of thumb for being explicit, Biopython makes >> the user specify which FASTQ variant is being used. I don't think you >> can do anything else. Any attempted validation would have to be >> heuristic based on the ASCII characters found, and would risk false >> positive warnings. > > Right; I'm thinking along the same lines. If anything the most we would > allow is some level of validation, so if there were a degree of uncertainty > about the format one could set a validation flag to check bounds during the > parse and warn if they are exceeded. > >>> 2) having a way for the Seq object to either 'know' what format is >>> contained, or we use phred score and convert back and forth from that (I >>> think the latter makes more sense). >> >> I think it could make sense for BioPerl to convert Solexa scores to/ from >> PHRED scores on the fly (especially now that Illumina is abandoning >> the Solexa score system). Python style tries to avoid implicit conversions, >> so Biopython doesn't automatically do a conversion from Solexa to >> PHRED scores on parsing (but will on writing if the requested output >> format requires this). >> >>> Peter's suggestions also are reasonable, though does biopython have a >>> separate module for each of these variations? Our version (I believe) >>> mainly varied the conversion within Bio::SeqIO::fastq itself based on the >>> fastq variant passed in as a separate named argument. >> >> Biopython's SeqIO gives the three FASTQ variants their own unique >> names. This format name is a required argument for parsing/writing >> (we don't try and guess the file format from the data contents). Internally >> we have three separate FASTQ parsers/writers although they do share >> code. > > We could easily do the same if others agree. Actually, if we specified that > shorthand for a variant on a format would be designated as -format => > 'format-variant', I think we could easily hack SeqIO to deal with that by > splitting on '-' and passing everything to the constructor as (-format => > 'format', -variant => 'variant'). Very little repeated code in this case, > just an additional named parameter indicating the format variant (and the > SeqIO class can do the type checking on that within the constructor). > >> Other issues to keep in mind: >> >> (3) There should be no warning parsing files where the optional repeated >> title is missing on the "+" lines (as discussed earlier on the BioPerl >> list). > > Agreed, though we'll have to check the current fastq parser to see if that's > currently the case. I thought that was fixed but maybe not? > >> (4) When writing FASTQ files should BioPerl omit the optional repeated >> title on the "+" line? Biopython omits this as I understand this to be >> common practice, and can make a big different to file sizes - especially >> on short read data from Solexa/Illumina. > > Agreed, particularly if it's commonly encountered. > >> (5) Also test reading and writing files with an optional description (as >> well >> as an identifier) on the "@" (and "+") lines. See the NCBI SRA for examples, >> e.g. >> >> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC >> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC > > Should be easy enough to implement with a simple regex. > >> (6) Test reading and writing files where the encoded quality string starts >> with a "@" or a "+" character, e.g. >> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html >> >> Peter > > Mark, getting all that? ;> > > chris > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesI would suggest developing the "standard" version first, then moving
onto potential optimizations. When we went through a similar argument in Ensembl about 8 years ago we ended up dropping Bio::Root completely... If one is truly after performance for these large next-gen projects, it'd be down to pure piping, shell, and worrying about location and copying of files, sticking to systems-level as much as possible, and quite far from Bioperl altogether, so I think it's a whole different level of optimization issues, probably outside the scope of Bioperl. Elia On 17 Jun 2009, at 18:09, Chris Fields wrote: > > On Jun 17, 2009, at 8:27 AM, Tristan Lefebure wrote: > >> Hello, >> Regarding next-gen sequences and bioperl, following my >> experience, another issue is bioperl speed. For example, if >> you want to trim bad quality bases at ends of 1E6 Solexa >> reads using Bio::SeqIO::fastq and some methods in >> Bio::Seq::Quality, well, you've got to be patient (but may >> be I missed some shortcuts...). > > The key issues affecting speed in bioperl are contained object > instantiation and inheritance (and between those two, the latter > much more so as it plays a role with contained objects as well as > the container). > > http://www.bioperl.org/wiki/Why_BioPerl_is_slow > > Moose/Perl6 roles/traits are one way around that issue, but we are a > ways off from getting that running. I think to get that working > decently would be a from-ground-up endeavor (see my past posts on > biomoose/bioperl6). > >> A pure perl solution will be between 100 to 1000x faster... >> Would it be possible to have an ultra-light quality object >> with few simple methods for next-gen reads? >> >> I can contribute some tests if that sounds like an important >> point. >> >> -Tristan > > The quality objects themselves I don't think are that heavy; I think > the main impediment is inheritance. One could get around that a bit > by using a direct_new method to create a blessed hash directly, then > reimplement methods to lazily create any objects contained on the fly. > > chris > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l --- Senior Lecturer, Bioinformatics UCL Cancer Institute Paul O' Gorman Building University College London Gower Street WC1E 6BT London UK Office (UCL): +44 207 679 6493 Office (ICMS): +44 0207 8822374 Mobile: +44 7597 566 194 Mobile (Italy): +39 338 8448801 _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesI think this is a top priority for a fall BioPerl release, maybe 1.6.2
(I am planning on a summer 1.6.1 release still). Made it into a bug report for tracking: http://bugzilla.open-bio.org/show_bug.cgi?id=2857 If no one works on this I may take it up after the 1.6.1 release. chris On Jun 17, 2009, at 12:13 PM, Mark A. Jensen wrote: > I'm on the case! (but maybe not in realtime, today!) > > ----- Original Message ----- From: "Chris Fields" <cjfields@... > > > To: "Peter" <biopython@...> > Cc: "BioPerl List" <bioperl-l@...>; "Elia Stupka" <e.stupka@... > >; "Heikki Lehvaslaiho" <heikki@...> > Sent: Wednesday, June 17, 2009 1:06 PM > Subject: Re: [Bioperl-l] Next-gen modules > > >> >> On Jun 17, 2009, at 8:25 AM, Peter wrote: >> >>> On Wed, Jun 17, 2009 at 1:57 PM, Chris >>> Fields<cjfields@...> wrote: >>>> >>>> Elia, >>>> >>>> As Mark indicated, we recently discussed the lack of support for >>>> next-gen on >>>> list, at least re: fastq. I may be hit with the same thing in a >>>> few months >>>> time myself, and I recall Jason and a few others also mentioning >>>> the same. >>>> Heikki wrote some code for Illumina FASTQ for SeqIO and related >>>> modules but >>>> I don't believe it has been committed to trunk yet, so maybe he >>>> can answer. >>>> >>>> From prior discussions IIRC the issues were: >>>> >>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina >>>> 1.0, Illumina >>>> 1.3) from one another (so maybe some optional validation), and >>> >>> Following the python rule of thumb for being explicit, Biopython >>> makes >>> the user specify which FASTQ variant is being used. I don't think >>> you >>> can do anything else. Any attempted validation would have to be >>> heuristic based on the ASCII characters found, and would risk false >>> positive warnings. >> >> Right; I'm thinking along the same lines. If anything the most we >> would allow is some level of validation, so if there were a degree >> of uncertainty about the format one could set a validation flag to >> check bounds during the parse and warn if they are exceeded. >> >>>> 2) having a way for the Seq object to either 'know' what format is >>>> contained, or we use phred score and convert back and forth from >>>> that (I >>>> think the latter makes more sense). >>> >>> I think it could make sense for BioPerl to convert Solexa scores >>> to/ from >>> PHRED scores on the fly (especially now that Illumina is abandoning >>> the Solexa score system). Python style tries to avoid implicit >>> conversions, >>> so Biopython doesn't automatically do a conversion from Solexa to >>> PHRED scores on parsing (but will on writing if the requested output >>> format requires this). >>> >>>> Peter's suggestions also are reasonable, though does biopython >>>> have a >>>> separate module for each of these variations? Our version (I >>>> believe) >>>> mainly varied the conversion within Bio::SeqIO::fastq itself >>>> based on the >>>> fastq variant passed in as a separate named argument. >>> >>> Biopython's SeqIO gives the three FASTQ variants their own unique >>> names. This format name is a required argument for parsing/writing >>> (we don't try and guess the file format from the data contents). >>> Internally >>> we have three separate FASTQ parsers/writers although they do share >>> code. >> >> We could easily do the same if others agree. Actually, if we >> specified that shorthand for a variant on a format would be >> designated as -format => 'format-variant', I think we could easily >> hack SeqIO to deal with that by splitting on '-' and passing >> everything to the constructor as (-format => 'format', -variant => >> 'variant'). Very little repeated code in this case, just an >> additional named parameter indicating the format variant (and the >> SeqIO class can do the type checking on that within the >> constructor). >> >>> Other issues to keep in mind: >>> >>> (3) There should be no warning parsing files where the optional >>> repeated >>> title is missing on the "+" lines (as discussed earlier on the >>> BioPerl list). >> >> Agreed, though we'll have to check the current fastq parser to see >> if that's currently the case. I thought that was fixed but maybe >> not? >> >>> (4) When writing FASTQ files should BioPerl omit the optional >>> repeated >>> title on the "+" line? Biopython omits this as I understand this >>> to be >>> common practice, and can make a big different to file sizes - >>> especially >>> on short read data from Solexa/Illumina. >> >> Agreed, particularly if it's commonly encountered. >> >>> (5) Also test reading and writing files with an optional >>> description (as well >>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA >>> for examples, >>> e.g. >>> >>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC >>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC >> >> Should be easy enough to implement with a simple regex. >> >>> (6) Test reading and writing files where the encoded quality >>> string starts >>> with a "@" or a "+" character, e.g. >>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html >>> >>> Peter >> >> Mark, getting all that? ;> >> >> chris >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l@... >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesIf we reach a consensus on how/who/what, I will be happy to contribute
some coding time in the coming days. Would it be a good starting point to start adding the different formats as named in BioPython, and test support for reading/wrting them? I could start playing with that. regards, Elia On 17 Jun 2009, at 18:52, Chris Fields wrote: > I think this is a top priority for a fall BioPerl release, maybe > 1.6.2 (I am planning on a summer 1.6.1 release still). Made it into > a bug report for tracking: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2857 > > If no one works on this I may take it up after the 1.6.1 release. > > chris > > On Jun 17, 2009, at 12:13 PM, Mark A. Jensen wrote: > >> I'm on the case! (but maybe not in realtime, today!) >> >> ----- Original Message ----- From: "Chris Fields" <cjfields@... >> > >> To: "Peter" <biopython@...> >> Cc: "BioPerl List" <bioperl-l@...>; "Elia Stupka" <e.stupka@... >> >; "Heikki Lehvaslaiho" <heikki@...> >> Sent: Wednesday, June 17, 2009 1:06 PM >> Subject: Re: [Bioperl-l] Next-gen modules >> >> >>> >>> On Jun 17, 2009, at 8:25 AM, Peter wrote: >>> >>>> On Wed, Jun 17, 2009 at 1:57 PM, Chris >>>> Fields<cjfields@...> wrote: >>>>> >>>>> Elia, >>>>> >>>>> As Mark indicated, we recently discussed the lack of support >>>>> for next-gen on >>>>> list, at least re: fastq. I may be hit with the same thing in >>>>> a few months >>>>> time myself, and I recall Jason and a few others also >>>>> mentioning the same. >>>>> Heikki wrote some code for Illumina FASTQ for SeqIO and related >>>>> modules but >>>>> I don't believe it has been committed to trunk yet, so maybe he >>>>> can answer. >>>>> >>>>> From prior discussions IIRC the issues were: >>>>> >>>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina >>>>> 1.0, Illumina >>>>> 1.3) from one another (so maybe some optional validation), and >>>> >>>> Following the python rule of thumb for being explicit, Biopython >>>> makes >>>> the user specify which FASTQ variant is being used. I don't think >>>> you >>>> can do anything else. Any attempted validation would have to be >>>> heuristic based on the ASCII characters found, and would risk false >>>> positive warnings. >>> >>> Right; I'm thinking along the same lines. If anything the most >>> we would allow is some level of validation, so if there were a >>> degree of uncertainty about the format one could set a validation >>> flag to check bounds during the parse and warn if they are >>> exceeded. >>> >>>>> 2) having a way for the Seq object to either 'know' what format is >>>>> contained, or we use phred score and convert back and forth >>>>> from that (I >>>>> think the latter makes more sense). >>>> >>>> I think it could make sense for BioPerl to convert Solexa scores >>>> to/ from >>>> PHRED scores on the fly (especially now that Illumina is abandoning >>>> the Solexa score system). Python style tries to avoid implicit >>>> conversions, >>>> so Biopython doesn't automatically do a conversion from Solexa to >>>> PHRED scores on parsing (but will on writing if the requested >>>> output >>>> format requires this). >>>> >>>>> Peter's suggestions also are reasonable, though does biopython >>>>> have a >>>>> separate module for each of these variations? Our version (I >>>>> believe) >>>>> mainly varied the conversion within Bio::SeqIO::fastq itself >>>>> based on the >>>>> fastq variant passed in as a separate named argument. >>>> >>>> Biopython's SeqIO gives the three FASTQ variants their own unique >>>> names. This format name is a required argument for parsing/writing >>>> (we don't try and guess the file format from the data contents). >>>> Internally >>>> we have three separate FASTQ parsers/writers although they do share >>>> code. >>> >>> We could easily do the same if others agree. Actually, if we >>> specified that shorthand for a variant on a format would be >>> designated as -format => 'format-variant', I think we could >>> easily hack SeqIO to deal with that by splitting on '-' and >>> passing everything to the constructor as (-format => 'format', - >>> variant => 'variant'). Very little repeated code in this case, >>> just an additional named parameter indicating the format variant >>> (and the SeqIO class can do the type checking on that within the >>> constructor). >>> >>>> Other issues to keep in mind: >>>> >>>> (3) There should be no warning parsing files where the optional >>>> repeated >>>> title is missing on the "+" lines (as discussed earlier on the >>>> BioPerl list). >>> >>> Agreed, though we'll have to check the current fastq parser to see >>> if that's currently the case. I thought that was fixed but maybe >>> not? >>> >>>> (4) When writing FASTQ files should BioPerl omit the optional >>>> repeated >>>> title on the "+" line? Biopython omits this as I understand this >>>> to be >>>> common practice, and can make a big different to file sizes - >>>> especially >>>> on short read data from Solexa/Illumina. >>> >>> Agreed, particularly if it's commonly encountered. >>> >>>> (5) Also test reading and writing files with an optional >>>> description (as well >>>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA >>>> for examples, >>>> e.g. >>>> >>>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >>>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC >>>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >>>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC >>> >>> Should be easy enough to implement with a simple regex. >>> >>>> (6) Test reading and writing files where the encoded quality >>>> string starts >>>> with a "@" or a "+" character, e.g. >>>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html >>>> >>>> Peter >>> >>> Mark, getting all that? ;> >>> >>> chris >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l@... >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l@... >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > --- Senior Lecturer, Bioinformatics UCL Cancer Institute Paul O' Gorman Building University College London Gower Street WC1E 6BT London UK Office (UCL): +44 207 679 6493 Office (ICMS): +44 0207 8822374 Mobile: +44 7597 566 194 Mobile (Italy): +39 338 8448801 _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesThanks both for the light.
That probably means that the place bioperl will take in the handling of the next-gen sequencing raw data (i.e. reads) is very limited, nope? (at least until bioperl6). A single GA2 solexa lane generates about 9 million reads, and I would really not called that a big project... BTW, is there a simple way to see object instantiation and inheritance, as well as time consumption for each, when once calls next_seq() (or any other method)? -Tristan On Wednesday 17 June 2009 13:49:38 Elia Stupka wrote: > I would suggest developing the "standard" version first, > then moving onto potential optimizations. > > When we went through a similar argument in Ensembl about > 8 years ago we ended up dropping Bio::Root completely... > > If one is truly after performance for these large > next-gen projects, it'd be down to pure piping, shell, > and worrying about location and copying of files, > sticking to systems-level as much as possible, and quite > far from Bioperl altogether, so I think it's a whole > different level of optimization issues, probably outside > the scope of Bioperl. > > Elia > > On 17 Jun 2009, at 18:09, Chris Fields wrote: > > On Jun 17, 2009, at 8:27 AM, Tristan Lefebure wrote: > >> Hello, > >> Regarding next-gen sequences and bioperl, following my > >> experience, another issue is bioperl speed. For > >> example, if you want to trim bad quality bases at ends > >> of 1E6 Solexa reads using Bio::SeqIO::fastq and some > >> methods in Bio::Seq::Quality, well, you've got to be > >> patient (but may be I missed some shortcuts...). > > > > The key issues affecting speed in bioperl are contained > > object instantiation and inheritance (and between those > > two, the latter much more so as it plays a role with > > contained objects as well as the container). > > > > http://www.bioperl.org/wiki/Why_BioPerl_is_slow > > > > Moose/Perl6 roles/traits are one way around that issue, > > but we are a ways off from getting that running. I > > think to get that working decently would be a > > from-ground-up endeavor (see my past posts on > > biomoose/bioperl6). > > > >> A pure perl solution will be between 100 to 1000x > >> faster... Would it be possible to have an ultra-light > >> quality object with few simple methods for next-gen > >> reads? > >> > >> I can contribute some tests if that sounds like an > >> important point. > >> > >> -Tristan > > > > The quality objects themselves I don't think are that > > heavy; I think the main impediment is inheritance. One > > could get around that a bit by using a direct_new > > method to create a blessed hash directly, then > > reimplement methods to lazily create any objects > > contained on the fly. > > > > chris > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l@... > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > --- > Senior Lecturer, Bioinformatics > UCL Cancer Institute > Paul O' Gorman Building > University College London > Gower Street > WC1E 6BT > London > UK > > Office (UCL): +44 207 679 6493 > Office (ICMS): +44 0207 8822374 > > Mobile: +44 7597 566 194 > Mobile (Italy): +39 338 8448801 _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesTristan Lefebure wrote:
> Hello, > Regarding next-gen sequences and bioperl, following my > experience, another issue is bioperl speed. For example, if > you want to trim bad quality bases at ends of 1E6 Solexa > reads using Bio::SeqIO::fastq and some methods in > Bio::Seq::Quality, well, you've got to be patient (but may > be I missed some shortcuts...). This is my concern as well. Or, rather, is there actually a significant set of users out there who are dealing with next-gen sequencing and would consider using BioPerl for their work? I'm working with all the 1000-genomes data at the Sanger, and we at least are probably never going to use BioPerl for the work. > A pure perl solution will be between 100 to 1000x faster... > Would it be possible to have an ultra-light quality object > with few simple methods for next-gen reads? The fastq parser itself already seems pretty fast. The way to get the speedup is to not create any Bio::Seq* objects but just return the data directly. At that point it's not taking much advantage of BioPerl. But certainly it could be done... _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesWe are using bioperl for simple pre and post-processing of data for
full Solexa runs, and although it might not be ideal, the scripting with Bioperl is not a major killer. When I was referring to large, heavy pipelines I was thinking of pipelines that deal with many Solexa runs as one project (e.g. 1000 genomes) who really cannot afford any bottleneck in their pipelines, because that affects directly their storage. cheers Elia On 17 Jun 2009, at 19:09, Tristan Lefebure wrote: > Thanks both for the light. > > That probably means that the place bioperl will take in the > handling of the next-gen sequencing raw data (i.e. reads) is > very limited, nope? (at least until bioperl6). A single GA2 > solexa lane generates about 9 million reads, and I would > really not called that a big project... > > BTW, is there a simple way to see object instantiation and > inheritance, as well as time consumption for each, when once > calls next_seq() (or any other method)? > > -Tristan > > On Wednesday 17 June 2009 13:49:38 Elia Stupka wrote: >> I would suggest developing the "standard" version first, >> then moving onto potential optimizations. >> >> When we went through a similar argument in Ensembl about >> 8 years ago we ended up dropping Bio::Root completely... >> >> If one is truly after performance for these large >> next-gen projects, it'd be down to pure piping, shell, >> and worrying about location and copying of files, >> sticking to systems-level as much as possible, and quite >> far from Bioperl altogether, so I think it's a whole >> different level of optimization issues, probably outside >> the scope of Bioperl. >> >> Elia >> >> On 17 Jun 2009, at 18:09, Chris Fields wrote: >>> On Jun 17, 2009, at 8:27 AM, Tristan Lefebure wrote: >>>> Hello, >>>> Regarding next-gen sequences and bioperl, following my >>>> experience, another issue is bioperl speed. For >>>> example, if you want to trim bad quality bases at ends >>>> of 1E6 Solexa reads using Bio::SeqIO::fastq and some >>>> methods in Bio::Seq::Quality, well, you've got to be >>>> patient (but may be I missed some shortcuts...). >>> >>> The key issues affecting speed in bioperl are contained >>> object instantiation and inheritance (and between those >>> two, the latter much more so as it plays a role with >>> contained objects as well as the container). >>> >>> http://www.bioperl.org/wiki/Why_BioPerl_is_slow >>> >>> Moose/Perl6 roles/traits are one way around that issue, >>> but we are a ways off from getting that running. I >>> think to get that working decently would be a >>> from-ground-up endeavor (see my past posts on >>> biomoose/bioperl6). >>> >>>> A pure perl solution will be between 100 to 1000x >>>> faster... Would it be possible to have an ultra-light >>>> quality object with few simple methods for next-gen >>>> reads? >>>> >>>> I can contribute some tests if that sounds like an >>>> important point. >>>> >>>> -Tristan >>> >>> The quality objects themselves I don't think are that >>> heavy; I think the main impediment is inheritance. One >>> could get around that a bit by using a direct_new >>> method to create a blessed hash directly, then >>> reimplement methods to lazily create any objects >>> contained on the fly. >>> >>> chris >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l@... >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> --- >> Senior Lecturer, Bioinformatics >> UCL Cancer Institute >> Paul O' Gorman Building >> University College London >> Gower Street >> WC1E 6BT >> London >> UK >> >> Office (UCL): +44 207 679 6493 >> Office (ICMS): +44 0207 8822374 >> >> Mobile: +44 7597 566 194 >> Mobile (Italy): +39 338 8448801 > > --- Senior Lecturer, Bioinformatics UCL Cancer Institute Paul O' Gorman Building University College London Gower Street WC1E 6BT London UK Office (UCL): +44 207 679 6493 Office (ICMS): +44 0207 8822374 Mobile: +44 7597 566 194 Mobile (Italy): +39 338 8448801 _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Jun 17, 2009, at 1:09 PM, Tristan Lefebure wrote:
> Thanks both for the light. > > That probably means that the place bioperl will take in the > handling of the next-gen sequencing raw data (i.e. reads) is > very limited, nope? (at least until bioperl6). A single GA2 > solexa lane generates about 9 million reads, and I would > really not called that a big project... I don't think it's impossible. If you parse any very long list of sequences in order it will be very slow, yes, but if they were indexed or loaded into a DB lookups would of course be magnitudes faster. We already have perl-based indexing for fastq (Bio::Index::Fastq), so maybe something could be built on top of that. I haven't looked but we can also wrap other C/C++-based parsers as well. BioLib, for instance, has bindings to io_lib, so maybe that could be (ab)used in some way. > BTW, is there a simple way to see object instantiation and > inheritance, as well as time consumption for each, when once > calls next_seq() (or any other method)? > > -Tristan As a simple benchmark, at one point all feature tag information was converted into Bio::Annotations. I reverted that behavior to be simple tag/value again and had a pretty decent bump: http://www.bioperl.org/wiki/Feature_Annotation_rollback#Simple_Benchmark Also, I tried reimplementing some parsers as generic 'event'-based driver/handler and they were slightly faster, the key roadblock being instantation again. If I didn't create Features/Annotations I saw a significant speedup. That's not entirely unexpected, as SeqFeatures also contain Locations (in turn that can contain subLocations) and (until recently) tag-based Bio::Annotation by default. Annotations are collected in an Annotation::Collection and can contain other objects I believe (Ontology terms, etc). The overall lesson is, if you don't have very heavy objects being created the overhead is actually quite small; it's only when you greedily instantiate everything that you run into problems. chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
| < Prev | 1 - 2 - 3 - 4 - 5 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |