Next-gen modules

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 - 4 - 5 | Next >

Next-gen modules

by Elia Stupka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear all,

after several years of absence I am slowly coming back to Bioperl, and  
hope to contribute again to its development.

One area that I was thinking of starting from, since we are actively  
involved with it, is to improve BIoperl's support fo next-gen  
sequencing data, tools, etc. Since I am sure I have missed out on a  
lot of recent developments, do let me know if/what is useful.

One example that comes to mind is that the conversion of various  
formats to/from FASTQ does not seem to be supported. Some code can be  
found within Li Heng's script: http://maq.sourceforge.net/ 
fq_all2std.pl but it would be good if it could make its way into  
SeqIO? And similarly, potentially, for other next-gen sequence formats?

Similarly, there seems to be little in bioperl-run to support tools  
that have been developed in this area, such as Maq, BowTie, TopHat, etc?

Do let me know if there is a past thread on this, or other people  
actively developing, etc. so that I can find out what priorities are.

thanks and best regards to all (old friends and new),

Elia

---
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street
WC1E 6BT
London
UK

Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374

Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Mark A. Jensen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Elia--
I say a definite +1; in fact, this sounds like it should be a Hot Topic
(see http://www.bioperl.org/wiki/Category:Hot_Topics for some others
you might have missed in your hiatus...). I will create a page that
can be a central point for wish lists, discussion, etc.

There has been much discussion of late about FASTQ
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030187.html
http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029970.html
http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
http://lists.open-bio.org/pipermail/bioperl-l/2009-April/029765.html

cheers from a newbie,
Mark

----- Original Message -----
From: "Elia Stupka" <e.stupka@...>
To: <bioperl-l@...>
Sent: Wednesday, June 17, 2009 7:29 AM
Subject: [Bioperl-l] Next-gen modules


> Dear all,
>
> after several years of absence I am slowly coming back to Bioperl, and  
> hope to contribute again to its development.
>
> One area that I was thinking of starting from, since we are actively  
> involved with it, is to improve BIoperl's support fo next-gen  
> sequencing data, tools, etc. Since I am sure I have missed out on a  
> lot of recent developments, do let me know if/what is useful.
>
> One example that comes to mind is that the conversion of various  
> formats to/from FASTQ does not seem to be supported. Some code can be  
> found within Li Heng's script: http://maq.sourceforge.net/ 
> fq_all2std.pl but it would be good if it could make its way into  
> SeqIO? And similarly, potentially, for other next-gen sequence formats?
>
> Similarly, there seems to be little in bioperl-run to support tools  
> that have been developed in this area, such as Maq, BowTie, TopHat, etc?
>
> Do let me know if there is a past thread on this, or other people  
> actively developing, etc. so that I can find out what priorities are.
>
> thanks and best regards to all (old friends and new),
>
> Elia
>
> ---
> Senior Lecturer, Bioinformatics
> UCL Cancer Institute
> Paul O' Gorman Building
> University College London
> Gower Street
> WC1E 6BT
> London
> UK
>
> Office (UCL): +44 207 679 6493
> Office (ICMS): +44 0207 8822374
>
> Mobile: +44 7597 566 194
> Mobile (Italy): +39 338 8448801
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Mark A. Jensen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

[ and here's a new wikipage: http://www.bioperl.org/wiki/Nextgen_in_Bioperl ]
----- Original Message -----
From: "Elia Stupka" <e.stupka@...>
To: <bioperl-l@...>
Sent: Wednesday, June 17, 2009 7:29 AM
Subject: [Bioperl-l] Next-gen modules


> Dear all,
>
> after several years of absence I am slowly coming back to Bioperl, and  
> hope to contribute again to its development.
>
> One area that I was thinking of starting from, since we are actively  
> involved with it, is to improve BIoperl's support fo next-gen  
> sequencing data, tools, etc. Since I am sure I have missed out on a  
> lot of recent developments, do let me know if/what is useful.
>
> One example that comes to mind is that the conversion of various  
> formats to/from FASTQ does not seem to be supported. Some code can be  
> found within Li Heng's script: http://maq.sourceforge.net/ 
> fq_all2std.pl but it would be good if it could make its way into  
> SeqIO? And similarly, potentially, for other next-gen sequence formats?
>
> Similarly, there seems to be little in bioperl-run to support tools  
> that have been developed in this area, such as Maq, BowTie, TopHat, etc?
>
> Do let me know if there is a past thread on this, or other people  
> actively developing, etc. so that I can find out what priorities are.
>
> thanks and best regards to all (old friends and new),
>
> Elia
>
> ---
> Senior Lecturer, Bioinformatics
> UCL Cancer Institute
> Paul O' Gorman Building
> University College London
> Gower Street
> WC1E 6BT
> London
> UK
>
> Office (UCL): +44 207 679 6493
> Office (ICMS): +44 0207 8822374
>
> Mobile: +44 7597 566 194
> Mobile (Italy): +39 338 8448801
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Peter-329 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Jun 17, 2009 at 12:29 PM, Elia Stupka<e.stupka@...> wrote:

>
> Dear all,
>
> after several years of absence I am slowly coming back to Bioperl, and hope
> to contribute again to its development.
>
> One area that I was thinking of starting from, since we are actively
> involved with it, is to improve BIoperl's support fo next-gen sequencing
> data, tools, etc. Since I am sure I have missed out on a lot of recent
> developments, do let me know if/what is useful.
>
> One example that comes to mind is that the conversion of various formats
> to/from FASTQ does not seem to be supported. Some code can be found within
> Li Heng's script: http://maq.sourceforge.net/fq_all2std.pl but it would be
> good if it could make its way into SeqIO? And similarly, potentially, for
> other next-gen sequence formats?

If you do add FASTQ support to BioPerl's SeqIO (and I think that is a
good idea), please could you follow the format names used by Biopython
- as this time we got there first ;)

I'm asking this as Biopython's SeqIO tries to use the same format
names as BioPerl's SeqIO and EMBOSS, see
http://biopython.org/wiki/SeqIO

Specifically,
* "fastq" in Biopython means the original Sanger standard FASTQ files
encoding PHRED qualities using an ASCII offset of 33.
* "fastq-solexa" in Biopython means the early Solexa/Illumina style
FASTQ files which encode Solexa qualities using an ASCII offset of 64.
* "fastq-illumina" in Biopython will mean recent Solexa/Illumina style
FASTQ files (from pipeline version 1.3+) which encode PHRED qualities
using an ASCII offset of 64. This is in the Biopython repository, but
hasn't been released yet - so the name "fastq-illumina" isn't set in
stone yet.

For good quality reads, PHRED and Solexa scores are approximately
equal, so the "fastq-solexa" and "fastq-illumina" variants are almost
equivalent.

> Similarly, there seems to be little in bioperl-run to support tools that
> have been developed in this area, such as Maq, BowTie, TopHat, etc?
>
> Do let me know if there is a past thread on this, or other people actively
> developing, etc. so that I can find out what priorities are.

Have you seen these recent threads?:
http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029970.html
http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030187.html

Regards,

Peter (at Biopython)
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Elia Stupka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear Mark,

thanks a lot for the pointers.

With regards to FASTQ parsing:

-my understanding by reading past threads is to work on a single  
format, i.e. FASTQ and to interpet the quality "flavours" as just  
quality conversions, right?

-However, I assume we would still want to support a simple way for the  
user to say format => 'fastq-solexa' using the nomenclature adopted in  
BioPython suggested by Peter, right?

-I also saw Heikki's "long essay", but did not yet compare to Heng  
Li's code at http://maq.sourceforge.net/fq_all2std.pl, I guess we  
would hope they would produce identical outputs, will be a good check.

Finally, I saw Tristan's reply to Heikki's thread, so what is the  
status quo? Is it moving forward?

cheers

Elia



On 17 Jun 2009, at 13:02, Mark A. Jensen wrote:

> Elia--
> I say a definite +1; in fact, this sounds like it should be a Hot  
> Topic (see http://www.bioperl.org/wiki/Category:Hot_Topics for some  
> others
> you might have missed in your hiatus...). I will create a page that  
> can be a central point for wish lists, discussion, etc.
>
> There has been much discussion of late about FASTQ http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030187.html
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029970.html
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
> http://lists.open-bio.org/pipermail/bioperl-l/2009-April/029765.html
>
> cheers from a newbie, Mark
>
> ----- Original Message ----- From: "Elia Stupka" <e.stupka@...>
> To: <bioperl-l@...>
> Sent: Wednesday, June 17, 2009 7:29 AM
> Subject: [Bioperl-l] Next-gen modules
>
>
>> Dear all,
>> after several years of absence I am slowly coming back to Bioperl,  
>> and  hope to contribute again to its development.
>> One area that I was thinking of starting from, since we are  
>> actively  involved with it, is to improve BIoperl's support fo next-
>> gen  sequencing data, tools, etc. Since I am sure I have missed out  
>> on a  lot of recent developments, do let me know if/what is useful.
>> One example that comes to mind is that the conversion of various  
>> formats to/from FASTQ does not seem to be supported. Some code can  
>> be  found within Li Heng's script: http://maq.sourceforge.net/ 
>> fq_all2std.pl but it would be good if it could make its way into  
>> SeqIO? And similarly, potentially, for other next-gen sequence  
>> formats?
>> Similarly, there seems to be little in bioperl-run to support  
>> tools  that have been developed in this area, such as Maq, BowTie,  
>> TopHat, etc?
>> Do let me know if there is a past thread on this, or other people  
>> actively developing, etc. so that I can find out what priorities are.
>> thanks and best regards to all (old friends and new),
>> Elia
>> ---
>> Senior Lecturer, Bioinformatics
>> UCL Cancer Institute
>> Paul O' Gorman Building
>> University College London
>> Gower Street
>> WC1E 6BT
>> London
>> UK
>> Office (UCL): +44 207 679 6493
>> Office (ICMS): +44 0207 8822374
>> Mobile: +44 7597 566 194
>> Mobile (Italy): +39 338 8448801
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@...
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>

---
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street
WC1E 6BT
London
UK

Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374

Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Elia,

As Mark indicated, we recently discussed the lack of support for next-
gen on list, at least re: fastq.  I may be hit with the same thing in  
a few months time myself, and I recall Jason and a few others also  
mentioning the same.  Heikki wrote some code for Illumina FASTQ for  
SeqIO and related modules but I don't believe it has been committed to  
trunk yet, so maybe he can answer.

 From prior discussions IIRC the issues were:

1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0,  
Illumina 1.3) from one another (so maybe some optional validation), and
2) having a way for the Seq object to either 'know' what format is  
contained, or we use phred score and convert back and forth from that  
(I think the latter makes more sense).

Peter's suggestions also are reasonable, though does biopython have a  
separate module for each of these variations?  Our version (I believe)  
mainly varied the conversion within Bio::SeqIO::fastq itself based on  
the fastq variant passed in as a separate named argument.

As for the wrappers, we would most certainly welcome them!

chris

On Jun 17, 2009, at 6:29 AM, Elia Stupka wrote:

> Dear all,
>
> after several years of absence I am slowly coming back to Bioperl,  
> and hope to contribute again to its development.
>
> One area that I was thinking of starting from, since we are actively  
> involved with it, is to improve BIoperl's support fo next-gen  
> sequencing data, tools, etc. Since I am sure I have missed out on a  
> lot of recent developments, do let me know if/what is useful.
>
> One example that comes to mind is that the conversion of various  
> formats to/from FASTQ does not seem to be supported. Some code can  
> be found within Li Heng's script: http://maq.sourceforge.net/fq_all2std.pl 
>  but it would be good if it could make its way into SeqIO? And  
> similarly, potentially, for other next-gen sequence formats?
>
> Similarly, there seems to be little in bioperl-run to support tools  
> that have been developed in this area, such as Maq, BowTie, TopHat,  
> etc?
>
> Do let me know if there is a past thread on this, or other people  
> actively developing, etc. so that I can find out what priorities are.
>
> thanks and best regards to all (old friends and new),
>
> Elia
>
> ---
> Senior Lecturer, Bioinformatics
> UCL Cancer Institute
> Paul O' Gorman Building
> University College London
> Gower Street
> WC1E 6BT
> London
> UK
>
> Office (UCL): +44 207 679 6493
> Office (ICMS): +44 0207 8822374
>
> Mobile: +44 7597 566 194
> Mobile (Italy): +39 338 8448801
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Peter-329 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields@...> wrote:

>
> Elia,
>
> As Mark indicated, we recently discussed the lack of support for next-gen on
> list, at least re: fastq.  I may be hit with the same thing in a few months
> time myself, and I recall Jason and a few others also mentioning the same.
>  Heikki wrote some code for Illumina FASTQ for SeqIO and related modules but
> I don't believe it has been committed to trunk yet, so maybe he can answer.
>
> From prior discussions IIRC the issues were:
>
> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0, Illumina
> 1.3) from one another (so maybe some optional validation), and

Following the python rule of thumb for being explicit, Biopython makes
the user specify which FASTQ variant is being used. I don't think you
can do anything else. Any attempted validation would have to be
heuristic based on the ASCII characters found, and would risk false
positive warnings.

> 2) having a way for the Seq object to either 'know' what format is
> contained, or we use phred score and convert back and forth from that (I
> think the latter makes more sense).

I think it could make sense for BioPerl to convert Solexa scores to/from
PHRED scores on the fly (especially now that Illumina is abandoning
the Solexa score system). Python style tries to avoid implicit conversions,
so Biopython doesn't automatically do a conversion from Solexa to
PHRED scores on parsing (but will on writing if the requested output
format requires this).

> Peter's suggestions also are reasonable, though does biopython have a
> separate module for each of these variations?  Our version (I believe)
> mainly varied the conversion within Bio::SeqIO::fastq itself based on the
> fastq variant passed in as a separate named argument.

Biopython's SeqIO gives the three FASTQ variants their own unique
names. This format name is a required argument for parsing/writing
(we don't try and guess the file format from the data contents). Internally
we have three separate FASTQ parsers/writers although they do share
code.

Other issues to keep in mind:

(3) There should be no warning parsing files where the optional repeated
title is missing on the "+" lines (as discussed earlier on the BioPerl list).

(4) When writing FASTQ files should BioPerl omit the optional repeated
title on the "+" line? Biopython omits this as I understand this to be
common practice, and can make a big different to file sizes - especially
on short read data from Solexa/Illumina.

(5) Also test reading and writing files with an optional description (as well
as an identifier) on the "@" (and "+") lines. See the NCBI SRA for examples,
e.g.

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC


(6) Test reading and writing files where the encoded quality string starts
with a "@" or a "+" character, e.g.
http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html

Peter

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Tristan Lefebure-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,
Regarding next-gen sequences and bioperl, following my
experience, another issue is bioperl speed. For example, if
you want to trim bad quality bases at ends of 1E6 Solexa
reads using Bio::SeqIO::fastq and some methods in
Bio::Seq::Quality, well, you've got to be patient (but may
be I missed some shortcuts...).

A pure perl solution will be between 100 to 1000x faster...
Would it be possible to have an ultra-light quality object
with few simple methods for next-gen reads?

I can contribute some tests if that sounds like an important
point.

-Tristan



On Wednesday 17 June 2009 08:02:11 Mark A. Jensen wrote:

> Elia--
> I say a definite +1; in fact, this sounds like it should
> be a Hot Topic (see
> http://www.bioperl.org/wiki/Category:Hot_Topics for some
> others you might have missed in your hiatus...). I will
> create a page that can be a central point for wish lists,
> discussion, etc.
>
> There has been much discussion of late about FASTQ
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/0
>30187.html
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/02
>9970.html
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/02
>9911.html
> http://lists.open-bio.org/pipermail/bioperl-l/2009-April/
>029765.html
>
> cheers from a newbie,
> Mark
>
> ----- Original Message -----
> From: "Elia Stupka" <e.stupka@...>
> To: <bioperl-l@...>
> Sent: Wednesday, June 17, 2009 7:29 AM
> Subject: [Bioperl-l] Next-gen modules
>
> > Dear all,
> >
> > after several years of absence I am slowly coming back
> > to Bioperl, and hope to contribute again to its
> > development.
> >
> > One area that I was thinking of starting from, since we
> > are actively involved with it, is to improve BIoperl's
> > support fo next-gen sequencing data, tools, etc. Since
> > I am sure I have missed out on a lot of recent
> > developments, do let me know if/what is useful.
> >
> > One example that comes to mind is that the conversion
> > of various formats to/from FASTQ does not seem to be
> > supported. Some code can be found within Li Heng's
> > script: http://maq.sourceforge.net/ fq_all2std.pl but
> > it would be good if it could make its way into SeqIO?
> > And similarly, potentially, for other next-gen sequence
> > formats?
> >
> > Similarly, there seems to be little in bioperl-run to
> > support tools that have been developed in this area,
> > such as Maq, BowTie, TopHat, etc?
> >
> > Do let me know if there is a past thread on this, or
> > other people actively developing, etc. so that I can
> > find out what priorities are.
> >
> > thanks and best regards to all (old friends and new),
> >
> > Elia
> >
> > ---
> > Senior Lecturer, Bioinformatics
> > UCL Cancer Institute
> > Paul O' Gorman Building
> > University College London
> > Gower Street
> > WC1E 6BT
> > London
> > UK
> >
> > Office (UCL): +44 207 679 6493
> > Office (ICMS): +44 0207 8822374
> >
> > Mobile: +44 7597 566 194
> > Mobile (Italy): +39 338 8448801
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@...
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by John Marshall-12 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 17 Jun 2009, at 12:29, Elia Stupka wrote:
> Similarly, there seems to be little in bioperl-run to support tools  
> that have been developed in this area, such as Maq, BowTie, TopHat,  
> etc?

FYI I have a Bio::Tools::Run::Velvet wrapper [1] that I plan to submit  
in the not too distant future.  (First it needs some "blah blah"  
replaced with actual documentation and a test suite.)

Cheers,

     John

[1] http://www.ebi.ac.uk/~zerbino/velvet/


--
 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Peter-329 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Jun 17, 2009 at 1:54 PM, Elia Stupka<e.stupka@...> wrote:

>
> Dear Mark,
>
> thanks a lot for the pointers.
>
> With regards to FASTQ parsing:
>
> -my understanding by reading past threads is to work on a single format,
> i.e. FASTQ and to interpet the quality "flavours" as just quality
> conversions, right?
> -However, I assume we would still want to support a simple way for the user
> to say format => 'fastq-solexa' using the nomenclature adopted in BioPython
> suggested by Peter, right?

I think you will need a way for the user to say they have a Solexa, or
an Illumina 1.3+, or an original Sanger standard FASTQ file.

>From reading the http://bioperl.org/wiki/HOWTO:SeqIO wiki page, I
assumed BioPerl's SeqIO just had formats (e.g. the "chadoxml" format
and the variant
"flybase_chadoxml" format). Does BioPerl's SeqIO format system have any
concept of flavour that I am not aware of?

> -I also saw Heikki's "long essay", but did not yet compare to Heng Li's code
> at http://maq.sourceforge.net/fq_all2std.pl, I guess we would hope they
> would produce identical outputs, will be a good check.

Heng Li's code at http://maq.sourceforge.net/fq_all2std.pl is a useful
guide (although it doesn't yet cope with the new Illumina 1.3+ variant),
but I don't trust it 100%. See e.g.
http://lists.open-bio.org/pipermail/biopython/2009-June/005208.html
http://lists.open-bio.org/pipermail/biopython/2009-June/005209.html

Peter
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Jun 17, 2009, at 8:25 AM, Peter wrote:

> On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields@...>  
> wrote:
>>
>> Elia,
>>
>> As Mark indicated, we recently discussed the lack of support for  
>> next-gen on
>> list, at least re: fastq.  I may be hit with the same thing in a  
>> few months
>> time myself, and I recall Jason and a few others also mentioning  
>> the same.
>>  Heikki wrote some code for Illumina FASTQ for SeqIO and related  
>> modules but
>> I don't believe it has been committed to trunk yet, so maybe he can  
>> answer.
>>
>> From prior discussions IIRC the issues were:
>>
>> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0,  
>> Illumina
>> 1.3) from one another (so maybe some optional validation), and
>
> Following the python rule of thumb for being explicit, Biopython makes
> the user specify which FASTQ variant is being used. I don't think you
> can do anything else. Any attempted validation would have to be
> heuristic based on the ASCII characters found, and would risk false
> positive warnings.

Right; I'm thinking along the same lines.  If anything the most we  
would allow is some level of validation, so if there were a degree of  
uncertainty about the format one could set a validation flag to check  
bounds during the parse and warn if they are exceeded.

>> 2) having a way for the Seq object to either 'know' what format is
>> contained, or we use phred score and convert back and forth from  
>> that (I
>> think the latter makes more sense).
>
> I think it could make sense for BioPerl to convert Solexa scores to/
> from
> PHRED scores on the fly (especially now that Illumina is abandoning
> the Solexa score system). Python style tries to avoid implicit  
> conversions,
> so Biopython doesn't automatically do a conversion from Solexa to
> PHRED scores on parsing (but will on writing if the requested output
> format requires this).
>
>> Peter's suggestions also are reasonable, though does biopython have a
>> separate module for each of these variations?  Our version (I  
>> believe)
>> mainly varied the conversion within Bio::SeqIO::fastq itself based  
>> on the
>> fastq variant passed in as a separate named argument.
>
> Biopython's SeqIO gives the three FASTQ variants their own unique
> names. This format name is a required argument for parsing/writing
> (we don't try and guess the file format from the data contents).  
> Internally
> we have three separate FASTQ parsers/writers although they do share
> code.

We could easily do the same if others agree.  Actually, if we  
specified that shorthand for a variant on a format would be designated  
as -format => 'format-variant', I think we could easily hack SeqIO to  
deal with that by splitting on '-' and passing everything to the  
constructor as (-format => 'format', -variant => 'variant').  Very  
little repeated code in this case, just an additional named parameter  
indicating the format variant (and the SeqIO class can do the type  
checking on that within the constructor).

> Other issues to keep in mind:
>
> (3) There should be no warning parsing files where the optional  
> repeated
> title is missing on the "+" lines (as discussed earlier on the  
> BioPerl list).

Agreed, though we'll have to check the current fastq parser to see if  
that's currently the case.  I thought that was fixed but maybe not?

> (4) When writing FASTQ files should BioPerl omit the optional repeated
> title on the "+" line? Biopython omits this as I understand this to be
> common practice, and can make a big different to file sizes -  
> especially
> on short read data from Solexa/Illumina.

Agreed, particularly if it's commonly encountered.

> (5) Also test reading and writing files with an optional description  
> (as well
> as an identifier) on the "@" (and "+") lines. See the NCBI SRA for  
> examples,
> e.g.
>
> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Should be easy enough to implement with a simple regex.

> (6) Test reading and writing files where the encoded quality string  
> starts
> with a "@" or a "+" character, e.g.
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>
> Peter

Mark, getting all that? ;>

chris



_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Jun 17, 2009, at 8:27 AM, Tristan Lefebure wrote:

> Hello,
> Regarding next-gen sequences and bioperl, following my
> experience, another issue is bioperl speed. For example, if
> you want to trim bad quality bases at ends of 1E6 Solexa
> reads using Bio::SeqIO::fastq and some methods in
> Bio::Seq::Quality, well, you've got to be patient (but may
> be I missed some shortcuts...).

The key issues affecting speed in bioperl are contained object  
instantiation and inheritance (and between those two, the latter much  
more so as it plays a role with contained objects as well as the  
container).

http://www.bioperl.org/wiki/Why_BioPerl_is_slow

Moose/Perl6 roles/traits are one way around that issue, but we are a  
ways off from getting that running.  I think to get that working  
decently would be a from-ground-up endeavor (see my past posts on  
biomoose/bioperl6).

> A pure perl solution will be between 100 to 1000x faster...
> Would it be possible to have an ultra-light quality object
> with few simple methods for next-gen reads?
>
> I can contribute some tests if that sounds like an important
> point.
>
> -Tristan

The quality objects themselves I don't think are that heavy; I think  
the main impediment is inheritance.  One could get around that a bit  
by using a direct_new method to create a blessed hash directly, then  
reimplement methods to lazily create any objects contained on the fly.

chris



_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Mark A. Jensen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm on the case! (but maybe not in realtime, today!)

----- Original Message -----
From: "Chris Fields" <cjfields@...>
To: "Peter" <biopython@...>
Cc: "BioPerl List" <bioperl-l@...>; "Elia Stupka"
<e.stupka@...>; "Heikki Lehvaslaiho" <heikki@...>
Sent: Wednesday, June 17, 2009 1:06 PM
Subject: Re: [Bioperl-l] Next-gen modules


>
> On Jun 17, 2009, at 8:25 AM, Peter wrote:
>
>> On Wed, Jun 17, 2009 at 1:57 PM, Chris Fields<cjfields@...>  wrote:
>>>
>>> Elia,
>>>
>>> As Mark indicated, we recently discussed the lack of support for  next-gen
>>> on
>>> list, at least re: fastq.  I may be hit with the same thing in a  few months
>>> time myself, and I recall Jason and a few others also mentioning  the same.
>>>  Heikki wrote some code for Illumina FASTQ for SeqIO and related  modules
>>> but
>>> I don't believe it has been committed to trunk yet, so maybe he can  answer.
>>>
>>> From prior discussions IIRC the issues were:
>>>
>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina 1.0,
>>> Illumina
>>> 1.3) from one another (so maybe some optional validation), and
>>
>> Following the python rule of thumb for being explicit, Biopython makes
>> the user specify which FASTQ variant is being used. I don't think you
>> can do anything else. Any attempted validation would have to be
>> heuristic based on the ASCII characters found, and would risk false
>> positive warnings.
>
> Right; I'm thinking along the same lines.  If anything the most we  would
> allow is some level of validation, so if there were a degree of  uncertainty
> about the format one could set a validation flag to check  bounds during the
> parse and warn if they are exceeded.
>
>>> 2) having a way for the Seq object to either 'know' what format is
>>> contained, or we use phred score and convert back and forth from  that (I
>>> think the latter makes more sense).
>>
>> I think it could make sense for BioPerl to convert Solexa scores to/ from
>> PHRED scores on the fly (especially now that Illumina is abandoning
>> the Solexa score system). Python style tries to avoid implicit  conversions,
>> so Biopython doesn't automatically do a conversion from Solexa to
>> PHRED scores on parsing (but will on writing if the requested output
>> format requires this).
>>
>>> Peter's suggestions also are reasonable, though does biopython have a
>>> separate module for each of these variations?  Our version (I  believe)
>>> mainly varied the conversion within Bio::SeqIO::fastq itself based  on the
>>> fastq variant passed in as a separate named argument.
>>
>> Biopython's SeqIO gives the three FASTQ variants their own unique
>> names. This format name is a required argument for parsing/writing
>> (we don't try and guess the file format from the data contents).  Internally
>> we have three separate FASTQ parsers/writers although they do share
>> code.
>
> We could easily do the same if others agree.  Actually, if we  specified that
> shorthand for a variant on a format would be designated  as -format =>
> 'format-variant', I think we could easily hack SeqIO to  deal with that by
> splitting on '-' and passing everything to the  constructor as (-format =>
> 'format', -variant => 'variant').  Very  little repeated code in this case,
> just an additional named parameter  indicating the format variant (and the
> SeqIO class can do the type  checking on that within the constructor).
>
>> Other issues to keep in mind:
>>
>> (3) There should be no warning parsing files where the optional  repeated
>> title is missing on the "+" lines (as discussed earlier on the  BioPerl
>> list).
>
> Agreed, though we'll have to check the current fastq parser to see if  that's
> currently the case.  I thought that was fixed but maybe not?
>
>> (4) When writing FASTQ files should BioPerl omit the optional repeated
>> title on the "+" line? Biopython omits this as I understand this to be
>> common practice, and can make a big different to file sizes -  especially
>> on short read data from Solexa/Illumina.
>
> Agreed, particularly if it's commonly encountered.
>
>> (5) Also test reading and writing files with an optional description  (as
>> well
>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA for  examples,
>> e.g.
>>
>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>
> Should be easy enough to implement with a simple regex.
>
>> (6) Test reading and writing files where the encoded quality string  starts
>> with a "@" or a "+" character, e.g.
>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>
>> Peter
>
> Mark, getting all that? ;>
>
> chris
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Elia Stupka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I would suggest developing the "standard" version first, then moving  
onto potential optimizations.

When we went through a similar argument in Ensembl about 8 years ago  
we ended up dropping Bio::Root completely...

If one is truly after performance for these large next-gen projects,  
it'd be down to pure piping, shell, and worrying about location and  
copying of files, sticking to systems-level as much as possible, and  
quite far from Bioperl altogether, so I think it's a whole different  
level of optimization issues, probably outside the scope of Bioperl.

Elia

On 17 Jun 2009, at 18:09, Chris Fields wrote:

>
> On Jun 17, 2009, at 8:27 AM, Tristan Lefebure wrote:
>
>> Hello,
>> Regarding next-gen sequences and bioperl, following my
>> experience, another issue is bioperl speed. For example, if
>> you want to trim bad quality bases at ends of 1E6 Solexa
>> reads using Bio::SeqIO::fastq and some methods in
>> Bio::Seq::Quality, well, you've got to be patient (but may
>> be I missed some shortcuts...).
>
> The key issues affecting speed in bioperl are contained object  
> instantiation and inheritance (and between those two, the latter  
> much more so as it plays a role with contained objects as well as  
> the container).
>
> http://www.bioperl.org/wiki/Why_BioPerl_is_slow
>
> Moose/Perl6 roles/traits are one way around that issue, but we are a  
> ways off from getting that running.  I think to get that working  
> decently would be a from-ground-up endeavor (see my past posts on  
> biomoose/bioperl6).
>
>> A pure perl solution will be between 100 to 1000x faster...
>> Would it be possible to have an ultra-light quality object
>> with few simple methods for next-gen reads?
>>
>> I can contribute some tests if that sounds like an important
>> point.
>>
>> -Tristan
>
> The quality objects themselves I don't think are that heavy; I think  
> the main impediment is inheritance.  One could get around that a bit  
> by using a direct_new method to create a blessed hash directly, then  
> reimplement methods to lazily create any objects contained on the fly.
>
> chris
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

---
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street
WC1E 6BT
London
UK

Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374

Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I think this is a top priority for a fall BioPerl release, maybe 1.6.2  
(I am planning on a summer 1.6.1 release still).  Made it into a bug  
report for tracking:

http://bugzilla.open-bio.org/show_bug.cgi?id=2857

If no one works on this I may take it up after the 1.6.1 release.

chris

On Jun 17, 2009, at 12:13 PM, Mark A. Jensen wrote:

> I'm on the case! (but maybe not in realtime, today!)
>
> ----- Original Message ----- From: "Chris Fields" <cjfields@...
> >
> To: "Peter" <biopython@...>
> Cc: "BioPerl List" <bioperl-l@...>; "Elia Stupka" <e.stupka@...
> >; "Heikki Lehvaslaiho" <heikki@...>
> Sent: Wednesday, June 17, 2009 1:06 PM
> Subject: Re: [Bioperl-l] Next-gen modules
>
>
>>
>> On Jun 17, 2009, at 8:25 AM, Peter wrote:
>>
>>> On Wed, Jun 17, 2009 at 1:57 PM, Chris  
>>> Fields<cjfields@...>  wrote:
>>>>
>>>> Elia,
>>>>
>>>> As Mark indicated, we recently discussed the lack of support for  
>>>> next-gen on
>>>> list, at least re: fastq.  I may be hit with the same thing in a  
>>>> few months
>>>> time myself, and I recall Jason and a few others also mentioning  
>>>> the same.
>>>> Heikki wrote some code for Illumina FASTQ for SeqIO and related  
>>>> modules but
>>>> I don't believe it has been committed to trunk yet, so maybe he  
>>>> can  answer.
>>>>
>>>> From prior discussions IIRC the issues were:
>>>>
>>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina  
>>>> 1.0, Illumina
>>>> 1.3) from one another (so maybe some optional validation), and
>>>
>>> Following the python rule of thumb for being explicit, Biopython  
>>> makes
>>> the user specify which FASTQ variant is being used. I don't think  
>>> you
>>> can do anything else. Any attempted validation would have to be
>>> heuristic based on the ASCII characters found, and would risk false
>>> positive warnings.
>>
>> Right; I'm thinking along the same lines.  If anything the most we  
>> would allow is some level of validation, so if there were a degree  
>> of  uncertainty about the format one could set a validation flag to  
>> check  bounds during the parse and warn if they are exceeded.
>>
>>>> 2) having a way for the Seq object to either 'know' what format is
>>>> contained, or we use phred score and convert back and forth from  
>>>> that (I
>>>> think the latter makes more sense).
>>>
>>> I think it could make sense for BioPerl to convert Solexa scores  
>>> to/ from
>>> PHRED scores on the fly (especially now that Illumina is abandoning
>>> the Solexa score system). Python style tries to avoid implicit  
>>> conversions,
>>> so Biopython doesn't automatically do a conversion from Solexa to
>>> PHRED scores on parsing (but will on writing if the requested output
>>> format requires this).
>>>
>>>> Peter's suggestions also are reasonable, though does biopython  
>>>> have a
>>>> separate module for each of these variations?  Our version (I  
>>>> believe)
>>>> mainly varied the conversion within Bio::SeqIO::fastq itself  
>>>> based  on the
>>>> fastq variant passed in as a separate named argument.
>>>
>>> Biopython's SeqIO gives the three FASTQ variants their own unique
>>> names. This format name is a required argument for parsing/writing
>>> (we don't try and guess the file format from the data contents).  
>>> Internally
>>> we have three separate FASTQ parsers/writers although they do share
>>> code.
>>
>> We could easily do the same if others agree.  Actually, if we  
>> specified that shorthand for a variant on a format would be  
>> designated  as -format => 'format-variant', I think we could easily  
>> hack SeqIO to  deal with that by splitting on '-' and passing  
>> everything to the  constructor as (-format => 'format', -variant =>  
>> 'variant').  Very  little repeated code in this case, just an  
>> additional named parameter  indicating the format variant (and the  
>> SeqIO class can do the type  checking on that within the  
>> constructor).
>>
>>> Other issues to keep in mind:
>>>
>>> (3) There should be no warning parsing files where the optional  
>>> repeated
>>> title is missing on the "+" lines (as discussed earlier on the  
>>> BioPerl list).
>>
>> Agreed, though we'll have to check the current fastq parser to see  
>> if  that's currently the case.  I thought that was fixed but maybe  
>> not?
>>
>>> (4) When writing FASTQ files should BioPerl omit the optional  
>>> repeated
>>> title on the "+" line? Biopython omits this as I understand this  
>>> to be
>>> common practice, and can make a big different to file sizes -  
>>> especially
>>> on short read data from Solexa/Illumina.
>>
>> Agreed, particularly if it's commonly encountered.
>>
>>> (5) Also test reading and writing files with an optional  
>>> description  (as well
>>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA  
>>> for  examples,
>>> e.g.
>>>
>>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>>
>> Should be easy enough to implement with a simple regex.
>>
>>> (6) Test reading and writing files where the encoded quality  
>>> string  starts
>>> with a "@" or a "+" character, e.g.
>>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>>
>>> Peter
>>
>> Mark, getting all that? ;>
>>
>> chris
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@...
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Elia Stupka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

If we reach a consensus on how/who/what, I will be happy to contribute  
some coding time in the coming days.

Would it be a good starting point to start adding the different  
formats as named in BioPython, and test support for reading/wrting  
them? I could start playing with that.

regards,

Elia

On 17 Jun 2009, at 18:52, Chris Fields wrote:

> I think this is a top priority for a fall BioPerl release, maybe  
> 1.6.2 (I am planning on a summer 1.6.1 release still).  Made it into  
> a bug report for tracking:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2857
>
> If no one works on this I may take it up after the 1.6.1 release.
>
> chris
>
> On Jun 17, 2009, at 12:13 PM, Mark A. Jensen wrote:
>
>> I'm on the case! (but maybe not in realtime, today!)
>>
>> ----- Original Message ----- From: "Chris Fields" <cjfields@...
>> >
>> To: "Peter" <biopython@...>
>> Cc: "BioPerl List" <bioperl-l@...>; "Elia Stupka" <e.stupka@...
>> >; "Heikki Lehvaslaiho" <heikki@...>
>> Sent: Wednesday, June 17, 2009 1:06 PM
>> Subject: Re: [Bioperl-l] Next-gen modules
>>
>>
>>>
>>> On Jun 17, 2009, at 8:25 AM, Peter wrote:
>>>
>>>> On Wed, Jun 17, 2009 at 1:57 PM, Chris  
>>>> Fields<cjfields@...>  wrote:
>>>>>
>>>>> Elia,
>>>>>
>>>>> As Mark indicated, we recently discussed the lack of support  
>>>>> for  next-gen on
>>>>> list, at least re: fastq.  I may be hit with the same thing in  
>>>>> a  few months
>>>>> time myself, and I recall Jason and a few others also  
>>>>> mentioning  the same.
>>>>> Heikki wrote some code for Illumina FASTQ for SeqIO and related  
>>>>> modules but
>>>>> I don't believe it has been committed to trunk yet, so maybe he  
>>>>> can  answer.
>>>>>
>>>>> From prior discussions IIRC the issues were:
>>>>>
>>>>> 1) distinguishing the various FASTQ versions (Sanger, Illumina  
>>>>> 1.0, Illumina
>>>>> 1.3) from one another (so maybe some optional validation), and
>>>>
>>>> Following the python rule of thumb for being explicit, Biopython  
>>>> makes
>>>> the user specify which FASTQ variant is being used. I don't think  
>>>> you
>>>> can do anything else. Any attempted validation would have to be
>>>> heuristic based on the ASCII characters found, and would risk false
>>>> positive warnings.
>>>
>>> Right; I'm thinking along the same lines.  If anything the most  
>>> we  would allow is some level of validation, so if there were a  
>>> degree of  uncertainty about the format one could set a validation  
>>> flag to check  bounds during the parse and warn if they are  
>>> exceeded.
>>>
>>>>> 2) having a way for the Seq object to either 'know' what format is
>>>>> contained, or we use phred score and convert back and forth  
>>>>> from  that (I
>>>>> think the latter makes more sense).
>>>>
>>>> I think it could make sense for BioPerl to convert Solexa scores  
>>>> to/ from
>>>> PHRED scores on the fly (especially now that Illumina is abandoning
>>>> the Solexa score system). Python style tries to avoid implicit  
>>>> conversions,
>>>> so Biopython doesn't automatically do a conversion from Solexa to
>>>> PHRED scores on parsing (but will on writing if the requested  
>>>> output
>>>> format requires this).
>>>>
>>>>> Peter's suggestions also are reasonable, though does biopython  
>>>>> have a
>>>>> separate module for each of these variations?  Our version (I  
>>>>> believe)
>>>>> mainly varied the conversion within Bio::SeqIO::fastq itself  
>>>>> based  on the
>>>>> fastq variant passed in as a separate named argument.
>>>>
>>>> Biopython's SeqIO gives the three FASTQ variants their own unique
>>>> names. This format name is a required argument for parsing/writing
>>>> (we don't try and guess the file format from the data contents).  
>>>> Internally
>>>> we have three separate FASTQ parsers/writers although they do share
>>>> code.
>>>
>>> We could easily do the same if others agree.  Actually, if we  
>>> specified that shorthand for a variant on a format would be  
>>> designated  as -format => 'format-variant', I think we could  
>>> easily hack SeqIO to  deal with that by splitting on '-' and  
>>> passing everything to the  constructor as (-format => 'format', -
>>> variant => 'variant').  Very  little repeated code in this case,  
>>> just an additional named parameter  indicating the format variant  
>>> (and the SeqIO class can do the type  checking on that within the  
>>> constructor).
>>>
>>>> Other issues to keep in mind:
>>>>
>>>> (3) There should be no warning parsing files where the optional  
>>>> repeated
>>>> title is missing on the "+" lines (as discussed earlier on the  
>>>> BioPerl list).
>>>
>>> Agreed, though we'll have to check the current fastq parser to see  
>>> if  that's currently the case.  I thought that was fixed but maybe  
>>> not?
>>>
>>>> (4) When writing FASTQ files should BioPerl omit the optional  
>>>> repeated
>>>> title on the "+" line? Biopython omits this as I understand this  
>>>> to be
>>>> common practice, and can make a big different to file sizes -  
>>>> especially
>>>> on short read data from Solexa/Illumina.
>>>
>>> Agreed, particularly if it's commonly encountered.
>>>
>>>> (5) Also test reading and writing files with an optional  
>>>> description  (as well
>>>> as an identifier) on the "@" (and "+") lines. See the NCBI SRA  
>>>> for  examples,
>>>> e.g.
>>>>
>>>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
>>>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
>>>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
>>>
>>> Should be easy enough to implement with a simple regex.
>>>
>>>> (6) Test reading and writing files where the encoded quality  
>>>> string  starts
>>>> with a "@" or a "+" character, e.g.
>>>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html
>>>>
>>>> Peter
>>>
>>> Mark, getting all that? ;>
>>>
>>> chris
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@...
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@...
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>

---
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street
WC1E 6BT
London
UK

Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374

Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Tristan Lefebure-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks both for the light.

That probably means that the place bioperl will take in the
handling of the next-gen sequencing raw data (i.e. reads) is
very limited, nope? (at least until bioperl6). A single GA2
solexa lane generates about 9 million reads, and I would
really not called that a big project...

BTW, is there a simple way to see object instantiation and
inheritance, as well as time consumption for each, when once
calls next_seq() (or any other method)?

-Tristan

On Wednesday 17 June 2009 13:49:38 Elia Stupka wrote:

> I would suggest developing the "standard" version first,
> then moving onto potential optimizations.
>
> When we went through a similar argument in Ensembl about
> 8 years ago we ended up dropping Bio::Root completely...
>
> If one is truly after performance for these large
> next-gen projects, it'd be down to pure piping, shell,
> and worrying about location and copying of files,
> sticking to systems-level as much as possible, and quite
> far from Bioperl altogether, so I think it's a whole
> different level of optimization issues, probably outside
> the scope of Bioperl.
>
> Elia
>
> On 17 Jun 2009, at 18:09, Chris Fields wrote:
> > On Jun 17, 2009, at 8:27 AM, Tristan Lefebure wrote:
> >> Hello,
> >> Regarding next-gen sequences and bioperl, following my
> >> experience, another issue is bioperl speed. For
> >> example, if you want to trim bad quality bases at ends
> >> of 1E6 Solexa reads using Bio::SeqIO::fastq and some
> >> methods in Bio::Seq::Quality, well, you've got to be
> >> patient (but may be I missed some shortcuts...).
> >
> > The key issues affecting speed in bioperl are contained
> > object instantiation and inheritance (and between those
> > two, the latter much more so as it plays a role with
> > contained objects as well as the container).
> >
> > http://www.bioperl.org/wiki/Why_BioPerl_is_slow
> >
> > Moose/Perl6 roles/traits are one way around that issue,
> > but we are a ways off from getting that running.  I
> > think to get that working decently would be a
> > from-ground-up endeavor (see my past posts on
> > biomoose/bioperl6).
> >
> >> A pure perl solution will be between 100 to 1000x
> >> faster... Would it be possible to have an ultra-light
> >> quality object with few simple methods for next-gen
> >> reads?
> >>
> >> I can contribute some tests if that sounds like an
> >> important point.
> >>
> >> -Tristan
> >
> > The quality objects themselves I don't think are that
> > heavy; I think the main impediment is inheritance.  One
> > could get around that a bit by using a direct_new
> > method to create a blessed hash directly, then
> > reimplement methods to lazily create any objects
> > contained on the fly.
> >
> > chris
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@...
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> ---
> Senior Lecturer, Bioinformatics
> UCL Cancer Institute
> Paul O' Gorman Building
> University College London
> Gower Street
> WC1E 6BT
> London
> UK
>
> Office (UCL): +44 207 679 6493
> Office (ICMS): +44 0207 8822374
>
> Mobile: +44 7597 566 194
> Mobile (Italy): +39 338 8448801


_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Sendu Bala-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tristan Lefebure wrote:
> Hello,
> Regarding next-gen sequences and bioperl, following my
> experience, another issue is bioperl speed. For example, if
> you want to trim bad quality bases at ends of 1E6 Solexa
> reads using Bio::SeqIO::fastq and some methods in
> Bio::Seq::Quality, well, you've got to be patient (but may
> be I missed some shortcuts...).

This is my concern as well. Or, rather, is there actually a significant
set of users out there who are dealing with next-gen sequencing and
would consider using BioPerl for their work?

I'm working with all the 1000-genomes data at the Sanger, and we at
least are probably never going to use BioPerl for the work.


> A pure perl solution will be between 100 to 1000x faster...
> Would it be possible to have an ultra-light quality object
> with few simple methods for next-gen reads?

The fastq parser itself already seems pretty fast. The way to get the
speedup is to not create any Bio::Seq* objects but just return the data
directly. At that point it's not taking much advantage of BioPerl. But
certainly it could be done...
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Elia Stupka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

We are using bioperl for simple pre and post-processing of data for  
full Solexa runs, and although it might not be ideal, the scripting  
with Bioperl is not a major killer. When I was referring to large,  
heavy pipelines I was thinking of pipelines that deal with many Solexa  
runs as one project (e.g. 1000 genomes) who really cannot afford any  
bottleneck in their pipelines, because that affects directly their  
storage.

cheers

Elia


On 17 Jun 2009, at 19:09, Tristan Lefebure wrote:

> Thanks both for the light.
>
> That probably means that the place bioperl will take in the
> handling of the next-gen sequencing raw data (i.e. reads) is
> very limited, nope? (at least until bioperl6). A single GA2
> solexa lane generates about 9 million reads, and I would
> really not called that a big project...
>
> BTW, is there a simple way to see object instantiation and
> inheritance, as well as time consumption for each, when once
> calls next_seq() (or any other method)?
>
> -Tristan
>
> On Wednesday 17 June 2009 13:49:38 Elia Stupka wrote:
>> I would suggest developing the "standard" version first,
>> then moving onto potential optimizations.
>>
>> When we went through a similar argument in Ensembl about
>> 8 years ago we ended up dropping Bio::Root completely...
>>
>> If one is truly after performance for these large
>> next-gen projects, it'd be down to pure piping, shell,
>> and worrying about location and copying of files,
>> sticking to systems-level as much as possible, and quite
>> far from Bioperl altogether, so I think it's a whole
>> different level of optimization issues, probably outside
>> the scope of Bioperl.
>>
>> Elia
>>
>> On 17 Jun 2009, at 18:09, Chris Fields wrote:
>>> On Jun 17, 2009, at 8:27 AM, Tristan Lefebure wrote:
>>>> Hello,
>>>> Regarding next-gen sequences and bioperl, following my
>>>> experience, another issue is bioperl speed. For
>>>> example, if you want to trim bad quality bases at ends
>>>> of 1E6 Solexa reads using Bio::SeqIO::fastq and some
>>>> methods in Bio::Seq::Quality, well, you've got to be
>>>> patient (but may be I missed some shortcuts...).
>>>
>>> The key issues affecting speed in bioperl are contained
>>> object instantiation and inheritance (and between those
>>> two, the latter much more so as it plays a role with
>>> contained objects as well as the container).
>>>
>>> http://www.bioperl.org/wiki/Why_BioPerl_is_slow
>>>
>>> Moose/Perl6 roles/traits are one way around that issue,
>>> but we are a ways off from getting that running.  I
>>> think to get that working decently would be a
>>> from-ground-up endeavor (see my past posts on
>>> biomoose/bioperl6).
>>>
>>>> A pure perl solution will be between 100 to 1000x
>>>> faster... Would it be possible to have an ultra-light
>>>> quality object with few simple methods for next-gen
>>>> reads?
>>>>
>>>> I can contribute some tests if that sounds like an
>>>> important point.
>>>>
>>>> -Tristan
>>>
>>> The quality objects themselves I don't think are that
>>> heavy; I think the main impediment is inheritance.  One
>>> could get around that a bit by using a direct_new
>>> method to create a blessed hash directly, then
>>> reimplement methods to lazily create any objects
>>> contained on the fly.
>>>
>>> chris
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@...
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> ---
>> Senior Lecturer, Bioinformatics
>> UCL Cancer Institute
>> Paul O' Gorman Building
>> University College London
>> Gower Street
>> WC1E 6BT
>> London
>> UK
>>
>> Office (UCL): +44 207 679 6493
>> Office (ICMS): +44 0207 8822374
>>
>> Mobile: +44 7597 566 194
>> Mobile (Italy): +39 338 8448801
>
>

---
Senior Lecturer, Bioinformatics
UCL Cancer Institute
Paul O' Gorman Building
University College London
Gower Street
WC1E 6BT
London
UK

Office (UCL): +44 207 679 6493
Office (ICMS): +44 0207 8822374

Mobile: +44 7597 566 194
Mobile (Italy): +39 338 8448801

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Jun 17, 2009, at 1:09 PM, Tristan Lefebure wrote:

> Thanks both for the light.
>
> That probably means that the place bioperl will take in the
> handling of the next-gen sequencing raw data (i.e. reads) is
> very limited, nope? (at least until bioperl6). A single GA2
> solexa lane generates about 9 million reads, and I would
> really not called that a big project...

I don't think it's impossible.  If you parse any very long list of  
sequences in order it will be very slow, yes, but if they were indexed  
or loaded into a DB lookups would of course be magnitudes faster.

We already have perl-based indexing for fastq (Bio::Index::Fastq), so  
maybe something could be built on top of that. I haven't looked but we  
can also wrap other C/C++-based parsers as well. BioLib, for instance,  
has bindings to io_lib, so maybe that could be (ab)used in some way.

> BTW, is there a simple way to see object instantiation and
> inheritance, as well as time consumption for each, when once
> calls next_seq() (or any other method)?
>
> -Tristan

As a simple benchmark, at one point all feature tag information was  
converted into Bio::Annotations.  I reverted that behavior to be  
simple tag/value again and had a pretty decent bump:

http://www.bioperl.org/wiki/Feature_Annotation_rollback#Simple_Benchmark

Also, I tried reimplementing some parsers as generic 'event'-based  
driver/handler and they were slightly faster, the key roadblock being  
instantation again.  If I didn't create Features/Annotations I saw a  
significant speedup.  That's not entirely unexpected, as SeqFeatures  
also contain Locations (in turn that can contain subLocations) and  
(until recently) tag-based Bio::Annotation by default.  Annotations  
are collected in an Annotation::Collection and can contain other  
objects I believe (Ontology terms, etc).

The overall lesson is, if you don't have very heavy objects being  
created the overhead is actually quite small; it's only when you  
greedily instantiate everything that you run into problems.

chris
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l
< Prev | 1 - 2 - 3 - 4 - 5 | Next >