Next-gen modules

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 - 4 - 5 | Next >

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Peter,

I just committed a fix to FASTQ parsing last night to support read/
write for Sanger/Solexa/Illumina following the biopython convention;  
the only thing needed is more extensive testing for the quality  
scores.  There are a few other oddities with it I intend to address  
soon, but it appears to be working.

The Seq instance iterator actually calls a raw data iterator (hash  
refs of named arguments to the class constructor).  That should act as  
a decent filtering step if needed.

We have automated EMBOSS wrapping but I'm not sure how intuitive it  
is; we can probably reconfigure some of that.

chris

On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:

> Hi all (BioPerl and Biopython),
>
> This is a continuation of a long thread on the BioPerl mailing
> list, which I have now CC'd to the Biopython mailing list. See:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html
>
> On this thread we have been discussing next gen sequencing
> tools and co-coordinating things like consistent file format
> naming between Biopython, BioPerl and EMBOSS. I've been
> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
> and he will look into setting up a cross project mailing list for
> this kind of discussion in future.
>
> In the mean time, my replies to Giles below cover both BioPerl
> and Biopython (and EMBOSS). Giles' original email is here:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html
>
> Peter
>
> On 6/30/09, Giles Weaver <giles.weaver@...> wrote:
>>
>> I'm developing a transcriptomics database for use with next-gen  
>> data, and
>> have found processing the raw data to be a big hurdle.
>>
>> I'm a bit late in responding to this thread, so most issues have  
>> already
>> been discussed. One thing that hasn't been mentioned is removal of  
>> adapters
>> from raw Illumina sequence. This is a PITA, and I'm not aware of  
>> any well
>> developed and documented open source software for removal of adapters
>> (and poor quality sequence) from Illumina reads.
>>
>> My current Illumina sequence processing pipeline is an unholy mix of
>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for  
>> converting
>> the Illumina fastq to Sanger fastq, bioperl to read the quality  
>> values,
>> pure perl to trim the poor quality sequence from each read, and  
>> bioperl
>> with emboss to remove the adapter sequence. I'm aware that the  
>> pipeline
>> contains bugs and would like to simplify it, but at least it does  
>> work...
>>
>> Ideally I'd like to replace as much of the pipeline as possible with
>> bioperl/bioperl-run, but this isn't currently possible due to both  
>> a lack
>> of features and poor performance. I'm sure the features will come  
>> with
>> time, but the performance is more of a concern to me. ..
>
> I gather you would rather work with (Bio)Perl, but since you are
> already using Biopython to do the FASTQ conversion, you could
> also use it for more of your pipe line. Our tutorial includes examples
> of simple FASTQ quality filtering, and trimming of primer sequences
> (something like this might be helpful for removing adaptors). See:
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>
> Alternatively, with the new release of EMBOSS this July, you will
> also be able to do the Illumina FASTQ to Sanger standard FASTQ
> with EMBOSS, and I'm sure BioPerl will offer this soon too.
>
>> Regarding trimming bad quality bases (see comments from
>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
>> pure/bioperl solution to be much faster than a primarily bioperl
>> based implementation. I found Bio::Seq->subseq(a,b) and
>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
>> ~1300 sequences/second, including unzipping the raw data and
>> converting it to sanger fastq with biopython. Processing an entire
>> sequencing run with the whole pipeline takes in the region of 6-12h.
>
> There are several ways of doing quality trimming, and it would
> make an excellent cookbook example (both for BioPerl and
> Biopython).
>
> Could you go into a bit more detail about your trimming
> algorithm? e.g. Do you just trim any bases on the right below
> a certain threshold, perhaps with a minimum length to retain
> the trimmed read afterwards?
>
>> Hope this looooong post was of interest to someone!
>
> I was interested at least ;)
>
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Jonathan_Epstein :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I too am interested in these topics.  In particular, I would like to
learn more about "sequencing adapter removal," i.e. what these adapters
look like, and what strategies you've employed for finding and removing
them.

Jonathan


Giles Weaver wrote:

> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
>
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters (and
> poor quality sequence) from Illumina reads.
>
> My current Illumina sequence processing pipeline is an unholy mix of
> biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting
> the Illumina fastq to Sanger fastq, bioperl to read the quality values, pure
> perl to trim the poor quality sequence from each read, and bioperl with
> emboss to remove the adapter sequence. I'm aware that the pipeline contains
> bugs and would like to simplify it, but at least it does work...
>
> Ideally I'd like to replace as much of the pipeline as possible with
> bioperl/bioperl-run, but this isn't currently possible due to both a lack of
> features and poor performance. I'm sure the features will come with time,
> but the performance is more of a concern to me. I wonder if Bio::Moose might
> be used to alleviate some of the performance issues? Might next-gen modules
> be an ideal guinea pig for Bio::Moose?
>
> For my purposes the tools that would love to see supported in
> bioperl/bioperl-run are:
>
>    - next-gen sequence quality parsing (to output phred scores)
>    - sequence quality based trimming
>    - sequencing adapter removal
>    - filtering based on sequence complexity (repeats, entropy etc)
>    - bioperl-run modules for bowtie etc.
>
> Obviously all of these need to be fast!
> I'd love to muck in, but I doubt I'll contribute much before
> Bio::Moose/bioperl6, as the (bio)perl object system gives me nightmares!
>
> Regarding trimming bad quality bases (see comments from Tristan Lefebure)
> from Solexa/Illumina reads, I did find a mixed pure/bioperl solution to be
> much faster than a primarily bioperl based implementation. I found
> Bio::Seq->subseq(a,b) and Bio::Seq->subqual(a,b) to be far too slow. My
> current code trims ~1300 sequences/second, including unzipping the raw data
> and converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.
>
> Hope this looooong post was of interest to someone!
>
> Giles
>
> 2009/6/17 Tristan Lefebure <tristan.lefebure@...>
>
>  
>> Hello,
>> Regarding next-gen sequences and bioperl, following my
>> experience, another issue is bioperl speed. For example, if
>> you want to trim bad quality bases at ends of 1E6 Solexa
>> reads using Bio::SeqIO::fastq and some methods in
>> Bio::Seq::Quality, well, you've got to be patient (but may
>> be I missed some shortcuts...).
>>
>> A pure perl solution will be between 100 to 1000x faster...
>> Would it be possible to have an ultra-light quality object
>> with few simple methods for next-gen reads?
>>
>> I can contribute some tests if that sounds like an important
>> point.
>>
>> -Tristan
>>
>>    
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>  
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Giles Weaver :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Peter, the trimming algorithm I use employs a sliding window, as follows:

   - For each sequence position calculate the mean phred quality score for a
   window around that position.
   - Record whether the mean score is above or below a threshold as an array
   of zeros and ones.
   - Use a regular expression on the joined array to find the start and end
   of the good quality sequence(s).
   - Extract the quality sequence(s) and replace any bases below the quality
   threshold with N.
   - Trim any Ns from the ends.

A refinement would be to weight the scores from positions in the window, but
this could give a performance hit, and the method seems to work well enough
as is.

Chris, thanks for committing the fix, I'll give bioperl illumina fastq
parsing a workout soon. Peter, as much as I'd love to help out with
biopython, I'm under too much time pressure right now!

Jonathan, some of the Illumina sequencing adapters are listed at
http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
http://seqanswers.com/forums/showthread.php?t=198
Adapter sequence typically appears towards the end of the read, though the
latter part of it is often misread as the sequencing quality drops off.
I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
real alignments and trim the sequence. This is not the ideal way of doing
things, but it's fast enough, and does seem to work. The adapter sequence
shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
adapter removal.

I'll happily share some code once I've got it to the stage where I'm not
embarrassed by it!

Giles

2009/7/1 Chris Fields <cjfields@...>

> Peter,
>
> I just committed a fix to FASTQ parsing last night to support read/write
> for Sanger/Solexa/Illumina following the biopython convention; the only
> thing needed is more extensive testing for the quality scores.  There are a
> few other oddities with it I intend to address soon, but it appears to be
> working.
>
> The Seq instance iterator actually calls a raw data iterator (hash refs of
> named arguments to the class constructor).  That should act as a decent
> filtering step if needed.
>
> We have automated EMBOSS wrapping but I'm not sure how intuitive it is; we
> can probably reconfigure some of that.
>
> chris
>
>
> On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:
>
>  Hi all (BioPerl and Biopython),
>>
>> This is a continuation of a long thread on the BioPerl mailing
>> list, which I have now CC'd to the Biopython mailing list. See:
>> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html
>>
>> On this thread we have been discussing next gen sequencing
>> tools and co-coordinating things like consistent file format
>> naming between Biopython, BioPerl and EMBOSS. I've been
>> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
>> and he will look into setting up a cross project mailing list for
>> this kind of discussion in future.
>>
>> In the mean time, my replies to Giles below cover both BioPerl
>> and Biopython (and EMBOSS). Giles' original email is here:
>> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html
>>
>> Peter
>>
>> On 6/30/09, Giles Weaver <giles.weaver@...> wrote:
>>
>>>
>>> I'm developing a transcriptomics database for use with next-gen data, and
>>> have found processing the raw data to be a big hurdle.
>>>
>>> I'm a bit late in responding to this thread, so most issues have already
>>> been discussed. One thing that hasn't been mentioned is removal of
>>> adapters
>>> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
>>> developed and documented open source software for removal of adapters
>>> (and poor quality sequence) from Illumina reads.
>>>
>>> My current Illumina sequence processing pipeline is an unholy mix of
>>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for
>>> converting
>>> the Illumina fastq to Sanger fastq, bioperl to read the quality values,
>>> pure perl to trim the poor quality sequence from each read, and bioperl
>>> with emboss to remove the adapter sequence. I'm aware that the pipeline
>>> contains bugs and would like to simplify it, but at least it does work...
>>>
>>> Ideally I'd like to replace as much of the pipeline as possible with
>>> bioperl/bioperl-run, but this isn't currently possible due to both a lack
>>> of features and poor performance. I'm sure the features will come with
>>> time, but the performance is more of a concern to me. ..
>>>
>>
>> I gather you would rather work with (Bio)Perl, but since you are
>> already using Biopython to do the FASTQ conversion, you could
>> also use it for more of your pipe line. Our tutorial includes examples
>> of simple FASTQ quality filtering, and trimming of primer sequences
>> (something like this might be helpful for removing adaptors). See:
>> http://biopython.org/DIST/docs/tutorial/Tutorial.html
>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>>
>> Alternatively, with the new release of EMBOSS this July, you will
>> also be able to do the Illumina FASTQ to Sanger standard FASTQ
>> with EMBOSS, and I'm sure BioPerl will offer this soon too.
>>
>>  Regarding trimming bad quality bases (see comments from
>>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
>>> pure/bioperl solution to be much faster than a primarily bioperl
>>> based implementation. I found Bio::Seq->subseq(a,b) and
>>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
>>> ~1300 sequences/second, including unzipping the raw data and
>>> converting it to sanger fastq with biopython. Processing an entire
>>> sequencing run with the whole pipeline takes in the region of 6-12h.
>>>
>>
>> There are several ways of doing quality trimming, and it would
>> make an excellent cookbook example (both for BioPerl and
>> Biopython).
>>
>> Could you go into a bit more detail about your trimming
>> algorithm? e.g. Do you just trim any bases on the right below
>> a certain threshold, perhaps with a minimum length to retain
>> the trimmed read afterwards?
>>
>>  Hope this looooong post was of interest to someone!
>>>
>>
>> I was interested at least ;)
>>
>> Peter
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@...
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:

...

> Peter, the trimming algorithm I use employs a sliding window, as  
> follows:
>
>   - For each sequence position calculate the mean phred quality  
> score for a
>   window around that position.
>   - Record whether the mean score is above or below a threshold as  
> an array
>   of zeros and ones.
>   - Use a regular expression on the joined array to find the start  
> and end
>   of the good quality sequence(s).
>   - Extract the quality sequence(s) and replace any bases below the  
> quality
>   threshold with N.
>   - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the  
> window, but
> this could give a performance hit, and the method seems to work well  
> enough
> as is.
>
> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!

Just let me know if the qual values match up with what is expected.  
You can also iterate through the data with hashrefs using next_dataset  
(faster than objects).  This is from the fastq tests in core:

-----------------------------------------
$in_qual  = Bio::SeqIO->new(-file     =>  
test_input_file('fastq','test3_illumina.fastq'),
                             -variant  => 'illumina',
                             -format   => 'fastq');

$qual = $in_qual->next_dataset();

isa_ok($qual, 'HASH');
is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA');
is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ');
is($qual->{-id}, 'FC12044_91407_8_200_406_24');
is($qual->{-desc}, '');
is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24');
is(join(',',@{$qual->{-qual}}[0..10]),  
'19,24,24,20,24,24,24,24,24,24,24');
-----------------------------------------

So one could check those values directly and then filter them through  
as needed directly into Bio::Seq::Quality if necessary (note some of  
the key values are constructor args):

my $qualobj = Bio::Seq::Quality->new(%$qual);

chris
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Peter Cock :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 7/1/09, Giles Weaver wrote:

> Peter, the trimming algorithm I use employs a sliding window, as follows:
>
>    - For each sequence position calculate the mean phred quality score for a
>    window around that position.
>    - Record whether the mean score is above or below a threshold as an array
>    of zeros and ones.
>    - Use a regular expression on the joined array to find the start and end
>    of the good quality sequence(s).
>    - Extract the quality sequence(s) and replace any bases below the quality
>    threshold with N.
>    - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the window, but
> this could give a performance hit, and the method seems to work well enough
> as is.

Thanks for the details - that is a bit more complex that what I had been
thinking. Do you have any favoured window size and quality threshold,
or does this really depend on the data itself?

Also, if you find a sequence read that goes "good - poor - good" for
example, do you extract the two good regions as two sub reads
(presumably with a minimum length)? This may be silly for Illumina
where the reads are very short, but might make sense for Roche 454.

> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!

Even use cases are useful - so thank you.

> Jonathan, some of the Illumina sequencing adapters are listed at
> http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
> http://seqanswers.com/forums/showthread.php?t=198
> Adapter sequence typically appears towards the end of the read, though the
> latter part of it is often misread as the sequencing quality drops off.
> I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
> then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
> real alignments and trim the sequence. This is not the ideal way of doing
> things, but it's fast enough, and does seem to work. The adapter sequence
> shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
> adapter removal.
>
> I'll happily share some code once I've got it to the stage where I'm not
> embarrassed by it!
>
> Giles

Cheers,

Peter
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Giles Weaver :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Regarding the trimming algorithm, I've been using a window size of 5, a
minimum score of 20 and a minimum length of 15 with the Illumina data. In
the past I have used a similar algorithm with a larger window size and much
longer minimum length with sequence from ABI 3XXX machines. I imagine that
the ideal parameters for ABI SOLiD and Roche 454 would likely be similar to
those for Illumina and Sanger sequencing respectively.
Window size doesn't appear to affect performance much, if at all.

For sequences with multiple good regions, I do extract all good regions.
Even with the Illumina data there are sometimes two good regions, but
usually the second is adapter or junk and gets filtered out later. I haven't
seen quality data from a 454 machine recently, and would be interested to
know if multiple good regions are commonplace in 454 data. Can anyone with
access to 454 data comment on this?

Giles

2009/7/2 Peter Cock <p.j.a.cock@...>

> On 7/1/09, Giles Weaver wrote:
> > Peter, the trimming algorithm I use employs a sliding window, as follows:
> >
> >    - For each sequence position calculate the mean phred quality score
> for a
> >    window around that position.
> >    - Record whether the mean score is above or below a threshold as an
> array
> >    of zeros and ones.
> >    - Use a regular expression on the joined array to find the start and
> end
> >    of the good quality sequence(s).
> >    - Extract the quality sequence(s) and replace any bases below the
> quality
> >    threshold with N.
> >    - Trim any Ns from the ends.
> >
> > A refinement would be to weight the scores from positions in the window,
> but
> > this could give a performance hit, and the method seems to work well
> enough
> > as is.
>
> Thanks for the details - that is a bit more complex that what I had been
> thinking. Do you have any favoured window size and quality threshold,
> or does this really depend on the data itself?
>
> Also, if you find a sequence read that goes "good - poor - good" for
> example, do you extract the two good regions as two sub reads
> (presumably with a minimum length)? This may be silly for Illumina
> where the reads are very short, but might make sense for Roche 454.
>
> > Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> > parsing a workout soon. Peter, as much as I'd love to help out with
> > biopython, I'm under too much time pressure right now!
>
> Even use cases are useful - so thank you.
>
> > Jonathan, some of the Illumina sequencing adapters are listed at
> >
> http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
> > http://seqanswers.com/forums/showthread.php?t=198
> > Adapter sequence typically appears towards the end of the read, though
> the
> > latter part of it is often misread as the sequencing quality drops off.
> > I abuse needle (EMBOSS) into aligning the adapter sequence with each
> read. I
> > then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
> > real alignments and trim the sequence. This is not the ideal way of doing
> > things, but it's fast enough, and does seem to work. The adapter sequence
> > shouldn't be gapped, so I'm sure there is a lot of scope for optimising
> the
> > adapter removal.
> >
> > I'll happily share some code once I've got it to the stage where I'm not
> > embarrassed by it!
> >
> > Giles
>
> Cheers,
>
> Peter
>
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Giles Weaver :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Chris, I've just tested your Illumina/Solexa fastq parsing code and am
pleased to report that I haven't encountered any issues thus far.

To give an idea of the processing overhead of object instantiation, fastq
parsing performance on a lowly 3GHz Core 2 Duo (using one core) is as
follows:
Illumina fastq with next_dataset: ~1 million sequences/minute
Solexa fastq with next_dataset: ~500000 sequences/minute
Illumina fastq with next_seq: ~215000 sequences/minute
Solexa fastq with next_seq: ~175000 sequences/minute

My quality trimming script does about 300000 sequences/minute with
next_dataset, up from ~130000 sequences/minute with next_seq, so it shaves
hours off the run time, thanks!

Giles

2009/7/1 Chris Fields <cjfields@...>

> On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:
>
> ...
>
>  Peter, the trimming algorithm I use employs a sliding window, as follows:
>>
>>  - For each sequence position calculate the mean phred quality score for a
>>  window around that position.
>>  - Record whether the mean score is above or below a threshold as an array
>>  of zeros and ones.
>>  - Use a regular expression on the joined array to find the start and end
>>  of the good quality sequence(s).
>>  - Extract the quality sequence(s) and replace any bases below the quality
>>  threshold with N.
>>  - Trim any Ns from the ends.
>>
>> A refinement would be to weight the scores from positions in the window,
>> but
>> this could give a performance hit, and the method seems to work well
>> enough
>> as is.
>>
>> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
>> parsing a workout soon. Peter, as much as I'd love to help out with
>> biopython, I'm under too much time pressure right now!
>>
>
> Just let me know if the qual values match up with what is expected.  You
> can also iterate through the data with hashrefs using next_dataset (faster
> than objects).  This is from the fastq tests in core:
>
> -----------------------------------------
> $in_qual  = Bio::SeqIO->new(-file     =>
> test_input_file('fastq','test3_illumina.fastq'),
>                            -variant  => 'illumina',
>                            -format   => 'fastq');
>
> $qual = $in_qual->next_dataset();
>
> isa_ok($qual, 'HASH');
> is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA');
> is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ');
> is($qual->{-id}, 'FC12044_91407_8_200_406_24');
> is($qual->{-desc}, '');
> is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24');
> is(join(',',@{$qual->{-qual}}[0..10]), '19,24,24,20,24,24,24,24,24,24,24');
> -----------------------------------------
>
> So one could check those values directly and then filter them through as
> needed directly into Bio::Seq::Quality if necessary (note some of the key
> values are constructor args):
>
> my $qualobj = Bio::Seq::Quality->new(%$qual);
>
> chris
>
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

No problem.  Scary to see that creating an instance is 2-4x slower  
than a simple hash ref.  Not sure there is an easy way around that;  
maybe we need a direct_new?

The next step is to ensure this works cross-platform and get indexing  
(via Bio::Index::Fastq) optimized.  Would be nice to get output  
working with the hash refs as well.

chris

On Jul 3, 2009, at 10:35 AM, Giles Weaver wrote:

> Chris, I've just tested your Illumina/Solexa fastq parsing code and  
> am pleased to report that I haven't encountered any issues thus far.
>
> To give an idea of the processing overhead of object instantiation,  
> fastq parsing performance on a lowly 3GHz Core 2 Duo (using one  
> core) is as follows:
> Illumina fastq with next_dataset: ~1 million sequences/minute
> Solexa fastq with next_dataset: ~500000 sequences/minute
> Illumina fastq with next_seq: ~215000 sequences/minute
> Solexa fastq with next_seq: ~175000 sequences/minute
>
> My quality trimming script does about 300000 sequences/minute with  
> next_dataset, up from ~130000 sequences/minute with next_seq, so it  
> shaves hours off the run time, thanks!
>
> Giles
>
> 2009/7/1 Chris Fields <cjfields@...>
> On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:
>
> ...
>
>
> Peter, the trimming algorithm I use employs a sliding window, as  
> follows:
>
>  - For each sequence position calculate the mean phred quality score  
> for a
>  window around that position.
>  - Record whether the mean score is above or below a threshold as an  
> array
>  of zeros and ones.
>  - Use a regular expression on the joined array to find the start  
> and end
>  of the good quality sequence(s).
>  - Extract the quality sequence(s) and replace any bases below the  
> quality
>  threshold with N.
>  - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the  
> window, but
> this could give a performance hit, and the method seems to work well  
> enough
> as is.
>
> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!
>
> Just let me know if the qual values match up with what is expected.  
> You can also iterate through the data with hashrefs using  
> next_dataset (faster than objects).  This is from the fastq tests in  
> core:
>
> -----------------------------------------
> $in_qual  = Bio::SeqIO->new(-file     =>  
> test_input_file('fastq','test3_illumina.fastq'),
>                            -variant  => 'illumina',
>                            -format   => 'fastq');
>
> $qual = $in_qual->next_dataset();
>
> isa_ok($qual, 'HASH');
> is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA');
> is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ');
> is($qual->{-id}, 'FC12044_91407_8_200_406_24');
> is($qual->{-desc}, '');
> is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24');
> is(join(',',@{$qual->{-qual}}[0..10]),  
> '19,24,24,20,24,24,24,24,24,24,24');
> -----------------------------------------
>
> So one could check those values directly and then filter them  
> through as needed directly into Bio::Seq::Quality if necessary (note  
> some of the key values are constructor args):
>
> my $qualobj = Bio::Seq::Quality->new(%$qual);
>
> chris
>

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by pmr-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Giles Weaver wrote:
> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
>
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters (and
> poor quality sequence) from Illumina reads.

We would like to add this to EMBOSS. Can you describe the method you
would like to use (I see you currently use a combination of bioperl and
emboss for this).

> For my purposes the tools that would love to see supported in
> bioperl/bioperl-run are:
>
>    - next-gen sequence quality parsing (to output phred scores)
>    - sequence quality based trimming
>    - sequencing adapter removal
>    - filtering based on sequence complexity (repeats, entropy etc)
>    - bioperl-run modules for bowtie etc.

We would like to see these supported in all the Open-Bio Projects and
they are a priority for EMBOSS.

Can you suggest quality filters, trimming methods, adaptor removal
methods, sequence filters and any other applications we could provide in
EMBOSS.

We hope to keep in line with what the other projects do so that EMBOSS,
bioperl, biopython etc. can be used interchangeably in pipelines.

> Obviously all of these need to be fast! .... My
> current code trims ~1300 sequences/second, including unzipping the raw data
> and converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.

OK, we will see what speed we can reach.

> Hope this looooong post was of interest to someone!

Very interesting!

regards,

Peter Rice
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Giles Weaver :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've just added a sequence adapter removal implementation to the bioperl
scrapbook at http://www.bioperl.org/wiki/Removing_sequencing_adapters. I
think the basic method is sound, but the implementation is ugly.

Performance wise, it currently takes around 80 minutes to remove adapters
from a ~3.2 million read Illumina run. This includes quality trimming and
grouping the sequences to reduce processing time. The quality trimming
(described
earlier in this
thread<http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030411.html>)
takes about 15 minutes, so adapter removal is definitely the bottleneck. I'm
confident that some relatively simple developments in Bioperl and/or EMBOSS
will yield some big performance improvements - if you see my sample code in
the scrapbook you'll understand why!

I've also been experimenting with sequence entropy calculations for
filtering out junk sequence.
I used Mark Jensens code at
http://www.bioperl.org/wiki/Site_entropy_in_an_alignment for inspiration.

Here is my current entropy calculation code:

sub entropy
{
    my ($seq_str, $word_size) = @_;
    my %res_counts;
    for (my $i = 0; $i <= ((length $seq_str) - $word_size); $i ++)
    {
        my $word = substr $seq_str, $i, $word_size;
        if ($word !~ /N/) { $res_counts{$word} ++; }
    }
    #~ print STDERR join (" ", keys %res_counts), "\n";
    #~ print STDERR join (" ", values %res_counts), "\n";
    my @counts = values %res_counts;
    my $word_count = sum @counts;
    map {$_ /= $word_count} @counts;
    return sum map {-$_*log2($_)} @counts;
}

sub log2
{
    my $n = shift;
    return log($n)/log(2);
}

I don't know if this does "the right thing", and have yet to determine a
suitable word size and entropy threshold for sequence filtering, so feel
free to comment/test away.

Giles

2009/7/6 Peter Rice <pmr@...>

> Giles Weaver wrote:
> > I'm developing a transcriptomics database for use with next-gen data, and
> > have found processing the raw data to be a big hurdle.
> >
> > I'm a bit late in responding to this thread, so most issues have already
> > been discussed. One thing that hasn't been mentioned is removal of
> adapters
> > from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> > developed and documented open source software for removal of adapters
> (and
> > poor quality sequence) from Illumina reads.
>
> We would like to add this to EMBOSS. Can you describe the method you
> would like to use (I see you currently use a combination of bioperl and
> emboss for this).
>
> > For my purposes the tools that would love to see supported in
> > bioperl/bioperl-run are:
> >
> >    - next-gen sequence quality parsing (to output phred scores)
> >    - sequence quality based trimming
> >    - sequencing adapter removal
> >    - filtering based on sequence complexity (repeats, entropy etc)
> >    - bioperl-run modules for bowtie etc.
>
> We would like to see these supported in all the Open-Bio Projects and
> they are a priority for EMBOSS.
>
> Can you suggest quality filters, trimming methods, adaptor removal
> methods, sequence filters and any other applications we could provide in
> EMBOSS.
>
> We hope to keep in line with what the other projects do so that EMBOSS,
> bioperl, biopython etc. can be used interchangeably in pipelines.
>
> > Obviously all of these need to be fast! .... My
> > current code trims ~1300 sequences/second, including unzipping the raw
> data
> > and converting it to sanger fastq with biopython. Processing an entire
> > sequencing run with the whole pipeline takes in the region of 6-12h.
>
> OK, we will see what speed we can reach.
>
> > Hope this looooong post was of interest to someone!
>
> Very interesting!
>
> regards,
>
> Peter Rice
>
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by pmr-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Giles Weaver wrote:
> I've just added a sequence adapter removal implementation to the bioperl
> scrapbook at http://www.bioperl.org/wiki/Removing_sequencing_adapters. I
> think the basic method is sound, but the implementation is ugly.

Ugly perhaps, but I'll look anyway :-)

I see you don't use needle because it creates gapped alignments, but
that can be fixed with a sufficiently high gap penalty (just to see if
it works - it won't be fast).

We also have word-based matching methods in EMBOSS but they would not
allow mismatches. I will play with alternatives and see what works best.
Some word-based seed should allow for a faster solution.

The provisional EMBOSS name for a quality filter and adaptor removal
application is "quaffle"

regards,

Peter Rice
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Jul 8, 2009, at 10:57 AM, Peter Rice wrote:

> Giles Weaver wrote:
>> I've just added a sequence adapter removal implementation to the  
>> bioperl
>> scrapbook at http://www.bioperl.org/wiki/ 
>> Removing_sequencing_adapters. I
>> think the basic method is sound, but the implementation is ugly.
>
> Ugly perhaps, but I'll look anyway :-)
>
> I see you don't use needle because it creates gapped alignments, but
> that can be fixed with a sufficiently high gap penalty (just to see if
> it works - it won't be fast).
>
> We also have word-based matching methods in EMBOSS but they would not
> allow mismatches. I will play with alternatives and see what works  
> best.
> Some word-based seed should allow for a faster solution.
>
> The provisional EMBOSS name for a quality filter and adaptor removal
> application is "quaffle"
>
> regards,
>
> Peter Rice

In the meantime, we can probably add this in to Bio::SeqUtils for  
general use as an exported method.

It would be nice to get some regression tests going for this to make  
sure it does what we expect, so maybe some test data and expected  
results?

chris
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Robert Buels :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Giles Weaver wrote:
> takes about 15 minutes, so adapter removal is definitely the bottleneck. I'm
> confident that some relatively simple developments in Bioperl and/or EMBOSS
> will yield some big performance improvements - if you see my sample code in

Apropos this kind of thing, have you guys already discussed using lazy
object creation for objects returned from bioperl parsers?  Not really
relevant in the short term, but it could be a useful avenue to pursue
for addressing some performance concerns people (like ebi) have.

In very vague terms, one would probably implement this by defining a
very light-weight role/class  called something like Bio::LazyInflator,
that would provide only an `inflate` method.  Parsers would parse into
lightweight structures (probably arrayrefs) that implement LazyInflator
and users could choose between grabbing data out of the uninflated
arrayref directly, or they could call inflate() on it to transform it
into a real object (like a Bio::Annotation or Bio::Seq or something).

The exact implementation of this would vary depending on whether Moose
is being used.

This could potentially also be compatible with having some of the tight
parsing loops be implemented in XS.

Rob
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Jul 8, 2009, at 5:45 PM, Robert Buels wrote:

> Giles Weaver wrote:
>> takes about 15 minutes, so adapter removal is definitely the  
>> bottleneck. I'm
>> confident that some relatively simple developments in Bioperl and/
>> or EMBOSS
>> will yield some big performance improvements - if you see my sample  
>> code in
>
> Apropos this kind of thing, have you guys already discussed using  
> lazy object creation for objects returned from bioperl parsers?  Not  
> really relevant in the short term, but it could be a useful avenue  
> to pursue for addressing some performance concerns people (like ebi)  
> have.

There are some lazy parsers for SearchIO, but each of those has  
specific classes geared towards the SearchIO format, an issue I worry  
about.  I'm not sure about going down the path of having a  
Bio::Search::Result::FooResult, Bio::Search::Hit::FooHit, and  
Bio::Search::HSP::FooHSP for each 'Foo' format. The same thing could  
occur with SeqIO, TreeIO, etc.  A possible maintenance nightmare.

What I would like to see are generic lazy implementations for some of  
the various class, primarily Seq, AnnotationCollection, FeatureHolder/
Collection, etc, and parsers pass in just the necessary data (lazy  
implies file points or stream points).  This may not be terribly hard  
to do if using iterators, but (as you may have seen) many of the  
current methods are greedily defined, so new interface methods would  
need to be drawn up (and older ones refactored to work with newer ones).

> In very vague terms, one would probably implement this by defining a  
> very light-weight role/class  called something like  
> Bio::LazyInflator, that would provide only an `inflate` method.  
> Parsers would parse into lightweight structures (probably arrayrefs)  
> that implement LazyInflator and users could choose between grabbing  
> data out of the uninflated arrayref directly, or they could call  
> inflate() on it to transform it into a real object (like a  
> Bio::Annotation or Bio::Seq or something).

I would go one step further and reimplement the various  
AnnotationCollection/featureHolder methods in terms of a completely  
lazy implementation (i.e. parses the file or stream into a lazy Seq).  
See SwissKnife for instance.

> The exact implementation of this would vary depending on whether  
> Moose is being used.

This may be an area where optimization via Moose may not matter as  
much.  It would be best to attempt some of this initially in bioperl,  
then port to Moose/Bio::Moose.

> This could potentially also be compatible with having some of the  
> tight parsing loops be implemented in XS.
>
> Rob

That's where it'll get a little trickier; you would probably need a  
decent grammar to get everything out the way you want it, or at least  
parse everything event-based, and other grammars would have to have  
similarly named rules/tokens so the same action could be tied to the  
data being parsed.  I had a first go at generic parsing in the  
gbdriver/embldriver/swissdriver modules, which just pass data chunks  
to the handler object (which could do anything it wants with the  
data).  The only thing not passed in yet are file points.  That needs  
to be fleshed out more when I have the tuits, but you are more than  
welcome to look.

Also, just to note (and something to think about): Perl6 has this  
'solved' to a large degree with grammar/action combinations, where you  
define a grammar for a particular format and attach an Action class to  
process everything:

my $action = MyActionClass.new();

while Bio::Grammar::Fasta.parse($filehandle, :action($action)) {
    # do interesting things with data from $action
}

In this case the Action class could create a Seq out of all the data,  
or possibly create something much more lightweight and lazily  
evaluated (for instance, use the file points instead of the actual  
text).  The grammar in this case would essentially be C- or PIR-based  
I believe.

Note the quotes above with 'solved'; with Rakudo you can almost do  
this now, however some of the Perl 6 specification needs to be fleshed  
out re: Grammars, and the grammar engine for Parrot (PGE) needs to be  
properly set up for iteration through a stream.  There is  enough  
interest that I think things could be worked out fairly quickly (e.g.  
months, not years).

chris
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Parent Message unknown Re: Next-gen modules

by Matthew Vaughn-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

A lot of what is being discussed is handled very elegantly by Assaf  
Gordon's FASTX toolkit <http://hannonlab.cshl.edu/fastx_toolkit/>. I  
spent a lot of time trying to roll my own solutions for basic Illumina  
processing and I've found his utilities to work much more reliably and  
very fast (almost real-time) than anything I could design in Perl.  
They are also the basis for Illumina handling in Galaxy, which is a  
second vote of confidence.

They've got clean CLI interfaces and should be very easy to wrap in  
Bio::SeqUtils  or Bio::Run packages.

Matt

--
Matthew W. Vaughn, Ph.D.
Research Assistant Professor
Cold Spring Harbor Laboratory
1 Bungtown Road
Williams #5
Cold Spring Harbor, NY 11724 USA

tel: (516) 367-8808
cell: (516) 353-7055
google-talk: matt.vaughn@...





_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Looks very promising.  Do you know if it's capable of reporting back  
indices, e.g. for building flat-file databases?

chris

On Jul 9, 2009, at 3:42 PM, Matthew Vaughn wrote:

> A lot of what is being discussed is handled very elegantly by Assaf  
> Gordon's FASTX toolkit <http://hannonlab.cshl.edu/fastx_toolkit/>. I  
> spent a lot of time trying to roll my own solutions for basic  
> Illumina processing and I've found his utilities to work much more  
> reliably and very fast (almost real-time) than anything I could  
> design in Perl. They are also the basis for Illumina handling in  
> Galaxy, which is a second vote of confidence.
>
> They've got clean CLI interfaces and should be very easy to wrap in  
> Bio::SeqUtils  or Bio::Run packages.
>
> Matt
>
> --
> Matthew W. Vaughn, Ph.D.
> Research Assistant Professor
> Cold Spring Harbor Laboratory
> 1 Bungtown Road
> Williams #5
> Cold Spring Harbor, NY 11724 USA
>
> tel: (516) 367-8808
> cell: (516) 353-7055
> google-talk: matt.vaughn@...
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Peter Cock :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Jul 8, 2009 at 5:24 PM, Chris Fields<cjfields@...> wrote:
>
> It would be nice to get some regression tests going for this to make sure it
> does what we expect, so maybe some test data and expected results?
>

Regression tests for BioPerl's FASTQ support would of course
be sensible. In terms of sample data and expected results...

I've got some test files put together for Biopython, and I have
been cross checking Biopython's FASTQ support against
EMBOSS 6.1.0 which has turned up a few issues:
http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000577.html

------------------------------------------------------------------------------

I'd like to get comparisons against BioPerl's new FASTQ support
going too. To do this I'd need to know which (branch?) of BioPerl I
should install, and I'd also like a trivial sample BioPerl script to do
piped FASTQ conversion. i.e. read a FASTQ file from stdin (say
as "fastq-solexa"), and output it to stdout (say as "fastq" meaning
the Sanger Standard FASTQ).

i.e. Something like this four line Biopython script would be perfect:
http://biopython.org/wiki/Reading_from_unix_pipes

------------------------------------------------------------------------------

Peter Rice and I have also been talking about line wrapping when
writing FASTQ output, and if this is a good idea or not:
http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000593.html

Thanks!

Peter C. (@Biopython)
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Aaron Mackey :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Jul 23, 2009 at 7:31 AM, Peter Cock <p.j.a.cock@...>wrote:

> Peter Rice and I have also been talking about line wrapping when
> writing FASTQ output, and if this is a good idea or not:
> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000593.html
>

>From the perspective of many user scripts that expect (irrationally or
otherwise) that FASTQ data not be line-wrapped, I'd argue against
line-wrapping.  The days of "human readable and editable" formats must be
coming to an end soon ...

Just another 2 cents on the matter,

-Aaron
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Jul 23, 2009, at 6:31 AM, Peter Cock wrote:

> On Wed, Jul 8, 2009 at 5:24 PM, Chris Fields<cjfields@...>  
> wrote:
>>
>> It would be nice to get some regression tests going for this to  
>> make sure it
>> does what we expect, so maybe some test data and expected results?
>>
>
> Regression tests for BioPerl's FASTQ support would of course
> be sensible. In terms of sample data and expected results...
>
> I've got some test files put together for Biopython, and I have
> been cross checking Biopython's FASTQ support against
> EMBOSS 6.1.0 which has turned up a few issues:
> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000577.html
>
> ------------------------------------------------------------------------------
>
> I'd like to get comparisons against BioPerl's new FASTQ support
> going too. To do this I'd need to know which (branch?) of BioPerl I
> should install, and I'd also like a trivial sample BioPerl script to  
> do
> piped FASTQ conversion. i.e. read a FASTQ file from stdin (say
> as "fastq-solexa"), and output it to stdout (say as "fastq" meaning
> the Sanger Standard FASTQ).

You would have to install svn (bioperl-live) if you want the  
refactored fastq.  That commit was within the last month.

> i.e. Something like this four line Biopython script would be perfect:
> http://biopython.org/wiki/Reading_from_unix_pipes

We use named parameters so it's a little more verbose.

use Bio::SeqIO;
my $in  = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-sanger');
my $out = Bio::SeqIO->new(-format => 'fastq-solexa');
while (my $seq = $in->next_seq) { $out->write_seq($seq) }

Don't be surprised if there are still bugs lurking about, just let me  
know and I'll fix 'em.

> ------------------------------------------------------------------------------
>
> Peter Rice and I have also been talking about line wrapping when
> writing FASTQ output, and if this is a good idea or not:
> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000593.html
>
> Thanks!
>
> Peter C. (@Biopython)

BTW, I think the bioperl parser does handle line-wrapped FASTQ now.

Anyway, I tend to agree with Aaron on that point.  Too many exceptions  
to the rule make it harder to write parsers for human-readable format.

chris

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Next-gen modules

by Peter-329 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Jul 23, 2009 at 11:58 PM, Chris Fields<cjfields@...> wrote:
>> i.e. Something like this four line Biopython script would be perfect:
>> http://biopython.org/wiki/Reading_from_unix_pipes
>
> We use named parameters so it's a little more verbose.
>
> use Bio::SeqIO;
> my $in  = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-sanger');
> my $out = Bio::SeqIO->new(-format => 'fastq-solexa');
> while (my $seq = $in->next_seq) { $out->write_seq($seq) }

Thanks. So that implicitly uses STDOUT for the output?

> Don't be surprised if there are still bugs lurking about, just let me know
> and I'll fix 'em.

Have you guys (BioPerl) have also gone for "fastq-sanger" instead of
just "fastq" for the Sanger Standard version of FASTQ (like EMBOSS)?
Does BioPerl use just "fastq" to mean anything?

If BioPerl and EMBOSS are using "fastq-sanger", I think Biopython will
have to support that as an alias too:
http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006416.html

Thanks,

Peter

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l
< Prev | 1 - 2 - 3 - 4 - 5 | Next >