|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 - 3 - 4 - 5 | Next > |
|
|
Re: Next-gen modulesHi All,
Apropos of this, I am about to release to CPAN a BioPerl interface to SAM and BAM files. The documentation is still in progress, but you can get CVS access here: % cvs -d :pserver:anonymous@...:/cvsroot/gmod co gbrowse-adaptors/Bio-SamTools Lincoln On Wed, Jun 17, 2009 at 7:29 AM, Elia Stupka <e.stupka@...> wrote: > Dear all, > > after several years of absence I am slowly coming back to Bioperl, and hope > to contribute again to its development. > > One area that I was thinking of starting from, since we are actively > involved with it, is to improve BIoperl's support fo next-gen sequencing > data, tools, etc. Since I am sure I have missed out on a lot of recent > developments, do let me know if/what is useful. > > One example that comes to mind is that the conversion of various formats > to/from FASTQ does not seem to be supported. Some code can be found within > Li Heng's script: http://maq.sourceforge.net/fq_all2std.pl but it would be > good if it could make its way into SeqIO? And similarly, potentially, for > other next-gen sequence formats? > > Similarly, there seems to be little in bioperl-run to support tools that > have been developed in this area, such as Maq, BowTie, TopHat, etc? > > Do let me know if there is a past thread on this, or other people actively > developing, etc. so that I can find out what priorities are. > > thanks and best regards to all (old friends and new), > > Elia > > --- > Senior Lecturer, Bioinformatics > UCL Cancer Institute > Paul O' Gorman Building > University College London > Gower Street > WC1E 6BT > London > UK > > Office (UCL): +44 207 679 6493 > Office (ICMS): +44 0207 8822374 > > Mobile: +44 7597 566 194 > Mobile (Italy): +39 338 8448801 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Lincoln D. Stein Director, Informatics and Biocomputing Platform Ontario Institute for Cancer Research 101 College St., Suite 800 Toronto, ON, Canada M5G0A3 416 673-8514 Assistant: Renata Musa <Renata.Musa@...> _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Wed, Jun 17, 2009 at 6:06 PM, Chris Fields wrote:
> Peter wrote: >> Other issues to keep in mind: >> >> (3) There should be no warning parsing files where the optional repeated >> title is missing on the "+" lines (as discussed earlier on the BioPerl >> list). > > Agreed, though we'll have to check the current fastq parser to see if that's > currently the case. I thought that was fixed but maybe not? > >> (4) When writing FASTQ files should BioPerl omit the optional repeated >> title on the "+" line? Biopython omits this as I understand this to be >> common practice, and can make a big different to file sizes - especially >> on short read data from Solexa/Illumina. > > Agreed, particularly if it's commonly encountered. > >> (5) Also test reading and writing files with an optional description (as >> well as an identifier) on the "@" (and "+") lines. See the NCBI SRA >> for examples, e.g. >> >> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC >> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC > > Should be easy enough to implement with a simple regex. > >> (6) Test reading and writing files where the encoded quality string starts >> with a "@" or a "+" character, e.g. >> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html >> >> Peter > > Mark, getting all that? ;> > > chris Another couple of points that I should have remembered earlier, related to converting between PHRED scores and Solexa scores. On the bright side, with Illumina abandoning the Solexa scores in pipeline 1.3+, these issues will go away with time: (7) If BioPerl will be converting Solexa scores to/from PHRED scores as integers automatically (as discussed earlier), make sure you round to the nearest whole number (don't just truncate with a call to int!). MAQ does this by adding 0.5 before calling int (while in Biopython I just use Python's round function). (8) When asked to write out an old Solexa style FASTQ file, what will you do if given a standard Sanger FASTQ file (or a new Illumina 1.3+ FASTQ file) containing a base with PHRED quality zero? This maps to a Solexa quality of minus infinity... Right now the development version of Biopython will throw an error in this situation, but mapping to the lowest observed Solexa score might be reasonable. Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Jun 22, 2009, at 9:24 AM, Peter wrote:
> On Wed, Jun 17, 2009 at 6:06 PM, Chris Fields wrote: >> Peter wrote: >>> Other issues to keep in mind: >>> >>> (3) There should be no warning parsing files where the optional >>> repeated >>> title is missing on the "+" lines (as discussed earlier on the >>> BioPerl >>> list). >> >> Agreed, though we'll have to check the current fastq parser to see >> if that's >> currently the case. I thought that was fixed but maybe not? >> >>> (4) When writing FASTQ files should BioPerl omit the optional >>> repeated >>> title on the "+" line? Biopython omits this as I understand this >>> to be >>> common practice, and can make a big different to file sizes - >>> especially >>> on short read data from Solexa/Illumina. >> >> Agreed, particularly if it's commonly encountered. >> >>> (5) Also test reading and writing files with an optional >>> description (as >>> well as an identifier) on the "@" (and "+") lines. See the NCBI SRA >>> for examples, e.g. >>> >>> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >>> GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC >>> +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 >>> IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC >> >> Should be easy enough to implement with a simple regex. >> >>> (6) Test reading and writing files where the encoded quality >>> string starts >>> with a "@" or a "+" character, e.g. >>> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/029911.html >>> >>> Peter >> >> Mark, getting all that? ;> >> >> chris > > Another couple of points that I should have remembered earlier, > related to converting between PHRED scores and Solexa scores. > On the bright side, with Illumina abandoning the Solexa scores > in pipeline 1.3+, these issues will go away with time: > > (7) If BioPerl will be converting Solexa scores to/from PHRED > scores as integers automatically (as discussed earlier), make > sure you round to the nearest whole number (don't just truncate > with a call to int!). MAQ does this by adding 0.5 before calling > int (while in Biopython I just use Python's round function). That can probably be done with sprintf if needed. It avoids a call to POSIX functions. > (8) When asked to write out an old Solexa style FASTQ file, > what will you do if given a standard Sanger FASTQ file (or a > new Illumina 1.3+ FASTQ file) containing a base with PHRED > quality zero? This maps to a Solexa quality of minus infinity... > Right now the development version of Biopython will throw an > error in this situation, but mapping to the lowest observed > Solexa score might be reasonable. > > Peter Maybe address with a warning followed by assigning to the lowest solexa score? chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesWe just added FASTQ parsing to EMBOSS and faced the same issues.
Parsing was easy - find the '@' line, read sequence until the '+' line is reached, then read (seqlen) quality characters ... and check the next line starts with '@' Quality scores are kept as phred values. Phred of 0 means unknown, which in Solexa is -5 (0.75 error rate = could be anything). We assume lower quality scores are from alignments rather than single reads. We gave up on trying to guess the quality score standard and require users to say whether they are sanger, solexa (1.0) or Illumina (1.3) format files. If we only want the sequence then we don't care so we allow "fastq" as a sequence format and ignore the quality scores in that case. We also allow the integer quality score format ... is anyone still using that (it looks horrible to me :-) Code is in the EMBOSS CVS, and will appear in release 6.1.0 on July 15th. Any further tips would be very useful. regards, Peter Rice _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Tue, Jun 23, 2009 at 12:00 PM, Peter Rice<pmr@...> wrote:
> > We just added FASTQ parsing to EMBOSS and faced the same issues. > I was going to chat to you about this at BOSC, and suggest this be added to EMBOSS - but you are well ahead of me ;) > Parsing was easy - find the '@' line, read sequence until the '+' line > is reached, then read (seqlen) quality characters ... and check the next > line starts with '@' That is basically what I did for Biopython. > Quality scores are kept as phred values. Phred of 0 means unknown, > which in Solexa is -5 (0.75 error rate = could be anything). A Phred quality of 0 means probability of error is 1, so yes, unknown. I don't quite follow your leap that this corresponds to a Solexa quality of -5. Could you clarify? > We assume lower quality scores are from alignments rather than single reads. Did you mean to say "higher quality scores" (i.e. lower probability of error), e.g a PHRED score of 80 which you can get from MAQ doing read mapping or something consensus based. > We gave up on trying to guess the quality score standard and require > users to say whether they are sanger, solexa (1.0) or Illumina (1.3) > format files. If we only want the sequence then we don't care so we allow > "fastq" as a sequence format and ignore the quality scores in that case. What format names have you used? Ideally we'd have the same names in EMBOSS, BioPerl and Biopython (i.e. "fastq", "fastq-solexa", and "fastq-illumina"). > We also allow the integer quality score format ... is anyone still using > that (it looks horrible to me :-) Do you mean the QUAL file format holding PHRED scores? Roche provide tools to turn their SFF files into FASTA and QUAL files, so they are still used. > Code is in the EMBOSS CVS, and will appear in release 6.1.0 on July 15th. > > Any further tips would be very useful. Great. See you at BOSC 2009! Peter (Biopython) _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesPeter wrote:
> On Tue, Jun 23, 2009 at 12:00 PM, Peter Rice<pmr@...> wrote: >> We just added FASTQ parsing to EMBOSS and faced the same issues. >> > > I was going to chat to you about this at BOSC, and suggest this be > added to EMBOSS - but you are well ahead of me ;) Not that well ahead really ... someone asked for it in our BoF at BOSC/ISMB last year so we thought we'd better get it done before this one. it was implemented a couple of days ago :-) >> Parsing was easy - find the '@' line, read sequence until the '+' line >> is reached, then read (seqlen) quality characters ... and check the next >> line starts with '@' > > That is basically what I did for Biopython. > >> Quality scores are kept as phred values. Phred of 0 means unknown, >> which in Solexa is -5 (0.75 error rate = could be anything). > > A Phred quality of 0 means probability of error is 1, so yes, unknown. I don't > quite follow your leap that this corresponds to a Solexa quality of -5. Could > you clarify? Phred score is -10 log(p) where p is the probability of error. A phred of 0 implies 1.0 (certainty) of error, but 0.75 is a better estimate (3/4 chance that any base you pick is wrong). Solexa scores are -10 log(p/(1-p)) so p=0.75 comes out at -5. This is why Solexa scores can go down to -5 in their fastq format. >> We assume lower quality scores are from alignments rather than single reads. > > Did you mean to say "higher quality scores" (i.e. lower probability of error), > e.g a PHRED score of 80 which you can get from MAQ doing read mapping > or something consensus based. Actually I mean both. Error probabilities below 0.75 for a single base are silly, and error probabilities below 0.0001 make sense only when two or more high quality bases are aligned. >> We gave up on trying to guess the quality score standard and require >> users to say whether they are sanger, solexa (1.0) or Illumina (1.3) >> format files. If we only want the sequence then we don't care so we allow >> "fastq" as a sequence format and ignore the quality scores in that case. > > What format names have you used? Ideally we'd have the same names > in EMBOSS, BioPerl and Biopython (i.e. "fastq", "fastq-solexa", and > "fastq-illumina"). We don't normally use '-' in our format names so we have fastqsanger, fastqsolexa, fastqillumina and fastqint. None of these have been tried on users as yet. The '-' names look nice though. We can consider introducing them. Do you have a full list of format names (sequence, feature, alignment, etc.) we can try to conform to? >> We also allow the integer quality score format ... is anyone still using >> that (it looks horrible to me :-) > > Do you mean the QUAL file format holding PHRED scores? Roche provide > tools to turn their SFF files into FASTA and QUAL files, so they are still used. Probably ... unless there is a Solexa version too. regards, Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Jun 23, 2009, at 7:22 AM, Peter Rice wrote: > Peter wrote: > ... >>> Parsing was easy - find the '@' line, read sequence until the '+' >>> line >>> is reached, then read (seqlen) quality characters ... and check >>> the next >>> line starts with '@' >> >> That is basically what I did for Biopython. This is now what bioperl will do (at least when I commit changes today or tomorrow). > ... >>> We gave up on trying to guess the quality score standard and require >>> users to say whether they are sanger, solexa (1.0) or Illumina (1.3) >>> format files. If we only want the sequence then we don't care so >>> we allow >>> "fastq" as a sequence format and ignore the quality scores in that >>> case. >> >> What format names have you used? Ideally we'd have the same names >> in EMBOSS, BioPerl and Biopython (i.e. "fastq", "fastq-solexa", and >> "fastq-illumina"). > > We don't normally use '-' in our format names so we have fastqsanger, > fastqsolexa, fastqillumina and fastqint. None of these have been tried > on users as yet. > > The '-' names look nice though. We can consider introducing them. Do > you > have a full list of format names (sequence, feature, alignment, > etc.) we > can try to conform to? We (bioperl) are using biopython's convention of format-variant, or at least that's how I'm coding it up. With SeqIO it's fairly easy to check for the format variant prior to loading the class and pass it in as a second named parameter. I have actually thought of adding in fastqint as an option (it would be fairly easy to do). chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesJust so we're on the same page data-wise, would there be a common set
of fastq data files to use for tests? I am using some from SRA (which is all converted to Sanger). Just need a few small ones for older solexa and newer illumina. chris On Jun 23, 2009, at 6:29 AM, Peter wrote: > On Tue, Jun 23, 2009 at 12:00 PM, Peter Rice<pmr@...> wrote: > >> Code is in the EMBOSS CVS, and will appear in release 6.1.0 on July >> 15th. >> >> Any further tips would be very useful. > > Great. See you at BOSC 2009! > > Peter > (Biopython) Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Tue, Jun 23, 2009 at 1:22 PM, Peter Rice<pmr@...> wrote:
> Peter wrote: >> On Tue, Jun 23, 2009 at 12:00 PM, Peter Rice<pmr@...> wrote: >>> We just added FASTQ parsing to EMBOSS and faced the same issues. >>> >> >> I was going to chat to you about this at BOSC, and suggest this be >> added to EMBOSS - but you are well ahead of me ;) > > Not that well ahead really ... someone asked for it in our BoF at > BOSC/ISMB last year so we thought we'd better get it done before this > one. it was implemented a couple of days ago :-) > Well, ahead of my asking! >>> Quality scores are kept as phred values. Phred of 0 means unknown, >>> which in Solexa is -5 (0.75 error rate = could be anything). >> >> A Phred quality of 0 means probability of error is 1, so yes, unknown. I don't >> quite follow your leap that this corresponds to a Solexa quality of -5. Could >> you clarify? > > Phred score is -10 log(p) where p is the probability of error. A phred > of 0 implies 1.0 (certainty) of error, but 0.75 is a better estimate > (3/4 chance that any base you pick is wrong). > > Solexa scores are -10 log(p/(1-p)) so p=0.75 comes out at -5. This is > why Solexa scores can go down to -5 in their fastq format. > >>> We assume lower quality scores are from alignments rather than >>> single reads. >> >> Did you mean to say "higher quality scores" (i.e. lower probability of error), >> e.g a PHRED score of 80 which you can get from MAQ doing read mapping >> or something consensus based. > > Actually I mean both. Error probabilities below 0.75 for a single base > are silly, and error probabilities below 0.0001 make sense only when two > or more high quality bases are aligned. I see what you mean - a probability of error of 0.75 matches that for a random base call, obvious when you put it like that. Of course, there is this nasty little thought at the back of my mind that sooner or later someone will use FASTQ files for proteins (e.g. from some mass-spec protein sequencing). A probability less than that (e.g. 0) is actually worse than random and could be considered as mean "we're pretty sure this isn't the stated letter". But that would be silly, as you say. >>> We gave up on trying to guess the quality score standard and require >>> users to say whether they are sanger, solexa (1.0) or Illumina (1.3) >>> format files. If we only want the sequence then we don't care so we allow >>> "fastq" as a sequence format and ignore the quality scores in that case. >> >> What format names have you used? Ideally we'd have the same names >> in EMBOSS, BioPerl and Biopython (i.e. "fastq", "fastq-solexa", and >> "fastq-illumina"). > > We don't normally use '-' in our format names so we have fastqsanger, > fastqsolexa, fastqillumina and fastqint. None of these have been tried > on users as yet. > > The '-' names look nice though. We can consider introducing them. Do you > have a full list of format names (sequence, feature, alignment, etc.) we > can try to conform to? See http://biopython.org/wiki/SeqIO and http://biopython.org/wiki/AlignIO Getting EMBOSS to conforming should be trivial - in general when picking a format name for Biopython's SeqIO or AlignIO (and we have avoided multiple aliases with one exception) we have tried to use anything shared by BioPerl and EMBOSS. The FASTQ variants are unusual in that Biopython got to invent some names. In future where would be a good place to discuss these kinds of cross-platform issues? (i.e. BioPerl, Biopython, EMBOSS, etc). >>> We also allow the integer quality score format ... is anyone still >>> using that (it looks horrible to me :-) >> >> Do you mean the QUAL file format holding PHRED scores? >> Roche provide tools to turn their SFF files into FASTA and >> QUAL files, so they are still used. > > Probably ... unless there is a Solexa version too. We may be talking at cross purposes here, this is QUAL format: http://www.bioperl.org/wiki/Qual_sequence_format Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesPeter wrote:
> On Tue, Jun 23, 2009 at 1:22 PM, Peter Rice<pmr@...> wrote: >> The '-' names look nice though. We can consider introducing them. Do you >> have a full list of format names (sequence, feature, alignment, etc.) we >> can try to conform to? > > See http://biopython.org/wiki/SeqIO and http://biopython.org/wiki/AlignIO Thanks. I'll take a look at those. > Getting EMBOSS to conforming should be trivial - in general when > picking a format name for Biopython's SeqIO or AlignIO (and we > have avoided multiple aliases with one exception) we have tried to > use anything shared by BioPerl and EMBOSS. The FASTQ variants > are unusual in that Biopython got to invent some names. > > In future where would be a good place to discuss these kinds of > cross-platform issues? (i.e. BioPerl, Biopython, EMBOSS, etc). I was planning to suggest a get-together at BOSC in Stockholm so we can identify common cross-platform issues. I'm sure there are many ways we can conform with naming and interfaces and perhaps even share code. >>>> We also allow the integer quality score format ... is anyone still >>>> using that (it looks horrible to me :-) >>> Do you mean the QUAL file format holding PHRED scores? >>> Roche provide tools to turn their SFF files into FASTA and >>> QUAL files, so they are still used. >> Probably ... unless there is a Solexa version too. > > We may be talking at cross purposes here, this is QUAL format: > http://www.bioperl.org/wiki/Qual_sequence_format Yes that is different. We'll worry about separate QUAL files later (we already find separate GFF files a pain for features) and still with the "fastqint" format name. regards, Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Wed, Jun 24, 2009 at 12:48 PM, Peter Rice<pmr@...> wrote:
> > I was planning to suggest a get-together at BOSC in Stockholm so we can > identify common cross-platform issues. I'm sure there are many ways we > can conform with naming and interfaces and perhaps even share code. > That would be a good idea - but while there are quite a few Biopython people at BOSC this year, I don't know if there will be many from BioPerl (there isn't a BioPerl update talk scheduled). >>>>> We also allow the integer quality score format ... is anyone still >>>>> using that (it looks horrible to me :-) >>>> Do you mean the QUAL file format holding PHRED scores? >>>> Roche provide tools to turn their SFF files into FASTA and >>>> QUAL files, so they are still used. >>> Probably ... unless there is a Solexa version too. >> >> We may be talking at cross purposes here, this is QUAL format: >> http://www.bioperl.org/wiki/Qual_sequence_format > > Yes that is different. We'll worry about separate QUAL files later (we > already find separate GFF files a pain for features) and still with the > "fastqint" format name. So when you say "fastqint" are you talking about something else? Could you show us an example record in this format? Peter [I need to remember to proof read my evening emails more carefully] _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Jun 24, 2009, at 9:56 AM, Peter wrote: > On Wed, Jun 24, 2009 at 12:48 PM, Peter Rice<pmr@...> wrote: >> >> I was planning to suggest a get-together at BOSC in Stockholm so we >> can >> identify common cross-platform issues. I'm sure there are many ways >> we >> can conform with naming and interfaces and perhaps even share code. >> > > That would be a good idea - but while there are quite a few Biopython > people at BOSC this year, I don't know if there will be many from > BioPerl > (there isn't a BioPerl update talk scheduled). Most of us are caught up with other work, though I will likely be able to dedicate more time to it in the ext few months. Also doesn't help that my travel stipend doesn't start until Aug. 1. >>>>>> We also allow the integer quality score format ... is anyone >>>>>> still >>>>>> using that (it looks horrible to me :-) >>>>> Do you mean the QUAL file format holding PHRED scores? >>>>> Roche provide tools to turn their SFF files into FASTA and >>>>> QUAL files, so they are still used. >>>> Probably ... unless there is a Solexa version too. >>> >>> We may be talking at cross purposes here, this is QUAL format: >>> http://www.bioperl.org/wiki/Qual_sequence_format >> >> Yes that is different. We'll worry about separate QUAL files later >> (we >> already find separate GFF files a pain for features) and still with >> the >> "fastqint" format name. > > So when you say "fastqint" are you talking about something else? > Could you show us an example record in this format? > > Peter > [I need to remember to proof read my evening emails more carefully] The same as fastq, except the ASCII quality is converted to actual score: @4_1_912_360 AAGGGGCTAGAGAAACACGTAATGAAGGGAGGACTC +4_1_912_360 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 21 40 40 40 40 40 40 40 40 40 26 40 40 14 39 40 40 @4_1_54_483 TAATAAATGTGCTTCCTTGATGCATGTGCTATGATT +4_1_54_483 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 16 40 40 40 28 40 40 40 40 40 40 16 40 40 5 40 40 chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Wed, Jun 24, 2009 at 5:22 PM, Chris Fields<cjfields@...> wrote:
>> So when you say "fastqint" are you talking about something else? >> Could you show us an example record in this format? >> >> Peter > > The same as fastq, except the ASCII quality is converted to actual score: > > @4_1_912_360 > AAGGGGCTAGAGAAACACGTAATGAAGGGAGGACTC > +4_1_912_360 > 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 21 40 40 40 40 40 > 40 40 40 40 26 40 40 14 39 40 40 > @4_1_54_483 > TAATAAATGTGCTTCCTTGATGCATGTGCTATGATT > +4_1_54_483 > 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 16 40 40 40 28 40 > 40 40 40 40 40 16 40 40 5 40 40 OK - and who uses this "Integer FASTQ" files? Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Jun 24, 2009, at 11:27 AM, Peter wrote: > On Wed, Jun 24, 2009 at 5:22 PM, Chris Fields<cjfields@...> > wrote: >>> So when you say "fastqint" are you talking about something else? >>> Could you show us an example record in this format? >>> >>> Peter >> >> The same as fastq, except the ASCII quality is converted to actual >> score: >> >> @4_1_912_360 >> AAGGGGCTAGAGAAACACGTAATGAAGGGAGGACTC >> +4_1_912_360 >> 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 21 40 40 >> 40 40 40 >> 40 40 40 40 26 40 40 14 39 40 40 >> @4_1_54_483 >> TAATAAATGTGCTTCCTTGATGCATGTGCTATGATT >> +4_1_54_483 >> 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 16 40 40 >> 40 28 40 >> 40 40 40 40 40 16 40 40 5 40 40 > > OK - and who uses this "Integer FASTQ" files? > > Peter Not sure, but it is covered by MAQ via the conversion script (as FASTQ- int): http://maq.sourceforge.net/fq_all2std.pl chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesChris Fields wrote:
> Not sure, but it is covered by MAQ via the conversion script (as > FASTQ-int): Are the scores phred or Solexa? Peter Rice _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Jun 24, 2009, at 5:44 PM, Peter Rice wrote: > Chris Fields wrote: >> Not sure, but it is covered by MAQ via the conversion script (as >> FASTQ-int): > > Are the scores phred or Solexa? > > Peter Rice Not sure actually. The perl script I linked to looks like it converts using the same scale as solexa (illumina 1.0). chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesI'm developing a transcriptomics database for use with next-gen data, and
have found processing the raw data to be a big hurdle. I'm a bit late in responding to this thread, so most issues have already been discussed. One thing that hasn't been mentioned is removal of adapters from raw Illumina sequence. This is a PITA, and I'm not aware of any well developed and documented open source software for removal of adapters (and poor quality sequence) from Illumina reads. My current Illumina sequence processing pipeline is an unholy mix of biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting the Illumina fastq to Sanger fastq, bioperl to read the quality values, pure perl to trim the poor quality sequence from each read, and bioperl with emboss to remove the adapter sequence. I'm aware that the pipeline contains bugs and would like to simplify it, but at least it does work... Ideally I'd like to replace as much of the pipeline as possible with bioperl/bioperl-run, but this isn't currently possible due to both a lack of features and poor performance. I'm sure the features will come with time, but the performance is more of a concern to me. I wonder if Bio::Moose might be used to alleviate some of the performance issues? Might next-gen modules be an ideal guinea pig for Bio::Moose? For my purposes the tools that would love to see supported in bioperl/bioperl-run are: - next-gen sequence quality parsing (to output phred scores) - sequence quality based trimming - sequencing adapter removal - filtering based on sequence complexity (repeats, entropy etc) - bioperl-run modules for bowtie etc. Obviously all of these need to be fast! I'd love to muck in, but I doubt I'll contribute much before Bio::Moose/bioperl6, as the (bio)perl object system gives me nightmares! Regarding trimming bad quality bases (see comments from Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed pure/bioperl solution to be much faster than a primarily bioperl based implementation. I found Bio::Seq->subseq(a,b) and Bio::Seq->subqual(a,b) to be far too slow. My current code trims ~1300 sequences/second, including unzipping the raw data and converting it to sanger fastq with biopython. Processing an entire sequencing run with the whole pipeline takes in the region of 6-12h. Hope this looooong post was of interest to someone! Giles 2009/6/17 Tristan Lefebure <tristan.lefebure@...> > Hello, > Regarding next-gen sequences and bioperl, following my > experience, another issue is bioperl speed. For example, if > you want to trim bad quality bases at ends of 1E6 Solexa > reads using Bio::SeqIO::fastq and some methods in > Bio::Seq::Quality, well, you've got to be patient (but may > be I missed some shortcuts...). > > A pure perl solution will be between 100 to 1000x faster... > Would it be possible to have an ultra-light quality object > with few simple methods for next-gen reads? > > I can contribute some tests if that sounds like an important > point. > > -Tristan > Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesOn Jun 30, 2009, at 6:28 AM, Giles Weaver wrote: > I'm developing a transcriptomics database for use with next-gen > data, and > have found processing the raw data to be a big hurdle. > > I'm a bit late in responding to this thread, so most issues have > already > been discussed. One thing that hasn't been mentioned is removal of > adapters > from raw Illumina sequence. This is a PITA, and I'm not aware of any > well > developed and documented open source software for removal of > adapters (and > poor quality sequence) from Illumina reads. > > My current Illumina sequence processing pipeline is an unholy mix of > biopython, bioperl, pure perl, emboss and bowtie. Biopython for > converting > the Illumina fastq to Sanger fastq, bioperl to read the quality > values, pure > perl to trim the poor quality sequence from each read, and bioperl > with > emboss to remove the adapter sequence. I'm aware that the pipeline > contains > bugs and would like to simplify it, but at least it does work... My local bioperl is working with FASTQ parsing of Sanger and Illumina (but not solexa yet). I'll commit what I have today, and we should be able to add in solexa soon. We'll also need to add in write_seq support. > Ideally I'd like to replace as much of the pipeline as possible with > bioperl/bioperl-run, but this isn't currently possible due to both a > lack of > features and poor performance. I'm sure the features will come with > time, > but the performance is more of a concern to me. I wonder if > Bio::Moose might > be used to alleviate some of the performance issues? Might next-gen > modules > be an ideal guinea pig for Bio::Moose? We should get FASTQ working in core first then optimize on speed (as Elia previously pointed out). We can do that within the actual SeqIO parser using a few simple tricks. For instance my local Bio::SeqIO::fastq has a reconfigured next_seq to call an iterator that returns raw processed data as a simple hash ref; users have access to that method, so if one wanted they could retrieve the raw data directly, or pass it through a filter that only creates seq instances one wants on the fly (that would be where your quality checks, adaptor modification, etc. fit in). In the end it might be to wrap a C/C++-based solution for speed. As mentioned previously a C-based parser exists from Sanger Centre that we could incorporate in some fashion, but I would like if it were able to report back file position for fast indexing. The code is fairly simple so it should be too hard to incorporate that in somehow. Just so there is no confusion, Bio::Moose is an attempt to both lay out plans for perl6 and deal with inheritance issues within bioperl now. It's still in very early development and may not see a release until Dec. at the very earliest, it will be an alpha release then, and likely won't have every major class represented at that point. It's also not intended to be backwards-compatible with bioperl core. It may help, but that's not an absolute certainty. As for bioperl6, it will be pre-alpha until perl6 spec reaches a stable draft and we have an active implementation. > For my purposes the tools that would love to see supported in > bioperl/bioperl-run are: > > - next-gen sequence quality parsing (to output phred scores) > - sequence quality based trimming > - sequencing adapter removal > - filtering based on sequence complexity (repeats, entropy etc) > - bioperl-run modules for bowtie etc. > > Obviously all of these need to be fast! > I'd love to muck in, but I doubt I'll contribute much before > Bio::Moose/bioperl6, as the (bio)perl object system gives me > nightmares! One can only read a file so fast (even with a highly optimized C/C++ based parser), but I don't think that will be the limiting factor as much as object instantiation. > Regarding trimming bad quality bases (see comments from Tristan > Lefebure) > from Solexa/Illumina reads, I did find a mixed pure/bioperl solution > to be > much faster than a primarily bioperl based implementation. I found > Bio::Seq->subseq(a,b) and Bio::Seq->subqual(a,b) to be far too slow. > My > current code trims ~1300 sequences/second, including unzipping the > raw data > and converting it to sanger fastq with biopython. Processing an entire > sequencing run with the whole pipeline takes in the region of 6-12h. Right, hence coming up with a 'pre-filter' for raw data (hash refs) prior to object instantiation to speed things up. This will be a bit easier with Bio::Moose as we can introspect attributes via the meta class, but this will be a while yet. > Hope this looooong post was of interest to someone! > > Giles It's always good to hear about such issues and what one expects. chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesAll,
I have committed the first run at adding Illumina/Solexa parsing for FASTQ along with tests. It's very possible the quality scores are off, particularly for Solexa (Illumina 1.0), so test away and let me know if anything pops up (should be a quick fix). Along with that is a small commit to Bio::SeqIO so that we can add format variants (see below for an example). write_seq/write_qual/write_fastq will likely not work as expected as I haven't touched them; they are to be tackled next. For faster parsing I have also added a next_dataset method that returns a hash reference to the parsed data instead of an object; this hash includes quality scores. This method is called by next_seq and the relevant data is passed in to the sequence factory directly; one could do something like the following to filter sequences as needed: use Modern::Perl; use Bio::SeqIO; use Bio::Seq::SeqFactory; my $file = shift; # same as (-format => 'fastq', -variant => 'illumina') my $in = Bio::SeqIO->new(-file => $file, -format => 'fastq-illumina'); my $factory = Bio::Seq::SeqFactory->new(-type => 'Bio::Seq::Quality'); while (my $data = $in->next_dataset) { next if seq_is_crap($data); my $seq = $factory->create(%$data); } sub seq_is_crap { # filter here } chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Next-gen modulesHi all (BioPerl and Biopython),
This is a continuation of a long thread on the BioPerl mailing list, which I have now CC'd to the Biopython mailing list. See: http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html On this thread we have been discussing next gen sequencing tools and co-coordinating things like consistent file format naming between Biopython, BioPerl and EMBOSS. I've been chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009, and he will look into setting up a cross project mailing list for this kind of discussion in future. In the mean time, my replies to Giles below cover both BioPerl and Biopython (and EMBOSS). Giles' original email is here: http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html Peter On 6/30/09, Giles Weaver <giles.weaver@...> wrote: > > I'm developing a transcriptomics database for use with next-gen data, and > have found processing the raw data to be a big hurdle. > > I'm a bit late in responding to this thread, so most issues have already > been discussed. One thing that hasn't been mentioned is removal of adapters > from raw Illumina sequence. This is a PITA, and I'm not aware of any well > developed and documented open source software for removal of adapters > (and poor quality sequence) from Illumina reads. > > My current Illumina sequence processing pipeline is an unholy mix of > biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting > the Illumina fastq to Sanger fastq, bioperl to read the quality values, > pure perl to trim the poor quality sequence from each read, and bioperl > with emboss to remove the adapter sequence. I'm aware that the pipeline > contains bugs and would like to simplify it, but at least it does work... > > Ideally I'd like to replace as much of the pipeline as possible with > bioperl/bioperl-run, but this isn't currently possible due to both a lack > of features and poor performance. I'm sure the features will come with > time, but the performance is more of a concern to me. .. I gather you would rather work with (Bio)Perl, but since you are already using Biopython to do the FASTQ conversion, you could also use it for more of your pipe line. Our tutorial includes examples of simple FASTQ quality filtering, and trimming of primer sequences (something like this might be helpful for removing adaptors). See: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Alternatively, with the new release of EMBOSS this July, you will also be able to do the Illumina FASTQ to Sanger standard FASTQ with EMBOSS, and I'm sure BioPerl will offer this soon too. > Regarding trimming bad quality bases (see comments from > Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed > pure/bioperl solution to be much faster than a primarily bioperl > based implementation. I found Bio::Seq->subseq(a,b) and > Bio::Seq->subqual(a,b) to be far too slow. My current code trims > ~1300 sequences/second, including unzipping the raw data and > converting it to sanger fastq with biopython. Processing an entire > sequencing run with the whole pipeline takes in the region of 6-12h. There are several ways of doing quality trimming, and it would make an excellent cookbook example (both for BioPerl and Biopython). Could you go into a bit more detail about your trimming algorithm? e.g. Do you just trim any bases on the right below a certain threshold, perhaps with a minimum length to retain the trimmed read afterwards? > Hope this looooong post was of interest to someone! I was interested at least ;) Peter _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
| < Prev | 1 - 2 - 3 - 4 - 5 | Next > |
| Free embeddable forum powered by Nabble | Forum Help |