SeqIOTools deprecated, looking for alternatives // RichSeq.IOTools

View: New views
2 Messages — Rating Filter:   Alert me  

SeqIOTools deprecated, looking for alternatives // RichSeq.IOTools

by Oliver Stolpe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello *,

the cookbook uses in its examples the SeqIOTools-class for reading the
files. But in the API it is marked as deprecated. Now I am looking for
alternatives, so I searched the list and internet and found out that
biojavax provides methods and classes for reading the files
(RichSequence.IOTools).

For example, I try to read an EMBL-file:

--begin:code--

BufferedReader br = new BufferedReader(new FileReader(filename));
Namespace ns = RichObjectFactory.getDefaultNamespace();
RichSequenceIterator seqs = RichSequence.IOTools.readEMBLDNA(br, ns);

while (seqs.hasNext()) {
    RichSequence seq = seqs.nextRichSequence();
    System.out.println(seq.getName() + ":" + seq.getAnnotation().asMap());
}

--end:code--

But I always get this error message:

--begin:error--

org.biojava.bio.BioException: Could not read sequence
        at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
        at ReadGenbankFile.EMBL(ReadGenbankFile.java:42)
        at ReadGenbankFile.main(ReadGenbankFile.java:85)
Caused by: org.biojava.bio.seq.io.ParseException:

A Exception Has Occurred During Parsing.
Please submit the details that follow to biojava-l@... or post a
bug report to http://bugzilla.open-bio.org/

Format_object=org.biojavax.bio.seq.io.EMBLFormat
Accession=null
Id=not set
Comments=
Parse_block=ID   AJ243265_2; parent: AJ243265AC   AJ243265;FT  
CDS             join(<1082..1272,2484..2638,4926..>5041)
                /codon_start=3
                /gene="PGM1"
                /product="phosphoglucomutase 1"
                /function="carbohydrate metabolism"
                /EC_number="5.4.2.2"
                /db_xref="GOA:Q9H1D2"
                /db_xref="HGNC:8905"
                /db_xref="HSSP:3PMG"
                /db_xref="InterPro:IPR016055"
                /db_xref="UniProtKB/TrEMBL:Q9H1D2"
                /protein_id="CAC19809.1"
                /translation="VGPYVKKILCEELGAPANSAVNCVPLEDFGGHHPDPNLTYAADLV
                ETMKSGEHDFGAAFDGDGDRNMILGKHGFFVNPSDSVAVIAANTFSIPYFQQTGVRGFA
                RSMPTSGALDRVASATKIALYETPTGWKFFGNLMDASKLSLCGEESFGT"SQ  
Sequence   462 BP;
Stack trace follows ....


        at
org.biojavax.bio.seq.io.EMBLFormat.readSection(EMBLFormat.java:775)
        at
org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:284)
        at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
        ... 2 more
Caused by: java.lang.StringIndexOutOfBoundsException: String index out
of range: -3
        at java.lang.String.substring(String.java:1949)
        at java.lang.String.substring(String.java:1916)
        at
org.biojavax.bio.seq.io.EMBLFormat.readSection(EMBLFormat.java:761)
        ... 4 more

--end:error--

The file looks all ok I think and works well with the deprecated SeqIOTools:

--begin:embl-file--
ID   AJ243265_2; parent: AJ243265
AC   AJ243265;
FT   CDS             join(<1082..1272,2484..2638,4926..>5041)
FT                   /codon_start=3
FT                   /gene="PGM1"
FT                   /product="phosphoglucomutase 1"
FT                   /function="carbohydrate metabolism"
FT                   /EC_number="5.4.2.2"
FT                   /db_xref="GOA:Q9H1D2"
FT                   /db_xref="HGNC:8905"
FT                   /db_xref="HSSP:3PMG"
FT                   /db_xref="InterPro:IPR016055"
FT                   /db_xref="UniProtKB/TrEMBL:Q9H1D2"
FT                   /protein_id="CAC19809.1"
FT                  
/translation="VGPYVKKILCEELGAPANSAVNCVPLEDFGGHHPDPNLTYAADLV
FT                  
ETMKSGEHDFGAAFDGDGDRNMILGKHGFFVNPSDSVAVIAANTFSIPYFQQTGVRGFA
FT                   RSMPTSGALDRVASATKIALYETPTGWKFFGNLMDASKLSLCGEESFGT"
SQ   Sequence   462 BP;
     ttgtgggacc gtatgtaaag aagatcctct gtgaagaact cggtgcccct
gcgaactcgg        60
     cagttaactg cgttcctctg gaggactttg gaggccacca ccctgacccc
aacctcacct       120
     atgcagctga cctggtggag accatgaagt caggagagca tgattttggg
gctgcctttg       180
     atggagatgg ggatcgaaac atgattctgg gcaagcatgg gttctttgtg
aacccttcag       240
     actctgtggc tgtcattgct gccaacacct tcagcattcc gtatttccag
cagactgggg       300
     tccgcggttt tgcacggagc atgcccacga gtggtgctct ggaccgggtg
gctagtgcta       360
     caaagattgc tttgtatgag accccaactg gctggaagtt ttttgggaat
ttgatggacg       420
     cgagcaaact gtccctttgt ggggaggaga gcttcgggac
cg                          462
//
--end:embl-file--

The parser always crashes before reading the sequence (ttgt..., directly
after the BP;).

Any suggestions how I get this work?
Or are there other alternatives for substituting the deprecated
SeqIOTools-class?

Thanks in advance,

with best regards,

Oliver
_______________________________________________
Biojava-l mailing list  -  Biojava-l@...
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: SeqIOTools deprecated, looking for alternatives // RichSeq.IOTools

by Richard Holland-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

The file you are parsing is not a valid EMBL format file. The EMBL format is specified here:

   http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_4

and this is what the file should look like for your accession:

   http://www.ebi.ac.uk/cgi-bin/emblfetch?style=html&id=AJ243265&Submit=Go

The most obvious problems in your file are the absence of the required 'XX' section delimiters, and an invalid ID line. There might be other problems too but I haven't checked the whole file, just the first few lines.

The deprecated SeqIOTools really didn't care if the file was valid or not, they basically just made a copy of all the lines in an internal token/value map. They made no attempt to parse or understand the data in each line. The new RichSequence-based parsers actually attempt to enforce the file format definitions and break down and understand the contents of each line. This means that they will reject any file that does not strictly conform to the specified format.

cheers,
Richard

On 12 Nov 2009, at 13:18, Oliver Stolpe wrote:

> Hello *,
>
> the cookbook uses in its examples the SeqIOTools-class for reading the files. But in the API it is marked as deprecated. Now I am looking for alternatives, so I searched the list and internet and found out that biojavax provides methods and classes for reading the files (RichSequence.IOTools).
>
> For example, I try to read an EMBL-file:
>
> --begin:code--
>
> BufferedReader br = new BufferedReader(new FileReader(filename));
> Namespace ns = RichObjectFactory.getDefaultNamespace();
> RichSequenceIterator seqs = RichSequence.IOTools.readEMBLDNA(br, ns);
>
> while (seqs.hasNext()) {
>   RichSequence seq = seqs.nextRichSequence();
>   System.out.println(seq.getName() + ":" + seq.getAnnotation().asMap());
> }
>
> --end:code--
>
> But I always get this error message:
>
> --begin:error--
>
> org.biojava.bio.BioException: Could not read sequence
>       at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
>       at ReadGenbankFile.EMBL(ReadGenbankFile.java:42)
>       at ReadGenbankFile.main(ReadGenbankFile.java:85)
> Caused by: org.biojava.bio.seq.io.ParseException:
>
> A Exception Has Occurred During Parsing.
> Please submit the details that follow to biojava-l@... or post a bug report to http://bugzilla.open-bio.org/
>
> Format_object=org.biojavax.bio.seq.io.EMBLFormat
> Accession=null
> Id=not set
> Comments=
> Parse_block=ID   AJ243265_2; parent: AJ243265AC   AJ243265;FT   CDS             join(<1082..1272,2484..2638,4926..>5041)
>               /codon_start=3
>               /gene="PGM1"
>               /product="phosphoglucomutase 1"
>               /function="carbohydrate metabolism"
>               /EC_number="5.4.2.2"
>               /db_xref="GOA:Q9H1D2"
>               /db_xref="HGNC:8905"
>               /db_xref="HSSP:3PMG"
>               /db_xref="InterPro:IPR016055"
>               /db_xref="UniProtKB/TrEMBL:Q9H1D2"
>               /protein_id="CAC19809.1"
>               /translation="VGPYVKKILCEELGAPANSAVNCVPLEDFGGHHPDPNLTYAADLV
>               ETMKSGEHDFGAAFDGDGDRNMILGKHGFFVNPSDSVAVIAANTFSIPYFQQTGVRGFA
>               RSMPTSGALDRVASATKIALYETPTGWKFFGNLMDASKLSLCGEESFGT"SQ   Sequence   462 BP;
> Stack trace follows ....
>
>
>       at org.biojavax.bio.seq.io.EMBLFormat.readSection(EMBLFormat.java:775)
>       at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:284)
>       at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
>       ... 2 more
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -3
>       at java.lang.String.substring(String.java:1949)
>       at java.lang.String.substring(String.java:1916)
>       at org.biojavax.bio.seq.io.EMBLFormat.readSection(EMBLFormat.java:761)
>       ... 4 more
>
> --end:error--
>
> The file looks all ok I think and works well with the deprecated SeqIOTools:
>
> --begin:embl-file--
> ID   AJ243265_2; parent: AJ243265
> AC   AJ243265;
> FT   CDS             join(<1082..1272,2484..2638,4926..>5041)
> FT                   /codon_start=3
> FT                   /gene="PGM1"
> FT                   /product="phosphoglucomutase 1"
> FT                   /function="carbohydrate metabolism"
> FT                   /EC_number="5.4.2.2"
> FT                   /db_xref="GOA:Q9H1D2"
> FT                   /db_xref="HGNC:8905"
> FT                   /db_xref="HSSP:3PMG"
> FT                   /db_xref="InterPro:IPR016055"
> FT                   /db_xref="UniProtKB/TrEMBL:Q9H1D2"
> FT                   /protein_id="CAC19809.1"
> FT                   /translation="VGPYVKKILCEELGAPANSAVNCVPLEDFGGHHPDPNLTYAADLV
> FT                   ETMKSGEHDFGAAFDGDGDRNMILGKHGFFVNPSDSVAVIAANTFSIPYFQQTGVRGFA
> FT                   RSMPTSGALDRVASATKIALYETPTGWKFFGNLMDASKLSLCGEESFGT"
> SQ   Sequence   462 BP;
>    ttgtgggacc gtatgtaaag aagatcctct gtgaagaact cggtgcccct gcgaactcgg        60
>    cagttaactg cgttcctctg gaggactttg gaggccacca ccctgacccc aacctcacct       120
>    atgcagctga cctggtggag accatgaagt caggagagca tgattttggg gctgcctttg       180
>    atggagatgg ggatcgaaac atgattctgg gcaagcatgg gttctttgtg aacccttcag       240
>    actctgtggc tgtcattgct gccaacacct tcagcattcc gtatttccag cagactgggg       300
>    tccgcggttt tgcacggagc atgcccacga gtggtgctct ggaccgggtg gctagtgcta       360
>    caaagattgc tttgtatgag accccaactg gctggaagtt ttttgggaat ttgatggacg       420
>    cgagcaaact gtccctttgt ggggaggaga gcttcgggac cg                          462
> //
> --end:embl-file--
>
> The parser always crashes before reading the sequence (ttgt..., directly after the BP;).
>
> Any suggestions how I get this work?
> Or are there other alternatives for substituting the deprecated SeqIOTools-class?
>
> Thanks in advance,
>
> with best regards,
>
> Oliver
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@...
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland@...
http://www.eaglegenomics.com/


_______________________________________________
Biojava-l mailing list  -  Biojava-l@...
http://lists.open-bio.org/mailman/listinfo/biojava-l