How do I read a FASTA file containing protein sequences in lowercase?

View: New views
6 Messages — Rating Filter:   Alert me  

How do I read a FASTA file containing protein sequences in lowercase?

by Carl Mäsak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm using RichSequenceIterator to read FASTA files containing
proteins. Somehow it doesn't work when the protein sequences are in
lowercase, which they sometimes are when downloaded from e.g. Uniprot.
My code fails to recognize the following file as containing a protein
sequence:

>OPSD_FELCA
mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyilln
lavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgv
aftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaq
qqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrn
cmlttlccgknplgddeasttgsktetsqvapa

What am I missing? Here's the code I'm using to read in sequences:

    private List<ISequence> sequencesFromInputStream(InputStream stream) {

        BufferedInputStream bufferedStream = new BufferedInputStream(stream);
        Namespace ns = RichObjectFactory.getDefaultNamespace();
        RichSequenceIterator seqit = null;

        try {
            seqit = RichSequence.IOTools.readStream(bufferedStream, ns);
        } catch (IOException e) {
            logger.error("Couldn't read sequences from file", e);
            return Collections.emptyList();
        }

        List<ISequence> sequences = new ArrayList<ISequence>();
        try {
            while ( seqit.hasNext() ) {
                RichSequence rseq;
                    rseq = seqit.nextRichSequence(); // *error occurs here*
                if (rseq == null)
                    continue;
                String alphabet = rseq.getAlphabet().getName();
                sequences.add(
                      "DNA".equals(alphabet) ? new BiojavaDNA(rseq)
                    : "RNA".equals(alphabet) ? new BiojavaRNA(rseq)
                    :                          new BiojavaProtein(rseq) );
            }
        } catch (NoSuchElementException e) {
            logger.error("Read past last sequence", e);
        } catch (BioException e) {
            logger.error(e); // *ends up here*
        }

        return sequences;
    }

Grateful for any pointers you might have.

Regards,
// Carl Mäsak

_______________________________________________
Biojava-l mailing list  -  Biojava-l@...
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: How do I read a FASTA file containing protein sequences in lowercase?

by Richard Holland-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Could you post the output from the exception stack that it generates?

thanks,
Richard

On 6 Nov 2009, at 16:25, Carl Mäsak wrote:

> I'm using RichSequenceIterator to read FASTA files containing
> proteins. Somehow it doesn't work when the protein sequences are in
> lowercase, which they sometimes are when downloaded from e.g. Uniprot.
> My code fails to recognize the following file as containing a protein
> sequence:
>
>> OPSD_FELCA
> mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyilln
> lavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgv
> aftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaq
> qqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrn
> cmlttlccgknplgddeasttgsktetsqvapa
>
> What am I missing? Here's the code I'm using to read in sequences:
>
>    private List<ISequence> sequencesFromInputStream(InputStream  
> stream) {
>
>        BufferedInputStream bufferedStream = new BufferedInputStream
> (stream);
>        Namespace ns = RichObjectFactory.getDefaultNamespace();
>        RichSequenceIterator seqit = null;
>
>        try {
>            seqit = RichSequence.IOTools.readStream(bufferedStream,  
> ns);
>        } catch (IOException e) {
>            logger.error("Couldn't read sequences from file", e);
>            return Collections.emptyList();
>        }
>
>        List<ISequence> sequences = new ArrayList<ISequence>();
>        try {
>            while ( seqit.hasNext() ) {
>                RichSequence rseq;
>                    rseq = seqit.nextRichSequence(); // *error occurs  
> here*
>                if (rseq == null)
>                    continue;
>                String alphabet = rseq.getAlphabet().getName();
>                sequences.add(
>                      "DNA".equals(alphabet) ? new BiojavaDNA(rseq)
>                    : "RNA".equals(alphabet) ? new BiojavaRNA(rseq)
>                    :                          new BiojavaProtein
> (rseq) );
>            }
>        } catch (NoSuchElementException e) {
>            logger.error("Read past last sequence", e);
>        } catch (BioException e) {
>            logger.error(e); // *ends up here*
>        }
>
>        return sequences;
>    }
>
> Grateful for any pointers you might have.
>
> Regards,
> // Carl Mäsak
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@...
> http://lists.open-bio.org/mailman/listinfo/biojava-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland@...
http://www.eaglegenomics.com/


_______________________________________________
Biojava-l mailing list  -  Biojava-l@...
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: How do I read a FASTA file containing protein sequences in lowercase?

by Carl Mäsak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Richard (>), Carl (>>):

>> I'm using RichSequenceIterator to read FASTA files containing
>> proteins. Somehow it doesn't work when the protein sequences are in
>> lowercase, which they sometimes are when downloaded from e.g. Uniprot.
>> My code fails to recognize the following file as containing a protein
>> sequence:
>>
>>> OPSD_FELCA
>>
>>
>> mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyilln
>>
>> lavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgv
>>
>> aftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaq
>>
>> qqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrn
>> cmlttlccgknplgddeasttgsktetsqvapa
>>
>> What am I missing? Here's the code I'm using to read in sequences:
>>
>>   private List<ISequence> sequencesFromInputStream(InputStream stream) {
>>
>>       BufferedInputStream bufferedStream = new
>> BufferedInputStream(stream);
>>       Namespace ns = RichObjectFactory.getDefaultNamespace();
>>       RichSequenceIterator seqit = null;
>>
>>       try {
>>           seqit = RichSequence.IOTools.readStream(bufferedStream, ns);
>>       } catch (IOException e) {
>>           logger.error("Couldn't read sequences from file", e);
>>           return Collections.emptyList();
>>       }
>>
>>       List<ISequence> sequences = new ArrayList<ISequence>();
>>       try {
>>           while ( seqit.hasNext() ) {
>>               RichSequence rseq;
>>                   rseq = seqit.nextRichSequence(); // *error occurs here*
>>               if (rseq == null)
>>                   continue;
>>               String alphabet = rseq.getAlphabet().getName();
>>               sequences.add(
>>                     "DNA".equals(alphabet) ? new BiojavaDNA(rseq)
>>                   : "RNA".equals(alphabet) ? new BiojavaRNA(rseq)
>>                   :                          new BiojavaProtein(rseq) );
>>           }
>>       } catch (NoSuchElementException e) {
>>           logger.error("Read past last sequence", e);
>>       } catch (BioException e) {
>>           logger.error(e); // *ends up here*
>>       }
>>
>>       return sequences;
>>   }
>>
>> Grateful for any pointers you might have.
>
> Could you post the output from the exception stack that it generates?

org.biojava.bio.BioException: Could not read sequence
        at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:113)
        at net.bioclipse.biojava.business.BiojavaManager.sequencesFromInputStream(BiojavaManager.java:314)
        at net.bioclipse.biojava.business.BiojavaManager.sequencesFromFile(BiojavaManager.java:291)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at net.bioclipse.managers.business.AbstractManagerMethodDispatcher.doInvoke(AbstractManagerMethodDispatcher.java:243)
        at net.bioclipse.managers.business.JavaManagerMethodDispatcher.doInvokeInSameThread(JavaManagerMethodDispatcher.java:248)
        at net.bioclipse.managers.business.AbstractManagerMethodDispatcher.invoke(AbstractManagerMethodDispatcher.java:130)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
        at net.bioclipse.recording.WrapInProxyAdvice.invoke(WrapInProxyAdvice.java:22)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.springframework.osgi.service.importer.internal.aop.ServiceInvoker.doInvoke(ServiceInvoker.java:59)
        at org.springframework.osgi.service.importer.internal.aop.ServiceInvoker.invoke(ServiceInvoker.java:67)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
        at org.springframework.osgi.service.importer.internal.aop.ServiceTCCLInterceptor.invoke(ServiceTCCLInterceptor.java:34)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
        at org.springframework.osgi.service.importer.support.LocalBundleContextAdvice.invoke(LocalBundleContextAdvice.java:59)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
        at org.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed(DelegatingIntroductionInterceptor.java:131)
        at org.springframework.aop.support.DelegatingIntroductionInterceptor.invoke(DelegatingIntroductionInterceptor.java:119)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
        at $Proxy18.invoke(Unknown Source)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
        at org.springframework.aop.framework.adapter.AfterReturningAdviceInterceptor.invoke(AfterReturningAdviceInterceptor.java:50)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
        at $Proxy20.sequencesFromFile(Unknown Source)
        at net.bioclipse.biojava.ui.editors.Aligner.setInput(Aligner.java:152)
        at net.bioclipse.biojava.ui.editors.Aligner.init(Aligner.java:138)
        at org.eclipse.ui.part.MultiPageEditorPart.addPage(MultiPageEditorPart.java:238)
        at org.eclipse.ui.part.MultiPageEditorPart.addPage(MultiPageEditorPart.java:212)
        at net.bioclipse.biojava.ui.editors.SequenceEditor.createPages(SequenceEditor.java:47)
        at org.eclipse.ui.part.MultiPageEditorPart.createPartControl(MultiPageEditorPart.java:357)
        at org.eclipse.ui.internal.EditorReference.createPartHelper(EditorReference.java:662)
        at org.eclipse.ui.internal.EditorReference.createPart(EditorReference.java:462)
        at org.eclipse.ui.internal.WorkbenchPartReference.getPart(WorkbenchPartReference.java:595)
        at org.eclipse.ui.internal.PartPane.setVisible(PartPane.java:313)
        at org.eclipse.ui.internal.presentations.PresentablePart.setVisible(PresentablePart.java:180)
        at org.eclipse.ui.internal.presentations.util.PresentablePartFolder.select(PresentablePartFolder.java:270)
        at org.eclipse.ui.internal.presentations.util.LeftToRightTabOrder.select(LeftToRightTabOrder.java:65)
        at org.eclipse.ui.internal.presentations.util.TabbedStackPresentation.selectPart(TabbedStackPresentation.java:473)
        at org.eclipse.ui.internal.PartStack.refreshPresentationSelection(PartStack.java:1256)
        at org.eclipse.ui.internal.PartStack.setSelection(PartStack.java:1209)
        at org.eclipse.ui.internal.PartStack.showPart(PartStack.java:1608)
        at org.eclipse.ui.internal.PartStack.add(PartStack.java:499)
        at org.eclipse.ui.internal.EditorStack.add(EditorStack.java:103)
        at org.eclipse.ui.internal.PartStack.add(PartStack.java:485)
        at org.eclipse.ui.internal.EditorStack.add(EditorStack.java:112)
        at org.eclipse.ui.internal.EditorSashContainer.addEditor(EditorSashContainer.java:63)
        at org.eclipse.ui.internal.EditorAreaHelper.addToLayout(EditorAreaHelper.java:225)
        at org.eclipse.ui.internal.EditorAreaHelper.addEditor(EditorAreaHelper.java:213)
        at org.eclipse.ui.internal.EditorManager.createEditorTab(EditorManager.java:778)
        at org.eclipse.ui.internal.EditorManager.openEditorFromDescriptor(EditorManager.java:677)
        at org.eclipse.ui.internal.EditorManager.openEditor(EditorManager.java:638)
        at org.eclipse.ui.internal.WorkbenchPage.busyOpenEditorBatched(WorkbenchPage.java:2854)
        at org.eclipse.ui.internal.WorkbenchPage.busyOpenEditor(WorkbenchPage.java:2762)
        at org.eclipse.ui.internal.WorkbenchPage.access$11(WorkbenchPage.java:2754)
        at org.eclipse.ui.internal.WorkbenchPage$10.run(WorkbenchPage.java:2705)
        at org.eclipse.swt.custom.BusyIndicator.showWhile(BusyIndicator.java:70)
        at org.eclipse.ui.internal.WorkbenchPage.openEditor(WorkbenchPage.java:2701)
        at org.eclipse.ui.internal.WorkbenchPage.openEditor(WorkbenchPage.java:2685)
        at org.eclipse.ui.internal.WorkbenchPage.openEditor(WorkbenchPage.java:2676)
        at org.eclipse.ui.ide.IDE.openEditor(IDE.java:651)
        at org.eclipse.ui.ide.IDE.openEditor(IDE.java:610)
        at org.eclipse.ui.actions.OpenFileAction.openFile(OpenFileAction.java:99)
        at org.eclipse.ui.actions.OpenSystemEditorAction.run(OpenSystemEditorAction.java:99)
        at org.eclipse.ui.actions.RetargetAction.run(RetargetAction.java:221)
        at org.eclipse.ui.navigator.CommonNavigatorManager$3.open(CommonNavigatorManager.java:202)
        at org.eclipse.ui.OpenAndLinkWithEditorHelper$InternalListener.open(OpenAndLinkWithEditorHelper.java:48)
        at org.eclipse.jface.viewers.StructuredViewer$2.run(StructuredViewer.java:842)
        at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
        at org.eclipse.core.runtime.Platform.run(Platform.java:888)
        at org.eclipse.ui.internal.JFaceUtil$1.run(JFaceUtil.java:48)
        at org.eclipse.jface.util.SafeRunnable.run(SafeRunnable.java:175)
        at org.eclipse.jface.viewers.StructuredViewer.fireOpen(StructuredViewer.java:840)
        at org.eclipse.jface.viewers.StructuredViewer.handleOpen(StructuredViewer.java:1101)
        at org.eclipse.ui.navigator.CommonViewer.handleOpen(CommonViewer.java:467)
        at org.eclipse.jface.viewers.StructuredViewer$6.handleOpen(StructuredViewer.java:1205)
        at org.eclipse.jface.util.OpenStrategy.fireOpenEvent(OpenStrategy.java:264)
        at org.eclipse.jface.util.OpenStrategy.access$2(OpenStrategy.java:258)
        at org.eclipse.jface.util.OpenStrategy$1.handleEvent(OpenStrategy.java:298)
        at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:84)
        at org.eclipse.swt.widgets.Display.sendEvent(Display.java:3543)
        at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1250)
        at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1273)
        at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1258)
        at org.eclipse.swt.widgets.Widget.notifyListeners(Widget.java:1079)
        at org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:3441)
        at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3100)
        at org.eclipse.ui.internal.Workbench.runEventLoop(Workbench.java:2405)
        at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:2369)
        at org.eclipse.ui.internal.Workbench.access$4(Workbench.java:2221)
        at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:500)
        at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:332)
        at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:493)
        at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:149)
        at net.bioclipse.ui.Application.start(Application.java:36)
        at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:194)
        at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:110)
        at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:79)
        at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:368)
        at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:179)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:559)
        at org.eclipse.equinox.launcher.Main.basicRun(Main.java:514)
        at org.eclipse.equinox.launcher.Main.run(Main.java:1311)
        at org.eclipse.equinox.launcher.Main.main(Main.java:1287)
Caused by: org.biojava.bio.seq.io.ParseException:

A Exception Has Occurred During Parsing.
Please submit the details that follow to biojava-l@... or post
a bug report to http://bugzilla.open-bio.org/

Format_object=org.biojavax.bio.seq.io.FastaFormat
Accession=OPSD_FELCA
Id=null
Comments=problem parsing symbols
Parse_block=mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyillnlavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgvaftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaqqqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrncmlttlccgknplgddeasttgsktetsqvapa
Stack trace follows ....


        at org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:244)
        at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:110)
        ... 114 more
Caused by: org.biojava.bio.symbol.IllegalSymbolException: This
tokenization doesn't contain character: 'e'
        at org.biojava.bio.seq.io.CharacterTokenization.parseTokenChar(CharacterTokenization.java:175)
        at org.biojava.bio.seq.io.CharacterTokenization$TPStreamParser.characters(CharacterTokenization.java:246)
        at org.biojava.bio.symbol.SimpleSymbolList.<init>(SimpleSymbolList.java:178)
        at org.biojavax.bio.seq.io.FastaFormat.readRichSequence(FastaFormat.java:237)
        ... 115 more

// Carl

_______________________________________________
Biojava-l mailing list  -  Biojava-l@...
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: How do I read a FASTA file containing protein sequences in lowercase?

by Richard Holland-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ah OK I see what's going on.

The convenience method you're using, RichSequence.IOTools.readStream
(), uses FastaFormat to try and guess the alphabet to use based on the  
first line of the input sequence.

In FastaFormat, it does this by searching for matching non-DNA  
symbols. The search is case-sensitive:

        protected static final Pattern aminoAcids = Pattern.compile(".*
[FLIPQE].*");

FastaFormat needs patching to make this pattern non-case-sensitive.  
Still, if the sequence is such that any of the above symbols don't  
appear until the second or subsequent lines, the guessing will not  
work and it'll assume it's DNA, and give you the same error as before.  
In the circumstances where you know what alphabet the sequence is in  
advance, it's best to avoid the guessing algorithms and instead use  
the methods such as readFastaDNA that explicity specify the alphabet  
you want to read.

However, there's still one thing that you definitely can't do and  
that's parse different types of sequence from the same input without  
inserting some kind of additional code to detect what alphabet each  
individual sequence is using before parsing it using the appropriate  
BioJava parser. Your code appears to expecting mixed input, but this  
won't work unless they all happen to be the same alphabet.

cheers,
Richard

On 6 Nov 2009, at 16:54, Carl Mäsak wrote:

> Richard (>), Carl (>>):
>>> I'm using RichSequenceIterator to read FASTA files containing
>>> proteins. Somehow it doesn't work when the protein sequences are in
>>> lowercase, which they sometimes are when downloaded from e.g.  
>>> Uniprot.
>>> My code fails to recognize the following file as containing a  
>>> protein
>>> sequence:
>>>
>>>> OPSD_FELCA
>>>
>>>
>>> mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyilln
>>>
>>> lavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgv
>>>
>>> aftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaq
>>>
>>> qqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrn
>>> cmlttlccgknplgddeasttgsktetsqvapa
>>>
>>> What am I missing? Here's the code I'm using to read in sequences:
>>>
>>>  private List<ISequence> sequencesFromInputStream(InputStream  
>>> stream) {
>>>
>>>      BufferedInputStream bufferedStream = new
>>> BufferedInputStream(stream);
>>>      Namespace ns = RichObjectFactory.getDefaultNamespace();
>>>      RichSequenceIterator seqit = null;
>>>
>>>      try {
>>>          seqit = RichSequence.IOTools.readStream(bufferedStream,  
>>> ns);
>>>      } catch (IOException e) {
>>>          logger.error("Couldn't read sequences from file", e);
>>>          return Collections.emptyList();
>>>      }
>>>
>>>      List<ISequence> sequences = new ArrayList<ISequence>();
>>>      try {
>>>          while ( seqit.hasNext() ) {
>>>              RichSequence rseq;
>>>                  rseq = seqit.nextRichSequence(); // *error occurs  
>>> here*
>>>              if (rseq == null)
>>>                  continue;
>>>              String alphabet = rseq.getAlphabet().getName();
>>>              sequences.add(
>>>                    "DNA".equals(alphabet) ? new BiojavaDNA(rseq)
>>>                  : "RNA".equals(alphabet) ? new BiojavaRNA(rseq)
>>>                  :                          new BiojavaProtein
>>> (rseq) );
>>>          }
>>>      } catch (NoSuchElementException e) {
>>>          logger.error("Read past last sequence", e);
>>>      } catch (BioException e) {
>>>          logger.error(e); // *ends up here*
>>>      }
>>>
>>>      return sequences;
>>>  }
>>>
>>> Grateful for any pointers you might have.
>>
>> Could you post the output from the exception stack that it generates?
>
> org.biojava.bio.BioException: Could not read sequence
> at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence
> (RichStreamReader.java:113)
> at  
> net.bioclipse.biojava.business.BiojavaManager.sequencesFromInputStream
> (BiojavaManager.java:314)
> at net.bioclipse.biojava.business.BiojavaManager.sequencesFromFile
> (BiojavaManager.java:291)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at  
> net.bioclipse.managers.business.AbstractManagerMethodDispatcher.doInvoke
> (AbstractManagerMethodDispatcher.java:243)
> at  
> net.bioclipse.managers.business.JavaManagerMethodDispatcher.doInvokeInSameThread
> (JavaManagerMethodDispatcher.java:248)
> at  
> net.bioclipse.managers.business.AbstractManagerMethodDispatcher.invoke
> (AbstractManagerMethodDispatcher.java:130)
> at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at net.bioclipse.recording.WrapInProxyAdvice.invoke
> (WrapInProxyAdvice.java:22)
> at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at  
> org.springframework.osgi.service.importer.internal.aop.ServiceInvoker.doInvoke
> (ServiceInvoker.java:59)
> at  
> org.springframework.osgi.service.importer.internal.aop.ServiceInvoker.invoke
> (ServiceInvoker.java:67)
> at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at  
> org.springframework.osgi.service.importer.internal.aop.ServiceTCCLInterceptor.invoke
> (ServiceTCCLInterceptor.java:34)
> at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at  
> org.springframework.osgi.service.importer.support.LocalBundleContextAdvice.invoke
> (LocalBundleContextAdvice.java:59)
> at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at  
> org.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed
> (DelegatingIntroductionInterceptor.java:131)
> at  
> org.springframework.aop.support.DelegatingIntroductionInterceptor.invoke
> (DelegatingIntroductionInterceptor.java:119)
> at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at org.springframework.aop.framework.JdkDynamicAopProxy.invoke
> (JdkDynamicAopProxy.java:204)
> at $Proxy18.invoke(Unknown Source)
> at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at  
> org.springframework.aop.framework.adapter.AfterReturningAdviceInterceptor.invoke
> (AfterReturningAdviceInterceptor.java:50)
> at  
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed
> (ReflectiveMethodInvocation.java:171)
> at org.springframework.aop.framework.JdkDynamicAopProxy.invoke
> (JdkDynamicAopProxy.java:204)
> at $Proxy20.sequencesFromFile(Unknown Source)
> at net.bioclipse.biojava.ui.editors.Aligner.setInput(Aligner.java:
> 152)
> at net.bioclipse.biojava.ui.editors.Aligner.init(Aligner.java:138)
> at org.eclipse.ui.part.MultiPageEditorPart.addPage
> (MultiPageEditorPart.java:238)
> at org.eclipse.ui.part.MultiPageEditorPart.addPage
> (MultiPageEditorPart.java:212)
> at net.bioclipse.biojava.ui.editors.SequenceEditor.createPages
> (SequenceEditor.java:47)
> at org.eclipse.ui.part.MultiPageEditorPart.createPartControl
> (MultiPageEditorPart.java:357)
> at org.eclipse.ui.internal.EditorReference.createPartHelper
> (EditorReference.java:662)
> at org.eclipse.ui.internal.EditorReference.createPart
> (EditorReference.java:462)
> at org.eclipse.ui.internal.WorkbenchPartReference.getPart
> (WorkbenchPartReference.java:595)
> at org.eclipse.ui.internal.PartPane.setVisible(PartPane.java:313)
> at org.eclipse.ui.internal.presentations.PresentablePart.setVisible
> (PresentablePart.java:180)
> at  
> org.eclipse.ui.internal.presentations.util.PresentablePartFolder.select
> (PresentablePartFolder.java:270)
> at  
> org.eclipse.ui.internal.presentations.util.LeftToRightTabOrder.select
> (LeftToRightTabOrder.java:65)
> at  
> org.eclipse.ui.internal.presentations.util.TabbedStackPresentation.selectPart
> (TabbedStackPresentation.java:473)
> at org.eclipse.ui.internal.PartStack.refreshPresentationSelection
> (PartStack.java:1256)
> at org.eclipse.ui.internal.PartStack.setSelection(PartStack.java:
> 1209)
> at org.eclipse.ui.internal.PartStack.showPart(PartStack.java:1608)
> at org.eclipse.ui.internal.PartStack.add(PartStack.java:499)
> at org.eclipse.ui.internal.EditorStack.add(EditorStack.java:103)
> at org.eclipse.ui.internal.PartStack.add(PartStack.java:485)
> at org.eclipse.ui.internal.EditorStack.add(EditorStack.java:112)
> at org.eclipse.ui.internal.EditorSashContainer.addEditor
> (EditorSashContainer.java:63)
> at org.eclipse.ui.internal.EditorAreaHelper.addToLayout
> (EditorAreaHelper.java:225)
> at org.eclipse.ui.internal.EditorAreaHelper.addEditor
> (EditorAreaHelper.java:213)
> at org.eclipse.ui.internal.EditorManager.createEditorTab
> (EditorManager.java:778)
> at org.eclipse.ui.internal.EditorManager.openEditorFromDescriptor
> (EditorManager.java:677)
> at org.eclipse.ui.internal.EditorManager.openEditor
> (EditorManager.java:638)
> at org.eclipse.ui.internal.WorkbenchPage.busyOpenEditorBatched
> (WorkbenchPage.java:2854)
> at org.eclipse.ui.internal.WorkbenchPage.busyOpenEditor
> (WorkbenchPage.java:2762)
> at org.eclipse.ui.internal.WorkbenchPage.access$11
> (WorkbenchPage.java:2754)
> at org.eclipse.ui.internal.WorkbenchPage$10.run(WorkbenchPage.java:
> 2705)
> at org.eclipse.swt.custom.BusyIndicator.showWhile
> (BusyIndicator.java:70)
> at org.eclipse.ui.internal.WorkbenchPage.openEditor
> (WorkbenchPage.java:2701)
> at org.eclipse.ui.internal.WorkbenchPage.openEditor
> (WorkbenchPage.java:2685)
> at org.eclipse.ui.internal.WorkbenchPage.openEditor
> (WorkbenchPage.java:2676)
> at org.eclipse.ui.ide.IDE.openEditor(IDE.java:651)
> at org.eclipse.ui.ide.IDE.openEditor(IDE.java:610)
> at org.eclipse.ui.actions.OpenFileAction.openFile
> (OpenFileAction.java:99)
> at org.eclipse.ui.actions.OpenSystemEditorAction.run
> (OpenSystemEditorAction.java:99)
> at org.eclipse.ui.actions.RetargetAction.run(RetargetAction.java:221)
> at org.eclipse.ui.navigator.CommonNavigatorManager$3.open
> (CommonNavigatorManager.java:202)
> at org.eclipse.ui.OpenAndLinkWithEditorHelper$InternalListener.open
> (OpenAndLinkWithEditorHelper.java:48)
> at org.eclipse.jface.viewers.StructuredViewer$2.run
> (StructuredViewer.java:842)
> at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
> at org.eclipse.core.runtime.Platform.run(Platform.java:888)
> at org.eclipse.ui.internal.JFaceUtil$1.run(JFaceUtil.java:48)
> at org.eclipse.jface.util.SafeRunnable.run(SafeRunnable.java:175)
> at org.eclipse.jface.viewers.StructuredViewer.fireOpen
> (StructuredViewer.java:840)
> at org.eclipse.jface.viewers.StructuredViewer.handleOpen
> (StructuredViewer.java:1101)
> at org.eclipse.ui.navigator.CommonViewer.handleOpen
> (CommonViewer.java:467)
> at org.eclipse.jface.viewers.StructuredViewer$6.handleOpen
> (StructuredViewer.java:1205)
> at org.eclipse.jface.util.OpenStrategy.fireOpenEvent
> (OpenStrategy.java:264)
> at org.eclipse.jface.util.OpenStrategy.access$2(OpenStrategy.java:
> 258)
> at org.eclipse.jface.util.OpenStrategy$1.handleEvent
> (OpenStrategy.java:298)
> at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:84)
> at org.eclipse.swt.widgets.Display.sendEvent(Display.java:3543)
> at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1250)
> at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1273)
> at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1258)
> at org.eclipse.swt.widgets.Widget.notifyListeners(Widget.java:1079)
> at org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:
> 3441)
> at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3100)
> at org.eclipse.ui.internal.Workbench.runEventLoop(Workbench.java:
> 2405)
> at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:2369)
> at org.eclipse.ui.internal.Workbench.access$4(Workbench.java:2221)
> at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:500)
> at org.eclipse.core.databinding.observable.Realm.runWithDefault
> (Realm.java:332)
> at org.eclipse.ui.internal.Workbench.createAndRunWorkbench
> (Workbench.java:493)
> at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:
> 149)
> at net.bioclipse.ui.Application.start(Application.java:36)
> at org.eclipse.equinox.internal.app.EclipseAppHandle.run
> (EclipseAppHandle.java:194)
> at  
> org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication
> (EclipseAppLauncher.java:110)
> at  
> org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start
> (EclipseAppLauncher.java:79)
> at org.eclipse.core.runtime.adaptor.EclipseStarter.run
> (EclipseStarter.java:368)
> at org.eclipse.core.runtime.adaptor.EclipseStarter.run
> (EclipseStarter.java:179)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke
> (NativeMethodAccessorImpl.java:39)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke
> (DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:559)
> at org.eclipse.equinox.launcher.Main.basicRun(Main.java:514)
> at org.eclipse.equinox.launcher.Main.run(Main.java:1311)
> at org.eclipse.equinox.launcher.Main.main(Main.java:1287)
> Caused by: org.biojava.bio.seq.io.ParseException:
>
> A Exception Has Occurred During Parsing.
> Please submit the details that follow to biojava-l@... or post
> a bug report to http://bugzilla.open-bio.org/
>
> Format_object=org.biojavax.bio.seq.io.FastaFormat
> Accession=OPSD_FELCA
> Id=null
> Comments=problem parsing symbols
> Parse_block
> =
> mngtegpnfyvpfsnktgvvrspfeypqyylaepwqfsmlaaymfllivlgfpinfltlyvtvqhkklrtplnyillnlavadlfmvfggftttlytslhgyfvfgptgcnlegffatlggeialwslvvlaieryvvvckpmsnfrfgenhaimgvaftwvmalacaapplvgwsryipegmqcscgidyytlkpevnnesfviymfvvhftipmiviffcygqlvftvkeaaaqqqesattqkaekevtrmviimviaflicwvpyasvafyifthqgsnfgpifmtlpaffaksssiynpviyimmnkqfrncmlttlccgknplgddeasttgsktetsqvapa
> Stack trace follows ....
>
>
> at org.biojavax.bio.seq.io.FastaFormat.readRichSequence
> (FastaFormat.java:244)
> at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence
> (RichStreamReader.java:110)
> ... 114 more
> Caused by: org.biojava.bio.symbol.IllegalSymbolException: This
> tokenization doesn't contain character: 'e'
> at org.biojava.bio.seq.io.CharacterTokenization.parseTokenChar
> (CharacterTokenization.java:175)
> at org.biojava.bio.seq.io.CharacterTokenization
> $TPStreamParser.characters(CharacterTokenization.java:246)
> at org.biojava.bio.symbol.SimpleSymbolList.<init>
> (SimpleSymbolList.java:178)
> at org.biojavax.bio.seq.io.FastaFormat.readRichSequence
> (FastaFormat.java:237)
> ... 115 more
>
> // Carl

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland@...
http://www.eaglegenomics.com/


_______________________________________________
Biojava-l mailing list  -  Biojava-l@...
http://lists.open-bio.org/mailman/listinfo/biojava-l

Re: How do I read a FASTA file containing protein sequences in lowercase?

by Carl Mäsak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Richard (>):

> Ah OK I see what's going on.
>
> The convenience method you're using, RichSequence.IOTools.readStream(), uses
> FastaFormat to try and guess the alphabet to use based on the first line of
> the input sequence.
>
> In FastaFormat, it does this by searching for matching non-DNA symbols. The
> search is case-sensitive:
>
>        protected static final Pattern aminoAcids =
> Pattern.compile(".*[FLIPQE].*");
>
> FastaFormat needs patching to make this pattern non-case-sensitive.
Patch attached.

I also took the opportunity to remove the occurrences of .* in the
Pattern above. Generally, once should be using Matcher.find() when one
is interested in matching a part of a string. This is more efficient
than using Matcher.matches() and surrounding the desired regular
expression with .*, since the latter will cause a lot of unnecessary
backtracking and make the search quadratic.

This effect only shows up for very long strings, but long strings can
and do happen in bioinformatics. The below measurements show the
quadratic behaviour of the former approach.

$ for length in 100 1000 10000 100000 1000000; do (time java
WithDotStar $length) 2>&1 | grep real; done
real 0m0.371s
real 0m0.367s
real 0m0.577s
real 0m2.735s
real 0m25.275s

$ for length in 100 1000 10000 100000 1000000; do (time java
WithoutDotStar $length) 2>&1 | grep real; done
real 0m0.309s
real 0m0.361s
real 0m0.468s
real 0m1.184s
real 0m9.703s

Kindly,
// Carl




_______________________________________________
Biojava-l mailing list  -  Biojava-l@...
http://lists.open-bio.org/mailman/listinfo/biojava-l

aminoAcids.patch (2K) Download Attachment
WithDotStar.java (870 bytes) Download Attachment
WithoutDotStar.java (870 bytes) Download Attachment

Re: How do I read a FASTA file containing protein sequences in lowercase?

by Richard Holland-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I've applied the patch to the trunk of biojava-live. Thanks!

Richard

On 9 Nov 2009, at 16:26, Carl Mäsak wrote:

> Richard (>):
>> Ah OK I see what's going on.
>>
>> The convenience method you're using, RichSequence.IOTools.readStream(), uses
>> FastaFormat to try and guess the alphabet to use based on the first line of
>> the input sequence.
>>
>> In FastaFormat, it does this by searching for matching non-DNA symbols. The
>> search is case-sensitive:
>>
>>        protected static final Pattern aminoAcids =
>> Pattern.compile(".*[FLIPQE].*");
>>
>> FastaFormat needs patching to make this pattern non-case-sensitive.
>
> Patch attached.
>
> I also took the opportunity to remove the occurrences of .* in the
> Pattern above. Generally, once should be using Matcher.find() when one
> is interested in matching a part of a string. This is more efficient
> than using Matcher.matches() and surrounding the desired regular
> expression with .*, since the latter will cause a lot of unnecessary
> backtracking and make the search quadratic.
>
> This effect only shows up for very long strings, but long strings can
> and do happen in bioinformatics. The below measurements show the
> quadratic behaviour of the former approach.
>
> $ for length in 100 1000 10000 100000 1000000; do (time java
> WithDotStar $length) 2>&1 | grep real; done
> real 0m0.371s
> real 0m0.367s
> real 0m0.577s
> real 0m2.735s
> real 0m25.275s
>
> $ for length in 100 1000 10000 100000 1000000; do (time java
> WithoutDotStar $length) 2>&1 | grep real; done
> real 0m0.309s
> real 0m0.361s
> real 0m0.468s
> real 0m1.184s
> real 0m9.703s
>
> Kindly,
> // Carl
> <aminoAcids.patch><WithDotStar.java><WithoutDotStar.java>

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland@...
http://www.eaglegenomics.com/


_______________________________________________
Biojava-l mailing list  -  Biojava-l@...
http://lists.open-bio.org/mailman/listinfo/biojava-l