Long /labels are wrapped, but can't be read

View: New views
4 Messages — Rating Filter:   Alert me  

Long /labels are wrapped, but can't be read

by Adam Sjøgren-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

  Hi.


I am wondering whether this is a buglet or just a case of "Don't do
that":

If I set a very long /label on a feature and output the sequence in EMBL
format, the qualifier value gets wrapped, but not quoted.

When BioPerl reads such a file, an exception is thrown.

I probably shouldn't be setting very long labels... But oughtn't BioPerl
throw an exception when a too long label is set, or automatically quote
the value when it is long enough to be wrapped, or know how to read a
wrapped yet unquoted value?

I will be happy to try and provide a patch for whichever solution is
preferred.

Here is an example script:

  #!/usr/bin/perl

  use strict;
  use warnings;

  use IO::String;

  use Bio::Seq;
  use Bio::SeqFeature::Generic;
  use Bio::SeqIO;

  print 'BioPerl ' . $Bio::Root::Version::VERSION . "\n";

  my $seq=Bio::Seq->new(-seq=>'ATG');
  my $feature=Bio::SeqFeature::Generic->new(-primary=>'misc_feature', -start=>1, -end=>3);
  $feature->add_tag_value(label=>'averylonglabelthisisindeedbutitoughttoworkanywaydontyouthink');
  $seq->add_SeqFeature($feature);

  my $out_string=out($seq);
  print $out_string;

  my $fh=IO::String->new($out_string);
  my $in=Bio::SeqIO->new(-fh=>$fh, -format=>'EMBL');
  my $in_seq=$in->next_seq;

  print "Done\n";

  sub out {
      my ($seq)=@_;

      my $string='';
      my $fh=IO::String->new($string);
      my $out=Bio::SeqIO->new(-fh=>$fh, -format=>'EMBL');
      $out->write_seq($seq);

      return $string;
  }

Which gives this output when run:

  BioPerl 1.0069
  ID   unknown; SV 1; linear; unassigned DNA; STD; UNC; 3 BP.
  XX
  AC   unknown;
  XX
  XX
  FH   Key             Location/Qualifiers
  FH
  FT   misc_feature    1..3
  FT                   /label=averylonglabelthisisindeedbutitoughttoworkanywaydont
  FT                   youthink
  XX
  SQ   Sequence 3 BP; 1 A; 0 C; 1 G; 1 T; 0 other;
       atg                                                                       3
  //

  ------------- EXCEPTION: Bio::Root::Exception -------------
  MSG: Can't see new qualifier in: youthink
  from:
  /label=averylonglabelthisisindeedbutitoughttoworkanywaydont
  youthink

  STACK: Error::throw
  STACK: Bio::Root::Root::throw Bio/Root/Root.pm:368
  STACK: Bio::SeqIO::embl::_read_FTHelper_EMBL Bio/SeqIO/embl.pm:1294
  STACK: Bio::SeqIO::embl::next_seq Bio/SeqIO/embl.pm:392
  STACK: /z/home/adsj/bugs/bioperl/embl/embl.pl:24
  -----------------------------------------------------------

If I change the value to include "-quotes ("simulating" that embl.pm
quotes the value), BioPerl can read the EMBL string it produces fine:

  -----------------------------------------------------------
  adsj@ala:~/work/bioperl/bioperl-live$ perl -I. ~/bugs/bioperl/embl/embl.pl
  BioPerl 1.0069
  ID   unknown; SV 1; linear; unassigned DNA; STD; UNC; 3 BP.
  XX
  AC   unknown;
  XX
  XX
  FH   Key             Location/Qualifiers
  FH
  FT   misc_feature    1..3
  FT                   /label=""averylonglabelthisisindeedbutitoughttoworkanywaydo
  FT                   ntyouthink""
  XX
  SQ   Sequence 3 BP; 1 A; 0 C; 1 G; 1 T; 0 other;
       atg                                                                       3
  //
  Done


  Best regards,

     Adam

--
                                                          Adam Sjøgren
                                                    adsj@...

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Long /labels are wrapped, but can't be read

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Adam,

Not sure, but this could be a case of 'both'.  Labels that are quoted  
and aren't are currently distinguished via a global hash lookup  
(%FTQUAL_NO_QUOTE) due to the way the parser works; there is some  
logic behind this, just can't quite recall at the moment why it is  
this way.  You could set a hash key for the label in cases where it  
isn't quoted, that should work.  You can also test out the  
Bio::SeqIO::embldriver version (-format => 'embldriver').

If the above doesn't work out it's worth filing a bug for this  
behavior, though I'm not sure how easily it will be to fix.

chris

On Sep 28, 2009, at 2:51 AM, Adam Sjøgren wrote:

>  Hi.
>
>
> I am wondering whether this is a buglet or just a case of "Don't do
> that":
>
> If I set a very long /label on a feature and output the sequence in  
> EMBL
> format, the qualifier value gets wrapped, but not quoted.
>
> When BioPerl reads such a file, an exception is thrown.
>
> I probably shouldn't be setting very long labels... But oughtn't  
> BioPerl
> throw an exception when a too long label is set, or automatically  
> quote
> the value when it is long enough to be wrapped, or know how to read a
> wrapped yet unquoted value?
>
> I will be happy to try and provide a patch for whichever solution is
> preferred.
>
> Here is an example script:
>
>  #!/usr/bin/perl
>
>  use strict;
>  use warnings;
>
>  use IO::String;
>
>  use Bio::Seq;
>  use Bio::SeqFeature::Generic;
>  use Bio::SeqIO;
>
>  print 'BioPerl ' . $Bio::Root::Version::VERSION . "\n";
>
>  my $seq=Bio::Seq->new(-seq=>'ATG');
>  my $feature=Bio::SeqFeature::Generic->new(-primary=>'misc_feature',  
> -start=>1, -end=>3);
>  $feature->add_tag_value
> (label
> =>'averylonglabelthisisindeedbutitoughttoworkanywaydontyouthink');
>  $seq->add_SeqFeature($feature);
>
>  my $out_string=out($seq);
>  print $out_string;
>
>  my $fh=IO::String->new($out_string);
>  my $in=Bio::SeqIO->new(-fh=>$fh, -format=>'EMBL');
>  my $in_seq=$in->next_seq;
>
>  print "Done\n";
>
>  sub out {
>      my ($seq)=@_;
>
>      my $string='';
>      my $fh=IO::String->new($string);
>      my $out=Bio::SeqIO->new(-fh=>$fh, -format=>'EMBL');
>      $out->write_seq($seq);
>
>      return $string;
>  }
>
> Which gives this output when run:
>
>  BioPerl 1.0069
>  ID   unknown; SV 1; linear; unassigned DNA; STD; UNC; 3 BP.
>  XX
>  AC   unknown;
>  XX
>  XX
>  FH   Key             Location/Qualifiers
>  FH
>  FT   misc_feature    1..3
>  FT                   /
> label=averylonglabelthisisindeedbutitoughttoworkanywaydont
>  FT                   youthink
>  XX
>  SQ   Sequence 3 BP; 1 A; 0 C; 1 G; 1 T; 0 other;
>        
> atg
>                                                                        3
>  //
>
>  ------------- EXCEPTION: Bio::Root::Exception -------------
>  MSG: Can't see new qualifier in: youthink
>  from:
>  /label=averylonglabelthisisindeedbutitoughttoworkanywaydont
>  youthink
>
>  STACK: Error::throw
>  STACK: Bio::Root::Root::throw Bio/Root/Root.pm:368
>  STACK: Bio::SeqIO::embl::_read_FTHelper_EMBL Bio/SeqIO/embl.pm:1294
>  STACK: Bio::SeqIO::embl::next_seq Bio/SeqIO/embl.pm:392
>  STACK: /z/home/adsj/bugs/bioperl/embl/embl.pl:24
>  -----------------------------------------------------------
>
> If I change the value to include "-quotes ("simulating" that embl.pm
> quotes the value), BioPerl can read the EMBL string it produces fine:
>
>  -----------------------------------------------------------
>  adsj@ala:~/work/bioperl/bioperl-live$ perl -I. ~/bugs/bioperl/embl/
> embl.pl
>  BioPerl 1.0069
>  ID   unknown; SV 1; linear; unassigned DNA; STD; UNC; 3 BP.
>  XX
>  AC   unknown;
>  XX
>  XX
>  FH   Key             Location/Qualifiers
>  FH
>  FT   misc_feature    1..3
>  FT                   /
> label=""averylonglabelthisisindeedbutitoughttoworkanywaydo
>  FT                   ntyouthink""
>  XX
>  SQ   Sequence 3 BP; 1 A; 0 C; 1 G; 1 T; 0 other;
>        
> atg
>                                                                        3
>  //
>  Done
>
>
>  Best regards,
>
>     Adam
>
> --
>                                                          Adam Sjøgren
>                                                    adsj@...
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Long /labels are wrapped, but can't be read

by Adam Sjøgren-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 29 Sep 2009 22:54:04 -0500, Chris wrote:

> Not sure, but this could be a case of 'both'. Labels that are quoted
> and aren't are currently distinguished via a global hash lookup
> (%FTQUAL_NO_QUOTE) due to the way the parser works; there is some
> logic behind this, just can't quite recall at the moment why it is
> this way.

Yes, I saw that there is a number of qualifiers that aren't quoted
automatically.

The very easy "fix" for me would be to simply remove "label" from
%FTQUAL_NO_QUOTE, but I'm not really sure what the reason for not
quoting all values is, so I was hesitant to just propose that.

> You could set a hash key for the label in cases where it isn't quoted,
> that should work. You can also test out the Bio::SeqIO::embldriver
> version (-format => 'embldriver').

Ah, embldriver reads the wrapped qualifier when it isn't quoted without
problem. Nice! I hadn't noticed embldriver.

I wonder which one is correct in this case?

And should I switch to using embldriver to read, or does it make sense
to try and concoct a patch that changes embl?


  Thanks for the feedback!

     Adam

--
                                                          Adam Sjøgren
                                                    adsj@...

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Long /labels are wrapped, but can't be read

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sep 30, 2009, at 4:50 AM, Adam Sjøgren wrote:

> On Tue, 29 Sep 2009 22:54:04 -0500, Chris wrote:
>
>> Not sure, but this could be a case of 'both'. Labels that are quoted
>> and aren't are currently distinguished via a global hash lookup
>> (%FTQUAL_NO_QUOTE) due to the way the parser works; there is some
>> logic behind this, just can't quite recall at the moment why it is
>> this way.
>
> Yes, I saw that there is a number of qualifiers that aren't quoted
> automatically.
>
> The very easy "fix" for me would be to simply remove "label" from
> %FTQUAL_NO_QUOTE, but I'm not really sure what the reason for not
> quoting all values is, so I was hesitant to just propose that.

It's basically for more control over format IIRC.  It appears to only  
play a role in output (via write_seq).

>> You could set a hash key for the label in cases where it isn't  
>> quoted,
>> that should work. You can also test out the Bio::SeqIO::embldriver
>> version (-format => 'embldriver').
>
> Ah, embldriver reads the wrapped qualifier when it isn't quoted  
> without
> problem. Nice! I hadn't noticed embldriver.
>
> I wonder which one is correct in this case?
>
> And should I switch to using embldriver to read, or does it make sense
> to try and concoct a patch that changes embl?

Bio::SeqIO::embldriver is an attempt to coalesce the parsers into a  
generic driver/parser-handler framework; the various parsers (the  
drivers) would parse data into simple chunks, basically hash refs of  
data.  These would be passed on to the handler object, which has  
methods designed to handle the chunks passed in.  Basically it's like  
a souped-up XML parser, but the data is grouped together in a related,  
meaningful way (like an entire seqfeature, for instance).

The main job of the driver is simply to parse the incoming data stream  
into chunks of naturally related data (think XML, but larger chunks of  
data, like an entire seqfeature) and pass it on to the handler object.

For the moment they're still experimental, but I put them out with the  
release so they can be tested.  The current problem with them at the  
moment is there is no specification on how a data chunk is defined and  
labeled, but I am thinking of using something like JSON for that.

>  Thanks for the feedback!
>
>     Adam
>
> --
>                                                          Adam Sjøgren
>                                                    adsj@...

np.

chris


_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l