Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain sequence metadata without sequence

View: New views
4 Messages — Rating Filter:   Alert me  

Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain sequence metadata without sequence

by Dan Kortschak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I am looking to query NCBI for sequence metadata (LOCUS/length,
DEFINITION/name etc) without obtaining the sequence associated with the
entry (pulling sequence data for chromosome when only the metadata is
needed is a waste).

I'm wondering what would be the most appropriate bioperl module to use -
Bio::DB::EntrezGene or Bio::DB::Query::GenBank seem like the best bet
and from the description the latter seems best, but I'm wondering if
this is best and what database would both provide this data and be
parsable.

thanks for any help.
Dan

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain sequence metadata without sequence

by Smithies, Russell :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I guess it depends on what you've got to start with, how many queries, and which species.
For example, if you want metadata on all human genes, I'd probably do it "manually" from NCBI's website by searching the gene database for "human[orgn]", switching to "gene table" view, then save to file.
It gives you an easily parsed text file with contents as below:
------------------------------------------
1: TGFB1 transforming growth factor, beta 1 [ Homo sapiens ]
GeneID: 7040 updated 07-Oct-2009
RefSeq status: REVIEWED
total gene size: 23166 bp

mRNA   bp   exons   Protein   aa   exons
NM_000660.3   2346   7   NP_000651.3   390   7

Exon information:

NM_000660.3 length: 2346 bp, number of exons: 7
NP_000651.3 length: 390 aa, number of exons: 7

EXON      Coding EXON      INTRON
coords   length      coords   length      coords   length
1 - 1222   1222 bp  868 - 1222   355 bp  1223 - 5456   4234 bp
5457 - 5617   161 bp  5457 - 5617   161 bp  5618 - 9047   3430 bp
9048 - 9165   118 bp  9048 - 9165   118 bp  9166 - 11664   2499 bp
11665 - 11742   78 bp  11665 - 11742   78 bp  11743 - 11881   139 bp
11882 - 12029   148 bp  11882 - 12029   148 bp  12030 - 21630   9601 bp
21631 - 21784   154 bp  21631 - 21784   154 bp  21785 - 22701   917 bp
22702 - 23166   465 bp  22702 - 22860   159 bp

------------------------------------------

Or you could try using Bio::DB::Eutilities, specifying 'gene' as the database and 'table' as the retype.
I'm not sure what retypes are allowed under B:D:E but it should be in the docs.

Take a look at http://www.bioperl.org/wiki/Getting_Genomic_Sequences or http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook

Hope this helps,


Russell Smithies

Bioinformatics Applications Developer
T +64 3 489 9085
russell.smithies@...

Invermay  Research Centre
Puddle Alley,
Mosgiel,
New Zealand
T  +64 3 489 3809  
F  +64 3 489 9174 
www.agresearch.co.nz



> -----Original Message-----
> From: bioperl-l-bounces@... [mailto:bioperl-l-
> bounces@...] On Behalf Of Dan Kortschak
> Sent: Friday, 9 October 2009 7:54 p.m.
> To: bioperl-l@...
> Subject: [Bioperl-l] Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain
> sequence metadata without sequence
>
> Hi,
>
> I am looking to query NCBI for sequence metadata (LOCUS/length,
> DEFINITION/name etc) without obtaining the sequence associated with the
> entry (pulling sequence data for chromosome when only the metadata is
> needed is a waste).
>
> I'm wondering what would be the most appropriate bioperl module to use -
> Bio::DB::EntrezGene or Bio::DB::Query::GenBank seem like the best bet
> and from the description the latter seems best, but I'm wondering if
> this is best and what database would both provide this data and be
> parsable.
>
> thanks for any help.
> Dan
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
=======================================================================
Attention: The information contained in this message and/or attachments
from AgResearch Limited is intended only for the persons or entities
to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipients is prohibited by AgResearch
Limited. If you have received this message in error, please notify the
sender immediately.
=======================================================================

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain sequence metadata without sequence

by Dan Kortschak :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Russell,

I ended up using a hodgepodge of Bio::DB::GenBank and
Bio::DB::EUtililties

#!/usr/bin/perl

use Bio::DB::EUtilities;
use Bio::DB::GenBank;

my @uids = qw(89161185 89161199 89161205 89161207 51511721 89161210 89161213 51511724 89161216 89161187 51511727 89161190 51511729 51511730 51511731 51511732 51511734 51511735 42406306 51511747 51511750 89161203 17981852 89161218 89161220);

my $gb = Bio::DB::GenBank->new(-complexity => 1,-seq_stop=>1);
my $seqio = $gb->get_Stream_by_gi(\@uids);
my $summary = Bio::DB::EUtilities->new(-eutil => 'esummary',
                           -db => 'nucleotide',
                           -id => \@uids);

print "chromosome,refSeq,name,length\n";

my $index=0;

while (my $seq = $seqio->next_seq() and my $ds = $summary->next_DocSum) {
        warn "Database queries don't reconcile",$seq->primary_id,"-",$ds->get_id,"\n" if $seq->primary_id != $ds->get_id;
        print $index++,",",$seq->id(),".",$seq->version,",";
        ($feat)=$seq->get_SeqFeatures;
        if( defined $seq->species && $feat->annotation->get_Annotations('chromosome')) {
                print $seq->species->binomial;
                print " chromosome ",$feat->annotation->get_Annotations('chromosome'),",";
        } elsif (defined $seq->species && $feat->annotation->get_Annotations('organelle')) {
                print $seq->species->binomial;
                print " ",$feat->annotation->get_Annotations('organelle'),",";
        } else {
                $_=$seq->desc;
                /^([[:alnum:] ]+)[[:graph:]]/;
                print "$1,";
        }
        print $ds->get_contents_by_name('Length') if $ds->get_contents_by_name('Length');
        print "\n";
}


If Bio::DB:EUtilities gives rich seq or equivalent I might change over
from Bio::DB::GenBank, but it pretty much works at the moment (the
fail-overs give me grief and I don't like the kludge of asking for a
single base, but they work to get some of the details that the DocSum
doesn't - sensible title for example).

cheers
Dan

On Mon, 2009-10-12 at 08:46 +1300, Smithies, Russell wrote:

> Or you could try using Bio::DB::Eutilities, specifying 'gene' as the database and 'table' as the retype.
> I'm not sure what retypes are allowed under B:D:E but it should be in the docs.


_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain sequence metadata without sequence

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Oct 11, 2009, at 5:03 PM, Dan Kortschak wrote:

> Hi Russell,
>
> I ended up using a hodgepodge of Bio::DB::GenBank and
> Bio::DB::EUtililties
>
> #!/usr/bin/perl
>
> use Bio::DB::EUtilities;
> use Bio::DB::GenBank;
>
> my @uids = qw(89161185 89161199 89161205 89161207 51511721 89161210  
> 89161213 51511724 89161216 89161187 51511727 89161190 51511729  
> 51511730 51511731 51511732 51511734 51511735 42406306 51511747  
> 51511750 89161203 17981852 89161218 89161220);
>
> my $gb = Bio::DB::GenBank->new(-complexity => 1,-seq_stop=>1);
> my $seqio = $gb->get_Stream_by_gi(\@uids);
> my $summary = Bio::DB::EUtilities->new(-eutil => 'esummary',
>                           -db => 'nucleotide',
>                           -id => \@uids);
>
> print "chromosome,refSeq,name,length\n";
>
> my $index=0;
>
> while (my $seq = $seqio->next_seq() and my $ds = $summary-
> >next_DocSum) {
> warn "Database queries don't reconcile",$seq->primary_id,"-",$ds-
> >get_id,"\n" if $seq->primary_id != $ds->get_id;
> print $index++,",",$seq->id(),".",$seq->version,",";
> ($feat)=$seq->get_SeqFeatures;
> if( defined $seq->species && $feat->annotation->get_Annotations
> ('chromosome')) {
> print $seq->species->binomial;
> print " chromosome ",$feat->annotation->get_Annotations
> ('chromosome'),",";
> } elsif (defined $seq->species && $feat->annotation->get_Annotations
> ('organelle')) {
> print $seq->species->binomial;
> print " ",$feat->annotation->get_Annotations('organelle'),",";
> } else {
> $_=$seq->desc;
>        /^([[:alnum:] ]+)[[:graph:]]/;
> print "$1,";
> }
> print $ds->get_contents_by_name('Length') if $ds-
> >get_contents_by_name('Length');
> print "\n";
> }
>
>
> If Bio::DB:EUtilities gives rich seq or equivalent I might change over
> from Bio::DB::GenBank, but it pretty much works at the moment (the
> fail-overs give me grief and I don't like the kludge of asking for a
> single base, but they work to get some of the details that the DocSum
> doesn't - sensible title for example).
>
> cheers
> Dan

Bio::DB::EUtilities is a generic front-end to NCBI's eutils, so you  
can retrieve the gene info in whatever format via efetch and pass it  
on to the appropriate parser (Bio::SeqIO::entrezgene, which requires  
ASN1 at this time).  It's probably best to save the ASN1 data to a  
temp file prior to that (getting the raw content as a file handle  
isn't set up yet).

chris

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l