|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain sequence metadata without sequenceHi,
I am looking to query NCBI for sequence metadata (LOCUS/length, DEFINITION/name etc) without obtaining the sequence associated with the entry (pulling sequence data for chromosome when only the metadata is needed is a waste). I'm wondering what would be the most appropriate bioperl module to use - Bio::DB::EntrezGene or Bio::DB::Query::GenBank seem like the best bet and from the description the latter seems best, but I'm wondering if this is best and what database would both provide this data and be parsable. thanks for any help. Dan _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain sequence metadata without sequenceI guess it depends on what you've got to start with, how many queries, and which species.
For example, if you want metadata on all human genes, I'd probably do it "manually" from NCBI's website by searching the gene database for "human[orgn]", switching to "gene table" view, then save to file. It gives you an easily parsed text file with contents as below: ------------------------------------------ 1: TGFB1 transforming growth factor, beta 1 [ Homo sapiens ] GeneID: 7040 updated 07-Oct-2009 RefSeq status: REVIEWED total gene size: 23166 bp mRNA bp exons Protein aa exons NM_000660.3 2346 7 NP_000651.3 390 7 Exon information: NM_000660.3 length: 2346 bp, number of exons: 7 NP_000651.3 length: 390 aa, number of exons: 7 EXON Coding EXON INTRON coords length coords length coords length 1 - 1222 1222 bp 868 - 1222 355 bp 1223 - 5456 4234 bp 5457 - 5617 161 bp 5457 - 5617 161 bp 5618 - 9047 3430 bp 9048 - 9165 118 bp 9048 - 9165 118 bp 9166 - 11664 2499 bp 11665 - 11742 78 bp 11665 - 11742 78 bp 11743 - 11881 139 bp 11882 - 12029 148 bp 11882 - 12029 148 bp 12030 - 21630 9601 bp 21631 - 21784 154 bp 21631 - 21784 154 bp 21785 - 22701 917 bp 22702 - 23166 465 bp 22702 - 22860 159 bp ------------------------------------------ Or you could try using Bio::DB::Eutilities, specifying 'gene' as the database and 'table' as the retype. I'm not sure what retypes are allowed under B:D:E but it should be in the docs. Take a look at http://www.bioperl.org/wiki/Getting_Genomic_Sequences or http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook Hope this helps, Russell Smithies Bioinformatics Applications Developer T +64 3 489 9085 E russell.smithies@... Invermay Research Centre Puddle Alley, Mosgiel, New Zealand T +64 3 489 3809 F +64 3 489 9174 www.agresearch.co.nz > -----Original Message----- > From: bioperl-l-bounces@... [mailto:bioperl-l- > bounces@...] On Behalf Of Dan Kortschak > Sent: Friday, 9 October 2009 7:54 p.m. > To: bioperl-l@... > Subject: [Bioperl-l] Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain > sequence metadata without sequence > > Hi, > > I am looking to query NCBI for sequence metadata (LOCUS/length, > DEFINITION/name etc) without obtaining the sequence associated with the > entry (pulling sequence data for chromosome when only the metadata is > needed is a waste). > > I'm wondering what would be the most appropriate bioperl module to use - > Bio::DB::EntrezGene or Bio::DB::Query::GenBank seem like the best bet > and from the description the latter seems best, but I'm wondering if > this is best and what database would both provide this data and be > parsable. > > thanks for any help. > Dan > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain sequence metadata without sequenceHi Russell,
I ended up using a hodgepodge of Bio::DB::GenBank and Bio::DB::EUtililties #!/usr/bin/perl use Bio::DB::EUtilities; use Bio::DB::GenBank; my @uids = qw(89161185 89161199 89161205 89161207 51511721 89161210 89161213 51511724 89161216 89161187 51511727 89161190 51511729 51511730 51511731 51511732 51511734 51511735 42406306 51511747 51511750 89161203 17981852 89161218 89161220); my $gb = Bio::DB::GenBank->new(-complexity => 1,-seq_stop=>1); my $seqio = $gb->get_Stream_by_gi(\@uids); my $summary = Bio::DB::EUtilities->new(-eutil => 'esummary', -db => 'nucleotide', -id => \@uids); print "chromosome,refSeq,name,length\n"; my $index=0; while (my $seq = $seqio->next_seq() and my $ds = $summary->next_DocSum) { warn "Database queries don't reconcile",$seq->primary_id,"-",$ds->get_id,"\n" if $seq->primary_id != $ds->get_id; print $index++,",",$seq->id(),".",$seq->version,","; ($feat)=$seq->get_SeqFeatures; if( defined $seq->species && $feat->annotation->get_Annotations('chromosome')) { print $seq->species->binomial; print " chromosome ",$feat->annotation->get_Annotations('chromosome'),","; } elsif (defined $seq->species && $feat->annotation->get_Annotations('organelle')) { print $seq->species->binomial; print " ",$feat->annotation->get_Annotations('organelle'),","; } else { $_=$seq->desc; /^([[:alnum:] ]+)[[:graph:]]/; print "$1,"; } print $ds->get_contents_by_name('Length') if $ds->get_contents_by_name('Length'); print "\n"; } If Bio::DB:EUtilities gives rich seq or equivalent I might change over from Bio::DB::GenBank, but it pretty much works at the moment (the fail-overs give me grief and I don't like the kludge of asking for a single base, but they work to get some of the details that the DocSum doesn't - sensible title for example). cheers Dan On Mon, 2009-10-12 at 08:46 +1300, Smithies, Russell wrote: > Or you could try using Bio::DB::Eutilities, specifying 'gene' as the database and 'table' as the retype. > I'm not sure what retypes are allowed under B:D:E but it should be in the docs. _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Bio::DB::EntrezGene or Bio::DB::Query::GenBank to obtain sequence metadata without sequenceOn Oct 11, 2009, at 5:03 PM, Dan Kortschak wrote:
> Hi Russell, > > I ended up using a hodgepodge of Bio::DB::GenBank and > Bio::DB::EUtililties > > #!/usr/bin/perl > > use Bio::DB::EUtilities; > use Bio::DB::GenBank; > > my @uids = qw(89161185 89161199 89161205 89161207 51511721 89161210 > 89161213 51511724 89161216 89161187 51511727 89161190 51511729 > 51511730 51511731 51511732 51511734 51511735 42406306 51511747 > 51511750 89161203 17981852 89161218 89161220); > > my $gb = Bio::DB::GenBank->new(-complexity => 1,-seq_stop=>1); > my $seqio = $gb->get_Stream_by_gi(\@uids); > my $summary = Bio::DB::EUtilities->new(-eutil => 'esummary', > -db => 'nucleotide', > -id => \@uids); > > print "chromosome,refSeq,name,length\n"; > > my $index=0; > > while (my $seq = $seqio->next_seq() and my $ds = $summary- > >next_DocSum) { > warn "Database queries don't reconcile",$seq->primary_id,"-",$ds- > >get_id,"\n" if $seq->primary_id != $ds->get_id; > print $index++,",",$seq->id(),".",$seq->version,","; > ($feat)=$seq->get_SeqFeatures; > if( defined $seq->species && $feat->annotation->get_Annotations > ('chromosome')) { > print $seq->species->binomial; > print " chromosome ",$feat->annotation->get_Annotations > ('chromosome'),","; > } elsif (defined $seq->species && $feat->annotation->get_Annotations > ('organelle')) { > print $seq->species->binomial; > print " ",$feat->annotation->get_Annotations('organelle'),","; > } else { > $_=$seq->desc; > /^([[:alnum:] ]+)[[:graph:]]/; > print "$1,"; > } > print $ds->get_contents_by_name('Length') if $ds- > >get_contents_by_name('Length'); > print "\n"; > } > > > If Bio::DB:EUtilities gives rich seq or equivalent I might change over > from Bio::DB::GenBank, but it pretty much works at the moment (the > fail-overs give me grief and I don't like the kludge of asking for a > single base, but they work to get some of the details that the DocSum > doesn't - sensible title for example). > > cheers > Dan Bio::DB::EUtilities is a generic front-end to NCBI's eutils, so you can retrieve the gene info in whatever format via efetch and pass it on to the appropriate parser (Bio::SeqIO::entrezgene, which requires ASN1 at this time). It's probably best to save the ASN1 data to a temp file prior to that (getting the raw content as a file handle isn't set up yet). chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
| Free embeddable forum powered by Nabble | Forum Help |