|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
|
|
|
Re: Parsing a FASTA file (Was: Bioperl-l Digest, Vol 74, Issue 25)Hi Paola,
You want to try Bio::SearchIO, I think. It's not quite clear what you want to do, but here's an example of what you can do: Get all high-scoring pairs ( the mini-alignments ) involving the database sequence called "2ojg:A"-- use Bio::SearchIO; my $io = Bio::SearchIO->new(-format=>'fasta', -file=>'yourfile.fasta'); my $result = $io->next_result; my @desired_hsps; while ( my $hit = $result->next_hit ) { push @desired_hsps, grep { $_->subject->seq_id =~ /2ojg:A/ } $hit->hsps; } # now all your desired hsps are in the array @desired_hsps; # you can get Bio::SimpleAlign objects from them all, for example: my @aligns = map { $_->get_aln } @desired_hsps; #...and lots of other things... Look at http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_SearchIO and http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_the_methods for a nice introduction to the Bio::SearchIO system by its authors. They use a blast output as an example, but everything applies to fasta output as well. You didn't waste your time writing regexps, by the way. For a Perl student, that kind of work is like money in the bank. cheers, Mark ----- Original Message ----- From: "Paola Bisignano" <paola.bisignano@...> To: <bioperl-l@...> Sent: Tuesday, June 30, 2009 5:12 AM Subject: Re: [Bioperl-l] Bioperl-l Digest, Vol 74, Issue 25 > Hi, > I need a little help, to parse a file, but I tried to search some > modules of bioperl, but there are a lot, and I don't know how to > start, I find moduls for all db, for different web site, but not for > my favorite PDBsum....so I parsed a lot of thing on my own, even if I > was new in learning perl....but now I'm waiting for help...because I > need to parse a FASTA file, resulted from aligned sequences...I need > to extract the aligned sequences, only for the pdb in my lista.... > > > my fasta file is like: > > Query: /ebi/research/thornton/tmp/sas307986/seq.fasta > 1>>>Sequence 3e7e:A - 333 aa > Library: /ebi/research/thornton/www/databases/html/pdbsum/data/pdblib > 17840403 residues in 79353 sequences > > opt E() > < 20 286 0:=== > 22 1 0:= one = represents 135 library sequences > 24 1 0:= > 26 0 2:* > 28 21 18:* > 30 36 109:* > 32 237 421:== * > 34 956 1140:========* > 36 1924 2342:=============== * > 38 3591 3871:=========================== * > 40 4904 5400:===================================== * > 42 6750 6600:================================================*= > 44 7145 7281:=====================================================* > 46 8047 7416:======================================================*===== > ......... > >>>2np8:A (159 aa) > initn: 125 init1: 72 opt: 136 Z-score: 168.6 bits: 38.5 E(): 0.011 > Smith-Waterman score: 136; 26.0% identity (57.1% similar) in 154 aa > overlap (59-204:13-153) > > 10 20 30 40 50 60 > Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG > :: > 2np8:A QWALEDFEIGRPLG > 10 > > 70 80 90 100 110 > Sequen EGAFAQVYEATQNKQKFVL--KVQKPANPWEFYIGTQLMER--LKPSMQH-MFMKFYSAH > .: :..:: : ....::.: :: :. . . :: .. .. ..: ....:. > 2np8:A KGKFGNVYLAREKQSKFILALKVLFKAQLEKAGVEHQLRREVEIQSHLRHPNILRLYG-- > 20 30 40 50 60 70 > > 120 130 140 150 160 170 > Sequen LFQNGS--VLVGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEII > :.... :. : ::. .. .. :. . .. .. . :. ..: > 2np8:A YFHDATRVYLILEYAPLGTVYRELQKLSKFDEQR-----TATYITELANALSYCHSKRVI > 80 90 100 110 120 > > 180 190 200 210 220 230 > Sequen HGDIKPDNFILGNGFLEQSAG-LALIDLGQSIDMKLFPKGTIFTAKCETSGFQCVEMLSN > : ::::.:..:: ::: : . :.: :. > 2np8:A HRDIKPENLLLG------SAGELKIADFGWSVHAPSSR > 130 140 150 > > 240 250 260 270 280 290 > Sequen KPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLNIP > > 300 310 320 330 > Sequen DCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC > >>>2ojg:A (337 aa) > initn: 85 init1: 53 opt: 140 Z-score: 168.1 bits: 39.5 E(): 0.012 > Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa > overlap (46-252:1-204) > > 10 20 30 40 50 60 > Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG > :..: . . . .. : > 2ojg:A FDVGPRYTNLSYI-G > 10 > > 70 80 90 100 110 > Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN > :::...: : .: .: . ..: .:.: : ....: ....: ... > 2ojg:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI > 20 30 40 50 60 > > 120 130 140 150 160 170 > Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI > .... . ..: :... .::: . . . . : ...: .. .:. .. > 2ojg:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV > 70 80 90 100 110 120 > > 180 190 200 210 220 230 > Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML > .: :.::.:..:.. . : . :.: . . . ..: : .. : :: > 2ojg:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML > 130 140 150 160 170 180 > > 240 250 260 270 280 290 > Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN > ..: .. .:: ..:. . :: > 2ojg:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA > 190 200 210 220 230 240 > > 300 310 320 330 > Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC > > 2ojg:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS > 250 260 270 280 290 300 > > 2ojg:A DEPIAEAPFKFELDDLPKEKLKELIFEETARFQPG > 310 320 330 > >>>2oji:A (344 aa) > initn: 85 init1: 53 opt: 140 Z-score: 168.0 bits: 39.5 E(): 0.012 > Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa > overlap (46-252:5-208) > > 10 20 30 40 50 60 > Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG > :..: . . . .. : > 2oji:A RGQVFDVGPRYTNLSYI-G > 10 > > 70 80 90 100 110 > Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN > :::...: : .: .: . ..: .:.: : ....: ....: ... > 2oji:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI > 20 30 40 50 60 70 > > 120 130 140 150 160 170 > Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI > .... . ..: :... .::: . . . . : ...: .. .:. .. > 2oji:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV > 80 90 100 110 120 130 > > 180 190 200 210 220 230 > Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML > .: :.::.:..:.. . : . :.: . . . ..: : .. : :: > 2oji:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML > 140 150 160 170 180 > > 240 250 260 270 280 290 > Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN > ..: .. .:: ..:. . :: > 2oji:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA > 190 200 210 220 230 240 > > 300 310 320 330 > Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC > > 2oji:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS > 250 260 270 280 290 300 > > 2oji:A DEPIAEAPFKFDMELDDLPKEKLKELIFEETARFQPGY > 310 320 330 340 > > ....... > I show a part of the file...if I want for example only that two > alignment? are there moduls to parse...because I've tried to parse > whit regex but....without results :-(.... > If anyone has suggestion for muduls or anything else, I'll be very > happy to learn > thanks > Paola > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
| Free embeddable forum powered by Nabble | Forum Help |