|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
Finding possible primers regexHi there,
I'm trying to write a perl script to scan an aligned multiple entry fasta file and find possible primers. So far I've produced a string which contains bases which match all sequences and * where they don't match e.g. 1) TTAGCCTAA 2) TTAGCAGAA 3) TTACCCTAA would give TTA*C**AA. I want to parse this string and pull out all sequences which are 18-21 bp in length and have no more than 4 * in them. So far, I've got this: while($fragment_match =~ /([GTAC*]{18,21})/g){ print "$1\n"; } hoping to match all fragments 18-21 characters in length. However even that doesn't work as it has essentially chunked it into 21 char blocks, rather than what I hoped for of 0-18 0-19 0-20 0-21 1-19 1-20 1-21 1-22 etc. Can anyone let me know if this is already possible in BioPerl, or how one would go about it with regex. Sadly I'm fairly new to perl and getting to grips with BioPerl, so please treat me gently :). Many thanks, Ben |
|
|
Re: Finding possible primers regexOn Aug 2, 2008, at 3:05 PM, Benbo wrote:
> > Hi there, > I'm trying to write a perl script to scan an aligned multiple entry > fasta > file and find possible primers. So far I've produced a string which > contains > bases which match all sequences and * where they don't match e.g. > 1) TTAGCCTAA > 2) TTAGCAGAA > 3) TTACCCTAA > > would give TTA*C**AA. > > I want to parse this string and pull out all sequences which are > 18-21 bp in > length and have no more than 4 * in them. > > So far, I've got this: > > while($fragment_match =~ /([GTAC*]{18,21})/g){ > print "$1\n"; > } > > hoping to match all fragments 18-21 characters in length. However > even that > doesn't work as it has essentially chunked it into 21 char blocks, > rather > than what I hoped for of > 0-18 > 0-19 > 0-20 > 0-21 > 1-19 > 1-20 > 1-21 > 1-22 > > etc. > > Can anyone let me know if this is already possible in BioPerl, or > how one > would go about it with regex. Sadly I'm fairly new to perl and > getting to > grips with BioPerl, so please treat me gently :). > > Many thanks, > > Ben There is a trick to this which is discussed more extensively in 'Mastering Regular Expressions'. Essentially you have to embed code into the regex and trick the parser into backtracking using a negative lookahead. The match itself fails (i.e. no match is returned), but the embedded code is executed for each match attempt, The following script is a slight modification of one I used which checks the consensus string from the input alignment (in aligned FASTA format here), extracts the alignment slice using that match, then spit the alignment out to STDOUT in clustalw format. This should work for perl 5.8 and up, but it's only been tested on perl 5.10. You should be able to use this to fit what you want. my $in = Bio::AlignIO->new(-file => $file, -format => 'fasta'); my $out = Bio::AlignIO->new(-fh => \*STDOUT, -format => 'clustalw'); while (my $aln = $in->next_aln) { my $c = $aln->consensus_string(100); my @matches; $c =~ m/ ([GTAC?]{18,21}) (?{my $match = check_match($1); push @matches, [$match, pos(), length($match)] if defined $match;}) (?!) /xig; for my $match (@matches) { my ($hit, $st, $end) = ($match->[0], $match->[1] - $match->[2] + 1, $match->[1]); my $newaln = $aln->slice($st, $end); $out->write_aln($newaln); } } sub check_match { my $match = shift; return unless $match; my $ct = $match =~ tr/?/?/; return $match if $ct <= 4; } chris _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
how to remove indentical sequences from a datasetHi, there ,
I have a sequence dataset which contains about 200 sequences. there are some identical sequences in this. is there any bioperl modules which can remove those identical sequences? thanks a lot. yours, shaohua ----- Original Message ----- From: "Benbo" <btemperton@...> To: <Bioperl-l@...> Sent: Sunday, August 03, 2008 4:05 AM Subject: [Bioperl-l] Finding possible primers regex > > Hi there, > I'm trying to write a perl script to scan an aligned multiple entry fasta > file and find possible primers. So far I've produced a string which contains > bases which match all sequences and * where they don't match e.g. > 1) TTAGCCTAA > 2) TTAGCAGAA > 3) TTACCCTAA > > would give TTA*C**AA. > > I want to parse this string and pull out all sequences which are 18-21 bp in > length and have no more than 4 * in them. > > So far, I've got this: > > while($fragment_match =~ /([GTAC*]{18,21})/g){ > print "$1\n"; > } > > hoping to match all fragments 18-21 characters in length. However even that > doesn't work as it has essentially chunked it into 21 char blocks, rather > than what I hoped for of > 0-18 > 0-19 > 0-20 > 0-21 > 1-19 > 1-20 > 1-21 > 1-22 > > etc. > > Can anyone let me know if this is already possible in BioPerl, or how one > would go about it with regex. Sadly I'm fairly new to perl and getting to > grips with BioPerl, so please treat me gently :). > > Many thanks, > > Ben > > > > -- > View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: how to remove indentical sequences from a datasetHi,
There is a BioPerl Utility script doing this. See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities header. " scripts/utilities/bp_nrdb.PLS Make a non-redundant database based on sequence, not id. Requires Digest::MD5." Alternatively, you can make a hash using the sequences as keys. Regards, Bernd On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao@...> wrote: > Hi, there , > > I have a sequence dataset which contains about 200 sequences. there are some identical sequences in this. is there any bioperl modules which can remove those identical sequences? > > thanks a lot. > yours, > shaohua > ----- Original Message ----- > From: "Benbo" <btemperton@...> > To: <Bioperl-l@...> > Sent: Sunday, August 03, 2008 4:05 AM > Subject: [Bioperl-l] Finding possible primers regex > > >> >> Hi there, >> I'm trying to write a perl script to scan an aligned multiple entry fasta >> file and find possible primers. So far I've produced a string which contains >> bases which match all sequences and * where they don't match e.g. >> 1) TTAGCCTAA >> 2) TTAGCAGAA >> 3) TTACCCTAA >> >> would give TTA*C**AA. >> >> I want to parse this string and pull out all sequences which are 18-21 bp in >> length and have no more than 4 * in them. >> >> So far, I've got this: >> >> while($fragment_match =~ /([GTAC*]{18,21})/g){ >> print "$1\n"; >> } >> >> hoping to match all fragments 18-21 characters in length. However even that >> doesn't work as it has essentially chunked it into 21 char blocks, rather >> than what I hoped for of >> 0-18 >> 0-19 >> 0-20 >> 0-21 >> 1-19 >> 1-20 >> 1-21 >> 1-22 >> >> etc. >> >> Can anyone let me know if this is already possible in BioPerl, or how one >> would go about it with regex. Sadly I'm fairly new to perl and getting to >> grips with BioPerl, so please treat me gently :). >> >> Many thanks, >> >> Ben >> >> >> >> -- >> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l@... >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l > Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: how to remove indentical sequences from a datasetHi all,
Or you might try a non-bioperl solution that works pretty well, check: http://blast.wustl.edu/pub/nrdb/executables/nrdb.linux-x86 Best, Diego Bernd Web wrote: > Hi, > > There is a BioPerl Utility script doing this. > See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities header. > > " scripts/utilities/bp_nrdb.PLS > Make a non-redundant database based on sequence, not id. Requires > Digest::MD5." > > Alternatively, you can make a hash using the sequences as keys. > > > Regards, > Bernd > > On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao@...> wrote: >> Hi, there , >> >> I have a sequence dataset which contains about 200 sequences. there are some identical sequences in this. is there any bioperl modules which can remove those identical sequences? >> >> thanks a lot. >> yours, >> shaohua >> ----- Original Message ----- >> From: "Benbo" <btemperton@...> >> To: <Bioperl-l@...> >> Sent: Sunday, August 03, 2008 4:05 AM >> Subject: [Bioperl-l] Finding possible primers regex >> >> >>> Hi there, >>> I'm trying to write a perl script to scan an aligned multiple entry fasta >>> file and find possible primers. So far I've produced a string which contains >>> bases which match all sequences and * where they don't match e.g. >>> 1) TTAGCCTAA >>> 2) TTAGCAGAA >>> 3) TTACCCTAA >>> >>> would give TTA*C**AA. >>> >>> I want to parse this string and pull out all sequences which are 18-21 bp in >>> length and have no more than 4 * in them. >>> >>> So far, I've got this: >>> >>> while($fragment_match =~ /([GTAC*]{18,21})/g){ >>> print "$1\n"; >>> } >>> >>> hoping to match all fragments 18-21 characters in length. However even that >>> doesn't work as it has essentially chunked it into 21 char blocks, rather >>> than what I hoped for of >>> 0-18 >>> 0-19 >>> 0-20 >>> 0-21 >>> 1-19 >>> 1-20 >>> 1-21 >>> 1-22 >>> >>> etc. >>> >>> Can anyone let me know if this is already possible in BioPerl, or how one >>> would go about it with regex. Sadly I'm fairly new to perl and getting to >>> grips with BioPerl, so please treat me gently :). >>> >>> Many thanks, >>> >>> Ben >>> >>> >>> >>> -- >>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html >>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l@... >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l@... >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ___________________________________ Diego Mauricio Riaño Pachón Biologist - PhD student AG Mueller-Roeber Institute for Biochemistry and Biology University of Potsdam Address: Karl-Liebknecht-Str. 24-25 Haus 20 14476 Golm Germany Tel: +49 331 977 2809 Fax: +49 331 977 2512 web: http://www.geocities.com/dmrp.geo _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: how to remove indentical sequences from a datasetHere are two links which go into detail (the last is a specific
implementation): http://en.wikipedia.org/wiki/Sequence_clustering http://www.bioinformatics.org/cd-hit/ chris On Aug 5, 2008, at 5:28 AM, Diego Mauricio Riano Pachon wrote: > Hi all, > > Or you might try a non-bioperl solution that works pretty well, check: > > http://blast.wustl.edu/pub/nrdb/executables/nrdb.linux-x86 > > Best, > > Diego > > Bernd Web wrote: >> Hi, >> There is a BioPerl Utility script doing this. >> See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities >> header. >> " scripts/utilities/bp_nrdb.PLS >> Make a non-redundant database based on sequence, not id. Requires >> Digest::MD5." >> Alternatively, you can make a hash using the sequences as keys. >> Regards, >> Bernd >> On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao@...> >> wrote: >>> Hi, there , >>> >>> I have a sequence dataset which contains about 200 sequences. >>> there are some identical sequences in this. is there any bioperl >>> modules which can remove those identical sequences? >>> >>> thanks a lot. >>> yours, >>> shaohua >>> ----- Original Message ----- >>> From: "Benbo" <btemperton@...> >>> To: <Bioperl-l@...> >>> Sent: Sunday, August 03, 2008 4:05 AM >>> Subject: [Bioperl-l] Finding possible primers regex >>> >>> >>>> Hi there, >>>> I'm trying to write a perl script to scan an aligned multiple >>>> entry fasta >>>> file and find possible primers. So far I've produced a string >>>> which contains >>>> bases which match all sequences and * where they don't match e.g. >>>> 1) TTAGCCTAA >>>> 2) TTAGCAGAA >>>> 3) TTACCCTAA >>>> >>>> would give TTA*C**AA. >>>> >>>> I want to parse this string and pull out all sequences which are >>>> 18-21 bp in >>>> length and have no more than 4 * in them. >>>> >>>> So far, I've got this: >>>> >>>> while($fragment_match =~ /([GTAC*]{18,21})/g){ >>>> print "$1\n"; >>>> } >>>> >>>> hoping to match all fragments 18-21 characters in length. However >>>> even that >>>> doesn't work as it has essentially chunked it into 21 char >>>> blocks, rather >>>> than what I hoped for of >>>> 0-18 >>>> 0-19 >>>> 0-20 >>>> 0-21 >>>> 1-19 >>>> 1-20 >>>> 1-21 >>>> 1-22 >>>> >>>> etc. >>>> >>>> Can anyone let me know if this is already possible in BioPerl, or >>>> how one >>>> would go about it with regex. Sadly I'm fairly new to perl and >>>> getting to >>>> grips with BioPerl, so please treat me gently :). >>>> >>>> Many thanks, >>>> >>>> Ben >>>> >>>> >>>> >>>> -- >>>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html >>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l@... >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l@... >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l@... >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -- > ___________________________________ > Diego Mauricio Riaño Pachón > Biologist - PhD student > AG Mueller-Roeber > Institute for Biochemistry and Biology > University of Potsdam > > Address: Karl-Liebknecht-Str. 24-25 > Haus 20 > 14476 Golm > Germany > > Tel: +49 331 977 2809 > Fax: +49 331 977 2512 > > web: http://www.geocities.com/dmrp.geo > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: how to remove indentical sequences from a datasethi,
thanks a lot for the help! cheers, shaohua ----- Original Message ----- From: "Chris Fields" <cjfields@...> To: "Diego Mauricio Riano Pachon" <diriano@...> Cc: "Bernd Web" <bernd.web@...>; <Bioperl-l@...>; "Shaohua Fan" <lengjingmao@...> Sent: Tuesday, August 05, 2008 11:19 PM Subject: Re: [Bioperl-l] how to remove indentical sequences from a dataset Here are two links which go into detail (the last is a specific implementation): http://en.wikipedia.org/wiki/Sequence_clustering http://www.bioinformatics.org/cd-hit/ chris On Aug 5, 2008, at 5:28 AM, Diego Mauricio Riano Pachon wrote: > Hi all, > > Or you might try a non-bioperl solution that works pretty well, check: > > http://blast.wustl.edu/pub/nrdb/executables/nrdb.linux-x86 > > Best, > > Diego > > Bernd Web wrote: >> Hi, >> There is a BioPerl Utility script doing this. >> See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities >> header. >> " scripts/utilities/bp_nrdb.PLS >> Make a non-redundant database based on sequence, not id. Requires >> Digest::MD5." >> Alternatively, you can make a hash using the sequences as keys. >> Regards, >> Bernd >> On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao@...> >> wrote: >>> Hi, there , >>> >>> I have a sequence dataset which contains about 200 sequences. >>> there are some identical sequences in this. is there any bioperl >>> modules which can remove those identical sequences? >>> >>> thanks a lot. >>> yours, >>> shaohua >>> ----- Original Message ----- >>> From: "Benbo" <btemperton@...> >>> To: <Bioperl-l@...> >>> Sent: Sunday, August 03, 2008 4:05 AM >>> Subject: [Bioperl-l] Finding possible primers regex >>> >>> >>>> Hi there, >>>> I'm trying to write a perl script to scan an aligned multiple >>>> entry fasta >>>> file and find possible primers. So far I've produced a string >>>> which contains >>>> bases which match all sequences and * where they don't match e.g. >>>> 1) TTAGCCTAA >>>> 2) TTAGCAGAA >>>> 3) TTACCCTAA >>>> >>>> would give TTA*C**AA. >>>> >>>> I want to parse this string and pull out all sequences which are >>>> 18-21 bp in >>>> length and have no more than 4 * in them. >>>> >>>> So far, I've got this: >>>> >>>> while($fragment_match =~ /([GTAC*]{18,21})/g){ >>>> print "$1\n"; >>>> } >>>> >>>> hoping to match all fragments 18-21 characters in length. However >>>> even that >>>> doesn't work as it has essentially chunked it into 21 char >>>> blocks, rather >>>> than what I hoped for of >>>> 0-18 >>>> 0-19 >>>> 0-20 >>>> 0-21 >>>> 1-19 >>>> 1-20 >>>> 1-21 >>>> 1-22 >>>> >>>> etc. >>>> >>>> Can anyone let me know if this is already possible in BioPerl, or >>>> how one >>>> would go about it with regex. Sadly I'm fairly new to perl and >>>> getting to >>>> grips with BioPerl, so please treat me gently :). >>>> >>>> Many thanks, >>>> >>>> Ben >>>> >>>> >>>> >>>> -- >>>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html >>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l@... >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l@... >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l@... >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -- > ___________________________________ > Diego Mauricio Riaño Pachón > Biologist - PhD student > AG Mueller-Roeber > Institute for Biochemistry and Biology > University of Potsdam > > Address: Karl-Liebknecht-Str. 24-25 > Haus 20 > 14476 Golm > Germany > > Tel: +49 331 977 2809 > Fax: +49 331 977 2512 > > web: http://www.geocities.com/dmrp.geo > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Marie-Claude Hofmann College of Veterinary Medicine University of Illinois Urbana-Champaign _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
Re: Finding possible primers regexThis looks like a neat trick. Do you think it's worth including as a
SimpleAlign method (obviously w/o the printing to STDOUT)? I can imagine that a lot of people might appreciate it. -hilmar On Aug 4, 2008, at 12:08 AM, Chris Fields wrote: > On Aug 2, 2008, at 3:05 PM, Benbo wrote: > >> >> Hi there, >> I'm trying to write a perl script to scan an aligned multiple entry >> fasta >> file and find possible primers. So far I've produced a string which >> contains >> bases which match all sequences and * where they don't match e.g. >> 1) TTAGCCTAA >> 2) TTAGCAGAA >> 3) TTACCCTAA >> >> would give TTA*C**AA. >> >> I want to parse this string and pull out all sequences which are >> 18-21 bp in >> length and have no more than 4 * in them. >> >> So far, I've got this: >> >> while($fragment_match =~ /([GTAC*]{18,21})/g){ >> print "$1\n"; >> } >> >> hoping to match all fragments 18-21 characters in length. However >> even that >> doesn't work as it has essentially chunked it into 21 char blocks, >> rather >> than what I hoped for of >> 0-18 >> 0-19 >> 0-20 >> 0-21 >> 1-19 >> 1-20 >> 1-21 >> 1-22 >> >> etc. >> >> Can anyone let me know if this is already possible in BioPerl, or >> how one >> would go about it with regex. Sadly I'm fairly new to perl and >> getting to >> grips with BioPerl, so please treat me gently :). >> >> Many thanks, >> >> Ben > > There is a trick to this which is discussed more extensively in > 'Mastering Regular Expressions'. Essentially you have to embed code > into the regex and trick the parser into backtracking using a > negative lookahead. The match itself fails (i.e. no match is > returned), but the embedded code is executed for each match attempt, > > The following script is a slight modification of one I used which > checks the consensus string from the input alignment (in aligned > FASTA format here), extracts the alignment slice using that match, > then spit the alignment out to STDOUT in clustalw format. This > should work for perl 5.8 and up, but it's only been tested on perl > 5.10. You should be able to use this to fit what you want. > > my $in = Bio::AlignIO->new(-file => $file, > -format => 'fasta'); > my $out = Bio::AlignIO->new(-fh => \*STDOUT, > -format => 'clustalw'); > > while (my $aln = $in->next_aln) { > my $c = $aln->consensus_string(100); > my @matches; > $c =~ m/ > ([GTAC?]{18,21}) > (?{my $match = check_match($1); > push @matches, [$match, > pos(), > length($match)] > if defined $match;}) > (?!) > /xig; > for my $match (@matches) { > my ($hit, $st, $end) = ($match->[0], > $match->[1] - $match->[2] + 1, > $match->[1]); > my $newaln = $aln->slice($st, $end); > $out->write_aln($newaln); > } > } > > sub check_match { > my $match = shift; > return unless $match; > my $ct = $match =~ tr/?/?/; > return $match if $ct <= 4; > } > > > chris > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l@... > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
|
|
|
|
|
Re: Finding possible primers regexActually, now that you ask I'm wondering whether one wouldn't
sometimes want to retain the relationship between the match and the resulting spliced alignment? If so, neither AlignIO nor array would accomplish that, right? Other than that I myself don't have a strong preference either way. I suppose AlignIO stream is somewhat more extensible, since as you say it could be coupled to a file if the resulting set of alignments is really large. -hilmar On Aug 11, 2008, at 3:50 PM, Christopher Fields wrote: > When I can I could try generating a method which accepts a regex/ > Bio::Tools::SeqPattern and returns an AlignIO stream or array of > SimpleAlign instances (the former could be attached to a temp file > for iteration). Any preference? > > chris > > ---- Original message ---- >> Date: Sat, 9 Aug 2008 12:07:30 -0400 >> From: Hilmar Lapp <hlapp@...> >> Subject: Re: [Bioperl-l] Finding possible primers regex >> To: Chris Fields <cjfields@...> >> Cc: Benbo <btemperton@...>, Bioperl-l@... >> >> This looks like a neat trick. Do you think it's worth including as a >> SimpleAlign method (obviously w/o the printing to STDOUT)? I can >> imagine that a lot of people might appreciate it. >> >> -hilmar >> >> On Aug 4, 2008, at 12:08 AM, Chris Fields wrote: >> >>> On Aug 2, 2008, at 3:05 PM, Benbo wrote: >>> >>>> >>>> Hi there, >>>> I'm trying to write a perl script to scan an aligned multiple entry >>>> fasta >>>> file and find possible primers. So far I've produced a string which >>>> contains >>>> bases which match all sequences and * where they don't match e.g. >>>> 1) TTAGCCTAA >>>> 2) TTAGCAGAA >>>> 3) TTACCCTAA >>>> >>>> would give TTA*C**AA. >>>> >>>> I want to parse this string and pull out all sequences which are >>>> 18-21 bp in >>>> length and have no more than 4 * in them. >>>> >>>> So far, I've got this: >>>> >>>> while($fragment_match =~ /([GTAC*]{18,21})/g){ >>>> print "$1\n"; >>>> } >>>> >>>> hoping to match all fragments 18-21 characters in length. However >>>> even that >>>> doesn't work as it has essentially chunked it into 21 char blocks, >>>> rather >>>> than what I hoped for of >>>> 0-18 >>>> 0-19 >>>> 0-20 >>>> 0-21 >>>> 1-19 >>>> 1-20 >>>> 1-21 >>>> 1-22 >>>> >>>> etc. >>>> >>>> Can anyone let me know if this is already possible in BioPerl, or >>>> how one >>>> would go about it with regex. Sadly I'm fairly new to perl and >>>> getting to >>>> grips with BioPerl, so please treat me gently :). >>>> >>>> Many thanks, >>>> >>>> Ben >>> >>> There is a trick to this which is discussed more extensively in >>> 'Mastering Regular Expressions'. Essentially you have to embed code >>> into the regex and trick the parser into backtracking using a >>> negative lookahead. The match itself fails (i.e. no match is >>> returned), but the embedded code is executed for each match attempt, >>> >>> The following script is a slight modification of one I used which >>> checks the consensus string from the input alignment (in aligned >>> FASTA format here), extracts the alignment slice using that match, >>> then spit the alignment out to STDOUT in clustalw format. This >>> should work for perl 5.8 and up, but it's only been tested on perl >>> 5.10. You should be able to use this to fit what you want. >>> >>> my $in = Bio::AlignIO->new(-file => $file, >>> -format => 'fasta'); >>> my $out = Bio::AlignIO->new(-fh => \*STDOUT, >>> -format => 'clustalw'); >>> >>> while (my $aln = $in->next_aln) { >>> my $c = $aln->consensus_string(100); >>> my @matches; >>> $c =~ m/ >>> ([GTAC?]{18,21}) >>> (?{my $match = check_match($1); >>> push @matches, [$match, >>> pos(), >>> length($match)] >>> if defined $match;}) >>> (?!) >>> /xig; >>> for my $match (@matches) { >>> my ($hit, $st, $end) = ($match->[0], >>> $match->[1] - $match->[2] + 1, >>> $match->[1]); >>> my $newaln = $aln->slice($st, $end); >>> $out->write_aln($newaln); >>> } >>> } >>> >>> sub check_match { >>> my $match = shift; >>> return unless $match; >>> my $ct = $match =~ tr/?/?/; >>> return $match if $ct <= 4; >>> } >>> >>> >>> chris >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l@... >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== _______________________________________________ Bioperl-l mailing list Bioperl-l@... http://lists.open-bio.org/mailman/listinfo/bioperl-l |
| Free embeddable forum powered by Nabble | Forum Help |