Finding possible primers regex

View: New views
10 Messages — Rating Filter:   Alert me  

Finding possible primers regex

by Benbo :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi there,
I'm trying to write a perl script to scan an aligned multiple entry fasta file and find possible primers. So far I've produced a string which contains bases which match all sequences and * where they don't match e.g.
1) TTAGCCTAA
2) TTAGCAGAA
3) TTACCCTAA

would give TTA*C**AA.

I want to parse this string and pull out all sequences which are 18-21 bp in length and have no more than 4 * in them.

So far, I've got this:

while($fragment_match =~ /([GTAC*]{18,21})/g){
print "$1\n";
}

hoping to match all fragments 18-21 characters in length. However even that doesn't work as it has essentially chunked it into 21 char blocks, rather than what I hoped for of
0-18
0-19
0-20
0-21
1-19
1-20
1-21
1-22

etc.

Can anyone let me know if this is already possible in BioPerl, or how one would go about it with regex. Sadly I'm fairly new to perl and getting to grips with BioPerl, so please treat me gently :).

Many thanks,

Ben



Re: Finding possible primers regex

by Chris Fields :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Aug 2, 2008, at 3:05 PM, Benbo wrote:

>
> Hi there,
> I'm trying to write a perl script to scan an aligned multiple entry  
> fasta
> file and find possible primers. So far I've produced a string which  
> contains
> bases which match all sequences and * where they don't match e.g.
> 1) TTAGCCTAA
> 2) TTAGCAGAA
> 3) TTACCCTAA
>
> would give TTA*C**AA.
>
> I want to parse this string and pull out all sequences which are  
> 18-21 bp in
> length and have no more than 4 * in them.
>
> So far, I've got this:
>
> while($fragment_match =~ /([GTAC*]{18,21})/g){
> print "$1\n";
> }
>
> hoping to match all fragments 18-21 characters in length. However  
> even that
> doesn't work as it has essentially chunked it into 21 char blocks,  
> rather
> than what I hoped for of
> 0-18
> 0-19
> 0-20
> 0-21
> 1-19
> 1-20
> 1-21
> 1-22
>
> etc.
>
> Can anyone let me know if this is already possible in BioPerl, or  
> how one
> would go about it with regex. Sadly I'm fairly new to perl and  
> getting to
> grips with BioPerl, so please treat me gently :).
>
> Many thanks,
>
> Ben

There is a trick to this which is discussed more extensively in  
'Mastering Regular Expressions'.  Essentially you have to embed code  
into the regex and trick the parser into backtracking using a negative  
lookahead.  The match itself fails (i.e. no match is returned), but  
the embedded code is executed for each match attempt,

The following script is a slight modification of one I used which  
checks the consensus string from the input alignment (in aligned FASTA  
format here), extracts the alignment slice using that match, then spit  
the alignment out to STDOUT in clustalw format.  This should work for  
perl 5.8 and up, but it's only been tested on perl 5.10.  You should  
be able to use this to fit what you want.

my $in = Bio::AlignIO->new(-file => $file,
                            -format => 'fasta');
my $out = Bio::AlignIO->new(-fh => \*STDOUT,
                            -format => 'clustalw');

while (my $aln = $in->next_aln) {
     my $c = $aln->consensus_string(100);
     my @matches;
     $c =~ m/
         ([GTAC?]{18,21})
         (?{my $match = check_match($1);
            push @matches, [$match,
                            pos(),
                            length($match)]
               if defined $match;})
         (?!)
         /xig;
     for my $match (@matches) {
         my ($hit, $st, $end) = ($match->[0],
                                 $match->[1] - $match->[2] + 1,
                                 $match->[1]);
         my $newaln = $aln->slice($st, $end);
         $out->write_aln($newaln);
     }
}

sub check_match {
     my $match = shift;
     return unless $match;
     my $ct = $match =~ tr/?/?/;
     return $match if $ct <= 4;
}


chris

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

how to remove indentical sequences from a dataset

by Shaohua Fan-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi, there ,

I have a sequence dataset which contains about 200 sequences. there are some identical sequences in this. is there any bioperl modules  which can remove those identical sequences?

thanks a lot.
yours,
shaohua
----- Original Message -----
From: "Benbo" <btemperton@...>
To: <Bioperl-l@...>
Sent: Sunday, August 03, 2008 4:05 AM
Subject: [Bioperl-l] Finding possible primers regex


>
> Hi there,
> I'm trying to write a perl script to scan an aligned multiple entry fasta
> file and find possible primers. So far I've produced a string which contains
> bases which match all sequences and * where they don't match e.g.
> 1) TTAGCCTAA
> 2) TTAGCAGAA
> 3) TTACCCTAA
>
> would give TTA*C**AA.
>
> I want to parse this string and pull out all sequences which are 18-21 bp in
> length and have no more than 4 * in them.
>
> So far, I've got this:
>
> while($fragment_match =~ /([GTAC*]{18,21})/g){
> print "$1\n";
> }
>
> hoping to match all fragments 18-21 characters in length. However even that
> doesn't work as it has essentially chunked it into 21 char blocks, rather
> than what I hoped for of
> 0-18
> 0-19
> 0-20
> 0-21
> 1-19
> 1-20
> 1-21
> 1-22
>
> etc.
>
> Can anyone let me know if this is already possible in BioPerl, or how one
> would go about it with regex. Sadly I'm fairly new to perl and getting to
> grips with BioPerl, so please treat me gently :).
>
> Many thanks,
>
> Ben
>
>
>
> --
> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: how to remove indentical sequences from a dataset

by Bernd Web :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

There is a BioPerl Utility script doing this.
See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities header.

" scripts/utilities/bp_nrdb.PLS
    Make a non-redundant database based on sequence, not id. Requires
Digest::MD5."

Alternatively, you can make a hash using the sequences as keys.


Regards,
Bernd

On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao@...> wrote:

> Hi, there ,
>
> I have a sequence dataset which contains about 200 sequences. there are some identical sequences in this. is there any bioperl modules  which can remove those identical sequences?
>
> thanks a lot.
> yours,
> shaohua
> ----- Original Message -----
> From: "Benbo" <btemperton@...>
> To: <Bioperl-l@...>
> Sent: Sunday, August 03, 2008 4:05 AM
> Subject: [Bioperl-l] Finding possible primers regex
>
>
>>
>> Hi there,
>> I'm trying to write a perl script to scan an aligned multiple entry fasta
>> file and find possible primers. So far I've produced a string which contains
>> bases which match all sequences and * where they don't match e.g.
>> 1) TTAGCCTAA
>> 2) TTAGCAGAA
>> 3) TTACCCTAA
>>
>> would give TTA*C**AA.
>>
>> I want to parse this string and pull out all sequences which are 18-21 bp in
>> length and have no more than 4 * in them.
>>
>> So far, I've got this:
>>
>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>> print "$1\n";
>> }
>>
>> hoping to match all fragments 18-21 characters in length. However even that
>> doesn't work as it has essentially chunked it into 21 char blocks, rather
>> than what I hoped for of
>> 0-18
>> 0-19
>> 0-20
>> 0-21
>> 1-19
>> 1-20
>> 1-21
>> 1-22
>>
>> etc.
>>
>> Can anyone let me know if this is already possible in BioPerl, or how one
>> would go about it with regex. Sadly I'm fairly new to perl and getting to
>> grips with BioPerl, so please treat me gently :).
>>
>> Many thanks,
>>
>> Ben
>>
>>
>>
>> --
>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@...
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: how to remove indentical sequences from a dataset

by Diego Mauricio Riano Pachon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

Or you might try a non-bioperl solution that works pretty well, check:

http://blast.wustl.edu/pub/nrdb/executables/nrdb.linux-x86

Best,

Diego

Bernd Web wrote:

> Hi,
>
> There is a BioPerl Utility script doing this.
> See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities header.
>
> " scripts/utilities/bp_nrdb.PLS
>     Make a non-redundant database based on sequence, not id. Requires
> Digest::MD5."
>
> Alternatively, you can make a hash using the sequences as keys.
>
>
> Regards,
> Bernd
>
> On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao@...> wrote:
>> Hi, there ,
>>
>> I have a sequence dataset which contains about 200 sequences. there are some identical sequences in this. is there any bioperl modules  which can remove those identical sequences?
>>
>> thanks a lot.
>> yours,
>> shaohua
>> ----- Original Message -----
>> From: "Benbo" <btemperton@...>
>> To: <Bioperl-l@...>
>> Sent: Sunday, August 03, 2008 4:05 AM
>> Subject: [Bioperl-l] Finding possible primers regex
>>
>>
>>> Hi there,
>>> I'm trying to write a perl script to scan an aligned multiple entry fasta
>>> file and find possible primers. So far I've produced a string which contains
>>> bases which match all sequences and * where they don't match e.g.
>>> 1) TTAGCCTAA
>>> 2) TTAGCAGAA
>>> 3) TTACCCTAA
>>>
>>> would give TTA*C**AA.
>>>
>>> I want to parse this string and pull out all sequences which are 18-21 bp in
>>> length and have no more than 4 * in them.
>>>
>>> So far, I've got this:
>>>
>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>> print "$1\n";
>>> }
>>>
>>> hoping to match all fragments 18-21 characters in length. However even that
>>> doesn't work as it has essentially chunked it into 21 char blocks, rather
>>> than what I hoped for of
>>> 0-18
>>> 0-19
>>> 0-20
>>> 0-21
>>> 1-19
>>> 1-20
>>> 1-21
>>> 1-22
>>>
>>> etc.
>>>
>>> Can anyone let me know if this is already possible in BioPerl, or how one
>>> would go about it with regex. Sadly I'm fairly new to perl and getting to
>>> grips with BioPerl, so please treat me gently :).
>>>
>>> Many thanks,
>>>
>>> Ben
>>>
>>>
>>>
>>> --
>>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@...
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@...
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


--
___________________________________
Diego Mauricio Riaño Pachón
Biologist - PhD student
AG Mueller-Roeber
Institute for Biochemistry and Biology
University of Potsdam

Address: Karl-Liebknecht-Str. 24-25
         Haus 20
         14476 Golm
         Germany

Tel: +49 331 977 2809
Fax: +49 331 977 2512

web: http://www.geocities.com/dmrp.geo
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: how to remove indentical sequences from a dataset

by Chris Fields :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Here are two links which go into detail (the last is a specific  
implementation):

http://en.wikipedia.org/wiki/Sequence_clustering
http://www.bioinformatics.org/cd-hit/

chris

On Aug 5, 2008, at 5:28 AM, Diego Mauricio Riano Pachon wrote:

> Hi all,
>
> Or you might try a non-bioperl solution that works pretty well, check:
>
> http://blast.wustl.edu/pub/nrdb/executables/nrdb.linux-x86
>
> Best,
>
> Diego
>
> Bernd Web wrote:
>> Hi,
>> There is a BioPerl Utility script doing this.
>> See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities  
>> header.
>> " scripts/utilities/bp_nrdb.PLS
>>    Make a non-redundant database based on sequence, not id. Requires
>> Digest::MD5."
>> Alternatively, you can make a hash using the sequences as keys.
>> Regards,
>> Bernd
>> On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao@...>  
>> wrote:
>>> Hi, there ,
>>>
>>> I have a sequence dataset which contains about 200 sequences.  
>>> there are some identical sequences in this. is there any bioperl  
>>> modules  which can remove those identical sequences?
>>>
>>> thanks a lot.
>>> yours,
>>> shaohua
>>> ----- Original Message -----
>>> From: "Benbo" <btemperton@...>
>>> To: <Bioperl-l@...>
>>> Sent: Sunday, August 03, 2008 4:05 AM
>>> Subject: [Bioperl-l] Finding possible primers regex
>>>
>>>
>>>> Hi there,
>>>> I'm trying to write a perl script to scan an aligned multiple  
>>>> entry fasta
>>>> file and find possible primers. So far I've produced a string  
>>>> which contains
>>>> bases which match all sequences and * where they don't match e.g.
>>>> 1) TTAGCCTAA
>>>> 2) TTAGCAGAA
>>>> 3) TTACCCTAA
>>>>
>>>> would give TTA*C**AA.
>>>>
>>>> I want to parse this string and pull out all sequences which are  
>>>> 18-21 bp in
>>>> length and have no more than 4 * in them.
>>>>
>>>> So far, I've got this:
>>>>
>>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>>> print "$1\n";
>>>> }
>>>>
>>>> hoping to match all fragments 18-21 characters in length. However  
>>>> even that
>>>> doesn't work as it has essentially chunked it into 21 char  
>>>> blocks, rather
>>>> than what I hoped for of
>>>> 0-18
>>>> 0-19
>>>> 0-20
>>>> 0-21
>>>> 1-19
>>>> 1-20
>>>> 1-21
>>>> 1-22
>>>>
>>>> etc.
>>>>
>>>> Can anyone let me know if this is already possible in BioPerl, or  
>>>> how one
>>>> would go about it with regex. Sadly I'm fairly new to perl and  
>>>> getting to
>>>> grips with BioPerl, so please treat me gently :).
>>>>
>>>> Many thanks,
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
>>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l@...
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@...
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@...
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> --
> ___________________________________
> Diego Mauricio Riaño Pachón
> Biologist - PhD student
> AG Mueller-Roeber
> Institute for Biochemistry and Biology
> University of Potsdam
>
> Address: Karl-Liebknecht-Str. 24-25
> Haus 20
> 14476 Golm
> Germany
>
> Tel: +49 331 977 2809
> Fax: +49 331 977 2512
>
> web: http://www.geocities.com/dmrp.geo
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign





_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: how to remove indentical sequences from a dataset

by Shaohua Fan-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hi,
thanks a lot for the help!
cheers,
shaohua
----- Original Message -----
From: "Chris Fields" <cjfields@...>
To: "Diego Mauricio Riano Pachon" <diriano@...>
Cc: "Bernd Web" <bernd.web@...>; <Bioperl-l@...>; "Shaohua Fan" <lengjingmao@...>
Sent: Tuesday, August 05, 2008 11:19 PM
Subject: Re: [Bioperl-l] how to remove indentical sequences from a dataset


Here are two links which go into detail (the last is a specific  
implementation):

http://en.wikipedia.org/wiki/Sequence_clustering
http://www.bioinformatics.org/cd-hit/

chris

On Aug 5, 2008, at 5:28 AM, Diego Mauricio Riano Pachon wrote:

> Hi all,
>
> Or you might try a non-bioperl solution that works pretty well, check:
>
> http://blast.wustl.edu/pub/nrdb/executables/nrdb.linux-x86
>
> Best,
>
> Diego
>
> Bernd Web wrote:
>> Hi,
>> There is a BioPerl Utility script doing this.
>> See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities  
>> header.
>> " scripts/utilities/bp_nrdb.PLS
>>    Make a non-redundant database based on sequence, not id. Requires
>> Digest::MD5."
>> Alternatively, you can make a hash using the sequences as keys.
>> Regards,
>> Bernd
>> On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao@...>  
>> wrote:
>>> Hi, there ,
>>>
>>> I have a sequence dataset which contains about 200 sequences.  
>>> there are some identical sequences in this. is there any bioperl  
>>> modules  which can remove those identical sequences?
>>>
>>> thanks a lot.
>>> yours,
>>> shaohua
>>> ----- Original Message -----
>>> From: "Benbo" <btemperton@...>
>>> To: <Bioperl-l@...>
>>> Sent: Sunday, August 03, 2008 4:05 AM
>>> Subject: [Bioperl-l] Finding possible primers regex
>>>
>>>
>>>> Hi there,
>>>> I'm trying to write a perl script to scan an aligned multiple  
>>>> entry fasta
>>>> file and find possible primers. So far I've produced a string  
>>>> which contains
>>>> bases which match all sequences and * where they don't match e.g.
>>>> 1) TTAGCCTAA
>>>> 2) TTAGCAGAA
>>>> 3) TTACCCTAA
>>>>
>>>> would give TTA*C**AA.
>>>>
>>>> I want to parse this string and pull out all sequences which are  
>>>> 18-21 bp in
>>>> length and have no more than 4 * in them.
>>>>
>>>> So far, I've got this:
>>>>
>>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>>> print "$1\n";
>>>> }
>>>>
>>>> hoping to match all fragments 18-21 characters in length. However  
>>>> even that
>>>> doesn't work as it has essentially chunked it into 21 char  
>>>> blocks, rather
>>>> than what I hoped for of
>>>> 0-18
>>>> 0-19
>>>> 0-20
>>>> 0-21
>>>> 1-19
>>>> 1-20
>>>> 1-21
>>>> 1-22
>>>>
>>>> etc.
>>>>
>>>> Can anyone let me know if this is already possible in BioPerl, or  
>>>> how one
>>>> would go about it with regex. Sadly I'm fairly new to perl and  
>>>> getting to
>>>> grips with BioPerl, so please treat me gently :).
>>>>
>>>> Many thanks,
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
>>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l@...
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@...
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@...
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> --
> ___________________________________
> Diego Mauricio Riaño Pachón
> Biologist - PhD student
> AG Mueller-Roeber
> Institute for Biochemistry and Biology
> University of Potsdam
>
> Address: Karl-Liebknecht-Str. 24-25
> Haus 20
> 14476 Golm
> Germany
>
> Tel: +49 331 977 2809
> Fax: +49 331 977 2512
>
> web: http://www.geocities.com/dmrp.geo
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign




_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Finding possible primers regex

by Hilmar Lapp :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

This looks like a neat trick. Do you think it's worth including as a  
SimpleAlign method (obviously w/o the printing to STDOUT)? I can  
imagine that a lot of people might appreciate it.

        -hilmar

On Aug 4, 2008, at 12:08 AM, Chris Fields wrote:

> On Aug 2, 2008, at 3:05 PM, Benbo wrote:
>
>>
>> Hi there,
>> I'm trying to write a perl script to scan an aligned multiple entry  
>> fasta
>> file and find possible primers. So far I've produced a string which  
>> contains
>> bases which match all sequences and * where they don't match e.g.
>> 1) TTAGCCTAA
>> 2) TTAGCAGAA
>> 3) TTACCCTAA
>>
>> would give TTA*C**AA.
>>
>> I want to parse this string and pull out all sequences which are  
>> 18-21 bp in
>> length and have no more than 4 * in them.
>>
>> So far, I've got this:
>>
>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>> print "$1\n";
>> }
>>
>> hoping to match all fragments 18-21 characters in length. However  
>> even that
>> doesn't work as it has essentially chunked it into 21 char blocks,  
>> rather
>> than what I hoped for of
>> 0-18
>> 0-19
>> 0-20
>> 0-21
>> 1-19
>> 1-20
>> 1-21
>> 1-22
>>
>> etc.
>>
>> Can anyone let me know if this is already possible in BioPerl, or  
>> how one
>> would go about it with regex. Sadly I'm fairly new to perl and  
>> getting to
>> grips with BioPerl, so please treat me gently :).
>>
>> Many thanks,
>>
>> Ben
>
> There is a trick to this which is discussed more extensively in  
> 'Mastering Regular Expressions'.  Essentially you have to embed code  
> into the regex and trick the parser into backtracking using a  
> negative lookahead.  The match itself fails (i.e. no match is  
> returned), but the embedded code is executed for each match attempt,
>
> The following script is a slight modification of one I used which  
> checks the consensus string from the input alignment (in aligned  
> FASTA format here), extracts the alignment slice using that match,  
> then spit the alignment out to STDOUT in clustalw format.  This  
> should work for perl 5.8 and up, but it's only been tested on perl  
> 5.10.  You should be able to use this to fit what you want.
>
> my $in = Bio::AlignIO->new(-file => $file,
>                           -format => 'fasta');
> my $out = Bio::AlignIO->new(-fh => \*STDOUT,
>                           -format => 'clustalw');
>
> while (my $aln = $in->next_aln) {
>    my $c = $aln->consensus_string(100);
>    my @matches;
>    $c =~ m/
>        ([GTAC?]{18,21})
>        (?{my $match = check_match($1);
>           push @matches, [$match,
>                           pos(),
>                           length($match)]
>              if defined $match;})
>        (?!)
>        /xig;
>    for my $match (@matches) {
>        my ($hit, $st, $end) = ($match->[0],
>                                $match->[1] - $match->[2] + 1,
>                                $match->[1]);
>        my $newaln = $aln->slice($st, $end);
>        $out->write_aln($newaln);
>    }
> }
>
> sub check_match {
>    my $match = shift;
>    return unless $match;
>    my $ct = $match =~ tr/?/?/;
>    return $match if $ct <= 4;
> }
>
>
> chris
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@...
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

--
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================



_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Parent Message unknown Re: Finding possible primers regex

by Chris Fields-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

When I can I could try generating a method which accepts a regex/Bio::Tools::SeqPattern and returns an AlignIO stream or array of SimpleAlign instances (the former could be attached to a temp file for iteration).  Any preference?

chris

---- Original message ----

>Date: Sat, 9 Aug 2008 12:07:30 -0400
>From: Hilmar Lapp <hlapp@...>  
>Subject: Re: [Bioperl-l] Finding possible primers regex  
>To: Chris Fields <cjfields@...>
>Cc: Benbo <btemperton@...>, Bioperl-l@...
>
>This looks like a neat trick. Do you think it's worth including as a  
>SimpleAlign method (obviously w/o the printing to STDOUT)? I can  
>imagine that a lot of people might appreciate it.
>
> -hilmar
>
>On Aug 4, 2008, at 12:08 AM, Chris Fields wrote:
>
>> On Aug 2, 2008, at 3:05 PM, Benbo wrote:
>>
>>>
>>> Hi there,
>>> I'm trying to write a perl script to scan an aligned multiple entry  
>>> fasta
>>> file and find possible primers. So far I've produced a string which  
>>> contains
>>> bases which match all sequences and * where they don't match e.g.
>>> 1) TTAGCCTAA
>>> 2) TTAGCAGAA
>>> 3) TTACCCTAA
>>>
>>> would give TTA*C**AA.
>>>
>>> I want to parse this string and pull out all sequences which are  
>>> 18-21 bp in
>>> length and have no more than 4 * in them.
>>>
>>> So far, I've got this:
>>>
>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>> print "$1\n";
>>> }
>>>
>>> hoping to match all fragments 18-21 characters in length. However  
>>> even that
>>> doesn't work as it has essentially chunked it into 21 char blocks,  
>>> rather
>>> than what I hoped for of
>>> 0-18
>>> 0-19
>>> 0-20
>>> 0-21
>>> 1-19
>>> 1-20
>>> 1-21
>>> 1-22
>>>
>>> etc.
>>>
>>> Can anyone let me know if this is already possible in BioPerl, or  
>>> how one
>>> would go about it with regex. Sadly I'm fairly new to perl and  
>>> getting to
>>> grips with BioPerl, so please treat me gently :).
>>>
>>> Many thanks,
>>>
>>> Ben
>>
>> There is a trick to this which is discussed more extensively in  
>> 'Mastering Regular Expressions'.  Essentially you have to embed code  
>> into the regex and trick the parser into backtracking using a  
>> negative lookahead.  The match itself fails (i.e. no match is  
>> returned), but the embedded code is executed for each match attempt,
>>
>> The following script is a slight modification of one I used which  
>> checks the consensus string from the input alignment (in aligned  
>> FASTA format here), extracts the alignment slice using that match,  
>> then spit the alignment out to STDOUT in clustalw format.  This  
>> should work for perl 5.8 and up, but it's only been tested on perl  
>> 5.10.  You should be able to use this to fit what you want.
>>
>> my $in = Bio::AlignIO->new(-file => $file,
>>                           -format => 'fasta');
>> my $out = Bio::AlignIO->new(-fh => \*STDOUT,
>>                           -format => 'clustalw');
>>
>> while (my $aln = $in->next_aln) {
>>    my $c = $aln->consensus_string(100);
>>    my @matches;
>>    $c =~ m/
>>        ([GTAC?]{18,21})
>>        (?{my $match = check_match($1);
>>           push @matches, [$match,
>>                           pos(),
>>                           length($match)]
>>              if defined $match;})
>>        (?!)
>>        /xig;
>>    for my $match (@matches) {
>>        my ($hit, $st, $end) = ($match->[0],
>>                                $match->[1] - $match->[2] + 1,
>>                                $match->[1]);
>>        my $newaln = $aln->slice($st, $end);
>>        $out->write_aln($newaln);
>>    }
>> }
>>
>> sub check_match {
>>    my $match = shift;
>>    return unless $match;
>>    my $ct = $match =~ tr/?/?/;
>>    return $match if $ct <= 4;
>> }
>>
>>
>> chris
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@...
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>--
>===========================================================
>: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>===========================================================
>
>
>
_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Re: Finding possible primers regex

by Hilmar Lapp :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Actually, now that you ask I'm wondering whether one wouldn't  
sometimes want to retain the relationship between the match and the  
resulting spliced alignment? If so, neither AlignIO nor array would  
accomplish that, right?

Other than that I myself don't have a strong preference either way. I  
suppose AlignIO stream is somewhat more extensible, since as you say  
it could be coupled to a file if the resulting set of alignments is  
really large.

        -hilmar

On Aug 11, 2008, at 3:50 PM, Christopher Fields wrote:

> When I can I could try generating a method which accepts a regex/
> Bio::Tools::SeqPattern and returns an AlignIO stream or array of  
> SimpleAlign instances (the former could be attached to a temp file  
> for iteration).  Any preference?
>
> chris
>
> ---- Original message ----
>> Date: Sat, 9 Aug 2008 12:07:30 -0400
>> From: Hilmar Lapp <hlapp@...>
>> Subject: Re: [Bioperl-l] Finding possible primers regex
>> To: Chris Fields <cjfields@...>
>> Cc: Benbo <btemperton@...>, Bioperl-l@...
>>
>> This looks like a neat trick. Do you think it's worth including as a
>> SimpleAlign method (obviously w/o the printing to STDOUT)? I can
>> imagine that a lot of people might appreciate it.
>>
>> -hilmar
>>
>> On Aug 4, 2008, at 12:08 AM, Chris Fields wrote:
>>
>>> On Aug 2, 2008, at 3:05 PM, Benbo wrote:
>>>
>>>>
>>>> Hi there,
>>>> I'm trying to write a perl script to scan an aligned multiple entry
>>>> fasta
>>>> file and find possible primers. So far I've produced a string which
>>>> contains
>>>> bases which match all sequences and * where they don't match e.g.
>>>> 1) TTAGCCTAA
>>>> 2) TTAGCAGAA
>>>> 3) TTACCCTAA
>>>>
>>>> would give TTA*C**AA.
>>>>
>>>> I want to parse this string and pull out all sequences which are
>>>> 18-21 bp in
>>>> length and have no more than 4 * in them.
>>>>
>>>> So far, I've got this:
>>>>
>>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>>> print "$1\n";
>>>> }
>>>>
>>>> hoping to match all fragments 18-21 characters in length. However
>>>> even that
>>>> doesn't work as it has essentially chunked it into 21 char blocks,
>>>> rather
>>>> than what I hoped for of
>>>> 0-18
>>>> 0-19
>>>> 0-20
>>>> 0-21
>>>> 1-19
>>>> 1-20
>>>> 1-21
>>>> 1-22
>>>>
>>>> etc.
>>>>
>>>> Can anyone let me know if this is already possible in BioPerl, or
>>>> how one
>>>> would go about it with regex. Sadly I'm fairly new to perl and
>>>> getting to
>>>> grips with BioPerl, so please treat me gently :).
>>>>
>>>> Many thanks,
>>>>
>>>> Ben
>>>
>>> There is a trick to this which is discussed more extensively in
>>> 'Mastering Regular Expressions'.  Essentially you have to embed code
>>> into the regex and trick the parser into backtracking using a
>>> negative lookahead.  The match itself fails (i.e. no match is
>>> returned), but the embedded code is executed for each match attempt,
>>>
>>> The following script is a slight modification of one I used which
>>> checks the consensus string from the input alignment (in aligned
>>> FASTA format here), extracts the alignment slice using that match,
>>> then spit the alignment out to STDOUT in clustalw format.  This
>>> should work for perl 5.8 and up, but it's only been tested on perl
>>> 5.10.  You should be able to use this to fit what you want.
>>>
>>> my $in = Bio::AlignIO->new(-file => $file,
>>>                          -format => 'fasta');
>>> my $out = Bio::AlignIO->new(-fh => \*STDOUT,
>>>                          -format => 'clustalw');
>>>
>>> while (my $aln = $in->next_aln) {
>>>   my $c = $aln->consensus_string(100);
>>>   my @matches;
>>>   $c =~ m/
>>>       ([GTAC?]{18,21})
>>>       (?{my $match = check_match($1);
>>>          push @matches, [$match,
>>>                          pos(),
>>>                          length($match)]
>>>             if defined $match;})
>>>       (?!)
>>>       /xig;
>>>   for my $match (@matches) {
>>>       my ($hit, $st, $end) = ($match->[0],
>>>                               $match->[1] - $match->[2] + 1,
>>>                               $match->[1]);
>>>       my $newaln = $aln->slice($st, $end);
>>>       $out->write_aln($newaln);
>>>   }
>>> }
>>>
>>> sub check_match {
>>>   my $match = shift;
>>>   return unless $match;
>>>   my $ct = $match =~ tr/?/?/;
>>>   return $match if $ct <= 4;
>>> }
>>>
>>>
>>> chris
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@...
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> --
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>

--
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================



_______________________________________________
Bioperl-l mailing list
Bioperl-l@...
http://lists.open-bio.org/mailman/listinfo/bioperl-l