preg_replace with UTF-8

View: New views
4 Messages — Rating Filter:   Alert me  

preg_replace with UTF-8

by SleePy-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I seem to be having a minor issue with preg_replace not working as  
expected when using UTF-8 strings. So far I have found out that \w  
doesn't seem to be detecting UTF-8 strings.

This is my test php file:
<?php
$data = 'ooooooooooooooooooooooo';
echo 'Data before: ', $data, '<br />';

$data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

// UTF-8 Test
$data = 'ффффффффффффффффффффффф';
echo '<hr />Data before: ', $data, '<br />';

$data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

?>


I would expect it to be:
Data before: ooooooooooooooooooooooo
Data After: oooooo < >oooooo < >oooooo < >ooooo
---
Data before: ффффффффффффффффффффффф
Data After: фффффф <>фффффф <>фффффф<> ффффф

But what I get is:
Data before: ooooooooooooooooooooooo
Data After: oooooo < >oooooo < >oooooo < >ooooo
---
Data before: ффффффффффффффффффффффф
Data After: ффффффффффффффффффффффф

Did I go about this the wrong way or is this a php bug itself?
I tested this in php 5.3, 5.2.9 and 6.0 (snapshot from a couple weeks  
ago) and received the same results.


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: preg_replace with UTF-8

by Georgi Alexandrov :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jul 6, 2009 at 4:54 AM, SleePy <sleepingkiller@...> wrote:

> I seem to be having a minor issue with preg_replace not working as expected
> when using UTF-8 strings. So far I have found out that \w doesn't seem to be
> detecting UTF-8 strings.
>
> This is my test php file:
> <?php
> $data = 'ooooooooooooooooooooooo';
> echo 'Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> // UTF-8 Test
> $data = 'ффффффффффффффффффффффф';
> echo '<hr />Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> ?>
>
>
> I would expect it to be:
> Data before: ooooooooooooooooooooooo
> Data After: oooooo < >oooooo < >oooooo < >ooooo
> ---
> Data before: ффффффффффффффффффффффф
> Data After: фффффф <>фффффф <>фффффф<> ффффф
>
> But what I get is:
> Data before: ooooooooooooooooooooooo
> Data After: oooooo < >oooooo < >oooooo < >ooooo
> ---
> Data before: ффффффффффффффффффффффф
> Data After: ффффффффффффффффффффффф
>
> Did I go about this the wrong way or is this a php bug itself?
> I tested this in php 5.3, 5.2.9 and 6.0 (snapshot from a couple weeks ago)
> and received the same results.


Did you tried mb_ereg_replace?


>
>
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

Re: preg_replace with UTF-8

by Andrew Ballard :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Jul 5, 2009 at 9:54 PM, SleePy<sleepingkiller@...> wrote:

> I seem to be having a minor issue with preg_replace not working as expected
> when using UTF-8 strings. So far I have found out that \w doesn't seem to be
> detecting UTF-8 strings.
>
> This is my test php file:
> <?php
> $data = 'ooooooooooooooooooooooo';
> echo 'Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> // UTF-8 Test
> $data = 'ффффффффффффффффффффффф';
> echo '<hr />Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> ?>
>
>
> I would expect it to be:
> Data before: ooooooooooooooooooooooo
> Data After: oooooo < >oooooo < >oooooo < >ooooo
> ---
> Data before: ффффффффффффффффффффффф
> Data After: фффффф <>фффффф <>фффффф<> ффффф
>
> But what I get is:
> Data before: ooooooooooooooooooooooo
> Data After: oooooo < >oooooo < >oooooo < >ooooo
> ---
> Data before: ффффффффффффффффффффффф
> Data After: ффффффффффффффффффффффф
>
> Did I go about this the wrong way or is this a php bug itself?
> I tested this in php 5.3, 5.2.9 and 6.0 (snapshot from a couple weeks ago)
> and received the same results.
>
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

From the manual on PCRE syntax:
"A 'word' character is any letter or digit or the underscore
character, that is, any character which can be part of a Perl 'word'.
The definition of letters and digits is controlled by PCRE's character
tables, and may vary if locale-specific matching is taking place. For
example, in the 'fr' (French) locale, some character codes greater
than 128 are used for accented letters, and these are matched by \w.

These character type sequences can appear both inside and outside
character classes. They each match one character of the appropriate
type. If the current matching point is at the end of the subject
string, all of them fail, since there is no character to match."

I'm not sure if this is exactly what you want (or if it might let more
things slip past than you intend), but try this:

<?php
$data = 'ooooooooooooooooooooooo';
echo 'Data before: ', $data, '<br />';

$data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

// UTF-8 Test
$data = 'ффффффффффффффффффффффф';
echo '<hr />Data before: ', $data, '<br />';

$data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
echo 'Data After: ', $data;

?>

Andrew

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: preg_replace with UTF-8

by SleePy-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thank you Andrew,
That seems to break up UTF-8 strings. So from there I will play with it.

On Jul 6, 2009, at 8:50 AM, Andrew Ballard wrote:

> On Sun, Jul 5, 2009 at 9:54 PM, SleePy<sleepingkiller@...>  
> wrote:
>> I seem to be having a minor issue with preg_replace not working as  
>> expected
>> when using UTF-8 strings. So far I have found out that \w doesn't  
>> seem to be
>> detecting UTF-8 strings.
>>
>> This is my test php file:
>> <?php
>> $data = 'ooooooooooooooooooooooo';
>> echo 'Data before: ', $data, '<br />';
>>
>> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
>> echo 'Data After: ', $data;
>>
>> // UTF-8 Test
>> $data = 'ффффффффффффффффффффффф';
>> echo '<hr />Data before: ', $data, '<br />';
>>
>> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
>> echo 'Data After: ', $data;
>>
>> ?>
>>
>>
>> I would expect it to be:
>> Data before: ooooooooooooooooooooooo
>> Data After: oooooo < >oooooo < >oooooo < >ooooo
>> ---
>> Data before: ффффффффффффффффффффффф
>> Data After: фффффф <>фффффф <>фффффф<> ффффф
>>
>> But what I get is:
>> Data before: ooooooooooooooooooooooo
>> Data After: oooooo < >oooooo < >oooooo < >ooooo
>> ---
>> Data before: ффффффффффффффффффффффф
>> Data After: ффффффффффффффффффффффф
>>
>> Did I go about this the wrong way or is this a php bug itself?
>> I tested this in php 5.3, 5.2.9 and 6.0 (snapshot from a couple  
>> weeks ago)
>> and received the same results.
>>
>>
>> --
>> PHP General Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>
> From the manual on PCRE syntax:
> "A 'word' character is any letter or digit or the underscore
> character, that is, any character which can be part of a Perl 'word'.
> The definition of letters and digits is controlled by PCRE's character
> tables, and may vary if locale-specific matching is taking place. For
> example, in the 'fr' (French) locale, some character codes greater
> than 128 are used for accented letters, and these are matched by \w.
>
> These character type sequences can appear both inside and outside
> character classes. They each match one character of the appropriate
> type. If the current matching point is at the end of the subject
> string, all of them fail, since there is no character to match."
>
> I'm not sure if this is exactly what you want (or if it might let more
> things slip past than you intend), but try this:
>
> <?php
> $data = 'ooooooooooooooooooooooo';
> echo 'Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> // UTF-8 Test
> $data = 'ффффффффффффффффффффффф';
> echo '<hr />Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> ?>
>
> Andrew


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php