« Return to Thread: preg_replace with UTF-8

Re: preg_replace with UTF-8

by SleePy-4 :: Rate this Message:

Reply to Author | View in Thread

Thank you Andrew,
That seems to break up UTF-8 strings. So from there I will play with it.

On Jul 6, 2009, at 8:50 AM, Andrew Ballard wrote:

> On Sun, Jul 5, 2009 at 9:54 PM, SleePy<sleepingkiller@...>  
> wrote:
>> I seem to be having a minor issue with preg_replace not working as  
>> expected
>> when using UTF-8 strings. So far I have found out that \w doesn't  
>> seem to be
>> detecting UTF-8 strings.
>>
>> This is my test php file:
>> <?php
>> $data = 'ooooooooooooooooooooooo';
>> echo 'Data before: ', $data, '<br />';
>>
>> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
>> echo 'Data After: ', $data;
>>
>> // UTF-8 Test
>> $data = 'ффффффффффффффффффффффф';
>> echo '<hr />Data before: ', $data, '<br />';
>>
>> $data = preg_replace('~([\w\.]{6})~u', '$1 < >', $data);
>> echo 'Data After: ', $data;
>>
>> ?>
>>
>>
>> I would expect it to be:
>> Data before: ooooooooooooooooooooooo
>> Data After: oooooo < >oooooo < >oooooo < >ooooo
>> ---
>> Data before: ффффффффффффффффффффффф
>> Data After: фффффф <>фффффф <>фффффф<> ффффф
>>
>> But what I get is:
>> Data before: ooooooooooooooooooooooo
>> Data After: oooooo < >oooooo < >oooooo < >ooooo
>> ---
>> Data before: ффффффффффффффффффффффф
>> Data After: ффффффффффффффффффффффф
>>
>> Did I go about this the wrong way or is this a php bug itself?
>> I tested this in php 5.3, 5.2.9 and 6.0 (snapshot from a couple  
>> weeks ago)
>> and received the same results.
>>
>>
>> --
>> PHP General Mailing List (http://www.php.net/)
>> To unsubscribe, visit: http://www.php.net/unsub.php
>>
>>
>
> From the manual on PCRE syntax:
> "A 'word' character is any letter or digit or the underscore
> character, that is, any character which can be part of a Perl 'word'.
> The definition of letters and digits is controlled by PCRE's character
> tables, and may vary if locale-specific matching is taking place. For
> example, in the 'fr' (French) locale, some character codes greater
> than 128 are used for accented letters, and these are matched by \w.
>
> These character type sequences can appear both inside and outside
> character classes. They each match one character of the appropriate
> type. If the current matching point is at the end of the subject
> string, all of them fail, since there is no character to match."
>
> I'm not sure if this is exactly what you want (or if it might let more
> things slip past than you intend), but try this:
>
> <?php
> $data = 'ooooooooooooooooooooooo';
> echo 'Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> // UTF-8 Test
> $data = 'ффффффффффффффффффффффф';
> echo '<hr />Data before: ', $data, '<br />';
>
> $data = preg_replace('~([\w\pL\.]{6})~u', '$1 < >', $data);
> echo 'Data After: ', $data;
>
> ?>
>
> Andrew


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

 « Return to Thread: preg_replace with UTF-8