SimpleXML - UTF8

View: New views
6 Messages — Rating Filter:   Alert me  

SimpleXML - UTF8

by John Campbell-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I hope there is an easy answer to this:

I am using a remote XML service, that about 1 in 100 times returns XML
with invalid UTF-8 bytes.  I don't have any control over the remote
service, but simpleXML pukes when I pass malformed UTF-8 to it.  Does
anyone know of a simple way to cleanup bad UTF-8 bytes, e.g. replace
the invalid bytes with a '?'.

Rgds,
John Campbell
_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation

Re: SimpleXML - UTF8

by dorgan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

a simple str_replace can replace the invalid characters..but you have to know what they are first.


----- Original Message -----
From: "John Campbell" <jcampbell1@...>
To: "NYPHP Talk" <talk@...>
Sent: Saturday, October 17, 2009 4:59:53 PM
Subject: [nyphp-talk] SimpleXML - UTF8

I hope there is an easy answer to this:

I am using a remote XML service, that about 1 in 100 times returns XML
with invalid UTF-8 bytes.  I don't have any control over the remote
service, but simpleXML pukes when I pass malformed UTF-8 to it.  Does
anyone know of a simple way to cleanup bad UTF-8 bytes, e.g. replace
the invalid bytes with a '?'.

Rgds,
John Campbell
_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation

_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation

Re: SimpleXML - UTF8

by Adrian Noland-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have this handy function I pulled from somewhere else. Does it help?

Apologies if the actual characters don't come across in the email.

    /**
     * This function was created to scrub additional html entities that are not in the PHP get_html_translation_table
     * Currently bug #34577 in the bugs.php.net database.       
     * a1 is a list of current html entities that are commonly appearing in the listing description that are not escaped
     * a2 is most of the entities to either an accepted format, correct html-entity, or with a blank space
     *
     * @param string $string string to scrub
     * @return string $string clean string
     */
    public static function xmlStringScrub($string) {
        $a1 = array("�","�","�","�", "�","�", "�", "�", "�", "�", "�","�","�","�","�", "�", "�");
        $a2 = array(".","-","&bull;","", "'","'", '"', '"', "-", "-", ",", "^",",","","&euro;", "&reg;", "&trade;");
        $string = htmlentities($string, ENT_QUOTES);
        $string = str_replace($a1, $a2, $string);
        $string = utf8_encode($string);
        return $string;
    }


_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation

Re: SimpleXML - UTF8

by Fernando Gabrieli :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi, i believe php.net/mb_convert_encoding will use "?" or just avoid printing a character when it can't convert. There's also php.net/iconv

On Sat, Oct 17, 2009 at 5:59 PM, John Campbell <jcampbell1@...> wrote:
I hope there is an easy answer to this:

I am using a remote XML service, that about 1 in 100 times returns XML
with invalid UTF-8 bytes.  I don't have any control over the remote
service, but simpleXML pukes when I pass malformed UTF-8 to it.  Does
anyone know of a simple way to cleanup bad UTF-8 bytes, e.g. replace
the invalid bytes with a '?'.

Rgds,
John Campbell
_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation


_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation

Re: SimpleXML - UTF8

by Dan Cech :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

John Campbell wrote:
> I am using a remote XML service, that about 1 in 100 times returns XML
> with invalid UTF-8 bytes.  I don't have any control over the remote
> service, but simpleXML pukes when I pass malformed UTF-8 to it.  Does
> anyone know of a simple way to cleanup bad UTF-8 bytes, e.g. replace
> the invalid bytes with a '?'.

Try:

$text = @iconv('UTF-8','UTF-8//TRANSLIT',$text);

Dan
_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation

Re: SimpleXML - UTF8

by John Campbell-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Oct 19, 2009 at 7:32 AM, Dan Cech <dcech@...> wrote:
> Try:
>
> $text = @iconv('UTF-8','UTF-8//TRANSLIT',$text);

Thanks Dan,

I knew there had to be something simple.

It looks like mb_convert_encoding($txt,'UTF-8','UTF-8') will work
similarly, but just deletes the offending bytes.

Regards,
John Campbell
_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation