How to exclude DTD and namespace during Sax (TagSoup) parsing in JDOM

View: New views
2 Messages — Rating Filter:   Alert me  

How to exclude DTD and namespace during Sax (TagSoup) parsing in JDOM

by Jack Bush :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
 

Hi All,

 

I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:

 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional/ /EN" "http://www. w3.org/TR/ xhtml1/DTD/ xhtml1-transitio nal.dtd">

<html xmlns="http: //www.w3. org/1999/ xhtml">

<head>

<meta http-equiv=" Content-Type" content="text/ html; charset=UTF- 8" />

……..

</head>

<body>

    <div id="container">

        <div id="content">

            <table class="sresults">

                <tr>

                    <td>

                        <a href="http:/ /www.abc. com/areas" title=" Hollywood , CA "> hollywood </a>

                    </td>

                    <td>

                        <a href="http:/ /www.abc. com/areas" title=" San Jose , CA "> san jose </a>

                    </td>

                    <td>

                        <a href="http:/ /www.abc. com/areas" title=" San Francisco , CA "> san francisco </a>

                    </td>

                    <td>

                        <a href="http:/ /www.abc. com/areas" title=" San Diego , CA "> San diego </a>

                    </td>

              </tr>

……….

</body>

</html>

 

Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of  <a>):

 

             import java.util.*;

             import org.jdom.*;

             import org.jdom.xpath. *;

             import org.saxpath. *;

             import org.ccil.cowan. tagsoup.Parser;

 

( 1 )       frInHtml = new FileReader(" C:\\Tmp\\ ABC.html" );

( 2 )       brInHtml = new BufferedReader( frInHtml) ;

( 3 )       SAXBuilder saxBuilder = new SAXBuilder(" org.ccil. cowan.tagsoup. Parser");

( 4 )       org.jdom.Document jdomDocument = saxbuilder.build( brInHtml) ;

( 5 )       XPath xpath =  XPath.newInstance( "/ns:html/ ns:body/ns: div[@id=' container' ]/ns:div[ @id='content' ]/ns:table[ @class='sresults ']/ns:tr/ ns:td/ns: a");

( 6 )       xpath.addNamespace( "ns", "http://www. w3.org/1999/ xhtml");

( 7 )       java.util.List list = (java.util.List) (xpath.selectNodes( jdomDocument) );

( 8 )       Iterator iterator = list.iterator( );

( 9 )     while (iterator.hasNext( ))

( 10 )     {

( 11 )            Object object = iterator.next( );

( 12 ) //         if (object instanceof Element)

( 13 ) //               System.out.println( ((Element) object).getTextN ormalize( ));

( 14 )             if (object instanceof Content)

( 15 )                   System.out.println( ((Content) object).getValue ());

              }

….

 I would like to achieve the following objectives if possible:

 ( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done?

( ii ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?

 I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.

 Any assistance would be appreciated.

 Thanks in advance,

 Jack



Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started.
__._,_.___
To unsubscribe, send a blank email to tagsoup-friends-unsubscribe@...
Recent Activity
Visit Your Group
Give Back

Yahoo! for Good

Get inspired

by a good cause.

Y! Toolbar

Get it Free!

easy 1-click access

to your groups.

Yahoo! Groups

Start a group

in 3 easy steps.

Connect with others.

.

__,_._,___


Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started.


Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started.


_______________________________________________
To control your jdom-interest membership:
http://www.jdom.org/mailman/options/jdom-interest/youraddr@...

Re: How to exclude DTD and namespace during Sax (TagSoup) parsing in JDOM

by Laurent Bihanic :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Jack,

Some answers below,

Laurent

Jack Bush wrote:

> I am having difficulty parsing using Saxon and TagSoup parser on a
> namespace html document. The relevant content of this document are as
> follows:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional/ /EN"
> "http://www. w3.org/TR/ xhtml1/DTD/ xhtml1-transitio nal.dtd">
>
> <html xmlns="http: //www.w3. org/1999/ xhtml">
> <head>
> <meta http-equiv=" Content-Type" content="text/ html; 8" />
> </head>
> <body>
>     <div id="container">
>         <div id="content">
>             <table class="sresults">
>                 <tr>
>                     <td>
>                         <a href="http:/ /www.abc. com/areas" title="
> Hollywood , CA "> hollywood </a>
>                     </td>
...

> Below is the relevant code snippets illustrates how I have attempted to
> retrieve the contents (value of  <a>):
>              import java.util.*;
>              import org.jdom.*;
>              import org.jdom.xpath. *;
>              import org.saxpath. *;
>              import org.ccil.cowan. tagsoup.Parser;
>
> ( 1 )       frInHtml = new FileReader(" C:\\Tmp\\ ABC.html" );
> ( 2 )       brInHtml = new BufferedReader( frInHtml) ;
> ( 3 )       SAXBuilder saxBuilder = new SAXBuilder(" org.ccil.
> cowan.tagsoup. Parser");
> ( 4 )       org.jdom.Document jdomDocument = saxbuilder.build( brInHtml) ;
> ( 5 )       XPath xpath =  XPath.newInstance( "/ns:html/ ns:body/ns:
> div[@id=' container' ]/ns:div[ @id='content' ]/ns:table[
> @class='sresults ']/ns:tr/ ns:td/ns: a");
> ( 6 )       xpath.addNamespace( "ns", "http://www. w3.org/1999/ xhtml");
...

>  I would like to achieve the following objectives if possible:
>
>  ( i ) Exclude DTD and namespace in order to simplifying the parsing
> process. How this could be done?

You can't exclude the DTD. It is required in case the HTML document uses
entities not defined by XML such as   à etc.

I don't think Tagsoup will let you remove the namespace. It does the best
possible thing by making it the default namespace. But XPath does not support
any default (i.e. unprefixed) namespace, you have to use a prefix.

> ( ii ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD
> to a local SYSTEM one and include a local DTD for reference?

You could try attaching a SAX EntityResolver to SAXBuilder that will pass it
on to Tagsoup. Tagsoup will use it to resolve the DOCTYPE.

>  I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup
> 1.2 on Windows XP platform.
>
>  Any assistance would be appreciated.
>
>  Thanks in advance,
>
>  Jack
_______________________________________________
To control your jdom-interest membership:
http://www.jdom.org/mailman/options/jdom-interest/youraddr@...