|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
How to exclude DTD and namespace during Sax (TagSoup) parsing in JDOM
Hi All,
I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional/ /EN" "http://www. w3.org/TR/ xhtml1/DTD/ xhtml1-transitio nal.dtd"> <html xmlns="http: //www.w3. org/1999/ xhtml"> <head> <meta http-equiv=" Content-Type" content="text/ html; charset=UTF- 8" /> …….. </head> <body> <div id="container"> <div id="content"> <table class="sresults"> <tr> <td> <a href="http:/ /www.abc. com/areas" title=" Hollywood , CA "> hollywood </a> </td> <td> <a href="http:/ /www.abc. com/areas" title=" San Jose , CA "> san jose </a> </td> <td> <a href="http:/ /www.abc. com/areas" title=" San Francisco , CA "> san francisco </a> </td> <td> <a href="http:/ /www.abc. com/areas" title=" San Diego , CA "> San diego </a> </td> </tr> ………. </body> </html>
Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of <a>):
import java.util.*; import org.jdom.*; import org.jdom.xpath. *; import org.saxpath. *; import org.ccil.cowan. tagsoup.Parser;
( 1 ) frInHtml = new FileReader(" C:\\Tmp\\ ABC.html" ); ( 2 ) brInHtml = new BufferedReader( frInHtml) ; ( 3 ) SAXBuilder saxBuilder = new SAXBuilder(" org.ccil. cowan.tagsoup. Parser"); ( 4 ) org.jdom.Document jdomDocument = saxbuilder.build( brInHtml) ; ( 5 ) XPath xpath = XPath.newInstance( "/ns:html/ ns:body/ns: div[@id=' container' ]/ns:div[ @id='content' ]/ns:table[ @class='sresults ']/ns:tr/ ns:td/ns: a"); ( 6 ) xpath.addNamespace( "ns", "http://www. w3.org/1999/ xhtml"); ( 7 ) java.util.List list = (java.util.List) (xpath.selectNodes( jdomDocument) ); ( 8 ) Iterator iterator = list.iterator( ); ( 9 ) while (iterator.hasNext( )) ( 10 ) { ( 11 ) Object object = iterator.next( ); ( 12 ) // if (object instanceof Element) ( 13 ) // System.out.println( ((Element) object).getTextN ormalize( )); ( 14 ) if (object instanceof Content) ( 15 ) System.out.println( ((Content) object).getValue ()); } …. I would like to achieve the following objectives if possible: ( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done? ( ii ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference? I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform. Any assistance would be appreciated. Thanks in advance, Jack Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started. To unsubscribe, send a blank email to tagsoup-friends-unsubscribe@...
Change settings via the Web (Yahoo! ID required) Change settings via email: tagsoup-friends-digest@... | tagsoup-friends-traditional@... Visit Your Group | Yahoo! Groups Terms of Use | tagsoup-friends-unsubscribe@... . __,_._,___ Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started.
|
_______________________________________________ To control your jdom-interest membership: http://www.jdom.org/mailman/options/jdom-interest/youraddr@...
|
|
Re: How to exclude DTD and namespace during Sax (TagSoup) parsing in JDOMHi Jack,
Some answers below, Laurent Jack Bush wrote: > I am having difficulty parsing using Saxon and TagSoup parser on a > namespace html document. The relevant content of this document are as > follows: > > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional/ /EN" > "http://www. w3.org/TR/ xhtml1/DTD/ xhtml1-transitio nal.dtd"> > > <html xmlns="http: //www.w3. org/1999/ xhtml"> > <head> > <meta http-equiv=" Content-Type" content="text/ html; 8" /> > </head> > <body> > <div id="container"> > <div id="content"> > <table class="sresults"> > <tr> > <td> > <a href="http:/ /www.abc. com/areas" title=" > Hollywood , CA "> hollywood </a> > </td> > Below is the relevant code snippets illustrates how I have attempted to > retrieve the contents (value of <a>): > import java.util.*; > import org.jdom.*; > import org.jdom.xpath. *; > import org.saxpath. *; > import org.ccil.cowan. tagsoup.Parser; > > ( 1 ) frInHtml = new FileReader(" C:\\Tmp\\ ABC.html" ); > ( 2 ) brInHtml = new BufferedReader( frInHtml) ; > ( 3 ) SAXBuilder saxBuilder = new SAXBuilder(" org.ccil. > cowan.tagsoup. Parser"); > ( 4 ) org.jdom.Document jdomDocument = saxbuilder.build( brInHtml) ; > ( 5 ) XPath xpath = XPath.newInstance( "/ns:html/ ns:body/ns: > div[@id=' container' ]/ns:div[ @id='content' ]/ns:table[ > @class='sresults ']/ns:tr/ ns:td/ns: a"); > ( 6 ) xpath.addNamespace( "ns", "http://www. w3.org/1999/ xhtml"); > I would like to achieve the following objectives if possible: > > ( i ) Exclude DTD and namespace in order to simplifying the parsing > process. How this could be done? You can't exclude the DTD. It is required in case the HTML document uses entities not defined by XML such as à etc. I don't think Tagsoup will let you remove the namespace. It does the best possible thing by making it the default namespace. But XPath does not support any default (i.e. unprefixed) namespace, you have to use a prefix. > ( ii ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD > to a local SYSTEM one and include a local DTD for reference? You could try attaching a SAX EntityResolver to SAXBuilder that will pass it on to Tagsoup. Tagsoup will use it to resolve the DOCTYPE. > I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup > 1.2 on Windows XP platform. > > Any assistance would be appreciated. > > Thanks in advance, > > Jack _______________________________________________ To control your jdom-interest membership: http://www.jdom.org/mailman/options/jdom-interest/youraddr@... |
| Free embeddable forum powered by Nabble | Forum Help |