|
View:
New views
1 Messages
—
Rating Filter:
Alert me
|
|
|
[tagsoup-friends] How to parse XML document with default namespace with JDOM XPath
Hi All,
I am having difficulty parsing using Saxon and TagSoup parser on a namespace html document. The relevant content of this document are as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional/ /EN" "http://www. w3.org/TR/ xhtml1/DTD/ xhtml1-transitio nal.dtd"> <html xmlns="http: //www.w3. org/1999/ xhtml"> <head> <meta http-equiv=" Content-Type" content="text/ html; charset=UTF- 8" /> …….. </head> <body> <div id="container"> <div id="content"> <table class="sresults"> <tr> <td> <a href="http:/ /www.abc. com/areas" title=" Hollywood , CA "> hollywood </a> </td> <td> <a href="http:/ /www.abc. com/areas" title=" San Jose , CA "> san jose </a> </td> <td> <a href="http:/ /www.abc. com/areas" title=" San Francisco , CA "> san francisco </a> </td> <td> <a href="http:/ /www.abc. com/areas" title=" San Diego , CA "> San diego </a> </td> </tr> ………. </body> </html>
Below is the relevant code snippets illustrates how I have attempted to retrieve the contents (value of <a>):
import java.util.*; import org.jdom.*; import org.jdom.xpath. *; import org.saxpath. *; import org.ccil.cowan. tagsoup.Parser;
( 1 ) frInHtml = new FileReader(" C:\\Tmp\\ ABC.html" ); ( 2 ) brInHtml = new BufferedReader( frInHtml) ; ( 3 ) // SAXBuilder saxBuilder = new SAXBuilder(" org.apache. xerces.parsers. SAXParser" ); ( 4 ) SAXBuilder saxBuilder = new SAXBuilder(" org.ccil. cowan.tagsoup. Parser"); ( 5 ) org.jdom.Document jdomDocument = saxbuilder.build( brInHtml) ; ( 6 ) XPath xpath = XPath.newInstance( "/ns:html/ ns:body/ns: div[@id=' container' ]/ns:div[ @id='content' ]/ns:table[ @class='sresults ']/ns:tr/ ns:td/ns: a"); ( 7 ) xpath.addNamespace( "ns", "http://www. w3.org/1999/ xhtml"); ( 8 ) java.util.List list = (java.util.List) (xpath.selectNodes( jdomDocument) ); ( 9 ) Iterator iterator = list.iterator( ); ( 10 ) while (iterator.hasNext( )) ( 11 ) { ( 12 ) Object object = iterator.next( ); ( 13 ) // if (object instanceof Element) ( 14 ) // System.out.println( ((Element) object).getTextN ormalize( )); ( 15 ) if (object instanceof Content) ( 16 ) System.out.println( ((Content) object).getValue ()); } ….
This program would work on the same document without the default namespace, hence, it would not be necessary to include “ns” prefix along in the XPath statements (line 6-7) either. Moreover, I was using “org..apache. xerces.parsers. SAXParser” to have successfully retrieve content of <a> from the same document without default namespace in the past.
I would like to achieve the following objectives if possible:
( i ) Exclude DTD and namespace in order to simplifying the parsing process. How this could be done? ( ii ) If this is not possible, how to include it in XPath statements (line 6-7) so that the value of <a> is picked up correctly? ( iii ) Would changing from “org.apache.xerces. parsers.SAXParse r” to “org.ccil.cowan. tagsoup.Parser” make any difference as far as using XPath is concerned? ( iv ) Failing to exlude DTD, how to change the lookup of a PUBLIC DTD to a local SYSTEM one and include a local DTD for reference?
I am running JDK 1.6.0_06, Netbeans 6.1, JDOM 1.1, Saxon6-5-5, Tagsoup 1.2 on Windows XP platform.
Any assistance would be appreciated.
Thanks in advance,
Jack Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started. To unsubscribe, send a blank email to tagsoup-friends-unsubscribe@...
Change settings via the Web (Yahoo! ID required) Change settings via email: tagsoup-friends-digest@... | tagsoup-friends-traditional@... Visit Your Group | Yahoo! Groups Terms of Use | tagsoup-friends-unsubscribe@... . __,_._,___ Search 1000's of available singles in your area at the new Yahoo!7 Dating. Get Started. _______________________________________________ To control your jdom-interest membership: http://www.jdom.org/mailman/options/jdom-interest/youraddr@... |
| Free embeddable forum powered by Nabble | Forum Help |