|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Validating with a preloaded DTD when instance includes a <!DOCTYPE> statementHello:
I'm a newbie to the Java XML processing world, but an oldie at SGML/XML processing (mostly using OmniMark for the last 15 years). I'm working on a project in which many thousands of documents (average size 14K) must be parsed together to both validate them and to extract certain metadata from all document to create a single metadata output file. Every document contains a <!DOCTYPE> statement with a SYSTEM that names the DTD file. The DTD is fairly large and complex. In my experience, the time that it takes to parse the DTD for a document can be a significant percentage of the processing time for the document, especially when the document is relatively small. I have used the WoodStoX "ValidateWithDtd.java" sample program with WoodStoX 4.0.2 as a model for parsing with validation. If I remove the <!DOCTYPE> statement from my small set of test documents, all documents are processed fine. But, if the <!DOCTYPE> statement is present, the I get the exception: Exception in thread "main" com.ctc.wstx.exc.WstxParsingException: (was java.io.FileNotFoundException) [DTD name] (The system cannot find the file specified) at [row,col,system-id]: [2,120,"[input file name] I don't see any way configure the parse to ignore the <!DOCTYPE> statement. Am I missing something, or is it not possible to use a preloaded DTD if a <!DOCTYPE> statement is present in the input file? Thank in advance for any suggestions. Jack...... ----------------------------- Jack S. Rugh Retrieval Systems Corporation 2071 Chain Bridge Road Suite 510 Vienna, VA 22182 703-749-0012 ext. 335 http://retrievalsystems.com ----------------------------- --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
|
|
Re: Validating with a preloaded DTD when instance includes a <!DOCTYPE> statementOn Fri, Apr 3, 2009 at 6:56 PM, Jack Rugh <jrugh@...> wrote:
> Hello: > > I'm a newbie to the Java XML processing world, but an oldie at SGML/XML > processing (mostly using OmniMark for the last 15 years). > > I'm working on a project in which many thousands of documents (average > size 14K) must be parsed together to both validate them and to extract > certain metadata from all document to create a single metadata output > file. Every document contains a <!DOCTYPE> statement with a SYSTEM that > names the DTD file. The DTD is fairly large and complex. In my > experience, the time that it takes to parse the DTD for a document can > be a significant percentage of the processing time for the document, > especially when the document is relatively small. > > I have used the WoodStoX "ValidateWithDtd.java" sample program with > WoodStoX 4.0.2 as a model for parsing with validation. If I remove the > <!DOCTYPE> statement from my small set of test documents, all documents > are processed fine. But, if the <!DOCTYPE> statement is present, the I > get the exception: > > Exception in thread "main" com.ctc.wstx.exc.WstxParsingException: (was > java.io.FileNotFoundException) > [DTD name] (The system cannot find the file specified) at > [row,col,system-id]: [2,120,"[input file name] > > I don't see any way configure the parse to ignore the <!DOCTYPE> > statement. Am I missing something, or is it not possible to use a > preloaded DTD if a <!DOCTYPE> statement is present in the input file? It is possible: you can both easily ignore DOCTYPE altogether (set property XMLInputFactory.SUPPORT_DTD to false), to redirect access to external DTD substype, or feed an alternate DTD schema. I don't have code at hand right now, but if you do have access to Woodstox subversion repository, ValidateDTD tool (under wstx-tools; sibling to main 'wstx' in svn) has a way to show how to override it. But let's see.. looking at Woodstox javadocs (http://woodstox.codehaus.org/4.0.3/javadoc/index.html), there are couple of ways actually. First, you can set property WstxInputProperties.P_DTD_RESOLVER (via XMLInputFactory.setProperty, or woodstox stream reader's setProperty(); just must be done before DTD event is processed) to define how DOCTYPE reference is resolved (and can feed alternate source). Similar to how entity resolver would work with SAX API. Or, you can call ReaderConfig.setDTDOverride(schema); and to load schema, need to get DTD instance read (via XMLSchemaFactory). This is what ValidateXML uses with command line switches I think. ... all of which should be documented in a better way. :-) Please let me know if it's hard to figure out details wrt above. Someone on this list should be able to help more. Hope this helps, -+ Tatu +- --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
|
|
RE: Validating with a preloaded DTD when instance includes a <!DOCTYPE> statementHi Tatu:
Thanks for the quick response. I first implemented the technique from the ValidateDTD tool. It worked fine. Then, when I saw that the setFeature method in XMLStreamReader2 was deprecated, I switched to the "setProperty" method in XMLInputFactory2. That also worked. So, I will stay with it. Jack....... > -----Original Message----- > From: Tatu Saloranta [mailto:tsaloranta@...] > Sent: Friday, April 03, 2009 10:20 PM > To: user@... > Subject: Re: [woodstox-user] Validating with a preloaded DTD > when instance includes a <!DOCTYPE> statement > > On Fri, Apr 3, 2009 at 6:56 PM, Jack Rugh > <jrugh@...> wrote: > > Hello: > > > > I'm a newbie to the Java XML processing world, but an oldie at > > SGML/XML processing (mostly using OmniMark for the last 15 years). > > > > I'm working on a project in which many thousands of > documents (average > > size 14K) must be parsed together to both validate them and > to extract > > certain metadata from all document to create a single > metadata output > > file. Every document contains a <!DOCTYPE> statement with a SYSTEM > > that names the DTD file. The DTD is fairly large and > complex. In my > > experience, the time that it takes to parse the DTD for a > document can > > be a significant percentage of the processing time for the > document, > > especially when the document is relatively small. > > > > I have used the WoodStoX "ValidateWithDtd.java" sample program with > > WoodStoX 4.0.2 as a model for parsing with validation. If I remove > > the <!DOCTYPE> statement from my small set of test documents, all > > documents are processed fine. But, if the <!DOCTYPE> statement is > > present, the I get the exception: > > > > Exception in thread "main" > com.ctc.wstx.exc.WstxParsingException: (was > > java.io.FileNotFoundException) > > [DTD name] (The system cannot find the file specified) at > > [row,col,system-id]: [2,120,"[input file name] > > > > I don't see any way configure the parse to ignore the <!DOCTYPE> > > statement. Am I missing something, or is it not possible to use a > > preloaded DTD if a <!DOCTYPE> statement is present in the > input file? > > It is possible: you can both easily ignore DOCTYPE altogether > (set property XMLInputFactory.SUPPORT_DTD to false), to > redirect access to external DTD substype, or feed an > alternate DTD schema. > > I don't have code at hand right now, but if you do have > access to Woodstox subversion repository, ValidateDTD tool > (under wstx-tools; sibling to main 'wstx' in svn) has a way > to show how to override it. > > But let's see.. looking at Woodstox javadocs > (http://woodstox.codehaus.org/4.0.3/javadoc/index.html), > there are couple of ways actually. > First, you can set property > > WstxInputProperties.P_DTD_RESOLVER > > (via XMLInputFactory.setProperty, or woodstox stream reader's > setProperty(); just must be done before DTD event is > processed) to define how DOCTYPE reference is resolved (and > can feed alternate source). Similar to how entity resolver > would work with SAX API. > > Or, you can call ReaderConfig.setDTDOverride(schema); and to > load schema, need to get DTD instance read (via > XMLSchemaFactory). This is what ValidateXML uses with command > line switches I think. > > ... all of which should be documented in a better way. :-) > > Please let me know if it's hard to figure out details wrt above. > Someone on this list should be able to help more. > > Hope this helps, > > -+ Tatu +- > > --------------------------------------------------------------------- > To unsubscribe from this list, please visit: > > http://xircles.codehaus.org/manage_email > > > --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
|
|
Re: Validating with a preloaded DTD when instance includes a <!DOCTYPE> statementOn Tue, Apr 7, 2009 at 7:58 AM, Jack Rugh <jrugh@...> wrote:
> Hi Tatu: > > Thanks for the quick response. I first implemented the technique from the ValidateDTD tool. It worked fine. Then, when I saw that the setFeature method in XMLStreamReader2 was deprecated, I switched to the "setProperty" method in XMLInputFactory2. That also worked. So, I will stay with it. Great. Good to hear it worked. And yes, setFeature() was something I thought would make sense to add (with 2.0 I think), but in the end wasn't really generally needed so decided I'll just use setProperty()/getProperty() for generic configuration. And ReaderConfig/WriterConfig have type-safe woodstox-specific access methods. -+ Tatu +- --------------------------------------------------------------------- To unsubscribe from this list, please visit: http://xircles.codehaus.org/manage_email |
| Free embeddable forum powered by Nabble | Forum Help |