Validating with a preloaded DTD when instance includes a <!DOCTYPE> statement

View: New views
4 Messages — Rating Filter:   Alert me  

Validating with a preloaded DTD when instance includes a <!DOCTYPE> statement

by Jack Rugh :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello:

I'm a newbie to the Java XML processing world, but an oldie at SGML/XML
processing (mostly using OmniMark for the last 15 years).

I'm working on a project in which many thousands of documents (average
size 14K) must be parsed together to both validate them and to extract
certain metadata from all document to create a single metadata output
file.  Every document contains a <!DOCTYPE> statement with a SYSTEM that
names the DTD file.  The DTD is fairly large and complex.  In my
experience, the time that it takes to parse the DTD for a document can
be  a significant percentage of the processing time for the document,
especially when the document is relatively small.

I have used the WoodStoX "ValidateWithDtd.java" sample program with
WoodStoX 4.0.2 as a model for parsing with validation.  If I remove the
<!DOCTYPE> statement from my small set of test documents, all documents
are processed fine.  But, if the <!DOCTYPE> statement is present, the I
get the exception:

Exception in thread "main" com.ctc.wstx.exc.WstxParsingException: (was
java.io.FileNotFoundException)
[DTD name] (The system cannot find the file specified) at
[row,col,system-id]: [2,120,"[input file name]

I don't see any way configure the parse to ignore the <!DOCTYPE>
statement.  Am I missing something, or is it not possible to use a
preloaded DTD if a <!DOCTYPE> statement is present in the input file?

Thank in advance for any suggestions.

Jack......
-----------------------------
Jack S. Rugh
Retrieval Systems Corporation
2071 Chain Bridge Road
Suite 510
Vienna, VA 22182
703-749-0012 ext. 335
http://retrievalsystems.com
-----------------------------

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email



Re: Validating with a preloaded DTD when instance includes a <!DOCTYPE> statement

by Cowtowncoder :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Apr 3, 2009 at 6:56 PM, Jack Rugh <jrugh@...> wrote:

> Hello:
>
> I'm a newbie to the Java XML processing world, but an oldie at SGML/XML
> processing (mostly using OmniMark for the last 15 years).
>
> I'm working on a project in which many thousands of documents (average
> size 14K) must be parsed together to both validate them and to extract
> certain metadata from all document to create a single metadata output
> file.  Every document contains a <!DOCTYPE> statement with a SYSTEM that
> names the DTD file.  The DTD is fairly large and complex.  In my
> experience, the time that it takes to parse the DTD for a document can
> be  a significant percentage of the processing time for the document,
> especially when the document is relatively small.
>
> I have used the WoodStoX "ValidateWithDtd.java" sample program with
> WoodStoX 4.0.2 as a model for parsing with validation.  If I remove the
> <!DOCTYPE> statement from my small set of test documents, all documents
> are processed fine.  But, if the <!DOCTYPE> statement is present, the I
> get the exception:
>
> Exception in thread "main" com.ctc.wstx.exc.WstxParsingException: (was
> java.io.FileNotFoundException)
> [DTD name] (The system cannot find the file specified) at
> [row,col,system-id]: [2,120,"[input file name]
>
> I don't see any way configure the parse to ignore the <!DOCTYPE>
> statement.  Am I missing something, or is it not possible to use a
> preloaded DTD if a <!DOCTYPE> statement is present in the input file?

It is possible: you can both easily ignore DOCTYPE altogether (set
property XMLInputFactory.SUPPORT_DTD to false), to redirect access to
external DTD substype, or feed an alternate DTD schema.

I don't have code at hand right now, but if you do have access to
Woodstox subversion repository, ValidateDTD tool (under wstx-tools;
sibling to main 'wstx' in svn) has a way to show how to override it.

But let's see.. looking at Woodstox javadocs
(http://woodstox.codehaus.org/4.0.3/javadoc/index.html), there are
couple of ways actually.
First, you can set property

        WstxInputProperties.P_DTD_RESOLVER

(via XMLInputFactory.setProperty, or woodstox stream reader's
setProperty(); just must be done before DTD event is processed)
to define how DOCTYPE reference is resolved (and can feed alternate
source). Similar to how entity resolver would work with SAX API.

Or, you can call ReaderConfig.setDTDOverride(schema); and to load
schema, need to get DTD instance read (via XMLSchemaFactory). This is
what ValidateXML uses with command line switches I think.

... all of which should be documented in a better way. :-)

Please let me know if it's hard to figure out details wrt above.
Someone on this list should be able to help more.

Hope this helps,

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email



RE: Validating with a preloaded DTD when instance includes a <!DOCTYPE> statement

by Jack Rugh :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Tatu:

Thanks for the quick response.  I first implemented the technique from the ValidateDTD tool.  It worked fine.  Then, when I saw that the setFeature method in XMLStreamReader2 was deprecated, I switched to the  "setProperty" method in XMLInputFactory2.  That also worked.  So, I will stay with it.

Jack.......

> -----Original Message-----
> From: Tatu Saloranta [mailto:tsaloranta@...]
> Sent: Friday, April 03, 2009 10:20 PM
> To: user@...
> Subject: Re: [woodstox-user] Validating with a preloaded DTD
> when instance includes a <!DOCTYPE> statement
>
> On Fri, Apr 3, 2009 at 6:56 PM, Jack Rugh
> <jrugh@...> wrote:
> > Hello:
> >
> > I'm a newbie to the Java XML processing world, but an oldie at
> > SGML/XML processing (mostly using OmniMark for the last 15 years).
> >
> > I'm working on a project in which many thousands of
> documents (average
> > size 14K) must be parsed together to both validate them and
> to extract
> > certain metadata from all document to create a single
> metadata output
> > file.  Every document contains a <!DOCTYPE> statement with a SYSTEM
> > that names the DTD file.  The DTD is fairly large and
> complex.  In my
> > experience, the time that it takes to parse the DTD for a
> document can
> > be  a significant percentage of the processing time for the
> document,
> > especially when the document is relatively small.
> >
> > I have used the WoodStoX "ValidateWithDtd.java" sample program with
> > WoodStoX 4.0.2 as a model for parsing with validation.  If I remove
> > the <!DOCTYPE> statement from my small set of test documents, all
> > documents are processed fine.  But, if the <!DOCTYPE> statement is
> > present, the I get the exception:
> >
> > Exception in thread "main"
> com.ctc.wstx.exc.WstxParsingException: (was
> > java.io.FileNotFoundException)
> > [DTD name] (The system cannot find the file specified) at
> > [row,col,system-id]: [2,120,"[input file name]
> >
> > I don't see any way configure the parse to ignore the <!DOCTYPE>
> > statement.  Am I missing something, or is it not possible to use a
> > preloaded DTD if a <!DOCTYPE> statement is present in the
> input file?
>
> It is possible: you can both easily ignore DOCTYPE altogether
> (set property XMLInputFactory.SUPPORT_DTD to false), to
> redirect access to external DTD substype, or feed an
> alternate DTD schema.
>
> I don't have code at hand right now, but if you do have
> access to Woodstox subversion repository, ValidateDTD tool
> (under wstx-tools; sibling to main 'wstx' in svn) has a way
> to show how to override it.
>
> But let's see.. looking at Woodstox javadocs
> (http://woodstox.codehaus.org/4.0.3/javadoc/index.html),
> there are couple of ways actually.
> First, you can set property
>
> WstxInputProperties.P_DTD_RESOLVER
>
> (via XMLInputFactory.setProperty, or woodstox stream reader's
> setProperty(); just must be done before DTD event is
> processed) to define how DOCTYPE reference is resolved (and
> can feed alternate source). Similar to how entity resolver
> would work with SAX API.
>
> Or, you can call ReaderConfig.setDTDOverride(schema); and to
> load schema, need to get DTD instance read (via
> XMLSchemaFactory). This is what ValidateXML uses with command
> line switches I think.
>
> ... all of which should be documented in a better way. :-)
>
> Please let me know if it's hard to figure out details wrt above.
> Someone on this list should be able to help more.
>
> Hope this helps,
>
> -+ Tatu +-
>
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
>
>     http://xircles.codehaus.org/manage_email
>
>
>

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email



Re: Validating with a preloaded DTD when instance includes a <!DOCTYPE> statement

by Cowtowncoder :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Apr 7, 2009 at 7:58 AM, Jack Rugh <jrugh@...> wrote:
> Hi Tatu:
>
> Thanks for the quick response.  I first implemented the technique from the ValidateDTD tool.  It worked fine.  Then, when I saw that the setFeature method in XMLStreamReader2 was deprecated, I switched to the  "setProperty" method in XMLInputFactory2.  That also worked.  So, I will stay with it.

Great. Good to hear it worked.

And yes, setFeature() was something I thought would make sense to add
(with 2.0 I think), but in the end wasn't really generally needed so
decided I'll just use setProperty()/getProperty() for generic
configuration.
And ReaderConfig/WriterConfig have type-safe woodstox-specific access methods.

-+ Tatu +-

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email