|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Stripping commentsOne task I have it to package our source XML files for use by integrators; one thing I'd like to do is first strip the comments from these files as they may contain sensitive information.
I was thinking that this could be done by processing each file through Saxon using a stylesheet which strips out comments and outputs the XML again. But rather than risk reinventing the wheel, I was wondering if anyone out there has implemented a DocBook comment stripper in their build process? Thanks, P. |
|
|
Re: Stripping commentsYou could use XSLT, but you might not like the results. 8^)
You start with an identity stylesheet such as the following: <?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output indent="no"/> <xsl:template match="node()|@*"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> </xsl:stylesheet> Then you add a template to strip out comments: <xsl:template match="comment()"/> There are several problems with using XSLT though: 1. Entity references are expanded, not preserved as entity references. You can't hide them in XSLT because the parser expands them before the stylesheet sees them. 2. Any DOCTYPE declaration is removed. You have to copy your doctype public and system identifiers to the xsl:output element's doctype-public and doctype-system attributes. The stylesheet can't do it, because the DOCTYPE is not accessible to XPath. Any internal DTD subset is lost, as there is no way for xsl:output to specify it. 3. Default DocBook attributes are added. You will end up with a lot of moreinfo="none" attributes on elements like literal. 4. The output will differ in other ways because the XML is parsed and then re-serialized: attribute order may be different, empty elements may be expressed differently, character references will become native UTF-8 (unless you specify a different output encoding). These differences will show up in a text diff program, but not an XML-aware differencing program. Generally, I use Perl for such filtering. The XML comment string is a well defined regular expression, and Perl doesn't mess with any XML stuff. I read the entire file into a single string, globally replace comments with nothing, and then print the string. Bob Stayton Sagehill Enterprises DocBook Consulting bobs@... ----- Original Message ----- From: "Paul Moloney" <paul_moloney@...> To: <docbook-apps@...> Sent: Thursday, March 29, 2007 6:45 AM Subject: [docbook-apps] Stripping comments > > One task I have it to package our source XML files for use by > integrators; > one thing I'd like to do is first strip the comments from these files as > they may contain sensitive information. > > I was thinking that this could be done by processing each file through > Saxon > using a stylesheet which strips out comments and outputs the XML again. > But > rather than risk reinventing the wheel, I was wondering if anyone out > there > has implemented a DocBook comment stripper in their build process? > > Thanks, > > P. > -- > View this message in context: > http://www.nabble.com/Stripping-comments-tf3486783.html#a9734912 > Sent from the docbook apps mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: docbook-apps-unsubscribe@... > For additional commands, e-mail: docbook-apps-help@... > > > --------------------------------------------------------------------- To unsubscribe, e-mail: docbook-apps-unsubscribe@... For additional commands, e-mail: docbook-apps-help@... |
|
|
Re: Stripping comments Here's a quick perl solution that doesn't read everything into
memory and seems to handle some of the edge cases. Try it out on a few things to verify that everything is okay before completely trusting it, though. :) Copy the lines between '------------' into a file (say strip_xml_comments.pl). (if on Unix do this step first) chmod 755 strip_xml_comments.pl Make a backup copy of any and all files that you'll be using. (The script should work fine as is, but it's *MUCH* better to be safe than sorry. :) Now you should be able to run the script on a copy of your input file. strip_xml_comments.pl my_xml_input_file.xml The script will make a backup copy of its own with '.orig' at the end of the name. (Please don't just rely on this feature -- make your own backup.) Verify that everything looks okay and integrate it into your application stream. Here's the script ---------------------- #!/usr/bin/perl -w -i.orig # # NB: Delete the '.orig' portion if backup copies are not desired # # # Delete XML comments. # # # Go through every file given on the command line # $in_comment= 0; while( <> ) { # # Match inline comments # s { <!-- # Match the opening delimiter. .*? # Match a minimal number of characters. --> # Match the closing delimiter. } []gsx; # # Match multi-line comments # if( /<!--/ ) { $in_comment= 1; next; } # # Find the end of a multi-line comment and remove everything to that point. # NB: All other in-line comments have already been removed # if( /-->/ ) { s/.*-->//; $in_comment= 0; } # # Ignore every line in the comment # if( $in_comment ) { next; } print; # Print everything on the current line } ---------------------- Note that the code is a simple modification of one of the examples from the perlre man page (http://perldoc.perl.org/perlre.html). Hopefully this will suit your purposes! kells > > ----- Original Message ----- > From: "Paul Moloney" <paul_moloney@...> > To: <docbook-apps@...> > Sent: Thursday, March 29, 2007 6:45 AM > Subject: [docbook-apps] Stripping comments > > > > > > One task I have it to package our source XML files for use by > > integrators; > > one thing I'd like to do is first strip the comments from these files as > > they may contain sensitive information. > > > > I was thinking that this could be done by processing each file through > > Saxon > > using a stylesheet which strips out comments and outputs the XML again. > > But > > rather than risk reinventing the wheel, I was wondering if anyone out > > there > > has implemented a DocBook comment stripper in their build process? > > > > Thanks, > > > > P. > > -- > > View this message in context: > > http://www.nabble.com/Stripping-comments-tf3486783.html#a9734912 > > Sent from the docbook apps mailing list archive at Nabble.com. > --------------------------------------------------------------------- To unsubscribe, e-mail: docbook-apps-unsubscribe@... For additional commands, e-mail: docbook-apps-help@... |
|
|
Re: Stripping commentsThanks for the help; will try this out and let you know how it goes...
P. |
| Free embeddable forum powered by Nabble | Forum Help |