|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Created: (TIKA-251) package parser ignoring tika-config.xmlpackage parser ignoring tika-config.xml
---------------------------------------- Key: TIKA-251 URL: https://issues.apache.org/jira/browse/TIKA-251 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.4 Reporter: Jonathan Koren I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed. I modified tika-config so that AutoDetectParser will call this parser for xml files: <parser name="parse-xml" class="XmlParser"> <mime>application/xml</mime> </parser> If tika parses an xml file directly, the right thing is done: resourceName: 1001281.xml ComplexIndexerTaskThread() XmlParser Begins SCH: start document SCH: start element nitf SCH: a: change.date=June 10, 2005 SCH: a: change.time=19:30 SCH: a: version=-//IPTC//DTD NITF 3.3//EN SCH: start element head SCH: start element title Apprentices Sample Life Of Doctors In Villages SCH: end element title SCH: start element meta SCH: a: content=Y11DOC$01 SCH: a: name=slug and so on for the fragment: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd"> <nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN"> <head> <title>Apprentices Sample Life Of Doctors In Villages</title> <meta content="Y11DOC$01" name="slug"/> Now. If I put this XML file within a a gzipped tar file, my XmlParser isn't called. Instead it is somehow converted to plain text. Which is not correct. Example output: fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz resourceName: aaa.tar.gz ComplexIndexerTaskThread() SCH: start document SCH: start element html SCH: start element head SCH: start element title SCH: end element title SCH: end element head SCH: start element body SCH: start element div SCH: a: class=package-entry SCH: subfile 1 detected! SCH: start element h1 aaa.tar SCH: subfile 1's name is aaa.tar SCH: end element h1 SCH: start element div SCH: a: class=package-entry SCH: subfile 2 detected! SCH: start element h1 1001281.xml SCH: subfile 2's name is 1001281.xml SCH: end element h1 SCH: start element p Apprentices Sample Life Of Doctors In Villages and so on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (TIKA-251) package parser ignoring tika-config.xml[ https://issues.apache.org/jira/browse/TIKA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Koren updated TIKA-251: -------------------------------- Priority: Minor (was: Major) > package parser ignoring tika-config.xml > ---------------------------------------- > > Key: TIKA-251 > URL: https://issues.apache.org/jira/browse/TIKA-251 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.4 > Reporter: Jonathan Koren > Priority: Minor > > I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed. I modified tika-config so that AutoDetectParser will call this parser for xml files: > <parser name="parse-xml" class="XmlParser"> > <mime>application/xml</mime> > </parser> > If tika parses an xml file directly, the right thing is done: > resourceName: 1001281.xml > ComplexIndexerTaskThread() > XmlParser Begins > SCH: start document > SCH: start element nitf > SCH: a: change.date=June 10, 2005 > SCH: a: change.time=19:30 > SCH: a: version=-//IPTC//DTD NITF 3.3//EN > SCH: start element head > SCH: start element title > Apprentices Sample Life Of Doctors In Villages > SCH: end element title > SCH: start element meta > SCH: a: content=Y11DOC$01 > SCH: a: name=slug > and so on for the fragment: > <?xml version="1.0" encoding="UTF-8"?> > <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd"> > <nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN"> > <head> > <title>Apprentices Sample Life Of Doctors In Villages</title> > <meta content="Y11DOC$01" name="slug"/> > Now. If I put this XML file within a a gzipped tar file, my XmlParser isn't called. Instead it is somehow converted to plain text. Which is not correct. Example output: > fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz > resourceName: aaa.tar.gz > ComplexIndexerTaskThread() > SCH: start document > SCH: start element html > SCH: start element head > SCH: start element title > SCH: end element title > SCH: end element head > SCH: start element body > SCH: start element div > SCH: a: class=package-entry > SCH: subfile 1 detected! > SCH: start element h1 > aaa.tar > SCH: subfile 1's name is aaa.tar > SCH: end element h1 > SCH: start element div > SCH: a: class=package-entry > SCH: subfile 2 detected! > SCH: start element h1 > 1001281.xml > SCH: subfile 2's name is 1001281.xml > SCH: end element h1 > SCH: start element p > Apprentices Sample Life Of Doctors In Villages > and so on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-251) package parser ignoring tika-config.xml[ https://issues.apache.org/jira/browse/TIKA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724199#action_12724199 ] Jukka Zitting commented on TIKA-251: ------------------------------------ The package parser might not be picking up your custom configuration. Are you using a recent version from trunk? See TIKA-238 that should fix the issue of a PackageParser always using the default Tika configuration. > package parser ignoring tika-config.xml > ---------------------------------------- > > Key: TIKA-251 > URL: https://issues.apache.org/jira/browse/TIKA-251 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.4 > Reporter: Jonathan Koren > Priority: Minor > > I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed. I modified tika-config so that AutoDetectParser will call this parser for xml files: > <parser name="parse-xml" class="XmlParser"> > <mime>application/xml</mime> > </parser> > If tika parses an xml file directly, the right thing is done: > resourceName: 1001281.xml > ComplexIndexerTaskThread() > XmlParser Begins > SCH: start document > SCH: start element nitf > SCH: a: change.date=June 10, 2005 > SCH: a: change.time=19:30 > SCH: a: version=-//IPTC//DTD NITF 3.3//EN > SCH: start element head > SCH: start element title > Apprentices Sample Life Of Doctors In Villages > SCH: end element title > SCH: start element meta > SCH: a: content=Y11DOC$01 > SCH: a: name=slug > and so on for the fragment: > <?xml version="1.0" encoding="UTF-8"?> > <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd"> > <nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN"> > <head> > <title>Apprentices Sample Life Of Doctors In Villages</title> > <meta content="Y11DOC$01" name="slug"/> > Now. If I put this XML file within a a gzipped tar file, my XmlParser isn't called. Instead it is somehow converted to plain text. Which is not correct. Example output: > fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz > resourceName: aaa.tar.gz > ComplexIndexerTaskThread() > SCH: start document > SCH: start element html > SCH: start element head > SCH: start element title > SCH: end element title > SCH: end element head > SCH: start element body > SCH: start element div > SCH: a: class=package-entry > SCH: subfile 1 detected! > SCH: start element h1 > aaa.tar > SCH: subfile 1's name is aaa.tar > SCH: end element h1 > SCH: start element div > SCH: a: class=package-entry > SCH: subfile 2 detected! > SCH: start element h1 > 1001281.xml > SCH: subfile 2's name is 1001281.xml > SCH: end element h1 > SCH: start element p > Apprentices Sample Life Of Doctors In Villages > and so on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-251) package parser ignoring tika-config.xml[ https://issues.apache.org/jira/browse/TIKA-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724352#action_12724352 ] Jonathan Koren commented on TIKA-251: ------------------------------------- Just updated and reran `mvn install` to make sure. bash-3.2# svn update At revision 788551. > package parser ignoring tika-config.xml > ---------------------------------------- > > Key: TIKA-251 > URL: https://issues.apache.org/jira/browse/TIKA-251 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.4 > Reporter: Jonathan Koren > Priority: Minor > > I created my own ContentHandler, XmlParser that echos out the dom tree of the xml file being parsed. I modified tika-config so that AutoDetectParser will call this parser for xml files: > <parser name="parse-xml" class="XmlParser"> > <mime>application/xml</mime> > </parser> > If tika parses an xml file directly, the right thing is done: > resourceName: 1001281.xml > ComplexIndexerTaskThread() > XmlParser Begins > SCH: start document > SCH: start element nitf > SCH: a: change.date=June 10, 2005 > SCH: a: change.time=19:30 > SCH: a: version=-//IPTC//DTD NITF 3.3//EN > SCH: start element head > SCH: start element title > Apprentices Sample Life Of Doctors In Villages > SCH: end element title > SCH: start element meta > SCH: a: content=Y11DOC$01 > SCH: a: name=slug > and so on for the fragment: > <?xml version="1.0" encoding="UTF-8"?> > <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd"> > <nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD NITF 3.3//EN"> > <head> > <title>Apprentices Sample Life Of Doctors In Villages</title> > <meta content="Y11DOC$01" name="slug"/> > Now. If I put this XML file within a a gzipped tar file, my XmlParser isn't called. Instead it is somehow converted to plain text. Which is not correct. Example output: > fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz > resourceName: aaa.tar.gz > ComplexIndexerTaskThread() > SCH: start document > SCH: start element html > SCH: start element head > SCH: start element title > SCH: end element title > SCH: end element head > SCH: start element body > SCH: start element div > SCH: a: class=package-entry > SCH: subfile 1 detected! > SCH: start element h1 > aaa.tar > SCH: subfile 1's name is aaa.tar > SCH: end element h1 > SCH: start element div > SCH: a: class=package-entry > SCH: subfile 2 detected! > SCH: start element h1 > 1001281.xml > SCH: subfile 2's name is 1001281.xml > SCH: end element h1 > SCH: start element p > Apprentices Sample Life Of Doctors In Villages > and so on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| Free embeddable forum powered by Nabble | Forum Help |