package parser ignoring tika-config.xml

View: New views
1 Messages — Rating Filter:   Alert me  

package parser ignoring tika-config.xml

by Jonathan Koren-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I created my own ContentHandler, XmlParser that echos out the dom tree  
of the xml file being parsed.  I modified tika-config so that  
AutoDetectParser will call this parser for xml files:

         <parser name="parse-xml" class="XmlParser">
                 <mime>application/xml</mime>
         </parser>

If tika parses an xml file directly, the right thing is done:

        resourceName: 1001281.xml
ComplexIndexerTaskThread()
        XmlParser Begins
        SCH: start document
        SCH: start element nitf
        SCH: a: change.date=June 10, 2005
        SCH: a: change.time=19:30
        SCH: a: version=-//IPTC//DTD NITF 3.3//EN
        SCH: start element head
        SCH: start element title
        Apprentices Sample Life Of Doctors In Villages
        SCH: end element title
        SCH: start element meta
        SCH: a: content=Y11DOC$01
        SCH: a: name=slug

and so on for the fragment:

        <?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE nitf SYSTEM "http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd 
">
        <nitf change.date="June 10, 2005" change.time="19:30" version="-//
IPTC//DTD NITF 3.3//EN">
        <head>
        <title>Apprentices Sample Life Of Doctors In Villages</title>
        <meta content="Y11DOC$01" name="slug"/>


Now.  If I put this XML file within a a gzipped tar file, my XmlParser  
isn't called.  Instead it is somehow converted to plain text.  Which  
is not correct.   Example output:

        fullpathname: /Users/jonathan/devel/cs/spade/aaa.tar.gz
        resourceName: aaa.tar.gz
        ComplexIndexerTaskThread()
        SCH: start document
        SCH: start element html
        SCH: start element head
        SCH: start element title

        SCH: end element title

        SCH: end element head
        SCH: start element body
        SCH: start element div
        SCH: a: class=package-entry
        SCH: subfile 1 detected!
        SCH: start element h1
        aaa.tar
        SCH: subfile 1's name is aaa.tar

        SCH: end element h1
        SCH: start element div
        SCH: a: class=package-entry
        SCH: subfile 2 detected!
        SCH: start element h1
        1001281.xml
        SCH: subfile 2's name is 1001281.xml

        SCH: end element h1
        SCH: start element p


     Apprentices Sample Life Of Doctors In Villages


and so on.

Why is PackageParser ignoring the configuration within tika-
config.xml ?  This shouldn't be defined behavior.  If a user  
configured tika to handle certain mimetypes special, then the files  
matching those mimetypes should be handled special wherever the file  
is found.  I suspect that this has a problem with how mimetypes are  
detected.


--
Jonathan Koren
jonathan@...
http://www.soe.ucsc.edu/~jonathan/