can Nutch crawl XLS and XLSX file???

View: New views
2 Messages — Rating Filter:   Alert me  

can Nutch crawl XLS and XLSX file???

by tarunsapra :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi people,

I am not able to find the answert to above question.? please help if u know the answer.
Also i have read nutch has plugins to parse Word and PDF files, am i correct?

Re: can Nutch crawl XLS and XLSX file???

by John Whelan :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Nutch can index MS Word, MS Powerpoint, MS Excel, and PDF files. In order for these types to be crawled, you need to have the plugins specified in the plugin.includes value of nutch/conf/nutch-site.xml (values are 'parse-(msexcel|mspowerpoint|msword|pdf)'.)

I was not sure if the new XLSX format was supported, so I looked at nutch/conf/tika-mimetypes.xml. From what I can tell, the XLSX files are not supported (as of the 11/6/2009 build of Nutch), only the following extensions are supported for  are:

        <mime-type type="application/vnd.ms-excel">
                <magic priority="50">
                        <match value="Microsoft Excel 5.0 Worksheet" type="string"
                                offset="2080" />
                </magic>
                <glob pattern="*.xls" />
                <glob pattern="*.xlc" />
                <glob pattern="*.xll" />
                <glob pattern="*.xlm" />
                <glob pattern="*.xlw" />
                <glob pattern="*.xla" />
                <glob pattern="*.xlt" />
                <glob pattern="*.xld" />
                <alias type="application/msexcel" />
        </mime-type>

...of course, I could be wrong.