Having pluggable parts, as you suggest, is definitely the
way to go for optimum power and flexibility. However, IMHO,
for the simplest use cases, and for beginning users,
this approach may discourage and complicate Tika's use.
I suggest an alternate simplified interface (see below)
for these uses/users.
Renovating the entrance gate to Tika-land in this way
could result in an increase in the number of
beginning users, who continue on to be advanced users,
and hopefully developers. A larger installed base could
then result in attracting more resources to the project, human
and otherwise.
* * *
It's been awhile since I worked on Tika, and it's evolved in the
meantime, so I'm not very adept at it these days.
As such, let me use this to the project's advantage, and let you know
what I would value in Tika as a new user.
For the simple cases, I would suggest hiding things like parser
implementations, metadata objects, and content handlers. The simplest
cases with document type autodetection could be handled by:
I realize that this approach is not very efficient with multiple
documents, since there is setup work that needs to be done for each
document, but it is probably not an issue for most casual users.
Another question...I used Tika to parse an Excel spreadsheet. and it
created an XML file. How could I insert a handler for parsing
documents with multiple records (such as an Excel spreadsheets, so
that I could, for example, insert the record into a data base instead
of writing XML to a file? Rather than writing a full blown XML
content handler, I wonder if we could simplify it to something like
this:
public interface RecordProcessor {
void processRecord(Object [] fields); // or List