|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
Moving Functionality from CLI to ParseUtilsHi, all. Long time no talk. I had been working part time and on a kind of
sabbatical during which I abandoned Java in favor of studying Ruby and Clojure, and attending and organizing BarCamp's. About three months ago, I started a new job, working with Java again. The need to extract structured data from Excel spreadsheets arose, and I wrote a JRuby script that called Tika to manage the parsing. In the process, I think I identified some possible improvements to Tika. It would be nice to simplify one of the simplest use cases, where you want Tika to parse a document using default configurations, and specify its output stream. There is a very general mechanism for parsing in CLI, but it is not possible to override the output stream default (stdout), and awkward to call it from a program rather than on the command line. I have two suggestions: 1) Make the output destination a configuration option (a command line parameter) that defaults to stdout (perhaps "-o"). Although it's easy to redirect output on the command line, it's not quite so simple when that command is called within a script that itself may be redirected. Also, when the command is executed from within another program, there may be issues as well. 2) Move the methods that do the work to ParseUtils, and leave only a thin command line wrapper around them in CLI. It would be helpful for scripts and Java programs to have these easy to use methods available too. It seems wasteful to force the caller to construct a command line to do this. What do you think? Cheers, Keith |
|
|
Re: Moving Functionality from CLI to ParseUtilsHi,
On Sat, Jul 4, 2009 at 9:56 PM, Keith R. Bennett<kbennett@...> wrote: > In the process, I think I identified some possible improvements to Tika. It > would be nice to simplify one of the simplest use cases, where you want > Tika to parse a document using default configurations, and specify its output > stream. > [...] > What do you think? Instead of a fixed facade like ParseUtils I personally prefer a set of components that I can combine in different ways to solve all kinds of use cases. For example your case would be easy to solve like this: InputStream input = ...; // Where your input is coming from OutputStream output = ...; // Where your output is going to new AutoDetectParser().parse( input, new BodyContentHandler(output), new Metadata()); Of course a static facade method like ParseUtils.parse(File input, File output) might be easier for occasional users. Did you have some specific method signatures in mind? BR, Jukka Zitting |
|
|
Re: Moving Functionality from CLI to ParseUtilsJukka -
Having pluggable parts, as you suggest, is definitely the way to go for optimum power and flexibility. However, IMHO, for the simplest use cases, and for beginning users, this approach may discourage and complicate Tika's use. I suggest an alternate simplified interface (see below) for these uses/users. Renovating the entrance gate to Tika-land in this way could result in an increase in the number of beginning users, who continue on to be advanced users, and hopefully developers. A larger installed base could then result in attracting more resources to the project, human and otherwise. * * * It's been awhile since I worked on Tika, and it's evolved in the meantime, so I'm not very adept at it these days. As such, let me use this to the project's advantage, and let you know what I would value in Tika as a new user. For the simple cases, I would suggest hiding things like parser implementations, metadata objects, and content handlers. The simplest cases with document type autodetection could be handled by: parse(InputStream inputStream, OutputStream outputStream) Then, to specify the document type, we could add a MimeType string argument: parse(InputStream inputStream, OutputStream outputStream, String mimeType) I realize that this approach is not very efficient with multiple documents, since there is setup work that needs to be done for each document, but it is probably not an issue for most casual users. Another question...I used Tika to parse an Excel spreadsheet. and it created an XML file. How could I insert a handler for parsing documents with multiple records (such as an Excel spreadsheets, so that I could, for example, insert the record into a data base instead of writing XML to a file? Rather than writing a full blown XML content handler, I wonder if we could simplify it to something like this: public interface RecordProcessor { void processRecord(Object [] fields); // or List |
|
|
Re: Moving Functionality from CLI to ParseUtilsHi,
2009/7/11 keithrbennett <keithrbennett@...>: > Having pluggable parts, as you suggest, is definitely the > way to go for optimum power and flexibility. However, IMHO, > for the simplest use cases, and for beginning users, > this approach may discourage and complicate Tika's use. > I suggest an alternate simplified interface (see below) > for these uses/users. Agreed, the more I think about this the more I think having something like this would be useful. My proposal would be to add a org.apache.tika.Tika facade class with static methods for the most important simple use cases. > For the simple cases, I would suggest hiding things like parser > implementations, metadata objects, and content handlers. The simplest > cases with document type autodetection could be handled by: > > parse(InputStream inputStream, OutputStream outputStream) I guess the most important parsing use case is to produce a Reader for use in Lucene indexing. Thus I would add a method like this: Reader parse(InputStream); Some clients may prefer to have it all in a simple string (with all the caveats of large inputs, perhaps we should have some built-in output size limit), so we could also do: String parseToString(InputStream); The XHTML output is probably only useful in more sophisticated use cases, where the Parser interface and an appropriate ContentHandler can be used directly. > Then, to specify the document type, we could add a MimeType string > argument: > > parse(InputStream inputStream, OutputStream outputStream, > String mimeType) Tika is already pretty good at auto-detecting the document type, and in my experience the file name is much more useful in helping type detection than any externally provided type information. Tika likely has a much more complete set of file name glob patterns than what probably was used to produce the external type information. Thus I'd rather give the proposed parse method information about the file name when available. And instead of adding an explicit argument, we could just as well add overloaded methods that also take care of correctly opening and closing the file (or URL resource) as needed. Something like this: Reader parse(File); Reader parse(URL); Similarly for the parseToString method. In more complex cases (e.g. if the file is inside a database field) one can always use the Parser interface directly. And while we're at it, there are many cases where an application needs to figure out the type of a given document. Instead of coming up with its own glob patterns and the like, an application could use Tika functionality through potential facade methods like the following that would return the auto-detected media type of the given document: String detect(InputStream); String detect(File); String detect(URL); WDYT? > Another question...I used Tika to parse an Excel spreadsheet. and it > created an XML file. How could I insert a handler for parsing > documents with multiple records (such as an Excel spreadsheets, so > that I could, for example, insert the record into a data base instead > of writing XML to a file? That's a big can of worms as each document type comes with it's own structure and semantics. Tika avoids this problem by focusing on just the contained text and some very generic structural information. If you need more detailed structural information, you'll inevitably hit type-specific features and my recommendation would be to directly use the appropriate parser library. For example, I'd use POI directly for pulling specific information out of Excel spreadsheets. BR, Jukka Zitting |
|
|
Re: Moving Functionality from CLI to ParseUtilsJukka and All -
I think a Tika facade would be awesome. I guess where I mentioned streams, I should be mentioning readers and writers instead. BTW, how can I insert new text into quoted sections of a message in Nabble? Regarding having a method that returns a Reader rather than taking a Writer being better for Lucene, for other use cases a Writer might be more convenient (for writing to files, for example). Having a method that takes a Writer would, I think, be more useful than having a method returning a string because it could 1) support sizes larger than memory capacity, 2) easily support output to files, and 3) still support strings (by using a StringWriter). Speaking of Lucene, I have never used Lucene directly, so I lack the context to understand the Tika/Lucene integration. All my input is from the point of view of someone who just wants to parse text from documents and do things other than text search. So if I neglect to include Lucene in my outlook, rest assured that it is just ignorance and nothing more. ;) Regarding XHTML, we already support it on the command line. My sense is that Excel spreadsheet parsing would be used more often for structured data than for raw text (that's certainly true for me), so I hope we could keep that. I understand your suggestion to use Poi directly for more sophisticated document handling, though. Everything else sounded good to me. Regards, Keith |
| Free embeddable forum powered by Nabble | Forum Help |