|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
rdf outputhi, i just checked out tika for the first time,
since i could use it very well, i have a few newbie questions. 1. when will tika switch to apache's pdf box (is it still not mature enough?) 2. is it possible to skip html tags with tika (say i don't want to have <script> or <style> contents in my resulting plain text and most important 3. are there any plan for outputing the result into RDF (currently i'm using aperture), but i would be more than happy to switch to an apache project and i'm also willing to contribute on that one. any insight appreciated wkr www.turnguard.com |
|
|
Re: rdf outputHi,
On Sun, Sep 20, 2009 at 10:22 PM, jakobitsch juergen <tschyrgie@...> wrote: > 1. when will tika switch to apache's pdf box (is it still not mature enough?) As soon as the 0.8.0 release is officially out and available from the central Maven repository. I expect this to happen within a week, so Tika 0.5 will be based on Apache PDFBox. > 2. is it possible to skip html tags with tika (say i don't want to have <script> > or <style> contents in my resulting plain text Yes. That's actually what the HTML parser in Tika is programmed to do by default. See the DISCARD_ELEMENTS set in org.apache.tika.parser.html.HTMLParser. > 3. are there any plan for outputing the result into RDF (currently i'm using aperture), > but i would be more than happy to switch to an apache project and i'm also willing > to contribute on that one. We've had discussions about using XMP for expressing and handling extracted document metadata. So far we haven't reached clear consensus and not much work has yet been done about this, but contributions are of course welcome. BR, Jukka Zitting |
|
|
Re: rdf outputHi Jukka,
On Sep 20, 2009, at 2:26pm, Jukka Zitting wrote: >> 2. is it possible to skip html tags with tika (say i don't want to >> have <script> >> or <style> contents in my resulting plain text > > Yes. That's actually what the HTML parser in Tika is programmed to do > by default. See the DISCARD_ELEMENTS set in > org.apache.tika.parser.html.HTMLParser. I recently ran into the need to customize the behavior of the HtmlParser, in terms of what tags it passed through. In particular, the <span> tag contained attributes I wanted, but these aren't part of the "SAFE_ELEMENTS" set. 1. From what I can see, <span> should be part of the XHTML safe set. 2. It would be great to have some way to easily customize this behavior, e.g. a protected isSafeElement() method. 3. It looks like the code currently will skip calling startElement/ endElement for non-safe tags, but will output any characters found between those tags. Depending on the where the non-safe tag occurs, this could result in an invalid XHTML document, e.g. if you had <body><non-safe tag>some text</non-safe tag></body> this would output <body>some text</body> Thanks, -- Ken >> >> 3. are there any plan for outputing the result into RDF (currently >> i'm using aperture), >> but i would be more than happy to switch to an apache project and >> i'm also willing >> to contribute on that one. > > We've had discussions about using XMP for expressing and handling > extracted document metadata. So far we haven't reached clear consensus > and not much work has yet been done about this, but contributions are > of course welcome. > > BR, > > Jukka Zitting |
| Free embeddable forum powered by Nabble | Forum Help |