Hugo Pinto wrote:
> Hello,
>
> I am using Nutch for mirroring, rather than crawling and indexing.
> I need to access directly the cached data in my Nutch index, but I am
> unable to find an easy way to do so.
> I browsed the documentation(wiki, javadocs, and skimmed the code), but
> found no straightforward way to do it.
> Would anyone suggest a place to look for more information, or perhaps
> have done this before and could share a few tips?
Most likely what you need is not the Lucene index, but the segments
(shards), right? There's a utility called SegmentReader (available from
cmd-line as readseg), and you can use its API to retrieve either all or
individual records from a segment (using URL as key).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com