Direct Access to Cached Data

View: New views
2 Messages — Rating Filter:   Alert me  

Direct Access to Cached Data

by Hugo Pinto :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I am using Nutch for mirroring, rather than crawling and indexing.
I need to access directly the cached data in my Nutch index, but I am
unable to find an easy way to do so.
I browsed the documentation(wiki, javadocs, and skimmed the code), but
found no straightforward way to do it.
Would anyone suggest a place to look for more information, or perhaps
have done this before and could share a few tips?

Thank you,
--
Hugo Pinto
Artificial Intelligence, Computational Linguistics and Computer Games
http://www.hugopinto.net

Re: Direct Access to Cached Data

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hugo Pinto wrote:
> Hello,
>
> I am using Nutch for mirroring, rather than crawling and indexing.
> I need to access directly the cached data in my Nutch index, but I am
> unable to find an easy way to do so.
> I browsed the documentation(wiki, javadocs, and skimmed the code), but
> found no straightforward way to do it.
> Would anyone suggest a place to look for more information, or perhaps
> have done this before and could share a few tips?

Most likely what you need is not the Lucene index, but the segments
(shards), right? There's a utility called SegmentReader (available from
cmd-line as readseg), and you can use its API to retrieve either all or
individual records from a segment (using URL as key).


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com