|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
|
|
|
RE: How to index files only with specific typedisable the html-parser from the nutch-site and keep only your parser. you can also add in uour filter file this : -(htm|html)$ thx > Date: Mon, 26 Oct 2009 17:53:11 +0300 > Subject: How to index files only with specific type > From: dfundak@... > To: nutch-user@... > > Hi, I've create parser and indexer to specific file type(geo xml meta > file - kml). > I am trying to crawl couple of sites, and index only files of this type. > I don't want to index html or anything else. > How can I achieve this? > Thanks.- _________________________________________________________________ Save up to 84% on Windows 7 until Jan 3—eligible CDN College & University students only. Hurry—buy it now for $39.99! http://go.microsoft.com/?linkid=9691635 |
|
|
Re: How to index files only with specific typeIf I disable html-parser(remove "parse-(html" from plugin.includes
property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's not possible without modifying source code. thx 2009/10/26 BELLINI ADAM <mbellil@...>: > > disable the html-parser from the nutch-site and keep only your parser. > you can also add in uour filter file this : -(htm|html)$ > > thx > > > >> Date: Mon, 26 Oct 2009 17:53:11 +0300 >> Subject: How to index files only with specific type >> From: dfundak@... >> To: nutch-user@... >> >> Hi, I've create parser and indexer to specific file type(geo xml meta >> file - kml). >> I am trying to crawl couple of sites, and index only files of this type. >> I don't want to index html or anything else. >> How can I achieve this? >> Thanks.- > > _________________________________________________________________ > Save up to 84% on Windows 7 until Jan 3—eligible CDN College & University students only. Hurry—buy it now for $39.99! > http://go.microsoft.com/?linkid=9691635 |
|
|
Re: How to index files only with specific typeDmitriy Fundak wrote:
> If I disable html-parser(remove "parse-(html" from plugin.includes > property) html filed didn't get parsed > So didn't get outlinks to kml files from html. > So I can't parse and index kml files. > I might not be right, but I have a feeling that it's not possible > without modifying source code. It's possible to do this with a custom indexing filter - see other indexing filters to get a feeling of what's involved. Or you could do this with a scoring filter, too, although the scoring API looks more complicated. Either way, when you execute the Indexer, these filters are run in a chain, and if one of them returns null then that document is discarded, i.e. it's not added to the output index. So, it's easy to examine in your indexing filter the content type (or just a URL of the document) and either pass the document on or reject it by returning null. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
|
Re: How to index files only with specific typeChecking url postfix and returning null if it's not one I need helped.
Thanks, Andrzej. 2009/10/27 Andrzej Bialecki <ab@...>: > Dmitriy Fundak wrote: >> >> If I disable html-parser(remove "parse-(html" from plugin.includes >> property) html filed didn't get parsed >> So didn't get outlinks to kml files from html. >> So I can't parse and index kml files. >> I might not be right, but I have a feeling that it's not possible >> without modifying source code. > > It's possible to do this with a custom indexing filter - see other indexing > filters to get a feeling of what's involved. Or you could do this with a > scoring filter, too, although the scoring API looks more complicated. > > Either way, when you execute the Indexer, these filters are run in a chain, > and if one of them returns null then that document is discarded, i.e. it's > not added to the output index. So, it's easy to examine in your indexing > filter the content type (or just a URL of the document) and either pass the > document on or reject it by returning null. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > |
| Free embeddable forum powered by Nabble | Forum Help |