|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Created: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed[PARSE-HTML plugin] Block certain parts of HTML code from being indexed
----------------------------------------------------------------------- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Priority: Minor We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like <!-- START-IGNORE --> ... ignored part ... <!-- STOP-IGNORE --> We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
Re: [jira] Created: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexedHi Andrea,
It sounds like your addition is useful only for people crawling sites under their control. It's not useful for Internet crawling. Is there a way to make this useful to a wider audience without overly- complicating the code? I'm trying to think of specific scenarios where this could be useful in an internet crawl: 1) Known ad-hosting providers? (Doubleclick, etc) 2) Known domain-parking (godaddy) or domain-squatting patterns? But, these examples have problems: (1) -- I can't think of any offhand that would put indexable text into the page (would be an <embed>, <img>, or <script) (2) -- I'd want to filter domain-squatters earlier in the processing chain, so we ignore their links too --matt On Nov 29, 2007, at 6:13 AM, Andrea Spinelli (JIRA) wrote: > [PARSE-HTML plugin] Block certain parts of HTML code from being > indexed > ---------------------------------------------------------------------- > - > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Environment: All operating systems > Reporter: Andrea Spinelli > Priority: Minor > > > We are using nutch to index our own web sites; we would like not to > index certain parts of our pages, because we know they are not > relevant (for instance, there are several links to change the > background color) and generate spurious matches. > > We have modified the plugin so that it ignores HTML code between > certain HTML comments, like > <!-- START-IGNORE --> > ... ignored part ... > <!-- STOP-IGNORE --> > > We feel this might be useful to someone else, maybe factorizing the > comment strings as constants in the configuration files (say > parser.html.ignore.start and parser.html.ignore.stop in nutch- > site.xml). > > We are almost ready to contribute our code snippet. Looking > forward for any expression of interest - or for an explanation why > waht we are doing is plain wrong! > > > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > -- Matt Kangas / kangas@... |
|
|
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547642 ] Otis Gospodnetic commented on NUTCH-585: ---------------------------------------- A more general solution is needed. This solution should not rely on apriori marked-up content as in your example, but should automatically recognize things like footers, sidebars, repeating navigation and other elements, etc. I am sure there are PhD thesis out there on this topic... > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > ----------------------------------------------------------------------- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Environment: All operating systems > Reporter: Andrea Spinelli > Priority: Minor > > We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML comments, like > <!-- START-IGNORE --> > ... ignored part ... > <!-- STOP-IGNORE --> > We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548217 ] Andrea Spinelli commented on NUTCH-585: --------------------------------------- I absolutely agree that a more general solution is needed; however, I think that some of the Nutch current users might benefit from a quick fix. If there is no opposition, I could submit a patch (less than 20 lines) On the other hand,anybody thinks that blocking selected portions of text could pose serious architectural or stability risks? About the more general solution, do you think there is a viable path from here to there? -- andrea > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > ----------------------------------------------------------------------- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Environment: All operating systems > Reporter: Andrea Spinelli > Priority: Minor > > We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML comments, like > <!-- START-IGNORE --> > ... ignored part ... > <!-- STOP-IGNORE --> > We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548420 ] Matt Kangas commented on NUTCH-585: ----------------------------------- Simplest path forward... that I can think of: 1) Add a new indexing plugin extension-point for filtering page content. 2) Put your "apriori marked-up content" exclusion logic into a plugin. 3) Someone else figures out a more general-purpose solution later, and swaps out your plugin at that time. Ergo, you generalize the interface, and lazy-load the more general implementation. :-) > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > ----------------------------------------------------------------------- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Environment: All operating systems > Reporter: Andrea Spinelli > Priority: Minor > > We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML comments, like > <!-- START-IGNORE --> > ... ignored part ... > <!-- STOP-IGNORE --> > We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574613#action_12574613 ] Siddharth Jha commented on NUTCH-585: ------------------------------------- Hello All We are implementing a search engine based on Lucene/Nutch and we are also facing problems with caching. Would it be possible for you to help me out on this issue and provide me a code snippet? > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > ----------------------------------------------------------------------- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Environment: All operating systems > Reporter: Andrea Spinelli > Priority: Minor > > We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML comments, like > <!-- START-IGNORE --> > ... ignored part ... > <!-- STOP-IGNORE --> > We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764651#action_12764651 ] cwinay@... commented on NUTCH-585: ---------------------------------------- Hi, Is it possible for you to share the code with me?? I seem to have found a use of the facility you wish to add to Nutch. I'm using a content management system called Infoglue to create my website. The pages I create for my site have a fixed template containing header, footer and a menu system. I wish that Nutch should index the template content only for the home page and I want it to index just the relevant (non-template) content on the inner pages. So please share your idea and/or code. Details of the implementation are appreciated. So far I have just been a naive Nutch user. Thanks a lot. Winz Quoted from: http://www.nabble.com/-jira--Created%3A-%28NUTCH-585%29--PARSE-HTML-plugin--Block-certain-parts-of-HTML-code-from-being-indexed-tp14023775p14023775.html > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > ----------------------------------------------------------------------- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Environment: All operating systems > Reporter: Andrea Spinelli > Priority: Minor > > We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML comments, like > <!-- START-IGNORE --> > ... ignored part ... > <!-- STOP-IGNORE --> > We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764709#action_12764709 ] Andrea Spinelli commented on NUTCH-585: --------------------------------------- Yes, I'd be glad about that. There are some caveats, though: 1. I worked on a very old version of nutch (0.7.2) 2. I have to dig in my sources to find our patch, because it happened a lot of time ago We have a week long of demos, I will write back at the end of the working week Hi Andrea -- Andrea Spinelli - team QUALITY email: andrea.spinelli@... phone: +39-035-636029 fax: +39-035-638129 surface-mail: Via Sigismondi 40, 24018 Villa d'Alme', BG -- Questo messaggio è confidenziale; ai sensi del D.P.R. 44/314159/2718 01/04/2009 non puoi pubblicarlo o inoltrarlo e non potrai mai più utilizzare nessuna delle parole italiane in esso presenti. Se lo ricevi per errore, spruzzalo con spray al peperoncino, cancellalo, formatta il tuo hard disk e poi scrivi una cartolina all'indirizzo sopra indicato avvisandoci. > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > ----------------------------------------------------------------------- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Environment: All operating systems > Reporter: Andrea Spinelli > Priority: Minor > > We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML comments, like > <!-- START-IGNORE --> > ... ignored part ... > <!-- STOP-IGNORE --> > We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771625#action_12771625 ] David Stuart commented on NUTCH-585: ------------------------------------ Hi Andrea, I hope your week of demo's went well. I to would be interested in this code as I would like to look at extending to it be slightly more generic allowing for regular expression matches or an xpath like model (the plan is still formulating). From the web crawler view it would be a hard one to get right but we have about 26 sites that are will know to us that we wish to crawl and have common blocks that we wish to remove which a configurable version of your code may achieve. Look forward to see your patch Regards, David Stuart > [PARSE-HTML plugin] Block certain parts of HTML code from being indexed > ----------------------------------------------------------------------- > > Key: NUTCH-585 > URL: https://issues.apache.org/jira/browse/NUTCH-585 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Environment: All operating systems > Reporter: Andrea Spinelli > Priority: Minor > > We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. > We have modified the plugin so that it ignores HTML code between certain HTML comments, like > <!-- START-IGNORE --> > ... ignored part ... > <!-- STOP-IGNORE --> > We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). > We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| Free embeddable forum powered by Nabble | Forum Help |