|
View:
New views
6 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Created: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segmentGeneration of multiple segments in multiple runs returns only 1 segment
----------------------------------------------------------------------- Key: NUTCH-707 URL: https://issues.apache.org/jira/browse/NUTCH-707 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 0.9.0 Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b. Reporter: Michael Chan Fix For: 0.9.0 To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated. For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced. It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Chan updated NUTCH-707: ------------------------------- Attachment: GeneratorDiff > Generation of multiple segments in multiple runs returns only 1 segment > ----------------------------------------------------------------------- > > Key: NUTCH-707 > URL: https://issues.apache.org/jira/browse/NUTCH-707 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.9.0 > Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b. > Reporter: Michael Chan > Fix For: 0.9.0 > > Attachments: GeneratorDiff > > > To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated. > For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced. > It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated NUTCH-707: ----------------------------------- Fix Version/s: (was: 0.9.0) 1.1 > Generation of multiple segments in multiple runs returns only 1 segment > ----------------------------------------------------------------------- > > Key: NUTCH-707 > URL: https://issues.apache.org/jira/browse/NUTCH-707 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.9.0 > Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b. > Reporter: Michael Chan > Fix For: 1.1 > > Attachments: GeneratorDiff > > > To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated. > For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced. > It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Closed: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-707. ----------------------------------- Resolution: Fixed Assignee: Andrzej Bialecki > Generation of multiple segments in multiple runs returns only 1 segment > ----------------------------------------------------------------------- > > Key: NUTCH-707 > URL: https://issues.apache.org/jira/browse/NUTCH-707 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.9.0 > Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b. > Reporter: Michael Chan > Assignee: Andrzej Bialecki > Fix For: 1.1 > > Attachments: GeneratorDiff > > > To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated. > For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced. > It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763994#action_12763994 ] Andrzej Bialecki commented on NUTCH-707: ----------------------------------------- Fixed - the bug was actually present in CrawlDbUpdater. Thanks! > Generation of multiple segments in multiple runs returns only 1 segment > ----------------------------------------------------------------------- > > Key: NUTCH-707 > URL: https://issues.apache.org/jira/browse/NUTCH-707 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.9.0 > Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b. > Reporter: Michael Chan > Assignee: Andrzej Bialecki > Fix For: 1.1 > > Attachments: GeneratorDiff > > > To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated. > For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced. > It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-707) Generation of multiple segments in multiple runs returns only 1 segment[ https://issues.apache.org/jira/browse/NUTCH-707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764293#action_12764293 ] Hudson commented on NUTCH-707: ------------------------------ Integrated in Nutch-trunk #959 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/959/]) Generation of multiple segments in multiple runs returns only 1 segment. > Generation of multiple segments in multiple runs returns only 1 segment > ----------------------------------------------------------------------- > > Key: NUTCH-707 > URL: https://issues.apache.org/jira/browse/NUTCH-707 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.9.0 > Environment: Ubuntu Hardy (8.04), Java 1.5.0 64b. > Reporter: Michael Chan > Assignee: Andrzej Bialecki > Fix For: 1.1 > > Attachments: GeneratorDiff > > > To generate multiple segments, generator.update.crawldb is set to true and -topN is defined to be the size of the segments. However, only one segment of size N is generated. > For example, I've tried it with a db containing 10,000+ links according to dump. When generator.update.crawldb is set to true and -topN is set to 5, only 1 segment of size 5 is produced. > It seems to me the problem is due to an incorrect recording of generation time. Selector.map assigns the generation time to each URL, even reduce only collects N many. It's perfectly fine if the generator was run once and that the db isn't updated. In the situation where the generator is run again within genDelay, all the remaining URLs will be excluded. So, I suggest the generation time should be assigned in reduce rather than map. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| Free embeddable forum powered by Nabble | Forum Help |