|
View:
New views
13 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Created: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatumLazy Instanciation of Metadata in CrawlDatum
-------------------------------------------- Key: NUTCH-702 URL: https://issues.apache.org/jira/browse/NUTCH-702 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: julien nioche CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-702: -------------------------------- Attachment: lazyMetadataInstanciation.patch patch for lazy instanciation of metadata in crawldatum > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: julien nioche > Attachments: lazyMetadataInstanciation.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-702: -------------------------------- Attachment: (was: lazyMetadataInstanciation.patch) > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: julien nioche > Attachments: NUTCH-702.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated NUTCH-702: -------------------------------- Attachment: NUTCH-702.patch patch for lazy instanciation of metadata in crawldatum (replaces previous one) > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: julien nioche > Attachments: NUTCH-702.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683772#action_12683772 ] Edwin Chu commented on NUTCH-702: --------------------------------- I have encountered OutOfMemoryError in CrawlDBReducer constantly on a system with 7GB main memory. I agree that the CrawlDatum for each incoming link is the cause. > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Attachments: NUTCH-702.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713383#action_12713383 ] Dmitry Lihachev commented on NUTCH-702: --------------------------------------- I catched NPE when using this patch {code} java.lang.NullPointerException at org.apache.nutch.crawl.CrawlDatum.putAllMetaData(CrawlDatum.java:216) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:134) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child.main(Child.java:158) {code} at line 216 we can see this part of code {code} metaData.put(e.getKey(), e.getValue()); {code} but metadata is yet not initialized. I think problem can be soved if change this code part to {code} getMetaData().put(e.getKey(), e.getValue()); {code} > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Attachments: NUTCH-702.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-702: -------------------------------- Attachment: NUTCH-702.patch.v2 Fixed bug reported by Dmitry Lihachev > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Attachments: NUTCH-702.patch, NUTCH-702.patch.v2 > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-702: -------------------------------- Attachment: NUTCH-702v3.patch Reattaching Julien's last patch. I have made some minor code cleanups. Could anyone confirm if that NPE is gone? (and my cleanup did not break anything ? :) > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748840#action_12748840 ] Julien Nioche commented on NUTCH-702: ------------------------------------- There have been quite a few related questions on the mailing-lists since releasing the patch. As pointed out by Andrzej, the best way of avoiding memory exceptions is to limit the number of URLs per inlink (see nutch-default.xml), however this patch allows to set a higher value for this parameter as it reduces the memory footprint. The main advantage of this patch is that it speeds up all phases of a crawl involving the crawlDB. As an illustration I did a bit of comparison between the default version and the patched one, the average results being : Injection (1M URLs) original : 98 secs patched : 65 secs generation : original : 37 secs patched : 37 secs stats: original : 13 secs patched : 19 secs update: original : 40 secs patched : 19 secs I find a bit surprising that the patched version is not faster on the generation (any idea?). On the other phases it seems to be 50% faster, except for the updating where it is twice as fast. J. > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748841#action_12748841 ] Julien Nioche commented on NUTCH-702: ------------------------------------- of course it was meant to be stats: original : 19 secs patched : 13 secs > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Closed: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-702. ------------------------------- Resolution: Fixed Fix Version/s: 1.1 Committed. Thanks for the patch, Julien. > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Fix For: 1.1 > > Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752913#action_12752913 ] Hudson commented on NUTCH-702: ------------------------------ Integrated in Nutch-trunk #929 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/929/]) - Lazy Instanciation of Metadata in CrawlDatum. Contributed by Julien Nioche. > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Fix For: 1.1 > > Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum[ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757637#action_12757637 ] zhangxihua commented on NUTCH-702: ---------------------------------- I used NUTCH1.1 Integrated in Nutch-trunk #929,but still outmemory. java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Unknown Source) at java.util.ArrayList.ensureCapacity(Unknown Source) at java.util.ArrayList.add(Unknown Source) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:97) at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.Child.main(Child.java:158) > Lazy Instanciation of Metadata in CrawlDatum > -------------------------------------------- > > Key: NUTCH-702 > URL: https://issues.apache.org/jira/browse/NUTCH-702 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.0.0 > Reporter: Julien Nioche > Fix For: 1.1 > > Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch > > > CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List. > Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| Free embeddable forum powered by Nabble | Forum Help |