[jira] Created: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

View: New views
13 Messages — Rating Filter:   Alert me  

[jira] Created: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Lazy Instanciation of Metadata in CrawlDatum
--------------------------------------------

                 Key: NUTCH-702
                 URL: https://issues.apache.org/jira/browse/NUTCH-702
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.0.0
            Reporter: julien nioche


CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.

Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

julien nioche updated NUTCH-702:
--------------------------------

    Attachment: lazyMetadataInstanciation.patch

patch for lazy instanciation of metadata in crawldatum

> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: julien nioche
>         Attachments: lazyMetadataInstanciation.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

julien nioche updated NUTCH-702:
--------------------------------

    Attachment:     (was: lazyMetadataInstanciation.patch)

> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: julien nioche
>         Attachments: NUTCH-702.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

julien nioche updated NUTCH-702:
--------------------------------

    Attachment: NUTCH-702.patch

patch for lazy instanciation of metadata in crawldatum (replaces previous one)

> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: julien nioche
>         Attachments: NUTCH-702.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683772#action_12683772 ]

Edwin Chu commented on NUTCH-702:
---------------------------------

I have encountered OutOfMemoryError in CrawlDBReducer constantly on a system with 7GB main memory. I agree that the CrawlDatum for each incoming link is the cause.

> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-702.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713383#action_12713383 ]

Dmitry Lihachev commented on NUTCH-702:
---------------------------------------

I catched NPE when using this patch
{code}
java.lang.NullPointerException
        at org.apache.nutch.crawl.CrawlDatum.putAllMetaData(CrawlDatum.java:216)
        at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:134)
        at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)
{code}
at line 216 we can see this part of code
{code}
     metaData.put(e.getKey(), e.getValue());
{code}
but metadata is yet not initialized.

I think problem can be soved if change this code part to
{code}
     getMetaData().put(e.getKey(), e.getValue());
{code}


> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-702.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-702:
--------------------------------

    Attachment: NUTCH-702.patch.v2

Fixed bug reported by Dmitry Lihachev

> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-702.patch, NUTCH-702.patch.v2
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-702:
--------------------------------

    Attachment: NUTCH-702v3.patch

Reattaching Julien's last patch. I have made some minor code cleanups.

Could anyone confirm if that NPE is gone? (and my cleanup did not break anything ? :)

> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748840#action_12748840 ]

Julien Nioche commented on NUTCH-702:
-------------------------------------

There have been quite a few related questions on the mailing-lists since releasing the patch. As pointed out by Andrzej, the best way of avoiding memory exceptions is to limit the number of URLs per inlink (see nutch-default.xml), however this patch allows to set a higher value for this parameter as it reduces the memory footprint.

The main advantage of this patch is that it speeds up all phases of a crawl involving  the crawlDB. As an illustration I did a bit of comparison between the default version and the patched one, the average results being :

Injection (1M URLs)
original : 98 secs
patched : 65 secs

generation :
original : 37 secs
patched : 37 secs

stats:
original : 13 secs
patched : 19 secs

update:
original : 40 secs
patched : 19 secs


I find a bit surprising that the patched version is not faster on the generation (any idea?). On the other phases it seems to be 50% faster, except for the updating where it is twice as fast.

J.


> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748841#action_12748841 ]

Julien Nioche commented on NUTCH-702:
-------------------------------------

of course it was meant to be

stats:
original : 19 secs
patched : 13 secs

> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>         Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-702.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1

Committed.

Thanks for the patch, Julien.

> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>             Fix For: 1.1
>
>         Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752913#action_12752913 ]

Hudson commented on NUTCH-702:
------------------------------

Integrated in Nutch-trunk #929 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/929/])
     - Lazy Instanciation of Metadata in CrawlDatum. Contributed by Julien Nioche.


> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>             Fix For: 1.1
>
>         Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-702) Lazy Instanciation of Metadata in CrawlDatum

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/NUTCH-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757637#action_12757637 ]

zhangxihua commented on NUTCH-702:
----------------------------------

I used NUTCH1.1 Integrated in Nutch-trunk #929,but still outmemory.
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Unknown Source)
        at java.util.ArrayList.ensureCapacity(Unknown Source)
        at java.util.ArrayList.add(Unknown Source)
        at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:97)
        at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)

> Lazy Instanciation of Metadata in CrawlDatum
> --------------------------------------------
>
>                 Key: NUTCH-702
>                 URL: https://issues.apache.org/jira/browse/NUTCH-702
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Julien Nioche
>             Fix For: 1.1
>
>         Attachments: NUTCH-702.patch, NUTCH-702.patch.v2, NUTCH-702v3.patch
>
>
> CrawlDatum systematically instanciates its metadata, which is quite wasteful especially in the case of CrawlDBReducer when it generates a new CrawlDatum for each incoming link before storing it in a List.
> Initial testing on the lazy instanciation shows an improvement in both speed and memory consumption. I will generate a patch for it ASAP

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.