[Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKrugler

View: New views
4 Messages — Rating Filter:   Alert me  

[Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKrugler

by Apache Wiki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "ApacheConUs2009MeetUp" page has been changed by KenKrugler.
http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&rev1=5&rev2=6

--------------------------------------------------

- We were planning to have a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
+ We had a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
 
- Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp.
+ It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm.
 
- So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :)
+ == Attendees ==
+
+  * Andrzej Bialeki - Apache Nutch
+  * Thorsten xxx - Apache Droids
+  * Michael Stack - Formerly with Heritrix, now HBase
+  * Ken Krugler - Bixo
+
+ == Topics ==
+
+ === Roadmaps ===
+
+ Nutch - become more component based.
+ Droids - get more people involved.
+
+ === Sharable Components ===
+
+  * robots.txt parsing
+  * URL normalization
+  * URL filtering
+  * Page cleansing
+   * General purpose
+   * Specialized
+  * Sub-page parsing (portlets)
+  * AJAX-ish page interactions
+  * Document parsing (via Tika)
+  * HttpClient (configuration)
+  * Text similarity
+  * Mime/charset/language detection
+
+ === Tika ===
+
+  * Needs help to become really usable
+  * Would benefit from large test corpus
+  * Could do comparison with Nutch parser
+  * Needs option for direct DOM querying (screen scraping tasks)
+  * Handles mime & charset detection now (some issues)
+  * Could be extended to include language detection (wrap other impl)
+
+ === URL Normalization ===
+
+  * Includes both domain (www.x.com == x.com), path, and query portions of URL
+  * Often site-specific rules
+   * Option to derive rules using URLs to similar documents.
+
+ === AJAX-ish Page Interaction ===
+
+  * Not applicable for broad/general crawling
+  * Can be very important for specific web sites
+  * Use Selenium or headless Mozilla
+
+ === Component API Issues ===
+
+  * Want to avoid using an API that's tied too closely to any implementation.
+  * One option is to have simple (e.g. URL param) API that takes meta-data.
+   * Similar to Tika passing in of meta-data.
+
+ === Hosting Options ===
+
+  * As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch.
+  * As part of Droids - but Droids is both a framework (queue-based) and set of components.
+  * New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids.
+  * Google code - seems like a good short-term solution, to judge level of interest and help shake out issues.
+
+ == Next Steps ==
+
+  * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page.
+  * Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up.
+  * Make decision about build system (and then move on to code formatting debate :))
+   * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good.
+  * Start contributing code
+   * Ken will put in robots.txt parser.
+
+ == Original Discussion Topic List ==
 
  Below are some potential topics for discussion - feel free to add/comment.
 

Re: [Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKrugler

by Bartosz Gadzimski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

Are there will be any materials after the meeting? Wiki pages, slides,
video, podcasts? Would be grate!

Thanks,
Bartosz

Apache Wiki pisze:

> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
>
> The "ApacheConUs2009MeetUp" page has been changed by KenKrugler.
> http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&rev1=5&rev2=6
>
> --------------------------------------------------
>
> - We were planning to have a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
> + We had a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland.
>  
> - Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp.
> + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm.
>  
> - So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :)
> + == Attendees ==
> +
> +  * Andrzej Bialeki - Apache Nutch
> +  * Thorsten xxx - Apache Droids
> +  * Michael Stack - Formerly with Heritrix, now HBase
> +  * Ken Krugler - Bixo
> +
> + == Topics ==
> +
> + === Roadmaps ===
> +
> + Nutch - become more component based.
> + Droids - get more people involved.
> +
> + === Sharable Components ===
> +
> +  * robots.txt parsing
> +  * URL normalization
> +  * URL filtering
> +  * Page cleansing
> +   * General purpose
> +   * Specialized
> +  * Sub-page parsing (portlets)
> +  * AJAX-ish page interactions
> +  * Document parsing (via Tika)
> +  * HttpClient (configuration)
> +  * Text similarity
> +  * Mime/charset/language detection
> +
> + === Tika ===
> +
> +  * Needs help to become really usable
> +  * Would benefit from large test corpus
> +  * Could do comparison with Nutch parser
> +  * Needs option for direct DOM querying (screen scraping tasks)
> +  * Handles mime & charset detection now (some issues)
> +  * Could be extended to include language detection (wrap other impl)
> +
> + === URL Normalization ===
> +
> +  * Includes both domain (www.x.com == x.com), path, and query portions of URL
> +  * Often site-specific rules
> +   * Option to derive rules using URLs to similar documents.
> +
> + === AJAX-ish Page Interaction ===
> +
> +  * Not applicable for broad/general crawling
> +  * Can be very important for specific web sites
> +  * Use Selenium or headless Mozilla
> +
> + === Component API Issues ===
> +
> +  * Want to avoid using an API that's tied too closely to any implementation.
> +  * One option is to have simple (e.g. URL param) API that takes meta-data.
> +   * Similar to Tika passing in of meta-data.
> +
> + === Hosting Options ===
> +
> +  * As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch.
> +  * As part of Droids - but Droids is both a framework (queue-based) and set of components.
> +  * New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids.
> +  * Google code - seems like a good short-term solution, to judge level of interest and help shake out issues.
> +
> + == Next Steps ==
> +
> +  * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page.
> +  * Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up.
> +  * Make decision about build system (and then move on to code formatting debate :))
> +   * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good.
> +  * Start contributing code
> +   * Ken will put in robots.txt parser.
> +
> + == Original Discussion Topic List ==
>  
>   Below are some potential topics for discussion - feel free to add/comment.
>  
>
>  


Re: [Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKrugler

by Ken Krugler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Bartosz,

I've updated the wiki, and others who attended might add/edit as  
necessary.

No video/podcast - it wasn't so high tech as that, just three of us in  
a spare room with Thorsten on Skype.

We're still waiting for some input from the Heritrix team, I think,  
before moving forward.

-- Ken


On Nov 5, 2009, at 12:01am, Bartosz Gadzimski wrote:

> Hello,
>
> Are there will be any materials after the meeting? Wiki pages,  
> slides, video, podcasts? Would be grate!
>
> Thanks,
> Bartosz
>
> Apache Wiki pisze:
>> Dear Wiki user,
>>
>> You have subscribed to a wiki page or wiki category on "Nutch Wiki"  
>> for change notification.
>>
>> The "ApacheConUs2009MeetUp" page has been changed by KenKrugler.
>> http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&rev1=5&rev2=6
>>
>> --------------------------------------------------
>>
>> - We were planning to have a "Web Crawler Developer" !MeetUp at  
>> this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon  
>> US]] in Oakland.
>> + We had a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/ 
>> |ApacheCon US]] in Oakland.
>>  - Unfortunately the only time slot where people would be around  
>> was Thursday night, which wound up conflicting with the Hadoop !
>> MeetUp.
>> + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday,  
>> November 4th from 11am - 1pm.   - So we're going to have an !
>> UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm.  
>> Location is TBD, hopefully we can get some space at the event but  
>> might be a lunch meeting :)
>> + == Attendees ==
>> + +  * Andrzej Bialeki - Apache Nutch
>> +  * Thorsten xxx - Apache Droids
>> +  * Michael Stack - Formerly with Heritrix, now HBase
>> +  * Ken Krugler - Bixo
>> + + == Topics ==
>> + + === Roadmaps ===
>> + + Nutch - become more component based.
>> + Droids - get more people involved.
>> + + === Sharable Components ===
>> + +  * robots.txt parsing
>> +  * URL normalization
>> +  * URL filtering
>> +  * Page cleansing
>> +   * General purpose
>> +   * Specialized
>> +  * Sub-page parsing (portlets)
>> +  * AJAX-ish page interactions
>> +  * Document parsing (via Tika)
>> +  * HttpClient (configuration)
>> +  * Text similarity
>> +  * Mime/charset/language detection
>> + + === Tika ===
>> + +  * Needs help to become really usable
>> +  * Would benefit from large test corpus
>> +  * Could do comparison with Nutch parser
>> +  * Needs option for direct DOM querying (screen scraping tasks)
>> +  * Handles mime & charset detection now (some issues)
>> +  * Could be extended to include language detection (wrap other  
>> impl)
>> + + === URL Normalization ===
>> + +  * Includes both domain (www.x.com == x.com), path, and query  
>> portions of URL
>> +  * Often site-specific rules
>> +   * Option to derive rules using URLs to similar documents.
>> + + === AJAX-ish Page Interaction ===
>> + +  * Not applicable for broad/general crawling
>> +  * Can be very important for specific web sites
>> +  * Use Selenium or headless Mozilla
>> + + === Component API Issues ===
>> + +  * Want to avoid using an API that's tied too closely to any  
>> implementation.
>> +  * One option is to have simple (e.g. URL param) API that takes  
>> meta-data.
>> +   * Similar to Tika passing in of meta-data.
>> + + === Hosting Options ===
>> + +  * As part of Nutch - but easy to get lost in Nutch codebase,  
>> and can be associated too closely with Nutch.
>> +  * As part of Droids - but Droids is both a framework (queue-
>> based) and set of components.
>> +  * New sub-project under Lucene TLP - but overhead to set up/
>> maintain, and then confusion between it and Droids.
>> +  * Google code - seems like a good short-term solution, to judge  
>> level of interest and help shake out issues.
>> + + == Next Steps ==
>> + +  * Get input from Gordon re Heritrix. Stack to follow up with  
>> him. Ideally he'd add his comments to this page.
>> +  * Get input from Thorsten on Google code option. If OK as  
>> starting point, then Andrzej to set up.
>> +  * Make decision about build system (and then move on to code  
>> formatting debate :))
>> +   * I'm going to propose ant + maven ant tasks for dependency  
>> management. I'm using this with Bixo, and so far it's been pretty  
>> good.
>> +  * Start contributing code
>> +   * Ken will put in robots.txt parser.
>> + + == Original Discussion Topic List ==
>>    Below are some potential topics for discussion - feel free to  
>> add/comment.
>>
>>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Re: [Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKrugler

by Andrzej Bialecki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Bartosz Gadzimski wrote:
> Hello,
>
> Are there will be any materials after the meeting? Wiki pages, slides,
> video, podcasts? Would be grate!

I will be adding some stuff later - no slides, as Ken said it was very
informal. Though the slides that I present during my Nutch talk have a
section about the roadmap, and I will be adding these to the wiki, too.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com