|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
[Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKruglerDear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "ApacheConUs2009MeetUp" page has been changed by KenKrugler. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&rev1=5&rev2=6 -------------------------------------------------- - We were planning to have a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. + We had a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. - Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp. + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. - So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :) + == Attendees == + + * Andrzej Bialeki - Apache Nutch + * Thorsten xxx - Apache Droids + * Michael Stack - Formerly with Heritrix, now HBase + * Ken Krugler - Bixo + + == Topics == + + === Roadmaps === + + Nutch - become more component based. + Droids - get more people involved. + + === Sharable Components === + + * robots.txt parsing + * URL normalization + * URL filtering + * Page cleansing + * General purpose + * Specialized + * Sub-page parsing (portlets) + * AJAX-ish page interactions + * Document parsing (via Tika) + * HttpClient (configuration) + * Text similarity + * Mime/charset/language detection + + === Tika === + + * Needs help to become really usable + * Would benefit from large test corpus + * Could do comparison with Nutch parser + * Needs option for direct DOM querying (screen scraping tasks) + * Handles mime & charset detection now (some issues) + * Could be extended to include language detection (wrap other impl) + + === URL Normalization === + + * Includes both domain (www.x.com == x.com), path, and query portions of URL + * Often site-specific rules + * Option to derive rules using URLs to similar documents. + + === AJAX-ish Page Interaction === + + * Not applicable for broad/general crawling + * Can be very important for specific web sites + * Use Selenium or headless Mozilla + + === Component API Issues === + + * Want to avoid using an API that's tied too closely to any implementation. + * One option is to have simple (e.g. URL param) API that takes meta-data. + * Similar to Tika passing in of meta-data. + + === Hosting Options === + + * As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch. + * As part of Droids - but Droids is both a framework (queue-based) and set of components. + * New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids. + * Google code - seems like a good short-term solution, to judge level of interest and help shake out issues. + + == Next Steps == + + * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page. + * Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up. + * Make decision about build system (and then move on to code formatting debate :)) + * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good. + * Start contributing code + * Ken will put in robots.txt parser. + + == Original Discussion Topic List == Below are some potential topics for discussion - feel free to add/comment. |
|
|
Re: [Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKruglerHello,
Are there will be any materials after the meeting? Wiki pages, slides, video, podcasts? Would be grate! Thanks, Bartosz Apache Wiki pisze: > Dear Wiki user, > > You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. > > The "ApacheConUs2009MeetUp" page has been changed by KenKrugler. > http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&rev1=5&rev2=6 > > -------------------------------------------------- > > - We were planning to have a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. > + We had a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. > > - Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp. > + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. > > - So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :) > + == Attendees == > + > + * Andrzej Bialeki - Apache Nutch > + * Thorsten xxx - Apache Droids > + * Michael Stack - Formerly with Heritrix, now HBase > + * Ken Krugler - Bixo > + > + == Topics == > + > + === Roadmaps === > + > + Nutch - become more component based. > + Droids - get more people involved. > + > + === Sharable Components === > + > + * robots.txt parsing > + * URL normalization > + * URL filtering > + * Page cleansing > + * General purpose > + * Specialized > + * Sub-page parsing (portlets) > + * AJAX-ish page interactions > + * Document parsing (via Tika) > + * HttpClient (configuration) > + * Text similarity > + * Mime/charset/language detection > + > + === Tika === > + > + * Needs help to become really usable > + * Would benefit from large test corpus > + * Could do comparison with Nutch parser > + * Needs option for direct DOM querying (screen scraping tasks) > + * Handles mime & charset detection now (some issues) > + * Could be extended to include language detection (wrap other impl) > + > + === URL Normalization === > + > + * Includes both domain (www.x.com == x.com), path, and query portions of URL > + * Often site-specific rules > + * Option to derive rules using URLs to similar documents. > + > + === AJAX-ish Page Interaction === > + > + * Not applicable for broad/general crawling > + * Can be very important for specific web sites > + * Use Selenium or headless Mozilla > + > + === Component API Issues === > + > + * Want to avoid using an API that's tied too closely to any implementation. > + * One option is to have simple (e.g. URL param) API that takes meta-data. > + * Similar to Tika passing in of meta-data. > + > + === Hosting Options === > + > + * As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch. > + * As part of Droids - but Droids is both a framework (queue-based) and set of components. > + * New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids. > + * Google code - seems like a good short-term solution, to judge level of interest and help shake out issues. > + > + == Next Steps == > + > + * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page. > + * Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up. > + * Make decision about build system (and then move on to code formatting debate :)) > + * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good. > + * Start contributing code > + * Ken will put in robots.txt parser. > + > + == Original Discussion Topic List == > > Below are some potential topics for discussion - feel free to add/comment. > > > |
|
|
Re: [Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKruglerHi Bartosz,
I've updated the wiki, and others who attended might add/edit as necessary. No video/podcast - it wasn't so high tech as that, just three of us in a spare room with Thorsten on Skype. We're still waiting for some input from the Heritrix team, I think, before moving forward. -- Ken On Nov 5, 2009, at 12:01am, Bartosz Gadzimski wrote: > Hello, > > Are there will be any materials after the meeting? Wiki pages, > slides, video, podcasts? Would be grate! > > Thanks, > Bartosz > > Apache Wiki pisze: >> Dear Wiki user, >> >> You have subscribed to a wiki page or wiki category on "Nutch Wiki" >> for change notification. >> >> The "ApacheConUs2009MeetUp" page has been changed by KenKrugler. >> http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&rev1=5&rev2=6 >> >> -------------------------------------------------- >> >> - We were planning to have a "Web Crawler Developer" !MeetUp at >> this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon >> US]] in Oakland. >> + We had a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/ >> |ApacheCon US]] in Oakland. >> - Unfortunately the only time slot where people would be around >> was Thursday night, which wound up conflicting with the Hadoop ! >> MeetUp. >> + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, >> November 4th from 11am - 1pm. - So we're going to have an ! >> UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. >> Location is TBD, hopefully we can get some space at the event but >> might be a lunch meeting :) >> + == Attendees == >> + + * Andrzej Bialeki - Apache Nutch >> + * Thorsten xxx - Apache Droids >> + * Michael Stack - Formerly with Heritrix, now HBase >> + * Ken Krugler - Bixo >> + + == Topics == >> + + === Roadmaps === >> + + Nutch - become more component based. >> + Droids - get more people involved. >> + + === Sharable Components === >> + + * robots.txt parsing >> + * URL normalization >> + * URL filtering >> + * Page cleansing >> + * General purpose >> + * Specialized >> + * Sub-page parsing (portlets) >> + * AJAX-ish page interactions >> + * Document parsing (via Tika) >> + * HttpClient (configuration) >> + * Text similarity >> + * Mime/charset/language detection >> + + === Tika === >> + + * Needs help to become really usable >> + * Would benefit from large test corpus >> + * Could do comparison with Nutch parser >> + * Needs option for direct DOM querying (screen scraping tasks) >> + * Handles mime & charset detection now (some issues) >> + * Could be extended to include language detection (wrap other >> impl) >> + + === URL Normalization === >> + + * Includes both domain (www.x.com == x.com), path, and query >> portions of URL >> + * Often site-specific rules >> + * Option to derive rules using URLs to similar documents. >> + + === AJAX-ish Page Interaction === >> + + * Not applicable for broad/general crawling >> + * Can be very important for specific web sites >> + * Use Selenium or headless Mozilla >> + + === Component API Issues === >> + + * Want to avoid using an API that's tied too closely to any >> implementation. >> + * One option is to have simple (e.g. URL param) API that takes >> meta-data. >> + * Similar to Tika passing in of meta-data. >> + + === Hosting Options === >> + + * As part of Nutch - but easy to get lost in Nutch codebase, >> and can be associated too closely with Nutch. >> + * As part of Droids - but Droids is both a framework (queue- >> based) and set of components. >> + * New sub-project under Lucene TLP - but overhead to set up/ >> maintain, and then confusion between it and Droids. >> + * Google code - seems like a good short-term solution, to judge >> level of interest and help shake out issues. >> + + == Next Steps == >> + + * Get input from Gordon re Heritrix. Stack to follow up with >> him. Ideally he'd add his comments to this page. >> + * Get input from Thorsten on Google code option. If OK as >> starting point, then Andrzej to set up. >> + * Make decision about build system (and then move on to code >> formatting debate :)) >> + * I'm going to propose ant + maven ant tasks for dependency >> management. I'm using this with Bixo, and so far it's been pretty >> good. >> + * Start contributing code >> + * Ken will put in robots.txt parser. >> + + == Original Discussion Topic List == >> Below are some potential topics for discussion - feel free to >> add/comment. >> >> > -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g |
|
|
Re: [Nutch Wiki] Update of "ApacheConUs2009MeetUp" by KenKruglerBartosz Gadzimski wrote:
> Hello, > > Are there will be any materials after the meeting? Wiki pages, slides, > video, podcasts? Would be grate! I will be adding some stuff later - no slides, as Ken said it was very informal. Though the slides that I present during my Nutch talk have a section about the roadmap, and I will be adding these to the wiki, too. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
| Free embeddable forum powered by Nabble | Forum Help |