|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
TU Berlin Winter of Code ProjectHello everybody,
we are a group of 6 master students of the Technical University of Berlin who are currently working on a winter term project using Mahout. Our - so called "Winter of Code" - project is mentored by Isabel Drost and will run until February 2010. The goal of our project is to develop a cloud-based blog search engine - think: "google news for beginners ;)". The engine should be highly scalabe and use Hadoop/Mahout to performing topical clustering and topic discovery for crawled blog entries. Based on suggestions by Isabel, we currently think of the following layered architecture: I. Layer: Web-crawling A web-crawler (e.g. Herititrix) is provided with a set of known blog URLs to perform web crawls. Heritrix is configured with a simple text filter to only crawl urls containing the word "blog" from a prespecified TLD (so we "know" which language the blog entries use). We plan on outputting the crawl data directly to HDFS (e.g. via hbase-writer). II. Layer: Preprocessing The data is probably not structured enough to be directly processable by a machine, so it has to be preprocessed. This step could e.g. consist of extracting the blog fulltext from the crawl, stemming it, finding named entitites and tagging them. We currently think of using UIMA for this layer. III. Layer: Feature extraction In order to use clustering algorithms. we need to perform a feature extraction. This could e.g. consist of generating feature vectors, a similiarity matrix, a link graph, etc. The goal of this layer is to have a representation of the web crawl that can be processed by Mahout. The feature extraction will likely be implemented via a custom-written Hadoop job. IV. Layer: Clustering This step consists of using a given Mahout clustering algorithm (or a newly implemented algorithm) to cluster the blogs based on the extraced features. For now we are probably going to use a very simple k-means clustering of word frequency. We plan to switch to a more sophisticated approach once the basic infrastructure is sound :) V. Layer: Topic Discovery Once the blog entries are clustered, each cluster needs to be assigned a topic. This topic should be automatically determined from the blog entries inside the cluster. Again, for now we will probably use a very simple approach: e.g. use the most frequent words inside the cluster (or within the center of the cluster) as topic tags. VI. Layer: Search Engine In order to search for blogs the tagged cluster-centers and topics provided by Mahout need to be recombined with the information form the blog crawl. This recombined data should then be fed into a search engine, so users can search for a specific entry. We will probably use Solr for this step, tagging each blog entry with it's respective cluster topic tag(s) and creating a search index on those tags. VII. Layer: User Front-End This will probably be a simple web-page that sends request with the Search Engine layer to return results to the user. This is obviously only a first draft of what we think would be a suited overall architecture, so there is probably lots of room for improvement. We are for example currently looking into multiple more sophisticated clustering approaches (e.g. spectral-clustering, graph-based clustering), ways of representing the clustered information (e.g. using hierarchical instead of partitional clustering, so the user can "drill down" by topic into the results) or architectural changes (e.g. using a "feedback loop", so search results can be used for further analysis). So, if you have any remarks, notes or suggestions we would be happy to hear from you :) Cheers and looking forward to discussions with you, Max |
|
|
Re: TU Berlin Winter of Code ProjectOn Nov 6, 2009, at 5:06 AM, Max Heimel wrote: > Hello everybody, > > we are a group of 6 master students of the Technical University of > Berlin who are currently working on a winter term project using > Mahout. Our - so called "Winter of Code" - project is mentored by > Isabel Drost and will run until February 2010. The goal of our project > is to develop a cloud-based blog search engine - think: "google news > for beginners ;)". The engine should be highly scalabe and use > Hadoop/Mahout to performing topical clustering and topic discovery for > crawled blog entries. > > Based on suggestions by Isabel, we currently think of the following > layered architecture: I like the layered approach, this should make it easier for others to adapt and use. > > I. Layer: Web-crawling > A web-crawler (e.g. Herititrix) is provided with a set of known blog > URLs to perform web crawls. Heritrix is configured with a simple text > filter to only crawl urls containing the word "blog" from a > prespecified TLD (so we "know" which language the blog entries use). > We plan on outputting the crawl data directly to HDFS (e.g. via > hbase-writer). > > II. Layer: Preprocessing > The data is probably not structured enough to be directly processable > by a machine, so it has to be preprocessed. This > step could e.g. consist of extracting the blog fulltext from the > crawl, stemming it, finding named entitites and tagging them. > We currently think of using UIMA for this layer. This could likely be done as M/R jobs too and contributed to Mahout utils module if so desired. > > III. Layer: Feature extraction > In order to use clustering algorithms. we need to perform a feature > extraction. This could e.g. consist of generating feature vectors, a > similiarity matrix, a link graph, etc. The goal of this layer is to > have a representation of the web crawl that can be processed by > Mahout. The feature extraction will likely be implemented via a > custom-written Hadoop job. It will be really useful to hear your feedback on what works and doesn't here, especially on noisy web data. > > IV. Layer: Clustering > This step consists of using a given Mahout clustering algorithm (or a > newly implemented algorithm) to cluster the blogs based on the > extraced features. For now we are probably going to use a very simple > k-means clustering of word frequency. We plan to switch to a more > sophisticated approach once the basic infrastructure is sound :) It should be pretty straightforward to use the different implementations in Mahout here. I'd really love to hear benchmarks, etc. > > V. Layer: Topic Discovery > Once the blog entries are clustered, each cluster needs to be assigned > a topic. This topic should be automatically determined from the blog > entries inside the cluster. Again, for now we will probably use a > very simple approach: e.g. use the most frequent words inside the > cluster > (or within the center of the cluster) as topic tags. See the patch on log-likelihood up in Mahout's JIRA. Feedback on this would be great. > > VI. Layer: Search Engine > In order to search for blogs the tagged cluster-centers and topics > provided > by Mahout need to be recombined with the information form the blog > crawl. This recombined data should then be fed into a search engine, > so users can search for a specific entry. We will probably use Solr > for this step, tagging each blog entry with it's respective cluster > topic > tag(s) and creating a search index on those tags. I'd love to see the clustering stuff worked into Solr in the ClusteringComponent. See contrib/clustering within Solr. You might find moving this layer up closer to the crawl actually makes preprocessing and feature extraction a whole lot easier. > > VII. Layer: User Front-End > This will probably be a simple web-page that sends request with the > Search Engine layer to return results to the user. > > This is obviously only a first draft of what we think would be a > suited overall > architecture, so there is probably lots of room for improvement. We > are for example currently looking into multiple more sophisticated > clustering approaches (e.g. spectral-clustering, graph-based > clustering), ways of representing the clustered information > (e.g. using hierarchical instead of partitional clustering, so the > user can > "drill down" by topic into the results) or architectural changes > (e.g. using > a "feedback loop", so search results can be used for further > analysis). > So, if you have any remarks, notes or suggestions we would be happy to > hear from you :) Those all sound really good. Looking forward to hearing more. -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search |
|
|
Re: TU Berlin Winter of Code ProjectNamed entity extraction and feature extraction are both going to be very
challenging in the web environment. On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll <gsingers@...>wrote: > >> II. Layer: Preprocessing >> The data is probably not structured enough to be directly processable >> by a machine, so it has to be preprocessed. This >> step could e.g. consist of extracting the blog fulltext from the >> crawl, stemming it, finding named entitites and tagging them. >> We currently think of using UIMA for this layer. >> > > This could likely be done as M/R jobs too and contributed to Mahout utils > module if so desired. > > > >> III. Layer: Feature extraction >> In order to use clustering algorithms. we need to perform a feature >> extraction. This could e.g. consist of generating feature vectors, a >> similiarity matrix, a link graph, etc. The goal of this layer is to >> have a representation of the web crawl that can be processed by >> Mahout. The feature extraction will likely be implemented via a >> custom-written Hadoop job. >> > > It will be really useful to hear your feedback on what works and doesn't > here, especially on noisy web data. -- Ted Dunning, CTO DeepDyve |
|
|
Re: TU Berlin Winter of Code ProjectThe question that I don't see addressed is whether you choose to use a fully
streaming approach as is done in Bixo or whether you will use a document repository approach as is more common in most search engines. Hbase is reputedly ready enough to serve as a document repository. Using such an approach would be very helpful for the incremental nature of web crawls. What is the plan in this regard? On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll <gsingers@...>wrote: > > This is obviously only a first draft of what we think would be a suited > overall > architecture -- Ted Dunning, CTO DeepDyve |
|
|
Re: TU Berlin Winter of Code ProjectHi Ted,
we don't plan on using a streaming approach: Each layer has to finish its work completeley before the next layer can start processing. The data transfer between layers happens via HDFS or - as you mentioned - HBase. We were planning on using HBase at least for storing the initial crawl (using the HBasewriter Plugin for Heritrix), but we have to see whether/where HBase fits in during the later stages. I must admit that I haven't heard of Bixo yet, so I will have to take a look into how their architecture looks like. As for the named entity/feature extraction: yes, this is probably going to be one of the most challenging problems for the project. We are currently looking into several research papers on this topic and have just started our discussion regarding which method looks the most promising to us. Now, we obviously aren't experts on this topic (after all we do this project to learn about parallel machine learning), so we will probably try to include you guys into this discussion as soon as we have an initial proposal figured out :) Cheers, Max On Fri, Nov 6, 2009 at 8:57 PM, Ted Dunning <ted.dunning@...> wrote: > The question that I don't see addressed is whether you choose to use a fully > streaming approach as is done in Bixo or whether you will use a document > repository approach as is more common in most search engines. > > Hbase is reputedly ready enough to serve as a document repository. Using > such an approach would be very helpful for the incremental nature of web > crawls. > > What is the plan in this regard? > > On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll <gsingers@...>wrote: > >> >> This is obviously only a first draft of what we think would be a suited >> overall >> architecture > > > > > -- > Ted Dunning, CTO > DeepDyve > |
|
|
Re: TU Berlin Winter of Code ProjectHi Max (& Ted),
On Nov 6, 2009, at 11:57am, Ted Dunning wrote: > The question that I don't see addressed is whether you choose to use > a fully > streaming approach as is done in Bixo or whether you will use a > document > repository approach as is more common in most search engines. I think the issue here isn't about streaming vs. document repository - all systems have elements of both, it's just that... a. Bixo exposes this more explicitly, by focusing on the workflow aspects of web mining. But Nutch also has sequences of map-reduce tasks that are run during a crawl (e.g. filter URLs, group them, then fetch & parse). b. Bixo doesn't have a baked in URL database, or file-system scheme for saving content. If you look at the example SimpleCrawlTool class in Bixo, for example, you'll see that it (similar to Nutch) is using a SequenceFile to store the URL state, and sequence files in sub-directories for fetched content & parse results. But Bixo just does the simple thing of propagates the URL state forward into successive crawl directories, versus updating a single URL database. Having a URL DB is what you'd want for large-scale web crawling. If you wanted to configure Bixo to use HBase to store the URL state and fetched/parsed content, you'd use an HBase tap (in Cascading- speak) versus the Hfs tap. > Hbase is reputedly ready enough to serve as a document repository. > Using > such an approach would be very helpful for the incremental nature of > web > crawls. I'd gotten the same input from Andrew Purtell, who's been able to stream lots of crawl data into HBase, after a bit of fiddling with configuration settings and also some patching on the writer side of things. As far as pre-processing and feature extraction, both could be implemented as Cascading operations (that wind up mapping to Hadoop tasks). As Ted noted, actually doing the named entity extraction and feature extraction will be the real challenge. See this talk for an example of doing web mining using Bixo - http://www.slideshare.net/sh1mmer/the-bixo-web-mining-toolkit -- Ken > On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll > <gsingers@...>wrote: > >> >> This is obviously only a first draft of what we think would be a >> suited >> overall >> architecture > > > > > -- > Ted Dunning, CTO > DeepDyve -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g |
|
|
Re: TU Berlin Winter of Code ProjectOn Fri Ted Dunning <ted.dunning@...> wrote:
> The question that I don't see addressed is whether you choose to use > a fully streaming approach as is done in Bixo or whether you will use > a document repository approach as is more common in most search > engines. I guess even when using a streaming approach a repository for temporary results is necessary to decouple those stages that are expensive and hard to reproduce. E.g. crawling to HBase and reading the results from there for further processing should prevent failures in post processing resulting in having to rerun the crawl. Most likely there are more of these points further down the processing chain as well. Isabel |
|
|
Re: TU Berlin Winter of Code Projectcorrect.
Also, it is very nice to be able to run the crawler again to get some new content and then only run following stages on the changed documents. With something like HBase, it is pretty easy to run only on documents that have a "needs-work" flag set. With a streaming approach, you have to segment all of your files according to incremental tranches and construct the job graph on the fly according to which inputs have appeared. It can work, but the job graph becomes prodigiously large after a bit. The complementary problem with the repository approach is the proliferation and complexity of the state indicators, especially since you want to be able to avoid scanning the entire repository by using the column nature of the repository. That means you generally can't do a scan based on a last update field, but rather you need to encode your dependencies by setting one of many flags in the code. That, in turn, means that the work-flow is encoded in your programs rather than outside them in the framework.. On Tue, Nov 10, 2009 at 3:17 AM, Isabel Drost <isabel@...> wrote: > On Fri Ted Dunning <ted.dunning@...> wrote: > > > The question that I don't see addressed is whether you choose to use > > a fully streaming approach as is done in Bixo or whether you will use > > a document repository approach as is more common in most search > > engines. > > I guess even when using a streaming approach a repository for temporary > results is necessary to decouple those stages that are expensive and > hard to reproduce. E.g. crawling to HBase and reading the results from > there for further processing should prevent failures in post processing > resulting in having to rerun the crawl. Most likely there are more of > these points further down the processing chain as well. > > Isabel > -- Ted Dunning, CTO DeepDyve |
|
|
Re: TU Berlin Winter of Code ProjectOn Friday 06 November 2009 20:47:00 Grant Ingersoll wrote:
> On Nov 6, 2009, at 5:06 AM, Max Heimel wrote: > > II. Layer: Preprocessing > > The data is probably not structured enough to be directly processable > > by a machine, so it has to be preprocessed. This > > step could e.g. consist of extracting the blog fulltext from the > > crawl, stemming it, finding named entitites and tagging them. > > We currently think of using UIMA for this layer. > > This could likely be done as M/R jobs too and contributed to Mahout > utils module if so desired. Though I know of code* at TU for retrieving blog urls via Yahoo! Boss and "guessing" the rss feed url. In a first iteration this might be a nice way of getting around the problem of having to parse the html code and separating blog posting from comments from navigational code. Isabel * That is fine to publish under Apache Software License according to the guys at the research group. -- QOTD: If you lose a son you can always get another, but there's only one Maltese Falcon. -- Sidney Greenstreet, "The Maltese Falcon" |\ _,,,---,,_ Web: <http://www.isabel-drost.de> /,`.-'`' -. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: <xmpp://MaineC.@...> |
| Free embeddable forum powered by Nabble | Forum Help |