TU Berlin Winter of Code Project

View: New views
9 Messages — Rating Filter:   Alert me  

TU Berlin Winter of Code Project

by Max Heimel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello everybody,

we are a group of 6 master students of the Technical University of
Berlin who are currently working on a winter term project using
Mahout. Our - so called "Winter of Code" - project is mentored by
Isabel Drost and will run until February 2010. The goal of our project
is to develop a cloud-based blog search engine - think: "google news
for beginners ;)". The engine should be highly scalabe and use
Hadoop/Mahout to performing topical clustering and topic discovery for
crawled blog entries.

Based on suggestions by Isabel, we currently think of the following
layered architecture:

I. Layer: Web-crawling
A web-crawler (e.g. Herititrix) is provided with a set of known blog
URLs to perform web crawls. Heritrix is configured with a simple text
filter to only crawl urls containing the word "blog" from a
prespecified TLD (so we "know" which language the blog entries use).
We plan on outputting the crawl data directly to HDFS (e.g. via hbase-writer).

II. Layer: Preprocessing
The data is probably not structured enough to be directly processable
by a machine, so it has to be preprocessed. This
step could e.g. consist of extracting the blog fulltext from the
crawl, stemming it, finding named entitites and tagging them.
We currently think of using UIMA for this layer.

III. Layer: Feature extraction
In order to use clustering algorithms. we need to perform a feature
extraction. This could e.g. consist of generating feature vectors, a
similiarity matrix, a link graph, etc. The goal of this layer is to
have a representation of the web crawl that can be processed by
Mahout. The feature extraction will likely be implemented via a
custom-written Hadoop job.

IV. Layer: Clustering
This step consists of using a given Mahout clustering algorithm (or a
newly implemented algorithm) to cluster the blogs based on the
extraced features. For now we are probably going to use a very simple
k-means clustering of word frequency. We plan to switch to a more
sophisticated approach once the basic infrastructure is sound :)

V. Layer: Topic Discovery
Once the blog entries are clustered, each cluster needs to be assigned
a topic. This topic should be automatically determined from the blog
entries inside the cluster. Again, for now we will probably use a
very simple approach: e.g. use the most frequent words inside the cluster
(or within the center of the cluster) as topic tags.

VI. Layer: Search Engine
In order to search for blogs the tagged cluster-centers and topics provided
by Mahout need to be recombined with the information form the blog
crawl. This recombined data should then be fed into a search engine,
so users can search for a specific entry. We will probably use Solr
for this step, tagging each blog entry with it's respective cluster topic
tag(s) and creating a search index on those tags.

VII. Layer: User Front-End
This will probably be a simple web-page that sends request with the
Search Engine layer to return results to the user.

This is obviously only a first draft of what we think would be a suited overall
architecture, so there is probably lots of room for improvement. We
are for example currently looking into multiple more sophisticated
clustering approaches (e.g. spectral-clustering, graph-based
clustering), ways of representing the clustered information
(e.g. using hierarchical instead of partitional clustering, so the user can
"drill down" by topic into the results) or architectural changes (e.g. using
a "feedback loop", so search results can be used for further analysis).
So, if you have any remarks, notes or suggestions we would be happy to
hear from you :)

Cheers and looking forward to discussions with you,

Max

Re: TU Berlin Winter of Code Project

by Grant Ingersoll-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Nov 6, 2009, at 5:06 AM, Max Heimel wrote:

> Hello everybody,
>
> we are a group of 6 master students of the Technical University of
> Berlin who are currently working on a winter term project using
> Mahout. Our - so called "Winter of Code" - project is mentored by
> Isabel Drost and will run until February 2010. The goal of our project
> is to develop a cloud-based blog search engine - think: "google news
> for beginners ;)". The engine should be highly scalabe and use
> Hadoop/Mahout to performing topical clustering and topic discovery for
> crawled blog entries.
>
> Based on suggestions by Isabel, we currently think of the following
> layered architecture:

I like the layered approach, this should make it easier for others to  
adapt and use.

>
> I. Layer: Web-crawling
> A web-crawler (e.g. Herititrix) is provided with a set of known blog
> URLs to perform web crawls. Heritrix is configured with a simple text
> filter to only crawl urls containing the word "blog" from a
> prespecified TLD (so we "know" which language the blog entries use).
> We plan on outputting the crawl data directly to HDFS (e.g. via  
> hbase-writer).
>
> II. Layer: Preprocessing
> The data is probably not structured enough to be directly processable
> by a machine, so it has to be preprocessed. This
> step could e.g. consist of extracting the blog fulltext from the
> crawl, stemming it, finding named entitites and tagging them.
> We currently think of using UIMA for this layer.

This could likely be done as M/R jobs too and contributed to Mahout  
utils module if so desired.

>
> III. Layer: Feature extraction
> In order to use clustering algorithms. we need to perform a feature
> extraction. This could e.g. consist of generating feature vectors, a
> similiarity matrix, a link graph, etc. The goal of this layer is to
> have a representation of the web crawl that can be processed by
> Mahout. The feature extraction will likely be implemented via a
> custom-written Hadoop job.

It will be really useful to hear your feedback on what works and  
doesn't here, especially on noisy web data.


>
> IV. Layer: Clustering
> This step consists of using a given Mahout clustering algorithm (or a
> newly implemented algorithm) to cluster the blogs based on the
> extraced features. For now we are probably going to use a very simple
> k-means clustering of word frequency. We plan to switch to a more
> sophisticated approach once the basic infrastructure is sound :)

It should be pretty straightforward to use the different  
implementations in Mahout here.  I'd really love to hear benchmarks,  
etc.


>
> V. Layer: Topic Discovery
> Once the blog entries are clustered, each cluster needs to be assigned
> a topic. This topic should be automatically determined from the blog
> entries inside the cluster. Again, for now we will probably use a
> very simple approach: e.g. use the most frequent words inside the  
> cluster
> (or within the center of the cluster) as topic tags.

See the patch on log-likelihood up in Mahout's JIRA.  Feedback on this  
would be great.


>
> VI. Layer: Search Engine
> In order to search for blogs the tagged cluster-centers and topics  
> provided
> by Mahout need to be recombined with the information form the blog
> crawl. This recombined data should then be fed into a search engine,
> so users can search for a specific entry. We will probably use Solr
> for this step, tagging each blog entry with it's respective cluster  
> topic
> tag(s) and creating a search index on those tags.

I'd love to see the clustering stuff worked into Solr in the  
ClusteringComponent.  See contrib/clustering within Solr.  You might  
find moving this layer up closer to the crawl actually makes  
preprocessing and feature extraction a whole lot easier.


>
> VII. Layer: User Front-End
> This will probably be a simple web-page that sends request with the
> Search Engine layer to return results to the user.
>
> This is obviously only a first draft of what we think would be a  
> suited overall
> architecture, so there is probably lots of room for improvement. We
> are for example currently looking into multiple more sophisticated
> clustering approaches (e.g. spectral-clustering, graph-based
> clustering), ways of representing the clustered information
> (e.g. using hierarchical instead of partitional clustering, so the  
> user can
> "drill down" by topic into the results) or architectural changes  
> (e.g. using
> a "feedback loop", so search results can be used for further  
> analysis).
> So, if you have any remarks, notes or suggestions we would be happy to
> hear from you :)

Those all sound really good.  Looking forward to hearing more.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: TU Berlin Winter of Code Project

by Ted Dunning :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Named entity extraction and feature extraction are both going to be very
challenging in the web environment.

On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll <gsingers@...>wrote:

>
>> II. Layer: Preprocessing
>> The data is probably not structured enough to be directly processable
>> by a machine, so it has to be preprocessed. This
>> step could e.g. consist of extracting the blog fulltext from the
>> crawl, stemming it, finding named entitites and tagging them.
>> We currently think of using UIMA for this layer.
>>
>
> This could likely be done as M/R jobs too and contributed to Mahout utils
> module if so desired.
>
>
>
>> III. Layer: Feature extraction
>> In order to use clustering algorithms. we need to perform a feature
>> extraction. This could e.g. consist of generating feature vectors, a
>> similiarity matrix, a link graph, etc. The goal of this layer is to
>> have a representation of the web crawl that can be processed by
>> Mahout. The feature extraction will likely be implemented via a
>> custom-written Hadoop job.
>>
>
> It will be really useful to hear your feedback on what works and doesn't
> here, especially on noisy web data.




--
Ted Dunning, CTO
DeepDyve

Re: TU Berlin Winter of Code Project

by Ted Dunning :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The question that I don't see addressed is whether you choose to use a fully
streaming approach as is done in Bixo or whether you will use a document
repository approach as is more common in most search engines.

Hbase is reputedly ready enough to serve as a document repository.  Using
such an approach would be very helpful for the incremental nature of web
crawls.

What is the plan in this regard?

On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll <gsingers@...>wrote:

>
> This is obviously only a first draft of what we think would be a suited
> overall
> architecture




--
Ted Dunning, CTO
DeepDyve

Re: TU Berlin Winter of Code Project

by Max Heimel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Ted,

we don't plan on using a streaming approach: Each layer has to finish
its work completeley before the next layer can start processing. The
data transfer between layers happens via HDFS or - as you mentioned -
HBase. We were planning on using HBase at least for storing the
initial crawl (using the HBasewriter Plugin for Heritrix), but we have
to see whether/where HBase fits in during the later stages. I must
admit that I haven't heard of Bixo yet, so I will have to take a look
into how their architecture looks like.

As for the named entity/feature extraction: yes, this is probably
going to be one of the most challenging problems for the project. We
are currently looking into several research papers on this topic and
have just started our discussion regarding which method looks the most
promising to us. Now, we obviously aren't experts on this topic (after
all we do this project to learn about parallel machine learning), so
we will probably try to include you guys into this discussion as soon
as we have an initial proposal figured out :)

Cheers,
Max

On Fri, Nov 6, 2009 at 8:57 PM, Ted Dunning <ted.dunning@...> wrote:

> The question that I don't see addressed is whether you choose to use a fully
> streaming approach as is done in Bixo or whether you will use a document
> repository approach as is more common in most search engines.
>
> Hbase is reputedly ready enough to serve as a document repository.  Using
> such an approach would be very helpful for the incremental nature of web
> crawls.
>
> What is the plan in this regard?
>
> On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll <gsingers@...>wrote:
>
>>
>> This is obviously only a first draft of what we think would be a suited
>> overall
>> architecture
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: TU Berlin Winter of Code Project

by Ken Krugler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Max (& Ted),

On Nov 6, 2009, at 11:57am, Ted Dunning wrote:

> The question that I don't see addressed is whether you choose to use  
> a fully
> streaming approach as is done in Bixo or whether you will use a  
> document
> repository approach as is more common in most search engines.

I think the issue here isn't about streaming vs. document repository -  
all systems have elements of both, it's just that...

a. Bixo exposes this more explicitly, by focusing on the workflow  
aspects of web mining.

But Nutch also has sequences of map-reduce tasks that are run during a  
crawl (e.g. filter URLs, group them, then fetch & parse).

b. Bixo doesn't have a baked in URL database, or file-system scheme  
for saving content.

If you look at the example SimpleCrawlTool class in Bixo, for example,  
you'll see that it (similar to Nutch) is using a SequenceFile to store  
the URL state, and sequence files in sub-directories for fetched  
content & parse results.

But Bixo just does the simple thing of propagates the URL state  
forward into successive crawl directories, versus updating a single  
URL database. Having a URL DB is what you'd want for large-scale web  
crawling.

If you wanted to configure Bixo to use HBase to store the URL state  
and fetched/parsed content, you'd use an HBase tap (in Cascading-
speak) versus the Hfs tap.

> Hbase is reputedly ready enough to serve as a document repository.  
> Using
> such an approach would be very helpful for the incremental nature of  
> web
> crawls.

I'd gotten the same input from Andrew Purtell, who's been able to  
stream lots of crawl data into HBase, after a bit of fiddling with  
configuration settings and also some patching on the writer side of  
things.

As far as pre-processing and feature extraction, both could be  
implemented as Cascading operations (that wind up mapping to Hadoop  
tasks).

As Ted noted, actually doing the named entity extraction and feature  
extraction will be the real challenge.

See this talk for an example of doing web mining using Bixo - http://www.slideshare.net/sh1mmer/the-bixo-web-mining-toolkit

-- Ken


> On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll  
> <gsingers@...>wrote:
>
>>
>> This is obviously only a first draft of what we think would be a  
>> suited
>> overall
>> architecture
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Re: TU Berlin Winter of Code Project

by Isabel Drost-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri Ted Dunning <ted.dunning@...> wrote:

> The question that I don't see addressed is whether you choose to use
> a fully streaming approach as is done in Bixo or whether you will use
> a document repository approach as is more common in most search
> engines.

I guess even when using a streaming approach a repository for temporary
results is necessary to decouple those stages that are expensive and
hard to reproduce. E.g. crawling to HBase and reading the results from
there for further processing should prevent failures in post processing
resulting in having to rerun the crawl. Most likely there are more of
these points further down the processing chain as well.

Isabel

Re: TU Berlin Winter of Code Project

by Ted Dunning :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

correct.

Also, it is very nice to be able to run the crawler again to get some new
content and then only run following stages on the changed documents.  With
something like HBase, it is pretty easy to run only on documents that have a
"needs-work" flag set.  With a streaming approach, you have to segment all
of your files according to incremental tranches and construct the job graph
on the fly according to which inputs have appeared.  It can work, but the
job graph becomes prodigiously large after a bit.  The complementary problem
with the repository approach is the proliferation and complexity of the
state indicators, especially since you want to be able to avoid scanning the
entire repository by using the column nature of the repository.  That means
you generally can't do a scan based on a last update field, but rather you
need to encode your dependencies by setting one of many flags in the code.
That, in turn, means that the work-flow is encoded in your programs rather
than outside them in the framework..

On Tue, Nov 10, 2009 at 3:17 AM, Isabel Drost <isabel@...> wrote:

> On Fri Ted Dunning <ted.dunning@...> wrote:
>
> > The question that I don't see addressed is whether you choose to use
> > a fully streaming approach as is done in Bixo or whether you will use
> > a document repository approach as is more common in most search
> > engines.
>
> I guess even when using a streaming approach a repository for temporary
> results is necessary to decouple those stages that are expensive and
> hard to reproduce. E.g. crawling to HBase and reading the results from
> there for further processing should prevent failures in post processing
> resulting in having to rerun the crawl. Most likely there are more of
> these points further down the processing chain as well.
>
> Isabel
>



--
Ted Dunning, CTO
DeepDyve

Re: TU Berlin Winter of Code Project

by Isabel Drost-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Friday 06 November 2009 20:47:00 Grant Ingersoll wrote:

> On Nov 6, 2009, at 5:06 AM, Max Heimel wrote:
> > II. Layer: Preprocessing
> > The data is probably not structured enough to be directly processable
> > by a machine, so it has to be preprocessed. This
> > step could e.g. consist of extracting the blog fulltext from the
> > crawl, stemming it, finding named entitites and tagging them.
> > We currently think of using UIMA for this layer.
>
> This could likely be done as M/R jobs too and contributed to Mahout
> utils module if so desired.
+1

Though I know of code* at TU for retrieving blog urls via Yahoo! Boss
and "guessing" the rss feed url. In a first iteration this might be a nice
way of getting around the problem of having to parse the html code and
separating blog posting from comments from navigational code.

Isabel


* That is fine to publish under Apache Software License according to the guys
at the research group.

--
QOTD: If you lose a son you can always get another, but there's only one
Maltese Falcon.   -- Sidney Greenstreet, "The Maltese Falcon"
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xmpp://MaineC.@...>



signature.asc (204 bytes) Download Attachment