Hallo,
now I'm writing on my diploma on near duplicate detection. I'm trying to optimize Broder's Algorithm (if you're interested, the paper is called „Syntactic Clustering of the Web“), for near duplicate search and I'm focusing on the detection of similar documents. The last step of the algorithm is clustering. Having my pairs of similar documents I now have to build clusters.
Now my data looks like this:
d1,d2
d1,d3
d2,d3
d4,d5
d4,d6
d4,d7
Broder writes that he uses a union-find algorithm to build clusters, but doesn't mention which. If I understand it correctly union-find algorithms are: single-link clustering and group-average link clustering. Is that correct? Could you suggest any library which would implement any of them? Maybe you can suggest any reliable links on the issue.
As I didn't find any information on that issue. I thought that some hierarchial clustering algorithm would be the best choice, while the number of clusters is not known in advance. So I tried the WEKA Cobweb implementation. I first created a similarity matrix which looks like this:
d1 d2 d3 d4 d5 d6 d7
d1 1 1 1 0 0 0 0
d2 1 1 1 0 0 0 0
d3 1 1 1 0 0 0 0
d4 0 0 0 1 1 1 1
d5 0 0 0 1 1 0 0
d6 0 0 0 1 0 1 0
d7 0 0 0 1 0 0 1
Cobweb correctly builds a cluster of documents {d1,d2,d3} but the rest of the documents appear in different clusters, not even something like {d4,d5}, {d4,d6}, {d4,d7}. Maybe I must select different acuity or cutoff values. I use the default ones till now. If yes could you please give a hint. I found out, that acuity „is the minimum value for the standard deviation“, and cutoof „sets the category utility threshold by which to prune nodes“. I think, I can even understand what they mean (I browsed through the „Data Mining“ book) but I'm not sure if they can help me, because the number of documents will vary (can be hundreds or thousands). Or maybe I'm mistaken and Cobweb is not a good choice at all?
I'm not too lazy to research more about clustering it's just that it would draw me away from the focus of my diploma for quite a long time.
I would be very thankful for any kind of help, also some keywords perhaps.