A fuzzy-based algorithm for web document clustering

Menahem Friedman, Abraham K, el, Moti Schneider, Mark Last, Bracha Shapira, Yuval Elovici, Omer Zaafrany

IEEE Annual Meeting of the Fuzzy Information, 2004. Processing NAFIPS'04. 2 …, 2004

Most existing methods of document clustering are based on a model that assumes a fixed-size vector representation of key terms or key phrases within each document. This assumption is not realistic in large and diverse document collections such as the World Wide Web. We propose a new fuzzy-based document clustering method (FDCM), to cluster documents that are represented by variable length vectors. Each vector element consists of two fields. The first is an identification of a key phrase (its name) in the document and the second denotes a frequency associated with this key phrase within the particular document. A new averaging method is defined for the cluster centroid calculating, and a membership function is developed for relating new documents to existing clusters. The proposed approach is described in detail and we show how it is implemented in a real world application from the area of Web monitoring.