RSS
热门关键字:  数据挖掘  数据仓库  人工智能  搜索引擎  数据挖掘导论

CHAMELEON: A Hierarchical Clustering Algorithm

来源: 作者:unkonwn 时间:2004-11-30 点击:

Clustering in data mining [SADC93, CHY96] is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized [JD88, KR90, PAS96, CHY96]. These discovered clusters can be used to explain the characteristics of the underlying data distribution, and thus serve as the foundation for other data mining and analysis techniques. The applications of clustering include characterization of different customer groups based upon purchasing patterns, categorization of documents on the World Wide Web [BGGC99a, BGGC99b], grouping of genes and proteins that have similar functionality [HHS92, NRSC95, SCCC95, HKKM98], grouping of spatial locations prone to earth quakes from seismological data [BR98, XEKS98], etc. Existing clustering algorithms, such as K-means [JD88], PAM [KR90], CLARANS [NH94], DBSCAN [EKSX96], CURE [GRS98], and ROCK [GRS99] are designed to find clusters that fit some static models. For example, K-means, PAM, and CLARANS assume that clusters are hyper-ellipsoidal (or globular) and are of similar sizes. DBSCAN assumes that all points within genuine clusters are density reachable 1 and points across different clusters are not. Agglomerative hierarchical clustering algorithms, such as CURE and ROCK use a static model to determine the most similar cluster to merge in the hierarchical clustering. CURE measures the similarity of two clusters based on the similarity of the closest pair of the representative points belonging to different clusters, without considering the internal closeness (i.e., density or homogeneity) of the two clusters involved. ROCK measures the similarity of two clusters by comparing the aggregate inter-connectivity of two clusters against a user-specified static inter-connectivity model, and thus ignores the potential variations in the inter-connectivity of different clusters within the same data set. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes. 数据挖掘研究院

In this paper, we present a novel hierarchical clustering algorithm called CHAMELEON that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the inter-connectivity and closeness (proximity) between two clusters are comparable to the internal inter-connectivity of the clusters and closeness of items within the clusters. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous clusters. The methodology of dynamic modeling of clusters used in CHAMELEON is applicable to all types of data as long as a similarity matrix can be constructed. We demonstrate the effectiveness of CHAMELEON in a number of data sets that contain points in 2D space, and contain clusters of different shapes, densities, sizes, noise, and artifacts. 数据挖掘研究院

The rest of the paper is organized as follows. Section 2 gives an overview of related clustering algorithms. Section 3 presents the limitations of the recently proposed state of the art clustering algorithms. We present our new clustering algorithm in Section 4. Section 5 gives the experimental results. Section 6 contains conclusions and directions for future work.

数据挖掘研究院

In this section, we give a brief description of existing clustering algorithms.

数据挖掘研究院

Partitional clustering attempts to break a data set into K clusters such that the partition optimizes a given criterion [JD88, KR90, NH94, CS96]. Centroid-based approaches, as typified by K means [JD88] and ISODATA [BH64], try to assign points to clusters such that the mean square distance of points to the centroid of the assigned cluster is minimized. Centroid-based techniques are suitable only for data in metric spaces (e.g., Euclidean space) in which it is possible to compute a centroid of a given set of points. Medoid-based methods, as typified by PAM (Partitioning Around Medoids) [KR90] and CLARANS [NH94], work with similarity data, i.e., data in an arbitrary similarity space [GRGC99]. These techniques try to find representative points (medoids) so as to minimize the sum of the distances of points from their closest medoid.

A major drawback of both of these schemes is that they fail for data in which points in a given cluster are closer to the center of another cluster than to the center of their own cluster. This can happen in many natural clusters [HKKM97, GRS99]; for example, if there is a large variation in cluster sizes (as in Figure 1 (a)) or when cluster shapes are convex (as in Figure 1 (b)).

资料全文下载

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?