Data clustering is important in many fields, including data mining [FPSU96], statistical data analysis [KR89,BR93], compression [ZRL97], and vector quantization. Applications include data analysis and modeling [FDW97,FHS96], image segmentation, marketing, fraud detection, predictive modeling, data summarization, and general data reporting tasks [B*96]. It has important applications in data cleaning and exploratory data analysis. Clustering is a crucial data mining step and performing the task over massive databases is essential. Previous scalable clustering work has focused on k-Means-type approaches [ZRL97,BFR98] and region growing [NH94, SEKX98, AGGR98]. These techniques, while effective, do not derive statistical models of the data (i.e. they are based on notions of distance metrics, etc.) and they do not allow for cluster overlap (i.e. a data record may belong to different clusters with different membership probabilities). In this paper, we focus on the task of scaling the most effective technique available for proper probabilistic clustering: the Expectation-Maximization (EM) algorithm [DLR77, CS96]. EM has additional desirable properties in that it does not require the specification of distance measures and readily admits categorical and continuous attributes (which is untrue of other clustering algorithms that either operate on continuous, e.g. k-Means-type algorithms, or categorical [GKR98] data exclusively). EM has been shown to be superior to other alternatives for statistical modeling purposes [GMPS97,PE96,B95,CS96,NH99].
The clustering problem has been formulated in various ways in the statistics [KR89,BR93,B95,S92,S86], pattern recognition [DH73,F90], optimization [BMS97,SI84], and machine learning literature [F87]. The fundamental problem is that of grouping together (clustering) data items that are similar to each other. The most general view places clustering in the framework of density estimation [S86, S92, BR93]. Data is generally not uniformly distributed. Some combinations of attribute values are more likely than others. Clustering can be viewed as identifying the dense regions of the probability density of the data source. An effective representation of the probability density function is the mixture model: a model consisting of several components (e.g. a model consisting of the sum of 3 Gaussians). Each component generates a set of data records (a “cluster”). The data set is then a mixture of clusters and the problem is to identify the data points constituting a cluster and inferring the properties of the distribution governing each cluster.
Consider a simple example with data consisting of 2 attributes: age and income. One may choose to model the data as a single cluster and report that the data records have an average age of 41 years and an average income of $26K/year (with associated variances). However, this is rather deceptive and uninformative. The data may be a mixture of working people, retired people, and children. A more informative summary is to identify these subsets or clusters, and report the cluster parameters. Results may now be: 20% of the data have average age 12 and zero income, 45% have average age 38 and average income $45K, and 30% have average age 72 and average income $20K, while 5% of data had unknown incomes and ages. Assuming we have no apriori definition of the various sub-populations, how do we discover what they are? Furthermore, how do we do this if the data has many more dimensions and the various partitions are not obvious? This is where clustering plays an important role in identifying these dense regions in multidimensional data.

