EM Algorithm
Expectation Maximization (EM) algorithm is used to estimate the probability density of a set of given data. In order to model the probability density of the data, Gaussian Mixture Model is used. The probability density of the data modeled as the weighted sum of a number of gaussian distributions.
During the initialization, the applet reads the data file from the host which is has been downloaded. The URL of the data file is defined in the html source code. Default number of gaussians is also defined in html code. Number of gaussians for the algorithm can be selected prior to the first execution of the algorithm and cannot be changed there after.
数据挖掘实验室
The initial estimates of the parameters of the gaussian (mean and the variance) is randomly selected. However, during this random guess of the parameters, the range and the variance of the data is used. 数据挖掘论坛
Implementation
EM algorithm is implemented as a class named em and it is capable of dealing with any dimensional data. The em class has two constructors. One constructor accepts a pointer to the data which is stored in a N x d dimensional double array where N denotes the number of data and d represents the dimensionality of the data. The other constructor accepts the number of gaussian components to be used as a parameter as well. If the former constructor is used, the number of components must be set using the setParameter method of the class. Once all the parameters set, the em object randomly sets the initial parameters of the gaussian components. The iterate method em uses various methods available in order to find a new set of estimates for the parameters. The iteration process continues until the parameters converge. The various methods used during iteration computes the following functions: P(j|xn ), P(xn|j), P(j), P(xn ), etc. The details of the calculation of these functions can be found in the source code. The details of the em algorithm can be found in "Neural Networks for Pattern Recognition" by Christopher M. Bishop.
数据挖掘工具
Although the class em was enough to determine the parameters of the Gaussian mixture model, I have decided to implement the output routines in a class called em_graph derived from class em. The derived class em_graph uses the range information of the data in order to scale and plot the data and the representation of the resultant gaussian parameters. Even tough, I have not tried the em_graph class with a data of higher dimensionality, I am pretty confident that the class is capable of displaying the first to dimension of the data without any problems.
Finally, the applet em_app reads the data from the file specified in the html source and creates a em_graph object. The details of this class is not relevant to the EM algorithm, therefore it will not be explained here. 数据挖掘实验室