Most of the existing algorithms, use local heuristics to handle the computational complexity. The computational complexity of these algorithms ranges from O(AN logN) to O(AN(logN)2 ) with N training data items and A attributes. These algorithms are fast enough for application domains where N is relatively small. However, in the data mining domain where millions of records and a large number of attributes are involved, the execution time of these algorithms can become prohibitive, particularly in interactive applications. Parallel algorithms have been suggested by many groups developing data mining algorithms. We discuss below two approaches that have been used.[2] 数据挖掘实验室
Initially N training data items are randomly distributed to P processors such that each processors has N=P data items. At this point, all the processors cooperate to expand the root node of a decision tree. For this, processors need to decide on an attribute to use to generate child nodes of the root. This can be done in three steps. In the first step, every processor collects the class distribution information of the local data. In the second step, the processors exchange the local class distribution information using global reduction. Finally, each processor can simultaneously compute entropy gains of the attributes and find the best attribute for splitting the root node.There are two approaches for further progress. In Synchronous Tree Construction Approach, the entire set of processors synchronously expand one node of the decision tree at a time. In partitioned Tree Construction Approach, each new generated node is expanded by a subset of processors that helped the expansion of the parent node. These are discussed below . 数据挖掘研究院
In this approach, all processors construct a decision tree synchronously by sending andreceiving class distribution information of local data.
Major steps for the approach are shown below: 数据挖掘实验室
- Select a node to expand according to a decision tree expansion strategy (eg. Depth-First, Breadth-First or Best-First), and call that node as the current node.
- For each data attribute, collect class distribution information of the local data at the current node.
- Exchange the class distribution information with all other processors and add up the class distribution information to get a complete distribution of all the attributes.
- Calculate the entropy gains of each attribute and select the best attribute for child node expansion.
- Based on the attribute values, create child nodes and partition the data according to the values of the attribute chosen.
- Repeat the above steps (1--5) until no more nodes are available for the expansion.
Load imbalance can be reduced if all the nodes on the frontier are expanded simultaneously, i.e. one pass of all the data at each processor is used to compute the class distribution information for all nodes on the frontier. Note that this improvement also reduces the number of times communications are done and reduces the message start-up overhead, but it does not reduce the overall volume of communications. Now the only source of load imbalance is when some leaf nodes become terminal nodes. This load imbalance can be further minimized if the training data set is distributed randomly. 数据挖掘实验室
In this approach, each leaf node n of the frontier of the decision tree is handled by a distinct subset of processors P(n). Once the node n is expanded into child nodes, n1, n2,…,nk, the processor group P(n) is also partitioned into k parts, P1, P2,…,Pk , such that Pi handle node ni . All the data items are shuffled such that the processors in group Pi have data items that belong to the leaf ni only. Major steps for the approach is shown below:- Expand a node based on the method discussed in the beginning of the Section 3 for expanding the root node.
- Assign a subset of processor to each leaf node such that number of processors assigned to a leaf node is proportional to the number of the data items contained in the node.
- Shuffle the training data such that each subset of processors has data item that belongs to the leaf nodes it is responsible for.
- Processor subsets assigned to different nodes develop subtrees of the responsible nodes independently, by following the above steps recursively.
- Partition leaf nodes into |P(n)| groups such that each group has about the equal number of data items. Assign each processor to one node group.
- Shuffle the training data such that each processor has data item that belongs to the leaf nodes it is responsible for.
- Now the expansion of the subtrees rooted at a node group proceeds completely independently at each process.
- At the end, the whole decision tree is constructed by combining subtrees of each processor.

