RSS
热门关键字:  数据挖掘  数据仓库  商业智能  搜索引擎  人工智能

数据挖掘并行算法 Parallel Algorithms

来源: 作者:unkonwn 时间:2004-12-11 点击:

Most of the existing algorithms, use local heuristics to handle the computational complexity. The computational complexity of these algorithms ranges from O(AN logN) to O(AN(logN)2 ) with N training data items and A attributes. These algorithms are fast enough for application domains where N is relatively small. However, in the data mining domain where millions of records and a large number of attributes are involved, the execution time of these algorithms can become prohibitive, particularly in interactive applications. Parallel algorithms have been suggested by many groups developing data mining algorithms. We discuss below two approaches that have been used.[2] 数据挖掘实验室

  • Basic Idea:
Initially N training data items are randomly distributed to P processors such that each processors has N=P data items. At this point, all the processors cooperate to expand the root node of a decision tree. For this, processors need to decide on an attribute to use to generate child nodes of the root. This can be done in three steps. In the first step, every processor collects the class distribution information of the local data. In the second step, the processors exchange the local class distribution information using global reduction. Finally, each processor can simultaneously compute entropy gains of the attributes and find the best attribute for splitting the root node.

There are two approaches for further progress. In Synchronous Tree Construction Approach, the entire set of processors synchronously expand one node of the decision tree at a time. In partitioned Tree Construction Approach, each new generated node is expanded by a subset of processors that helped the expansion of the parent node. These are discussed below . 数据挖掘研究院

  • Synchronous Tree Construction Approach
In this approach, all processors construct a decision tree synchronously by sending and

receiving class distribution information of local data.

Major steps for the approach are shown below: 数据挖掘实验室

  1. Select a node to expand according to a decision tree expansion strategy (eg. Depth-First, Breadth-First or Best-First), and call that node as the current node.
  2. For each data attribute, collect class distribution information of the local data at the current node.
  3. Exchange the class distribution information with all other processors and add up the class distribution information to get a complete distribution of all the attributes.
  4. Calculate the entropy gains of each attribute and select the best attribute for child node expansion.
  5. Based on the attribute values, create child nodes and partition the data according to the values of the attribute chosen.
  6. Repeat the above steps (1--5) until no more nodes are available for the expansion.
The advantage of this approach is that it does not require any movement of the training data items. However, this algorithm suffers from high communication cost and load imbalance.

Load imbalance can be reduced if all the nodes on the frontier are expanded simultaneously, i.e. one pass of all the data at each processor is used to compute the class distribution information for all nodes on the frontier. Note that this improvement also reduces the number of times communications are done and reduces the message start-up overhead, but it does not reduce the overall volume of communications. Now the only source of load imbalance is when some leaf nodes become terminal nodes. This load imbalance can be further minimized if the training data set is distributed randomly. 数据挖掘实验室

  • Partitioned Tree Construction Approach
In this approach, each leaf node n of the frontier of the decision tree is handled by a distinct subset of processors P(n). Once the node n is expanded into child nodes, n1, n2,…,nk, the processor group P(n) is also partitioned into k parts, P1, P2,…,Pk , such that Pi handle node ni . All the data items are shuffled such that the processors in group Pi have data items that belong to the leaf ni only. Major steps for the approach is shown below:
  1. Expand a node based on the method discussed in the beginning of the Section 3 for expanding the root node.
2. (a) If the number of leaf nodes is less than |P(n)|,
  1. Assign a subset of processor to each leaf node such that number of processors assigned to a leaf node is proportional to the number of the data items contained in the node.
  2. Shuffle the training data such that each subset of processors has data item that belongs to the leaf nodes it is responsible for.
  3. Processor subsets assigned to different nodes develop subtrees of the responsible nodes independently, by following the above steps recursively.
(b) Otherwise,
  1. Partition leaf nodes into |P(n)| groups such that each group has about the equal number of data items. Assign each processor to one node group.
  2. Shuffle the training data such that each processor has data item that belongs to the leaf nodes it is responsible for.
  3. Now the expansion of the subtrees rooted at a node group proceeds completely independently at each process.
  1. At the end, the whole decision tree is constructed by combining subtrees of each processor.
The advantage of this approach is that once a processor becomes solely responsible for a node, it can develop a subtree of the decision tree independently without any communication overhead. There are a number of disadvantages of this approach. First disadvantage is that it requires data movement after each node expansion until one process becomes responsible for an entire subtree. The communication cost is particularly expensive in the expansion of the upper part of the decision tree. Second disadvantage is due to load balancing.
最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?