Established companies have had decades to accumulate masses of data about their customers, suppliers, and products and services. The rapid pace of e-commerce means that Web startups can become huge enterprises in months, not years, amassing proportionately large databases as they grow. Data mining, also known as knowledge discovery in databases,1 gives organizations the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. 数据挖掘研究院
Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases, however, are much too large to be held in main memory. Retrieving data from disk is markedly slower than accessing data in RAM. Thus, to be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algorithm is said to be scalable if—given a fixed amount of main memory—its runtime increases linearly with the number of records in the input database. 数据挖掘研究院
Recent work has focused on scaling data-mining algorithms to very large data sets. In this survey, we describe a broad range of algorithms that address three classical data-mining problems: market basket analysis, clustering, and classification. 数据挖掘实验室
MARKET BASKET ANALYSIS A market basket is a collection of items purchased by a customer in an individual customer transaction, which is a well-defined business activity—for example, a customer’s visit to a grocery store or an online purchase from a virtual store such as Amazon.com. Retailers accumulate huge collections of transactions by recording business activity over time. One common analysis run against a transactions database is to find sets of items, or itemsets, that appear together in many transactions. Each pattern extracted through the analysis consists of an itemset and the number of transactions that contain it. Businesses can use knowledge of these patterns to improve the placement of items in a store or the layout of mail-order catalog pages and Web pages. 数据挖掘研究院
An itemset containing i items is called an i-itemset. The percentage of transactions that contain an itemset is called the itemset’s support. For an itemset to be interesting, its support must be higher than a user-specified minimum; such itemsets are said to be frequent. Figure 1 shows three transactions stored in a relational database system. The database has five fields: a transaction identifier, a customer identifier, the item purchased, its price, and the transaction date. The first transaction shows a customer who bought a computer, MS Office, and Doom. As an example, the 2-itemset {hard disk, Doom} has a support of 67 percent. Why is finding frequent itemsets a nontrivial problem? 数据挖掘研究院
First, the number of customer transactions can be very large and usually will not fit in memory. Second, the potential number of frequent itemsets is exponential in the number of different items, although the actual number of frequent itemsets can be much smaller. The example in Figure 1 shows four different items, so there are 24 - 1 = 15 potential frequent itemsets. If the minimum support is 60 percent, only five itemsets are actually frequent. Thus, we want algorithms that are scalable with respect to the number of transactions and examine as few infrequent itemsets as possible. Efficient algorithms have been designed to address these criteria. The Apriori algorithm2 provided one early solution, which subsequent algorithms built upon.

