Insightful Miner
The next tool in order of overall usefulness (Link to Table 1), Insightful Miner, follows naturally after Clementine for another reason: This tool has the best selection of ETL functions of any data mining tool on the market. These functions include: 数据挖掘研究院
- Merging, appending, sorting and filtering (similar to Clementine and Statistical Data Miner)
- Slicing and dicing of input data for purposes of data exploration. These functions are common in database management and business intelligence (BI) tools, but it is very rare to find them in a data mining tool.
- Joining: creates a new data set by combining columns of two other data sets.
- Stacking and unstacking: creates a new column by combining two or more columns and vice versa.
The only other data mining toolset with greater ETL capability was Torrent Orchestrate. Torrent Orchestrate was purchased by Ascential Software in 2001. IBM acquired Ascential recently.
The rich ETL capability of Insightful Miner is mated with a graphical programming interface, similar to that of Clementine, Statistica Data Miner and SAS-Enterprise Miner. In addition, many useful algorithms are integrated as analysis nodes (neural networks, classification & regression trees, logistic regression and nave Bayes models). 数据挖掘实验室
Insightful Miner provides a number of very valuable capabilities for data mining. Firstly, it is built around the S statistical language, providing a rich statistical analysis and graphics capability via the menu-based S-Plus implementation of the S language. The statistical abilities of Insightful Miner rival those of Statistica Data Miner in their power and completeness.
In addition, Insightful Miner is built on a pipeline architecture that permits easy scaling to analysis of large data sets. This means that the data analysis algorithms operate in streaming mode using incremental forms of statistical analysis (e.g., based on provisional means and standard deviation, etc.). For many data miners, this may not be important. But if you must analyze large data sets from massive data warehouses, this feature can be very important. It means that you don′t have to extract data to external data sets; you can stream data directly from the source data structures through the analysis algorithms and build solutions incrementally. The only other tool that can do that is Statistica Data Miner (the first tool to provide this capability), and one algorithm in Fair Isaac′s Model Builder tool (not reviewed here).
The other extremely useful part of Insightful Miner is the model evaluation tools. This tool outputs both a coincidence (confusion) matrix and percent classification accuracy for both output states. Often it is very useful to know how accurate the model is for the positives and negatives separately, rather than report just the global accuracy. Like the model comparison node in Statistica Data Miner, the Lift Chart node in Insightful Miner will accept multiple inputs to produce an overlaid lift chart. But, this tool does not provide for a final classification based on a voting process among the algorithms. Finally, an overlaid ROC chart (receiver operating characteristic) can be output from multiple inputs. The area under the ROC curve represents the classification power of an algorithm; by reporting how it performs with different cut-points along the range of classification, probabilities create the binary classifications. 数据挖掘研究院
These features are orchestrated by Insightful Miner to provide an analytical platform that is configurable and extendable throughout the business enterprise. It is scalable and can grow as your data analysis needs grow. The very flexible S-Plus framework provides a powerful and extensible programming environment. Maybe best of all, Insightful Miner provides perpetual licensing without annual rental agreements.
The Future for Insightful Miner?
The scalable architecture of Insightful Miner should be leveraged to create Tool Kits for analyzing massive data sets. As data sets increase in size, traditional data mining tools become less and less efficient for analysis. Two approaches to analytical scalability can be followed: parallelism and streaming-mode operation. Large hardware parallel systems of IBM and NCR are very expensive. If you don′t need parallelism to efficiently process storage and retrieval of massive data sets for other operational purposes, streaming-mode operation is the way to go. It is possible to build streaming-mode versions of machine learning programs also (e.g., neural nets and decision trees). These capabilities could become the "killer-apps" in the world of data mining of massive data sets. 数据挖掘研究院
KXEN
In the past, KXEN stood alone in many respects. It was the only implementation of statistical learning theory, it was highly automated and it minimized the amount of data preparation necessary before modeling, and the predictability of the algorithm was highest among competitors in many cases. This situation is still largely true today, but the gap is narrowing. 数据挖掘实验室
KXEN is composed of several modules: 数据挖掘实验室
- K2C - Consistent Coder
- K2R - Robust Regression
- K2S - Smart Segmenter
- KSVM - Support Vector Machine
- KTS - Time Series
- KEL - Event Logger
- KMX - Model Export
- KAR - Association Rules
- KXEN Assistant - Menu-based Interface
All of these modules are available through the menu interface, and they can be packaged separately or in bundles. For example, SmartFocus (based in Bristol, UK) offers a suite of smart marketing support packages, including SmartModeler (composed of KXEN K2C and K2R). In fact, the primary focus of KXEN is to provide other companies with embedded data mining capabilities. This business model can′t fail to win in the future. Data mining must become function-based, rather than tool-based. The analytical functions necessary to support mining of nonlinear data sets must become integrated into the very structure of other software tools, similar to that of arithmetic operations. Business users of standard vertical industry tools must be able to take data mining tools for granted. Today, there is as much art as science in data mining. This is due primarily to the structure of the analytical tools. Yes, there are problems in every data set that must be solved, and wrinkles that must be smoothed before running the algorithm. But, many of these issues can be handled automatically, at least from a theoretical standpoint. The trick is to invent automated tools that perform the same operations as humans perform or obviate them. KXEN does both, to a great extent.
KXEN automatically performs: 数据挖掘研究院
- Data standardization and recoding,
- Data segmentation into "smart" segments,
- Creation of many derived variables from combinations and transforms of existing variables,
- Handling of missing data by inserting intuitive inferred values.
Other requirements that usually met in data preparation are removed completely. A good example is in classification model evaluation. Most data mining tools evaluate models using the Global Accuracy approach (see Part I). This approach evaluated the accuracy of both the positive and negative classes of a binary classification. But, many data mining applications (e.g., CRM applications) focus on one or the other. KXEN focuses its evaluation on the accuracy of the positives only, as reflected in the Ki metric. Many candidate models can be performed with KXEN, using different parameters, and the best model can be evaluated by comparing Ki values. No expression language is available in KXEN for creating new variables, but Version 3.3 has a nice facility for doing so. Also, Version 3.3 provides a scripting capability to permit a user to save a modeling session into a script for running later. 数据挖掘研究院
All you have to do with KXEN is enter the names and paths of the input and output data sets, select the variables to include in the model, and hit the "generate" button, and in a very short time, the solution is found. How it is found is part of the genius of KXEN. The approach to the identifying the best model follows the structured risk minimization theory of Vladimir Vapnik. Vapnik′s theories represents the next generation of statistical analysis: First Generation - Parametric Analysis of Fisher; Second Generation - General Linear Model, Parametric Non-Linear, and Categorical Analysis; Third Generation - Machine Learning, Fourth Generation - Statistical Learning Theory. I expect that all the data mining tools will gravitate toward Statistical Learning Theory, because the theoretical basis is much more generalized, and it avoids many of the assumptions that constrain other approaches (normality, linearity, independence, etc.).
Present Uses of KXEN
Even though KXEN is designed for ease in embedded use, over 80 percent of its customers use it in standalone mode. Why? The reason is not that KXEN has the best user interface; it does not. Nor is the reason that it enables a large variety of statistical and machine learning analyses; KXEN uses only one learning technique. I believe the reason is two-fold: velocity of model building activities and automation of the data mining process.
Velocity 数据挖掘实验室
KXEN goes beyond rapid prototyping to rapid final model production. Users at a major bank in the UK, for example, create thousands of models to direct specific customer interactions to optimize business via the myriad of combinations of offers, channels, and customers. One-to-one inbound marketing programs based on customer profiles driven by propensity models can generate huge rewards for companies like the UK bank.
Automation 数据挖掘研究院
When using other data mining tools, as much as 90 percent of the time spent in building a model is consumed by data preparation. KXEN does almost all of this automatically. The only data preparation that must be done is to create the Customer Analytic Record (CAR). CARs suitable for analysis must consist of all of the variables associated with a given customer present in the same data record. Also, appropriate dummy variables must be created to eliminate codes like "other" and "unknown"; otherwise the modeling algorithm will pick up on these as modeling variables. Creation of the CAR can be done easily in SQL. Modeling of the CAR can be done via ODBC, flat file, or via a call-level interface. Scoring of the database can be done in SQL, because KXEN is the only tool that I know of that can output models directly in SQL. The closest tool to KXEN in this capacity is SAS, which can output models in SAS code to run against the host database. 数据挖掘实验室
Preview of KXEN Version 3.3
This version is still in beta form, but I did get a preview of it. My initial impression is ... wow! When this version is released, it will provide the capabilities of a complete data integration and data mining workbench. Insightful Miner is the only tool that comes close. It will have many new tools to facilitate a full-range of data operations from legacy source systems to automated model scoring. The new ETL capabilities will work with most common databases to permit virtually anything you can do in SQL, but without the pain of writing the code. This ETL function generator is so strong, I can imagine that some of my old data warehousing cronies might use it just as a SQL front end. My favorite new tool, though, is a wonderful automatic iteration tool, which permits you find the minimum number of variables that produce above a user-definable threshold percentage of the total performance. For example, you can set the threshold to 98 percent, and it will find the fewest variables that meet the threshold, using one of several loss functions. I have spent many days performing this operation manually.
The Future for KXEN?
As for Clementine, the future of KXEN may lie outside the box of their GUI interface. I expect to hear of many more analytical and planning packages (e.g., SAP) incorporating data mining capabilities as enablers. Data mining is not really an "end" per se , but a means to an end. These "means" will become progressively submerged in the infrastructure of the products they serve until they are as natural to use as standard arithmetic and graphical techniques. KXEN will undoubtedly lead the charge in this direction.
XL-Miner
This tool scored the lowest in the features analysis, but not by much. It was overshadowed by the other tools largely because of its lack of ETL and descriptive statistical capabilities in the interface. But, that lack (and several others in the tool) is partially compensated for by the integration of XL-Miner into Excel. When you take into account all the capabilities of the Excel spreadsheet (including the analysis plug-in), many of these apparent lacks disappear. 11 percent of KD-Nuggets viewers use Excel as a data mining tool. Presumably, they used Excel for data exploration and analysis prior to modeling in another tool. For that reason, XL-Miner was given a moderate score for the capabilities absent in tool, but present in Excel.
The great benefit of its integration with Excel is offset to some degree by XL-Miner′s greatest weakness - limitation of data set size by spreadsheet limits (65,536 rows and 256 columns). This is not a fatal weakness; many CRM models can be trained acceptably on relatively small samples of the data universe. However, for medium to large applications, another tool must be used. Another plus for XL-Miner in that bag of characteristics is its low cost ($850 - general; $100 - student), which renders it affordable in addition to purchase of a tool with a larger analysis capacity. This means that you can do all of your data assessment, data exploration and reduction of the number of variables to be submitted to the modeling algorithm (dimensionality) directly in Excel on samples of the total data set (if necessary). The only major weaknesses left (in CRM applications) are the lack of ETL capabilities and limitation on data set size.
Apart from those weaknesses, XL-Miner is a very complete tool for CRM analyses. It includes menu options for data sampling, handling of missing values, binning of continuous data (for categorical analysis), transformation of categorical data, and data set partitioning. And, the partitioning option includes one capability that does not exist in any other tool, except Clementine - Over-sampling (balancing). Clementine does it via the Distribution and Balance Nodes. XL-Miner does it by the menu option, "Partitioning with over-sampling." This option permits the user to set the desired relative frequency of the positive class (50 percent by default) to apply to data sampling. When analyzing a data set for direct marketing (for example), the tool collects all of the rare responder class, and randomly samples the non-responder class to equal the number of responders (if the desired relative frequency is set to 50 percent). The balanced data set is suitable for submission to a neural net or decision tree algorithm. This is a very valuable capability for CRM data mining. For many data mining tools, data set balancing must be done manually, either by using the general database/spreadsheet functions of the tool or by some other tool.
Three options are provided for data reduction (reduction of dimensionality): principal components analysis, hierarchical clustering and K-means clustering. These tools can help to identify the set of variables that have a sufficient relationship with the target variable to include them in the analysis. Data exploration is enhanced beyond the capabilities of Excel by the provision of a scatter plot matrix to help identify the final short-list of variables used for modeling. Only Statistica Data Miner, among the tools evaluated here, has that capability. Prediction and classification are accomplished with a relatively rich set of standard data mining algorithms, including CART and a nave Bayes classifier. There is even an association algorithm for creating association rules.
Finally, XL-Miner provides a flexible metadata mapping option for relating variables in the model with those in data sets to be scored by the model. The only other tool that does that is Clementine. This capability permits scoring of data sets from different systems with different metadata. Variables among the modeling and scoring data sets may be identical but named differently. Or, variables might not be identical, but close enough for them to be mapped to each other. For example, a model might be trained on household income, but the deployment data set might be scored on the basis of median income in the census block where a given prospect lives (because household income is not available for the deployment population). This capability can be of enormous benefit for deploying models in an industry vertical market that were developed in another. This approach can help to jumpstart modeling operations for a new product or a new vertical market lacking historical data to support modeling.
The Future for XL-Miner
One growth path for XL-Miner is to supplement the relatively sparse set of database sources for input of data to Excel. Integration with other data mining tool vendors is a hole that should be plugged. Particularly desirable are capabilities to import and export SAS and SPSS data sets. If XL-Miner is to be used in conjunction with other data mining tools, it must be able to interface with common tools that data miners use. Currently, database import capabilities are constrained by those of Excel (SQL-Server, Access, dBase/FoxBase, Oracle and Paradox). It is a relatively easy task to develop ODBC drivers for other database systems configured for use in Excel. Other candidates for inclusion are: NCR Teradata, IBM UDB and SP2, SAP and PeopleSoft. Both input and output capabilities should be provided. 数据挖掘研究院
The other growth path for XL-Miner is removal of the current data set size limitation of 65,536 records. XL-Miner could include "paging" operations through large data sets by analyzing blocks of data in different tabs of the spreadsheet, each of which can contain 65,536 rows. This means that the XL-Miner macros must be modified to page through large data sets like word processors page through large text documents, keeping track of top and bottom virtual block references. Large data sets could be read into a set of spreadsheet tabs (sheets) in blocks of 65,536 records. By including sheet references in processing streams of the macros, a neural net (for example) could be trained on data sets much larger than 65,536 records. This sort of processing was commonplace in the old DOS world of PC applications limited by the 640K addressable memory constraint. Maybe Microsoft will follow this path someday with the development of Excel itself. Another fascinating prospect is for Microsoft to acquire XL-Miner and add it to the list of available plug-ins furnished with the tool. Until then, XL-Miner will just have to hoof it alone. 数据挖掘研究院
How do you choose the best data mining tool for your use in CRM? The answer is not a simple one. Some considerations that you might consider and the tools best suited for them include:
Expected Data Mining Venue
In the data mining tool?
- SPSS Clementine
- SAS Enterprise Miner
- Statistica Data Miner
- Insightful Miner
In the database?
- Statistica Data Miner
- SAS Enterprise Miner
Embedded in an application?
- KXEN
- SAS Enterprise Miner
In financial operations based on spreadsheets?
- XL-Miner
- SAS Add-in for Microsoft Office
Academics?
- XL-Miner
- SAS (Academic license)
- SPSS Clementine (student edition)
Expected Purpose of the Models
To support direct mail operations?
- SPSS Clementine
- Statistica Data Miner
- SAS Enterprise Miner
To support management rules reporting?
- SAS Enterprise Miner (decision trees)
- SPSS Clementine
- XL-Miner
To support sales forecasting?
- Statistica Data Miner (time-series algorithms)
- SAS Forecast Server, in conjunction with SAS-EM
To support strategic marketing operations?
- Insightful Miner (slicing and dicing capability)
To support customer behavior modeling?
- SPSS Clementine
- Statistica Data Miner
- SAS Enterprise Miner
To support Six-Sigma industrial applications?
- Statistica Data Miner (Six-sigma algorithms).

