The Truth about Data Mining


A business intelligence (BI) implementation can be considered two-tiered. The first tier comprises standard reporting, ad hoc reporting, multidimensional analysis, dashboards, scorecards, and alerts. The second tier is more commonly found in organizations that have successfully built a mature first tier. Advanced data analysis through predictive modeling and forecasting defines this tier—in other words, data mining.

Data mining has a significantly broad reach and application. It can be applied in any situation where it is necessary to discover potential knowledge from vast amounts of data. Throughout this article, the word knowledge is used to refer to meaningful patterns derived through techniques in data mining that can stimulate an organization's goals (such as company revenue, Web site traffic, increase in crop yield, and improved health care). The field of data mining brings together techniques from statistics; machine learning (the design and development of algorithms that allow systems to learn and to improve their own performance based on their own experience); neural networks (mathematical or computational models based on nervous systems); database technology; high-performance computing (the use of supercomputers and computer clusters); and spatial data analysis (techniques to study entities using their topological, geometric, or geographic features)—to name a few. Data mining is a complex area of study and is still considered esoteric and difficult to implement in many BI environments.

The Raison d'Etre

Data mining refers to the process of extracting hidden patterns from large amounts of data. The term mining is often used in conjunction with an end product, such as gold or coal; however, the end product of data mining is not data, but knowledge. Data mining is used in a variety of situations, but here are the most common business scenarios in which it can be considered a potential solution:

  • Data explosion. When the amount of data grows significantly, only specialized statistical models can help uncover important patterns; in this situation, simple reporting and multidimensional analysis may fail.

  • Predicting behavior. There are situations in which organizations may need to predict customer behavior. For instance, churn analysis helps organizations identify which of their customers are likely to leave them for competitors. Modeling diseases in an animal population based on information relevant to the species in question can, through predictions, help estimate risk of disease.

  • Cross-selling. More commonly known in this scenario as market basket analysis, data mining can provide insight into cross-selling patterns. Online book stores such as use this technique to present recommendations on books related to the one being reviewed or purchased.

  • Taxonomy formations. Data mining can be applied in situations in which training data (data that is used to train a mining model) is missing any class labels. Class labels are used to conceptualize data. For instance, in an analysis that examines the relationships between seasonality and sales of products, seasonality can be characterized as spring, summer, and fall. Clustering or segmentation is the process of partitioning data into classes, or even hierarchies of classes, for which members of a group have similar characteristics.

  • Forecasting. In order to estimate future values of entities, forecasting techniques need to be applied to data. For instance, by forecasting the demand for its products in the future, a retailer can plan production.

Why Not OLAP or Statistics?

Data mining includes advanced techniques for understanding data that far exceed the capabilities of online analytical processing (OLAP). OLAP tools provide the means to perform multidimensional analysis by using powerful algorithms for aggregating data. While OLAP can help look at the sales of a certain product within a specific region and time period, data mining can discover relationships between various attributes in the data and deduce why sales in a certain region may have dropped over a certain time period. OLAP and data mining are frequently used in conjunction with each other, and we often find a happy coexistence of these two technologies in data warehousing and BI environments.

Comparing statistics and data mining, on the other hand, is not as straightforward. A principal reason is that they belong to two separate branches of study—mathematics and computer science. While data mining involves exploring large amounts of data (gigabytes or terabytes) to discover otherwise hidden or unknown patterns in the data, statistics is concerned with confirming a hypothesis by establishing a model and providing evidence to either support the underlying theory or establish a lack of evidence to the contrary. Consequently, most statistical packages may not even handle the amount of data that is quite normal in data mining processes.

Another distinction is that data collection is a principal component of statistics. Assembling data that is appropriate to test a hypothesis is of paramount importance. Data mining, on the other hand, is applied to data that has already been collected. As a result, data mining rather than statistical techniques naturally fit in BI environments.

Architecture of a Data Mining System

In describing the architecture of a data mining system, we assume the presence of a data warehouse or data store containing organizational data. Although data mining can be applied to a wide range of data sources, it is beneficial to start with a data warehouse in which facts and dimensions have been identified and a data cleansing framework is in place in order to ensure good data quality.

1. The Knowledge Base
The "crust" of a data mining system is an organization's knowledge base. This is the domain knowledge that describes an organization's data. It includes concept hierarchies that organize attributes or attribute values from low-level concepts or classes to high-level or general concepts. Concepts can be implicit, such as addresses that are described by number, street, city, state, and country. Concept hierarchies can also be created by organizing values. An example of such a hierarchy, commonly known as a set-grouping hierarchy, is company size. It can be defined as micro (< 5 employees), small (5 to 100 employees), medium (100 to 500 employees), and large (> 500 employees).

Interestingness measures constitute another example of domain knowledge. These measures help rank or filter the rules that are generated from data to determine the patterns that will be most useful for a business. Interestingness measures can include objective measures that are identified statistically and subjective measures that are derived from user beliefs regarding relationships in the data that can help evaluate the degree of expectedness or unexpectedness of results obtained from data mining. The knowledge base is an essential input at all stages of the data mining process.

2. The Data Mining Process

Figure 1. Creation of the data mining model.

Discussion of the data mining process in this article is centered on modeling and assessment. The data mining model constitutes the core, or the center, of data mining. The first step is the creation of the model through the selection of data relevant to the goal of data mining. For example, if an education research exercise requires the study of the performance of students across several cities in a specific state, only data from that state is relevant. Furthermore, if the goal is to study relationships between student attendance and the occupation and income of parents, the attributes relevant to the study include the attendance of the entity student (and not the grades or age) and the income and occupation of the entity parent (and not the age or ethnicity).

Once the goal or aim of the data mining exercise is established, the choice of data mining function or algorithm has to be made. The model is structured to store results found by the data mining algorithm. The following table broadly outlines commonly used algorithms (a full discussion on these algorithms is beyond the scope of this article).

Algorithm Description
This algorithm helps uncover items that are associated with each other. A well-known implementation of this algorithm is market basket analysis, where a question such as "If a customer purchases items A and B, what else is he or she likely to purchase?" is answered by examining associations of A and B with other items purchased in the past.
Clustering Clustering creates groups of data objects based on their similarity. Objects within the same cluster are similar to each other and dissimilar to objects in other clusters. Clustering has a wide applicability: in biology to develop taxonomies; in business to group customers based on purchasing behavior; and in geography to group locations.

Decision Trees

A decision tree is a structure where a branch or a split divides the dataset to partition data distribution. Each split is based on an attribute that causes a significant division in the data. Predictions can be made by applying the new attribute values to the decision tree.

Naïve Bayes

The Bayes algorithm has a systematic method of learning based on evidence. It combines conditional and unconditional probabilities to calculate the probability of a hypothesis.


Regression helps discover the dependency of the value of one attribute on the values of other attributes within the same entity or object. Regression is similar to decision trees in that it helps classify the data, but it predicts continuous rather than discrete attributes.

Time Series

A time series represents data at various intervals of time or any other indicator of chronology. The time series algorithm is used to forecast future values such as demand and Web site traffic using techniques in autoregression (a special branch of regression analysis dedicated to the analysis of time series) and decision trees.

Figure 2. Training the data mining model.

Model training involves running the data mining algorithm on historical data (also known as training data). The algorithm analyzes and finds relationships in the data. These are produced as patterns and stored in the data mining model to create a trained mining model. Training can be a lengthy process as it involves applying the mining algorithm to vast amounts of data in an iterative manner.

Figure 3. Predicting with the trained mining model.

Prediction involves passing a new set of data through the trained model. Rules and patterns found in training are applied to the data to create predictions. Prediction can be applied in real time to act on data as it arrives. The trained mining model represents all possible values of relevant attributes and includes a probability value associated with each combination. Prediction can imply the process of determining a discrete value or class label (as in classification techniques), or the prediction of continuous values (as in regression techniques).

3. Assessment
The final step is the assessment of the data mining model. A prudent approach to data mining is to build several models. This is done either by applying multiple algorithms to the same dataset, or by building multiple models by tuning the same algorithm until the desired level of accuracy is achieved. Predictions against the model can be compared to known results to arrive at a measure of accuracy. It is advisable to separate data used for testing a model from data used to train a model.

A cumulative gains chart is among several techniques that test the accuracy of a model. In a cumulative gains chart, the accuracy of a model is estimated for a target value chosen by the user. For example, the target can be a percentage of customers who will respond to an e-mail campaign. A baseline (or random model) always indicates that X percent of the target will be achieved with X percent of the data. This indicates results of a campaign for which users are picked at random rather than by using a mining model. Using the predictions of the model, the percentage of positive responses is mapped to the percentage of data selected to create a lift curve. The chart below illustrates the following example.

  1. From the data used for testing, we know that 40 percent of the data represents the target. This represents the ideal model.

  2. Using the predictions of the model, it is observed that the model can target 100 percent of the target with 90 percent of the data.

  3. If we were to use the mining model (see lift curve), we would be able to target 36 percent of the data (i.e. 90 percent of 40 percent).

  4. If we were to pick customers at random (see baseline), we would be able to target only 20 percent of the data (i.e. 50 percent of 40 percent).

Figure 4. Cumulative gains chart.

The closer the lift curve is to the ideal model (and consequently, the greater the area between the baseline and the lift curve), the better the predictive accuracy of the model.


Data Mining Vendors

SAS is a leader in the data mining market and has an impressive track record of successful implementations. Its Enterprise Miner offers a wide range of predictive analytics and visualization capabilities. The product encapsulates SAS' data mining process that it calls SEMMA: sampling (extracting a representative sample that can be manipulated easily, and partitioning data for training and testing); exploration (searching for unexpected trends or patterns through visual exploration or statistical techniques); modification (iteratively processing data to focus on relevant data and include new data periodically); modeling (applying data mining algorithms to generate predictions); and assessment (testing the model for quality and accuracy).

SPSS offers several families of products for statistical analysis and data mining. The PASW Modeler provides advanced analytical functions and visualization. It promises to integrate seamlessly with existing IT infrastructure, and uses multithreading, clustering, and embedded algorithms for high performance and scalability. In addition to a wide range of mining algorithms, SPSS offers Web mining and text analytics as add-on products.

Angoss Software offers an on-demand customer analytics solution focused on addressing sales and marketing strategies. Its KnowledgeSEEKER provides visualization for data exploration; and the KnowledgeSTUDIO represents the tool for modeling, with access to a variety of algorithms including decision trees, regression, and clustering.

Microsoft has taken a significant step into the data mining arena with the release of SQL Server 2005. SQL Server data mining is one of the components of Microsoft's BI suite. It includes several data mining algorithms developed through collaboration between the Microsoft research and SQL Server product teams. SQL Server data mining integrates with other parts of the BI suite: analysis services, integration services, and reporting services.

In Conclusion

It is essential to lay the groundwork for the complex process of data mining. This includes having a thorough understanding of business data entities and their interrelationships. In addition, data mining must not be a onetime process. Rather, in each instance, it must be applied iteratively, and training data must be reviewed and maintained periodically. When applied appropriately, it has the potential to uncover knowledge—the "gold" in business.

comments powered by Disqus