Data Mining Methods
A variety of methods are available for performing data mining studies, including classification, regression, clustering, and association. Most data mining software tools employ more than one technique (or algorithm) for each of these methods. This section describes the most popular data mining methods and explains their representative techniques.
Classification
Classification is perhaps the most frequently used data mining method for real-world problems. As a popular member of the machine-learning family of techniques, classification learns patterns from past data (a set of information – traits, variables, features – on characteristics of the previously labeled items, objects, or events) in order to place new instances into their respective groups or classes. For example, one could use classification to predict whether the weather on a particular day will be “sunny,” “rainy,” or “cloudy.” Popular classification tasks include credit approval, store location, fraud detection, and telecommunication. If what is being predicted is a class label, the prediction problem is called a classification, whereas if it is a numeric value, the prediction is called a regression.
Even though clustering can also be used to determine groups of things, there is a significant difference between the two. Classification learns the function between the characteristics of things ant their membership through a supervised learning process where both types of variables are presented to the algorithm. In clustering, the membership of the objects is learned through an unsupervised learning process where only the input variables are presented to the algorithm. Unlike classification, clustering does not have a supervising mechanism that enforces the learning process. Instead, clustering algorithms use one or more heuristics to discover natural groupings of objects.
The most common two-step methodology of classification-type prediction involves model development/training and model testing/deployment. In the model development phase, a collection of input data, including the actual class labels, is used. After a model has been trained, the model is tested against the holdout sample for accuracy assessment and eventually deployed for actual use where it is to predict classes of new data instances. Several factors are considered in assessing the model, including the following:
- Predictive accuracy
- Speed
- Robustness
- Scalability
- Interpretability
Estimating the True Accuracy of Classification Models
In classification problems, the primary source for accuracy estimation is the confusion matrix. Estimating the accuracy of a classification model induced by a supervised learning algorithm is important for the following two reasons: First, it can be used to estimate its future prediction accuracy, which could imply the level of confidence one should have in the classifier’s output in the prediction system. Second, it can be used for choosing a classifier from a given set. The following are among the most popular estimation methodologies used for classification-type data mining models.
SIMPLE SPLIT Thesimple split partitions the data into two mutually exclusive subsets called a training set and a test set. It is common to designate two-thirds of the data as the training set and the remaining one-third as the test set. The training set is used by the inducer and the built classifier is then tested on the test set. An exception to this rule occurs when the classifier is an artificial neural network. In this case, the data is partitioned into three mutually exclusive subsets: training, validation, and testing. The validation set is used during model building to prevent overfitting.
The main criticism of this method is that it makes the assumption that the data in the two subsets are the same kind. Because this is a simple random partitioning, in most realistic data sets where the data are skewed on the classification variable, such an assumption may not hold true. In order to improve this situation, stratified sampling is suggested, where the strata become the output variable. Even though this is an improvement over the simple split, it still has a bias associated from the single random partitioning.
k-FOLD CROSS-VALIDATION In order to minimize the bias associated with the random sampling of the training and holdout data samples in comparing the predictive accuracy of two or more methods, one can use a methodology called k-fold cross-validation. In k-fold cross-validation, also called rotation estimation, the complete data set is randomly split into k mutually exclusive subsets of approximately equal size. The classification model is trained and tested k item. Each time it is trained on all but one fold and then tested on the remaining single fold. The cross-validation estimate of the overall accuracy of a model is calculated by simply averaging the k individual accuracy measures.
ADDITIONAL CLASSIFICTION ASSESSMENT METHODOLOGIES Other popular assessment methodologies include the following:
- Leave-one-out
- Bootstrapping
- Jackknifing
- Area under the ROC curve
CLASSIFICATION TECHNIQUES A number of techniques are used for classification modeling, including the following:
- Decision tree analysis
- Statistical analysis
- Neural networks
- Case-based reasoning
- Bayesian classifiers
- Genetic algorithms
- Rough sets