About the Book
Learn Data Mining by doing data mining
Data mining can be revolutionary-but only when it's done right. The powerful black box data mining software now available can produce disastrously misleading results unless applied by a skilled and knowledgeable analyst. Discovering Knowledge in Data: An Introduction to Data Mining provides both the practical experience and the theoretical insight needed to reveal valuable information hidden in large data sets.
Employing a "white box" methodology and with real-world case studies, this step-by-step guide walks readers through the various algorithms and statistical structures that underlie the software and presents examples of their operation on actual large data sets. Principal topics include:
* Data preprocessing and classification
* Exploratory analysis
* Decision trees
* Neural and Kohonen networks
* Hierarchical and k-means clustering
* Association rules
* Model evaluation techniques
Complete with scores of screenshots and diagrams to encourage graphical learning, Discovering Knowledge in Data: An Introduction to Data Mining gives students in Business, Computer Science, and Statistics as well as professionals in the field the power to turn any data warehouse into actionable knowledge.
An Instructor's Manual presenting detailed solutions to all the problems in the book is available online.
Table of Contents:
PREFACE xi
1 INTRODUCTION TO DATA MINING 1
What Is Data Mining? 2
Why Data Mining? 4
Need for Human Direction of Data Mining 4
Cross-Industry Standard Process: CRISP–DM 5
Case Study 1: Analyzing Automobile Warranty Claims: Example of the CRISP–DM Industry Standard Process in Action 8
Fallacies of Data Mining 10
What Tasks Can Data Mining Accomplish? 11
Description 11
Estimation 12
Prediction 13
Classification 14
Clustering 16
Association 17
Case Study 2: Predicting Abnormal Stock Market Returns Using Neural Networks 18
Case Study 3: Mining Association Rules from Legal Databases 19
Case Study 4: Predicting Corporate Bankruptcies Using Decision Trees 21
Case Study 5: Profiling the Tourism Market Using k-Means Clustering Analysis 23
References 24
Exercises 25
2 DATA PREPROCESSING 27
Why Do We Need to Preprocess the Data? 27
Data Cleaning 28
Handling Missing Data 30
Identifying Misclassifications 33
Graphical Methods for Identifying Outliers 34
Data Transformation 35
Min–Max Normalization 36
Z-Score Standardization 37
Numerical Methods for Identifying Outliers 38
References 39
Exercises 39
3 EXPLORATORY DATA ANALYSIS 41
Hypothesis Testing versus Exploratory Data Analysis 41
Getting to Know the Data Set 42
Dealing with Correlated Variables 44
Exploring Categorical Variables 45
Using EDA to Uncover Anomalous Fields 50
Exploring Numerical Variables 52
Exploring Multivariate Relationships 59
Selecting Interesting Subsets of the Data for Further Investigation 61
Binning 62
Summary 63
References 64
Exercises 64
4 STATISTICAL APPROACHES TO ESTIMATION AND PREDICTION 67
Data Mining Tasks in Discovering Knowledge in Data 67
Statistical Approaches to Estimation and Prediction 68
Univariate Methods: Measures of Center and Spread 69
Statistical Inference 71
How Confident Are We in Our Estimates? 73
Confidence Interval Estimation 73
Bivariate Methods: Simple Linear Regression 75
Dangers of Extrapolation 79
Confidence Intervals for the Mean Value of y Given x 80
Prediction Intervals for a Randomly Chosen Value of y Given x 80
Multiple Regression 83
Verifying Model Assumptions 85
References 88
Exercises 88
5 k-NEAREST NEIGHBOR ALGORITHM 90
Supervised versus Unsupervised Methods 90
Methodology for Supervised Modeling 91
Bias–Variance Trade-Off 93
Classification Task 95
k-Nearest Neighbor Algorithm 96
Distance Function 99
Combination Function 101
Simple Unweighted Voting 101
Weighted Voting 102
Quantifying Attribute Relevance: Stretching the Axes 103
Database Considerations 104
k-Nearest Neighbor Algorithm for Estimation and Prediction 104
Choosing k 105
Reference 106
Exercises 106
6 DECISION TREES 107
Classification and Regression Trees 109
C4.5 Algorithm 116
Decision Rules 121
Comparison of the C5.0 and CART Algorithms Applied to Real Data 122
References 126
Exercises 126
7 NEURAL NETWORKS 128
Input and Output Encoding 129
Neural Networks for Estimation and Prediction 131
Simple Example of a Neural Network 131
Sigmoid Activation Function 134
Back-Propagation 135
Gradient Descent Method 135
Back-Propagation Rules 136
Example of Back-Propagation 137
Termination Criteria 139
Learning Rate 139
Momentum Term 140
Sensitivity Analysis 142
Application of Neural Network Modeling 143
References 145
Exercises 145
8 HIERARCHICAL AND k-MEANS CLUSTERING 147
Clustering Task 147
Hierarchical Clustering Methods 149
Single-Linkage Clustering 150
Complete-Linkage Clustering 151
k-Means Clustering 153
Example of k-Means Clustering at Work 153
Application of k-Means Clustering Using SAS Enterprise Miner 158
Using Cluster Membership to Predict Churn 161
References 161
Exercises 162
9 KOHONEN NETWORKS 163
Self-Organizing Maps 163
Kohonen Networks 165
Example of a Kohonen Network Study 166
Cluster Validity 170
Application of Clustering Using Kohonen Networks 170
Interpreting the Clusters 171
Cluster Profiles 175
Using Cluster Membership as Input to Downstream Data Mining Models 177
References 178
Exercises 178
10 ASSOCIATION RULES 180
Affinity Analysis and Market Basket Analysis 180
Data Representation for Market Basket Analysis 182
Support, Confidence, Frequent Itemsets, and the A Priori Property 183
How Does the A Priori AlgorithmWork (Part 1)? Generating Frequent Itemsets 185
How Does the A Priori AlgorithmWork (Part 2)? Generating Association Rules 186
Extension from Flag Data to General Categorical Data 189
Information-Theoretic Approach: Generalized Rule Induction Method 190
J-Measure 190
Application of Generalized Rule Induction 191
When Not to Use Association Rules 193
Do Association Rules Represent Supervised or Unsupervised Learning? 196
Local Patterns versus Global Models 197
References 198
Exercises 198
11 MODEL EVALUATION TECHNIQUES 200
Model Evaluation Techniques for the Description Task 201
Model Evaluation Techniques for the Estimation and Prediction Tasks 201
Model Evaluation Techniques for the Classification Task 203
Error Rate, False Positives, and False Negatives 203
Misclassification Cost Adjustment to Reflect Real-World Concerns 205
Decision Cost/Benefit Analysis 207
Lift Charts and Gains Charts 208
Interweaving Model Evaluation with Model Building 211
Confluence of Results: Applying a Suite of Models 212
Reference 213
Exercises 213
EPILOGUE: "WE'VE ONLY JUST BEGUN" 215
INDEX 217
About the Author :
DANIEL T. LAROSE received his PhD in statistics from the University of Connecticut. An associate professor of statistics at Central Connecticut State University, he developed and directs Data Mining@CCSU, the world's first online master of science program in data mining. He has also worked as a data mining consultant for Connecticut-area companies. He is currently working on the next two books of his three-volume series on Data Mining: Data Mining Methods and Models and Data Mining the Web: Uncovering Patterns in Web Content, scheduled to publish respectively in 2005 and 2006.
DANIEL T. LAROSE received his PhD in statistics from the University of Connecticut. An associate professor of statistics at Central Connecticut State University, he developed and directs Data Mining@CCSU, the world's first online master of science program in data mining. He has also worked as a data mining consultant for Connecticut-area companies. He is currently working on the next two books of his three-volume series on Data Mining: Data Mining Methods and Models and Data Mining the Web: Uncovering Patterns in Web Content, scheduled to publish respectively in 2005 and 2006.
Review :
"...an excellent introductory book of data mining. I recommend it for every one who wants to learn data mining." (Journal of Statistical Software, May 2006) "...selected material is described in a simple, clear, and…precise way...case studies…examples, and screen shots has definitely added to the learning value of the book." (Journal of Biopharmaceutical Statistics, January/February 2006)
"...does a good job introducing data mining to novices...it skillfully previews some of the basic statistical issues needed to understand data mining techniques." (Journal of the American Statistical Association, December 2005)
"If you need a book to help colleagues understand your data mining procedures and results, this is the one you want to give them." (Technometrics, November 2005)
"…an excellent book…it should be useful for anyone interested in analysing epidemiological data." (Statistics in Medical Research, October 2005)
"...an excellent 'white-box' overview of established approaches for data analysis, in which readers are shown how, why, and when the methods work." (CHOICE, April 2005)
"Larose has the making of a good series of books on data mining…I, for one, look forward to the next two books in the series." (Computing Reviews.com, February 15, 2005)
"...an excellent introductory book of data mining. I recommend it for every one who wants to learn data mining." (Journal of Statistical Software, May 2006) "...selected material is described in a simple, clear, and…precise way...case studies…examples, and screen shots has definitely added to the learning value of the book." (Journal of Biopharmaceutical Statistics, January/February 2006)
"...does a good job introducing data mining to novices...it skillfully previews some of the basic statistical issues needed to understand data mining techniques." (Journal of the American Statistical Association, December 2005)
"If you need a book to help colleagues understand your data mining procedures and results, this is the one you want to give them." (Technometrics, November 2005)
"…an excellent book…it should be useful for anyone interested in analysing epidemiological data." (Statistics in Medical Research, October 2005)
"...an excellent 'white-box' overview of established approaches for data analysis, in which readers are shown how, why, and when the methods work." (CHOICE, April 2005)
"Larose has the making of a good series of books on data mining…I, for one, look forward to the next two books in the series." (Computing Reviews.com, February 15, 2005)
"...an excellent introductory book of data mining. I recommend it for every one who wants to learn data mining." (Journal of Statistical Software, May 2006) "...selected material is described in a simple, clear, and…precise way...case studies…examples, and screen shots has definitely added to the learning value of the book." (Journal of Biopharmaceutical Statistics, January/February 2006)
"...does a good job introducing data mining to novices...it skillfully previews some of the basic statistical issues needed to understand data mining techniques." (Journal of the American Statistical Association, December 2005)
"If you need a book to help colleagues understand your data mining procedures and results, this is the one you want to give them." (Technometrics, November 2005)
"…an excellent book…it should be useful for anyone interested in analysing epidemiological data." (Statistics in Medical Research, October 2005)
"...an excellent 'white-box' overview of established approaches for data analysis, in which readers are shown how, why, and when the methods work." (CHOICE, April 2005)
"Larose has the making of a good series of books on data mining…I, for one, look forward to the next two books in the series." (Computing Reviews.com, February 15, 2005)
"...an excellent introductory book of data mining. I recommend it for every one who wants to learn data mining." (Journal of Statistical Software, May 2006) "...selected material is described in a simple, clear, and…precise way...case studies…examples, and screen shots has definitely added to the learning value of the book." (Journal of Biopharmaceutical Statistics, January/February 2006)
"...does a good job introducing data mining to novices...it skillfully previews some of the basic statistical issues needed to understand data mining techniques." (Journal of the American Statistical Association, December 2005)
"If you need a book to help colleagues understand your data mining procedures and results, this is the one you want to give them." (Technometrics, November 2005)
"…an excellent book…it should be useful for anyone interested in analysing epidemiological data." (Statistics in Medical Research, October 2005)
"...an excellent 'white-box' overview of established approaches for data analysis, in which readers are shown how, why, and when the methods work." (CHOICE, April 2005)
"Larose has the making of a good series of books on data mining…I, for one, look forward to the next two books in the series." (Computing Reviews.com, February 15, 2005)