Download Data Management Systems and more Study notes Data Mining in PDF only on Docsity!
Data Mining
Third Edition
Data Modeling Essentials, 3 rd Edition Graeme C. Simsion, Graham C. Witt Developing High Quality Data Models Matthew West Location-Based Services Jochen Schiller, Agnes Voisard Managing Time in Relational Databases: How to Design, Update, and Query Temporal Data Tom Johnston, Randall Weis Database Modeling with Microsoft ©R^ Visio for Enterprise Architects Terry Halpin, Ken Evans, Patrick Hallock, Bill Maclean Designing Data-Intensive Web Applications Stephano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, Maristella Matera Mining the Web: Discovering Knowledge from Hypertext Data Soumen Chakrabarti Advanced SQL: 1999—Understanding Object-Relational and Other Advanced Features Jim Melton Database Tuning: Principles, Experiments, and Troubleshooting Techniques Dennis Shasha, Philippe Bonnet SQL: 1999—Understanding Relational Language Components Jim Melton, Alan R. Simon Information Visualization in Data Mining and Knowledge Discovery Edited by Usama Fayyad, Georges G. Grinstein, Andreas Wierse Transactional Information Systems Gerhard Weikum, Gottfried Vossen Spatial Databases Philippe Rigaux, Michel Scholl, and Agnes Voisard Managing Reference Data in Enterprise Databases Malcolm Chisholm Understanding SQL and Java Together Jim Melton, Andrew Eisenberg Database: Principles, Programming, and Performance, 2 nd Edition Patrick and Elizabeth O’Neil The Object Data Standard Edited by R. G. G. Cattell, Douglas Barry Data on the Web: From Relations to Semistructured Data and XML Serge Abiteboul, Peter Buneman, Dan Suciu Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 3 rd Edition Ian Witten, Eibe Frank, Mark A. Hall Joe Celko’s Data and Databases: Concepts in Practice Joe Celko Developing Time-Oriented Database Applications in SQL Richard T. Snodgrass Web Farming for the Data Warehouse Richard D. Hackathorn
Management of Heterogeneous and Autonomous Database Systems Edited by Ahmed Elmagarmid, Marek Rusinkiewicz, Amit Sheth Object-Relational DBMSs, 2 nd Edition Michael Stonebraker, Paul Brown, with Dorothy Moore Universal Database Management: A Guide to Object/Relational Technology Cynthia Maro Saracco Readings in Database Systems, 3 rd Edition Edited by Michael Stonebraker, Joseph M. Hellerstein Understanding SQL’s Stored Procedures: A Complete Guide to SQL/PSM Jim Melton Principles of Multimedia Database Systems V. S. Subrahmanian Principles of Database Query Processing for Advanced Applications Clement T. Yu, Weiyi Meng Advanced Database Systems Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. Subrahmanian, Roberto Zicari Principles of Transaction Processing, 2 nd Edition Philip A. Bernstein, Eric Newcomer Using the New DB2: IBM’s Object-Relational Database System Don Chamberlin Distributed Algorithms Nancy A. Lynch Active Database Systems: Triggers and Rules for Advanced Database Processing Edited by Jennifer Widom, Stefano Ceri Migrating Legacy Systems: Gateways, Interfaces, and the Incremental Approach Michael L. Brodie, Michael Stonebraker Atomic Transactions Nancy Lynch, Michael Merritt, William Weihl, Alan Fekete Query Processing for Advanced Database Systems Edited by Johann Christoph Freytag, David Maier, Gottfried Vossen Transaction Processing Jim Gray, Andreas Reuter Database Transaction Models for Advanced Applications Edited by Ahmed K. Elmagarmid A Guide to Developing Client/Server SQL Applications Setrag Khoshafian, Arvola Chan, Anna Wong, Harry K. T. Wong
Morgan Kaufmann Publishers is an imprint of Elsevier. 225 Wyman Street, Waltham, MA 02451, USA
©c 2012 by Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
Han, Jiawei. Data mining : concepts and techniques / Jiawei Han, Micheline Kamber, Jian Pei. – 3rd ed. p. cm. ISBN 978-0-12-381479-
- Data mining. I. Kamber, Micheline. II. Pei, Jian. III. Title. QA76.9.D343H36 2011 006.3′12–dc22 2011010635
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
For information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.elsevierdirect.com
Printed in the United States of America 11 12 13 14 15 10 9 8 7 6 5 4 3 2 1
To Y. Dora and Lawrence for your love and encouragement J.H.
To Erik, Kevan, Kian, and Mikael for your love and inspiration
M.K.
To my wife, Jennifer, and daughter, Jacqueline J.P.
Contents
Foreword xix
Foreword to Second Edition xxi
Preface xxiii
Acknowledgments xxxi
Contents xiii
- Chapter 1 Introduction About the Authors xxxv
- 1.1 Why Data Mining?
- 1.1.1 Moving toward the Information Age
- 1.1.2 Data Mining as the Evolution of Information Technology
- 1.2 What Is Data Mining?
- 1.3 What Kinds of Data Can Be Mined?
- 1.3.1 Database Data
- 1.3.2 Data Warehouses
- 1.3.3 Transactional Data
- 1.3.4 Other Kinds of Data
- 1.4 What Kinds of Patterns Can Be Mined?
- 1.4.1 Class/Concept Description: Characterization and Discrimination
- 1.4.2 Mining Frequent Patterns, Associations, and Correlations
- 1.4.3 Classification and Regression for Predictive Analysis
- 1.4.4 Cluster Analysis
- 1.4.5 Outlier Analysis
- 1.4.6 Are All Patterns Interesting?
- 1.5 Which Technologies Are Used?
- 1.5.1 Statistics
- 1.5.2 Machine Learning
- 1.5.3 Database Systems and Data Warehouses
- 1.5.4 Information Retrieval
- 1.6 Which Kinds of Applications Are Targeted? x Contents
- 1.6.1 Business Intelligence
- 1.6.2 Web Search Engines
- 1.7 Major Issues in Data Mining
- 1.7.1 Mining Methodology
- 1.7.2 User Interaction
- 1.7.3 Efficiency and Scalability
- 1.7.4 Diversity of Database Types
- 1.7.5 Data Mining and Society
- 1.8 Summary
- 1.9 Exercises
- 1.10 Bibliographic Notes
- Chapter 2 Getting to Know Your Data
- 2.1 Data Objects and Attribute Types
- 2.1.1 What Is an Attribute?
- 2.1.2 Nominal Attributes
- 2.1.3 Binary Attributes
- 2.1.4 Ordinal Attributes
- 2.1.5 Numeric Attributes
- 2.1.6 Discrete versus Continuous Attributes
- 2.2 Basic Statistical Descriptions of Data - 2.2.1 Measuring the Central Tendency: Mean, Median, and Mode - Standard Deviation, and Interquartile Range 2.2.2 Measuring the Dispersion of Data: Range, Quartiles, Variance, - 2.2.3 Graphic Displays of Basic Statistical Descriptions of Data
- 2.3 Data Visualization - 2.3.1 Pixel-Oriented Visualization Techniques - 2.3.2 Geometric Projection Visualization Techniques - 2.3.3 Icon-Based Visualization Techniques - 2.3.4 Hierarchical Visualization Techniques - 2.3.5 Visualizing Complex Data and Relations
- 2.4 Measuring Data Similarity and Dissimilarity - 2.4.1 Data Matrix versus Dissimilarity Matrix - 2.4.2 Proximity Measures for Nominal Attributes - 2.4.3 Proximity Measures for Binary Attributes - 2.4.4 Dissimilarity of Numeric Data: Minkowski Distance - 2.4.5 Proximity Measures for Ordinal Attributes - 2.4.6 Dissimilarity for Attributes of Mixed Types - 2.4.7 Cosine Similarity
- 2.5 Summary
- 2.6 Exercises
- 2.7 Bibliographic Notes
- Chapter 3 Data Preprocessing Contents xi
- 3.1 Data Preprocessing: An Overview
- 3.1.1 Data Quality: Why Preprocess the Data?
- 3.1.2 Major Tasks in Data Preprocessing
- 3.2 Data Cleaning
- 3.2.1 Missing Values
- 3.2.2 Noisy Data
- 3.2.3 Data Cleaning as a Process
- 3.3 Data Integration
- 3.3.1 Entity Identification Problem
- 3.3.2 Redundancy and Correlation Analysis
- 3.3.3 Tuple Duplication
- 3.3.4 Data Value Conflict Detection and Resolution
- 3.4 Data Reduction
- 3.4.1 Overview of Data Reduction Strategies
- 3.4.2 Wavelet Transforms
- 3.4.3 Principal Components Analysis
- 3.4.4 Attribute Subset Selection - Data Reduction 3.4.5 Regression and Log-Linear Models: Parametric
- 3.4.6 Histograms
- 3.4.7 Clustering
- 3.4.8 Sampling
- 3.4.9 Data Cube Aggregation
- 3.5 Data Transformation and Data Discretization
- 3.5.1 Data Transformation Strategies Overview
- 3.5.2 Data Transformation by Normalization
- 3.5.3 Discretization by Binning
- 3.5.4 Discretization by Histogram Analysis
- Analyses 3.5.5 Discretization by Cluster, Decision Tree, and Correlation
- 3.5.6 Concept Hierarchy Generation for Nominal Data
- 3.6 Summary
- 3.7 Exercises
- 3.8 Bibliographic Notes
- Chapter 4 Data Warehousing and Online Analytical Processing
- 4.1 Data Warehouse: Basic Concepts
- 4.1.1 What Is a Data Warehouse? - and Data Warehouses 4.1.2 Differences between Operational Database Systems
- 4.1.3 But, Why Have a Separate Data Warehouse?
- 4.1.4 Data Warehousing: A Multitiered Architecture xii Contents - and Virtual Warehouse 4.1.5 Data Warehouse Models: Enterprise Warehouse, Data Mart,
- 4.1.6 Extraction, Transformation, and Loading
- 4.1.7 Metadata Repository
- 4.2 Data Warehouse Modeling: Data Cube and OLAP
- 4.2.1 Data Cube: A Multidimensional Data Model - for Multidimensional Data Models 4.2.2 Stars, Snowflakes, and Fact Constellations: Schemas
- 4.2.3 Dimensions: The Role of Concept Hierarchies
- 4.2.4 Measures: Their Categorization and Computation
- 4.2.5 Typical OLAP Operations - Databases 4.2.6 A Starnet Query Model for Querying Multidimensional
- 4.3 Data Warehouse Design and Usage
- 4.3.1 A Business Analysis Framework for Data Warehouse Design
- 4.3.2 Data Warehouse Design Process
- 4.3.3 Data Warehouse Usage for Information Processing - Data Mining 4.3.4 From Online Analytical Processing to Multidimensional
- 4.4 Data Warehouse Implementation
- 4.4.1 Efficient Data Cube Computation: An Overview
- 4.4.2 Indexing OLAP Data: Bitmap Index and Join Index
- 4.4.3 Efficient Processing of OLAP Queries - versus HOLAP 4.4.4 OLAP Server Architectures: ROLAP versus MOLAP
- 4.5 Data Generalization by Attribute-Oriented Induction
- 4.5.1 Attribute-Oriented Induction for Data Characterization
- 4.5.2 Efficient Implementation of Attribute-Oriented Induction
- 4.5.3 Attribute-Oriented Induction for Class Comparisons
- 4.6 Summary
- 4.7 Exercises
- 4.8 Bibliographic Notes
- Chapter 5 Data Cube Technology
- 5.1 Data Cube Computation: Preliminary Concepts - and Cube Shell 5.1.1 Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, - 5.1.2 General Strategies for Data Cube Computation
- 5.2 Data Cube Computation Methods - 5.2.1 Multiway Array Aggregation for Full Cube Computation - Downward 5.2.2 BUC: Computing Iceberg Cubes from the Apex Cuboid - Star-Tree Structure 5.2.3 Star-Cubing: Computing Iceberg Cubes Using a Dynamic - 5.2.4 Precomputing Shell Fragments for Fast High-Dimensional OLAP
- Technology 5.3 Processing Advanced Kinds of Queries by Exploring Cube
- 5.3.1 Sampling Cubes: OLAP-Based Mining on Sampling Data
- 5.3.2 Ranking Cubes: Efficient Computation of Top- k Queries
- 5.4 Multidimensional Data Analysis in Cube Space - 5.4.1 Prediction Cubes: Prediction Mining in Cube Space - Granularities 5.4.2 Multifeature Cubes: Complex Aggregation at Multiple - 5.4.3 Exception-Based, Discovery-Driven Cube Space Exploration
- 5.5 Summary
- 5.6 Exercises
- 5.7 Bibliographic Notes
- Concepts and Methods Chapter 6 Mining Frequent Patterns, Associations, and Correlations: Basic
- 6.1 Basic Concepts - 6.1.1 Market Basket Analysis: A Motivating Example - 6.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules
- 6.2 Frequent Itemset Mining Methods - Candidate Generation 6.2.1 Apriori Algorithm: Finding Frequent Itemsets by Confined - 6.2.2 Generating Association Rules from Frequent Itemsets - 6.2.3 Improving the Efficiency of Apriori - 6.2.4 A Pattern-Growth Approach for Mining Frequent Itemsets - 6.2.5 Mining Frequent Itemsets Using Vertical Data Format - 6.2.6 Mining Closed and Max Patterns
- Methods 6.3 Which Patterns Are Interesting?—Pattern Evaluation
- 6.3.1 Strong Rules Are Not Necessarily Interesting
- 6.3.2 From Association Analysis to Correlation Analysis
- 6.3.3 A Comparison of Pattern Evaluation Measures
- 6.4 Summary
- 6.5 Exercises
- 6.6 Bibliographic Notes
- Chapter 7 Advanced Pattern Mining xiv Contents
- 7.1 Pattern Mining: A Road Map
- 7.2 Pattern Mining in Multilevel, Multidimensional Space
- 7.2.1 Mining Multilevel Associations
- 7.2.2 Mining Multidimensional Associations
- 7.2.3 Mining Quantitative Association Rules
- 7.2.4 Mining Rare Patterns and Negative Patterns
- 7.3 Constraint-Based Frequent Pattern Mining
- 7.3.1 Metarule-Guided Mining of Association Rules - and Pruning Data Space 7.3.2 Constraint-Based Pattern Generation: Pruning Pattern Space
- 7.4 Mining High-Dimensional Data and Colossal Patterns
- 7.4.1 Mining Colossal Patterns by Pattern-Fusion
- 7.5 Mining Compressed or Approximate Patterns
- 7.5.1 Mining Compressed Patterns by Pattern Clustering
- 7.5.2 Extracting Redundancy-Aware Top- k Patterns
- 7.6 Pattern Exploration and Application
- 7.6.1 Semantic Annotation of Frequent Patterns
- 7.6.2 Applications of Pattern Mining
- 7.7 Summary
- 7.8 Exercises
- 7.9 Bibliographic Notes
- Chapter 8 Classification: Basic Concepts
- 8.1 Basic Concepts - 8.1.1 What Is Classification? - 8.1.2 General Approach to Classification
- 8.2 Decision Tree Induction - 8.2.1 Decision Tree Induction - 8.2.2 Attribute Selection Measures - 8.2.3 Tree Pruning - 8.2.4 Scalability and Decision Tree Induction - 8.2.5 Visual Mining for Decision Tree Induction
- 8.3 Bayes Classification Methods - 8.3.1 Bayes’ Theorem - 8.3.2 Na¨ıve Bayesian Classification
- 8.4 Rule-Based Classification - 8.4.1 Using IF-THEN Rules for Classification - 8.4.2 Rule Extraction from a Decision Tree - 8.4.3 Rule Induction Using a Sequential Covering Algorithm
- 8.5 Model Evaluation and Selection Contents xv
- 8.5.1 Metrics for Evaluating Classifier Performance
- 8.5.2 Holdout Method and Random Subsampling
- 8.5.3 Cross-Validation
- 8.5.4 Bootstrap
- 8.5.5 Model Selection Using Statistical Tests of Significance
- 8.5.6 Comparing Classifiers Based on Cost–Benefit and ROC Curves
- 8.6 Techniques to Improve Classification Accuracy
- 8.6.1 Introducing Ensemble Methods
- 8.6.2 Bagging
- 8.6.3 Boosting and AdaBoost
- 8.6.4 Random Forests
- 8.6.5 Improving Classification Accuracy of Class-Imbalanced Data
- 8.7 Summary
- 8.8 Exercises
- 8.9 Bibliographic Notes
- Chapter 9 Classification: Advanced Methods
- 9.1 Bayesian Belief Networks
- 9.1.1 Concepts and Mechanisms
- 9.1.2 Training Bayesian Belief Networks
- 9.2 Classification by Backpropagation
- 9.2.1 A Multilayer Feed-Forward Neural Network
- 9.2.2 Defining a Network Topology
- 9.2.3 Backpropagation
- 9.2.4 Inside the Black Box: Backpropagation and Interpretability
- 9.3 Support Vector Machines
- 9.3.1 The Case When the Data Are Linearly Separable
- 9.3.2 The Case When the Data Are Linearly Inseparable
- 9.4 Classification Using Frequent Patterns
- 9.4.1 Associative Classification
- 9.4.2 Discriminative Frequent Pattern–Based Classification
- 9.5 Lazy Learners (or Learning from Your Neighbors)
- 9.5.1 k -Nearest-Neighbor Classifiers
- 9.5.2 Case-Based Reasoning
- 9.6 Other Classification Methods
- 9.6.1 Genetic Algorithms
- 9.6.2 Rough Set Approach
- 9.6.3 Fuzzy Set Approaches
- 9.7 Additional Topics Regarding Classification
- 9.7.1 Multiclass Classification
- 9.7.2 Semi-Supervised Classification xvi Contents
- 9.7.3 Active Learning
- 9.7.4 Transfer Learning
- 9.8 Summary
- 9.9 Exercises
- 9.10 Bibliographic Notes
- Chapter 10 Cluster Analysis: Basic Concepts and Methods
- 10.1 Cluster Analysis
- 10.1.1 What Is Cluster Analysis?
- 10.1.2 Requirements for Cluster Analysis
- 10.1.3 Overview of Basic Clustering Methods
- 10.2 Partitioning Methods
- 10.2.1 k -Means: A Centroid-Based Technique
- 10.2.2 k -Medoids: A Representative Object-Based Technique
- 10.3 Hierarchical Methods
- 10.3.1 Agglomerative versus Divisive Hierarchical Clustering
- 10.3.2 Distance Measures in Algorithmic Methods - Feature Trees 10.3.3 BIRCH: Multiphase Hierarchical Clustering Using Clustering - Modeling 10.3.4 Chameleon: Multiphase Hierarchical Clustering Using Dynamic
- 10.3.5 Probabilistic Hierarchical Clustering
- 10.4 Density-Based Methods - Regions with High Density 10.4.1 DBSCAN: Density-Based Clustering Based on Connected - 10.4.2 OPTICS: Ordering Points to Identify the Clustering Structure - 10.4.3 DENCLUE: Clustering Based on Density Distribution Functions
- 10.5 Grid-Based Methods - 10.5.1 STING: STatistical INformation Grid - 10.5.2 CLIQUE: An Apriori-like Subspace Clustering Method
- 10.6 Evaluation of Clustering - 10.6.1 Assessing Clustering Tendency - 10.6.2 Determining the Number of Clusters - 10.6.3 Measuring Clustering Quality
- 10.7 Summary
- 10.8 Exercises
- 10.9 Bibliographic Notes
- Chapter 11 Advanced Cluster Analysis
- 11.1 Probabilistic Model-Based Clustering - 11.1.1 Fuzzy Clusters
- 11.1.2 Probabilistic Model-Based Clusters Contents xvii
- 11.1.3 Expectation-Maximization Algorithm
- 11.2 Clustering High-Dimensional Data - and Major Methodologies 11.2.1 Clustering High-Dimensional Data: Problems, Challenges,
- 11.2.2 Subspace Clustering Methods
- 11.2.3 Biclustering
- 11.2.4 Dimensionality Reduction Methods and Spectral Clustering
- 11.3 Clustering Graph and Network Data
- 11.3.1 Applications and Challenges
- 11.3.2 Similarity Measures
- 11.3.3 Graph Clustering Methods
- 11.4 Clustering with Constraints
- 11.4.1 Categorization of Constraints
- 11.4.2 Methods for Clustering with Constraints
- 11.5 Summary
- 11.6 Exercises
- 11.7 Bibliographic Notes
- Chapter 12 Outlier Detection
- 12.1 Outliers and Outlier Analysis
- 12.1.1 What Are Outliers?
- 12.1.2 Types of Outliers
- 12.1.3 Challenges of Outlier Detection
- 12.2 Outlier Detection Methods
- 12.2.1 Supervised, Semi-Supervised, and Unsupervised Methods
- Clustering-Based Methods 12.2.2 Statistical Methods, Proximity-Based Methods, and
- 12.3 Statistical Approaches
- 12.3.1 Parametric Methods
- 12.3.2 Nonparametric Methods
- 12.4 Proximity-Based Approaches - Method 12.4.1 Distance-Based Outlier Detection and a Nested Loop
- 12.4.2 A Grid-Based Method
- 12.4.3 Density-Based Outlier Detection
- 12.5 Clustering-Based Approaches
- 12.6 Classification-Based Approaches
- 12.7 Mining Contextual and Collective Outliers - Outlier Detection 12.7.1 Transforming Contextual Outlier Detection to Conventional - 12.7.2 Modeling Normal Behavior with Respect to Contexts xviii Contents - 12.7.3 Mining Collective Outliers
- 12.8 Outlier Detection in High-Dimensional Data
- 12.8.1 Extending Conventional Outlier Detection
- 12.8.2 Finding Outliers in Subspaces
- 12.8.3 Modeling High-Dimensional Outliers
- 12.9 Summary
- 12.10 Exercises
- 12.11 Bibliographic Notes
- Chapter 13 Data Mining Trends and Research Frontiers - 13.1 Mining Complex Data Types - and Biological Sequences 13.1.1 Mining Sequence Data: Time-Series, Symbolic Sequences, - 13.1.2 Mining Graphs and Networks - 13.1.3 Mining Other Kinds of Data - 13.2 Other Methodologies of Data Mining - 13.2.1 Statistical Data Mining - 13.2.2 Views on Data Mining Foundations - 13.2.3 Visual and Audio Data Mining - 13.3 Data Mining Applications - 13.3.1 Data Mining for Financial Data Analysis - 13.3.2 Data Mining for Retail and Telecommunication Industries - 13.3.3 Data Mining in Science and Engineering - 13.3.4 Data Mining for Intrusion Detection and Prevention - 13.3.5 Data Mining and Recommender Systems - 13.4 Data Mining and Society - 13.4.1 Ubiquitous and Invisible Data Mining - 13.4.2 Privacy, Security, and Social Impacts of Data Mining - 13.5 Data Mining Trends - 13.6 Summary - 13.7 Exercises - 13.8 Bibliographic Notes
Foreword
Analyzing large amounts of data is a necessity. Even popular science books, like “super crunchers,” give compelling cases where large amounts of data yield discoveries and intuitions that surprise even experts. Every enterprise benefits from collecting and ana- lyzing its data: Hospitals can spot trends and anomalies in their patient records, search engines can do better ranking and ad placement, and environmental and public health agencies can spot patterns and abnormalities in their data. The list continues, with cybersecurity and computer network intrusion detection; monitoring of the energy consumption of household appliances; pattern analysis in bioinformatics and pharma- ceutical data; financial and business intelligence data; spotting trends in blogs, Twitter, and many more. Storage is inexpensive and getting even less so, as are data sensors. Thus, collecting and storing data is easier than ever before. The problem then becomes how to analyze the data. This is exactly the focus of this Third Edition of the book. Jiawei, Micheline, and Jian give encyclopedic coverage of all the related methods, from the classic topics of clustering and classification, to database methods (e.g., association rules, data cubes) to more recent and advanced topics (e.g., SVD/PCA, wavelets, support vector machines). The exposition is extremely accessible to beginners and advanced readers alike. The book gives the fundamental material first and the more advanced material in follow-up chapters. It also has numerous rhetorical questions, which I found extremely helpful for maintaining focus. We have used the first two editions as textbooks in data mining courses at Carnegie Mellon and plan to continue to do so with this Third Edition. The new version has significant additions: Notably, it has more than 100 citations to works from 2006 onward, focusing on more recent material such as graphs and social networks, sen- sor networks, and outlier detection. This book has a new section for visualization, has expanded outlier detection into a whole chapter, and has separate chapters for advanced
xix