








































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This chapter from the book 'data mining: concepts and techniques' by jiawei han explores the concepts and techniques of mining frequent patterns and association rules. The basics of frequent patterns and association rules, closed patterns and max-patterns, scalable methods for mining frequent patterns, pruning, generating candidates, partitioning patterns and databases, recursion, mining frequent patterns with fp-trees, visualization of association rules, system optimization, and constraints in data mining.
What you will learn
Typology: Study Guides, Projects, Research
1 / 80
This page cannot be seen from the preview
Don't miss anything!
Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber, All rights reserved Revised by Zhongfei (Mark) Zhang Department of Computer Science SUNY Binghamton
Chapter 5: Mining Frequent Patterns, Association and Correlations
Why Is Freq. Pattern Mining Important? (^) Discloses an intrinsic and important property of data sets (^) Forms the foundation for many essential data mining tasks (^) Association, correlation, and causality analysis (^) Sequential, structural (e.g., sub-graph) patterns (^) Pattern analysis in spatiotemporal, multimedia, time- series, and stream data (^) Classification: associative classification (^) Cluster analysis: frequent pattern-based clustering (^) Data warehousing: iceberg cube and cube-gradient (^) Semantic data compression: fascicles (^) Broad applications
Basic Concepts: Frequent Patterns and Association Rules (^) Itemset X = {x 1 , …, xk} (^) Find all the rules X Y with minimum support and confidence
transaction contains X ∪ Y
probability that a transaction
Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A D (60%, 100%) D A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer 40 B, E, F 50 B, C, D, E, F 30 A, D, E 20 A, C, D 10 A, B, D Transaction-id Items bought
Closed Patterns and Max-Patterns
1
100
1
50
1
100
1
50
1
100
Chapter 5: Mining Frequent Patterns, Association and Correlations
(^) Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) (^) Method: Initially, scan DB once to get frequent 1-itemset (^) Generate length (k+1) candidate itemsets from length k frequent itemsets (^) Test the candidates against DB (^) Terminate when no frequent or candidate set can be generated
The Apriori Algorithm—An Example Database TDB 1 st scan
1
1 L 2
nd scan C 3
3 rd^ scan^3 40 B, E 30 A, B, C, E 20 B, C, E 10 A, C, D Tid Items {D} 1 {E} 3 {C} 3 {B} 3 {A} 2 Itemset sup {E} 3 {C} 3 {B} 3 {A} 2 Itemset sup {C, E} {B, E} {B, C} {A, E} {A, C} {A, B} Itemset {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {B, C, E} Itemset {B, C, E} 2 Itemset sup Sup min
Important Details of Apriori (^) How to generate candidates? (^) Step 1: self-joining L k (^) Step 2: pruning (^) How to count supports of candidates? (^) Example of Candidate-generation L 3 ={abc, abd, acd, ace, bcd} (^) Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace (^) Pruning: acde is removed because ade is not in L 3 C 4 ={abcd}
How to Generate Candidates?
k- are listed in an order
k- insert into Ck select p.item 1 , p.item 2 , …, p.itemk-1, q.itemk- from Lk-1 p, Lk-1 q where p.item 1 =q.item 1 , …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk- (^) Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in L k- ) then delete c from C k
Partition: Scan Database Only Twice (^) Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB (^) Scan 1: partition database and find local frequent patterns (^) Scan 2: consolidate global frequent patterns (^) A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In
DIC: Reduce Number of Scans ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B^ C^ D {} Itemset lattice (^) Once both A and D are determined frequent, the counting of AD begins (^) Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Transactions 1-itemsets 2-itemsets … Apriori 1-itemsets 2-items DIC 3-items S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’
Mining Frequent Patterns Without Candidate Generation
Construct FP-tree from a Transaction Database {} f:4 c: b: p: c:3 b: a: m:2 b: p:2 m: Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 { f, a, c, d, g, i, m, p } { f, c, a, m, p } 200 { a, b, c, f, l, m, o } { f, c, a, b, m } 300 { b, f, h, j, o, w } { f, b } 400 { b, c, k, s, p } { c, b, p } 500 { a, f, c, e, l, p, m, n } { f, c, a, m, p }