Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Apriori Algorithm and its example, Study notes of Data Mining

Easy to understand the apriori alogorithm

Typology: Study notes

2017/2018

Uploaded on 10/08/2018

akshay-ingole
akshay-ingole 🇮🇳

1 document

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
.Introduction
Short stories or tales always help us in understanding a concept better but this is a true story, Wal-
Mart’s beer diaper parable. A sales person from Wal-Mart tried to increase the sales of the store by
bundling the products together and giving discounts on them. He bundled bread and jam which made
it easy for a customer to find them together. Furthermore, customers could buy them together
because of the discount.
To find some more opportunities and more such products that can be tied together, the sales guy
analyzed all sales records. What he found was intriguing. Many customers who purchased diapers
also bought beers. The two products are obviously unrelated, so he decided to dig deeper. He found
that raising kids is grueling. And to relieve stress, parents imprudently decided to buy beer. He paired
diapers with beers and the sales escalated. This is a perfect example of Association Rules in data
mining.
This article takes you through a beginner’s level explanation of Apriori algorithm in data mining. We
will also look at the definition of association rules. Toward the end, we will look at the pros and cons
of the Apriori algorithm along with its R implementation.
Let’s begin by understanding what Apriori algorithm is and why is it important to learn it.
Apriori Algorithm
With the quick growth in e-commerce applications, there is an accumulation vast quantity of data in
months not in years. Data Mining, also known as Knowledge Discovery in Databases(KDD), to find
anomalies, correlations, patterns, and trends to predict outcomes.
Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets and
relevant association rules. It is devised to operate on a database containing a lot of transactions, for
instance, items brought by customers in a store.i
It is very important for effective Market Basket Analysis and it helps the customers in purchasing
their items with more ease which increases the sales of the markets. It has also been used in the field
of healthcare for the detection of adverse drug reactions. It produces association rules that indicates
what all combinations of medications and patient characteristics lead to ADRs.
Association rules
Association rule learning is a prominent and a well-explored method for determining relations among
variables in large databases. Let us take a look at the formal definition of the problem of association
rules given by Rakesh Agrawal, the President and Founder of the Data Insights Laboratories.
Let I={i1,i2,i3,…,in} be a set of n attributes called items and D={t1,t2,…,tn} be the set of
transactions. It is called database. Every transaction, ti in D has a unique transaction ID, and it
consists of a subset of itemsets in I.
A rule can be defined as an implication, XY where X and Y are subsets of I(X,YI), and they
have no element in common, i.e., XY. X and Y are the antecedent and the consequent of the rule,
respectively.
pf3
pf4
pf5

Partial preview of the text

Download Apriori Algorithm and its example and more Study notes Data Mining in PDF only on Docsity!

.Introduction

Short stories or tales always help us in understanding a concept better but this is a true story, Wal- Mart’s beer diaper parable. A sales person from Wal-Mart tried to increase the sales of the store by bundling the products together and giving discounts on them. He bundled bread and jam which made it easy for a customer to find them together. Furthermore, customers could buy them together because of the discount.

To find some more opportunities and more such products that can be tied together, the sales guy analyzed all sales records. What he found was intriguing. Many customers who purchased diapers also bought beers. The two products are obviously unrelated, so he decided to dig deeper. He found that raising kids is grueling. And to relieve stress, parents imprudently decided to buy beer. He paired diapers with beers and the sales escalated. This is a perfect example of Association Rules in data mining.

This article takes you through a beginner’s level explanation of Apriori algorithm in data mining. We will also look at the definition of association rules. Toward the end, we will look at the pros and cons of the Apriori algorithm along with its R implementation.

Let’s begin by understanding what Apriori algorithm is and why is it important to learn it.

Apriori Algorithm

With the quick growth in e-commerce applications, there is an accumulation vast quantity of data in months not in years. Data Mining, also known as Knowledge Discovery in Databases(KDD), to find anomalies, correlations, patterns, and trends to predict outcomes.

Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets and relevant association rules. It is devised to operate on a database containing a lot of transactions, for instance, items brought by customers in a store.i

It is very important for effective Market Basket Analysis and it helps the customers in purchasing their items with more ease which increases the sales of the markets. It has also been used in the field of healthcare for the detection of adverse drug reactions. It produces association rules that indicates what all combinations of medications and patient characteristics lead to ADRs.

Association rules

Association rule learning is a prominent and a well-explored method for determining relations among variables in large databases. Let us take a look at the formal definition of the problem of association rules given by Rakesh Agrawal, the President and Founder of the Data Insights Laboratories.

Let I={i^1 ,i^2 ,i^3 ,…,in^ }^ be a set of n attributes called items and^ D={t^1 ,t^2 ,…,tn^ }^ be the set of

transactions. It is called database. Every transaction, t^ i^ in^ D^ has a unique transaction ID, and it

consists of a subset of itemsets in I^.

A rule can be defined as an implication, X⟶Y^ where^ X^ and^ Y^ are subsets of^ I(X,Y⊆I), and they

have no element in common, i.e., X∩Y^.^ X^ and^ Y^ are the antecedent and the consequent of the rule,

respectively.

Let’s take an easy example from the supermarket sphere. The example that we are considering is quite small and in practical situations, datasets contain millions or billions of transactions. The set of

itemsets, I ={Onion, Burger, Potato, Milk, Beer} and a database consisting of six transactions. Each

transaction is a tuple of 0’s and 1’s where 0 represents the absence of an item and 1 the presence. An example for a rule in this scenario would be {Onion, Potato} => {Burger}, which means that if onion and potato are bought, customers also buy a burger.

Transaction ID

Onion Potato Burger Milk Beer

t 1 1 1 1 0 0

t 2 0 1 1 1 0

t 3 0 0 0 1 1

t 4 1 1 0 1 0

t 5 1 1 1 0 1

t 6 1 1 1 1 1

There are multiple rules possible even from a very small database, so in order to select the interesting ones, we use constraints on various measures of interest and significance. We will look at some of these useful measures such as support, confidence, lift and conviction.

Support

The support of an itemset X , supp(X) is the proportion of transaction in the database in which the

item X appears. It signifies the popularity of an itemset.

supp(X)=Number of transaction in whichXappearsTotal number of

transactions.

In the example above, supp(Onion)=46=0.^.

If the sales of a particular product (item) above a certain proportion have a meaningful effect on profits, that proportion can be considered as the support threshold. Furthermore, we can identify itemsets that have support values beyond this threshold as significant itemsets.

Confidence

Confidence of a rule is defined as follows:

conf(X⟶Y)=supp(X∪Y)supp(X)

It signifies the likelihood of item Y being purchased when item X is purchased. So, for the rule {Onion, Potato} => {Burger}, Undefined control sequence \implies This implies that for 75% of the transactions containing onion and potatoes, the rule is correct. It can

also be interpreted as the conditional probability P(Y|X), i.e, the probability of finding the

itemset Y^ in transactions given the transaction already contains^ X^.

It can give some important insights, but it also has a major drawback. It only takes into account the

popularity of the itemset X and not the popularity of Y. If Y is equally popular as X then there will

Beer(Be) 2 Step 2 : We know that only those elements are significant for which the support is greater than or equal to the threshold support. Here, support threshold is 50%, hence only those items are significant which occur in more than three transactions and such items are Onion(O), Potato(P), Burger(B), and Milk(M). Therefore, we are left with:

Item Frequency (No. of transactions) Onion(O) 4 Potato(P) 5 Burger(B) 4 Milk(M) 4 The table above represents the single items that are purchased by the customers frequently.

Step 3 : The next step is to make all the possible pairs of the significant items keeping in mind that the order doesn’t matter, i.e., AB is same as BA. To do this, take the first item and pair it with all the others such as OP, OB, OM. Similarly, consider the second item and pair it with preceding items, i.e., PB, PM. We are only considering the preceding items because PO (same as OP) already exists. So, all the pairs in our example are OP, OB, OM, PB, PM, BM.

Step 4 : We will now count the occurrences of each pair in all the transactions.

Itemset Frequency (No. of transactions) OP 4 OB 3 OM 2 PB 4 PM 3 BM 2 Step 5 : Again only those itemsets are significant which cross the support threshold, and those are OP, OB, PB, and PM.

Step 6 : Now let’s say we would like to look for a set of three items that are purchased together. We will use the itemsets found in step 5 and create a set of 3 items.

To create a set of 3 items another rule, called self-join is required. It says that from the item pairs OP, OB, PB and PM we look for two pairs with the identical first letter and so we get

  • (^) O P and O B, this gives OPB
  • (^) P B and P M, this gives PBM

Next, we find the frequency for these two itemsets.

Itemset Frequency (No. of transactions)

OPB 4

PBM 3

Applying the threshold rule again, we find that OPB is the only significant itemset.

Therefore, the set of 3 items that was purchased most frequently is OPB.

The example that we considered was a fairly simple one and mining the frequent itemsets stopped at 3 items but in practice, there are dozens of items and this process could continue to many items. Suppose we got the significant sets with 3 items as OPQ, OPR, OQR, OQS and PQR and now we want to generate the set of 4 items. For this, we will look at the sets which have first two alphabets common, i.e,

  • OP Q and OP R gives OPQR
  • OQ R and OQ S gives OQRS

In general, we have to look for sets which only differ in their last letter/item.

Now that we have looked at an example of the functionality of Apriori Algorithm, let us formulate the general process.

General Process of the Apriori algorithm

The entire algorithm can be divided into two steps:

Step 1: Apply minimum support to find all the frequent sets with k items in a database.

Step 2: Use the self-join rule to find the frequent sets with k+1 items with the help of frequent k- itemsets. Repeat this process from k=1 to the point when we are unable to apply the self-join rule.

This approach of extending a frequent itemset one at a time is called the “bottom up” approach.

Mining Association Rules

Till now, we have looked at the Apriori algorithm with respect to frequent itemset generation. There is another task for which we can use this algorithm, i.e., finding association rules efficiently.

For finding association rules, we need to find all rules having support greater than the threshold support and confidence greater than the threshold confidence.

But, how do we find these? One possible way is brute force, i.e., to list all the possible association rules and calculate the support and confidence for each rule. Then eliminate the rules that fail the threshold support and confidence. But it is computationally very heavy and prohibitive as the number of all the possible association rules increase exponentially with the number of items.

Given there are n items in the set I^ , the total number of possible association rules is^3 n^ –2n+1+1.

We can also use another way, which is called the two-step approach, to find the efficient association rules.

provide us the discount on some bundles of products. The use cases of the Apriori algorithm stretch to Google’s auto-completion features and Amazon’s recommendation systems.

This tutorial aims to make the reader familiar with the fundamentals of the Apriori algorithm and a general process followed to mine frequent itemsets. Hope you are familiar now!