Jeune has a fascination for building and solving things. His interests include Algorithms, Development Practices and Software Design. When not immersed in his passions, he spends time experimenting in the kitchen. Jose is a DZone MVB and is not an employee of DZone and has posted 10 posts at DZone. You can read more from them at their website. View Full User Profile

# My Implementation of the Apriori Algorithm

03.25.2012
| 17329 views |

This is a self imposed machine problem I wrote over a frantic afternoon for my lesson on Frequent Itemsets and the Apriori Algorithm.

I wanted to write a program that would find the top five frequent item sets among a set of baskets.

Consider the item set below:

D = [ ('milk', 'cheese', 'tuna'),
('cheese', 'mushroom'),
('cheese', 'eggs'),
('milk', 'cheese', 'mushroom'),
('milk', 'eggs'),
('cheese', 'eggs'),
('milk', 'cheese', 'mushroom', 'tuna'),
('milk', 'cheese', 'eggs') ]


The goal I want for my program is to be able to induce that

$milk \to cheese$ (the presence of milk somehow implies the presence of cheese as well)

or

$milk,cheese \to mushroom$ (milk, cheese and mushroom) somehow go together.

Just right now, I think I could also find a correlation between item sets and dates.

Anyway, as I have learned the key is to use map reduce and I was able to do this quite easily with a NoSQL database like Mongo DB.

A link to the code I wrote is available on github.

So far, I am able to generate the candidate item set for 2-combinations of the universal sets. This is another way of saying that the program I wrote is to count the number of times a two combination appears in all the baskets. For example, it is able to count that (‘milk’, ‘cheese’) occurs four times. Below is a screenshot of the script doing just that:

The next step after this is to generate the frequent item sets from the the candidate item set. This just means, I have to filter out non frequent 2-combinations. This can easily be done in two steps and is based on a support count that is defined before hand.

Let’s say a support count of 2 is defined, then the first thing to do is to check if the count of the 1 item sets that make up the two item sets have support count $\geq 2$. Remember that the key to the apriori algorithm is that the subsets of a frequent item set must also be frequent.

This can easily be derived by running the script I wrote:

Now that I am able to generate the candidate item sets for any n-combination of the universal set, some next steps for my program would be to

1. Generate the frequent item set given a candidate item set.
2. Report the top 5 item sets with the highest interest and highest confidence.

Performance considerations also come to mind as my implementation of the algorithm has not been tested for a large data set.

Published at DZone with permission of Jose Asuncion, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:
"Starting from scratch" is seductive but disease ridden