Big Data/Analytics Zone is brought to you in partnership with:

Jeune has a fascination for building and solving things. His interests include Algorithms, Development Practices and Software Design. When not immersed in his passions, he spends time experimenting in the kitchen. Jose is a DZone MVB and is not an employee of DZone and has posted 10 posts at DZone. You can read more from them at their website. View Full User Profile

Getting Started With Data Mining

03.04.2012
| 7776 views |
  • submit to reddit

Here are just notes from my data mining class which I will begin to consolidate here in my blog as a way to assimilate the lessons.

1. The market basket model is probably the easiest introduction to anyone interested in data mining. The concept is simple. There are baskets and there are items in those baskets.

The market basket model: There are items and there baskets, also called itemsets, that hold those items.


2. Closely related to the the Market Basket model is the concept of frequent itemsets. Intuitively, a set of items is frequent if it occurs many times.

3. The following terms are used a lot when talking about frequent itemsets:

  • Support count – is a term that refers to the number of times an itemset appears in a set of baskets. For example in the basket set

    [ ('cheese', 'milk', 'eggs'), ('milk'), ('milk', 'eggs', 'bread') ]


    the support count of the itemset (‘milk’, ‘eggs’) is 2 since it was a subset two times

  • Support threshold – is a numerical limit that draws the line between a frequent itemset and non-frequent itemset. For example, in the basket set above, if one sets the support threshold at >= 2, then one can say that the itemset (‘milk’, ‘eggs’) is frequent.

4. Frequent itemsets are presented as an if-then rule like so: I \to j where I is a set of items and j is an item. This representation is called an association rule. In words, it can be said that if I appears in a basket then j is “likely” to appear as well.

5. In data mining parlance, the concept of ‘likely’ is more formally known as the confidence of the rule I \to j. Mathematically,

Confidence(I \to j) = \frac{Support(I U {j})}{Support(I)}


The insights behind this formula are that

  • Baskets with I \cup {j} in them cannot be more than baskets with I in them. Think about it. If one has I then there may or may not be the j around it in the same basket.
  • Having said that, the more baskets with I \cup {j} in them, the better. This makes the confidence in the rule stronger
  • If Support(I \cup {j}) is the same as the Support(I) then that means that the confidence is 1 or 100%. In other words I \to j all the time!

6. In an association rule, interest is an indicator of how the item on the left affects the item on the right in I \to {j} . The formula is:

Interest = Confidence(I \to j) - \frac{\# of item j in baskets}{\# of baskets}


The insight behind this formula is that

  • If the confidence outweighs the fraction of baskets with j, then it can be said that there is indeed a correlation between I and J and/or the presence of I somehow affects the presence of j.

 

  • On the other hand, If there are significantly more baskets with j but not I, then the association rule I \to {j} isn’t really strong. It definitely is not the presence of I that implies the presence of j but something else. The instances when I and j are together in a basket can be said to be isolated




Published at DZone with permission of Jose Asuncion, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)