LUCS-KDD

NOTE ON THE SUPPORT AND CONFIDENCE FRAMEWORK



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

27 February 2008


The support and confidence framework is the most commonly used framework in Association Rule Mining (ARM) for identifying, and consequently defining, "interesting" associations described in the form of Association Rules (ARs).

Support is the number of occurrences of some set of attributes in a dataset (referred to as itemsets), some authors refer to itemsets as having a support count.

Confidence is an indication of the support for an AR in a rule set, i.e how "confident" we are about the validity of a rule. Confidence is usually expressed as a percentage and is calculated by dividing the support for the union of the antecedent and consequent of an AR by the support of just its antecedent. A rule that has a confidence of 100% associated with it means that there are no occurrences in the dataset where the antecedent is associated with some other consequent --- in other words this is a very strong rule.

ARs are generated from what are called frequent itemsets, these are itemsets with a support count above some user specified support threshold expressed as a percentage. This is typically given a low value so that no potentially interesting rules are missed. However, the threshold value to be chosen depends a bit on the number of records in the data set. If we choose a value of 1% and we have only 100 records then every item combination present in the data set will be considered to be frequent (which might entail a significant processing overhead). Alternatively if we have 1000 records then an itemset would have to appear in at least 10 records for it to be considered interesting.

Once the frequent itemsets in a data set have been identified the ARs can be generated. Each frequent itemsets of size greater than one can produce 2 or more ARs. To reduce this number only those rules above a given confidence threshold are selected. The confidence threshold value chosen is usually quite high.




Created and maintained by Frans Coenen. Last updated 13 February 2008