|
|
|
|
Lesson#35
|
CONFIDENCE AND SUPPORT
|
|
|
|
There are two terms/measures used in association, that is,
support and confidence. Confidence’ is a
measure of how often the relationship holds true e.g, what
percentage of time did people who bought milk
also bought eggs. Support means what is the percentage of two
items occurring together overall.
Mathematically, they can be expressed as follows if we take the
example of eggs and milk:
Confidence = Transactions (eggs+milk)
Transactions (eggs or milk or both)
In case no. of transactions involving eggs and milk are 25 and
those involving eggs or milk or both are 75
then confidence is 25/75*100=33.3%
Support = Transactions (eggs+milk)
Total no. of transactions
In case no. of transactions involving eggs and milk are 10 and
total no. of transactions in a day are 50 then
support is 10/50*100 = 20%
Suppose if confidence is 90% but the support is 5%., then we can
gather from this that the two items have
very strong affinity or relationship with each other such that
when an item is sold the other is sold together,
however, the chance of this pair being purchased out of the
total no. of transactions is very slim, just 5%.
One can adjust these measures to discover items having
corresponding level of association and accordingly
set marketing strategy. So, if I feed the data to the
association mining tool and specify the percentage of
confidence and support, it will list down the items that have
association corresponding to these percentages.
Results of association mining are shown with the help of double
arrows as indicated below:
Bread ----
Butter
Computer ----
Furniture
Clothes ----
Shoes
Using the result of association mining, a marketer can take a
number of useful steps to set or modify
marketing strategy. For example, items that have
closeness/affinity with each other can be shelved together
to improve customer service. Certain promotional schemes can be
introduced in view of the association
mining result etc.
Characterization
It is discovering interesting concepts in concise and succinct
terms at generalized levels for examining the
general behavior of the data. For example, in a database of
graduate students of a university the students of
different nationalities can be enrolled in different departments
such as music history, physics etc. We can
apply characterization technique to find a generalized
concept/answer in response to the question that how
many students of a particular country are studying science or
arts. See the following example:
Student name Department City of residence
Imran History Karachi
Alice Physics London
Ali Literature Lahore
Bob Mathematics Toronto
…
In the above example, characterization tool can, for that
matter, tell us that 02 Pakistani students are
studying arts. Note that the concept of location and the field
of education are generalized to Pakistan and
arts, respectively.
The two algorithms used in characterization are Version Space
Search and Attribute-Oriented Induction.
145
Clustering
A cluster is a group of data objects that are similar to another
within the same cluster and are dissimilar to
the objects in other clusters. For example, clusters of distinct
group of customers, categories of emails in a
mailing list database, different categories of web usage from
log files etc. It serves as a preprocessing step
for other algorithms such as classification and
characterization. K-means algorithm is normally used in
clustering. In the example below you can see four clusters of
customers based on their income level. Kmeans
algorithm displays the result in the format as shown in Fig. 1
below:
IInnccoommee<<11,0,000,0,00000
Income>2,00,000
<=3,50,000
Income>2,00,000
<=3,50,000
IInnccoommee>>33,5,500,0,00000
Income>=1,00,000
<=2,00,000
Income>=1,00,000
<=2,00,000
Fig. 1
Online Analytical Processing (OLAP)
OLAP makes use of background knowledge regarding the domain of
the data being studied in order to
allow the presentation of data at different levels of
abstraction. It is different form data mining in the sense
that it does not provide any patterns for making predictions;
rather the information stored in databases can
be presented/ viewed in a convenient format in case of OLAP at
different levels that facilitates decision
makers or managers. The result of OLAP is displayed in the form
of a data cube as shown in Fig. 2
below:
Data Cube in OLAP
605 825
400
Furniture
computer
phone
Grocery
Q1
Q2
Q3
Q4
Time Quarters
(Item Types)
Lahore
Karachi 440 Location (cities) 345
Fig. 2
146
Note that in the above diagram, time, item type and location are
the three dimensions. OLAP data cube
indicates the sale of 605 and 825 units of furniture and
computers, respectively, in the first quarter of the
year in Lahore, 440 units of furniture and 345 phone sets in
Karachi in the first quarter, respectively, and
400 grocery items in Lahore during second quarter. Results can
be displayed through data cube against more
than three dimensions. For instance, variables, ‘warehouse’ and
‘customer type’ may also be added as
dimensions to view the sale results. OLAP tool allows the use of
different processes, namely, drill-down,
roll-up, slice, dice etc. Using drill-down we can further dig
the data to receive some specific information.
For example using that I can find the sale of furniture in a
specific month of the first quarter, say, February.
Roll-up is the reverse of drill-down. In it we can sum-up or
integrate the information in a particular
dimension to show the result. For example the sale of furniture
or computers in a particular year (rather
than a specific quarter) can be viewed using roll-up. Similarly,
through slice and dice information can be
presented which is specific to certain dimensions of the data
cube.
SAS (Enterprise Miner) and DB miner are the names of two
commonly used tools for data mining and
OLAP. Note that characterization can be used in respect of any
data type whereas OLAP is generally used
for numeric data alone.
|
|
|
|