|
|
Data Mining can be defined as the task of discovering
interesting patterns from large amounts of data,
where the data can be stored in databases, data warehouses, or
other information repositories.
Data mining has a lot of business application in today’s world.
We can identify the behavior of our
customers and can effectively target them with personalized
messages using data mining techniques.
Assume that there is a shopping store where the data/information
about customers has been
recorded/stored over a period of time. Using a data mining
technique on the customers’ data, certain
pattern can be generated that can provide useful information.
For example, this pattern may tell us that
people having a certain demographic profile (age over 20 years
and sex male) coming from a particular
location have shown inclination to buy computer related items.
It is an interesting clue for the marketers. In
case there is a computer related item that is to be marketed in
future, then marketing effort in this behalf
should be focused on such persons instead of sending marketing
messages at random. In other words,
persons indicated by the pattern are the ones who are likely to
respond to this kind of marketing initiative.
Thus, if a company follows the pattern it can save time, energy
and mailing cost.
Data warehouse
A data warehouse is a repository for long-term storage of data
from multiple sources, organized so as to
facilitate the management for decision making. Fig. 1 below
shows how data collected at different sources is
cleaned, transformed, integrated and loaded in a data warehouse
from where it can be accessed by clients
for data mining and pattern evaluation.
Data
warehouse
Clean
Transform
Integrate
Load
Query and
Analysis tools
Client
Client
Data source in Karachi
Data source in
Lahore
Data source in
Islamabad
Data source in Faisalabad
Fig. 1
Knowledge discovery
A knowledge discovery process includes data cleaning, data
integration, data selection, data transformation,
data mining, pattern evaluation and knowledge presentation.
Fig. 2 shows the knowledge discovery process:
141
Data
Warehouse
Databases
Patterns
Data Mining
Knowledge
Cleaning and
Integration
Selection and
Transformation
Evaluation and
Presentation
Fig. 2
Note that data mining is a step in the overall knowledge
discovery process. Data must be cleaned,
transformed, selected and integrated before data mining is
performed. Data cleaning means that missing
values should be provided in different fields/columns wherever
needed and any impossible or erroneous
values should be substituted by correct/reasonable ones. For
example if the age of a person is typed as
1000 years in the column ‘age’ then an average age value can be
put in its place. Where there are quite a few
erroneous or missing values in a row, then that row can be
discarded/deleted altogether. This process is
called data selection. In data transformation, the data from all
different sources is converted into the same
format. For example, date typed under a column should be in the
same format in the entire data collected
through different sources. In data integration, data from all
the sources is assembled or integrated into one
and housed in the data warehouse. Now, this cleaned,
transformed, selected and integrated data is fed to the
data mining tool from a data warehouse for data mining purpose.
The results/ patterns are evaluated by
managers and useful knowledge is thus gained. Note that almost
80% of the total time used in a knowledge
discovery process is spent on just making the data fit for
mining, that is, data cleaning, data transformation,
data selection etc.
Types of Data Mining
There are four main types of data mining as follows:
Classification
Association
Characterization
Clustering
Classification and association are predictive types of data
mining while characterization and clustering
represent the descriptive type.
Classification
It allows you to have a predictive model labeling different
samples to different classes. The results of this
type of mining/model are represented as (if-then) rules,
decision trees, neural networks etc. Two important
algorithms used for this type are ID3 Algorithm, and Bayesian
classification. Decision tree is a graphical
representation of the if-then rules. Fig. 3 below shows the
result of classification in the form of a decision
tree. Initially, the whole data is divided into two sets –
training data and test data.
142
In the example below, ‘sex’ is the target attribute/variable
with males and females as the two classes. When
no mining is done and values are picked at random, we find that
males are 55% and females 45% in the
training data. With a variation of 1 or 2 % the test data
indicates a similar result. Classification algorithm
may find the variable ‘age’ as the best predictor of males such
that when the age is between 20 and 25 years
the percentage of males rises to 60% in the training data and
59% in test data. Similarly, education and
annual income can be discovered as other predictors for males,
and so on. Thus, you can find a pattern that
when age is between 20 and 25 years, and education is matric or
below and annual income is less than one
lac (assuming that the model ends at annual income), then there
is a 65% probability (in the training data)
and 64% probability (in the test data) that the sex of a person
would be male. Similarly, a pattern for
predicting females can also be obtained. Note that by using
classification mining your probability of
reaching males has increased from 55% (when no model is used) to
65% when the model is applied. Hence,
if you want to launch/market a product for males and target
them, you can use the model or pattern dug
out through classification mining. Following this model there
would be 65% chance that your message
would reach the desired class of persons (males). You can send
marketing messages to persons having the
above profile to increase response rate. It would save time,
energy and mailing cost.
In another example, three classes in a sales campaign may be
‘good response’, mild response’ and ‘no
response’ and different features of items such as ‘price’,
‘brand’, ‘category’ etc. can be found as predictors by
the algorithm.
M 55% 56%
F 45% 44%
M 65% 66%
F 35% 34%
M 62% 64%
F 38% 36%
M 60% 59%
F 40% 41%
M 40% 39%
F 60% 61%
M 35% 36%
F 65% 64%
Training Data Test Data
Age >=20<=25 years
Education : Matric or Below
Location : rural area
Marital Status : unmarried
Annual Income < one lac
Fig. 3
Note that we split data into training and test data to judge the
effectiveness of a rule, which means that a
rule (for example, age>=20<=25 years) is picked up as such by
the tool only if the test data also confirms
the same rule with a variation upto 1or 2 % etc. The model is
practically applied and the results are analyzed
to calculate the efficiency of the tool/model.
Efficiency = actual/theoretical*100
In case after applying the model we actually reach 50% males
whereas the predicted value was 66% (we
take the figure in test data for calculation) then
Efficiency = 50/66*100= 75.75 %
143
The decision as to whether or not the same model should be used
in the future would depend upon its
efficiency. Normally, efficiency of a model close to 80% is
considered as a good value.
Association
Association analysis is the discovery of association rules
showing attribute-value conditions that occur
frequently together in a given set of data. It is widely used
for market basket analysis. For example, where
we are recording sales of a big shopping store in databases,
then by applying association mining we may
discover that certain items have a strong bondage or affinity
with each other such that when one item is
purchased the other is purchased, too. Apriori algorithm is used
for association mining.
|
|
|
|