The aim of this lecture is to introduce you the study of Human
Computer Interaction,
so that after studying this you will be able to:
. Understand what
evaluation is in the development process
. Understand
different evaluation paradigms and techniques
What to evaluate?
There is a huge variety of interactive products with a vast
array of features that need
to be evaluated. Some features, such as the sequence of links to
be followed to find an
item on a website, are often best evaluated in a laboratory,
since such a
setting allows
the evaluators to control what they want to investigate. Other
aspects, such as whether
a collaborative toy is robust and whether children enjoy
interacting with it, are better
evaluated in natural settings, so that evaluators can see what
children do when left to
their own devices.
John Gould and his colleagues (Gould et aL 1990; Gould and
Lewis, 1985)
recommended three principles for developing the 1984 Olympic
Message System:
. Focus on users and
their tasks
. Observe, measure,
and analyze their performance with the system
. Design lucratively
Since the OMS study, a number of new evaluation techniques have
been developed.
There has also been a growing trend towards observing how people
interact with the
system in their work, home, and other settings, the goal being
to obtain a better
understanding of how the product is (or will be) used in its
intended setting. For
example, at work people are frequently being interrupted by
phone calls, others
knocking at their door, email arriving, and so on—to the extent
that many tasks are
interrupt-driven. Only rarely does someone carry a task out from
beginning to end
without stopping to do something else. Hence the way people
carry out an activity
(e.g., preparing a report) in the real world is very different
from how it may be
observed in a laboratory. Furthermore, this observation has
implications for the way
products should be designed.
Why you need to evaluate?
Just as designers shouldn't assume that everyone is like them,
they also shouldn't
presume that following design guidelines guarantees good
usability. Evaluation is
needed to check that users can use the product and like it.
Furthermore, nowadays
users look for much more than just a usable system, as the
Nielsen Norman Group, a
usability consultancy company, point out (www.nngroup.com):
272
"User experience" encompasses all aspects of the end-user's
interaction ...
the first requirement for an exemplary user experience is to
meet the exact
needs of the customer, without fuss or bother. Next comes
simplicity and
elegance that produce products that are a joy to own, a joy to
use. "
Bruce Tognazzini another successful usability consultant,
comments
(www.asktog.com) that:
“Iterative design, with its repeating cycle of design and
testing, is the
only validated methodology in existence that will consistently
produce
successful results. If you don't have user-testing as an
integral part of
your design process you are going to throw buckets of money down
the
drain.”
Tognazzini points out that there are five good reasons for
investing in user
testing:
1. Problems are fixed before the product is shipped, not after.
2. The team can concentrate on real problems, not imaginary
ones.
3. Engineers code instead of debating.
4. Time to market is sharply reduced.
5. Finally, upon first release, your sales department has a
rock-solid design it can sell
without having to pepper their pitches with how it will all
actually work in release 1.1
or 2.0.
Now that there is a diversity of interactive products, it is not
surprising that the range
of features to be evaluated is very broad. For example,
developers of a new web
browser may want to know if users find items faster with their
product. Government
authorities may ask if a computerized system for controlling
traffic lights results in
fewer accidents. Makers of a toy may ask if six-year-olds can
manipulate the controls
and whether they are engaged by its furry case and pixie face. A
company that
develops the casing for cell phones may ask if the shape, size,
and color of the case is
appealing to teenagers. A new dotcom company may want to assess
market reaction
to its new home page design.
This diversity of interactive products, coupled with new user
expectations, poses
interesting challenges for evaluators, who, armed with many well
tried and tested
techniques, must now adapt them and develop new ones. As well as
usability, user
experience goals can be extremely important for a product's
success.
When to evaluate?
The product being developed may be a brand-new product or an
upgrade of an existing
product. If the product is new, then considerable time is
usually invested in market
research. Designers often support this process by developing
mockups of the potential
product that are used to elicit reactions from potential users.
As well as helping to
assess market need, this activity contributes to understanding
users' needs and early
requirements. As we said in earlier lecture, sketches, screen
mockups, and other lowfidelity
prototyping techniques are used to represent design ideas. Many
of these same
techniques are used to elicit users” opinions in evaluation
(e.g., questionnaires and
interviews), but the purpose and focus of evaluation are
different. The goal of evaluation
is to assess how well a design fulfills users' needs and whether
users like it.
273
In the case of an upgrade, there is limited scope for change and
attention is focused on
improving the overall product. This type of design is well
suited to usability
engineering in which evaluations compare user performance and
attitudes with those
for previous versions. Some products, such as office systems, go
through many
versions, and successful products may reach double-digit version
numbers. In
contrast, new products do not have previous versions and there
may be nothing
comparable on the market, so more radical changes are possible
if evaluation results
indicate a problem.
Evaluations done during design to check that the product
continues to meet users'
needs are known as formative evaluations. Evaluations that are
done to assess the
success of a finished product, such as those to satisfy a
sponsoring agency or to check
that a standard is being upheld, are known as summative
evaluation. Agencies such as
National Institute of Standards and Technology (NIST) in the
USA, the International
Standards Organization (ISO) and the British Standards Institute
(BSI) set standards
by which products produced by others are evaluated.
29.1 Evaluation paradigms and
techniques
Before we describe the techniques used in evaluation studies, we
shall start by
proposing some key terms. Terminology in this field tends to be
loose and often
confusing so it is a good idea to be clear from the start what
you mean. We start with
the much-used term user studies, defined by Abigail Sellen in
her interview as
follows: "user studies essentially involve looking at how people
behave either in their
natural [environments], or in the laboratory, both with old
technologies and with new
ones." Any kind of evaluation, whether it is a user study or
not, is guided either
explicitly or implicitly by a set of beliefs that may also he
underpinned by theory.
These beliefs and the practices (i.e., the methods or
techniques) associated with them
are known as an evaluation paradigm, which you should not
confuse with the
"interaction paradigms. Often evaluation paradigms are related
to a particular
discipline in that they strongly influence how people from the
discipline think about
evaluation. Each paradigm has particular methods and techniques
associated with it.
So that you are not confused, we want to state explicitly that
we will not be
distinguishing between methods and techniques. We tend to talk
about techniques, but
you may find that other some call them methods. An example of
the relationship
between a paradigm and the techniques used by evaluators
following that paradigm
can be seen for usability testing, which is an applied science
and engineering
paradigm. The techniques associated wild usability testing are:
user testing in a
controlled environment; observation of user activity in the
controlled environment and
the field; and questionnaires and interviews.
Evaluation paradigms
In this lecture we identify four core evaluation paradigms: (1)
“quick and dirty” evaluations;
(2) usability testing; (3) field studies; and (4) predictive
evaluation. Other
people may use slightly different terms to refer to similar
paradigms.
"Quick and dirty" evaluation
A "quick and dirty" evaluation is a common practice in which
designers informally
get feedback from users or consultants to confirm that their
ideas are in line with
users" needs and are liked. "Quick and dirty" evaluations can be
done at any stage and
274
the emphasis is on fast input rather than carefully documented
findings. For example,
early in design developers may meet informally with users to get
feedback on ideas
for a new product (Hughes el al., 1994). At later stages similar
meetings may occur to
try out an idea for an icon, check whether a graphic is liked,
or confirm that
information has been appropriately categorized on a webpage.
This approach is often
called "quick and dirty" because it is meant to be done in a
short space of time.
Getting this kind of feedback is an essential ingredient of
successful design.
As discussed in earlier lectures, any involvement with users
will be highly informative
and you can learn a lot early in design by observing what people
do and talking to
them informally. The data collected is usually descriptive and
informal and it is fed
back into the design process as verbal or written notes,
sketches and anecdotes, etc.
Another source comes from consultants, who use their knowledge
of user behavior,
the market place and technical know-how, to review software
quickly and provide
suggestions for improvement. It is an approach that has become
particularly popular
in web design where the emphasis is usually on short timescales.
Usability testing
Usability testing was the dominant approach in the 1980s
(Whiteside et al., 1998), and
remains important, although, as you will see, field studies and
heuristic evaluations
have grown in prominence. Usability testing involves measuring
typical users’
performance on carefully prepared tasks that are typical of
those for which the system
was designed. Users’ performance is generally measured in terms
of number of errors
and time to complete the task. As the users perform these tasks,
they are watched and
recorded on video and by logging their interactions with
software. This observational
data is used to calculate performance times, identify errors,
and help explain why the
users did what they did. User satisfaction questionnaires and
interviews are also used
to elicit users’ opinions.
The defining characteristic of usability testing is that it is
strongly controlled by the
evaluator (Mayhew. 1999). There is no mistaking that the
evaluator is in charge!
Typically tests take place in laboratory-like conditions that
are controlled. Casual
visitors are not allowed and telephone calls are stopped, and
there is no possibility of
talking to colleagues, checking email, or doing any of the other
tasks that most of us
rapidly switch among in our normal lives. Everything that the
participant does is
recorded—every key press, comment, pause, expression, etc., so
that it can be used as
data.
Quantifying users' performance is a dominant theme in usability
testing. However,
unlike research experiments, variables are not manipulated and
the typical number of
participants is too small for much statistical analysis. User
satisfaction data from
questionnaires tends to be categorized and average ratings are
presented. Sometimes
video or anecdotal evidence is also included to illustrate
problems that users
encounter. Some evaluators then summarize this data in a
usability specification so
that developers can use it to test future prototypes or versions
of the product against it.
Optimal performance levels and minimal levels of acceptance are
often specified and
current levels noted. Changes in the design can then be agreed
and engineered—hence
the term "usability engineering.
Field studies
The distinguishing feature of field studies is that they are
done in natural settings with
the aim of increasing understanding about what users do
naturally and how
275
technology impacts them. In product design, field studies can be
used to (1) help
identify opportunities for new technology; (2) determine
requirements for design: (3)
facilitate the introduction of technology: and (4) evaluate
technology (Bly. 1997).
We introduced qualitative techniques such as interviews,
observation, participant
observation, and ethnography that are used in field studies. The
exact choice of
techniques is often influenced by the theory used to analyze the
data. The data takes
the form of events and conversations that are recorded as notes,
or by audio or video
recording, and later analyzed using a variety of analysis
techniques such as content,
discourse, and conversational analysis. These techniques vary
considerably. In content
analysis, for example, the data is analyzed into content
categories, whereas in
discourse analysis the use of words and phrases is examined.
Artifacts are also
collected. In fact, anything that helps to show what people do
in their natural contexts
can be regarded as data.
In this lecture we distinguish between two overall approaches to
field studies. The
first involves observing explicitly and recording what is
happening, as an outsider
looking on. Qualitative techniques are used to collect the data,
which may then he
analyzed qualitatively or quantitatively. For example, the
number of times a particular
event is observed may be presented in a bar graph with means and
standard
deviations.
In some field studies the evaluator may be an insider or even a
participant.
Ethnography is a particular type of insider evaluation in which
the aim is to explore
the details of what happens in a particular social setting. “In
the context of human
computer interaction, ethnography is a means of studying work
(or other activities) in
order to inform the design of information systems and understand
aspects of their use”
(Shapiro, 1995, p. 8).
Predictive evaluation
In predictive evaluations experts apply their knowledge of
typical users, often guided
by heuristics, to predict usability problems. Another approach
involves theoretically
based models. The key feature of predictive evaluation is that
users need not be present,
which makes the process quick, relatively inexpensive, and thus
attractive to
companies; but it has limitations.
In recent years heuristic evaluation in which experts review the
software product
guided by tried and tested heuristics has become popular
(Nielsen and Mack, 1994).
Usability guidelines (e.g., always provide clearly marked exits)
were designed
primarily for evaluating screen-based products (e.g. form
fill-ins, catalogs,
etc.). With the advent of a range of new interactive products
(e.g., the web, mobiles,
collaborative technologies), this original set of heuristics has
been found insufficient.
While some are still applicable (e.g., speak the users'
language), others are
inappropriate. New sets of heuristics are also needed that are
aimed at evaluating
different classes of interactive products. In particular,
specific heuristics are needed
that are tailored to evaluating web-based products, mobile
devices, collaborative
technologies, computerized toys, etc. These should be based on a
combination of
usability and user experience goals, new research findings and
market research. Care
is needed in using sets of heuristics. Designers are sometimes
led astray by findings
from heuristic evaluations that turn out not to be as accurate
as they at first seemed.
276
Table bellow summarizes the key aspects of each evaluation
paradigm for the following
issues:
. the role of users
. who controls the
process and the relationship between evaluators and users
during the evaluation
. the location of the
evaluation
. when the evaluation
is most useful
. the type of data
collected and how it is analyzed
. how the evaluation
findings are fed back into the design process
. the philosophy and
theory that underlies the evaluation paradigms.
Evaluation
paradigms
"Quick and
dirty"
Usability testing
Field studies
Predictive
Role of
users
Natural
behavior.
To carry out set
tasks.
Natural behavior.
Users generally
not involved.
Who
controls
E valuators take
minimum
control.
Evaluators
strongly in
control.
Evaluators try to
develop
relationships with
users.
Expert evaluators.
Location
Natural
environment or
laboratory
Laboratory.
Natural environment.
Laboratoryoriented
but often
happens on
customer's
premises.
277
When used
Any time you
want to get
feedback about a
design quickly.
Techniques from
other evaluation
paradigms can
be Used e.g.
experts review
soft ware.
With a prototype
or product.
Most often used
early in design to
check that users'
needs are being met
or to assess
problems or design
opportunities.
Expert reviews
(often done by
consultants) with
a prototype, but
can occur at any
time.
Models are used to
assess specific
aspects of a
potential design.
Type of data Usually
qualitative,
informal
descriptions
Quantitative.
Sometimes
statistically
validated. Users'
opinions
collected by
questionnaire or
interview.
Qualitative
descriptions often
accompanied with
sketches. Scenarios
quotes, other
artifacts.
List of problems
from expert
reviews.
Quantitative
figures from
model, e.g., how
long it takes to
perform a task
using two
designs.
Fed back into
design by..
Sketches,
quotes,
descriptive
report.
Report of
performance
measures, errors
etc. Findings
provide a
benchmark for
future versions.
Descriptions that
include quotes,
Sketches, anecdotes,
and sometimes time
logs.
Reviewers
provide a list of
problems, often
with suggested
solutions. Times
calculated from
models are given
to designers.
Philosophy
User-centered,
highly practical
approach
Applied approach
based on
experimentation.
i.e., usability
engineering.
May be objective
observation or
ethnographic.
Practical
heuristics and
practitioner
expertise
underpin expert
reviews. Theory
underpins
models
Techniques
There are many evaluation techniques and they can be categorized
in various ways,
but in this lecture we will examine techniques for:
. observing users
. asking users their
opinions
. asking experts
their opinions
. testing users"
performance
. modeling users'
task performance to predict the efficacy of a user interface
278
The brief descriptions below offer an overview of each category.
Be aware that some
techniques are used in different ways in different evaluation
paradigms.
Observing users
Observation techniques help to identify needs leading to new
types of products and
help to evaluate prototypes. Notes, audio, video, and
interaction logs are well-known
ways of recording observations and each has benefits and
drawbacks. Obvious
challenges for evaluators are how to observe without disturbing
the people being
observed and how to analyze the data, particularly when large
quantities of video data
are collected or when several different types must be integrated
to tell the story (e.g.,
notes, pictures, sketches from observers).
Asking users
Asking users what they think of a product—whether it does what
they want; whether
they like it; whether the aesthetic design appeals; whether they
had problems using it;
whether they want to use it again—is an obvious way of getting
feedback. Inter views
and questionnaires are the main techniques for doing this. The
questions asked can be
unstructured or tightly structured. They can be asked of a few
people or of hundreds.
Interview and questionnaire techniques are also being developed
for use with email
and the web.
Asking experts
Software inspections and reviews are long established techniques
for evaluating
software code and structure. During the 1980s versions of
similar techniques were
developed for evaluating usability. Guided by heuristics,
experts step through tasks
role-playing typical users and identify problems. Developers
like this approach hecause
it is usually relatively inexpensive and quick to perform
compared with laboratory
and field evaluations that involve users. In addition, experts
frequently suggest
solutions to problems
User testing
Measuring user performance to compare two or more designs has
been the bedrock of
usability testing. As we said earlier when discussing usability
testing, these tests are
usually conducted in controlled settings and involve typical
users performing typical.
well-defined tasks. Data is collected so that performance can be
analyzed. Generally
the time taken to complete a task, the number of errors made,
and the navigation path
through the product are recorded. Descriptive statistical
measures such as means and
standard deviations are commonly used to report the results.
Modeling users’ task performance
There have been various attempts to model human-computer
interaction so as to
predict the efficiency and problems associated with different
designs at an early stage
without building elaborate prototypes. These techniques are
successful for systems
with limited functionality such as telephone systems. GOMS and
the keystroke model
are the best known techniques.
|