Lesson#29

EVALUATION-1

The aim of this lecture is to introduce you the study of Human Computer Interaction,
so that after studying this you will be able to:

. Understand what evaluation is in the development process

. Understand different evaluation paradigms and techniques

What to evaluate?

There is a huge variety of interactive products with a vast array of features that need
to be evaluated. Some features, such as the sequence of links to be followed to find an
item on a website, are often best evaluated in a laboratory, since such a setting allows
the evaluators to control what they want to investigate. Other aspects, such as whether
a collaborative toy is robust and whether children enjoy interacting with it, are better
evaluated in natural settings, so that evaluators can see what children do when left to
their own devices.
John Gould and his colleagues (Gould et aL 1990; Gould and Lewis, 1985)
recommended three principles for developing the 1984 Olympic Message System:

. Focus on users and their tasks

. Observe, measure, and analyze their performance with the system

. Design lucratively
Since the OMS study, a number of new evaluation techniques have been developed.
There has also been a growing trend towards observing how people interact with the
system in their work, home, and other settings, the goal being to obtain a better
understanding of how the product is (or will be) used in its intended setting. For
example, at work people are frequently being interrupted by phone calls, others
knocking at their door, email arriving, and so on—to the extent that many tasks are
interrupt-driven. Only rarely does someone carry a task out from beginning to end
without stopping to do something else. Hence the way people carry out an activity
(e.g., preparing a report) in the real world is very different from how it may be
observed in a laboratory. Furthermore, this observation has implications for the way
products should be designed.

Why you need to evaluate?

Just as designers shouldn't assume that everyone is like them, they also shouldn't
presume that following design guidelines guarantees good usability. Evaluation is
needed to check that users can use the product and like it. Furthermore, nowadays
users look for much more than just a usable system, as the Nielsen Norman Group, a
usability consultancy company, point out (www.nngroup.com):

272
"User experience" encompasses all aspects of the end-user's interaction ...
the first requirement for an exemplary user experience is to meet the exact
needs of the customer, without fuss or bother. Next comes simplicity and
elegance that produce products that are a joy to own, a joy to use. "
Bruce Tognazzini another successful usability consultant, comments
(www.asktog.com) that:
“Iterative design, with its repeating cycle of design and testing, is the
only validated methodology in existence that will consistently produce
successful results. If you don't have user-testing as an integral part of
your design process you are going to throw buckets of money down the
drain.”

Tognazzini points out that there are five good reasons for investing in user
testing:
1. Problems are fixed before the product is shipped, not after.
2. The team can concentrate on real problems, not imaginary ones.
3. Engineers code instead of debating.
4. Time to market is sharply reduced.
5. Finally, upon first release, your sales department has a rock-solid design it can sell
without having to pepper their pitches with how it will all actually work in release 1.1
or 2.0.
Now that there is a diversity of interactive products, it is not surprising that the range
of features to be evaluated is very broad. For example, developers of a new web
browser may want to know if users find items faster with their product. Government
authorities may ask if a computerized system for controlling traffic lights results in
fewer accidents. Makers of a toy may ask if six-year-olds can manipulate the controls
and whether they are engaged by its furry case and pixie face. A company that
develops the casing for cell phones may ask if the shape, size, and color of the case is
appealing to teenagers. A new dotcom company may want to assess market reaction
to its new home page design.
This diversity of interactive products, coupled with new user expectations, poses
interesting challenges for evaluators, who, armed with many well tried and tested
techniques, must now adapt them and develop new ones. As well as usability, user
experience goals can be extremely important for a product's success.

When to evaluate?

The product being developed may be a brand-new product or an upgrade of an existing
product. If the product is new, then considerable time is usually invested in market
research. Designers often support this process by developing mockups of the potential
product that are used to elicit reactions from potential users. As well as helping to
assess market need, this activity contributes to understanding users' needs and early
requirements. As we said in earlier lecture, sketches, screen mockups, and other lowfidelity
prototyping techniques are used to represent design ideas. Many of these same
techniques are used to elicit users” opinions in evaluation (e.g., questionnaires and
interviews), but the purpose and focus of evaluation are different. The goal of evaluation
is to assess how well a design fulfills users' needs and whether users like it.

273
In the case of an upgrade, there is limited scope for change and attention is focused on
improving the overall product. This type of design is well suited to usability
engineering in which evaluations compare user performance and attitudes with those
for previous versions. Some products, such as office systems, go through many
versions, and successful products may reach double-digit version numbers. In
contrast, new products do not have previous versions and there may be nothing
comparable on the market, so more radical changes are possible if evaluation results
indicate a problem.
Evaluations done during design to check that the product continues to meet users'
needs are known as formative evaluations. Evaluations that are done to assess the
success of a finished product, such as those to satisfy a sponsoring agency or to check
that a standard is being upheld, are known as summative evaluation. Agencies such as
National Institute of Standards and Technology (NIST) in the USA, the International
Standards Organization (ISO) and the British Standards Institute (BSI) set standards
by which products produced by others are evaluated.

29.1 Evaluation paradigms and techniques

Before we describe the techniques used in evaluation studies, we shall start by
proposing some key terms. Terminology in this field tends to be loose and often
confusing so it is a good idea to be clear from the start what you mean. We start with
the much-used term user studies, defined by Abigail Sellen in her interview as
follows: "user studies essentially involve looking at how people behave either in their
natural [environments], or in the laboratory, both with old technologies and with new
ones." Any kind of evaluation, whether it is a user study or not, is guided either
explicitly or implicitly by a set of beliefs that may also he underpinned by theory.
These beliefs and the practices (i.e., the methods or techniques) associated with them
are known as an evaluation paradigm, which you should not confuse with the
"interaction paradigms. Often evaluation paradigms are related to a particular
discipline in that they strongly influence how people from the discipline think about
evaluation. Each paradigm has particular methods and techniques associated with it.
So that you are not confused, we want to state explicitly that we will not be
distinguishing between methods and techniques. We tend to talk about techniques, but
you may find that other some call them methods. An example of the relationship
between a paradigm and the techniques used by evaluators following that paradigm
can be seen for usability testing, which is an applied science and engineering
paradigm. The techniques associated wild usability testing are: user testing in a
controlled environment; observation of user activity in the controlled environment and
the field; and questionnaires and interviews.

Evaluation paradigms

In this lecture we identify four core evaluation paradigms: (1) “quick and dirty” evaluations;
(2) usability testing; (3) field studies; and (4) predictive evaluation. Other
people may use slightly different terms to refer to similar paradigms.

"Quick and dirty" evaluation

A "quick and dirty" evaluation is a common practice in which designers informally
get feedback from users or consultants to confirm that their ideas are in line with
users" needs and are liked. "Quick and dirty" evaluations can be done at any stage and

274
the emphasis is on fast input rather than carefully documented findings. For example,
early in design developers may meet informally with users to get feedback on ideas
for a new product (Hughes el al., 1994). At later stages similar meetings may occur to
try out an idea for an icon, check whether a graphic is liked, or confirm that
information has been appropriately categorized on a webpage. This approach is often
called "quick and dirty" because it is meant to be done in a short space of time.
Getting this kind of feedback is an essential ingredient of successful design.
As discussed in earlier lectures, any involvement with users will be highly informative
and you can learn a lot early in design by observing what people do and talking to
them informally. The data collected is usually descriptive and informal and it is fed
back into the design process as verbal or written notes, sketches and anecdotes, etc.
Another source comes from consultants, who use their knowledge of user behavior,
the market place and technical know-how, to review software quickly and provide
suggestions for improvement. It is an approach that has become particularly popular
in web design where the emphasis is usually on short timescales.

Usability testing

Usability testing was the dominant approach in the 1980s (Whiteside et al., 1998), and
remains important, although, as you will see, field studies and heuristic evaluations
have grown in prominence. Usability testing involves measuring typical users’
performance on carefully prepared tasks that are typical of those for which the system
was designed. Users’ performance is generally measured in terms of number of errors
and time to complete the task. As the users perform these tasks, they are watched and
recorded on video and by logging their interactions with software. This observational
data is used to calculate performance times, identify errors, and help explain why the
users did what they did. User satisfaction questionnaires and interviews are also used
to elicit users’ opinions.
The defining characteristic of usability testing is that it is strongly controlled by the
evaluator (Mayhew. 1999). There is no mistaking that the evaluator is in charge!
Typically tests take place in laboratory-like conditions that are controlled. Casual
visitors are not allowed and telephone calls are stopped, and there is no possibility of
talking to colleagues, checking email, or doing any of the other tasks that most of us
rapidly switch among in our normal lives. Everything that the participant does is
recorded—every key press, comment, pause, expression, etc., so that it can be used as
data.
Quantifying users' performance is a dominant theme in usability testing. However,
unlike research experiments, variables are not manipulated and the typical number of
participants is too small for much statistical analysis. User satisfaction data from
questionnaires tends to be categorized and average ratings are presented. Sometimes
video or anecdotal evidence is also included to illustrate problems that users
encounter. Some evaluators then summarize this data in a usability specification so
that developers can use it to test future prototypes or versions of the product against it.
Optimal performance levels and minimal levels of acceptance are often specified and
current levels noted. Changes in the design can then be agreed and engineered—hence
the term "usability engineering.

Field studies

The distinguishing feature of field studies is that they are done in natural settings with
the aim of increasing understanding about what users do naturally and how

275
technology impacts them. In product design, field studies can be used to (1) help
identify opportunities for new technology; (2) determine requirements for design: (3)
facilitate the introduction of technology: and (4) evaluate technology (Bly. 1997).
We introduced qualitative techniques such as interviews, observation, participant
observation, and ethnography that are used in field studies. The exact choice of
techniques is often influenced by the theory used to analyze the data. The data takes
the form of events and conversations that are recorded as notes, or by audio or video
recording, and later analyzed using a variety of analysis techniques such as content,
discourse, and conversational analysis. These techniques vary considerably. In content
analysis, for example, the data is analyzed into content categories, whereas in
discourse analysis the use of words and phrases is examined. Artifacts are also
collected. In fact, anything that helps to show what people do in their natural contexts
can be regarded as data.
In this lecture we distinguish between two overall approaches to field studies. The
first involves observing explicitly and recording what is happening, as an outsider
looking on. Qualitative techniques are used to collect the data, which may then he
analyzed qualitatively or quantitatively. For example, the number of times a particular
event is observed may be presented in a bar graph with means and standard
deviations.
In some field studies the evaluator may be an insider or even a participant.
Ethnography is a particular type of insider evaluation in which the aim is to explore
the details of what happens in a particular social setting. “In the context of human
computer interaction, ethnography is a means of studying work (or other activities) in
order to inform the design of information systems and understand aspects of their use”
(Shapiro, 1995, p. 8).

Predictive evaluation

In predictive evaluations experts apply their knowledge of typical users, often guided
by heuristics, to predict usability problems. Another approach involves theoretically
based models. The key feature of predictive evaluation is that users need not be present,
which makes the process quick, relatively inexpensive, and thus attractive to
companies; but it has limitations.
In recent years heuristic evaluation in which experts review the software product
guided by tried and tested heuristics has become popular (Nielsen and Mack, 1994).
Usability guidelines (e.g., always provide clearly marked exits) were designed
primarily for evaluating screen-based products (e.g. form fill-ins, catalogs,
etc.). With the advent of a range of new interactive products (e.g., the web, mobiles,
collaborative technologies), this original set of heuristics has been found insufficient.
While some are still applicable (e.g., speak the users' language), others are
inappropriate. New sets of heuristics are also needed that are aimed at evaluating
different classes of interactive products. In particular, specific heuristics are needed
that are tailored to evaluating web-based products, mobile devices, collaborative
technologies, computerized toys, etc. These should be based on a combination of
usability and user experience goals, new research findings and market research. Care
is needed in using sets of heuristics. Designers are sometimes led astray by findings
from heuristic evaluations that turn out not to be as accurate as they at first seemed.

276
Table bellow summarizes the key aspects of each evaluation paradigm for the following
issues:

. the role of users

. who controls the process and the relationship between evaluators and users
during the evaluation

. the location of the evaluation

. when the evaluation is most useful

. the type of data collected and how it is analyzed

. how the evaluation findings are fed back into the design process

. the philosophy and theory that underlies the evaluation paradigms.
Evaluation
paradigms
"Quick and
dirty"
Usability testing
Field studies
Predictive
Role of
users
Natural
behavior.
To carry out set
tasks.
Natural behavior.
Users generally
not involved.
Who
controls
E valuators take
minimum
control.
Evaluators
strongly in
control.
Evaluators try to
develop
relationships with
users.
Expert evaluators.
Location
Natural
environment or
laboratory
Laboratory.
Natural environment.
Laboratoryoriented
but often
happens on
customer's
premises.

277
When used
Any time you
want to get
feedback about a
design quickly.
Techniques from
other evaluation
paradigms can
be Used e.g.
experts review
soft ware.

With a prototype
or product.
Most often used
early in design to
check that users'
needs are being met
or to assess
problems or design
opportunities.
Expert reviews
(often done by
consultants) with
a prototype, but
can occur at any
time.
Models are used to
assess specific
aspects of a
potential design.

Type of data Usually
qualitative,
informal
descriptions
Quantitative.
Sometimes
statistically
validated. Users'
opinions
collected by
questionnaire or
interview.
Qualitative
descriptions often
accompanied with
sketches. Scenarios
quotes, other
artifacts.
List of problems
from expert
reviews.
Quantitative
figures from
model, e.g., how
long it takes to
perform a task
using two
designs.
Fed back into
design by..

Sketches,
quotes,
descriptive
report.
Report of
performance
measures, errors
etc. Findings
provide a
benchmark for
future versions.
Descriptions that
include quotes,
Sketches, anecdotes,
and sometimes time
logs.
Reviewers
provide a list of
problems, often
with suggested
solutions. Times
calculated from
models are given
to designers.

Philosophy User-centered,
highly practical
approach
Applied approach
based on
experimentation.
i.e., usability
engineering.
May be objective
observation or
ethnographic.
Practical
heuristics and
practitioner
expertise
underpin expert
reviews. Theory
underpins
models

Techniques

There are many evaluation techniques and they can be categorized in various ways,
but in this lecture we will examine techniques for:

. observing users

. asking users their opinions

. asking experts their opinions

. testing users" performance

. modeling users' task performance to predict the efficacy of a user interface

278
The brief descriptions below offer an overview of each category. Be aware that some
techniques are used in different ways in different evaluation paradigms.

Observing users

Observation techniques help to identify needs leading to new types of products and
help to evaluate prototypes. Notes, audio, video, and interaction logs are well-known
ways of recording observations and each has benefits and drawbacks. Obvious
challenges for evaluators are how to observe without disturbing the people being
observed and how to analyze the data, particularly when large quantities of video data
are collected or when several different types must be integrated to tell the story (e.g.,
notes, pictures, sketches from observers).

Asking users

Asking users what they think of a product—whether it does what they want; whether
they like it; whether the aesthetic design appeals; whether they had problems using it;
whether they want to use it again—is an obvious way of getting feedback. Inter views
and questionnaires are the main techniques for doing this. The questions asked can be
unstructured or tightly structured. They can be asked of a few people or of hundreds.
Interview and questionnaire techniques are also being developed for use with email
and the web.

Asking experts

Software inspections and reviews are long established techniques for evaluating
software code and structure. During the 1980s versions of similar techniques were
developed for evaluating usability. Guided by heuristics, experts step through tasks
role-playing typical users and identify problems. Developers like this approach hecause
it is usually relatively inexpensive and quick to perform compared with laboratory
and field evaluations that involve users. In addition, experts frequently suggest
solutions to problems

User testing

Measuring user performance to compare two or more designs has been the bedrock of
usability testing. As we said earlier when discussing usability testing, these tests are
usually conducted in controlled settings and involve typical users performing typical.
well-defined tasks. Data is collected so that performance can be analyzed. Generally
the time taken to complete a task, the number of errors made, and the navigation path
through the product are recorded. Descriptive statistical measures such as means and
standard deviations are commonly used to report the results.

Modeling users’ task performance

There have been various attempts to model human-computer interaction so as to
predict the efficiency and problems associated with different designs at an early stage
without building elaborate prototypes. These techniques are successful for systems
with limited functionality such as telephone systems. GOMS and the keystroke model

are the best known techniques.

<Previous Lesson

Human Computer Interaction

<Previous Lesson

Human Computer Interaction

Next Lesson>

Lesson#29

EVALUATION-1

<Previous Lesson

Human Computer Interaction

Next Lesson>

Home

Lesson Plan

Topics

Go to Top