Hi Class,
Let’s take a look at what you learned in the course. It’s in looking back on your educational journey that you truly take a moment to reflect on all you have learned and the knowledge gained from the course. Ideally, you will put the concepts right into practice in your job or career!
This assignment has you reflecting back on the past 13 weeks capturing your key learning moments.
For each week 1-13, complete a three-paragraph summary EACH Week that indicates the following:
1) Three things you learned from the week’s lesson.
2) Why each one is important?
3) How you would use them or how would a business use them?
Title page, APA Prepared, paragraph form. Use Week 1 as the subtitle, Week 2, etc. As you list your first concept you learned, you would follow that with the “why it is important” and “how you would use it” before going on to the second concept you learned. Minimum three concepts learned per week. This assignment is worth 100 points.
BUSINESS INTELLIGENCE
AND ANALYTICS
RAMESH SHARDA
DURSUN DELEN
EFRAIM TURBAN
TENTH EDITION
.•
TENTH EDITION
BUSINESS INTELLIGENCE
AND ANALYTICS:
SYSTEMS FOR DECISION SUPPORT
Ramesh Sharda
Oklahoma State University
Dursun Delen
Oklahoma State University
Efraim Turban
University of Hawaii
With contributions by
J.E.Aronson
Tbe University of Georgia
Ting-Peng Liang
National Sun Yat-sen University
David King
]DA Software Group, Inc.
PEARSON
Boston Columbus Indianapolis New York San Francisco Upper Saddle River
Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto
Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo
Editor in Chief: Stephanie Wall
Executive Editor: Bob Horan
Program Manager Team Lead: Ashley Santora
Program Manager: Denise Vaughn
Executive Marketing Manager: Anne Fahlgren
Project Manager Team Lead: Judy Leale
Project Manager: Tom Benfatti
Operations Specialist: Michelle Klein
Creative Director: Jayne Conte
Cover Designer: Suzanne Behnke
Digital Production Project Manager: Lisa
Rinaldi
Full-Service Project Management: George Jacob,
Integra Software Solutions.
Printer/Binder: Edwards Brothers Malloy-Jackson
Road
Cover Printer: Lehigh/Phoenix-Hagerstown
Text Font: Garamond
Credits and acknowledgments borrowed from other sources and reproduced, with permission, in this textbook
appear on the appropriate page within text.
Microsoft and/ or its respective suppliers make no representations about the suitability of the information
contained in the documents and related graphics published as part of the services for any purpose. All such
documents and related graphics are provided “as is” without warranty of any kind. Microsoft and/or its
respective suppliers hereby disclaim all warranties and conditions with regard to this information, including
all warranties and conditions of merchantability, whether express, implied or statutory, fitness for a particular
purpose, title and non-infringement. In no event shall Microsoft and/or its respective suppliers be liable for
any special, indirect or consequential damages or any damages whatsoever resulting from loss of use , data or
profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection
with the use or performance of information available from the services.
The documents and related graphics contained herein could include technical inaccuracies or typographical
errors. Changes are periodically added to the information here in. Microsoft and/or its respective suppliers may
make improvements and/or changes in the product(s) and/ or the program(s) described herein at any time.
Partial screen shots may be viewed in full within the software version specified.
Microsoft® Windows®, and Microsoft Office® are registered trademarks of the Microsoft Corporation in the U.S.A.
and other countries. This book is not sponsored or endorsed by or affiliated with the Microsoft Corporation.
Copyright© 2015, 2011, 2007 by Pearson Education, Inc., One Lake Street, Upper Saddle River,
New Jersey 07458. All rights reserved. Manufactured in the United States of America. This publication
is protected by Copyright, and permission should be obtained from the publisher prior to any prohibited
reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic,
mechanical, photocopying, recording, or likewise. To obtain permission(s) to use material from this work,
please submit a written request to Pearson Education, Inc., Permissions Department, One Lake Street,
Upper Saddle River, New Jersey 07458, or you may fax your request to 201-236-3290.
Many of the designations by manufacturers and sellers to distinguish their products are claimed as trademarks.
Where those designations appear in this book, and the publisher was aware of a trademark claim, the
designations have been printe d in initial caps or all caps.
Library of Congress Cataloging-in-Publication Data
Turban, Efraim.
[Decision support and expert system,)
Business intelligence and analytics: systems for decision support/Ramesh Sharda , Oklahoma State University,
Dursun Delen , Oklahoma State University, Efraim Turban, University of Hawaii; With contributions
by J. E. Aronson, The University of Georgia, Ting-Peng Liang, National Sun Yat-sen University,
David King, JOA Software Group, Inc.-Tenth edition.
pages cm
ISBN-13: 978-0-13-305090-5
ISBN-10: 0-13-305090-4
1. Management-Data processing. 2. Decision support systems. 3. Expert systems (Compute r science)
4. Business intelligence. I. Title .
HD30.2.T87 2014
658.4’03801 l-dc23
10 9 8 7 6 5 4 3 2 1
PEARSON
2013028826
ISBN 10: 0-13-305090-4
ISBN 13: 978-0-13-305090-5
BRIEF CONTENTS
Preface xxi
About the Authors xxix
PART I Decision Making and Analytics: An Overview 1
Chapter 1 An Overview of Business Intelligence, Analytics,
and Decision Support 2
Chapter 2 Foundations and Technologies for Decision Making 37
PART II Descriptive Analytics 77
Chapter 3 Data Warehousing 78
Chapter 4 Business Reporting, Visual Analytics, and Business
Performance Management 135
PART Ill Predictive Analytics 185
Chapter 5 Data Mining 186
Chapter 6 Techniques for Predictive Modeling 243
Chapter 7 Text Analytics, Text Mining, and Sentiment Analysis 288
Chapter 8 Web Analytics, Web Mining, and Social Analytics 338
PART IV Prescriptive Analytics 391
Chapter 9 Model-Based Decision Making: Optimization and Multi-
Criteria Systems 392
Chapter 10 Modeling and Analysis: Heuristic Search Methods and
Simulation 435
Chapter 11 Automated Decision Systems and Expert Systems 469
Chapter 12 Knowledge Management and Collaborative Systems 507
PART V Big Data and Future Directions for Business
Analytics 541
Chapter 13 Big Data and Analytics 542
Chapter 14 Business Analytics: Emerging Trends and Future
Impacts 592
Glossary 634
Index 648
iii
iv
CONTENTS
Preface xxi
About the Authors xxix
Part I Decision Making and Analytics: An Overview 1
Chapter 1 An Overview of Business Intelligence, Analytics, and
Decision Support 2
1.1 Opening Vignette: Magpie Sensing Employs Analytics to
Manage a Vaccine Supply Chain Effectively and Safely 3
1.2 Changing Business Environments and Computerized
Decision Support 5
The Business Pressures-Responses-Support Model 5
1.3 Managerial Decision Making 7
The Nature of Managers’ Work 7
The Decision-Making Process 8
1.4 Information Systems Support for Decision Making 9
1.5 An Early Framework for Computerized Decision
Support 11
The Gorry and Scott-Morton Classical Framework 11
Computer Support for Structured Decisions 12
Computer Support for Unstructured Decisions 13
Computer Support for Semistructured Problems 13
1.6 The Concept of Decision Support Systems (DSS) 13
DSS as an Umbrella Term 13
Evolution of DSS into Business Intelligence 14
1.7 A Framework for Business Intelligence (Bl) 14
Definitions of Bl 14
A Brief History of Bl 14
The Architecture of Bl 15
Styles of Bl 15
The Origins and Drivers of Bl 16
A Multimedia Exercise in Business Intelligence 16
~ APPLICATION CASE 1.1 Sabre Helps Its Clients Through Dashboards
and Analytics 17
The DSS-BI Connection 18
1.8 Business Analytics Overview 19
Descriptive Analytics 20
~ APPLICATION CASE 1.2 Eliminating Inefficiencies at Seattle
Children’s Hospital 21
~ APPLICATION CASE 1.3 Analysis at the Speed of Thought 22
Predictive Analytics 22
~ APPLICATION CASE 1.4 Moneybal/: Analytics in Sports and Movies 23
~ APPLICATION CASE 1.5 Analyzing Athletic Injuries 24
Prescriptive Analytics 24
~ APPLICATION CASE 1.6 Industrial and Commercial Bank of China
(ICBC) Employs Models to Reconfigure Its Branch Network 25
Analytics Applied to Different Domains 26
Analytics or Data Science? 26
1.9 Brief Introduction to Big Data Analytics 27
What Is Big Data? 27
~ APPLICATION CASE 1.7 Gilt Groupe’s Flash Sales Streamlined by Big
Data Analytics 29
1.10 Plan of the Book 29
Part I: Business Analytics: An Overview 29
Part II: Descriptive Analytics 30
Part Ill: Predictive Analytics 30
Part IV: Prescriptive Analytics 31
Part V: Big Data and Future Directions for Business Analytics 31
1.11 Resources, Links, and the Teradata University Network
Connection 31
Resources and Links 31
Vendors, Products, and Demos 31
Periodicals 31
The Teradata University Network Connection 32
The Book’s Web Site 32
Chapter Highlights 32 • Key Terms 33
Questions for Discussion 33 • Exercises 33
~ END-OF-CHAPTER APPLICATION CASE Nationwide Insurance Used Bl
to Enhance Customer Service 34
References 35
Chapter 2 Foundations and Technologies for Decision Making 37
2.1 Opening Vignette: Decision Modeling at HP Using
Spreadsheets 38
2.2 Decision Making: Introduction and Definitions 40
Characteristics of Decision Making 40
A Working Definition of Decision Making 41
Decision-Making Disciplines 41
Decision Style and Decision Makers 41
2.3 Phases of the Decision-Making Process 42
2.4 Decision Making: The Intelligence Phase 44
Problem (or Opportunity) Identification 45
~ APPLICATION CASE 2.1 Making Elevators Go Faster! 45
Problem Classification 46
Problem Decomposition 46
Problem Ownership 46
Conte nts v
vi Contents
2.5 Decision Making: The Design Phase 47
Models 47
Mathematical (Quantitative) Models 47
The Benefits of Models 4 7
Selection of a Principle of Choice 48
Normative Models 49
Suboptimization 49
Descriptive Models 50
Good Enough, or Satisficing 51
Developing (Generating) Alternatives 52
Measuring Outcomes 53
Risk 53
Scenarios 54
Possible Scenarios 54
Errors in Decision Making 54
2.6 Decision Making: The Choice Phase 55
2.7 Decision Making: The Implementation Phase 55
2.8 How Decisions Are Supported 56
Support for the Intelligence Phase 56
Support for the Design Phase 5 7
Support for the Choice Phase 58
Support for the Implementation Phase 58
2.9 Decision Support Systems: Capabilities 59
A DSS Application 59
2.10 DSS Classifications 61
The AIS SIGDSS Classification for DSS 61
Other DSS Categories 63
Custom-Made Systems Versus Ready-Made Systems 63
2.11 Components of Decision Support Systems 64
The Data Management Subsystem 65
The Model Management Subsystem 65
~ APPLICATION CASE 2.2 Station Casinos Wins by Building Customer
Relationships Using Its Data 66
~ APPLICATION CASE 2.3 SNAP DSS Helps OneNet Make
Telecommunications Rate Decisions 68
The User Interface Subsystem 68
The Knowledge-Based Management Subsystem 69
~ APPLICATION CASE 2.4 From a Game Winner to a Doctor! 70
Chapter Highlights 72 • Key Terms 73
Questions for Discussion 73 • Exercises 74
~ END-OF-CHAPTER APPLICATION CASE Logistics Optimization in a
Major Shipping Company (CSAV) 74
References 75
Part II Descriptive Analytics 77
Chapter 3 Data Warehousing 78
3.1 Opening Vignette: Isle of Capri Casinos Is Winning with
Enterprise Data Warehouse 79
3.2 Data Warehousing Definitions and Concepts 81
What Is a Data Warehouse? 81
A Historical Perspective to Data Warehousing 81
Characteristics of Data Warehousing 83
Data Marts 84
Operational Data Stores 84
Enterprise Data Warehouses (EDW) 85
Metadata 85
~ APPLICATION CASE 3.1 A Better Data Plan: Well-Established TELCOs
Leverage Data Warehousing and Analytics to Stay on Top in a
Competitive Industry 85
3.3 Data Warehousing Process Overview 87
~ APPLICATION CASE 3.2 Data Warehousing Helps MultiCare Save
More Lives 88
3.4 Data Warehousing Architectures 90
Alternative Data Warehousing Architectures 93
Which Architecture Is the Best? 96
3.5 Data Integration and the Extraction, Transformation, and
Load (ETL) Processes 97
Data Integration 98
~ APPLICATION CASE 3.3 BP Lubricants Achieves BIGS Success 98
Extraction, Transfonnation, and Load 100
3.6 Data Warehouse Development 102
~ APPLICATION CASE 3.4 Things Go Better with Coke’s Data
Warehouse 103
Data Warehouse Development Approaches 103
~ APPLICATION CASE 3.5 Starwood Hotels & Resorts Manages Hotel
Profitability with Data Warehousing 106
Additional Data Warehouse Development Considerations 107
Representation of Data in Data Warehouse 108
Analysis of Data in the Data Warehouse 109
OLAP Versus OLTP 110
OLAP Operations 11 0
3.7 Data Warehousing Implementation Issues 113
~ APPLICATION CASE 3.6 EDW Helps Connect State Agencies in
Michigan 115
Massive Data Warehouses and Scalability 116
3.8 Real-Time Data Warehousing 117
~ APPLICATION CASE 3.7 Egg Pie Fries the Competition in Near Real
Time 118
Conte nts vii
viii Conte nts
3.9 Data Warehouse Administration, Security Issues, and Future
Trends 121
The Future of Data Warehousing 123
3.10 Resources, Links, and the Teradata University Network
Connection 126
Resources and Links 126
Cases 126
Vendors, Products, and Demos 127
Periodicals 127
Additional References 127
The Teradata University Network (TUN) Connection 127
Chapter Highlights 128 • Key Terms 128
Questions for Discussion 128 • Exercises 129
…. END-OF-CHAPTER APPLICATION CASE Continental Airlines Flies High
with Its Real-Time Data Warehouse 131
References 132
Chapter 4 Business Reporting, Visual Analytics, and Business
Performance Management 135
4.1 Opening Vignette:Self-Service Reporting Environment
Saves Millions for Corporate Customers 136
4.2 Business Reporting Definitions and Concepts 139
What Is a Business Report? 140
..,. APPLICATION CASE 4.1 Delta Lloyd Group Ensures Accuracy and
Efficiency in Financial Reporting 141
Components of the Business Reporting System 143
…. APPLICATION CASE 4.2 Flood of Paper Ends at FEMA 144
4.3 Data and Information Visualization 145
..,. APPLICATION CASE 4.3 Tableau Saves Blastrac Thousands of Dollars
with Simplified Information Sharing 146
A Brief History of Data Visualization 147
…. APPLICATION CASE 4.4 TIBCO Spotfire Provides Dana-Farber Cancer
Institute with Unprecedented Insight into Cancer Vaccine Clinical
Trials 149
4.4 Different Types of Charts and Graphs 150
Basic Charts and Graphs 150
Specialized Charts and Graphs 151
4.5 The Emergence of Data Visualization and Visual
Analytics 154
Visual Analytics 156
High-Powered Visual Analytics Environments 158
4.6 Performance Dashboards 160
…. APPLICATION CASE 4.5 Dallas Cowboys Score Big with Tableau and
Teknion 161
Dashboard Design 162
~ APPLICATION CASE 4.6 Saudi Telecom Company Excels with
Information Visualization 163
What to Look For in a Dashboard 164
Best Practices in Dashboard Design 165
Benchmark Key Performance Indicators with Industry Standards 165
Wrap the Dashboard Metrics with Contextual Metadata 165
Validate the Dashboard Design by a Usability Specialist 165
Prioritize and Rank Alerts/Exceptions Streamed to the Dashboard 165
Enrich Dashboard with Business Users’ Comments 165
Present Information in Three Different Levels 166
Pick the Right Visual Construct Using Dashboard Design Principles 166
Provide for Guided Analytics 166
4.7 Business Performance Management 166
Closed-Loop BPM Cycle 167
~ APPLICATION CASE 4.7 IBM Cognos Express Helps Mace for Faster
and Better Business Reporting 169
4.8 Performance Measurement 170
Key Performance Indicator (KPI) 171
Performance Measurement System 172
4.9 Balanced Scorecards 172
The Four Perspectives 173
The Meaning of Balance in BSC 17 4
Dashboards Versus Scorecards 174
4.10 Six Sigma as a Performance Measurement System 175
The DMAIC Performance Model 176
Balanced Scorecard Versus Six Sigma 176
Effective Performance Measurement 1 77
~ APPLICATION CASE 4.8 Expedia.com’s Customer Satisfaction
Scorecard 178
Chapter Highlights 179 • Key Terms 180
Questions for Discussion 181 • Exercises 181
~ END-OF-CHAPTER APPLICATION CASE Smart Business Reporting
Helps Healthcare Providers Deliver Better Care 182
References 184
Part Ill Predictive Analytics 185
Chapter 5 Data Mining 186
5.1 Opening Vignette: Cabela’s Reels in More Customers with
Advanced Analytics and Data Mining 187
5.2 Data Mining Concepts and Applications 189
~ APPLICATION CASE 5.1 Smarter Insurance: Infinity P&C Improves
Customer Service and Combats Fraud with Predictive Analytics 191
Conte nts ix
x Conte nts
Definitions, Characteristics, and Benefits 192
..,. APPLICATION CASE 5.2 Harnessing Analytics to Combat Crime:
Predictive Analytics Helps Memphis Police Department Pinpoint Crime
and Focus Police Resources 196
How Data Mining Works 197
Data Mining Versus Statistics 200
5.3 Data Mining Applications 201
…. APPLICATION CASE 5.3 A Mine on Terrorist Funding 203
5.4 Data Mining Process 204
Step 1: Business Understanding 205
Step 2: Data Understanding 205
Step 3: Data Preparation 206
Step 4: Model Building 208
…. APPLICATION CASE 5.4 Data Mining in Cancer Research 210
Step 5: Testing and Evaluation 211
Step 6: Deployment 211
Other Data Mining Standardized Processes and Methodologies 212
5.5 Data Mining Methods 214
Classification 214
Estimating the True Accuracy of Classification Models 215
Cluster Analysis for Data Mining 220
..,. APPLICATION CASE 5.5 2degrees Gets a 1275 Percent Boost in Churn
Identification 221
Association Rule Mining 224
5.6 Data Mining Software Tools 228
…. APPLICATION CASE 5.6 Data Mining Goes to Hollywood: Predicting
Financial Success of Movies 231
5.7 Data Mining Privacy Issues, Myths, and Blunders 234
Data Mining and Privacy Issues 234
…. APPLICATION CASE 5.7 Predicting Customer Buying Patterns-The
Target Story 235
Data Mining Myths and Blunders 236
Chapter Highlights 237 • Key Terms 238
Questions for Discussion 238 • Exercises 239
…. END-OF-CHAPTER APPLICATION CASE Macys.com Enhances Its
Customers’ Shopping Experience with Analytics 241
References 241
Chapter 6 Techniques for Predictive Modeling 243
6.1 Opening Vignette: Predictive Modeling Helps Better
Understand and Manage Complex Medical
Procedures 244
6.2 Basic Concepts of Neural Networks 247
Biological and Artificial Neural Networks 248
..,. APPLICATION CASE 6.1 Neural Networks Are Helping to Save Lives in
the Mining Industry 250
Elements of ANN 251
Network Information Processing 2 52
Neural Network Architectures 254
~ APPLICATION CASE 6.2 Predictive Modeling Is Powering the Power
Generators 256
6.3 Developing Neural Network-Based Systems 258
The General ANN Learning Process 259
Backpropagation 260
6.4 Illuminating the Black Box of ANN with Sensitivity
Analysis 262
~ APPLICATION CASE 6.3 Sensitivity Analysis Reveals Injury Severity
Factors in Traffic Accidents 264
6.5 Support Vector Machines 265
~ APPLICATION CASE 6.4 Managing Student Retention with Predictive
Modeling 266
Mathematical Formulation of SVMs 270
Primal Form 271
Dual Form 271
Soft Margin 271
Nonlinear Classification 272
Kernel Trick 272
6.6 A Process-Based Approach to the Use of SVM 273
Support Vector Machines Versus Artificial Neural Networks 274
6.7 Nearest Neighbor Method for Prediction 275
Similarity Measure: The Distance Metric 276
Parameter Selection 277
~ APPLICATION CASE 6.5 Efficient Image Recognition and
Categorization with kNN 278
Chapter Highlights 280 • Key Terms 280
Questions for Discussion 281 • Exercises 281
~ END-OF-CHAPTER APPLICATION CASE Coors Improves Beer Flavors
with Neural Networks 284
References 285
Chapter 7 Text Analytics, Text Mining, and Sentiment Analysis 288
7.1 Opening Vignette: Machine Versus Men on Jeopardy!: The
Story of Watson 289
7.2 Text Analytics and Text Mining Concepts and
Definitions 291
~ APPLICATION CASE 7.1 Text Mining for Patent Analysis 295
7.3 Natural Language Processing 296
~ APPLICATION CASE 7.2 Text Mining Improves Hong Kong
Government’s Ability to Anticipate and Address Public Complaints 298
7.4 Text Mining Applications 300
Marketing Applications 301
Security Applications 301
~ APPLICATION CASE 7.3 Mining for Lies 302
Biomedical Applications 304
Conte nts xi
xii Conte nts
Academic Applications 305
…. APPLICATION CASE 7.4 Text Mining and Sentiment Analysis Help
Improve Customer Service Performance 306
7.5 Text Mining Process 307
Task 1: Establish the Corpus 308
Task 2: Create the Term-Document Matrix 309
Task 3: Extract the Knowledge 312
..,. APPLICATION CASE 7.5 Research Literature Survey with Text
Mining 314
7.6 Text Mining Tools 317
Commercial Software Tools 317
Free Software Tools 317
..,. APPLICATION CASE 7.6 A Potpourri ofText Mining Case Synopses 318
7.7 Sentiment Analysis Overview 319
..,. APPLICATION CASE 7.7 Whirlpool Achieves Customer Loyalty and
Product Success with Text Analytics 321
7.8 Sentiment Analysis Applications 323
7.9 Sentiment Analysis Process 325
Methods for Polarity Identification 326
Using a Lexicon 327
Using a Collection of Training Documents 328
Identifying Semantic Orientation of Sentences and Phrases 328
Identifying Semantic Orientation of Document 328
7.10 Sentiment Analysis and Speech Analytics 329
How Is It Done? 329
..,. APPLICATION CASE 7.8 Cutting Through the Confusion: Blue Cross
Blue Shield of North Carolina Uses Nexidia’s Speech Analytics to Ease
Member Experience in Healthcare 331
Chapter Highlights 333 • Key Terms 333
Questions for Discussion 334 • Exercises 334
…. END-OF-CHAPTER APPLICATION CASE BBVA Seamlessly Monitors
and Improves Its Online Reputation 335
References 336
Chapter 8 Web Analytics, Web Mining, and Social Analytics 338
8.1 Opening Vignette: Security First Insurance Deepens
Connection with Policyholders 339
8.2 Web Mining Overview 341
8.3 Web Content and Web Structure Mining 344
…. APPLICATION CASE 8.1 Identifying Extremist Groups with Web Link
and Content Analysis 346
8.4 Search Engines 347
Anatomy of a Search Engine 347
1. Development Cycle 348
Web Crawler 348
Document Indexer 348
2. Response Cycle 349
Query Analyzer 349
Document Matcher/Ranker 349
How Does Google Do It? 351
~ APPLICATION CASE 8.2 IGN Increases Search Traffic by 1500 Percent 353
8.5 Search Engine Optimization 354
Methods for Search Engine Optimization 355
~ APPLICATION CASE 8.3 Understanding Why Customers Abandon
Shopping Carts Results in $10 Million Sales Increase 357
8.6 Web Usage Mining (Web Analytics) 358
Web Analytics Technologies 359
~ APPLICATION CASE 8.4 Allegro Boosts Online Click-Through Rates by
500 Percent with Web Analysis 360
Web Analytics Metrics 362
Web Site Usability 362
Traffic Sources 363
Visitor Profiles 364
Conversion Statistics 364
8.7 Web Analytics Maturity Model and Web Analytics Tools 366
Web Analytics Tools 368
Putting It All Together-A Web Site Optimization Ecosystem 370
A Framework for Voice of the Customer Strategy 372
8.8 Social Analytics and Social Network Analysis 373
Social Network Analysis 374
Social Network Analysis Metrics 375
~ APPLICATION CASE 8.5 Social Network Analysis Helps
Telecommunication Firms 375
Connections 376
Distributions 376
Segmentation 377
8.9 Social Media Definitions and Concepts 377
How Do People Use Social Media? 378
~ APPLICATION CASE 8.6 Measuring the Impact of Social Media at
Lollapalooza 379
8.10 Social Media Analytics 380
Measuring the Social Media Impact 381
Best Practices in Social Media Analytics 381
~ APPLICATION CASE 8.7 eHarmony Uses Social Media to Help Take the
Mystery Out of Online Dating 383
Social Media Analytics Tools and Vendors 384
Chapter Highlights 386 • Key Terms 387
Questions for Discussion 387 • Exercises 388
~ END-OF-CHAPTER APPLICATION CASE Keeping Students on Track with
Web and Predictive Analytics 388
References 390
Conte nts xiii
xiv Contents
Part IV Prescriptive Analytics 391
Chapter 9 Model-Based Decision Making: Optimization and
Multi-Criteria Systems 392
9.1 Opening Vignette: Midwest ISO Saves Billions by Better
Planning of Power Plant Operations and Capacity
Planning 393
9.2 Decision Support Systems Modeling 394
~ APPLICATION CASE 9.1 Optimal Transport for ExxonMobil
Downstream Through a DSS 395
Current Modeling Issues 396
~ APPLICATION CASE 9.2 Forecasting/Predictive Analytics Proves to Be
a Good Gamble for Harrah’s Cherokee Casino and Hotel 397
9.3 Structure of Mathematical Models for Decision Support 399
The Components of Decision Support Mathematical Models 399
The Structure of Mathematical Models 401
9.4 Certainty, Uncertainty, and Risk 401
Decision Making Under Certainty 402
Decision Making Under Uncertainty 402
Decision Making Under Risk (Risk Analysis) 402
~ APPLICATION CASE 9.3 American Airlines Uses
Should-Cost Modeling to Assess the Uncertainty of Bids
for Shipment Routes 403
9.5 Decision Modeling with Spreadsheets 404
~ APPLICATION CASE 9.4 Showcase Scheduling at Fred Astaire East
Side Dance Studio 404
9.6 Mathematical Programming Optimization 407
~ APPLICATION CASE 9.5 Spreadsheet Model Helps Assign Medical
Residents 407
Mathematical Programming 408
Linear Programming 408
Modeling in LP: An Example 409
Implementation 414
9.7 Multiple Goals, Sensitivity Analysis, What-If Analysis,
and Goal Seeking 416
Multiple Goals 416
Sensitivity Analysis 417
What-If Analysis 418
Goal Seeking 418
9.8 Decision Analysis with Decision Tables and Decision
Trees 420
Decision Tables 420
Decision Trees 422
9.9 Multi-Criteria Decision Making With Pairwise
Comparisons 423
The Analytic Hierarchy Process 423
~ APPLICATION CASE 9.6 U.S. HUD Saves the House by Using
AHP for Selecting IT Projects 423
Tutorial on Applying Analytic Hierarchy Process Using Web-HIPRE 425
Chapter Highlights 429 • Key Terms 430
Questions for Discussion 430 • Exercises 430
~ END-OF-CHAPTER APPLICATION CASE Pre-Positioning of Emergency
Items for CARE International 433
References 434
Chapter 10 Modeling and Analysis: Heuristic Search Methods and
Simulation 435
10.1 Opening Vignette: System Dynamics Allows Fluor
Corporation to Better Plan for Project and Change
Management 436
10.2 Problem-Solving Search Methods 437
Analytical Techniques 438
Algorithms 438
Blind Searching 439
Heuristic Searching 439
~ APPLICATION CASE 10.1 Chilean Government Uses Heuristics to
Make Decisions on School Lunch Providers 439
10.3 Genetic Algorithms and Developing GA Applications 441
Example: The Vector Game 441
Terminology of Genetic Algorithms 443
How Do Genetic Algorithms Work? 443
Limitations of Genetic Algorithms 445
Genetic Algorithm Applications 445
10.4 Simulation 446
~ APPLICATION CASE 10.2 Improving Maintenance Decision Making in
the Finnish Air Force Through Simulation 446
~ APPLICATION CASE 10.3 Simulating Effects of Hepatitis B
Interventions 447
Major Characteristics of Simulation 448
Advantages of Simulation 449
Disadvantages of Simulation 450
The Methodology of Simulation 450
Simulation Types 451
Monte Carlo Simulation 452
Discrete Event Simulation 453
10.5 Visual Interactive Simulation 453
Conventional Simulation Inadequacies 453
Visual Interactive Simulation 453
Visual Interactive Models and DSS 454
~ APPLICATION CASE 10.4 Improving Job-Shop Scheduling Decisions
Through RFID: A Simulation-Based Assessment 454
Simulation Software 457
Conte nts xv
xvi Contents
10.6 System Dynamics Modeling 458
10.7 Agent-Based Modeling 461
~ APPLICATION CASE 10.5 Agent-Based Simulation Helps Analyze
Spread of a Pandemic Outbreak 463
Chapter Highlights 464 • Key Terms 464
Questions for Discussion 465 • Exercises 465
~ END-OF-CHAPTER APPLICATION CASE HP Applies Management
Science Modeling to Optimize Its Supply Chain and Wins a Major
Award 465
References 467
Chapter 11 Automated Decision Systems and Expert Systems 469
11.1 Opening Vignette: I nterContinental Hotel Group Uses
Decision Rules for Optimal Hotel Room Rates 470
11.2 Automated Decision Systems 471
~ APPLICATION CASE 11.1 Giant Food Stores Prices the Entire
Store 472
11.3 The Artificial Intelligence Field 475
11.4 Basic Concepts of Expert Systems 477
Experts 477
Expertise 478
Features of ES 478
~ APPLICATION CASE 11.2 Expert System Helps in Identifying Sport
Talents 480
11.5 Applications of Expert Systems 480
~ APPLICATION CASE 11.3 Expert System Aids in Identification of
Chemical, Biological, and Radiological Agents 481
Classical Applications of ES 481
Newer Applications of ES 482
Areas for ES Applications 483
11.6 Structure of Expert Systems 484
Knowledge Acquisition Subsystem 484
Knowledge Base 485
Inference Engine 485
User Interface 485
Blackboard (Workplace) 485
Explanation Subsystem (Justifier) 486
Knowledge-Refining System 486
~ APPLICATION CASE 11.4 Diagnosing Heart Diseases by Signal
Processing 486
11.7 Knowledge Engineering 487
Knowledge Acquisition 488
Knowledge Verification and Validation 490
Knowledge Representation 490
Inferencing 491
Explanation and Justification 496
11.8 Problem Areas Suitable for Expert Systems 497
11.9 Development of Expert Systems 498
Defining the Nature and Scope of the Problem 499
Identifying Proper Experts 499
Acquiring Knowledge 499
Selecting the Building Tools 499
Coding the System 501
Evaluating the System 501
…. APPLICATION CASE 11.5 Clinical Decision Support System for Tendon
Injuries 501
11.10 Concluding Remarks 502
Chapter Highlights 503 • Key Terms 503
Questions for Discussion 504 • Exercises 504
…. END·OF·CHAPTER APPLICATION CASE Tax Collections Optimization
for New York State 504
References 505
Chapter 12 Knowledge Management and Collaborative Systems 507
12.1 Opening Vignette: Expertise Transfer System to Train
Future Army Personnel 508
12.2 Introduction to Knowledge Management 512
Knowledge Management Concepts and Definitions 513
Knowledge 513
Explicit and Tacit Knowledge 515
12.3 Approaches to Knowledge Management 516
The Process Approach to Knowledge Management 517
The Practice Approach to Knowledge Management 51 7
Hybrid Approaches to Knowledge Management 51 8
Knowledge Repositories 518
12.4 Information Technology (IT) in Knowledge
Management 520
The KMS Cyde 520
Components of KMS 521
Technologies That Support Knowledge Management 521
12.5 Making Decisions in Groups: Characteristics, Process,
Benefits, and Dysfunctions 523
Characteristics of Groupwork 523
The Group Decision-Making Process 524
The Benefits and Limitations of Groupwork 524
12.6 Supporting Groupwork with Computerized Systems 526
An Overview of Group Support Systems (GSS) 526
Groupware 527
Time/Place Framework 527
12.7 Tools for Indirect Support of Decision Making 528
Groupware Tools 528
Conte nts xvii
xviii Conte nts
Groupware 530
Collaborative Workflow 530
Web 2.0 530
Wikis 531
Collaborative Networks 531
12.8 Direct Computerized Support for Decision Making:
From Group Decision Support Systems to Group Support
Systems 532
Group Decision Support Systems (GOSS) 532
Group Support Systems 533
How GOSS (or GSS) Improve Groupwork 533
Facilities for GOSS 534
Chapter Highlights 535 • Key Terms 536
Questions for Discussion 536 • Exercises 536
~ END-OF-CHAPTER APPLICATION CASE Solving Crimes by Sharing
Digital Forensic Knowledge 537
References 539
Part V Big Data and Future Directions for Business
Analytics 541
Chapter 13 Big Data and Analytics 542
13.1 Opening Vignette: Big Data Meets Big Science at CERN 543
13.2 Definition of Big Data 546
The Vs That Define Big Data 547
~ APPLICATION CASE 13.1 Big Data Analytics Helps Luxottica Improve
Its Marketing Effectiveness 550
13.3 Fundamentals of Big Data Analytics 551
Business Problems Addressed by Big Data Analytics 554
~ APPLICATION CASE 13.2 Top 5 Investment Bank Achieves Single
Source of Truth 555
13.4 Big Data Technologies 556
MapReduce 557
Why Use Map Reduce? 558
Hadoop 558
How Does Hadoop Work? 558
Hadoop Technical Components 559
Hadoop: The Pros and Cons 560
NoSQL 562
~ APPLICATION CASE 13.3 eBay’s Big Data Solution 563
13.5 Data Scientist 565
Where Do Data Scientists Come From? 565
~ APPLICATION CASE 13.4 Big Data and Analytics in Politics 568
13.6 Big Data and Data Warehousing 569
Use Case(s) for Hadoop 570
Use Case(s) for Data Warehousing 571
The Gray Areas (Any One of the Two Would Do the Job) 572
Coexistence of Hadoop and Data Warehouse 572
13.7 Big Data Vendors 574
~ APPLICATION CASE 13.5 Dublin City Council Is Leveraging Big Data
to Reduce Traffic Congestion 575
~ APPLICATION CASE 13.6 Creditreform Boosts Credit Rating Quality
with Big Data Visual Analytics 580
13.8 Big Data and Stream Analytics 581
Stream Analytics Versus Perpetual Analytics 582
Critical Event Processing 582
Data Stream Mining 583
13.9 Applications of Stream Analytics 584
e-commerce 584
Telecommunications 584
~ APPLICATION CASE 13.7 Turning Machine-Generated Streaming Data
into Valuable Business Insights 585
Law Enforcement and Cyber Security 586
Power Industry 587
Financial Services 587
Health Sciences 587
Government 587
Chapter Highlights 588 • Key Terms 588
Questions for Discussion 588 • Exercises 589
~ END-OF-CHAPTER APPLICATION CASE Discovery Health Turns Big
Data into Better Healthcare 589
References 591
Chapter 14 Business Analytics: Emerging Trends and Future
Impacts 592
14.1 Opening Vignette: Oklahoma Gas and Electric Employs
Analytics to Promote Smart Energy Use 593
14.2 Location-Based Analytics for Organizations 594
Geospatial Analytics 594
~ APPLICATION CASE 14.1 Great Clips Employs Spatial Analytics to
Shave Time in Location Decisions 596
A Multimedia Exercise in Analytics Employing Geospatial Analytics 597
Real-Time Location Intelligence 598
~ APPLICATION CASE 14.2 Quiznos Targets Customers for Its
Sandwiches 599
14.3 Analytics Applications for Consumers 600
~ APPLICATION CASE 14.3 A Life Coach in Your Pocket 601
14.4 Recommendation Engines 603
14.5 Web 2.0 and Online Social Networking 604
Representative Characteristics of Web 2.0 605
Social Networking 605
A Definition and Basic Information 606
Implications of Business and Enterprise Social Networks 606
Conte nts xix
xx Contents
14.6 Cloud Computing and Bl 607
Service-Oriented DSS 608
Data-as-a-Service (DaaS) 608
Information-as-a-Service (Information on Demand) (laaS) 611
Analytics-as-a-Service (AaaS) 611
14.7 Impacts of Analytics in Organizations: An Overview 613
New Organizational Units 613
Restructuring Business Processes and Virtual Teams 614
The Impacts of ADS Systems 614
Job Satisfaction 614
Job Stress and Anxiety 614
Analytics’ Impact on Managers’ Activities and Their Performance 615
14.8 Issues of Legality, Privacy, and Ethics 616
Legal Issues 616
Privacy 617
Recent Technology Issues in Privacy and Analytics 618
Ethics in Decision Making and Support 619
14.9 An Overview of the Analytics Ecosystem 620
Analytics Industry Clusters 620
Data Infrastructure Providers 620
Data Warehouse Industry 621
Middleware Industry 622
Data Aggregators/Distributors 622
Analytics-Focused Software Developers 622
Reporting/Analytics 622
Predictive Analytics 623
Prescriptive Analytics 623
Application Developers or System Integrators: Industry Specific or General 624
Analytics User Organizations 625
Analytics Industry Analysts and Influencers 627
Academic Providers and Certification Agencies 628
Chapter Highlights 629 • Key Terms 629
Questions for Discussion 629 • Exercises 630
~ END·OF·CHAPTER APPLICATION CASE Southern States Cooperative
Optimizes Its Catalog Campaign 630
References 632
Glossary 634
Index 648
PREFACE
Analytics has become the technology driver of this decade. Companies such as IBM,
Oracle , Microsoft, and others are creating new organizational units focused on analytics
that help businesses become more effective and efficient in their operations. Decision
makers are using more computerized tools to support their work. Even consumers are
using analytics tools directly or indirectly to make decisions on routine activities such as
shopping, healthcare, and entertainment. The field of decision support systems (DSS)/
business intelligence (BI) is evolving rapidly to become more focused on innovative appli-
cations of data streams that were not even captured some time back, much less analyzed
in any significant way. New applications turn up daily in healthcare, sports, entertain-
ment, supply chain management, utilities, and virtually every industry imaginable.
The theme of this revised edition is BI and analytics for enterprise decision support.
In addition to traditional decision support applications, this edition expands the reader’s
understanding of the various types of analytics by providing examples, products, services,
and exercises by discussing Web-related issues throughout the text. We highlight Web
intelligence/Web analytics, which parallel Bl/business analytics (BA) for e-commerce and
other Web applications. The book is supported by a Web site (pearsonhighered.com/
sharda) and also by an independent site at dssbibook.com. We will also provide links
to software tutorials through a special section of the Web site.
The purpose of this book is to introduce the reader to these technologies that are
generally called analytics but have been known by other names. The core technology
consists of DSS, BI, and various decision-making techniques. We use these terms inter-
changeably. This book presents the fundamentals of the techniques and the manner in
which these systems are constructed and used. We follow an EEE approach to introduc-
ing these topics: Exposure, Experience, and Explore. The book primarily provides
exposure to various analytics techniques and their applications. The idea is that a student
will be inspired to learn from how other organizations have employed analytics to make
decisions or to gain a competitive edge. We believe that such exposure to what is being
done with analytics and how it can be achieved is the key component of learning about
analytics. In describing the techniques, we also introduce specific software tools that can
be used for developing such applications. The book is not limited to any one software
tool , so the students can experience these techniques using any number of available
software tools. Specific suggestions are given in each chapter, but the student and the
professor are able to use this book with many different software tools. Our book’s com-
panion Web site will include specific software guides, but students can gain experience
with these techniques in many different ways. Finally, we hope that this exposure and
experience enable and motivate readers to explore the potential of these techniques in
their own domain. To facilitate such exploration, we include exercises that direct them
to Teradata University Network and other sites as well that include team-oriented exer-
cises where appropriate. We will also highlight new and innovative applications that we
learn about on the book’s companion Web sites.
Most of the specific improvements made in this tenth edition concentrate on three
areas: reorganization, content update, and a sharper focus. Despite the many changes, we
have preserved the comprehensiveness and user friendliness that have made the text a
market leader. We have also reduced the book’s size by eliminating older and redundant
material and by combining material that was not used by a majority of professors. At the
same time, we have kept several of the classical references intact. Finally, we present
accurate and updated material that is not available in any other text. We next describe the
changes in the tenth edition.
xxi
xxii Preface
WHAT’S NEW IN THE TENTH EDITION?
With the goal of improving the text, this edition marks a major reorganization of the text
to reflect the focus on analytics. The last two editions transformed the book from the
traditional DSS to BI and fostered a tight linkage with the Teradata University Network
(TUN). This edition is now organized around three major types of analytics. The new
edition has many timely additions , and the dated content has been deleted. The following
major specific changes have been made:
• New organization. The book is now organized around three types of analytics:
descriptive, predictive, and prescriptive, a classification promoted by INFORMS. After
introducing the topics of DSS/ BI and analytics in Chapter 1 and covering the founda-
tions of decision making and decision support in Chapter 2, the book begins with an
overview of data warehousing and data foundations in Chapter 3. This part then cov-
ers descriptive or reporting analytics, specifically, visualization and business perfor-
mance measurement. Chapters 5-8 cover predictive analytics. Chapters 9-12 cover
prescriptive and decision analytics as well as other decision support systems topics.
Some of the coverage from Chapter 3-4 in previous editions will now be found in
the new Chapters 9 and 10. Chapter 11 covers expert systems as well as the new
rule-based systems that are commonly built for implementing analytics. Chapter 12
combines two topics that were key chapters in earlier editions-knowledge manage-
ment and collaborative systems. Chapter 13 is a new chapter that introduces big data
and analytics. Chapter 14 concludes the book with discussion of emerging trends
and topics in business analytics , including location intelligence, mobile computing,
cloud-based analytics, and privacy/ethical considerations in analytics. This chapter
also includes an overview of the analytics ecosystem to help the user explore all of
the different ways one can participate and grow in the analytics environment. Thus ,
the book marks a significant departure from the earlier editions in organization. Of
course, it is still possible to teach a course with a traditional DSS focus with this book
by covering Chapters 1-4, Chapters 9-12, and possibly Chapter 14.
• New chapters. The following chapters have been added:
Chapter 8, “Web Analytics, Web Mining, and Social Analytics.” This chapter
covers the popular topics of Web analytics and social media analytics. It is an
almost entirely new chapter (95% new material).
Chapter 13, “Big Data and Analytics.” This chapter introduces the hot topics of
Big Data and analytics . It covers the basics of major components of Big Data tech-
niques and charcteristics. It is also a new chapter (99% new material) .
Chapter 14, “Business Analytics: Emerging Trends and Future Impacts.”
This chapter examines several new phenomena that are already changing or are
likely to change analytics . It includes coverage of geospatial in analytics , location-
based analytics applications , consumer-oriented analytical applications , mobile plat-
forms , and cloud-based analytics. It also updates some coverage from the previous
edition on ethical and privacy considerations. It concludes with a major discussion
of the analytics ecosystem (90% new material).
• Streamlined coverage. We have made the book shorter by keeping the most
commonly used content. We also mostly eliminated the preformatted online con-
tent. Instead, we will use a Web site to provide updated content and links on a
regular basis. We also reduced the number of references in each chapter.
• Revamped author team. Building upon the excellent content that has been
prepared by the authors of the previous editions (Turban, Aronson , Liang, King ,
Sharda, and Delen) , this edition was revised by Ramesh Sharda and Dursun Delen.
Both Ramesh and Dursun have worked extensively in DSS and analytics and have
industry as well as research experience.
• A live-update Web site. Adopters of the textbook will have access to a Web site that
will include links to news stories, software, tutorials, and even YouTube videos related
to topics covered in the book. This site will be accessible at http://dssbibook.com.
• Revised and updated content. Almost all of the chapters have new opening
vignettes and closing cases that are based on recent stories and events. In addition,
application cases throughout the book have been updated to include recent exam-
ples of applications of a specific technique/model. These application case stories
now include suggested questions for discussion to encourage class discussion as
well as further exploration of the specific case and related materials . New Web site
links have been added throughout the book. We also deleted many older product
links and references. Finally, most chapters have new exercises, Internet assign-
ments, and discussion questions throughout.
Specific changes made in chapters that have been retained from the previous edi-
tions are summarized next:
Chapter 1, “An Overview of Business Intelligence, Analytics, and Decision
Support,” introduces the three types of analytics as proposed by INFORMS: descriptive ,
predictive, and prescriptive analytics. A noted earlier, this classification is used in guiding
the complete reorganization of the book itself. It includes about 50 percent new material.
All of the case stories are new.
Chapter 2, “Foundations and Technologies for Decision Making,” combines mate-
rial from earlier Chapters 1, 2, and 3 to provide a basic foundation for decision making in
general and computer-supported decision making in particular. It eliminates some dupli-
cation that was present in Chapters 1-3 of the previous editions. It includes 35 percent
new material. Most of the cases are new.
Chapter 3, “Data Warehousing”
• 30 percent new material, including the cases
• New opening case
• Mostly new cases throughout
• NEW: A historic perspective to data warehousing-how did we get here?
• Better coverage of multidimensional modeling (star schema and snowflake schema)
• An updated coverage on the future of data warehousing
Chapter 4, “Business Reporting, Visual Analytics, and Business Performance
Management”
• 60 percent of the material is new-especially in visual analytics and reporting
• Most of the cases are new
Chapter 5, “Data Mining”
• 25 percent of the material is new
• Most of the cases are new
Chapter 6, “Techniques for Predictive Modeling”
• 55 percent of the material is new
• Most of the cases are new
• New sections on SVM and kNN
Chapter 7, “Text Analytics, Text Mining, and Sentiment Analysis”
• 50 percent of the material is new
• Most of the cases are new
• New section (1 / 3 of the chapter) on sentiment analysis
Preface xxiii
xxiv Preface
Chapter 8, “Web Analytics, Web Mining, and Social Analytics” (New Chapter)
• 95 percent of the material is new
Chapter 9, “Model-Based Decision Making: Optimization and Multi-Criteria Systems”
• All new cases
• Expanded coverage of analytic hierarchy process
• New examples of mixed-integer programming applications and exercises
• About 50 percent new material
In addition, all the Microsoft Excel-related coverage has been updated to work with
Microsoft Excel 2010.
Chapter 10, “Modeling and Analysis: Heuristic Search Methods and Simulation”
• This chapter now introduces genetic algorithms and various types of simulation
models
• It includes new coverage of other types of simulation modeling such as agent-based
modeling and system dynamics modeling
• New cases throughout
• About 60 percent new material
Chapter 11, “Automated Decision Systems and Expert Systems”
• Expanded coverage of automated decision systems including examples from the
airline industry
• New examples of expert systems
• New cases
• About 50 percent new material
Chapter 12, “Knowledge Management and Collaborative Systems”
• Significantly condensed coverage of these two topics combined into one chapter
• New examples of KM applications
• About 25 percent new material
Chapters 13 and 14 are mostly new chapters , as described earlier.
We have retained many of the enhancements made in the last editions and updated
the content. These are summarized next:
• Links to Teradata University Network (TUN). Most chapters include new links
to TUN (teradatauniversitynetwork.com). We encourage the instructors to regis-
ter and join teradatauniversitynetwork.com and explore various content available
through the site. The cases, white papers , and software exercises available through
TUN will keep your class fresh and timely.
• Book title. As is already evident, the book’s title and focus have changed
substantially.
• Software support. The TUN Web site provides software support at no charge .
It also provides links to free data mining and other software. In addition, the site
provides exercises in the use of such software.
THE SUPPLEMENT PACKAGE: PEARSONHIGHERED.COM/SHARDA
A comprehensive and flexible technology-support package is available to enhance the
teaching and learning experience. The following instructor and student supplements are
available on the book’s Web site, pearsonhighered.com/sharda:
• Instructor’s Manual. The Instructor’s Manual includes learning objectives for the
entire course and for each chapter, answers to the questions and exercises at the end
of each chapter, and teaching suggestions (including instructions for projects). The
Instructor’s Manual is available on the secure faculty section of pearsonhighered
.com/sharda.
• Test Item File and TestGen Software. The Test Item File is a comprehensive
collection of true/false, multiple-choice, fill-in-the-blank, and essay questions. The
questions are rated by difficulty level, and the answers are referenced by book page
number. The Test Item File is available in Microsoft Word and in TestGen. Pearson
Education’s test-generating software is available from www.pearsonhighered.
com/ire. The software is PC/MAC compatible and preloaded with all of the Test
Item File questions. You can manually or randomly view test questions and drag-
and-drop to create a test. You can add or modify test-bank questions as needed. Our
TestGens are converted for use in BlackBoard , WebCT, Moodie, D2L, and Angel.
These conversions can be found on pearsonhighered.com/sharda. The TestGen
is also available in Respondus and can be found on www.respondus.com.
• PowerPoint slides. PowerPoint slides are available that illuminate and build
on key concepts in the text. Faculty can download the PowerPoint slides from
pearsonhighered.com/ sharda.
ACKNOWLEDGMENTS
Many individuals have provided suggestions and criticisms since the publication of the
first edition of this book. Dozens of students participated in class testing of various chap-
ters , software , and problems and assisted in collecting material. It is not possible to name
everyone who participated in this project, but our thanks go to all of them. Certain indi-
viduals made significant contributions, and they deserve special recognition.
First, we appreciate the efforts of those individuals who provided formal reviews of
the first through tenth editions (school affiliations as of the date of review):
Robert Blanning, Vanderbilt University
Ranjit Bose , University of New Mexico
Warren Briggs, Suffolk University
Lee Roy Bronner, Morgan State University
Charles Butler, Colorado State University
Sohail S. Chaudry, University of Wisconsin-La Crosse
Kathy Chudoba , Florida State University
Wingyan Chung, University of Texas
Woo Young Chung, University of Memphis
Paul “Buddy” Clark, South Carolina State University
Pi’Sheng Deng, California State University-Stanislaus
Joyce Elam, Florida International University
Kurt Engemann, Iona College
Gary Farrar, Jacksonville University
George Federman, Santa Clara City College
Jerry Fjermestad, New Jersey Institute of Technology
Joey George , Florida State University
Paul Gray , Claremont Graduate School
Orv Greynholds, Capital College (Laurel, Maryland)
Martin Grossman, Bridgewater State College
Ray Jacobs, Ashland University
Leonard Jessup , Indiana University
Jeffrey Johnson , Utah State University
Jahangir Karimi , University of Colorado Denver
Saul Kassicieh , University of New Mexico
Anand S. Kunnathur, University of Toledo
Preface XXV
xxvi Preface
Shao-ju Lee, California State University at Northridge
Yair Levy, Nova Southeastern University
Hank Lucas, New York University
Jane Mackay, Texas Christian University
George M. Marakas, University of Maryland
Dick Mason, Southern Methodist University
Nick McGaughey, San Jose State University
Ido Millet, Pennsylvania State University-Erie
Benjamin Mittman, Northwestern University
Larry Moore, Virginia Polytechnic Institute and State University
Simitra Mukherjee, Nova Southeastern University
Marianne Murphy, Northeastern University
Peter Mykytyn, Southern Illinois University
Natalie Nazarenko, SUNY College at Fredonia
Souren Paul, Southern Illinois University
Joshua Pauli, Dakota State University
Roger Alan Pick, University of Missouri-St. Louis
W. “RP” Raghupaphi, California State University-Chico
Loren Rees, Virginia Polytechnic Institute and State University
David Russell, Western New England College
Steve Ruth, George Mason University
Vartan Safarian, Winona State University
Glenn Shephard, San Jose State University
Jung P. Shim, Mississippi State University
Meenu Singh, Murray State University
Randy Smith, University of Virginia
James T.C. Teng, University of South Carolina
John VanGigch, California State University at Sacramento
David Van Over, University of Idaho
Paul J.A. van Vliet, University of Nebraska at Omaha
B. S. Vijayaraman, University of Akron
Howard Charles Walton, Gettysburg College
Diane B. Walz, University of Texas at San Antonio
Paul R. Watkins, University of Southern California
Randy S. Weinberg, Saint Cloud State University
Jennifer Williams, University of Southern Indiana
Steve Zanakis, Florida International University
Fan Zhao, Florida Gulf Coast University
Several individuals contributed material to the text or the supporting material.
Susan Baxley and Dr. David Schrader of Teradata provided special help in identifying
new TUN content for the book and arranging permissions for the same. Peter Horner,
editor of OR/MS Today, allowed us to summarize new application stories from OR/
MS Today and Analytics Magazine. We also thank INFORMS for their permission to
highlight content from Inteifaces. Prof. Rick Wilson contributed some examples and
exercise questions for Chapter 9 . Assistance from Natraj Ponna, Daniel Asamoah, Amir
Hassan-Zadeh, Kartik Dasika, Clara Gregory, and Amy Wallace (all of Oklahoma State
University) is gratefully acknowledged for this edition. We also acknowledge Narges
Kasiri (Ithaca College) for the write-up on system dynamics modeling and Jongswas
Chongwatpol (NIDA, Thailand) for the material on SIMIO software. For the previous edi-
tion , we acknowledge the contributions of Dave King QDA Software Group, Inc.) and
Jerry Wagner (University of Nebraska-Omaha) . Major contributors for earlier editions
include Mike Gou! (Arizona State University) and Leila A. Halawi (Bethune-Cookman
College), who provided material for the chapter on data warehousing; Christy Cheung
(Hong Kong Baptist University), who contributed to the chapter on knowledge man-
agement; Linda Lai (Macau Polytechnic University of China); Dave King QDA Software
Group, Inc.); Lou Frenzel, an independent consultant whose books Crash Course in
Artificial Intelligence and Expert Systems and Understanding of Expert Systems (both
published by Howard W. Sams, New York , 1987) provided material for the early edi-
tions; Larry Medsker (American University), who contributed substantial material on neu-
ral networks; and Richard V. McCarthy (Quinnipiac University), who performed major
revisions in the seventh edition.
Previous editions of the book have also benefited greatly from the efforts of many
individuals who contributed advice and interesting material (such as problems), gave
feedback on material, or helped with class testing. These individuals are Warren Briggs
(Suffolk University) , Frank DeBalough (University of Southern California), Mei-Ting
Cheung (University of Hong Kong), Alan Dennis (Indiana University), George Easton
(San Diego State University), Janet Fisher (California State University, Los Angeles),
David Friend (Pilot Software, Inc .) , the late Paul Gray (Claremont Graduate School),
Mike Henry (OSU), Dustin Huntington (Exsys , Inc.), Subramanian Rama Iyer (Oklahoma
State University), Angie Jungermann (Oklahoma State University), Elena Karahanna
(The University of Georgia), Mike McAulliffe (The University of Georgia), Chad Peterson
(The University of Georgia), Neil Rabjohn (York University), Jim Ragusa (University of
Central Florida) , Alan Rowe (University of Southern California) , Steve Ruth (George
Mason University), Linus Schrage (University of Chicago), Antonie Stam (University of
Missouri), Ron Swift (NCR Corp.) , Merril Warkentin (then at Northeastern University),
Paul Watkins (The University of Southern California), Ben Mortagy (Claremont Graduate
School of Management), Dan Walsh (Bellcore), Richard Watson (The University of
Georgia), and the many other instructors and students who have provided feedback.
Several vendors cooperated by providing development and/or demonstra-
tion software: Expert Choice , Inc. (Pittsburgh, Pennsylvania), Nancy Clark of Exsys ,
Inc. (Albuquerque, New Mexico), Jim Godsey of GroupSystems, Inc. (Broomfield,
Colorado), Raimo Hamalainen of Helsinki University of Technology, Gregory Piatetsky-
Shapiro of KDNuggets .com, Logic Programming Associates (UK), Gary Lynn of
NeuroDimension Inc. (Gainesville, Florida), Palisade Software (Newfield, New York),
Jerry Wagner of Planners Lab (Omaha, Nebraska) , Promised Land Technologies (New
Haven, Connecticut) , Salford Systems (La Jolla , California), Sense Networks (New York ,
New York), Gary Miner of StatSoft, Inc . (Tulsa, Oklahoma) , Ward Systems Group, Inc .
(Frederick, Maryland), Idea Fisher Systems, Inc. (Irving, California), and Wordtech
Systems (Orinda , California) .
Special thanks to the Teradata University Network and especially to Hugh Watson,
Michael Gou!, and Susan Baxley, Program Director, for their encouragement to tie this
book with TUN and for providing useful material for the book.
Many individuals helped us with administrative matters and editing, proofreading,
and preparation. The project began with Jack Repcheck (a former Macmillan editor), who
initiated this project with the support of Hank Lucas (New York University). Judy Lang
collaborated with all of us , provided editing, and guided us during the entire project
through the eighth edition.
Finally, the Pearson team is to be commended: Executive Editor Bob Horan, who
orchestrated this project; Kitty Jarrett, who copyedited the manuscript; and the produc-
tion team, Tom Benfatti at Pearson, George and staff at Integra Software Services , who
transformed the manuscript into a book.
Preface xxvii
xxviii Preface
We would like to thank all these individuals and corporations. Without their help,
the creation of this book would not have been possible. Ramesh and Dursun want to
specifically acknowledge the contributions of previous coauthors Janine Aronson, David
King, and T. P. Liang, whose original contributions constitute significant components of
the book.
R.S.
D.D.
E.T
Note that Web site URLs are dynamic. As this book went to press, we verified that all the cited Web sites were
active and valid. Web sites to which we refer in the text sometimes change or are discontinued because compa-
nies change names , are bought or sold, merge, or fail. Sometimes Web sites are down for maintenance, repair,
or redesign. Most organizations have dropped the initial “www” designation for their sites, but some still use
it . If you have a problem connecting to a Web site that we mention , please be patient and simply run a Web
search to try to identify the new site. Most times , the new site can be found quickly. Some sites also require a
free registration before allowing you to see the content. We apologize in advance for this inconvenience.
ABOUT THE AUTHORS
Ramesh Sharda (M.B.A., Ph.D ., University of Wisconsin-Madison) is director of the
Ph.D. in Business for Executives Program and Institute for Research in Information
Systems (IRIS), ConocoPhillips Chair of Management of Technology, and a Regents
Professor of Management Science and Information Systems in the Spears School of
Business at Oklahoma State University (OSU) . About 200 papers describing his research
have been published in major journals, including Operations Research, Management
Science, Information Systems Research, Decision Support Systems, and journal of MIS.
He cofounded the AIS SIG on Decision Support Systems and Knowledge Management
(SIGDSS). Dr. Sharda serves on several editorial boards, including those of INFORMS
journal on Computing, Decision Support Systems , and ACM Transactions on Management
Information Systems . He has authored and edited several textbooks and research books
and serves as the co-editor of several book series (Integrated Series in Information
Systems , Operations Research/ Computer Science Interfaces, and Annals of Information
Systems) with Springer. He is also currently serving as the executive director of the
Teradata University Network. His current research interests are in decision support sys-
tems, business analytics, and technologies for managing information overload.
Dursun Delen (Ph.D ., Oklahoma State University) is the Spears and Patterson Chairs in
Business Analytics, Director of Research for the Center for Health Systems Innovation,
and Professor of Management Science and Information Systems in the Spears School of
Business at Oklahoma State University (OSU). Prior to his academic career, he worked
for a privately owned research and consultancy company, Knowledge Based Systems
Inc. , in College Station, Texas, as a research scientist for five years, during which he led
a number of decision support and other information systems-related research projects
funded by federal agencies such as DoD, NASA, NIST, and DOE. Dr. Delen’s research has
appeared in major journals including Decision Support Systems, Communications of the
ACM, Computers and Operations Research, Computers in Industry, journal of Production
Operations Management, Artificial Intelligence in Medicine, and Expert Systems with
Applications, among others. He recently published four textbooks: Advanced Data Mining
Techniques with Springer, 2008; Decision Support and Business Intelligence Systems with
Prentice Hall, 2010; Business Intelligence: A Managerial Approach , with Prentice Hall,
2010; and Practical Text Mining, with Elsevier, 2012 . He is often invited to national and
international conferences for keynote addresses on topics related to data/ text mining,
business intelligence, decision support systems , and knowledge management. He served
as the general co-chair for the 4th International Conference on Network Computing and
Advanced Information Management (September 2-4, 2008, in Seoul, South Korea) and
regularly chairs tracks and mini-tracks at various information systems conferences. He is
the associate editor-in-chief for International journal of Experimental Algorithms, associ-
ate editor for International journal of RF Technologies and journal of Decision Analytics,
and is on the editorial boards of five other technical journals. His research and teaching
interests are in data and text mining , decision support systems , knowledge management,
business intelligence, and enterprise modeling.
Efraim Turban (M .B.A., Ph .D., University of California, Berkeley) is a visiting scholar
at the Pacific Institute for Information System Management, University of Hawaii. Prior
to this, he was on the staff of several universities, including City University of Hong
Kong; Lehigh University; Florida International University; California State University, Long
xxix
XXX About the Authors
Beach; Eastern Illinois University; and the University of Southern California. Dr. Turban
is the author of more than 100 refereed papers published in leading journals, such as
Management Science, MIS Quarterly, and Decision Support Systems. He is also the author
of 20 books , including Electronic Commerce: A Managerial Perspective and Information
Technology for Management. He is also a consultant to major corporations worldwide.
Dr. Turban’s current areas of interest are Web-based decision support systems, social
commerce, and collaborative decision making.
P A R T
Decision Making and Analytics
An Overview
LEARNING OBJECTIVES FOR PART I
• Understand the need for business analytics
• Understand the foundations and key issues of
managerial decision making
• Understand the major categories and
applications of business analytics
• Learn the major frameworks of computerized
decision support: analytics, decision support
systems (DSS), and business intelligence (BI)
This book deals with a collection of computer technologies that support managerial work-essentially,
decision making. These technologies have had a profound impact on corporate strategy, perfor-
mance, and competitiveness. These techniques broadly encompass analytics, business intelligence,
and decision support systems, as shown throughout the book. In Part I, we first provide an overview
of the whole book in one chapter. We cover several topics in this chapter. The first topic is managerial
decision making and its computerized support; the second is frameworks for decision support. We
then introduce business analytics and business intelligence. We also provide examples of applications
of these analytical techniques, as well as a preview of the entire book. The second chapter within
Part I introduces the foundational methods for decision making and relates these to computerized
decision support. It also covers the components and technologies of decision support systems.
1
2
An Overview of Business Intelligence,
Analytics, and Decision Support
LEARNING OBJECTIVES
• Understand today’s turbulent business
environment and describe how
organizations survive and even excel in
such an environment (solving problems
and exploiting opportunities)
• Understand the need for computerized
support of managerial decision making
• Understand an early framework for
managerial decision making
• Learn the conceptual foundations of
the decision support systems (DSS1)
methodology
• Describe the business intelligence (BI)
methodology and concepts and relate
them to DSS
• Understand the various types of analytics
• List the major tools of computerized
decision support
T
he business environment (climate) is constantly changing, and it is becoming more
and more complex. Organizations, private and public, are under pressures that
force them to respond quickly to changing conditions and to be innovative in the
way they operate. Such activities require organizations to be agile and to make frequent
and quick strategic, tactical, and operational decisions, some of which are very complex.
Making such decisions may require considerable amounts of relevant data, information,
and knowledge. Processing these, in the framework of the needed decisions, must be
done quickly, frequently in real time, and usually requires some computerized support.
This book is about using business analytics as computerized support for manage-
rial decision making. It concentrates on both the theoretical and conceptual founda-
tions of decision support, as well as on the commercial tools and techniques that are
available. This introductory chapter provides more details of these topics as well as an
overview of the book. This chapter has the following sections:
1.1 Opening Vignette: Magpie Sensing Employs Analytics to Manage a Vaccine
Supply Chain Effectively and Safely 3
1.2 Changing Business Environments and Computerized Decision Support 5
‘The acronym DSS is treated as both singular and plural throughout this book. Similarly, other acronyms, such
as MIS and GSS, designate both plural and singular forms. This is also true of the word analytics.
Chapter 1 • An Overview of Business Intelligence, Analytics, and Decision Support 3
1.3 Managerial Decision Making 7
1.4 Information Systems Support for Decision Making 9
1.5 An Early Framework for Computerized Decision Support 11
1.6 The Concept of Decision Support Systems (DSS) 13
1. 7 A Framework for Business Intelligence (BI) 14
1.8 Business Analytics Overview 19
1.9 Brief Introduction to Big Data Analytics 27
1.10 Plan of the Book 29
1.11 Resources, Links, and the Teradata University Network Connection 31
1.1 OPENING VIGNETTE: Magpie Sensing Employs
Analytics to Manage a Vaccine Supply Chain
Effectively and Safely
Cold chain in healthcare is defined as the temperature-controlled supply chain involving a
system of transporting and storing vaccines and pharmaceutical drugs. It consists of three
major components-transport and storage equipment, trained personnel, and efficient
management procedures. The majority of the vaccines in the cold chain are typically main-
tained at a temperature of 35–46 degrees Fahrenheit [2-8 degrees Centigrade]. Maintaining
cold chain integrity is extremely important for healthcare product manufacturers.
Especially for the vaccines, improper storage and handling practices that compromise
vaccine viability prove a costly, time-consuming affair. Vaccines must be stored properly
from manufacture until they are available for use. Any extreme temperatures of heat or
cold will reduce vaccine potency; such vaccines, if administered, might not yield effective
results or could cause adverse effects .
Effectively maintaining the temperatures of storage units throughout the healthcare
supply chain in real time-Le., beginning from the gathering of the resources, manufac-
turing, distribution, and dispensing of the products-is the most effective solution desired
in the cold chain. Also, the location-tagged real-time environmental data about the storage
units helps in monitoring the cold chain for spoiled products. The chain of custody can
be easily identified to assign product liability.
A study conducted by the Centers for Disease Control and Prevention ( CDC) looked at
the handling of cold chain vaccines by 45 healthcare providers around United States and
reported that three-quarters of the providers experienced serious cold chain violations.
A WAY TOWARD A POSSIBLE SOLUTION
Magpie Sensing, a start-up project under Ebers Smith and Douglas Associated LLC, pro-
vides a suite of cold chain monitoring and analysis technologies for the healthcare indus-
try. It is a shippable, wireless temperature and humidity monitor that provides real-time,
location-aware tracking of cold chain products during shipment. Magpie Sensing’s solu-
tions rely on rich analytics algorithms that leverage the data gathered from the monitor-
ing devices to improve the efficiency of cold chain processes and predict cold storage
problems before they occur.
Magpie sensing applies all three types of analytical techniques-descriptive, predic-
tive, and prescriptive analytics-to tum the raw data returned from the monitoring devices
into actionable recommendations and warnings.
The properties of the cold storage system, which include the set point of the storage
system’s thermostat, the typical range of temperature values in the storage system, and
4 Part I • Decision Making and Analytics: An Oveiview
the duty cycle of the system’s compressor, are monitored and reported in real time. This
information helps trained personnel to ensure that the storage unit is properly configured
to store a particular product. All the temperature information is displayed on a Web dash-
board that shows a graph of the temperature inside the specific storage unit.
Based on information derived from the monitoring devices, Magpie’s predictive ana-
lytic algorithms can determine the set point of the storage unit’s thermostat and alert the
system’s users if the system is incorrectly configured, depending upon the various types
of products stored. This offers a solution to the users of consumer refrigerators where
the thermostat is not temperature graded. Magpie’s system also sends alerts about pos-
sible temperature violations based on the storage unit’s average temperature and subse-
quent compressor cycle runs, which may drop the temperature below the freezing point.
Magpie ‘s predictive analytics further report possible human errors, such as failure to shut
the storage unit doors or the presence of an incomplete seal, by analyzing the tempera-
ture trend and alerting users via Web interface, text message , or audible alert before the
temperature bounds are actually violated. In a similar way, a compressor or a power
failure can be detected; the estimated time before the storage unit reaches an unsafe tem-
perature also is reported, which prepares the users to look for backup solutions such as
using dry ice to restore power.
In addition to predictive analytics, Magpie Sensing’s analytics systems can provide
prescriptive recommendations for improving the cold storage processes and business
decision making. Prescriptive analytics help users dial in the optimal temperature setting,
which helps to achieve the right balance between freezing and spoilage risk; this, in turn,
provides a cushion-time to react to the situation before the products spoil. Its prescriptive
analytics also gather useful meta-information on cold storage units, including the times of
day that are busiest and periods where the system’s doors are opened, which can be used
to provide additional design plans and institutional policies that ensure that the system is
being properly maintained and not overused.
Furthermore, prescriptive analytics can be used to guide equipment purchase deci-
sions by constantly analyzing the performance of current storage units. Based on the
storage system’s efficiency, decisions on distributing the products across available storage
units can be made based on the product’s sensitivity.
Using Magpie Sensing’s cold chain analytics, additional manufacturing time and
expenditure can be eliminated by ensuring that product safety can be secured throughout
the supply chain and effective products can be administered to the patients. Compliance
with state and federal safety regulations can be better achieved through automatic data
gathering and reporting about the products involved in the cold chain.
QUESTIONS FOR THE OPENING VIGNETTE
1. What information is provided by the descriptive analytics employed at Magpie
Sensing?
2. What type of support is provided by the predictive analytics employed at Magpie
Sensing?
3. How does prescriptive analytics help in business decision making?
4. In what ways can actionable information be reported in real time to concerned
users of the system?
5. In what other situations might real-time monitoring applications be needed?
WHAT WE CAN LEARN FROM THIS VIGNETIE
This vignette illustrates how data from a business process can be used to generate insights
at various levels. First, the graphical analysis of the data (termed reporting analytics) allows
Chapter 1 • An Overview of Business Intelligence, Analytics, and Decision Support 5
users to get a good feel for the situation. Then, additional analysis using data mining
techniques can be used to estimate what future behavior would be like. This is the domain
of predictive analytics. Such analysis can then be taken to create specific recommendations
for operators. This is an example of what we call prescriptive analytics. Finally, this open-
ing vignette also suggests that innovative applications of analytics can create new business
ventures. Identifying opportunities for applications of analytics and assisting with decision
making in specific domains is an emerging entrepreneurial opportunity.
Sources: Magpiesensing.com, “Magpie Sensing Cold Chain Analytics and Monitoring,” magpiesensing.com/
wp-content/uploads/2013/01/ColdChainAnalyticsMagpieSensing-Whitepaper (accessed July 2013);
Centers for Disease Control and Prevention, Vaccine Storage and Handling, http://www.cdc.gov/vaccines/pubs/
pinkbook/vac-storage.html#storage (accessed July 2013); A. Zaleski, “Magpie Analytics System Tracks Cold-
Chain Products to Keep Vaccines, Reagents Fresh” (2012), technicallybaltimore.com/profiles/startups/magpie-
analytics-system-track.s-cold-chain-products-to-keep-vaccines-reagents-fresh (accessed February 2013).
1.2 CHANGING BUSINESS ENVIRONMENTS AND COMPUTERIZED
DECISION SUPPORT
The opening vignette illustrates how a company can employ technologies to make sense
of data and make better decisions. Companies are moving aggressively to computerized
support of their operations. To understand why companies are embracing computer-
ized support, including business intelligence, we developed a model called the Business
Pressures-Responses-Support Model, which is shown in Figure 1.1.
The Business Pressures-Responses-Support Model
The Business Pressures-Responses-Support Model, as its name indicates, has three com-
ponents: business pressures that result from today’s business climate, responses (actions
taken) by companies to counter the pressures (or to take advantage of the opportunities
available in the environment), and computerized support that facilitates the monitoring
of the environment and enhances the response actions taken by organizations.
Business
Environmental Factors
Globalization
Customer demand
Government regulations
Market conditions
Competition
Etc.
Pressures
Opportunities
Organization
Respon ses
Strategy
Partners’ collaboration
Real-time response
Agility
Increased productivity
New vendors
New business models
Etc.
FIGURE 1.1 The Business Pressures-Responses-Support Model.
.
Decisions and
Support
Analyses
Predictions
Decisions
i i i
Integrated
computerized
decision
support
Business
intelligence
6 Part I • Decision Making and Analytics: An Overview
THE BUSINESS ENVIRONMENT The environment in which organizations operate today
is becoming more and more complex. This complexity creates opportunities on the one
hand and problems on the other. Take globalization as an example. Today, you can eas-
ily find suppliers and customers in many countries, which means you can buy cheaper
materials and sell more of your products and services; great opportunities exist. However,
globalization also means more and stronger competitors. Business environment factors
can be divided into four major categories: markets, consumer demands, technology, and
societal. These categories are summarized in Table 1.1.
Note that the intensity of most of these factors increases with time, leading to
more pressures, more competition, and so on. In addition, organizations and departments
within organizations face decreased budgets and amplified pressures from top managers
to increase performance and profit. In this kind of environment, managers must respond
quickly, innovate, and be agile. Let’s see how they do it.
ORGANIZATIONAL RESPONSES: BE REACTIVE, ANTICIPATIVE, ADAPTIVE, AND PROACTIVE
Both private and public organizations are aware of today’s business environment and
pressures. They use different actions to counter the pressures. Vodafone New Zealand
Ltd (Krivda, 2008), for example, turned to BI to improve communication and to support
executives in its effort to retain existing customers and increase revenue from these cus-
tomers. Managers may take other actions, including the following:
• Employ strategic planning.
• Use new and innovative business models.
• Restructure business processes.
• Participate in business alliances.
• Improve corporate information systems.
• Improve partnership relationships.
TABLE 1.1 Business Environment Factors That Create Pressures on Organizations
Factor Description
Markets Strong competition
Expanding global markets
Consumer demands
Technology
Societal
Booming electronic markets on the Internet
Innovative marketing methods
Opportunities for outsourcing with IT support
Need for real-time, on-demand transactions
Desire for customization
Desire for quality, diversity of products, and speed of delivery
Customers getting powerful and less loyal
More innovations, new products, and new services
Increasing obsolescence rate
Increasing information overload
Social networking, Web 2.0 and beyond
Growing government regulations and deregulation
Workforce more diversified, older, and composed of more women
Prime concerns of homeland security and terrorist attacks
Necessity of Sarbanes-Oxley Act and other reporting-related legislation
Increasing social responsibility of companies
Greater emphasis on sustainability
Chapter 1 • An Overview of Business Intelligence , Analytics, and Decision Support 7
• Encourage innovation and creativity.
• Improve customer service and relationships.
• Employ social media and mobile platforms for e-commerce and beyond.
• Move to make-to-order production and on-demand manufacturing and services .
• Use new IT to improve communication, data access (discovery of information) , and
collaboration.
• Respond quickly to competitors’ actions (e.g., in pricing, promotions, new products
and services).
• Automate many tasks of white-collar employees.
• Automate certain decision processes, especially those dealing with customers.
• Improve decision making by employing analytics.
Many, if not all, of these actions require some computerized support. These and other
response actions are frequently facilitated by computerized decision support (DSS).
CLOSING THE STRATEGY GAP One of the major objectives of computerized decision
support is to facilitate closing the gap between the current performance of an organi-
zation and its desired performance, as expressed in its mission, objectives, and goals,
and the strategy to achieve them. In order to understand why computerized support
is needed and how it is provided, especially for decision-making support, let’s look at
managerial decision making.
SECTION 1.2 REVIEW QUESTIONS
1. List the components of and explain the Business Pressures-Responses-Support
Model.
2. What are some of the major factors in today’s business environment?
3. What are some of the major response activities that organizations take?
1.3 MANAGERIAL DECISION MAKING
Management is a process by which organizational goals are achieved by using
resources . The resources are considered inputs, and attainment of goals is viewed as
the output of the process. The degree of success of the organization and the manager
is often measured by the ratio of outputs to inputs. This ratio is an indication of the
organization’s productivity, which is a reflection of the organizational and managerial
pe,fonnance.
The level of productivity or the success of management depends on the perfor-
mance of managerial functions, such as planning, organizing, directing, and control-
ling. To perform their functions , managers engage in a continuous process of making
decisions. Making a decision means selecting the best alternative from two or more
solutions.
The Nature of Managers’ Work
Mintzberg’s (2008) classic study of top managers and several replicated studies suggest
that managers perform 10 major roles that can be classified into three major categories :
interpersonal, infonnational, and decisional (see Table 1.2).
To perform these roles, managers need information that is delivered efficiently and
in a timely manner to personal computers (PCs) on their desktops and to mobile devices.
This information is delivered by networks, generally via Web technologies.
In addition to obtaining information necessary to better perform their roles, manag-
ers use computers directly to support and improve decision making, which is a key task
8 Part I • Decision Making and Analytics: An Overview
TABLE 1.2 Mintzberg’s 10 Managerial Roles
Role
Interpersonal
Figurehead
Leader
Liaison
Informational
Monitor
Disseminator
Spokesperson
Decisional
Entrepreneur
Disturbance handler
Resource allocator
Negotiator
Description
Is symbolic head; obliged to perform a number of routine duties of a
legal or social nature
Is responsible for the motivation and activation of subordinates;
responsible for staffing, training, and associated duties
Maintains self-developed network of outside contacts and informers
who provide favors and information
Seeks and receives a wide variety of special information (much of it
current) to develop a thorough understanding of the organization
and environment; emerges as the nerve center of the organization’s
internal and external information
Transmits information received from outsiders or from subordinates to
members of the organization; some of this information is factual,
and some involves interpretation and integration
Transmits information to outsiders about the organization’s plans,
policies, actions, results, and so forth; serves as an expert on the
organization’s industry
Searches the organization and its environment for opportunities and
initiates improvement projects to bring about change; supervises
design of certain projects
Is responsible for corrective action when the organization faces
important, unexpected disturbances
Is responsible for the allocation of organizational resources of all
kinds; in effect, is responsible for the making or approval of all
significant organizational decisions
Is responsible for representing the organization at major negotiations
Sources: Compiled from H. A. Mintzberg, The Nature of Managerial Work. Prentice Hall, Englew ood Cliffs,
NJ, 1980; and H. A. Mintzberg, The Rise and Fall of Strategic Planning. The Free Press, New York, 1993.
that is part of most of these roles. Many managerial activities in all roles revolve around
decision making. Managers, especially those at high managerial levels, are primarily deci-
sion makers. We review the decision-making process next but will study it in more detail
in the next chapter.
The Decision-Making Process
For years, managers considered decision making purely an art-a talent acquired over a
long period through experience (i.e., learning by trial-and-error) and by using intuition.
Management was considered an art because a variety of individual styles could be used
in approaching and successfully solving the same types of managerial problems. These
styles were often based on creativity, judgment, intuition, and experience rather than
on systematic quantitative methods grounded in a scientific approach. However, recent
research suggests that companies with top managers who are more focused on persistent
work (almost dullness) tend to outperform those with leaders whose main strengths are
interpersonal communication skills (Kaplan et al., 2008; Brooks, 2009). It is more impor-
tant to emphasize methodical, thoughtful, analytical decision making rather than flashi-
ness and interpersonal communication skills.
Chapter 1 • An Overview of Business Intelligence , Analytics, and Decision Support 9
Managers usually make decisions by following a four-step process C we learn more
about these in Chapter 2):
1. Define the problem (i.e., a decision situation that may deal with some difficulty or
with an opportunity).
2. Construct a model that describes the real-world problem.
3. Identify possible solutions to the modeled problem and evaluate the solutions.
4. Compare, choose, and recommend a potential solution to the problem.
To follow this process, one must make sure that sufficient alternative solutions are
being considered, that the consequences of using these alternatives can be reasonably
predicted, and that comparisons are done properly. However, the environmental factors
listed in Table 1.1 make such an evaluation process difficult for the following reasons:
• Technology, information systems, advanced search engines, and globalization result
in more and more alternatives from which to choose.
• Government regulations and the need for compliance, political instability and ter-
rorism, competition, and changing consumer demands produce more uncertainty,
making it more difficult to predict consequences and the future.
• Other factors are the need to make rapid decisions, the frequent and unpredictable
changes that make trial-and-error learning difficult, and the potential costs of making
mistakes.
• These environments are growing more complex every day. Therefore, making deci-
sions today is indeed a complex task.
Because of these trends and changes, it is nearly impossible to rely on a trial-and-
error approach to management, especially for decisions for which the factors shown in
Table 1.1 are strong influences. Managers must be more sophisticated; they must use the
new tools and techniques of their fields. Most of those tools and techniques are discussed
in this book. Using them to support decision making can be extremely rewarding in
making effective decisions. In the following section, we look at why we need computer
support and how it is provided.
SECTION 1.3 REVIEW QUESTIONS
1. Describe the three major managerial roles , and list some of the specific activities in each.
2. Why have some argued that management is the same as decision making?
3. Describe the four steps managers take in making a decision.
1.4 INFORMATION SYSTEMS SUPPORT FOR DECISION MAKING
From traditional uses in payroll and bookkeeping functions, computerized systems have
penetrated complex managerial areas ranging from the design and management of auto-
mated factories to the application of analytical methods for the evaluation of proposed
mergers and acquisitions. Nearly all executives know that information technology is vital
to their business and extensively use information technologies.
Computer applications have moved from transaction processing and monitoring
activities to problem analysis and solution applications, and much of the activity is done
with Web-based technologies, in many cases accessed through mobile devices. Analytics
and BI tools such as data warehousing, data mining, online analytical processing (OLAF) ,
dashboards , and the use of the Web for decision support are the cornerstones of today’s
modern management. Managers must have high-speed, networked information sys-
tems (wireline or wireless) to assist them with their most important task: making deci-
sions. Besides the obvious growth in hardware, software, and network capacities, some
10 Part I • Decision Making and Analytics: An Oveiview
developments have clearly contributed to facilitating growth of decision support and
analytics in a number of ways, including the following:
• Group communication and collaboration. Many decisions are made today by
groups whose members may be in different locations. Groups can collaborate and
communicate readily by using Web-based tools as well as the ubiquitous smartphones.
Collaboration is especially important along the supply chain, where partners-all the
way from vendors to customers-must share information. Assembling a group of
decision makers, especially experts, in one place can be costly. Infonnation systems
can improve the collaboration process of a group and enable its members to be at dif-
ferent locations (saving travel costs). We will study some applications in Chapter 12.
• Improved data management. Many decisions involve complex computations.
Data for these can be stored in different databases anywhere in the organization
and even possibly at Web sites outside the organization. The data may include text,
sound, graphics, and video, and they can be in different languages. It may be neces-
sary to transmit data quickly from distant locations. Systems today can search, store,
and transmit needed data quickly, economically, securely, and transparently.
• Managing giant data warehouses and Big Data. Large data warehouses, like
the ones operated by Walmart, contain terabytes and even petabytes of data. Special
methods , including parallel computing, are available to organize, search, and mine
the data. The costs related to data warehousing are declining. Technologies that fall
under the broad category of Big Data have enabled massive data coming from a
variety of sources and in many different forms, which allows a very different view
into organizational performance that was not possible in the past.
• Analytical support. With more data and analysis technologies, more alterna-
tives can be evaluated, forecasts can be improved, risk analysis can be performed
quickly, and the views of experts (some of whom may be in remote locations) can
be collected quickly and at a reduced cost. Expertise can even be derived directly
from analytical systems. With such tools , decision makers can perform complex
simulations, check many possible scenarios, and assess diverse impacts quickly and
economically. This, of course, is the focus of several chapters in the book.
• Overcoming cognitive limits in processingandstoringinformation. According
to Simon 0977), the human mind has only a limited ability to process and store infor-
mation. People sometimes find it difficult to recall and use infonnation in an error-free
fashion due to their cognitive limits. The term cognitive limits indicates that an indi-
vidual’s problem-solving capability is limited when a wide range of diverse information
and knowledge is required. Computerized systems enable people to overcome their
cognitive limits by quickly accessing and processing vast amounts of stored information
(see Chapter 2).
• Knowledge management. Organizations have gathered vast stores of informa-
tion about their own operations, customers, internal procedures, employee interac-
tions , and so forth through the unstructured and structured communications taking
place among the various stakeholders. Knowledge management systems (KMS ,
Chapter 12) have become sources of formal and informal support for decision
making to managers, although sometimes they may not even be called KMS.
• Anywhere, any time support. Using wireless technology, managers can access
information anytime and from any place , analyze and interpret it, and communicate
with those involved. This perhaps is the biggest change that has occurred in the last
few years. The speed at which information needs to be processed and converted
into decisions has truly changed expectations for both consumers and businesses .
These and other capabilities have been driving tl1e use of computerized decision support
since the late 1960s, but especially since the mid-1990s. The growth of mobile technologies ,
Chapter 1 • An Overview of Business Intelligence, Analytics, and Decision Support 11
social media platforms, and analytical tools has enabled a much higher level of information
systems support for managers. In the next sections we study a historical classification of
decision support tasks. This leads us to be introduced to decision support systems. We will
then study an overview of technologies that have been broadly referred to as business intel-
ligence. From there we will broaden our horizons to introduce various types of analytics.
SECTION 1.4 REVIEW QUESTIONS
1. What are some of the key system-oriented trends that have fostered IS-supported
decision making to a new level?
2. List some capabilities of information systems that can facilitate managerial decision
making.
3. How can a computer help overcome the cognitive limits of humans?
1.5 A N EARLY FRAMEWORK FOR COMPUTERIZED
DECISION SUPPORT
An early framework for computerized decision support includes several major concepts
that are used in forthcoming sections and chapters of this book. Gorry and Scott-Morton
created and used this framework in the early 1970s, and the framework then evolved into
a new technology called DSS.
The Gorry and Scott-Morton Classical Framework
Gorry and Scott-Morton 0971) proposed a framework that is a 3-by-3 matrix, as shown in
Figure 1.2. The two dimensions are the degree of structuredness and the types of control.
Type of Control
Operational Managerial S trategic
Type of Decision Contro l Control Planning
L!_ l!_ l!_
Accounts receivable Budget analysis Financial management
Structured Accounts payable Short-term forecasting Investment portfolio
Order entry Personnel reports Warehouse location
Make-or-buy Distribution systems
~ l!_ l!_
Production scheduling Credit evaluation Building a new plant
Inventory control Budget preparation Mergers & acquisitions
S emistructured Plant layout New product planning
Project scheduling Compensation planning
Reward system design Quality assurance
Inventory HR policies
categorization Inventory planning
L!_ l!_ l!_
Buying software Negotiating R & D planning
Unstructured Approving loans Recruiting an executive New tech development
Operating a help desk Buying hardware Social responsibility
Selecting a cover for Lobbying planning
a magazine
FIGURE 1.2 Decision Support Frameworks.
12 Part I • Decision Making and Analytics: An Oveiview
DEGREE OF STRUCTUREDNESS The left side of Figure 1.2 is based on Simon’s (1977) idea
that decision-making processes fall along a continuum that ranges from highly structured
(sometimes called programmed) to highly unstructured (i.e., nonprogrammed) decisions.
Structured processes are routine and typically repetitive problems for which standard
solution methods exist. Unstrnctured processes are fuzzy, complex problems for which
there are no cut-and-dried solution methods.
An unstructured problem is one where the articulation of the problem or the solu-
tion approach may be unstructured in itself. In a structured problem, the procedures
for obtaining the best (or at least a good enough) solution are known. Whether the prob-
lem involves finding an appropriate inventory level or choosing an optimal investment
strategy, the objectives are clearly defined. Common objectives are cost minimization and
profit maximization.
Semistructured problems fall between structured and unstructured problems, hav-
ing some structured elements and some unstructured elements. Keen and Scott-Morton
0978) mentioned trading bonds, setting marketing budgets for consumer products, and
performing capital acquisition analysis as semistructured problems.
TYPES OF CONTROL The second half of the Gorry and Scott-Morton framework
(refer to Figure 1.2) is based on Anthony’s 0965) taxonomy, which defines three
broad categories that encompass all managerial activities: strategic planning, which
involves defining long-range goals and policies for resource allocation; manage-
ment control, the acquisition and efficient use of resources in the accomplishment of
organizational goals; and operational control, the efficient and effective execution of
specific tasks.
THE DECISION SUPPORT MATRIX Anthony’s and Simon’s taxonomies are combined in the
nine-cell decision support matrix shown in Figure 1.2. The initial purpose of this matrix
was to suggest different types of computerized support to different cells in the matrix.
Gorry and Scott-Morton suggested, for example, that for semistructured decisions and
unstrnctured decisions, conventional management information systems (MIS) and man-
agement science (MS) tools are insufficient. Human intellect and a different approach to
computer technologies are necessary. They proposed the use of a supportive information
system, which they called a DSS.
Note that the more structured and operational control-oriented tasks (such as
those in cells 1, 2, and 4) are usually performed by lower-level managers, whereas
the tasks in cells 6, 8, and 9 are the responsibility of top executives or highly trained
specialists.
Computer Support for Structured Decisions
Computers have historically supported structured and some semistructured decisions,
especially those that involve operational and managerial control, since the 1960s.
Operational and managerial control decisions are made in all functional areas , especially
in finance and production (i.e., operations) management.
Structured problems, which are encountered repeatedly, have a high level of struc-
ture . It is therefore possible to abstract, analyze , and classify them into specific catego-
ries. For example, a make-or-buy decision is one category. Other examples of categories
are capital budgeting, allocation of resources, distribution, procurement, planning, and
inventory control decisions. For each category of decision, an easy-to-apply prescribed
model and solution approach have been developed, generally as quantitative formulas.
Therefore, it is possible to use a scientific approach for automating portions of manage-
rial decision making.
Chapter 1 • An Overview of Business Intelligence , Analytics, and Decision Support 13
Computer Support for Unstructured Decisions
Unstructured problems can be only partially supported by standard computerized quan-
titative methods. It is usually necessary to develop customized solutions. However, such
solutions may benefit from data and information generated from corporate or external
data sources. Intuition and judgment may play a large role in these types of decisions, as
may computerized communication and collaboration technologies, as well as knowledge
management (see Chapter 12).
Computer Support for Semistructured Problems
Solving semistructured problems may involve a combination of standard solution pro-
cedures and human judgment. Management science can provide models for the portion
of a decision-making problem that is structured. For the unstructured portion, a DSS can
improve the quality of the information on which the decision is based by providing, for
example, not only a single solution but also a range of alternative solutions, along with
their potential impacts. These capabilities help managers to better understand the nature
of problems and, thus, to make better decisions.
SECTION 1.5 REVIEW QUESTIONS
1. What are structured, unstructured, and semistructured decisions? Provide two exam-
ples of each.
2. Define operational control, managerial control, and strategic planning. Provide two
examples of each.
3. What are the nine cells of the decision framework? Explain what each is for.
4. How can computers provide support for making structured decisions?
5. How can computers provide support to semistructured and unstructured decisions?
1.6 THE CONCEPT OF DECISION SUPPORT SYSTEMS (DSS)
In the early 1970s, Scott-Morton first articulated the major concepts of DSS. He defined
decision support systems (DSS) as “interactive computer-based systems, which help
decision makers utilize data and models to solve unstructured problems” (Gorry and
Scott-Morton, 1971). The following is another classic DSS definition, provided by Keen
and Scott-Morton 0978):
Decision support systems couple the intellectual resources of individuals with
the capabilities of the computer to improve the quality of decisions. It is a
computer-based support system for management decision makers who deal
with semistructured problems.
Note that the term decision support system, like management information system (MIS)
and other terms in the field of IT, is a content-free expression (i.e., it means different
things to different people). Therefore, there is no universally accepted definition of DSS.
(We present additional definitions in Chapter 2.) Actually, DSS can be viewed as a con-
ceptual methodology-that is, a broad, umbrella term. However, some view DSS as a nar-
rower, specific decision support application.
DSS as an Umbrella Term
The term DSS can be used as an umbrella term to describe any computerized system that
supports decision making in an organization. An organization may have a knowledge
14 Part I • Decision Making and Analytics: An Oveiview
management system to guide all its personnel in their problem solving. Another organiza-
tion may have separate support systems for marketing, finance , and accounting; a sup-
ply chain management (SCM) system for production; and several rule-based systems for
product repair diagnostics and help desks . DSS encompasses them all.
Evolution of DSS into Business Intelligence
In the early days of DSS , managers let their staff do some supportive analysis by using
DSS tools. As PC technology advanced, a new generation of managers evolved-one
that was comfortabl e with computing and knew that technology can directly h elp
make intelligent business decisions faster. New tools such as OLAP, data warehousing,
data mining, and intelligent systems , delivered via Web technology, added promised
capabilities and easy access to tools, models, and data for computer-aided decision
making. These tools started to appear under the names BI and business analytics in
the mid-1990s . We introduce these concepts next , and relate the DSS and BI concepts
in the following section s.
SECTION 1.6 REVIEW QUESTIONS
1. Provide two definitions of DSS.
2. Describe DSS as an umbrella term.
1.7 A FRAMEWORK FOR BUSINESS INTELLIGENCE (Bl)
The decision support concepts presented in Sections 1.5 and 1.6 have been implemented
incrementally, under different names, by many vendors that have created tools and meth-
odologies for decision support. As the enterprise-wide systems grew, managers were
able to access user-friendly reports that enabled them to make decisions quickly . These
systems, which were generally called executive information systems (EIS), then began to
offer additional visualization, alerts , and performance measurement capabilities. By 2006,
the major commercial products and services appeared under the umbrella term business
intelligence (BI).
Definitions of Bl
Business intelligence (BI) is an umbrella term that combines architectures, tools , data-
bases, analytical tools, applications, and methodologies. It is, like DSS, a content-free
expression, so it means different things to different peopl e. Part of the confusion about
BI lies in the flurry of acronyms and buzzwords that are associated with it (e.g., business
performance management [BPM]). Bi’s major objective is to enable interactive access
(sometimes in real time) to data, to enable manipulation of data, and to give business
managers and analysts the ability to conduct appropriate analyses . By analyzing historical
and current data, situations, and performances, decision makers get valuable insights that
enable them to make more informed and better decisions . The process of BI is based on
the traniformation of data to information , then to decisions, and finally to actions .
A Brief History of Bl
The term BJ was coined by the Gartner Group in the mid-1990s. However, the concept is
much older; it has its roots in the MIS reporting systems of the 1970s. During that period,
reporting systems were static, two dimensional, and had no analytical capabilities . In the
early 1980s, the concept of executive infonnation systems (EIS) emerged. This concept
expanded the computerized support to top-level managers and executives. Some of the
Chapter 1 • An Overview of Business Intelligence , Analytics, and Decision Support 15
Querying and
reportin
~ Data warehouse
EIS/ESS
Financial
reporting
DLAP
I
Digital cockpits L
. and dashboardsr,———–··
I
Scorecards and Ll————-11>
_ dashboards I
Workflow
Alerts and
notifications
mining Predictive
analytics
FIGURE 1.3 Evolution of Business Intelligence (Bl).
Data marts
Business
intelligence
Broadcasting
tools
Portals
capabilities introduced were dynamic multidimensional (ad hoc or on-demand) reporting,
forecasting and prediction, trend analysis, drill-down to details , status access, and criti-
cal success factors. These features appeared in dozens of commercial products until the
mid-1990s . Then the same capabilities and some new ones appeared under the name BI.
Today, a good BI-based enterprise information system contains all the information execu-
tives need. So, the original concept of EIS was transformed into BI. By 2005 , BI systems
started to include a-rtificial intelligence capabilities as well as powerful analytical capabili-
ties. Figure 1.3 illustrates the various tools and techniques that may be included in a BI
system. It illustrates the evolution of BI as well. The tools shown in Figure 1.3 provide the
capabilities of BI. The most sophisticated BI products include most of these capabilities;
others specialize in only some of them. We will study several of these capabilities in more
detail in Chapters 5 through 9.
The Architecture of Bl
A BI system has four major components: a data warehouse, with its source data; business
analytics, a collection of tools for manipulating, mining, and analyzing the data in the data
warehouse; business peiformance management {BPM) for monitoring and analyzing perfor-
mance; and a userinteiface (e.g., a dashboard). The relationship among these components is
illustrated in Figure 1.4. We will discuss these components in detail in Chapters 3 through 9.
Styles of Bl
The architecture of BI depends on its applications. MicroStrategy Corp. distinguishes five
styles of BI and offers special tools for each. The five styles are report delivery and alert-
ing; enterprise reporting (using dashboards and scorecards); cube analysis (also known as
slice-and-dice analysis); ad hoc queries; and statistics and data mining.
16 Part I • Decision Making and Analytics: An Oveiview
Data
————
Data Warehouse
Environment
Technical staff
Build the data warehouse
Organizing
Summarizing
Standardizing
Future component:
intelligent systems
Business Analytics
Environment
Manipulacion, results ——
User interface
Performance and
Strategy
Managers/ executives
— 8PM strategies
I
Browser ~
Portal 0
Dashboard
r-
FIGURE 1.4 A High-Level Architecture of Bl. Source: Based on W. Eckerson, Smart Companies in the
21st Century: The Secrets of Creating Successful Business Intelligent Solutions. The Data Warehousing
Institute, Seattle, WA, 2003, p. 32, Illustration 5.
The Origins and Drivers of Bl
Where did modern approaches to data warehousing (DW) and BI come from? What are
their roots, and how do those roots affect the way organizations are managing these initia-
tives today? Today’s investments in information technology are under increased scrutiny
in terms of their bottom-line impact and potential. The same is true of DW and the BI
applications that make these initiatives possible.
Organizations are being compelled to capture , understand, and harness their data
to support decision making in order to improve business operations . Legislation and
regulation (e.g., the Sarbanes-Oxley Act of 2002) now require business leaders to docu-
ment their business processes and to sign off on the legitimacy of the information they
rely on and report to stakeholders. Moreover, business cycle times are now extremely
compressed; faster, more informed, and better decision making is therefore a competitive
imperative. Managers need the right infonnation at the right time and in the right place.
This is the mantra for modern approaches to BI.
Organizations have to work smart. Paying careful attention to the management of BI
initiatives is a necessary aspect of doing business. It is no surprise, then, that organizations
are increasingly championing BL You will hear about more BI successes and the funda-
mentals of those successes in Chapters 3 through 9. Examples of many applications of BI
are provided in Table 1.3. Application Case 1.1 illustrates one such application of BI that
has helped many airlines, as well as the companies offering such services to the airlines.
A Multimedia Exercise in Business Intelligence
Teradata University Network (TUN) includes some videos along the lines of the televi-
sion show CSI to illustrate concepts of analytics in different industries. These are called
“BSI Videos (Business Scenario Investigations).” Not only these are entertaining, but
they also provide the class with some questions for discussion. For starters, please go to
teradatauniversitynetwork.com/teach-and-learn/library-item/?Libraryltemld=889.
Watch the video that appears on YouTube. Essentially, you have to assume the role of a
customer service center professional. An incoming flight is running late, and several pas-
sengers are likely to miss their connecting flights. There are seats on one outgoing flight
that can accommodate two of the four passengers. Which two passengers should be given
Chapter 1 • An Overview of Business Intelligence, Analytics, and Decision Support 17
TABLE 1.3 Business Value of Bl Analytical Applications
Analytic Application Business Question Business Value
Customer segmentation What market segments do my customers fall
into, and what are their characteristics?
Personalize customer relationships for higher
satisfaction and retention.
Propensity to buy Which customers are most likely to respond
to my promotion?
Target customers based on their need to
increase their loyalty to your product line.
Also, increase campaign profitability by focusing
on the most likely to buy.
Customer profitability What is the lifetime profitability of my
customer?
Make individual business interaction decisions
based on the overall profitability of
customers.
Fraud detection How can I tell which transactions are likely
to be fraudulent?
Quickly determine fraud and take immediate
action to minimize cost.
Customer attrition Which customer is at risk of leaving? Prevent loss of high-value customers and let go
of lower-value customers.
Channel optimization What is the best channel to reach my cus-
tomer in each segment?
Interact with customers based on their
preference and your need to manage cost.
Source: A. Ziama and J. Kasher, Data Mining Primer for the Data Warehousing Professional. Teradata, Dayton, OH, 2004.
Application Case 1.1
Sabre Helps Its Clients Through Dashboards and Analytics
Sabre is one of the world leaders in the travel indus-
try, providing both business-to-consumer services as
well as business-to-business services. It serves travel-
ers, travel agents, corporations, and travel suppliers
through its four main companies: Travelocity, Sabre
Travel Network, Sabre Airline Solutions, and Sabre
Hospitality Solutions. The current volatile global eco-
nomic environment poses significant competitive chal-
lenges to the airline industry. To stay ahead of the
competition, Sabre Airline Solutions recognized that
airline executives needed enhanced tools for manag-
ing their business decisions by eliminating the tradi-
tional, manual, time-consuming process of collect-
ing and aggregating financial and other information
needed for actionable initiatives. This enables real-time
decision support at airlines throughout the world that
maximize their (and, in tum, Sabre’s) return on infor-
mation by driving insights, actionable intelligence, and
value for customers from the growing data.
Sabre developed an Enterprise Travel Data
Warehouse (ETDW) using Teradata to hold its mas-
sive reservations data. ETDW is updated in near-real
time with batches that run every 15 minutes, gathering
data from all of Sabre’s businesses. Sabre uses its
ETDW to create Sabre Executive Dashboards that pro-
vide near-real-time executive insights using a Cognos
8 BI platform with Oracle Data Integrator and Oracle
Goldengate technology infrastructure. The Executive
Dashboards offer their client airlines’ top-level man-
agers and decision makers a timely, automated, user-
friendly solution, aggregating critical petformance
metrics in a succinct way and providing at a glance
a 360-degree view of the overall health of the airline.
At one airline, Sabre’s Executive Dashboards provide
senior management with a daily and intra-day snap-
shot of key petformance indicators in a single appli-
cation, replacing the once-a-week, &-hour process of
generating the same report from various data sources.
The use of dashboards is not limited to the external
customers; Sabre also uses them for their assessment
of internal operational petformance.
The dashboards help Sabre’s customers to have
a clear understanding of the data through the visual
displays that incorporate interactive drill-down capa-
bilities. It replaces flat presentations and allows for
more focused review of the data with less effort and
(Continued)
18 Part I • Decision Making and Analytics: An Oveiview
Application Case 1.1 (Continued)
time. This facilitates team dialog by making the data/
metrics pertaining to sales performance, including
ticketing, seats sold and flown, operational perfor-
mance such as data on flight movement and track-
ing, customer reservations, inventory, and revenue
across an airline’s multiple distribution channels, avail-
able to many stakeholders. The dashboard systems
provide scalable infrastructure, graphical user interface
(GUI) support, data integration, and data aggregation
that empower airline executives to be more proactive
in taking actions that lead to positive impacts on the
overall health of their airline.
With its EIDW, Sabre could also develop other
Web-based analytical and reporting solutions that lev-
erage data to gain customer insights through analysis
of customer profiles and their sales interactions to cal-
culate customer value. This enables better customer
segmentation and insights for value-added services.
QUESTIONS FOR DISCUSSION
1. What is traditional reporting? How is it used in
organizations?
2. How can analytics be used to transform tradi-
tional reporting?
3. How can interactive reporting assist organiza-
tions in decision making?
What We Can Learn from This Application
Case
This Application Case shows that organizations
that earlier used reporting only for tracking their
internal business activities and meeting compliance
requirements set out by the government are now
moving toward generating actionable intelligence
from their transactional business data. Reporting
has become broader as organizations are now try-
ing to analyze archived transactional data to under-
stand underlying hidden trends and patterns that
would enable them to make better decisions by
gaining insights into problematic areas and resolv-
ing them to pursue current and future market
opportunities. Reporting has advanced to interac-
tive online reports that enable users to pull and
quickly build custom reports as required and even
present the reports aided by visualization tools
that have the ability to connect to the database,
providing the capabilities of digging deep into
summarized data .
Source: Teradata .com, “Sabre Airline Solutions,” teradata.com/t/
case-studies/Sabre-Airline-Solutions-EB6281 (accessed
February 2013).
priority? You are given information about customers’ profiles and relationship with the air-
line. Your decisions might change as you learn more about those customers’ profiles.
Watch the video, pause it as appropriate, and answer the questions on which pas-
sengers should be given priority. Then resume the video to get more information. After
the video is complete, you can see the slides related to this video and how the analysis
was prepared on a slide set at teradatauniversitynetwork.corn/templates/Download.
aspx?Contentltemld=891. Please note that access to this content requires initial registration.
This multimedia excursion provides an example of how additional information made
available through an enterprise data warehouse can assist in decision making.
The DSS-BI Connection
By now , you should be able to see some of the similarities and differences between DSS
and BI. First, their architectures are very similar because BI evolved from DSS . However,
BI implies the use of a data warehouse, whereas DSS may or may not have such a feature .
BI is, therefore, more appropriate for large organizations (because data warehouses are
expensive to build and maintain) , but DSS can be appropriate to any type of organization.
Second, most DSS are constructed to directly support specific decision making. BI
systems, in general, are geared to provide accurate and timely information, and they sup-
port decision support indirectly. This situation is changing, however, as more and more
decision support tools are being added to BI software packages.
Chapter 1 • An Overview of Business Intelligence , Analytics, and Decision Support 19
Third, BI has an executive and strategy orientation, especially in its BPM and dash-
board components. DSS , in contrast, is oriented toward analysts.
Fourth, most BI systems are constructed with commercially available tools and com-
ponents that are fitted to the needs of organizations. In building DSS, the interest may
be in constructing solutions to very unstructured problems. In such situations , more pro-
gramming (e.g., using tools such as Excel) may be needed to customize the solutions.
Fifth, DSS methodologies and even some tools were developed mostly in the aca-
demic world. BI methodologies and tools we re deve loped mostly by software companies.
(See Zaman, 2005, for information on how BI has evolved.)
Sixth, many of the tools that BI uses are also considered DSS tools . For example,
data mining and predictive analysis are core tools in both areas.
Although some people equate DSS with BI, these systems are not, at present, the
same. It is interesting to note that some people believe that DSS is a part of BI-one of its
analytical tools. Others think that BI is a special case of DSS that deals mostly with report-
ing, communication, and collaboration (a form of data-oriented DSS) . Another explana-
tion (Watson, 2005) is that BI is a result of a continuous revolution and, as such, DSS is
one of Bi’s original elements. In this book, we separate DSS from BI. However, we point
to the DSS-BI connection frequently. Further, as noted in the next section onward, in
many circles BI has been subsumed by the new term analytics or data science.
SECTION 1. 7 REVIEW QUESTIONS
1. Define BI.
2. List and describe the major components of BI.
3. What are the major similarities and differences of DSS and BI?
1.8 BUSINESS ANALYTICS OVERVIEW
The word “analytics” has replaced the previous individual components of computerized
decision support technologies that have been available under various labels in the past.
Indeed , many practitioners and academics now use the word analytics in place of BI.
Although many authors and consultants have defined it slightly differently, one can view
analytics as the process of developing actionable decisions or recommendation for actions
based upon insights generated from historical data . The Institute for Operations Research
and Management Science (INFORMS) has created a major initiative to organize and pro-
mote analytics. According to INFORMS, analytics represents the combination of computer
technology, management science techniques, and statistics to solve real problems. Of
course, many other organizations have proposed their own interpretations and motivation
for analytics. For example, SAS Institute Inc. proposed eight levels of analytics that begin
with standardized reports from a computer system. These reports essentially provide a
sense of what is happening with an organization. Additional technologies have enabled
us to create more customized reports that can be generated on an ad hoc basis. The next
extension of reporting takes us to online analytical processing (OLAP)-type queries that
allow a user to dig deeper and determine the specific source of concern or opportuni-
ties . Technologies available today can also automatically issue alerts for a decision maker
when performance issues warrant such alerts. At a consumer leve l we see such alerts for
weather or other issues. But similar alerts can also be generated in specific settings when
sales fall above or below a certain level within a certain time period or when the inventory
for a specific product is running low. All of these applications are made possible through
analysis and queries on data being collected by an organization. The next level of analysis
might entail statistical analysis to better understand patterns. These can then be taken a
step further to develop forecasts or models for predicting how customers might respond to
20 Part I • Decision Making and Analytics: An Overview
Predictive
Statistical Analysis and
Data Mining
Reporting
Visualization
Periodic, ad hoc
Reporting Trend Analysis
Prescriptive
Management Science
Models and Solution
Reporting
Visualization
Periodic ,
ad hoc Reporting
Trend Analysis
FIGURE 1.5 Three Types of Analytics.
Predictive
Statistical Analysis
and
Data Mining
Prescriptive
Management Science
Models and
Solution
a specific marketing campaign or ongoing service/product offerings. When an organization
has a good view of what is happening and what is likely to happen, it can also employ
other techniques to make the best decisions under the circumstances. These eight levels of
analytics are described in more detail in a white paper by SAS (sas.com/news/sascom/
analytics_levels ).
This idea of looking at all the data to understand what is happening, what will
happen, and how to make the best of it has also been encapsulated by INFORMS in
proposing three levels of analytics. These three levels are identified (inforrns.org/
Community/Analytics) as descriptive, predictive, and prescriptive. Figure 1.5 presents
two graphical views of these three levels of analytics. One view suggests that these three
are somewhat independent steps (of a ladder) and one type of analytics application leads
to another. The interconnected circles view suggests that there is actually some overlap
across these three types of analytics. In either case, the interconnected nature of different
types of analytics applications is evident. We next introduce these three levels of analytics.
Descriptive Analytics
Descriptive or reporting analytics refers to knowing what is happening in the
organization and understanding some underlying trends and causes of such occur-
rences. This involves, first of all, consolidation of data sources and availability of
Chapter 1 • An Overview of Business Intelligence , Analytics, a nd Decision Support 21
all relevant data in a form that enables appropriate reporting and analysis . Usually
development of this data infrastructure is part of data warehouse s , which we study in
Chapter 3. From this data infrastructure we can develop appropriate reports , queries ,
alerts , and trends using various reporting tools and techniques. We study these in
Chapter 4.
A significant technology that has become a key player in this area is visua lizatio n .
Using the latest visualization tools in the marketplace , we can now develop powerful
insights into the operatio ns of our organization. Applicatio n Cases 1.2 a nd 1.3 highlight
some such applications in the healthcare domain. Color renderings of such applications
are available on the companion Web site and also on Tableau ‘s Web site. Chapter 4
covers visualization in more detail.
Application Case 1.2
Eliminating Inefficiencies at Seattle Children’s Hospital
Seattle Children’s was the seventh highest ranked
children’s hospital in 2011 , according to U.S. News
& World Report. For any organization that is com-
mitted to saving lives, identifying and removing the
inefficiencies from systems and processes so that
more resources become available to cater to patient
care become very important. At Seattle Children’s ,
management is continuously looking for new ways
to improve the quality, safety, and processes from
the time a patient is admitted to the time they are
discharged. To this end, they spend a lot of time in
analyzing the data associated w ith the patient visits.
To quickly turn patient and hospital data into
insights, Seattle Children’s implemented Tableau
Software’s business intelligence application. It pro-
vides a browser based on easy-to-use analytics to the
stakeholders; this makes it intuitive for individuals to
create visualizations and to understand what the data
has to offer. The data analysts, business managers,
and financial an alysts as well as clinicians, doctors,
and researchers are a ll using descriptive analytics
to solve different problems in a much faster way.
They are developing visual systems o n their own,
resulting in dashboards and scorecards that help
in defining the standards, the current performance
achieved measured against the standards, and how
these systems will grow into the future. Through the
use of monthly and daily dashboards, day-to-day
decision making at Seattle Children’s has improved
significantly .
Seattle Children’s measures patient wait-times
and analyzes them with the help of visualizations
to discover the root causes and contributing factors
for patient wa1tmg. They found that early delays
cascaded during the day. They focused o n on-time
appointments of patient services as on e of the solu-
tions to improving patient overall waiting time and
increasing the availability of beds. Seattle Children’s
saved about $3 million from the supply chain, and
with the help of tools like Tableau, they are find-
ing new ways to increase savings while treating as
many patients as possible by making the existing
processes more efficient.
QUESTIONS FOR DISCUSSION
1. Who are the users of the tool?
2. What is a dashboard?
3 . How does visualization help in decision making?
4 . What are the significant results achieved by the
use of Tableau?
What We Can Learn from This Application
Case
This Application Case shows that reporting analyt-
ics involving visualizations such as dashboards can
offer major insights into existing data and show how
a variety of users in different domains and depart-
ments can contribute toward process and qual-
ity improvements in an organization. Furthermore,
exploring the data visually can help in identifying
the root causes of problems and provide a basis for
working toward possible solutions.
Source: Tableausoftware.com, “Eliminating Waste at Seattle
Childre n’s,” tableausoftware.com/eliminating-waste-at-seattle-
childrens (accessed Fe bruary 2013).
22 Pan I • Decision Making and Analytics: An Overview
Application Case 1.3
Analysis at the Speed of Thought
Kaleida Health, the largest healthcare provider in
western New York, has more than 10,000 employ-
ees, five hospitals, a number of clinics and nursing
homes, and a visiting-nurse association that deals
with millions of patient records. Kaleida’s traditional
reporting tools were inadequate to ha ndle the grow-
ing data, and they were faced with the challenge of
finding a business intelligence tool that could handle
large data sets effortlessly, quickly, and with a much
deeper analytic capability.
At Kaleida, many of the calculations are now
done in Tableau, primarily pulling the data from
Oracle databases into Excel and importing the
data into Tableau. For many of the monthly ana-
lytic reports, data is directly extracted into Tableau
from the data warehouse ; many of the data queries
are saved and rerun, resulting in time savings when
dealing with millions of records-each having more
than 40 fields per record. Besides speed, Kaleida
also uses Tableau to me rge different tables for gen-
erating extracts.
Using Tableau, Kaleida can analyze emergency
room data to determin e the number of patients who
visit more than 10 times a year. The data often reveal
that people frequently use emergency room and
ambulance services inappropriately for stomach-
aches, headaches, and fevers. Kaleida can manage
resource utilizations-the use a nd cost of supplies-
which will ultimately lead to efficie ncy and standard-
ization of supplies management across the system.
Kaleida now has its own business intelligence
department and uses Tableau to compare itself to
Predictive Analytics
other hospitals across the country. Comparisons are
made o n various aspects, such as length of patient
stay, hospital practices, market share, and partner-
ships with doctors .
QUESTIONS FOR DISCUSSION
1. What are the desired functionalities of a report-
ing tool?
2. What advantages were derived by using a report-
ing tool in the case?
What We Can Learn from This Application
Case
Correct selection of a reporting tool is extremely
important, especially if an organization wants to
derive value from reporting. The generated reports
and visualizations should be easily discernible; they
should help people in different sectors make sense
out of the reports, identify the problematic areas,
a nd contribute toward improving them. Many future
organizations will require reporting analytic tools
that are fast and capable of h andling huge amounts
of data efficiently to generate desired reports with-
out the need for third-party consultants and service
providers. A truly useful reporting tool can exempt
organizations from unnecessary expenditure.
Source: Tableausoftware .com, “Kaleida Health Finds Efficiencies,
Stays Compe titive,” tableausoftware.com/learn/stories/user-
experience-speed-thought-kaleida-health (accessed February
2013).
Predictive analytics aims to determine what is likely to happen in the future . This an aly-
sis is based o n statistical techniques as well as other more recently developed techniques
that fa ll under the general category of data mining. The goal of these techniques is to be
able to predict if the customer is likely to switch to a competitor (“churn”) , what the cus-
tomer is likely to buy next and how much, what promotion a customer would respond
to, or w hether this customer is a creditworthy risk. A number of techniques are u sed in
developing predictive analytical applications, including various classification algorithms.
For example, as described in Chapters 5 an d 6, we can use classification techniques su ch
as decision tree models and neural networks to predict how well a motion picture will
do at the box office. We can also use clustering algorithms for segmenting customers
into different clusters to be able to target specific promotions to them. Fina lly, we can
Chapter 1 • An Overview of Business Intelligence , Analytics, a nd Decision Support 23
use association mining techniques to estimate relationships between different purchasing
behaviors. That is, if a customer buys one product, what e lse is the customer likely to pur-
chase? Such analysis can assist a retailer in recommending or promoting related products.
For example , any product search on Amazon.com results in the retailer also suggesting
other similar products that may interest a customer. We will study these techniques and
their applications in Ch apters 6 through 9. Application Cases 1.4 and 1.5 highlight some
similar applications. Application Case 1.4 introduces a movie you may have heard of:
Moneyball. It is perhaps one of the best examples of applications of predictive analysis
in sports.
Application Case 1.4
Moneyba/1: Analytics in Sports and Movies
Moneyball, a biographical, sports, drama film, was
released in 2011 and directed by Bennett Miller. The
film was based on Michael Lewis’s book, Moneyball.
The movie gave a detailed account of the Oakland
Athletics baseball team during the 2002 season and
the Oakland general manager’s efforts to assemble a
competitive team.
The Oakland Athletics suffered a big loss to the
New York Yankees in 2001 postseason. As a result,
Oakland lost many of its star players to free agency
and ended up with a weak team with unfavorable
financial prospects. The general manager’s efforts to
reassemble a competitive team were denied because
Oakland had limited payroll. The scouts for the
Oakland Athletics followed the o ld baseball custom
of making subjective decisions when selecting the
team members. The general manager then met a
young, computer whiz with an economics degree
from Yale. The general manager decided to appoint
him as the n ew assistant general manager.
The assistant general manager had a deep pas-
sion for baseball and had the expertise to crunch
the numbers for the game. His love for the game
made him develop a radical way of understanding
baseball statistics. He was a disciple of Bill James, a
marginal figure w ho offered rationalized techniques
to analyze baseball. James looked at baseball statis-
tics in a different way, crunching the numbers purely
on facts and eliminating subjectivity. James pio-
neered the nontraditional a nalysis method called the
Sabermetric approach, which derived from SABR-
Society for American Baseball Research.
The assistant general manager followed the
Sabermetric approach by building a prediction
model to help the Oakland Athletics select play-
ers based on their “on-base percentage” (OBP), a
statistic that measured how often a batter reached
base for any reason other than fielding error, field-
er’s choice, dropped/ uncaught third strike, fielder’s
obstruction, or catcher’s interference. Rather than
relying on the scout’s experience and intuition, the
assistant general manager selected players based
almost exclusively o n OBP.
Spoiler Alert: The new team beat all odds, won
20 consecutive games, and set an American League
record.
QUESTIONS FOR DISCUSSION
1. How is predictive analytics applied in Moneyball?
2. What is the difference between objective and
subjective approaches in decision making?
What We Can Learn from This Application
Case
Analytics finds its use in a variety of industries. It
helps organizations rethink their traditional prob-
lem-solving abilities, which are most often subjec-
tive , relying o n the same old processes to find a
solutio n. Analytics takes the radical approach of
using historical data to find fact-based solutions
that w ill remain appropriate for making even future
decisions.
Source.- Wikipedia , “On-Base Percentage,” en.wikipedia.org/
wiki/On_base_percentage (accessed Ja nuary 2013); Wikipedia ,
“Saberme tricsm,” wikipedia.org/wiki/Sabennetrics (accessed
January 2013) .
24 Pan I • Decision Making and Analytics: An Overview
Application Case 1.5
Analyzing Athletic Injuries
Any athletic activity is prone to injuries. If the inju-
ries are not handled properly, then the team suf-
fers. Using analytics to understand injuries can help
in deriving valuable insights that would enable
the coaches and team doctors to manage the team
composition , understand player profiles, and ulti-
mately a id in better decision making concerning
which players might be available to play at any
given time.
In an exploratory study, Oklahoma State
University analyzed American football-related sport
injuries by using reporting and predictive analytics.
The project followed the CRISP-DM methodol-
ogy to understand the problem of making recom-
me ndations on managing injuries, understanding
the various data elements collected about injuries,
cleaning the data, developing visualizations to draw
various inferences, building predictive models to
analyze the injury healing time period, and drawing
sequence rules to predict the relationship among the
injuries and the various body part parts afflicted with
injuries.
The injury data set consisted of more than
560 football injury records, which were categorized
into injury-specific variables-body part/ site/ later-
ality, action taken, severity, injury type, injury statt
and healing dates-and player/sport-specific varia-
bles-player ID, position played, activity, onset, and
game location. Healing time was calculated for each
record, which was classified into different sets of
time periods: 0-1 month, 1-2 months, 2-4 months,
4- 6 months, and 6- 24 months .
Various visualizations were built to draw
inferences from injury data set information depict-
ing the healing time period associated w ith players’
positions, severity of injuries and the h ealing time
period, treatment offered and the associated healing
time period, major injuries afflicting body parts, and
so forth.
Prescriptive Analytics
Neural network models were built to pre-
dict each of the healing categories using IBM SPSS
Modeler. Some of the predictor variables were cur-
rent status of injury, severity, body part, body site,
type of injury, activity, event location, action taken,
and position played. The success of classifying the
healing categoty was quite good: Accuracy was 79.6
percent. Based on the analysis, many business rec-
ommendations were suggested, including e mploy-
ing more specialists’ input from injury onset instead
of letting the training room staff screen the injured
players; training players at defensive positions to
avoid being injured; and holding practice to thor-
oughly safety-check mechanisms.
QUESTIONS FOR DISCUSSION
1. What types of a nalytics are applied in the injury
analysis?
2. How do visualizations aid in understanding the
data and delivering in sights into the data?
3. What is a classification problem?
4 . What can be derived by performing sequence
analysis?
What We Can Learn from This Application
Case
For any analytics project, it is always important
to understand the busin ess domain and the cur-
rent state of the business problem through exten-
sive analysis of the only resource-historical data .
Visualizations often provide a great tool for gaining
the initial insights into data, which can be further
refined based on expett opinions to identify the rela-
tive importance of the data e lements related to the
problem. Visualizations also aid in generating ideas
for obscure business problems, which can be pur-
sued in building predictive models that could help
organizations in decision making.
The third category of analytics is termed prescriptive analytics . The goal of prescriptive
analytics is to recognize what is going on as well as the likely forecast and make decisions
to achieve the best performance possible. This group of techniques h as historically been
studied under the umbrella of operations research or management sciences and h as gen-
erally been aimed at optimizing the performance of a system. The goal h e re is to provide
Chapter 1 • An Overview of Business Intelligence , Analytics, a nd Decision Support 25
a decision or a recommendation for a specific action. These recommendations can be in
the forms of a specific yes/ no decision for a problem, a specific amount (say, price for a
specific item or airfare to ch arge) , or a complete set of production plans. The decisions
may be presented to a decision maker in a report or may directly be used in an automated
decision rules system (e.g., in airline pricing systems). Thus, these types of analytics can
also be termed decision or normative analytics. Application Case 1.6 gives an example
of such prescriptive analytic applications . We will learn about some of these techniques
and several additional applicatio ns in Chapters 10 through 12.
Application Case 1.6
Industrial and Commercial Bank of China (ICBC) Employs Models
to Reconfigure Its Branch Network
The Industrial and Commercial Bank of China
(ICBC) has more than 16,000 branches and serves
over 230 million individual customers and 3.6 mil-
lion corporate clients. Its daily financial transactions
total about $180 million. It is also the largest pub-
licly traded bank in the world in terms of market
capitalization, deposit volume , and profitability. To
stay competitive and increase profitability, ICBC was
faced with the challenge to quickly adapt to the fast-
paced economic growth, urbanization, and increase
in personal wealth of the Chinese. Changes had to be
implemented in over 300 cities with high variability
in customer behavior and financial status. Obviously,
the nature of the challenges in such a huge economy
meant that a large-scale optimization solution had to
be developed to locate branches in the right places,
with right services, to serve the right customers.
With their existing method, ICBC used to decide
w here to open new branches through a scoring model
in which different variables with varying weight were
used as inputs. Some of the variables were customer
flow, number of residential households, and number
of competitors in the intended geographic region. This
method was deficient in determining the customer dis-
tribution of a geographic area. The existing method
was also unable to optimize the distribution of bank
branches in the branch network. With support from
IBM, a branch reconfiguration (BR) tool was devel-
oped. Inputs for the BR system are in three parts:
a. Geographic data with 83 different categories
b . Demographic and economic data with 22 dif-
ferent categories
c. Branch transactions and performance data that
consisted of more than 60 million transaction
records each day
These three inputs helped generate accurate cus-
tomer distribution for each area and, h ence, helped
the bank optimize its branch network. The BR system
consisted of a market potential calculation model, a
branch network optimization model, and a branch
site evaluation model. In the market potential model,
the customer volume and value is measured based
on input data and expert knowledge. For instance ,
expe1t knowledge would help determine if per-
sonal income should be weighted more than gross
domestic product (GDP). The geographic areas are
also demarcated into cells, and the preference of one
cell over the other is determined. In the branch net-
work optimization model, mixed integer program-
ming is used to locate branches in candidate cells
so that they cover the largest market potential areas.
In the branch site evaluation model, the value for
establishing bank branches at specific locations is
determined.
Since 2006, the development of the BR has
been improved through an iterative process. ICBC’s
branch reconfiguration tool has increased deposits
by $21.2 billio n since its inception. This increase
in deposit is because the bank can now reach
more customers with the right services by use of
its optimization tool. In a specific example , when
BR was implemented in Suzhou in 2010, deposits
increased to $13 .67 billion from an initial leve l of
$7 .56 billion in 2007. Hence, the BR tool assisted
in an increase of deposits to the tune of $6.11
b illion between 2007 and 2010. This project was
selected as a finalist in the Edelman Competition
2011 , which is run by INFORMS to promote actual
app lications of management science/ operations
research models.
(Continued)
26 Pan I • Decision Making and Analytics: An Overview
Application Case 1.6 (Continued}
QUESTIONS FOR DISCUSSION
1. How can analytical techniques help organiza-
tions to re tain competitive advantage?
2. How can descriptive and predictive an alytics
help in pursuing prescriptive analytics?
3. What kinds of p rescriptive analytic techniques
are employed in the case study?
4. Are the prescriptive models once built good
forever?
What We Can Learn from This Application
Case
Many organizations in the world are now embrac-
ing analytical techniqu es to stay compe titive
and achieve growth. Many o rga nizations provide
con sulting solutions to the businesses in employ-
ing prescriptive analytical solutions. It is equ ally
important to have proactive decis ion m akers in the
organizations who are aware of the ch anging eco-
nomic environment as well as the advan cem ents
in the field of an alytics to e n sure that appropriate
models are e mployed . This case shows an example
o f geographic m a rke t segmentatio n and customer
beh avioral segmentation tec hniques to isolate the
profitability of customers a nd e mploy optimizatio n
techniques to locate the branches that deliver hig h
profitability in each geographic segment.
Source: X. Wang e t al. , “Branch Reconfiguration Practice Through
Operations Research in Industrial a nd Commercial Bank of China,”
Interfaces, January/ February 2012, Vol. 42, No . 1, pp. 33-44; DOI:
10.1287/ inte.1110.0614.
Analytics Applied to Different Domains
Applicatio ns of analytics in various industry sectors h ave spawn ed m any related areas or
at least buzzwords. It is almost fashio nable to attach the word analytics to any specific
industry or type of data . Besides the general category of text analytics- aimed at getting
value out of text (to be studied in Ch apter 6)- or Web analytics- analyzing Web data
streams (Chapte r 7)- many industry- o r p roblem-sp ecific a nalytics p rofessio ns/ streams
have come up. Examples of such areas are marketing analytics, retail analytics, fraud ana-
lytics, transportation a nalytics, health analytics, sports analytics, tale nt a nalytics, behav-
ioral analytics, and so forth. For example, Application Case 1. 1 could also be terme d as
a case study in airline analytics . Application Cases 1.2 and 1.3 would belo ng to health
analytics; Applicatio n Cases 1.4 and 1.5 to sports an alytics; Application Case 1.6 to bank
analytics; and Application Case 1.7 to reta il analytics. The End-of-Chapter Application
Case could be termed insurance analytics . Literally, any systematic analysis of data in a
sp ecific secto r is b eing labeled as “(fill-in-blanks)” Analytics. Although this may result in
overselling the concepts of analytics, the benefit is that more people in specific in dustries
are aware of the power a nd p otential of an alytics. It also provides a focus to professionals
developing a nd applying the concepts of analytics in a vertical secto r. Altho ugh m any of
the techniques to develop analytics applications may be commo n , there are unique issues
w ithin each vertical segment that influe nce how the data may b e collected, processed,
analyzed , and the applications implemented. Thus , the differentiatio n o f analytics based
o n a vertical focus is good for the overall growth of the d iscipline.
Analytics or Data Science?
Even as the concept of analytics is getting popular among industry and academic circles ,
another term h as already been introduced and is becoming popular. The n ew term is data
science. Thus the practitio n ers of data scien ce are data scie ntists. Mr. D . J. Patil of Linkedin
is sometimes credited w ith creating the term data science. There have been some attempts
to describe the differences between data analysts and data scientists (e.g., see this study at
emc.com/collateral/about/news/emc-data-science-study-wp ). One view is that
Chapter 1 • An Overview of Business Intelligence, Analytics, a nd Decision Support 27
data analyst is just another term for professionals w ho were doing business intelligence in
the form of data compilation, cleaning, reporting, and perhaps some visualization. Their
skill sets included Excel , some SQL knowledge, a nd reporting. A reader of Section 1.8
would recognize that as descriptive or reporting analytics. In contrast, a data scientist is
responsible for predictive analysis, statistical analysis, and more advanced analytical tools
and algorithms. They may have a deeper knowledge of algorithms and may recognize
them under various labels-data mining, knowledge discovery, machine learning, and
so forth. Some of these professionals may also need deeper programming knowledge to
be able to write code for data cleaning and analysis in current Web-oriented languages
su ch as Java and Python. Again, our readers should recognize these as falling under the
predictive a nd prescriptive a nalytics umbrella. Our view is that the distinction between
analytics and data science is more of a degree of technical knowledge and skill sets than
the functions. It may also be more of a distinction across d iscip lines. Computer science,
statistics, and applied mathematics programs appear to prefer the data science label,
reserving the analytics label for more business-oriented professionals. As another example
of this, applied physics professionals have proposed using network science as the term
for describing analytics that relate to a group of people-social networks, supply chain
networks, and so forth. See barabasilab.neu.edu/networksciencebook/downlPDF.
html for a n evolving textbook on this topic .
Aside from a clear difference in the skill sets of professionals w ho only h ave to do
descriptive/ reporting analytics versus those who e ngage in all three types of analytics, the
distinction is fuzzy between the two labels, at best. We observe that graduates of our
analytics programs tend to be responsible for tasks more in line with data science profes-
sionals (as defined by some circles) than just reporting analytics. This book is clearly aimed
at introducing the capabilities and functionality of all analytics (which includes data sci-
ence), not just reporting analytics. From now o n , we w ill use these terms interch angeably .
SECTION 1.8 REVIEW QUESTIONS
1. Define analytics.
2. What is descriptive analytics? What various tools are employed in descriptive analytics?
3. How is descriptive analytics different from traditional reporting?
4. What is a data warehouse? How can data warehousing technology help in ena-
bling analytics?
5. What is predictive analytics? How can organizations employ predictive analytics?
6. What is prescriptive analytics? What kinds of problems can be solved by prescrip-
tive analytics?
7. Define modeling from the analytics perspective.
8. Is it a good idea to follow a hierarchy of descriptive and predictive analytics before
applying prescriptive analytics?
9. How can analytics aid in objective decision making?
1.9 BRIEF INTRODUCTION TO BIG DATA ANALYTICS
What Is Big Data?
Our brains work extreme ly quickly and are efficient and versatile in processing large
amounts of all kinds of data: images , text, sounds, smells, and video. We process all
different forms of data relatively easily. Computers, on the other hand, are still finding it
hard to keep up with the pace at which data is generated-let alone analyze it quickly.
We have the problem of Big Data. So w hat is Big Data? Simply put, it is data that cannot
28 Pan I • Decision Making and Analytics: An Overview
be stored in a single storage unit. Big Data typically refe rs to data that is arnv mg in
many different forms, be they structured, unstructured, or in a stream. Major sources
of su c h data are clickstreams from Web sites, postings o n social media sites such as
Facebook, o r data from traffic , sen sors , o r weather. A Web search e ngine like Google
n eed s to search a nd index billions of Web pages in o rde r to give you relevant search
results in a fraction of a second. Although this is n o t done in real time , generating an
index of a ll the Web pages o n the Inte rnet is not an easy task. Luckily for Google , it
was able to solve this proble m . Among o the r tools, it h as e mployed Big Data a nalytical
techniques.
There are two aspects to managing data o n this scale: storing a n d p rocessing . If we
could purchase an extreme ly expensive storage solutio n to store all the d ata at o n e place
o n one unit, making this unit fault tolerant would involve major expense . An ingenious
solutio n was proposed that involved storing this data in chunks o n different machines
connected by a network, putting a copy or two of this chunk in diffe rent locations on
the netwo rk, both logically and physically . It was originally used at Google (then called
Google File System) and later developed and re leased as an Apache project as the Hadoop
Distributed File System (HDFS).
However, sto ring this data is only h alf the problem. Data is worthless if it does
n ot provide business value, a nd for it to provide bu siness value, it has to be a nalyzed.
How are such vast amounts of data a n alyzed? Passing all computation to o ne powerful
compute r does n o t work; this scale would create a huge overhead on such a power-
ful computer. Another ingenious solutio n was proposed: Push computation to the data,
instead of pushing data to a computing no de. This was a new paradigm, and it gave rise
to a w ho le new way of processing data. This is w h at we know today as the MapRedu ce
programming paradigm, w hich made processing Big Data a reality. MapReduce was origi-
n ally develo p ed at Google, and a subseque n t versio n was released by th e Apache project
called Hadoop MapReduce.
Today, w hen we talk about storing, processing, o r analyzing Big Data, HDFS and
MapReduce are involved at some level. Other relevant stan dards and software solutions
have been proposed. Although the majo r toolkit is available as open source, several
companies have been launched to provide training or specialized analytical hardware or
software services in this space. Some examples are HortonWorks, Clo udera , an d Teradata
Aster.
Over the p ast few years , w h at was called Big Data changed m ore and more as Big
Data applicatio n s appeared. The n eed to process data coming in at a rapid rate added
velocity to the equatio n . One example of fast data processing is algorithmic trading . It
is the use of electronic platforms based o n algorithms for trading sh ares o n the finan cial
m arke t, which operates in the order of microseconds. The n eed to process different
kinds of data added variety to the equation. Another example of the wide varie ty of
data is sentiment a nalysis, w hic h uses various forms of data from social media p latforms
a nd c ustomer responses to gauge sentime nts. Tod ay Big Data is associated w ith al most
a ny kind of large data that h as the characteristics of volume, velocity, and variety.
Applicatio n Case 1.7 illustrates one example of Big Data analytics . We w ill study Big
Data characteristics in more detail in Chapters 3 and 13 .
SECTION 1.9 REVIEW QUESTIONS
1. What is Big Data a nalytics?
2 . What are the sources o f Big Data?
3. What are the characteristics of Big Data?
4. What processing technique is applied to p rocess Bi ta?
Chapter 1 • An Overview of Business Inte lligence, Analytics, a nd Decision Support 29
Application Case 1.7
Gilt Groupe’s Flash Sales Streamlined by Big Data Analytics
Gilt Groupe is an online destination offering flash
sales for major brands by selling their clothing and
accessories. It offers its members exclusive discounts
on high-end clothing and other apparel. After regis-
tering with Gilt, customers are sent e-mails containing
a variety of offers. Customers are given a 36-48 hour
window to make purchases using these offers. There
are about 30 different sales each day. While a typical
department store turns over its inventory two or three
times a year, Gilt does it eight to 10 times a year. Thus,
they have to manage their inventory extremely well
or they could incur extremely high inventory costs.
In order to do this, analytics software developed at
Gilt keeps track of eve1y customer click-ranging
from what brands the customers click on, what colors
they choose, what styles they pick, and what they
end up buying. Then Gilt tries to predict what these
customers are more likely to buy and stocks inve n-
tory according to these predictions. Customers are
sent customized alerts to sale offers depending on the
suggestions by the analytics software.
That, however, is not the whole process. The
software also monitors what offers the custome rs
choose from the recommended offers to make more
accurate predictions and to increase the effectiveness
of its personalized recommendations. Some custom-
ers do not check e-mail that often. Gilt’s analytics
1.10 PLAN OF THE BOOK
software keeps track of responses to offers and sends
the same offer 3 days later to those customers who
h aven’t responded. Gilt also keeps track of what
customers are saying in general about Gilt’s prod-
ucts by analyzing Twitter feeds to analyze sentiment.
Gilt’s recomme ndation software is based on Teradata
Aster’s technology solution that includes Big Data
analytics technologies .
QUESTIONS FOR DISCUSSION
1. What makes this case study an example of Big
Data analytics?
2. What types of decisions does Gilt Groupe have
to make?
What We Can Learn From this Application
Case
There is continuous growth in the amount of struc-
tured and unstructured data, and many organiza-
tions are now tapping these data to make actionable
decisions. Big Data analytics is now enabled by the
advancements in technologies that aid in storage and
processing of vast amounts of rapidly growing data.
Source: Asterdata.com, “Gilt Groupe Speaks o n Digital Ma rketing
Optimizatio n ,” asterdata.com/gilt_groupe_video.php (accesse d
Febrnary 2013).
The previous sections have given you an understanding of the n eed for using informa-
tion technology in decision making; an IT-oriented v iew of various types of decisions;
and the evolution of decision support systems into business intelligence, and now into
analytics. In the last two sections we have seen an overview of various types of analyt-
ics and their applications. Now we are ready for a more detailed managerial excursion
into these topics, along with some potentially deep hands-on experience in some of the
technical topics. The 14 chapters of this book are organized into five parts, as shown in
Figure 1.6.
Part I: Business Analytics: An Overview
In Chapter 1, we provided an introduction, definitions , and an overview of decision sup-
port systems, business intelligence, and analytics, including Big Data analytics. Chapter 2
covers the basic phases of the decision-making process and introduces decision support
systems in more detail.
30 Pan I • Decision Making and Analytics: An Ove rview
Part II
Descriptive Analytics
Chapter 3
Data Warehousing
Part I
Decision Making and Analytics: An Overview
r – – – – – – – – – – – – – – – – – – – – – – – – – – ,
‘ Chapter 1 1
An Overview of Business
Intelligence, Analytics, and
Decision Support
Chapter 2
Foundations and Technologies for
Decision Making
Part Ill Part IV
Predictive Analytics Prescriptive Analytics
Chapter 9
Chapter 5 Model-Based Decision Making:
Data Mining Optimization and Multi-Criteria
Chapter 6
Techniques for Predictive
Modeling
Chapter 7
Text Analytics, Text Mining, and
Sentiment Analysis
Systems
Chapter 10
Modeling and Analysis :
Heuristic Sear ch Methods and
Simulation
Chapter 11
Automated Decision Systems and
Expert Systems
Part V
Big Data and Future Directions
for Business Analytics
,————————–
Chapter 13
Big D ata and Analytics
Chapter 4
Business Reporting , Visual
Analytics, and Business
Performance Management
Chapter 8
Web Analytics, Web Mining, and
Social Analytics
Chapter 12
Knowledge Management and
Collabor ative Systems
Chapter 14
Business Analytics: Emer ging
Trends and Future Impacts
————-r- ———— ————r————, ._ ————1- ————- ————-T- ———–
I
FIGURE 1.6 Plan of the Book.
Part II: Descriptive Analytics
Part VI
Online Supplements
Software Demos
Data Files for Exercises
PowerPoint Slides
Part II begins with an introduction to data warehousing issues , applications, and technolo-
gies in Chapter 3. Data re present the fundamental backbone of any decision support and
analytics application. Chapter 4 describes business reporting, visualization technologies,
and applications. It also includes a brief overview of business performance management
techniques and applications, a topic that has been a key p art of traditional BI.
Part Ill: Predictive Analytics
Part III comprises a large part of the book. It begins w ith an introduction to predictive
analytics applications in Chapter 5. It includes many of the common application tech-
niques: classification, clustering, association mining, and so forth. Chapter 6 includes a
technical description of selected d ata minin g techniques, especially ne ural network m od-
els. Chapter 7 focuses on text mining applications. Similarly, Chapter 8 focuses on Web
analytics, including social media analytics, sentiment analysis, and other related to pics.
Chapter 1 • An Overview of Business Intelligence, Analytics, a nd Decision Support 31
Part IV: Prescriptive Analytics
Part IV introduces decision analytic techniques, which are also called prescriptive analyt-
ics. Specifically, Chapter 9 covers selected models that may be implemented in spread-
sheet environme nts. It also covers a popular multi-objective decision technique-analytic
hierarchy processes.
Chapter 10 then introduces other model-based decision-making techniques, espe-
cially heuristic models and simulation. Chapter 11 introduces automated decision systems
including expert systems. This part concludes with a brief discussion of knowledge
management and group support systems in Chapter 12.
Part V: Big Data and Future Directions for Business Analytics
Part V begins with a more detailed coverage of Big Data and analytics in Chapter 13.
Chapter 14 attempts to integrate all the material covered in this book and
concludes with a discussion of emerging trends , such as how the ubiquity of wire-
less and GPS devices and other sensors is resulting in the creation of massive new
databases and unique applications. A new breed of data mining and BI companies is
emerging to analyze these new databases and create a much better and deeper under-
standing of customers’ behaviors and movements. The chapter also covers cloud-based
analytics, recommendation systems, and a brief discussion of security/ privacy dimen-
sions of analytics. It concludes the book by also presenting a discussion of the analytics
ecosystem. An understanding of the ecosystem and the various players in the a nalytics
industry highlights the various career opportunities for students and practitioners of
analytics .
1.11 RESOURCES, LINKS, AND THE TERADATA UNIVERSITY
NETWORK CONNECTION
The use of this chapter and most other chapters in this book can be e nhanced by the tools
described in the following sections.
Resources and Links
We recommend the following major resources and links:
• The Data Warehousing Institute (tdwi.org)
• Informatio n Management (information-management.com)
• DSS Resources (dssresources.com)
• Microsoft Enterprise Consortium (enterprise.waltoncollege.uark.edu/mec.asp)
Vendors, Products, and Demos
Most vendors provide software demos of their products and applications. Information
about products, architecture, and software is available at dssresources.com.
Periodicals
We recommend the following periodicals:
• Decision Support Systems
• CIO Insight (cioinsight.com)
• Technology Evaluation (technologyevaluation.com)
• Baseline Magazine (baselinemag.com)
32 Pan I • Decision Making and Analytics: An Overview
The Teradata University Network Connection
This book is tightly connected with the free resources provided by Teradata University
Network (TUN; see teradatauniversitynetwork.com) . The TUN portal is divided
into two major parts: one for students and one for facu lty. This b ook is connected to
the TUN portal v ia a sp ecial sectio n at the end of each ch a pter. That section includes
appropriate links for the specific chapter, pointing to relevant resources. In addition,
w e provide hands-on exercises, u s ing software and other material (e.g., cases) avail-
able at TUN.
The Book’s Web Site
This book’s Web site, pearsonhighered.com/turban, contains supplemental textual
m aterial o rganized as Web chapters that correspond to the printed b ook’s chapters. The
topics of these ch a pters are listed in the online chapter table of conte n ts . Other conte nt is
also available on an indepe ndent Web site (dssbibook.com) .2
Chapter Highlights
• The business e nvironment is becoming complex
and is rapidly changing, ma king decisio n making
m o re difficult.
• Businesses must respond and adapt to the chang-
ing e nvironment rapid ly by making faster and
better decisions.
• The time frame for making decisio n s is shrinking,
w h ereas the global nature of decision making is
expa nding, necessitating the d evelopment and
u se of computerized DSS.
• Computerized support for ma nagers is ofte n
essential for the survival of a n organization .
• An early decision support framework divides
decision situatio ns into nine categories, depending
on the degree of structuredness and managerial
activities. Each category is supported differently.
• Structured repetitive decisio ns are supported by
standard quantitative analysis methods, such as MS,
MIS, an d rule-based automated decision suppo rt.
• DSS use data , models, and sometimes knowledge
management to find solutions for semistructured
a nd some unstructured proble ms.
• BI metho ds utilize a central repository called a
data warehouse that e na bles efficient data mining,
OLAP , BPM, and data visu alizatio n.
• BI architecture includes a data wareh ouse, busi-
ness analytics tools u sed by end u sers, and a u ser
interface (su ch as a dashboard) .
• Many organizatio ns employ descriptive analytics
to re place the ir traditional fla t reporting with inter-
active reporting that provides insights , trends, and
patterns in the transactional data.
• Predictive analytics enable organizations to estab-
lish predictive rules that drive the business o ut-
comes through historical d ata analysis of the
existing beh avior of the cu stomers .
• Prescriptive analytics he lp in building models that
involve forecasting and optimizatio n techniques
based o n the principles of operatio ns research
and management science to help organizations to
make better decision s.
• Big Data analytics focuses o n un structured, la rge
data sets that may also include vastly different
types of data for analysis.
• Analytics as a fie ld is also known by industry-
specific application names su ch as sports analytics.
It is also known by oth e r rela ted names su ch as
data scie nce or network scie nce.
2 As this book went to p ress, we verified that a ll the cited Web sites were active and valid . However, URLs a re
d ynamic. Web sites to which we re fe r in the text sometimes ch a nge o r are discontinued b ecau se compan ies
c ha nge na mes, are bought o r sold , merge, o r fail. Sometimes Web s ites a re down for maintenance, repair, o r
re design. Many organizations h ave dropped the initial “www” d esignatio n for their sites, but some still use it. If
you have a proble m connecting to a Web s ite that we mentio n , please be patient a nd simply ru n a Web search
to try to ide ntify the possible new site. Most times , you can quickly find the n ew s ite through on e o f th e popular
sea rch e ngines. We apologize in a dvance for this in convenie nce.
Chapter 1 • An Ove rview of Business Intelligence, Analytics, a nd Decision Support 33
Key Terms
business intelligence
(BI)
dashboard
data mining
decision (or normative)
analytics
decision support system
(DSS)
Questions for Discussion
1. Give examples for the conte nt of each cell in Figure 1. 2.
2. Survey the literature fro m the p ast 6 mo nths to find o ne
application each for DSS, BI, and analytics. Summarize
the applications o n o ne page and submit it with the exact
sources.
3. Observe an o rganization with which you a re familiar. List
three decisio ns it makes in each of the following categories :
Exercises
Teradata University Network (TUN) and Other
Hands-On Exercises
1. Go to teradatauniversitynetwork.com. Using the reg-
istration your instructor provides, log on and learn the
conte nt of the site . Yo u will receive ass ig nme nts re lated
to this site. Prepare a list of 20 items in the site tha t you
think could be beneficial to you.
2 . Ente r the TUN site and select “cases, projects and assign-
me nts.” Then select the case study: “Harrah’s High Payoff
from Customer Informatio n .” Answer the following ques-
tions about this case:
a. What informatio n does the data mining generate?
b. How is this information helpful to management in
d ecision making? (Be specific.)
c. List the types of data that are mine d .
d. Is this a DSS o r BI a pplication? Why?
3. Go to teradatauniversitynetwork.com and find the paper
titled “Data Warehousing Supports Corporate Strategy at First
American Corporation” (by Watson, Wixom, and Goodhue).
Read the paper and answer the following questions:
a. What were the drivers for the DW / BI project in the
company?
b. What strategic advantages w e re realized?
c. What o perational and tactical advantages were achieved?
d. What were the critical success fa c tors (CSF) for the
imple me ntation?
4. Go to analytics-magazine.org/issues/digital-editions
and find the January/Februa1y 2012 edition titled “Special
Issue: The Future of Healthcare. ” Read the article “Predictive
descriptive (or re porting)
analytics
predictive analytics
prescriptive analytics
semistructured
problem
structured problem
unstructured proble m
strategic planning, manageme nt control (tactical planning),
and o p e ratio nal planning a nd contro l.
4. Distinguish BI from DSS.
5. Compa re a nd contrast pre dictive a nalytics with prescrip-
tive and descriptive analytics. Use examples.
Analytics-Saving Lives and Lowering Medical Bills.”
Answer the following questions:
a . What is the proble m that is be ing addressed by apply-
ing predictive analytics?
b . What is the FICO Medication Adhere nce Score?
c. How is a prediction mo del traine d to predict the FICO
Medicatio n Adherence Score? Did the prediction
model classify FICO Medication Adhe re nce Score?
d. Zoom in o n Figure 4 and explain w h at kind of tech-
nique is applied o n the generated results.
e. List some of the actio nable decisions that were based
on the results of the p redictions.
5 . Go to analytics-magazine.org/issues/digital-editions
and find the January/ February 2013 editio n titled “Work
Social. ” Read the a rticle “Big Data, Analytics and Elections”
a nd answer the followin g questions:
a . What kinds of Big Data were a nalyzed in the article?
Comment o n some of the sources of Big Data.
b. Explain the term integrated system. Wha t othe r tech-
nical term suits integrated system?
c. What kinds of da ta a nalysis techniques are e mployed
in the project? Comment on some initiatives that
resulted fro m data a nalysis.
d . What a re the diffe re nt prediction problems a nswered
by the models?
e. List some o f the actionable decisions taken that were
based on the predicatio n results.
f. Identify two applications of Big Data a nalytics that are
not listed in the article.
34 Pan I • Decision Making and Analytics: An Ove rview
6. Search the Inte rnet for mate rial regarding the work of man-
agers and the role analytics play. What kind o f references
to consulting firms, academic de paitme nts, and programs
do you find? What major areas are re p resented? Select five
sites that cover one area and repo rt your findings.
7. Explore the public areas of dssresources.corn. Pre p are
a list of its majo r ava ila ble resources. Yo u might want to
refe r to this site as you wo rk through the book.
End-of-Chapter Application Case
8. Go to rnicrostrategy.corn. Find information on the five
styles o f BI. Prepare a su mmary ta ble fo r each style.
9. Go to oracle.corn and click the Hyp e rion link u nder
Applications. Determine what the compa ny’s major prod-
u cts a re. Re late these to the su p port tech nologies cited in
this ch apter.
Nationwide Insurance Used Bl to Enhance Customer Service
Nationwide Mutual Insurance Company, headqu aitered in
Columbus, O hio, is one of the largest insurance and financial
services companies, w ith $23 billion in revenues and more
than $160 billion in statutory assets. It offers a comprehe nsive
range of products through its family of 100-plus companies with
insurance products for auto, motorcycle, boat, life, homeown-
ers, and farms. It also offers financial products and services
including annuities, mongages, mutual funds, pensions, a nd
investme nt management.
Nationw ide strives to achieve greater efficie ncy in all
operatio ns by ma naging its expenses along with its ability to
grow its revenue . It recognizes the use of its su·ategic asset of
info rmation comb ined with analytics to o utp ace competitors
in strategic and o p eratio nal decision making even in complex
and unpredictable environments.
Historically, Natio nw ide’s bu siness units worked inde-
pendently and w ith a lot of autonomy. This led to d uplication
of effo rts, w idely dissimilar data processing environments, and
exu·e me data redundancy, resulting in higher exp enses. The
situatio n got comp licated w hen Natio nwide pursu ed any merg-
ers o r acquisitions.
Nationw ide, using enterprise data warehouse technology
from Teradata, set out to create , from tl1e ground u p, a single,
authoritative e nvironn1ent for clean , consistent, and complete
data that can be effectively used for best-p ractice analytics to
make su-ategic and tactical business decisions in the areas of
customer growth, retention , product profitability, cost contain-
ment, and productivity improvements. Natio nw ide u-ansfonned
its siloed business units, which were supponed by stove-piped
data environments, into integrated units by using cutting-edge
analytics that work w ith clear, consolidated data from all of
its business units. The Teradata data warehouse at Nationwide
has grown from 400 gigabytes to more than 100 terabytes and
supports 85 p ercent of Nationw ide’s business w ith more than
2,500 users.
Integrated Customer Knowledge
Nationw ide ‘s Cu stome r Knowledge Sto re (CKS) m1t1at1ve
develo p ed a cu sto me r-centric database that integrate d cu s-
to me r, product, a nd externa lly acquire d d ata from mo re
than 48 sources into a single customer data ma rt to deliver a
h o listic view of cu sto mers. This d ata mart was coupled w ith
Teradata’s customer relatio nship man agem ent application to
create and ma nage effective customer ma rketing cam paigns
th at u se behavioral analysis of cu stome r interactions to drive
c u stomer ma nageme nt actio ns (CMAs) fo r target segments .
Nationw ide ad ded mo re sophisticated cu stom er analytics
that looke d at custome r p o rtfolios and the effectiveness
of va riou s marketing camp aigns . This data analysis he lped
Nation wide to initiate proactive customer commu nicatio ns
around custo mer life time events like marriage, b irth of child ,
o r h o me p urchase and had significa n t imp act on improv-
ing cu stomer satisfactio n. Also, by integra ting cu stomer
contact history, produ ct own ership, a nd payment informa-
tio n , Nationwide’s be havioral an alytics teams fu rthe r created
p rio ritized models that could identify w hich specific cus-
tome r interaction was importa nt for a customer at any given
t ime. This resulted in o ne percentage point improve ment
in cu stome r rete ntion rates and significant improvement
in custome r e n thusiasm scores. Nationwide also achieved
3 percent annual growth in incremental sales by using CKS .
There are other uses of the customer database. In one of
th e initiatives, by integrating customer te lepho ne d ata from
multip le syste ms into CKS, the relation sh ip man agers at
Natio nw ide tty to be proactives in contacting c ustomers in
adva nce of a possible weather catastroph e, su ch as a hur-
rican e or flood, to provide the p rimary p o licyh older infor-
matio n and explain th e claims p rocesses. These an d other
analytic insights now d rive Natio nw ide to p rovide extrem ely
personal customer service.
Financial Operations
A sinillar p e rformance p ayoff from integrated information was
also noted in finan cial operations. Nationwide’s decentralized
management style resulted in a fragme nted financial report-
ing e nvironment that included more than 14 gene ral ledgers,
20 chans of accounts, 17 separate data re positories, 12 different
repo1ting tools, and hundreds of thousands of spreadsheets.
There was no common central view of the business, which
resulte d in labor-intensive slow a nd inaccu rate reporting.
Chapter 1 • An Ove rview of Business Intelligence, Analytics, a nd Decision Support 35
About 75 percent of the effort was spent on acquiring, clean-
ing, and consolidating and validating the data, and very little
time was spent on meaningful analysis of the data.
The Financial Pe rforma nce Management initiative
impleme nted a new o pe rating approac h that worked on a
single d ata a nd technology architecture with a common set
of systems standa rdizing the process of re porting. It e nabled
Nationwide to opera te analytical centers of excelle nce with
world-class planning, capital manageme nt, risk assessme nt,
and other decision support capabilities that delive red timely ,
accurate, and efficie nt accounting, re porting, and analytical
services.
The d ata from more tha n 200 operational systems was
sent to the e nte rprise -w ide da ta warehouse and the n distrib-
ute d to various applications and analytics. This resulte d in
a 50 percent improvement in the monthly closing process
with closing inte rvals reduced from 14 days to 7 days.
Postmerger Data Integration
Na tionwide’s Goal State Rate Manageme nt m1t1at1ve e n a –
ble d the company to me rge Allied Insurance’s automobile
p o licy system into its existi ng syste m. Bo th atio nwide and
Allied source systems were custom-built applica tions that
did not share any common values or process data in the
same m a nne r. Nationwide’s IT d e partme nt de cide d to bring
all the data from source systems into a centralized data
ware house, organized in an integrated fash io n tha t resulted
in standard dimensional reporting a nd helped Nationwide
in performing what-if analyses. The data analysis tea m
could identify previously unknown p otential diffe re nces
in the data e nvironme nt where premiums ra tes were cal-
culated diffe re ntly b etween Nationwide and Allied s ides.
Correcting all of these benefited Nationwide’s policyhold-
ers because they were safeguarded from experiencing w ide
pre mium rate swings.
Enhanced Reporting
Nationwide’s legacy re porting syste m , which catered to the
nee ds of prope rty and casualty business units, took weeks
to compile and d eliver the needed re ports to the agents.
Nationw ide determined that it nee ded b e tte r access to sales
and policy information to reach its sales targe ts. It chose a
References
Antho ny, R. N. (1965). Planning and Control Systems: A
Framework/or Analysis. Cambridge, MA: Harvard University
Graduate School of Business.
Asterdata.com. “Gilt Groupe Speaks on Digital Marketing
Optimization. ” www .asterdata.com/gilt_groupe_ video.
php (accessed February 2013).
single data warehouse approach and, after careful assessment
of the needs of sales manageme nt and individual agents,
selected a business intelligence platform that would integrate
d ynamic e nte rprise dashboards into its re p o rting systems,
making it easy for the agents and associates to view policy
information at a gla nce. The new repo rting system, d ubbed
Re ve nue Connection, also en abled users to analyze the info r-
mation with a lo t of interactive and drill-down-to-details capa-
bilities at various levels that elimina ted the need to gene rate
custom ad hoc reports. Reve nue Connection virtually elimi-
n ate d requests for manual p olicy audits, resulting in huge
savings in time and money for the bu siness and techno logy
teams. The reports were produced in 4 to 45 seconds, rather
than days o r weeks, a nd productivity in some units improved
by 20 to 30 percent.
QUESTIONS FOR DISCUSSION
1. Why did Nationwide need an e nterprise -w ide d ata
warehouse?
2. How did integrated data drive the bu siness value?
3 . What forms of analytics are e mployed at Natio nwide?
4. With integrated data available in an e nte rprise d ata
warehouse, what other applications could Natio nwide
potentially d evelo p ?
What We Can Learn from This Application
Case
The proper u se of integra ted informa tion in organiza-
tio ns can help achieve bette r business outcomes. Many
organizatio ns now rely o n data w a re housing techno logies
to p e rfo rm the o nline analytical processes o n the d ata to
derive valuable insights. Th e insights are u sed to d evelop
predictive mo d els that further e nable the growth of the
organizations by m ore precisely assessing customer needs.
Increasingly, o rganizations a re moving toward d e rivi ng
value from analytical applications in real time with the
he lp of integrated data fro m real-time data warehousing
techno logies.
Source: Te radata.com, “Nationw ide, Delivering an On Your Side
Experie nce,” teradata.com/WorkArea/linkit.aspx?Linkldentifie
r=id&Item1D=14714 (accessed Februa ry 2013).
Barabasilab. ne u.edu. “Network Science.” barabasilab.neu.
edu/networksciencebook/downlPDF .html (accessed
February 2013).
Brooks, D. (2009, May 18). “In Praise of Dullness.” New York
Times, nytimes.com/2009/05/19/opinion/19brooks.
html (accessed February 2013).
36 Part I • Decision Making and Analytics: An Overview
Centers for Disease Control and Prevention, Vaccines
for Children Program, “Module 6 of the VFC Operatio ns
Guide.” cdc.gov/vaccines/pubs/pinkbook/vac-stor-
age.html#storage (accessed Ja nuary 2013).
Eck erson , W. (2003). Smart Companies in the 21st Century:
Tbe Secrets of Creating Successful Business Intelligent
Solutions. Seattle , WA: The Data Warehousing Institute.
Emc.com. “Data Scie n ce Revealed: A Data-Drive n Glimpse
into the Burgeoning ew Field.” emc.com/collateraV
about/ news/ emc-data-science-study-wp. pdf (accessed
February 2013).
Gorry, G. A., an d M. S. Scott-Morto n . (1971). “A Framework
for Manageme nt Information Systems.” Sloan Management
Review, Vol. 13, No. 1, pp. 55-70.
INFORMS. “An alytics Section Overview.” informs.org/
Community/ Analytics (accessed February 2013).
Keen, P. G. W. , and M. S. Scott-Morton. 0978). Decision
Support Systems: An Organizational Perspective. Reading,
MA: Addison-Wesley.
Krivda , C. D. (2008, March) . “Dialing Up Growth in a Mature
Market. ” Teradata Magazine, pp. 1-3.
Magpiesensing.com. “MagpieSensing Cold Cha in Analyt-
ics a nd Monitoring.” magpiesensing.com/wp-content/
uploads/2013/01/ColdChainAnalyticsMagpieSensing-
Whitepaper (accessed Janua1y 2013).
Mintzberg, H. A. (1980). Tbe Nature of Managerial Work.
Englewood Cliffs, NJ: Pre ntice Hall.
Mintzberg, H. A. (1993). Tbe Rise and Fall of Strategic
Planning. New York: The Free Press.
Simo n , H. (1977). Tbe New Science of Management Decision.
Eng lewood Cliffs, NJ: Pre ntice Hall.
Tableausoftware.com. “Eliminating Waste at Seattle Ch il-
dren’s. ” tableausoftware.com/ eliminating-waste-at-
seattle-childrens (accessed February 2013).
Tableausoftware.com. “Kaleida Health Finds Efficie ncie s , Stays
Competitive. ” tableausoftware.com/learn/ stories/user-
experience-speed-thought-kaleida-health (accessed
February 2013).
Teradata. com. “Natio nw ide, Delivering an On Your Side
Experience. ” teradata.com/ case-studies/ delivering-on-
your-side-experience (accessed February 2013).
Teradata.com. “Sabre Airline Solutions. ” teradata.com/t/
case-studies/Sabre-Airline-Sol utions-EB62 81
(accessed February 2013).
Wa ng , X., et a l. (2012 , J a nuary/ February). “Bran ch Reconfigu-
ratio n Practice Through Operations Research in Industrial
and Comme rcial Bank of Ch in a .” Interfaces, Vol. 42 , No . 1,
pp. 33-44.
Watson, H. (2005 , Winter). “Sorting Out What’s New in Decision
Support.” Business Intelligence journal.
Wikipedia. “On Base Percentage. ” en.wikipedia.org/wiki/
On_base_percentage (accessed Ja n uary 2013).
Wikipedia. “Sab erm etrics. ” en.wikipedia.org/wiki/
Sabermetrics (accessed January 2013).
Zaleski, A. (2012). “Magpie Analytics System Tracks Cold-
Ch ain Products to Keep Vaccines, Reagents Fresh. ”
TechnicallyBaltimore.com (accessed February 2013).
Zaman, M. (2009, April). “Business Intelligence: Its Ins and O uts.”
technologyevaluation.com (accessed February 2013).
Z iam a , A. , a n d J. Kasher. ( 2004) . “Data Mining Primer
for the Data Wareh ousing Professional. ” Dayton, OH:
Teradata.
CHAPTER
Foundations and Technologies
for Decision Making
LEARNING OBJECTIVES
• Understand the conceptual foundations
of decision making
• Understand Simon’s four phases of
decision making: intelligence, design,
choice, and implementation
• Understand the essential definition
of DSS
• Understand impo1tant DSS classifications
• Learn how DSS suppo1t for decision
making can be provided in practice
• Understand DSS components and how
they integrate
0
ur major focus in this book is the support of decision making through
computer-based information systems. The purpose of this chapter is to describe
the conceptual foundations of decision making and how decision support is
provided. This chapter includes the fo llowing sections:
2.1 Opening Vignette: Decision Modeling at HP Using Spreadsheets 38
2.2 Decision Making: Introduction and Definitions 40
2.3 Phases of the Decision-Making Process 42
2.4 Decision Making: The Intelligence Phase 44
2.5 Decision Making: The Design Phase 47
2.6 Decision Making: The Choice Phase 55
2.7 Decision Making: The Implementation Phase 55
2.8 How Decisions Are Supported 56
2.9 Decision Support Systems: Capabilities 59
2.10 DSS Classifications 61
2.11 Components of Decision Support Systems 64
37
38 Pan I • Decision Making and Analytics: An Overview
2.1 OPENING VIGNETTE: Decision Modeling at HP Using
Spreadsheets
HP is a major manufacturer of computers, printers, and many industrial products. Its vast
product line leads to many decision problems. Olavson and Fry (2008) have worked on
many spreadsheet models for assisting decision makers at HP and have identified several
lessons from both their successes and their failures when it comes to constructing and
applying spreadsheet-based tools. They define a tool as “a reusable, analytical solution
designed to be handed off to nontechnical end users to assist them in solving a repeated
business problem. ”
When trying to solve a problem, HP developers consider the three phases in devel-
oping a model. The first phase is problem framing, where they consider the following
questions in order to develop the best solution for the problem:
• Will analytics solve the problem?
• Can an existing solution be leveraged?
• Is a tool needed?
The first question is important because the problem may not be of an analytic nature,
and therefore, a spreadsheet tool may not be of much help in the long run without fixing
the nonanalytical part of the problem first. For example , many inventory-related issues
arise because of the inherent differences between the goals of marketing and supply
chain groups. Marketing likes to have the maximum variety in the product line, whereas
supply chain management focuses on reducing the inventory costs. This difference is par-
tially outside the scope of any model. Coming up with nonmodeling solutions is impor-
tant as well . If the problem arises due to “misalignment” of incentives or unclear lines
of authority or plans, no model can help. Thus, it is important to identify the root issue.
The second question is important because sometimes an existing tool may solve a
problem that then saves time and money. Sometimes modifying an existing tool may solve
the problem, again saving some time and money, but sometimes a custom tool is neces-
sary to solve the problem. This is clearly worthwhile to explore.
The third question is important because sometimes a new computer-based system
is not required to solve the problem. The developers have found that they often use
analytically derived decision guidelines instead of a tool. This solution requires less time
for development and training, has lower maintenance requirements, and also provides
simpler and more intuitive results. That is, after they have explored the problem deeper,
the developers may determine that it is better to present decision rules that can be eas-
ily implemented as guidelines for decision making rather than asking the managers to
run some type of a computer model. This results in easier training, better understanding
of the rules being proposed, and increased acceptance. It also typically leads to lower
development costs and reduced time for deployment.
If a model has to be built, the developers move on to the second phase-the actual
design and development of the tools. Adhering to five guidelines tends to increase the
probability that the new tool will be successful. The first guideline is to develop a proto-
type as quickly as possible. This allows the developers to test the designs , demonstrate
various features and ideas for the new tools, get early feedback from the end users to
see what works for them and what needs to be changed, and test adoption. Developing
a prototype also prevents the developers from overbuilding the tool and yet allows them
to construct more scalable and standardized software applications later. Additionally, by
developing a prototype, developers can stop the process once the tool is “good enough, ”
rather than building a standardized solution that would take longer to build and be more
expen sive .
Chapter 2 • Foundations and Technologies for Decis ion Making 39
The second guideline is to “build insight, n o t black boxes. ” The HP spreadsheet
model developers believe that this is impo rta nt, because ofte n just e n tering some data
and receiving a calculated output is not e noug h. The u sers n eed to be able to think
of alternative scen arios, and the tool d oes n o t support this if it is a “black box” that
provides o nly o ne recomme ndatio n. They a rgue that a tool is best only if it provides
informa tio n to help make and support decisio ns rather tha n just give the answers. They
also believe that a n interactive tool h elps the u sers to unde rstand the problem better,
therefore leading to mo re informed decisions.
The third guideline is to “remove unneeded complexity before handoff. ” This is
important, because as a tool becomes mo re complex it requires mo re training and exper-
tise, more data, and more recalibrations. The risk of bugs and misuse also increases.
Sometimes it is best to study the problem, begin modeling and analysis, and then start
shaping the program into a simple -to -use tool for the end user.
The fourth guideline is to “partner w ith end users in discovery and design. ” By work-
ing with the end u sers the developers get a better feel of the problem a nd a better idea
of what the e nd users want. It also increases the e nd users’ ability to use analytic tools.
The end users also gain a better understanding of the problem and how it is solved u sin g
the new tool. Additionally, including the e nd users in the development process e nha nces
the decision makers’ a na lytical knowledge a nd capabilities. By working together, their
knowledge and skills complement each other in the final solutio n .
The fifth guide line is to “develop a n Operations Research (OR) champio n .” By involv-
ing e nd users in the development p rocess, the developers create champions for the new
tools w h o then go back to their departments o r companies and encourage their cowork-
ers to accept and use them. The champions are the n the experts on the tools in their areas
and can then h e lp those being introduced to the new tools. Having champio ns increases
the possibility that the tools w ill be adopted into the businesses su ccessfully .
The final stage is the h andoff, when the final tools that provide comple te solutions
are given to the businesses. When pla nning the h andoff, it is important to answer the fol-
lowing questions:
• Who w ill use the tool?
• Who owns the decisions that the tool w ill support?
• Who else must be involved?
• Who is resp o n sible for maintenance and enhancement of the tool?
• When will the too l be used?
• How w ill the u se of the tool fit in w ith othe r processes?
• Does it ch ange the processes?
• Does it generate input into those processes?
• How w ill the tool impact business performance?
• Are the existing m e trics sufficient to reward this aspect of performance?
• How should the metrics and incentives be changed to maximize impact to the busi-
n ess from the tool and process?
By keeping these lesson s in mind, developers and proponents of computerized deci-
sion support in gene ral and spreadsheet-based models in particular are likely to enjoy
greater success.
QUESTIONS FOR THE OPENING VIGNETTE
1. What are some of the key questions to be asked in supporting decision making
through DSS?
2. What guidelines can be learned from this vig n e tte about developing DSS?
3. What lessons sh ould be kept in mind for successful m odel implementation?
40 Pan I • Decision Making and Analytics: An Overview
WHAT WE CAN LEARN FROM THIS VIGNETTE
This vignette relates to providing decision support in a large organization:
• Before building a mo del, decision makers should develop a good understanding of
the problem that needs to be addressed.
• A model may not be necessary to address the problem.
• Before developing a n ew tool, decision makers should explo re reuse of existing tools.
• The goal of model building is to gain better insight into the problem, not just to
generate more numbers.
• Implementatio n plan s should be developed alo ng w ith the model.
Source: Based on T. O lavson and C. Fry, “Spre adshee t Decision-Support Too ls : Lesso ns Learne d at Hew le tt-
Packard,” ln teifaces, Vol. 38, No. 4, Ju ly/ August 2008, pp. 300-310.
2.2 DECISION MAKING: INTRODUCTION AND DEFINITIONS
We are about to examine how decision making is practiced a nd some of the underlying
theories and models of decision m aking. You w ill a lso learn a b out the various traits of
decision makers, including what ch aracterizes a good decision maker. Knowing this can
help you to understand the types of decision support tools that m a nagers can use to
m ake mo re effective decisions . In the following sections , we discuss variou s aspects of
decision making.
Characteristics of Decision Making
In additio n to the ch aracteristics presented in the opening vignette, decision making
may involve the fo llowing:
• Groupthink (i.e., group members accept the solution w ithout thinking for them-
selves) can lead to bad decisions.
• Decision ma kers are interested in evaluating what-if scenarios.
• Experimentation w ith a real system (e.g., develop a schedule, try it, and see h ow
well it works) m ay result in failure.
• Experime ntation w ith a real system is possible only for one set of conditions at a
time and can be disastrous.
• Ch anges in th e decision-making e nvironment may occur continuo u sly, leading to
invalidating assumptions about a situatio n (e.g., deliveries around holiday times may
increase, requiring a different view of the problem).
• Changes in the decision-making environme nt may affect decision quality by impos-
ing time pressure on the decision maker.
• Collecting informatio n and an alyzing a problem takes time a nd can be expen s ive. It
is difficult to determine w he n to stop a nd make a decisio n .
• There m ay not be sufficient information to make an intelligent decision.
• Too much information may b e available (i.e. , info rmatio n overload).
To determine how real decision makers make decisions, we must first understand the
process and the important issues involved in decision making. The n we can understand
appropriate metho dologies for assisting decision makers and the contributio ns informatio n
systems can make . Only the n can we develop DSS to he lp decision makers.
This ch apter is organized based o n the three key words that form the term DSS:
decision, support, and systems. A de cisio n maker sh ould n ot simply apply IT tools
b lind ly. Rather, the decision ma ker gets support throug h a ra tio nal approach that
Chapter 2 • Foundations and Technologies for Decis ion Making 41
simplifies reality and provides a relatively quick a nd inexpen sive means of con sidering
variou s alte rna tive courses of action to arrive at the best (or at least a ve1y good) solu-
tio n to the problem.
A Working Definition of Decision Making
Decision making is a process of choosin g among two or more alternative courses of
action for the purpose of attaining one or more goals. According to Simo n (1977), ma na-
gerial decision making is syno nymo u s w ith the e ntire m anagem e nt process. Con side r
the important managerial function of planning. Planning invo lves a series of decisions:
What should be done? When? Where? Why? How? By w hom? Managers set goals, or plan;
he nce, planning implies d ecision making. Other m anagerial functions, such as organizin g
and controlling, also involve decisio n making.
Decision-Making Disciplines
Decision m aking is directly influe nced by several major disciplines, some of which are
behavioral a nd some of which a re scien tific in nature. We must be aware of how their
philosophies can affect our ability to ma ke decisions and provide support. Behavioral
disciplines include anthropo logy, law, philosophy, political scien ce, psychology, social
psycho logy, and socio logy. Scie ntific disciplines include computer scie n ce, decision
analysis, econo mics, engineering, the h ard scien ces (e.g. , biology, ch emistry, p hysics),
man ageme nt science/ operatio ns research, mathematics, an d statistics.
An important characte ristic of management support systems (MSS) is their e mpha-
sis on the effectiveness, or “goodness,” of the decision produced rathe r than o n th e
computational efficie ncy of obtaining it; this is usually a major concern of a transaction
processing system. Most Web-based DSS are focused on improving decision effectiveness.
Efficiency may be a by-product.
Decision Style and Decision Makers
In the following sections, we examine the n otion of decision style and specific aspects
about decision makers.
DECISION STYLE Decision style is the manner by which decision makers think and react
to problems. This includes the way they perceive a problem, their cognitive respon ses,
and how values and beliefs vary from individual to individual and from situation to
situatio n . As a result, people make decisions in different ways. Although the re is a general
process of decision making, it is far from linear. People do n o t follow the same steps
of the process in the same sequence, nor do they use all the steps. Furthermore, the
emphasis, time allo tme nt, and priorities given to each step vary significantly, not o nly
fro m o ne person to a noth e r, but a lso from o ne situatio n to the next. The manner in which
managers make decisions (and the way they inte ract w ith other people) describes their
decision style . Because decision styles depend o n the factors described earlier, there are
many decision styles. Personality temperame n t tests are often u sed to determine decision
styles. Because the re are many su ch tests , it is important to try to equate them in deter-
mining decision style. However, the vario u s te sts mea sure somewh at different aspects of
pe rsonality, so they cannot be equated.
Researche rs h ave identified a numbe r of decision-making styles. These include heu-
ristic and a nalytic styles. One can also distinguish between autocratic versus democratic
styles. Another style is con sultative (with individuals or groups). Of course, there are
many combinations and variatio ns of styles. For example, a person can b e an alytic and
autocratic, o r consultative (with ind ividuals) a nd he uristic.
42 Pan I • Decision Making and Analytics: An Overview
For a computerized system to successfully support a man ager, it should fit the
decision situation as well as the decision style. Therefore, the system sh ould be flexible
and adaptable to different use rs. The a bility to ask w hat-if and goal-seeking questions
provides flexibility in this direction. A Web-based interface using graphics is a desirable
feature in supporting certain decision styles. If a DSS is to support varying styles, skills,
and knowledge, it should not attempt to enforce a specific process. Rather, it sho uld help
decision makers u se and develop their own styles, skills, an d knowledge .
Different decision styles re quire different types of support. A ma jo r factor that deter-
mines the type of support required is w hether the decision maker is a n individual or a
group. Individual decision makers need access to data and to experts who can provide
advice, w hereas groups additio nally need collaboration tools. Web-based DSS can p ro-
vide suppo rt to both.
A lo t of information is available on the Web about cognitive styles and d ecision
styles (e.g., see Birkman Inte rnational, Inc. , birkman.com; Keirsey Temperament Sorter
a nd Keirsey Temperament Theory-II, keirsey.com) . Many personality/ temperament tests
a re available to h e lp managers identify their own styles a nd those of their employees.
Identifying an individual’s style can help establish the m ost effective communication
p atterns a nd ideal tasks for which the p erson is suited.
DECISION MAKERS Decisio ns are often made by individuals, especially at lower manage-
rial levels and in small organizatio n s. There may be conflicting objectives even for a sole
decision maker. For example, w hen making an investment decision, an individual investor
may consider the rate of return on the investment, liquidity, and safety as objectives. Finally,
decisions may be fully automated (but only after a human decision maker decides to do so!).
This discussio n of decision m aking focuses in large p art o n an individual decision
maker. Most major decisions in medium-sized and large organizatio ns a re mad e by groups.
Obviously, there are often conflicting objectives in a group decision-making setting. Grou ps
can be of variable size and may include people from different departments or from differ-
e nt o rganizatio ns. Collaborating individuals may have different cognitive styles, personality
types, and decision styles. Some clash, w hereas others a re mutually enhancing. Con sensus
can be a difficult political problem. Therefore, the process of decision making by a group
can be ve1y complicated. Computerized support can greatly enha nce group decision
making. Computer support can be provided at a broad level, en abling members of w ho le
departments, divisions , or even entire o rganization s to collaborate online . Su ch supp ort
has evolved over the past few years into ente rprise info rmation systems (EIS) and inclu des
group support syste ms (GSS), enterprise resource management (ERM)/enterprise resource
p lanning (ERP), supply ch ain m anagem ent (SCM), knowledge management systems (KMS),
and custome r relatio nship management (CRM) systems.
SECTION 2.2 REVIEW QUESTIONS
1. What are the vario us asp ects of decisio n making7
2. Ide ntify similarities and diffe re n ces between individual and group decisio n making .
3. Define decision style a nd describe w hy it is importa nt to consider in the d ecision-
making process.
4. What are the benefits of m athematical models?
2.3 PHASES OF THE DECISION-MAKING PROCESS
It is advisable to follow a systematic decision-making process. Simon 0977) said that this
involves three major phases: intelligence, design , and choice. He later added a fourth phase,
imple me ntatio n . Monitoring can be considered a fifth phase- a form of feedback. However,
Success~
Simplification
Assumptions
Chapter 2 • Foundations and Technologies for Decision Making 43
Organization objectives
Search and scanning procedures
Data collection
Problem identification
Problem ownership
Problem classification
Problem statement
Formulate a model
-1- – – – – -;
Validation of the model Set criteria for choice
Verification, testing of
ro osed solution
Search for alternatives ~ • • • • • -:
Predict and measure outcomes
Solution to the model
Sensitivity analysis
Selection of the best (good)
alternative(s)
Plan for implementation
… ••••••I
Implementation
of solution – – – – – – – – – – – – -?- ———————————-_:
• Failure
FIGURE 2.1 The Decision-Making/Modeling Process.
we view mo nito ring as the intelligence phase applied to the implementation phase. Simo n ‘s
mod el is the most concise and yet comple te characte rizatio n of ratio nal decisio n making.
A conceptual picture of the decision-making process is shown in Figure 2. 1.
There is a continuo us flow of activity fro m intelligence to design to ch oice (see the
bold lines in Figure 2.1), but at any phase, there may be a return to a previous phase
(feedback) . Modeling is a n essential part o f this process. The seemingly ch aotic n ature of
following a haphazard path from problem discovery to solutio n via decision making can
be explained by these feedback loops .
The decision-making process starts w ith the intelligence phase; in this phase, th e
decision maker examines reality and identifies and defines the problem. Problem ownership
is established as well. In the design phase, a model that represents the system is constrncted.
This is done by making assumptions that simplify reality and w riting down the relationships
among all the variables. The model is then validated, and crite ria are determined in a princi-
ple of choice for evaluatio n of the alternative courses of action that a re identified. Often, the
process of model development identifies alternative solutions and vice versa.
The choice phase includes selection of a proposed solution to the model (not
necessarily to the problem it represents). This solutio n is tested to determine its viability.
When the proposed solution seems reasonable, we are ready for the last phase: imple –
me ntatio n of the decision (n o t n ecessarily of a system). Su ccessful implementation results
in solv ing the real problem. Failure leads to a return to an earlier phase of the process. In
fact, we can return to an earlie r phase during a ny o f the latter three phases. The decision-
making situatio ns described in the opening vignette follow Simon ‘s four-ph ase model, as
do almost a ll other d ecisio n-making situatio n s. Web impacts on the four phases, and vice
versa, a re sh own in Table 2.1 .
44 Pan I • Decision Making and Analytics: An Overview
TABLE 2.1 Simon’s Four Phases of Decision Making and the Web
Phase
Intell igence
Design
Choice
Implementation
Web Impacts
Access to information to identify
problems and opportunities from
internal and external data sources
Access to analytics methods to
identify opportunities
Collaboration through group support
systems (GSS) and knowledge
management systems (KMS)
Access to data, models, and solution
methods
Use of online analytical processing
(OLAP), data mining, and data
warehouses
Collaboration through GSS and KMS
Similar solutions available from KMS
Access to methods to evaluate the
impacts of proposed solutions
Web-based collaboration tools (e.g.,
GSS) and KMS, which can assist in
implementing decisions
Tools, which monitor the performance
of e-commerce and other sites,
including intranets, extranets, and
the Internet
Impacts on the Web
Identification of opportunities for
e-commerce, Web infrastructure,
hardware and softwa re tools, etc.
Intelligent agents, which reduce the
burden of information overload
Smart search engines
Brainstorming methods (e.g ., GSS)
to collaborate in Web
infrastructure design
Models and solutions of Web
infrast ructure issues
Decision support system (DSS) tools,
which examine and establish criteria
from models to determine Web,
intranet, and extranet infrastructure
DSS tool s, wh ich determine how
to route messages
Decisions implemented on browser
and server design and access,
which ultimately determined how
to set up the various compone nts
that have evolved into the Internet
Note that there are many other decision-making processes. Notable among them is
the Kepner-Tregoe method (Kepner and Tregoe, 1998), which has been adopted by many
firms because its tools are readily available from Kepner-Tregoe, Inc. (kepner-tregoe.
com). We have found that these alternative models , including the Kepner-Tregoe method,
readily map into Simon ‘s four-phase model.
We next turn to a detailed discussion of the four phases identified by Simon.
SECTION 2.3 REVIEW QUESTIONS
1. List a nd briefly describe Simon’s four phases of decision making.
2. What are the impacts of the Web on the phases of decision making?
2.4 DECISION MAKING: THE INTELLIGENCE PHASE
Intelligence in decision making involves scanning the environment, either intermitte ntly
or continuously. It includes several activities aimed at identifying problem situations or
opportunities. It may also include monitoring the results of the implementation phase of
a decision-making process.
Chapter 2 • Foundations and Technologies for Decis ion Making 45
Problem (or Opportunity) Identification
The intelligence phase begins with the ide ntification of organizational goals and objectives
re lated to an issu e of concern (e.g ., inventory manageme nt, job selectio n, lack of or incorrect
Web presence) and determination of w hether they are being met. Problems occur because of
dissatisfactio n with the status quo. Dissatisfactio n is the result of a difference between w hat
people desire (or expect) and what is occurring . In this first phase, a decision make r atte mpts
to determine whether a problem exists, identify its symptoms, determine its magnitude, and
explicitly define it. Often, what is described as a problem (e.g. , excessive costs) may be
only a symptom (i.e., measure) of a proble m (e.g., improper invento1y levels). Because real-
world problems are usually complicated by many interrelated factors, it is sometimes difficult
to distinguish between the sympto ms and the real problem. New opportunities and prob-
lems certainly may be uncovered w hile investigating the causes of symptoms. For example,
Application Case 2.1 describes a classic story of recognizing the correct p roblem.
The existe n ce of a proble m can be determined by mo nitoring and analyzing the
organization’s productivity level. The measure ment of prod uctivity and the construction
of a model are based o n real data. The collection of data and the estima tion of future data
are amo ng the most difficult step s in the an alysis. The following are some issues that may
arise during data collection and estimatio n and thus plagu e decisio n m akers:
• Data a re not available . As a result, the model is m ade with, and relies on, p otentially
inaccurate estimates .
• Obtaining data may be expensive .
• Data m ay not be accurate or precise enough.
• Data estimatio n is o fte n subjective .
• Data may be insecure.
• Impo rtant data that influence the results may be qualitative (soft) .
• There may be too ma ny data (i.e. , info rmatio n overload).
Application Case 2.1
Making Elevators Go Faster!
This sto1y h as been reported in numerous places
and has a lmost become a classic example to explain
the n eed for problem identification. Ackoff (as cited
in La rso n , 1987) described the problem of managing
complaints about slow elevators in a tall hotel tower.
After trying m any solutions for reducing the com-
plaint: staggering elevators to go to different floors,
adding o perators, and so on, the management deter-
mined that the real p roble m was n ot about the actual
waiting time but rathe r the p erceived waiting time.
So the solution was to install full-length mirrors on
e levator doors o n each floor. As Hesse and Woolsey
(1975) put it, “the women would look at the mselves
in the mirrors and make adjustme nts, w hile the me n
would look at the women , and before they knew it,
the e levator was the re .” By reducing the perceived
waiting time, the problem went away. Baker and
Cameron 0996) give several other examples of dis-
tractions, including lightin g, displays, and so on, that
organizations use to reduce p erceived waiting time .
If the real problem is identified as perceived waiting
time, it can make a big difference in the proposed
solutions and th e ir costs. For example, full-length
mirrors probably cost a whole lot less tha n adding
an elevator!
Sources: Based on J. Baker and M. Came ron , “The Effects of
the Service Environment on Affect and Consu me r Perception o f
Waiting Time: An Integrative Review and Research Propositions,”
Journal of the Academy of Marketing Science, Vol. 24, September
1996, pp. 338-349; R. He sse and G. Woolsey, Applied Management
Science: A Quick and Dirty Approach, SRA Inc., Chicago, 1975;
R. C. Larson, “Perspectives on Queues: Social J ustice and the
Psychology of Queuing ,” Operations Research, Vol. 35, No. 6,
November/ December 1987, pp. 895- 905.
46 Pan I • Decision Making and Analytics: An Overview
• Outcomes (or results) may occur over an extended period. As a result, rev-
enues, expen ses, a nd profits w ill be recorded at different points in time. To
overcome this difficulty, a present-value a pproac h can be used if the results are
quantifiable.
• It is assumed that future data w ill be similar to historical data. If this is n ot the case,
the nature of the change has to be predicted and included in the analysis .
When the preliminary investigation is completed, it is possible to determine w hether
a proble m really exists, where it is located, and h ow significant it is. A key issue is w he ther
an informatio n system is reporting a problem o r o nly the sympto ms of a problem. For
example, if reports indicate that sales are down, there is a p roblem, but the situation, no
d oubt, is symptomatic of the problem. It is critical to know the real problem. Sometimes
it may be a problem o f perception , incentive mismatch, or organizational processes rather
than a poor decision model.
Problem Classification
Problem classification is the conceptualization of a problem in an attempt to place it in
a definable category, p ossibly leading to a standard solutio n approach . An important
approach classifies problems according to the degree of structuredness evident in them.
This ranges from totally structured (i.e., programmed) to totally unstructured (i.e., unpro-
grammed), as described in Ch apte r 1.
Problem Decomposition
Many complex problems can be divided into subproblems. Solving the simpler subprob-
le ms may help in solving a complex problem. Also, seemingly poorly structured problems
sometimes have highly structured subproble ms. Just as a semistructured problem results
w he n some phases of decisio n ma king are structured w h ereas other phases are unstruc-
tured, so w h e n some subproblem s of a decision-making problem are structured w ith
others unstructured, the problem itself is semistructured. As a DSS is develo ped and the
d ecision maker and development staff learn more about the problem, it gains structure.
Decomposition also facilitates communicatio n a mo ng decision makers. Decomposition is
o ne of the m ost important aspects of the analytical hierarchy process. (AHP is discussed
in Chapte r 11 , which h elps decision m akers incorp o rate both qualitative and quantitative
factors into the ir decision-making models.)
Problem Ownership
In the inte lligen ce phase , it is impo rta nt to establish problem own e rsh ip. A problem
exists in a n organization o nly if someone or some group takes o n the responsibility of
a ttacking it and if the o rganizatio n has the ability to solve it. The assignment of auth o r-
ity to solve the problem is ca lled problem ownership. For example, a ma nager m ay
feel that he or she has a problem because interest rates are too high. Becau se interest
rate levels are determined at the n ational a nd international levels, a nd most managers
ca n do nothing about the m , hig h interest rates a re the problem of the government, n ot
a problem for a s p ecific company to solve. The problem compa nies actu ally face is
h ow to operate in a high-inte rest-ra te e n vironme nt. For an individual compan y, the
interest rate level s hould be h a ndle d as an uncontrollable (env ironme ntal) factor to be
predicte d.
When problem owne rship is not establish ed, e ithe r someone is n o t doing his or
her job or the problem at hand has yet to be identified as belonging to anyone . It is then
important for someon e to e ithe r volunteer to own it or assig n it to someone.
The inte lligen ce p hase e nds w ith a fo rmal problem stateme nt.
Chapter 2 • Foundations and Technologies for Decis ion Making 47
SECTION 2.4 REVIEW QUESTIONS
1. What is the difference between a problem and its symptoms?
2. Why is it impo rtant to classify a p roblem?
3. What is meant by p roblem decomposition?
4. Why is establishing prob lem ownership so impo rta nt in the decision-making process?
2.5 DECISION MAKING: THE DESIGN PHASE
The design phase involves finding or develo ping and a n alyzing possible courses of action .
These include understanding the problem and testing solu tio ns for feasibility. A model
of the decision-making problem is constructed, tested, and validated. Let u s first define
a m odel.
Models1
A majo r ch aracteristic of a DSS and many BI tools (n o tably those of business a nalytics) is the
inclusio n of at least o ne model. The basic idea is to perform the DSS analysis o n a m odel
of reality rather than on the real system . A model is a simplified representation or abstrac-
tio n of reality. It is usually simplified because reality is too complex to describe exactly and
because much of the complexity is actually irre levant in solving a specific problem.
Mathematical (Quantitative) Models
The complexity of relationships in many organizatio nal systems is described mathemati-
cally. Most DSS analyses are performed numerically with mathematical or oth er quantitative
models.
The Benefits of Models
We use models for the following reason s:
• Manipulating a mo del (ch anging decisio n variables or the environment) is much
easie r tha n manipulating a real system. Experimentation is easier and does not
interfere w ith the o rga nizatio n’s daily operations.
• Models enable the compression of time. Years of operation s can be simulated in
minutes o r seconds of compute r time.
• The cost o f modeling analysis is much lower than the cost of a similar experiment
conducted on a real system.
• The cost of m aking mistakes during a trial-and-error experime n t is much lower
w hen models are used than w ith real systems .
• The business e nvironment involves considerable uncertainty. With modeling, a
ma nager can estimate the risks resulting from specific action s.
• Mathematical models enable the analysis of a ve ry large, sometimes infinite , number
of possible solutio ns. Even in simple proble ms, managers ofte n have a large number
of alte rnatives fro m which to choose.
• Models enhance a nd reinforce learning and training .
• Models and solution m e tho ds are readily available .
Modeling involves con ceptua lizing a proble m a nd ab stracting it to quantita tive
and/ or qualitative form (see Chapter 9). For a mathematical model, the variables are
‘Cautio n : Many students a nd professio nals view models strictly as those of “data modeling” in the context o f
systems a nalysis and design. He re, we conside r analytical models su ch as those of linear progra mming, simula-
tio n , a nd forecasting.
48 Pan I • Decisio n Making and Analytics: An Ove rview
identified, a nd their mutual relationships are established. Simplificatio ns are made ,
w h e n ever n ecessary, through assumptio n s . For example, a relationship b etween two
variables may be assumed to be linear even though in reality there may be som e n on-
linear effects. A proper b alan ce b e twee n the level of model simplification and the rep-
rese ntatio n of reality must be obtained becau se of the cost-be n efit trade-off. A s impler
model leads to lower development costs, easier manipulation, and a faster solution but
is less representa tive o f the real proble m and can produce inaccurate results . However,
a simple r model gene rally re quires fe wer data , o r the data are aggregated and easier
to obtain.
The process of modeling is a combination of art a nd science. As a science, there
a re many standard model classes available , and, w ith practice, an analyst can determine
w hich one is applicable to a given situation. As an art, creativity a nd finesse are required
w he n determining what simplifying assumptions can work, how to combine approp ri-
ate features of the mode l classes, a nd how to integrate models to obtain valid solutio n s.
Models h ave decision variables that describe the alte rnatives from among which a
man ager must ch oose (e.g., h ow many cars to deliver to a sp ecific re n tal agen cy, h ow to
advertise at specific times, w hich Web server to buy or lease), a result variable or a set
of result variables (e.g. , profit, revenue, sales) that d escribes th e objective o r goal of the
decis io n-making problem, and uncontrollable variables or pa ra me te rs (e.g., econo mic
conditions) that describe the environment. The process of modeling involves determin-
ing the (usu ally mathe matical, sometimes symbolic) relationships among the variables.
These topics a re discussed in Chapter 9.
Selection of a Principle of Choice
A principle of choice is a crite rio n that d escribes the acceptability of a solutio n
approach . In a model, it is a result variable. Selecting a principle of ch o ice is n ot part
of the choice phase but involves h ow a p e rson establishes decision-making objective(s)
a nd incorporates the objective(s) into the m odel(s). Are we w illing to assum e high
risk , or do we pre fer a low-risk approach? Are we attempting to optimize or satisfice?
It is also important to recognize the diffe re n ce b etween a criterion a nd a con straint
(see Technology Ins ig hts 2.1). Among the many principles of choice, normative and
descriptive are of prime importan ce.
TECHNOLOGY INSIGHTS 2.1 The Difference Between a Criterion
and a Constraint
Many p eople new to the forma l study of decisio n making inadvenently confuse the concepts of
criterio n a nd constraint. Ofte n , this is because a criterion may imply a constraint, either implicit
or explicit, thereby adding to the confusion. For example, there may be a distance criterio n that
the decisio n make r does no t want to travel too far from ho me . However, there is a n imp licit
constraint that the alte rnatives from which he selects must be within a cen ain distance from his
ho me . This constraint effectively says tha t if the distance fro m h o me is greater tha n a ceitain
amo unt, the n the al ternative is not feasible- o r, rathe r, the distance to an alte rnative must b e less
than or equ al to a certain numbe r (this would be a formal relatio nship in some models; in the
model in this case, it reduces the search , considering fewer alternatives). This is similar to w h at
happens in some cases w he n selecting a u niversity, w here schools beyond a single day’s driv-
ing distan ce would no t be conside red by most people, and, in fact, the utility fun ction (criterio n
value) of dista nce can start o ut lo w close to home, p eak a t about 70 miles (about 100 km)- say,
the dista nce between Atlanta (home) a nd Athe ns, Georgia- a nd sharply drop off th ereafter.
Chapter 2 • Foundations and Technologies for Decis ion Making 49
Normative Models
Nonnative models are models in which the chosen alternative is d emo nstrably the best
of a ll possible alternatives. To find it, the decision maker should examine a ll the alterna-
tives and prove that the o ne selected is indeed the best, which is what the person would
no rmally want. This process is basically optimization. This is typically the goal of what
we call prescriptive a nalytics (Part IV) . In o p eratio n al terms, optimizatio n can be achieved
in o ne of three ways:
1 . Get the highest level of goal attainment from a given set of resources . For example,
w hich alte rnative w ill yield the maximum profit from an investment of $10 million?
2. Find the alternative with the hig hest ratio of goa l attainment to cost (e.g., profit per
d o llar invested) or maximize productivity.
3. Find the alternative w ith the lowest cost (or smallest amount of oth e r resources) that
w ill meet an acceptable level of goals. For example, if your task is to select hardware
for an intrane t with a minimum bandwidth, which alternative w ill accomplish this
goal at the least cost?
Normative decision theory is based o n the following assumptio n s of rational
decisio n m akers :
• Humans a re economic beings w hose objective is to maximize the attainment of
goals; that is, the decision maker is ratio nal. (More of a good thing [revenue, fun] is
better than less; less of a bad thing [cost, pa in] is better than more .)
• Fo r a decision-making situation, all viable alternative courses of actio n and their
conseque n ces, o r at least the probability and the values of the consequ ences, are
known .
• Decisio n make rs have an order or prefe ren ce that enables the m to rank the desir-
ability of a ll conseque n ces of the an alysis (best to worst).
Are decision makers really ratio nal? Though there may be major an omalies in the pre-
sumed rationality of financial and economic behavior, we take the view that they could be
cau sed by incompetence, lack of knowledge, multiple goals being framed inadequ ately, mis-
understanding of a decision maker’s true expected utility, and time-pressure impacts. There
are other anomalies, often caused by time pressure. For example, Stewart (2002) described
a number of researchers working with intuitive decision making. The idea of “thinking with
your gut” is obviously a heuristic approach to decision making. It works well for firefighters
and military personnel o n the battlefield. One critical aspect of decision making in this mode
is that many scen arios have been thought through in advance. Even when a situation is n ew,
it can quickly be matched to an existing o ne o n-the-fly , and a reasonable solution can be
obtained (through pattern recognition). Luce et al. (2004) described how emotions affect
decisio n making, and Pauly (2004) discussed inconsistencies in decision making.
We believe that irrationality is cau sed by the factors listed previously. Fo r exam-
ple, Tversky et al. 0990) investigated the phe nomen o n of preferen ce reve rsal, w hich is
a known problem in applying the AHP to problems. Also, some crite rio n or preference
may be omitted from the analysis. Ratner et a l. 0999) investigated how variety can cause
individuals to choose less-preferred optio ns, even though they w ill e njoy them less. But
we maintain that variety clearly has value, is part of a d ecision maker’s utility, and is a
criterion a nd/ or con straint that should be con sidered in decision m aking.
Suboptimization
By definition, optimizatio n requires a decision maker to consider the impact of each alter-
native course o f actio n on the e ntire o rganizatio n because a decision made in o ne area
may h ave significant effects (positive o r negative) o n other areas. Consider, for example , a
50 Pan I • Decision Making and Analytics: An Overview
marketing department that impleme nts a n electronic commerce (e-commerce) site. Within
hours, o rders far exceed production capacity. The production department, which plans
its own schedule , cannot meet demand. It may gear u p for as high demand as possi-
ble. Ideally and inde pende ntly , the department sh ould produce o nly a few products in
extre mely large quantities to minimize manufacturing costs. However, su ch a plan might
result in large, costly inventories and marketing difficulties caused by the lack of a variety
of products, especially if customers start to cancel orders that are n ot met in a timely way.
This s ituation illustrates the sequential nature of decision making.
A systems point of view assesses the impact of every decision o n the entire sys-
tem. Thus, the marketing department sho uld make its plans in conjunction with oth er
departments. However, su ch an approach may require a complicated, expen sive, time-
consuming analysis. In practice, the MSS builder may close the system within n arrow
boundaries, considering o nly the part of the organization under study (the marketing a nd/
or production department, in this case). By simplifying, the m od e l th e n does n ot incorpo-
rate certain complicated relatio n ships that describe interactio n s w ith and among the o ther
departments. The oth er departments can be aggregated into simple model components.
Such an approach is called suboptimization.
If a subo ptimal decision is made in one part of the o rganization w ithout considering
the details of the rest o f the organizatio n , then a n optimal solutio n from the point of view
of that p art may be inferior for the whole. However, su boptimizatio n may still be a very
practical approach to decision making, and many problems are first approached from this
perspective. It is possible to reach tentative conclusions (and generally u sable resu lts) by
analyzing only a portion of a system, witho ut getting bogged down in too many details.
After a solutio n is proposed, its potential effects on the remaining depa1tments of the
organizatio n can be tested. If no significant negative effects are found, the solution can
be imple m ented.
Suboptimization may also apply w h e n simplifying assumption s are used in mod-
eling a specific problem. There may be too many details or too m any data to incorporate
into a specific decision-making situatio n , and so n ot all of them are used in the model.
If the solutio n to the mode l seems reasonable, it may be valid for the problem and thus
be adopted. For example, in a production department, parts are often partitioned into
A/ B/ C inventory categories. Generally, A items (e.g ., large gears, w h ole assemblies) are
expen sive (say, $3,000 or more each), built to order in small batches, and inventoried in
low quantities; C item s (e.g ., nuts, bolts, screws) are very inexpensive (say, less than $2)
and o rdered a nd used in ve1y large quantities; and B ite ms fall in betwee n . All A items
can be handle d by a detailed scheduling model and physically monitored closely by man-
agement; B ite ms a re gen e rally som ewh at aggregated, their groupings are sch eduled, and
m an ageme nt reviews these parts less frequently; an d C items are n ot sch eduled but are
simply acquired or built based on a p o licy defined by management w ith a simple eco-
n o mic order quantity (EOQ) o rdering system that assumes consta nt annual demand. The
policy mig ht be reviewed o nce a year. This situation applies w he n determining all crite ria
o r mo deling the entire problem becomes p rohibitively time-consuming or expensive.
Suboptimization may also involve simply bounding the search fo r an optimum
(e.g., by a heuristic) by conside ring fewer criteria o r alternatives or by eliminating large
portions of the problem from evaluation. If it takes too lo ng to solve a problem, a good-
e nough solutio n found already may be used and the optimization effort terminated.
Descriptive Models
Descriptive models describe things as they are or as they are believed to be. These
models are typically mathematically based. Descriptive models are extremely u seful in
DSS for investigating the consequences of various alternative courses of actio n under
Chapter 2 • Foundations and Technologies for Decis ion Making 51
different configurations of inputs a nd processes. However, because a descriptive analysis
checks the p erforma nce of the syste m for a given set of alternatives (rather than for all
alternatives), there is no guarantee tha t a n alternative selected w ith the aid of descriptive
analysis is o ptimal. In m any cases, it is o nly satisfactory.
Simulation is probably the most commo n descriptive modeling method. Simulation
is the imitation of reality and has been applied to many areas of decisio n making.
Computer and video games are a form of simulation : An a rtificial reality is created, and
the game p layer lives within it. Vi11ual reality is also a form of s imulation because the e nvi-
ronment is simulated, not real. A common u se of simulation is in manufacturing. Again,
consider the productio n departme nt of a firm w ith complicatio n s caused by the marketing
de paltme nt. The characteristics of each machine in a job shop along the supply chain
can be described mathematically. Relatio nships can be established based o n how each
machine physically runs and relates to others. Given a trial sch edule of batches of parts ,
it is possib le to measure how batches flow through the system and to use th e statistics
fro m each machine. Alte rnative schedules may the n be tried and the statistics recorded
until a reasonable schedule is found. Marketing can examine access and purchase pat-
terns on its Web site. Simulation can be u sed to determine h ow to structure a Web site for
improved p e rforma n ce an d to estimate future purchases. Both departments can therefore
use primarily exp erime ntal modeling methods.
Classes of descriptive models include the following :
• Complex inventory decisions
• Environmental impact analysis
• Financial planning
• Information flow
• Markov a n alysis (predictio n s)
• Scenario analysis
• Simulation (alternative types)
• Technological forecasting
• Waiting-line (queuing) management
A number of nonmathematical descriptive models are available for d ecision mak-
ing. One is the cognitive map (see Eden and Ackermann, 2002; and Jenkins, 2002). A
cognitive map can he lp a decision ma ke r sketch out the impoltant qualitative factors and
their causal re lationships in a messy decision-making situation. This helps the decision
mak er (or decision-making group ) focus o n w hat is relevant and w hat is not, and the
map evolves as more is learned about the problem. The map can h e lp the d ecisio n ma ker
understand issues better, focus better, and reach closure. One inte resting software tool
for cognitive mapping is Decisio n Explorer fro m Banxia Software Ltd. (banxia.com; try
the demo) .
Another descriptive decisio n-ma king model is the u se of narratives to describe a
decision-ma king situatio n. A narrative is a story that he lps a decisio n maker uncover the
impoltant aspects of the situation and leads to better understanding and framing. This is
extrem ely effective w he n a group is making a decision, and it can lead to a m o re com-
mo n viewpoint, also called a frame. Juries in coult trials typically use n arrative-based
approaches in reaching verdicts (see Allan, Frame, and Turney, 2003; Beach, 2005; and
Denning, 2000).
Good Enough, or Satisficing
According to Simon 0977), most human decision making, w hethe r organizatio nal or indi-
vidual, involves a w illingness to settle for a satisfactory solutio n , “something less than the
best.” When satisficing, the decisio n make r sets up a n aspiratio n , a goal, o r a desired
52 Pan I • Decision Making and Analytics: An Overview
level of performance and then searches the alternatives until o ne is found that achieves
this level. The u sual reasons for satisficing are time pressures (e.g., decisions may lose
value over time), the ability to achieve optimization (e.g., solving som e models could
take a really long time, and recognition that the m arginal be nefit of a better solution is
no t worth the marginal cost to obtain it (e.g. , in searching the Inte rnet, you can look at
o nly so many Web sites before you run out of time and energy). In such a situation, the
decision maker is behaving rationally, though in reality h e o r she is satisficing. Essen tially,
satisficing is a form of suboptimizatio n . There may be a best solution , an optimum , but it
would be difficult, if not impossible, to a ttain it. With a normative model, too much com-
putation m ay be involved; w ith a descriptive model, it may n ot be possible to evaluate all
the sets of alte rnatives.
Rela ted to satisficing is Simo n ‘s idea of bounded rationality. Humans h ave a
limited capacity for rational thinking; they generally con stru ct and analyze a sim-
plified model o f a real situation by considering fewer alternatives, criteria, and/ or
con straints than actually exist. The ir behavior w ith respect to the simplified model
m ay be ratio n a l. However, the ration al solutio n for the simplified model may n ot be
rational for the real-world problem. Rationality is bounded not o nly by limitations on
huma n processing cap acities, but also by individual differences, such as age, edu ca-
tion, knowledge, a nd attitudes. Bounded ratio n ality is also why ma ny models are
descriptive rather tha n n ormative. This may also explain why so m any good managers
rely o n intuitio n , a n important aspect of good decis ion making (see Stewart, 2002 ; a nd
Pauly, 2004).
Because rationality a nd the use of normative models lead to good decision s, it is
n atural to ask w hy so ma ny bad decisions are made in practice. Intuitio n is a c ritical
factor that d ecis io n m akers u se in solv ing unstructured and semistructured problem s.
The best decision makers recognize the trade -off between the m argin al cost of obtain-
ing further info rma tio n a nd an alysis versus the benefit of making a better decision. But
sometimes decisions must be made quickly, and, ideally, the intuitio n of a season ed,
excellent decision maker is called for. When ad equate planning, funding, or informa-
tion is n o t ava ilable, o r w he n a decision maker is inexperie nced or ill trained, disaster
can strike.
Developing (Generating) Alternatives
A significant part of the model-build ing process is gene rating alternatives. In optimization
models (such as linear programming), the alternatives may be generated automatically by
the model. In most decision situations, however, it is n ecessary to generate alternatives
m anually . This can be a le ngthy process that involves searching an d creativity, perhaps
utilizing electronic brainstorming in a GSS. It takes time and costs money. Issu es such as
w hen to stop generating alternatives can be very impo rta nt. Too many alternatives can be
d etrime ntal to the process o f decision making. A decisio n maker may suffer from info rma-
tion overload.
Gen erating alternatives is heavily dependent o n the availability and cost of informa-
tion a nd re quires expe rtise in the problem area. This is the least formal aspect of problem
solving. Alternatives can be generated and evaluated u sing heuristics. The generatio n of
a lte rnatives fro m either ind ividuals or groups can be supported by electronic brainstorm-
ing software in a Web-based GSS.
Note that the search for alterna tives u sually occurs after the criteria for evaluating the
a lte rnatives a re determined. This sequen ce can ease the search fo r alternatives and redu ce
the effort involved in evalua ting them, but identifying potential alternatives can sometimes
a id in ide ntifying c rite ria.
Chapter 2 • Foundations and Technologies for Decision Making 53
The outcome of every proposed alternative must be established. Depending on
whether the decision-making problem is classified as one of certainty, risk, or uncertainty,
diffe rent modeling approaches may be u sed (see Drummond, 2001; and Koller, 2000).
These a re discussed in Chapter 9.
Measuring Outcomes
The value o f a n a lte rna tive is eva lu ated in terms of goal atta inme n t. Sometimes a n
outcome is expressed directly in terms of a goal. For example, profit is a n outcome,
profit maximization is a goal, and both are expressed in dollar terms. An outcome such
as cu stomer satisfactio n may be measured by the number of complaints, by the level
of loyalty to a product, or by ratings found through surveys . Ideally, a decision maker
would wan t to deal w ith a single goal, but in practice, it is not unusual to have multiple
goals (see Barba-Romero, 2001; and Koksalan a nd Zionts, 2001). When grou ps make
decisions, each group participant may h ave a diffe re nt agenda . For example, executives
mig ht want to maximize profit, marketing might want to maximize market penetration,
operations might want to minimize costs, and stockholders mig ht want to maximize the
bo tto m line . Typically, these goals conflict, so special multiple-criteria meth odologies
have been developed to ha ndle this. One such me thod is the AHP. We will study AHP
in Cha pter 9.
Risk
All decisions are made in an inhe re ntly unstable e nv ironment. This is d ue to the man y
unpredictable events in both the economic and physical e nvironments . Some risk (m eas-
ured as probability) may be due to inte rnal o rganizatio n al events, such as a valued
employee quitting or becoming ill , w h ereas othe rs may be due to natural disasters, such
as a hurricane . Aside fro m the human toll, one economic aspect of Hurricane Katrina was
that the price of a gallon of gasoline doubled overnight due to uncertainty in the port
capabilities, refining, a nd pipelines of the south e rn United States. What can a decisio n
maker do in the face of such instability?
In gen eral, p eople have a tendency to measure uncertainty a nd risk badly. Purdy
(2005) said that people tend to be overconfide nt and have an illusion of control in
decisio n making. The results of experiments by Adam Goodie at the University of
Georgia ind icate that m ost people are overconfident most of the tim e (Goodie, 2004).
This m ay explain w hy people often feel that o ne more pull of a slot machine w ill
definitely p ay o ff.
However, meth odologies for h andling extreme uncertainty do exist. For example,
Yakov (2001) described a way to make good decisions based o n very little info rmatio n ,
using an information gap theory and methodology approach . Aside from estimating the
potential utility o r va lue of a particular decision’s o utcome, the best decision makers are
capable of accurately estimating the risk associated with the outcomes that result from
making each decision. Thus, on e impo rta nt task of a decision maker is to attribute a level
of risk to the o utcome associated w ith each potential alte rnative being con sidered. Some
decisions may lead to unacceptable risks in terms of success a nd can therefore be dis-
carded or discounted immediately.
In some cases, some decisions a re assumed to be m ade under condition s of cer-
ta inty simply because the e nvironment is assumed to be stable. Other decisions are
made unde r conditio n s of uncertainty, w h ere risk is unknown. Still , a good decision
ma ker can make working estimates of risk. Also, the p rocess of developing BI/DSS
involves learning more about the situatio n , w hich leads to a more accurate assessment
of the risks .
54 Pan I • Decision Making and Analytics: An Overview
Scenarios
A scenario is a statement of assumptio ns about the operating environment of a p articu-
la r system at a given time; tha t is, it is a n arrative description of the d e cision-situation
setting . A scenario d escribes the decision a nd uncontrollable variables and parameters
for a sp ecific mo d eling situation. It may also p rovide the p rocedures and con straints for
the modeling.
Scenarios originated in the theater, an d the term was borrowed for war gaming and
large-scale simulations. Scenario planning and analysis is a DSS tool that can capture a
whole range o f possibilities. A ma nager can constm ct a series of scenarios (i.e., w hat-if
cases) , perform computerized analyses, and learn more abo u t the system and decision-
making proble m w hile an alyzing it. Ideally, the manager can identify an excellent, p ossibly
o ptimal, solution to the mo de l of the problem.
Scen arios are especially helpful in simulation s and what-if a nalyses . In both cases,
we ch ange scenarios a nd examine the results. For example, we can ch ange the anticipated
dema nd for hospitalization (an input variable for pla nning), thus creating a new scenario .
The n we can measure the anticipated cash flow of the h ospital for each scenario .
Scenarios p lay an impo rtant role in decision making because they:
• Help identify opportunities a nd proble m areas
• Provide flexibility in p lanning
• Ide ntify the leading edges of cha nges that management sh ould monitor
• He lp validate major mod e ling assumptions
• Allow the decisio n maker to explore the behavior of a system throu g h a model
• Help to check the sensitivity of proposed solutions to changes in the e nvironment,
as described by the scen ario
Possible Scenarios
The re may be thousands of possib le scena rios for every decision situatio n. However, the
following are especia lly u seful in practice:
• The worst possible scenario
• The best possible scenario
• The most like ly scen a rio
• The average scenario
The scen ario determines the context of the a nalysis to be performed .
Errors in Decision Making
The model is a critical component in the decisio n-making p rocess, but a decisio n maker
m ay make a number of e rrors in its development and u se. Validating the model before it
is used is critical. Gathering the rig ht amount of information , w ith th e right level of preci-
sion and accuracy, to incorporate into the decision-making p rocess is also critical. Sawyer
0999) described “the seven deadly sins of decision making, ” most of w hich are behavior
o r informatio n re la ted .
SECTION 2.5 REVIEW QUESTIONS
1. Define optimization and contrast it w ith suboptimization.
2. Compare the normative a nd d escriptive approaches to d ecision making .
3. Define rational decision making. What does it really mean to be a rational decisio n
m aker?
4. Why do people exhibit bounded ratio n ality when solving problems?
Chapter 2 • Foundations and Technologies for Decision Making 55
5. Define scenario. How is a scenario u sed in decision m aking?
6. Some “errors” in decision making can be attributed to the n o tion of decisio n making
from the gut. Explain w h at is meant by this a nd h ow such e rrors can happ e n.
2.6 DECISION MAKING: THE CHOICE PHASE
Choice is the critical act of decision making. The choice phase is the o ne in which the
actual d ecisio n and the commitment to follow a certain course of action are m ade . The
boundary between the design and ch oice phases is often unclear because certain activi-
ties can be performed during both of them and because the decision maker can return
freque ntly from cho ice activities to design activities (e.g. , generate new alternatives w hile
performing an evaluatio n of existing o nes) . The cho ice phase includ es the search fo r,
evaluation of, and recommendation of an appropriate solutio n to a model. A solution to a
model is a sp ecific set of values for the decision variables in a selected alternative. Ch o ices
can be evaluated as to the ir viability and profitability.
Note that solving a model is not the same as solving the problem the model represents.
The solution to the mod e l yie lds a recomme nded solutio n to the problem. The p roblem is
considered solved o nly if the recommended solution is su ccessfully implemented.
Solving a decision-making model invo lves searching for an appropriate course
of actio n . Search approaches include analytical techniques (i.e. , solving a formula) ,
algorithms (i.e. , step-by-step procedures), heuristics (i.e., rules o f thumb), and blind
search es (i.e ., sh ooting in the d ark, ideally in a logical way). These approaches are
examined in Cha pter 9.
Each alternative must be evaluated. If an alternative has multiple goals, they must
all be examine d and balanced against each oth er. Sensitivity analysis is u sed to deter-
mine the robustness of any given alte rnative ; slig ht ch anges in the parameters sh o uld
ideally lead to slight or no ch a nges in the alte rnative chosen. What-if analysis is
used to explo re ma jo r ch an ges in the parameters. Goal seeking h e lps a m a nager deter-
mine values of the decis ion variables to meet a specific objective . All this is discussed
in Cha pte r 9 .
SECTION 2.6 REVIEW QUESTIONS
1. Explain the difference between a principle of choice and the actual choice phase of
decisio n making.
2. Why do som e people claim that the choice phase is the point in time w h e n a decisio n
is really made?
3. How can sen sitivity a nalysis h elp in the choice phase?
2.7 DECISION MAKING: THE IMPLEMENTATION PHASE
In The Prince, Machiavelli astutely noted some 500 years ago that there was “nothing more
difficult to carry out, nor more do ubtful of success, nor more dangerous to handle , than to
initiate a new o rder of things. ” The implementation of a proposed solution to a problem is ,
in effect, the initiatio n of a n ew order of things or the introduction of change. And change
must be managed. User expectations must be managed as part of ch ange management.
The definitio n o f implementation is som ewh at complicated b ecau se impleme ntation
is a lo ng, involved process w ith vague boundaries. Simplistically, the implementation
phase involves putting a recommended solution to work, not necessarily implementing
a compute r system. Many generic imple me nta tio n issues, such as resistan ce to change ,
degree of suppo rt o f top ma nageme nt, a nd u ser training, are important in dea ling w ith
56 Pan I • Decision Making and Analytics: An Overview
information system supported decision m aking. Indeed, many previous technology-
related waves (e.g., business p rocess reengineering (BPR), knowledge ma nagement,
e tc .) have faced mixed results mainly because of change management challenges and
issues. Management of ch a nge is almost an entire discipline in itself, so we recogn ize its
impo rtan ce a nd e n courage the readers to focus o n it indepen dently. Implemen tation also
includes a thorou g h understanding of project management. Importan ce of project man-
agement goes far beyond a nalytics, so the last few years have w itnessed a m ajor growth
in ce1tificatio n progra ms fo r project ma nagers. A very p opular certificatio n now is P roject
Management Professional (PMP). See pmi.org for more details .
Implementatio n must also involve collecting and a nalyzing data to learn from the
previous decisions and improve the next decision. Although analysis of data is usually
conducted to identify the problem and/ o r the solution, a n alytics should also be employed
in the feedback process. This is especia lly true for any public p o licy decisions. We n eed
to be sure that the data being used for problem identification is valid. Sometimes people
find this o ut o nly afte r the impleme ntation phase.
The decisio n-making process, though conducted by people, can be improved with
computer su pport, which is the subject of the n ext section.
SECTION 2. 7 REVIEW QUESTIONS
1. Define implementation.
2. How can DSS support the implementation of a decision?
2.8 HOW DECISIONS ARE SUPPORTED
In Chapter 1, we d iscussed the n eed for computerized decision support and briefly
described some decision aids. Here we relate specific technologies to the decision-
making process (see Figure 2.2). Databases, data m arts, an d especially data ware h o u ses
a re important technologies in supporting a ll phases of decisio n making. They provide
the data that drive decision m a king.
Support for the Intelligence Phase
The primary requirement of decision support for the intelligence phase is the ability to scan
exte rnal and internal information sources for opportunities and problems and to interpret
w hat the scanning discovers. Web tools and sources are extrem ely useful for environmen tal
Phase
:- —.-~1 __ 1n_t_e1_1ig_e_n_c_e_ 1~{
\~ –.-~1 __ o_e_s_ig_n __ I-
\~ –.-~1 __ c_h_o_ic_e __ I-
l —.- ~1 _,m_p_le_m_e_n_ta_t_io_n~I-{
FIGURE 2.2 DSS Support.
ANN
MIS
Data Mining, OLAP
ES, ERP
ESS, ES, SCM
CRM, ERP, KVS
Management
Science
ANN
ESS, ES
KMS, ERP
DSS
ES
CRM
SCM
Chapter 2 • Foundations and Technologies for Decision Making 57
scanning. Web browsers provide useful front ends for a variety of tools, from OLAF to data
mining and data warehouses. Data sources can be internal or external. Internal sources may
be accessible via a corporate intranet. External sources are many and varied.
Decision support/ BI technologies can be very helpful. For example, a data ware-
house can support the intelligence phase by continuously monitoring both internal and
external information, looking for early sign s of problems and opportunities through
a Web-based enterprise information portal (also called a dashboard) . Similarly, (automatic)
data (a nd Web) mining (which may include expert systems [ES], CRM, genetic a lgorithms ,
neural networks, and other analytics systems) and (manual) OLAP also support the intel-
ligence phase by identifying relationships among activities and other factors. Geographic
information systems (GIS) can be utilized either as stand-alone systems or integrated with
these systems so that a decision maker can determine opportunities and problems in a
spatial sense. These relationships can be exploited for competitive advantage (e.g., CRM
identifies classes of customers to approach with specific products and services). A KMS
can be used to identify similar past situations and how they were handled. GSS can be
used to share information and for brainstorming. As seen in Chapter 14, even cell phone
and GPS data can be captured to create a micro-view of customers and their habits.
Another aspect of identifying internal problems and capabilities involves monitoring
the current status of operations. When something goes wrong, it can be identified quickly
and the problem can be solved. Tools such as business activity monitoring (BAM), busi-
ness process management (BPM) , and product life-cycle management (PLM) provide such
capability to decision makers. Both routine and ad hoc reports can aid in the intelligence
phase. For example, regular reports can be designed to assist in the problem-finding
activity by comparing expectations with current and projected performance. Web-based
OLAP tools are excellent at this task. So are visualization tools and electronic document
management systems.
Expert systems (ES) , in contrast, can render advice regarding the nature of a prob-
lem, its classification, its seriousness, and the like. ES can advise on the suitability of a
solution approach and the likelihood of successfully solving the problem. One of the
primary areas of ES success is interpreting information and d iagnosing problems. This
capability can be exploited in the inte lligence phase. Even intelligent agents can be used
to identify opportunities.
Much of the information used in seeking new opportunities is qualitative, or soft.
This indicates a high level of unstructuredness in the problems, thus making DSS quite
useful in the intelligence phase.
The Internet and advanced database technologies have created a glut of data and
information available to decision makers-so much that it can detract from the quality
and speed of decision making. It is important to recognize some issues in using data and
analytics tools for decision making. First, to paraphrase baseball great Vin Scully, “data
should be used the way a drunk uses a lamppost. For support, not for illumination.” It
is especially true when the focus is on understanding the problem. We should recognize
that not all the data that may help understand the problem is available. To quote Einstein,
“Not everything that counts can be counted, and not everything that can be counted
counts. ” There might be other issues that have to be recognized as well.
Support for the Design Phase
The design phase involves generating alternative courses of action, discussing the criteria
for choices and their relative importance, and forecasting the future consequences of
using various alternatives. Several of these activities can use standard models provided by
a DSS (e.g., financial and forecasting models, available as applets) . Alternatives for struc-
tured problems can be generated through the use of e ither standard or special models.
58 Pan I • Decision Making and Analytics: An Overview
However, the genera tion of alternatives for complex problems requires expertise that can
be provided o nly by a human, brainstorming software, or an ES. OLAP and data mining
softwa re are quite useful in identifying relatio nsh ips that can be u sed in m odels. Most DSS
h ave quantitative a nalysis cap abilities, and an internal ES can assist with qualitative meth-
ods as well as with the expertise required in selecting quantitative an alysis and forecasting
models. A KMS should certainly be consulted to determine whether such a problem h as
been en counte red before o r whether there are experts on h and w h o can provide quick
unde rstanding and a nswers. CRM syste ms, revenue ma nageme nt systems, ERP , a nd SCM
systems software are u seful in that they provide models of business processes that can test
assumptio ns and scen arios. If a problem requires brainstorming to help identify important
issu es and options, a GSS may prove helpful. Tools that provide cognitive mapping can
also help. Cohen et al. (2001) described several Web-based tools that p rovide decisio n
suppo rt, mainly in the design phase, by providing models a nd reporting of alternative
results . Each of the ir cases has saved millio n s o f dollars annually by utilizing these tools.
Such DSS are helping en gineers in product d esign as well as decision makers solving
business problems.
Support for the Choice Phase
In additio n to providing models that rapidly ide ntify a best or good-en ough alternative,
a DSS can support the choice phase throug h what-if and goal-seeking a nalyses. Different
scenarios can be tested for the selected option to reinforce the final decision. Again, a KMS
helps identify similar p ast experie n ces; CRM, ERP , and SCM systems are u sed to test the
impacts of decisio ns in establishing their value, leading to an intelligent choice. An ES can
be u sed to assess the desirability of certain solutions as well as to recommend an appropri-
ate solutio n . If a group ma kes a decision, a GSS can provide support to lead to consensus.
Support for the Implementation Phase
This is w h ere “making the decision h app e n ” occurs . The DSS benefits p rovided during
imple me ntatio n may be as impo rta nt as o r even more importa nt than th ose in the earlier
phases. DSS can be u sed in implementation activities such as decision communication,
explanatio n, a nd justificatio n .
Implementation-phase DSS benefits are partly due to the vividness and d etail of
analyses and reports. For example , one chief executive officer (CEO) gives employees
and external parties n o t o nly the aggregate financial goals and cash needs for the n ear
term, but also the calculation s, interme diate results, and statistics u sed in d etermining
the aggregate figures. In additio n to communicating the financial goals unambiguously,
the CEO sign als oth e r messages. Employees know that the CEO has thought through the
assumptions behind the financial goals and is serious about their importance and attain-
ability. Bankers a nd directors are shown that the CEO was personally involved in a n a-
lyzing cash need s and is aware o f a nd respons ible fo r the implications of the fina ncing
requests prepa red by the finance department. Each o f these messages improves decisio n
imple me ntatio n in some way.
As mentioned earlier, reporting systems a nd other tools variou sly labeled as BAM,
BPM, KMS , EIS, ERP, CRM, and SCM are all useful in tracking how well an implementation
is working. GSS is useful for a tea m to collaborate in establishing implementation effec-
tiveness . For example, a d ecisio n might be made to get rid of unprofitable customers. An
effective CRM can identify classes of customers to get rid of, identify the impact of doing
so, and then verify that it really worked that way.
All phases of the decision-making process can be suppo rted by improved communica-
tion through collaborative computing via GSS and KMS. Computerized systems can facilitate
communicatio n by he lping people explain and justify the ir suggestio ns and opinio ns.
Chapter 2 • Foundations and Technologies for Decision Making 59
Decision implementation can also be supported by ES. An ES can be used as an advi-
sory system regarding implementation problems (such as handling resistance to change).
Finally, an ES can provide training that may smooth the course of implementation.
Impacts along the value chain, though reported by an EIS through a Web-based
enterprise information portal, are typically identified by BAM, BPM, SCM, and ERP systems.
CRM systems report and update internal records, based on the impacts of the implementa-
tion. These inputs are then used to identify new problems and opportunities- a return to
the intelligence phase.
SECTION 2.8 REVIEW QUESTIONS
1. Describe how DSS/BI technologies and tools can aid in each phase of decision making.
2. Describe how new technologies can provide decision-making support.
Now that we have studied how technology can assist in decision making, we study some
details of decision support systems (DSS) in the next two sections.
2.9 DECISION SUPPORT SYSTEMS: CAPABILITIES
The early definitions of a DSS identified it as a system intended to support managerial
decision makers in semistructured and unstructured decision situations. DSS were meant
to be adjuncts to decision makers , extending their capabilities but not replacing their judg-
ment. They were aimed at decisions that required judgment or at decisions that could not
be completely supported by algorithms . Not specifically stated but implied in the early
definitions was the notion that the system would be computer based, would operate inter-
actively online, and preferably would have graphical output capabilities, now simplified
via browsers and mobile devices.
A DSS Application
A DSS is typically built to support the solution of a certain problem or to evaluate an
opportunity. This is a key difference between DSS and BI applications. In a very strict
sense, business intelligence (BI) systems monitor situations and identify problems and/
or opportunities, using analytic methods. Reporting plays a major role in BI; the user
generally must identify whether a particular situation warrants attention, and then analyti-
cal methods can be applied. Again, although models and data access (generally through
a data warehouse) are included in BI, DSS typically have their own databases and are
developed to solve a specific problem or set of problems. They are therefore called
DSS applications.
Formally, a DSS is an approach (or methodology) for supporting decision making .
It uses an interactive, flexible, adaptable computer-based information system (CBIS)
especia lly developed for supporting the solution to a specific unstructured manage-
ment problem. It uses data, provides an easy user interface, and can incorporate the
decision maker’s own insights. In addition, a DSS includes models and is developed
(possibly by end users) through an interactive and iterative process. It can support a ll
phases of decision making and may include a knowledge component. Finally, a DSS
can be used by a single user or can be Web based for use by many people at several
locations.
Because there is no consensus on exactly what a DSS is, there is obviously no agree-
ment on the standard characteristics and capabilities of DSS. The capabilities in Figure 2.3
constitute an ideal set, some members of which are described in the definitions of DSS
and illustrated in the application cases.
The key characteristics and capabilities of DSS (as s hown in Figure 2.3) are:
60 Pan I • Decision Making and Analytics: An Overview
13
12
11
Data access
Modeling
and analysis
Ease of
development
by end users
10
the process
9
14
Stand-alone,
integration, and
Web-based
8
1
Semistructured
or unstructured
problems
2
Support
managers at
all levels
3
Support
individuals
and groups
Interdependent
or sequential
decisions
Support
intelligence
design, choice , and
implementation
Support variety
of decision
processes and styles
Effectiveness
and efficiency
Interactive ,
ease of use
Adaptable
and f lexible
FIGURE 2.3 Key Characteristics and Capabilities of DSS.
1. Support for decision makers, mainly in semistructured and unstructured situ ations,
by bringing together human judgment and computerized information. Such prob-
lems cannot be solved (or cannot be solved conveniently) by oth er computerized
systems or through use of sta ndard quantitative methods or tools. Gen erally , these
problems gain structure as the DSS is developed. Even some structured problems
have been solved by DSS.
2 . Support for a ll managerial levels, ranging from top executives to line managers.
3. Support for individuals as well as groups . Less-structured problems often require the
involvement of individuals from d ifferent departments and organizational levels or
even from d ifferent o rganizations. DSS support virtual teams through collaborative
Web tools. DSS have been developed to support individual and group work, as well
as to support individual decisio n making and groups of decision makers working
somewhat independently.
4. Support fo r interdependent a nd/ or sequential decisions. The decisio ns may be made
once, several times, or repeatedly .
5. Support in all phases of the decision-making process: intelligence, design, choice,
and impleme ntation.
6. Support for a variety of decision-making processes and styles.
7. The decision m aker should be reactive, able to confront cha nging con ditions quickly,
an d able to adapt the DSS to meet these changes. DSS are flexible, so users can add,
delete, combine, change, or rearrange basic elements. They are also flexible in that
they can be readily modified to solve oth er, similar p roblems.
Chapter 2 • Foundations and Technologies for Decision Making 61
8 . User-friendliness, strong graphical capabilities, and a n atural language interactive
human-machine interface can greatly increase the effectiveness of DSS. Most new
DSS applications u se Web-based inte rfaces or mobile platform interfaces.
9 . Improvement of the effectiveness of decision making (e.g. , accuracy, timeliness,
quality) rathe r tha n its efficie ncy (e.g., the cost of making decisio n s). When DSS are
deployed , decision making often takes longer, but the decisions are better.
10. The decision maker h as comple te control over all step s of th e decisio n-making
process in solving a proble m . A DSS sp ecifically aims to support, not to replace, the
decision maker.
11. End u sers are able to develop a nd modify simple systems by them selves. Larger
systems can be built with assistance from informatio n system (IS) specialists.
Spreadsheet p ackages have been utilized in developing simpler systems. OLAP and
data mining software, in conjunctio n w ith d ata warehouses, e nable u sers to build
fairly large, complex DSS.
12. Models a re generally utilized to analyze decision-making situation s. The mod-
e ling capability e n ables experim e ntation with different strategies under diffe re nt
config urations .
13. Access is provided to a variety o f data sources, formats, a nd types, including GIS ,
multimedia , a nd object-o rie nted data.
14. The DSS can be employed as a stand-alone tool u sed by an individual decision maker
in one locatio n or distributed throughout an o rganization and in several organizations
along the supply chain. It can be integrated with other DSS and/ or applications, and it
can be distributed internally and exte rnally, u sing networking and Web technologies.
These key DSS characteristics and cap abilities allow decision make rs to make
better, more consistent decisions in a time ly m anne r, a nd they are provided by the major
DSS compo nents, w hich we w ill describe after discussing various ways of classifying
DSS (n ext) .
SECTION 2.9 REVIEW QUESTIONS
1. List the key characte ristics and capabilities of DSS.
2. Describe h ow providing support to a workgroup is different from providing support
to group work . Explain why it is impo rtant to differentiate these concepts.
3. What kinds of DSS can end users develop in spreadsheets?
4. Why is it so important to include a m odel in a DSS?
2.10 DSS CLASSIFICATIONS
DSS application s have been classified in several different ways (see Power, 2002; Power
and Sharda, 2009). The design process, as well as the operatio n a nd implementatio n of DSS,
depends in many cases on the type of DSS involved . However, remember that not every
DSS fits neatly into o n e category. Most fit into the classification provided by the Association
for Information Syste ms Special Interest Group on Decision Support Systems (AIS SIGDSS).
We discuss this classification but also point out a few other attempts at classifying DSS.
The AIS SIGDSS Classification for DSS
The AIS SIGDSS (ais.site-ym.com/group/SIGDSS) h as adopted a concise classification
scheme for DSS that was proposed by Power (2002). It includes the following categories:
• Communicatio n s-driven and group DSS (GSS)
• Data-driven DSS
62 Pan I • Decision Making and Analytics: An Overview
• Document-driven DSS
• Knowledge-driven DSS, data mining, and management ES applications
• Model-driven DSS
There may also be hybrids that combine two or more categories. These are called
compound DSS. We discuss the major categories next.
COMMUNICATIONS-DRIVEN AND GROUP DSS Communications-driven and group DSS
(GSS) include DSS tha t use computer, collaboration, and communication technologies
to support groups in tasks that may or may not include decision making. Essentially,
all DSS that support any kind of group work fall into this category. They include
those that support meetings, design collaboration, and even supply c hain management.
Knowledge management systems (KMS) that a re developed around communities that
practice collaborative work also fall into this category. We discuss these in more detail
in later chapters.
DATA-DRIVEN DSS Data-driven DSS are primarily involved with data and processing
them into information and presenting the information to a decision maker. Many DSS
developed in OLAP and reporting analytics software systems fall into this category. There
is minimal emphasis on the use of mathematical models.
In this type of DSS, the database organization, often in a data warehouse, plays
a major role in the DSS structure . Early generations of database-oriented DSS mainly
used the relational database configuration. The information handled by relational
databases tends to be voluminous, descriptive, and rigidly structured. A database-
oriented DSS features strong report generation and query capabilities . Indeed, this
is primarily the current application of the tools marked under the BI umbrella or
under the label of reporting/business analytics. The chapters on data warehousing and
business performance management (BPM) describe several examples of this category
of DSS.
DOCUMENT-DRIVEN DSS Document-driven DSS rely on knowledge coding, analysis,
search, and retrieval for decision suppo rt. They essentially include all DSS that are text
based. Most KMS fall into this category. These DSS also have minimal emphasis on utiliz-
ing mathematical models. For example , a system that we built for the U.S . Army’s Defense
Ammunitions Center fa lls in this catego1y. The main objective of document-driven DSS is
to provide support for decision making using documents in various forms: oral, written,
and multimedia.
KNOWLEDGE-DRIVEN DSS, DATA MINING, AND MANAGEMENT EXPERT SYSTEMS
APPLICATIONS These DSS involve the application of knowledge technologies to address
specific decision support needs. Essentially , all artificial intelligence-based DSS fall into
this category. When symbolic storage is utilized in a DSS, it is generally in this category.
ANN and ES are included here. Because the benefits o f these intelligent DSS or knowledge-
based DSS can be large, organizations have invested in them. These DSS are utilized in the
creation of automated decision-making systems, as described in Chapter 12. The basic idea
is that rules are u sed to a utomate the decision-making process. These mies are basically
either an ES or structured like one. This is important when decisions must be made quickly,
as in many e-commerce situations.
MODEL-DRIVEN DSS The major emphases of DSS that are primarily developed around
one or more (large-scale/ complex) optimization or simulation models typically include
s ignificant activities in model formulation, model maintenance, model m anagement
Chapter 2 • Foundations and Technologies for Decision Making 63
in distributed computing environments, and what-if analyses . Many large-scale applica-
tions fall into this category. Notable examples include those used by Procter & Gamble
(Farasyn et al. , 2008), HP (Olavson and Fry, 2008), and many others.
The focus of such systems is on using the model(s) to optimize one or more objec-
tives (e.g., profit). The most common end-user tool for DSS development is Microsoft
Excel. Excel includes dozens of statistical packages, a linear programming package
(Solver), and many financial and management science models. We will study these in
more detail in Chapter 9. These DSS typically can be grouped under the new label of
prescriptive analytics.
COMPOUND DSS A compound, or hybrid, DSS includes two or more of the major cat-
egories described earlier. Often, an ES can benefit by utilizing some optimization, and
clearly a data-driven DSS can feed a large-scale optimization model. Sometimes docu-
ments are critical in understanding how to interpret the results of visualizing data from
a data-driven DSS.
An emerging example of a compound DSS is a product offered by WolframAlpha
(wolframalpha.com). It compiles knowledge from outside databases , models, algo-
rithms, documents, and so on to provide answers to specific questions. For example , it
ca n find and analyze current data for a stock and compare it w ith other stocks. It can
also tell you how many calories you will burn when performing a specific exercise or
the side effects of a particular medicine. Although it is in early stages as a collection of
knowledge components from many different areas, it is a good example of a compound
DSS in getting its knowledge from many diverse sources and attempting to synthesize it.
Other DSS Categories
Many other proposals have been made to classify DSS. Perhaps the first formal attempt
was by Alter (1980). Several other important categories of DSS include (1) institutional
and ad hoc DSS; (2) personal, group, and organizational support; (3) individual support
system versus GSS; and (4) custom-made systems versus ready-made systems. We discuss
some of these next.
INSTITUTIONAL AND AD HOC DSS Institutional DSS (see Donovan and Madnick, 1977)
deal with decisions of a recurring nature. A typical example is a portfolio management
system (PMS), which has been used by several large banks for supporting investment
decisions. An institutionalized DSS can be developed and refined as it evolves over a
number of years, because the DSS is used repeatedly to solve identical or similar prob-
lems. It is important to remember that an institutional DSS may not be used by everyone
in an organization; it is the recurring nature of the decision-making problem that deter-
mines whether a DSS is institutional versus ad hoc.
Ad hoc DSS deal w ith specific problems that are usually neither anticipated nor recur-
ring. Ad hoc decisions often involve strategic planning issues and sometimes management
control problems. Justifying a DSS that w ill be used only once or twice is a major issue
in DSS development. Countless ad hoc DSS applications have evolved into institutional
DSS. Either the problem recurs and the system is reused or others in the organization have
similar needs that can be handled by the formerly ad hoc DSS.
Custom-Made Systems Versus Ready-Made Systems
Many DSS are custom made for individual users and organizations. However, a com-
parable problem may exist in similar organizations. For example, hospitals, banks,
and universities share many similar problems. Similarly, certain nonroutine problems in
a functional area (e.g. , finance , accou nting) can repeat themselves in the same functional
64 Pan I • Decision Making and Analytics: An Overview
area of different areas or organizations. Therefore, it makes sense to build generic DSS
that can be used (sometimes with modifications) in several organizations. Such DSS
are called ready-made and are sold by various vendors (e.g., Cognos , MicroStrategy,
Teradata). Essentially, the database, models, interface, and other support features are
built in: Just add an organization’s data and logo. The major OLAP and analytics vendors
provide DSS templates for a variety of functional areas , including finance, real estate,
marketing, and accounting. The number of ready-made DSS continu es to increase
because of their flexibility and low cost. They are typically developed using Internet
technologies for database access and communications, and Web browsers for interfaces.
They also readily incorporate OLAP and other easy-to-use DSS generators.
One complication in terminology results when an organization develops an
institutional system but, because of its structure, uses it in an ad hoc manner. An organi-
zation can build a large data warehouse but then use OLAP tools to que1y it and perform
ad hoc analysis to solve nonrecurring problems. The DSS exhibits the traits of ad hoc
and institutional systems and also of custom and ready-made systems. Several ERP, CRM,
knowledge management (KM), and SCM companies offer DSS applications online. These
kinds of systems can be viewed as ready-made, although typically they require modifica-
tions (sometimes major) before they can be used effectively.
SECTION 2 . 10 REVIEW QUESTIONS
1. List the DSS classifications of the AIS SIGDSS.
2. Define document-driven DSS.
3. List the capabilities of institutional DSS and ad hoc DSS.
4. Define the term ready-made DSS.
2.11 COMPONENTS OF DECISION SUPPORT SYSTEMS
A DSS application can be composed of a data management subsystem, a model man-
agement subsystem, a user interface subsystem, and a knowledge-based m anagement
subsystem. We show these in Figure 2.4 .
FIGURE 2.4
Data: external
and/ or internal
~/,
§/
§
Organizational
Knowledge Base
Other
computer-based
systems
Data
Schematic View of DSS.
Internet,
intranet,
extranet
Model External
management models
Knowledge-based
subsystems
User
interface
t
Manager [user)
Chapter 2 • Foundations and Technologies for Decision Making 65
Finance
sources
Organizational
knowledge base
Query
facility
Data
directory
Internal Data Sources
Production
Decision
support
~-d-a-ta_b_as_e_~
Database
management
system
• Retrieval
• Inquiry
• Update
• Report .
generation
• Delete
FIGURE 2.5 Structure of the Data Management Subsystem.
The Data Management Subsystem
Private ,
personal
data
Cor porate
data
warehouse
Interface
management
Model
management
Knowledge-based
subsystem
The data management subsystem includes a database that contains relevant data for
the situation and is managed by software called the database management system
(DBMS) .2 The data management subsystem can be interconnected with the corporate
data warehouse, a repository for corporate relevant decision-making data. Usually, the
data are stored or accessed via a database Web server. The data management subsystem
is composed of the following e lements:
• DSS database
• Database management system
• Data directory
• Query facility
These e lements are shown schematically in Figure 2.5 (in the shaded area) . The figure
also shows the interaction of the data management subsystem with the other parts of the
DSS, as well as its interaction with several data sources. Many of the BI or descriptive
analytics applications derive their strength from the data management side of the subsys-
tems. Application Case 2.2 provides an example of a DSS that focuses on data.
The Model Management Subsystem
The model management subsystem is the component that includes financial, statistical,
management science, or other quantitative models that provide the system’s analytical
capabilities and appropriate software management. Modeling languages for bu ilding cus-
tom models are also included. This software is often called a model base management
‘DBMS is used as both singular and plura l (system and systems), as are many other acronyms in this text.
66 Pan I • Decision Making and Analytics: An Overview
Application Case 2.2
Station Casinos Wins by Building Customer Relationships Using Its Data
Station Casinos is a major provider of gaming for
Las Vegas-area residents. It owns about 20 proper-
ties in Nevada and other states, employs over 12,000
people, and has revenue of over $1 billion.
Station Casinos wanted to develop an in-depth
view of each customer/guest who visited Casino
Station propetties. This would permit them to bet-
ter understand customer trends as well as enhance
their one-to -one marketing for each guest. The com-
pany employed the Teradata warehouse to develop
the “Total Guest Worth” solution. The project used
used Aprimo Relationship Manager, Informatica, and
Cognos to capture, analyze, and segment customers.
Almost 500 different data sources were integrated to
develop the full view of a customer. As a result, the
company was able to realize the following benefits:
• Customer segments were expanded from 14
(originally) to 160 segments so as to be able to
target more specific promotions to each segment.
• A 4 percent to 6 percent increase in monthly
slot profit.
• Slot promotion costs were reduced by $1 million
(from $13 million per month) by better targeting
the customer segments.
• A 14 percent improvement in guest retention.
• Increased new-member acquisition by 160
percent.
• Reduction in data error rates from as high as
80 percent to less than 1 percent.
• Reduced the time to analyze a campaign’s effec-
tiveness from almost 2 weeks to just a few hours.
QUESTIONS FOR DISCUSSION
1. Why is this decision support system classified as
a data-focused DSS?
2. What were some of the benefits from implement-
ing this solution?
Source: Teradata .com, “No Limits: Station Casinos Breaks the
Mold on Custome r Re lationships ,” teradata.com/case-studies/
Station-Casinos-No-Limits-Station-Casinos-Brea1′:s-the-Mold-
on-Customer-Relationships-Executive-Summary-eb64 IO
(accessed February 2013).
system (MBMS) . This component can be connected to corporate or external storage
of models . Model solution methods and management systems are implemented in Web
development systems (such as Java) to run on application servers. The model manage-
ment subsystem of a DSS is composed of the following elements:
• Model base
• MBMS
• Modeling language
• Model directory
• Model execution, integratio n , and command processor
These elements and their interfaces with other DSS components are shown in Figure 2.6.
At a higher level than building blocks, it is important to consider the different types of
models and solutio n methods needed in the DSS. Often at the start of development, there is
some sense of the model types to be incorporated, but this may change as more is learned
about the decision problem. Some DSS development systems include a wide variety of com-
ponents (e.g., Analytica from Lumina Decision Systems), whereas others have a single one
(e.g. , Lindo). Often, the results of one type of model component (e.g., forecasting) a re used
as input to another (e.g., production scheduling). In some cases, a modeling language is a
component that generates input to a solver, w hereas in other cases, the two are combined.
Because DSS deal with semistructured o r unstructured problems, it is often necessary
to customize models, using programming tools and languages. Some examples of these are
.NET Framework languages, C++, and Java. OLAP software may also be used to work with
models in data analysis. Even languages for simulation such as Arena and statistical pack-
ages such as those of SPSS offer modeling tools developed through the use of a proprietary
Models (Model Base)
• Strategic, tactical, operational
• Statistical, financial, marketing,
management science,
accounting, engineering, etc.
• Model building blocks
Model Base Management
• Modeling commands: creation
• Maintenance: update
• Database interface
• Modeling language
Chapter 2 • Foundations and Technologies for Decision Making 67
Model
Directory
Model execution,
_. integration, and
command processor
Data Interface Knowledge-based
management management subsystem
FIGURE 2.6 Structure of the Model Management Subsystem.
programming language. For small and medium-sized DSS or for less complex ones, a spread-
sheet (e.g., Excel) is usually used. We will use Excel for many key examples in this book.
Application Case 2.3 describes a spreadsheet-based DSS. However, using a spreadsheet
for modeling a problem of any significant size presents problems with documentation and
error diagnosis. It is very difficult to determine or understand nested, complex relationships
in spreadsheets created by someone else. This makes it difficult to modify a model built by
someone else. A related issue is the increased likelihood of errors creeping into the formu-
las. With all the equations appearing in the form of cell references, it is challenging to figure
out where an error might be. These issues were addressed in an early gen eration of DSS
developme nt software that was available on mainframe computers in the 1980s. One such
product was called Interactive Financial Planning System (IFPS). Its developer, Dr. Gerald
Wagner, then released a desktop software called Planners Lab. Planners Lab includes the
following components: (1) an easy-to-use algebraically oriented model-building language
and (2) an easy-to-use state-of-the-art option for visualizing model output, such as answers
to what-if and goal seek questions to analyze results of changes in assumptions. The com-
bination of these components enables business managers and analysts to build, review, and
challenge the assumptions that underlie decision-making scenarios.
Planners Lab makes it possible for the decision makers to “play” with assumptions
to reflect alternative views of the future. Every Planners Lab model is an assemblage of
assumptions about the future. Assumptions may come from databases of historical per-
formance, market research, and the decision makers’ minds, to name a few sources. Most
assumptions about the future come from the decision makers’ accumulated experiences
in the form of opinions.
The resulting collection of equations is a Planners Lab model that tells a readable
story for a particular scenario. Planners Lab lets decision makers describe their plans
in their own words and with their own assumptions . The product’s raison d’etre is that
a s imulator should facilitate a conversation with the decision maker in the process of
68 Pan I • Decision Making and Analytics: An Overview
Application Case 2.3
SNAP DSS Helps OneNet Make Telecommunications Rate Decisions
Telecommunications network services to educational
institutions and government entities are typically
provided by a mix of private and public organiza-
tions. Many states in the United States have one or
more state agencies that are responsible for providin g
network services to sch ools, colleges, and other state
agencies. One example of such an agency is OneNet
in Oklahoma. OneNet is a division of the Oklahoma
State Regents for Higher Education and operated in
cooperation with the Office of State Finance.
Usually agencies such as OneNet operate as
an enterprise-type fund. They must recover their
costs through billing their clients and/or by justifying
appropriations directly from the state legislatures.
This cost recovery should occur through a pricing
mechanism that is efficie nt, simple to implement,
and equitable. This pricing model typically needs to
recognize many factors: convergence of voice, data ,
and video traffic on the same infrastructure; diver-
sity of user base in terms of educational institutions,
state agencies, and so on; diversity of applications
in use by state clients, from e -mail to videoconfer-
ences, IP telephoning, and distance learning; recov-
ery of current costs, as well as planning for upgrades
and future developments; and leverage of the shared
infrastructure to enable further economic develop-
ment and collaborative work across the state that
leads to innovative uses of OneNet.
These considerations led to the development of
a spreadsheet-based model. The system, SNAP-DSS,
or Service Network Application and Pricing (SNAP)-
based DSS, was developed in Microsoft Excel 2007
and used the VBA programming language .
The SNAP-DSS offers OneNet the ability to select
the rate card options that best fit the preferred pric-
ing strategies by providing a real-time, user-friendly,
graphical user interface (GUI). In addition, the SNAP-
DSS not only illustrates the influence of the changes in
the pricing factors on each rate card option, but also
allows the u ser to an alyze various rate card options
in different scenarios using different parameters. This
model has been used by OneNet financial planners to
gain insights into their customers and analyze many
what-if scenarios of different rate plan options.
Source: Based on J. Chongwatpol and R. Sharda, “SNAP: A DSS
to Analyze Network Service Pricing for State Netwo rks, ” Decision
Support Systems, Vol. 50, No. 1, December 2010, p p. 347-359.
describing business assumptions. All assumptions are described in English equations (or
the user’s native lan guage).
The best way to learn how to use Planners Lab is to launch the software and follow
the tutorials. The software can be downloaded at plannerslab.com.
The User Interface Subsystem
The user communicates with a nd comman ds the DSS through the user interface sub-
system. The user is considered part of the system. Researchers assert that some of the
unique contributio ns of DSS are derived from the intensive interaction between the
computer and the decision maker. The Web browser provides a familiar, consistent
graphical user interface (GUI) structure for most DSS. For locally used DSS , a spread-
sheet also provides a familiar user interface. A difficult user interface is one of the
major reasons managers do not use computers and quantitative analyses as much as
they could, given the availability of these technologies. The Web browser has b e en
recognized as an effective DSS GUI because it is flexible, user friendly, and a gateway
to almost all sources of necessary information and data . Essentially, Web browsers h ave
led to the development of portals and dashboards, which front end many DSS.
Explosive growth in portable devices including smartphones and tablets has changed
the DSS user interfaces as well. These devices allow either handwritten input or typed input
from inte rna l or external keyboa rds. Some DSS user interfaces utilize natural-language input
Chapter 2 • Foundations and Technologies for Decis ion Making 69
(i.e., text in a human language) so that the users can easily express themselves in a mean-
ingful way. Because of the fuzzy n ature of human language, it is fairly difficult to develop
software to interpret it. However, these p ackages increase in accuracy every year, and they
will ultimate ly lead to accurate input, o utput, and langu age translators .
Cell phone inputs through SMS a re becoming mo re commo n for at least some con-
sumer DSS-type applications. Fo r example, one can send an SMS request for search on
any topic to GOOGL (46645) . It is m ost useful in locating nearby businesses, addresses,
o r phone numbers, but it can also be used for many othe r decis io n support tasks. For
example, users can find definitions of words by entering the word “define” followed by a
word, su ch as “define extenuate. ” Some of the other capabilities include:
• Translatio n s: “Tran slate thanks in Sp anish .”
• Price lookups: “Price 32GB iPho ne.”
• Calculator: Although you would probably just want to use your phone’s built-in
calcula to r function , you can send a math expression as an SMS for an a nswer.
• Currency conversions: “10 usd in e uros. ”
• Sports scores and game times: Just enter the name of a team (“NYC Giants”), and Google
SMS w ill send the most recent game’s score and the date and time of the next match.
This typ e of SMS-based search capability is also available for oth er search e ngin es, includ-
ing Yahoo! and Microsoft’s n ew search e ngine Bing.
With the e me rge n ce of sm a rtphon es su ch as Apple ‘s iPho n e and Android smart-
phones fro m m a ny vendors , ma ny companies are developing applicatio n s (commonly
called apps) to provide purchasing-decision support. For example, Amazon.corn’s app
allows a u ser to tak e a picture of a ny item in a store (or w h erever) a nd send it to Amazon.
com. Amazon. corn ‘s graphics-understanding a lgorithm tries to matc h the image to a real
product in its d atabases and sends the user a page similar to Amazon. corn’s prod uct
info pages, allowing users to perform price comp arisons in real time. Thousands of
othe r apps have been developed tha t provide consumers support for decision making
on finding and selecting stores/ restaura nts/ service providers on the basis of locatio n ,
recommendatio n s from o thers, a nd especially fro m your own social circles.
Voice input for these devices and PCs is common and fairly accurate (but not per-
fect). When voice input with accompanying speech-recognition software (and readily
available text-to-speech software) is u sed, verbal instructions w ith accompanied actions
and outputs can be invoked. These are readily available for DSS and are incorporated into
the portable devices described earlier. An example of voice inputs that can be used for
a gene ral-purpose DSS is Apple’s Siri applicatio n a nd Google ‘s Google Now service. For
example, a user can give her zip code and say “pizza delivery.” These devices provide the
search results and can even p lace a call to a business.
Recent efforts in business process management (BPM) have led to inputs directly
from physical devices fo r analysis via DSS. For example, radio-frequency identificatio n
(RFID) chips can record data fro m sen sors in railcars or in-process products in a factory.
Data from these sen sors (e.g., recording an ite m’s status) can be dow nloaded at key loca-
tio ns and immediately transmitted to a database o r data ware h ou se, w h ere they can be
an alyzed and decisions can be made con cerning the status of the ite ms being mo nitored.
Walmart and Best Buy are developing this technology in their SCM, an d su ch sensor
networks are a lso being u sed effectively by o ther firms.
The Knowledge-Based Management Subsystem
The knowledge-based management subsystem can support any of the other subsystems or
act as an independe nt compo ne nt. It provides inte llige nce to augment the decision mak-
er’s own. It can be inte rconnected w ith the o rganizatio n’s knowledge repository (part of
70 Pan I • Decision Making and Analytics: An Overview
a knowledge management system [KMS]), which is sometimes called the organizational
knowledge base. Knowledge may be provided via Web se1vers. Many artificial intelligence
methods have been implemented in Web development systems su ch as Java and are easy
to integrate into the oth er DSS components. One of the most widely publicized knowledge-
based DSS is IBM’s Watson computer system. It is described in Application Case 2.4.
We conclude the sections on the three major DSS componen ts with information
o n some recent technology and methodology developments that affect DSS a nd de ci-
s io n making. Technology Ins ig hts 2.2 summarizes some emerging developments in user
Application Case 2.4
From a Game Winner to a Doctor!
The television show Jeopardy! inspired an IBM
research team to build a supercomputer n amed
Watson that successfully took o n the ch allenge of
playing Jeopardy! and beat the other human com-
p etitors. Since the n , Watson has evolved into a
question-answering computing platform that is now
being u sed commercially in the medical field and
is exp ected to find its use in man y othe r a reas.
Watson is a cognitive system built on clus-
ters of powerful processors supported by IBM’s
DeepQA® software. Watson employs a combina-
tion of techniques like n atural-language processing,
hypo thesis generation and evaluatio n , and evide nce-
based learning to overcome the con straints imposed
by programmatic computing. This enables Watson
to work on massive amounts of real-world , unstruc-
ture d Big Data e fficie ntly .
In the medical field, it is estimated that the
amount of medical information doubles every
5 years. This massive growth limits a physician’s
decision-making ability in diagnosis and treatment of
illness using an evide n ce-based approach. With the
advancements being made in the medical field every
day, physicians do n o t have enough time to read
eve1y jo urnal that can he lp the m in keeping up-to –
date with the latest ad van cements. Patient histories
and electronic medical records contain lo ts of data . If
this info rmatio n can be an alyzed in com binatio n with
vast amounts of existing medical know ledge, many
u seful clues can be provided to the physicians to
help the m ide ntify diagnostic and treatment options.
Watson, dubbed Dr. Watson, w ith its advanced
machine learning capabilities, now finds a new role
as a computer compa nion that assists physicians by
providing relevant real-time information for critical
d ecisio n making in ch oosing the right diagnostic and
treatment procedures. (Also see the opening vignette
for Chapter 7.)
Memorial Sloan-Kettering Cancer Center
(MSKCC), New York, and WellPoint, a major insur-
ance provider, have begun using Watson as a treat-
ment advisor in oncology diagnosis. Watson learned
the process of diagnosis and treatme nt through its
natural-language processing capabilities, which ena-
bled it to leverage the unstructured data with an enor-
mous amount of clinical exp ertise data, molecular
a nd genomic data from existing cancer case histo-
ries, jo urnal articles, physicians’ no tes, and guidelines
and best practices from the Nation al Comprehensive
Cancer Network. It was then trained by oncologists to
apply the knowledge gained in comparing an individ-
u al patient’s med ical information against a w ide vari-
ety of treatment guidelines, published research, and
o ther insights to provide individualized, confidence-
scored recomme ndatio ns to the physicians.
At MSKCC, Watson facilitates evidence-based
support for every suggestion it makes while analyz-
ing an individual case by bringing out the facts from
medical literature that point to a particular sugges-
tion. It also provides a platform for the physicians to
look at the case from multiple directions by doing fur-
ther analysis relevant to the individual case. Its voice
recognition capabilities allow physicians to speak to
Watson, enabling it to be a perfect assistant that he lps
physicians in critical evidence-based decision making.
WellPoint also trained Watson w ith a vast his-
tory o f medical cases and now relies o n Watson’s
h ypothesis generation and eviden ce-based learning
to generate recommendations in providing approval
for medical treatments based on the clinical and
patient data. Watson also assists the insuran ce pro-
viders in detecting fraudulent claims and protecting
physicians from malpractice claims.
Watson provides a n excellent example of
a knowledge-based DSS that employs multip le
ad vanced technologies.
Chapter 2 • Foundations and Technologies for Decision Making 71
QUESTIONS FOR DISCUSSION
1. What is a cognitive system? How can it assist in
real-time decision making’
2. What is evide n ce-based decision making?
3. What is the role played by Watson in the
discussion?
4 . Does Watson eliminate the need for human deci-
s io n making?
What We Can Learn from This Application
Case
Advanceme nts in technology now e nable the build-
ing of powerful, cognitive computing platfo rms com-
bined w ith complex analytics. These systems are
TECHNOLOGY INSIGHTS 2.2
Next Generation of Input Devices
impacting the decision-making process radically by
shifting them from an opinion-based process to a
more real-time, eviden ce-based process, thereby turn-
ing available information intelligence into actio nable
wisdom that can be readily employed across many
industrial secto rs.
Sources, lbm.com, “IBM Watson: Ushering In a New Era of
Computing,” www-03.ibm.com/innovation/us/watson (accessed
February 2013); lbm.com, “IBM Watson Helps Fight Cancer with
Evidence-Based Diagnosis and Treatment Suggestions ,” www-
03.ibm.com/innovation/us/watson/pdf/MSK_ Case_Study _
IMC14794 (accessed February 2013); lbm.com, “IBM Watson
Enables More Effective Healthcare Preapproval Decisions Using
Evidence-Based Learning,” www-03.ibm.com/innovation/us/
watson/pdf/WellPoint_ Case_Study _IM Cl 4 792 (accessed
February 2013).
The last few years have seen exciting developments in u ser interfaces. Perhaps the most com-
mo n example of the new user interfaces is the iPhone ‘s multi-to uch interface that allows a user
to zoom, pan, and scroll through a screen just w ith the use of a finger. The success of iPhone has
spawned developme nts of similar user interfaces from many other providers including Blackberry,
HTC, LG, Motorola (a part of Google), Microsoft, Nokia, Samsung, and others. Mobile platform
has become the major access mechanism for all decision su pport applications .
In the last few years, gaming devices have evolved significantly to be able to receive and
process gesture-based inputs. In 2007, Nintendo introdu ced the Wii game p latform , which is
able to process mo tio ns and gestures. Microsoft’s Kinect is able to recognize image movements
and use that to discern inputs. The n ext generation of these technologies is in th e form of
mind-readin g p latforms . A company called Emotiv (en.wikipedia.org/wiki/Emotiv) made
big n ews in early 2008 w ith a promise to deliver a ga me controller that a u ser would be able
to control by thinking about it. These technologies are to b e based on electroencephalogra-
phy (EEG), the technique of reading a nd processing the electrical activity at the scalp level
as a result of specific tho ughts in the brain. The technical details are available on Wikipedia
(en.wikipedia.org/wiki/Electroencephalography) and the Web. Although EEG has not
yet been known to be used as a DSS user inte rface (at least to the autho rs), its potential is
significant for many oth e r DSS-type applications. Many other companies a re developing similar
technologies.
It is also possible to speculate on other developmen ts on the horizon. O ne major growth
a rea is like ly to be in wearable devices. Google ‘s wea rable glasses that are labeled “augmented
reality” glasses w ill likely emerge as a new u ser interfa ce for decision suppo rt in both consumer
a nd corporate decisio n settings. Similarly, Apple is supposed to be working on iOS-based wrist-
watch-type computers. These devices will significantly impact h ow we interact w ith a syste m and
use the system for decision support. So it is a safe bet that user interfaces are going to change
significantly in the next few yea rs. Their first u se w ill probably be in gaming and consu mer
a pplicatio ns, but the business and DSS applicatio ns won ‘t be far behind.
Sources, Various Wikipedia sites and the company Web sites provided in the feature.
72 Pan I • Decision Making and Analytics: An Overview
Chapter Highlights
interfaces. Many developments in DSS compone nts are the result of new developments
in h a rdware and software compute r technology, data wareh ou sing, data mining, OLAP,
Web technologies, integration of technologies, a nd DSS applicatio n to variou s a n d n ew
function al areas. There is also a clear link between hardware and software capabilities
and improvem ents in DSS. Hardware continues to shrink in size while increasing in speed
and other capabilities. The sizes of data bases and data warehouses h ave increased dra-
m atically . Data warehouses n ow provide hundreds of petabytes of sales data for retail
o rga nizatio ns a nd content for majo r news networks.
We expect to see more seamless integration of DSS components as they adopt Web
technologies, especially XML. These Web-based technologies have become the center of
activ ity in developing DSS. Web-based DSS have reduced technological barriers and h ave
made it easier and less costly to make decision-relevant information and m odel-d riven
DSS available to managers and staff u sers in geographically distributed location s, espe-
cially through mobile devices.
DSS are becoming mo re embedded in o ther systems. Similarly, a major area to expect
improvements in DSS is in GSS in suppo rting collaboration at the enterprise level. This is
true even in the edu cational arena. Almost every new area of informatio n systems involves
some level of d ecision-making support. Thus, DSS, eithe r directly or indirectly, h as impacts
o n CRM, SCM, ERP , KM, PLM, BAM, BPM, and other EIS. As these systems evolve, the
active decision-making component that utilizes mathematical, statistical, or even descriptive
models increases in size and capability, although it may be buried deep w ithin the system.
Finally, different typ es of DSS compon e nts are being integrated more frequently. For
example, GIS are readily integrated w ith other, more traditional, DSS compo nents and
tools for improved decision making.
By definition, a DSS must include the three major compo ne nts-DBMS, MBMS, and
user inte rface . The knowledge-based management subsyste m is optio nal, but it can pro-
vide many benefits by providing intelligence in and to the three m ajo r compon ents . As in
any other MIS, the user m ay be considered a componen t of DSS.
• Managerial decision m aking is synonymous with
the w h ole process of management.
• In the choice phase, alternatives are compared, and
a search for the best (or a good-en ough) solution is
launched. Many search techniques are available. • Human decision styles need to be recognized in
designing systems.
• Individual and group decision making can both
be supported by systems .
• Problem solving is also opp ortunity evaluatio n.
• A model is a simplified representatio n or abstrac-
tion of reality.
• Decisio n making involves four major phases:
inte llige nce, design , choice, a nd imple me ntatio n .
• In the intelligence phase, the problem (oppor-
tunity) is ide ntified , classified, and decom-
posed (if nee d ed), and problem ownership is
established.
• In the design phase, a model of the syste m is
built, criteria for selection are agreed on, alterna-
tives a re generated, results are predicted, and a
decision methodology is created.
• In implementing alternatives, a decision maker
should conside r multiple goals and sen sitivity-
analysis issues.
• Satisficing is a w illingn ess to settle for a satis-
factory solution. In effect, satisficing is subopti-
mizing. Bounded rationality results in decision
makers satisficing.
• Computer systems can support all p hases of deci-
sion making by automating many of the required
tasks or by applying a rtificial intelligen ce.
• A DSS is designed to support complex m anage-
rial problems that other computerized techniques
cannot. DSS is user oriented, and it uses data and
models .
• DSS are generally developed to solve specific
manage1ial p roblems, wh ereas BI systems typically
Chapter 2 • Foundations and Technologies for Decision Making 73
report status, and, whe n a problem is discovered,
the ir analysis tools are utilized by decisio n makers .
• DSS can provide support in a ll phases of the deci-
sio n-making process a nd to all m an age rial leve ls
for individ u als, groups, and organizatio n s .
• DSS is a u ser-oriented tool. Ma ny applica-
tio ns can b e d eve lo p e d by e nd u sers, ofte n in
spread sheets.
• DSS can improve the effectiveness of decision
m aking, decrease the nee d for training, improve
managem e nt control , fa cilitate communication,
save effort by the users, reduce costs, a nd allow
for m o re o bjective decisio n making .
• The AIS SIGDSS classification of DSS includes
communicatio ns-drive n and group DSS (GSS) ,
d ata-driven D SS, d ocume n t-driven DSS, knowl-
e dge-driven D SS, data mining a nd management
ES a pplicatio n s, a nd m o d e l-driven DSS. Several
o the r classificatio ns map into this o ne .
• Severa l u seful classifications o f DSS are based
on w hy they are d evelo p e d (in stitutio n al versu s
ad hoc), w hat level within the o rganization they
support (personal, group , or organizatio nal),
w h ethe r they suppo rt individual work o r g roup
w ork (indiv idua l DSS versus GSS), and h ow they
are develo ped (cu sto m versu s ready-mad e).
Key Terms
ad hoc DSS
algorithm
an a lytical techniques
business inte llige nce
(BI)
cho ice phase
data warehouse
data base management
syste m (DBMS)
decisio n ma king
decision style
decision variable
descriptive mode l
design phase
DSS applicatio n
effectiveness
efficiency
implem e ntation phase
Questions for Discussion
1. Why is intuition still an impottant aspect of decision making?
2. Define efficiency and effectiveness, and compare and
contrast the two.
3. Why is it impottant to focus on the effectiveness of a deci-
sion, not necessarily the efficiency of making a decision?
4. What are some of the measures of effectiveness in a
toy manufac turing plant, a restaurant, an educational
institutio n, and the U.S. Congress?
• The ma jo r compo nents of a DSS are a datab ase
and its m an agem ent, a mode l base and its m a n-
age m ent, and a u ser-frie ndly interface . An inte lli-
gent (knowle dge -based) com pon e nt can also be
included . The user is also conside red to be a com-
ponent of a DSS.
• Data w areh ouses, data mining, and O LAP h ave
made it p ossib le to develo p DSS quickly and easily.
• The data management subsyste m u su ally includes
a D SS d a tabase, a DBMS, a d a ta d irectory, and a
q u ery fa cility.
• The mo del b ase includes standard m odels and
m od els sp ecifically w ritte n fo r the DSS.
• Cu stom-made models can be w ritten in p rogram-
ming languages, in special m odeling languages,
and in Web-based develo p ment systems (e .g. , Java,
the .NET Framework) .
• The use r inte rface (or dialog) is of utmost impo r-
tance. It is ma naged by software that p rovides the
needed capabilities. Web browsers a nd smart-
pho nes/ tablets commo nly provide a frie ndly, con-
sistent DSS GUI.
• The user interface cap abilities of D SS h ave m oved
into sm a ll, p o rtable d evices, including sm art-
pho n es, tablets, and so forth .
institutio nal D SS
intelligence phase
mo de l b ase m an agem e nt
syste m (MBMS)
normative m o del
o ptimizatio n
o rganizatio nal
know ledge base
principle of ch o ice
p roble m ownership
p roblem solving
satisficing
scena rio
sensitivity analysis
simulatio n
suboptimization
u ser interface
what-if an a lysis
5. Even though implementation of a decision involves change,
and change management is very difficult, explain how
change management has not changed very much in thou-
sands of years. Use specific examples throughout history.
6. Your company is considering opening a branch in China.
List typical activities in each phase of the decision (intel-
ligence, design, choice, implementation) of whether to
open a branch.
74 Part I • Decision Making and Analytics: An Overview
7. You a re about to buy a car. Using Simon’s four-phase
model , describe your activities at each step.
8. Explain, through a n example, the support given to deci-
sion makers by computers in each phase of the decision
process.
9. Some experts believe that the major contribution of DSS
is to the implementatio n of a decision. Why is this so?
10. Review the major characteristics and capabilities of
DSS. How do each of them relate to the major compo-
ne nts of DSS?
Exercises
Teradata University Network TUN)
and Other Hands-On Exercises
1. Choose a case at TUN o r use the case that your instructor
chooses. Describe in detail what decisions were to be made
in the case and what process was actually followed. Be sure
to describe how technology assisted or hindered the deci-
sion-making process and what the decision’s impacts were .
2. Most companies and organizations have downloadable
demos or trial versions of their software products on the
Web so that you can copy a nd try them o ut on your own
compute r. Others have o nline demos. Find one tha t pro-
vides decision support, try it out, and write a short report
about it. Include details about the intended purpose of
the software , how it works, a nd how it supports decision
making.
3. Comme nt o n Simo n’s (1977) philosophy that managerial
decision making is synonymous with the whole process
End-of-Chapter Application Case
11. List some inte rnal data and external data that could be
found in a DSS for a university’s admissions office.
12. Why does a DSS need a DBMS, a model management
system, and a user interface, but not necessarily a knowl-
edge-based management system?
13. What are the benefits and the limitations of the AIS
SIGDSS classification for DSS?
14. Search for a ready-made DSS . Wha t type of indu stry is its
market’ Explain why it is a ready-made DSS.
of management. Does this make sense? Explain. Use a
real-world example in your explanation.
4. Consider a situation in which you have a preference
about where you go to college: You want to be not too
far away from home and not too close. Why might this
situation a rise? Explain how this situatio n fits with rational
decision-making behavior.
5. Explore teradatauniversitynetwork.com. In a report,
d escribe at least three inte resting DSS applications and
three inte resting DSS areas (e.g. , CRM, SCM) that you
h ave discove red there .
6. Examine Daniel Power’s DSS Resources site at
dssresources.com. Take the Decision Support Sys-
tems Web Tour (dssresources.com/tour/index.html).
Explore other areas of the Web site.
Logistics Optimization in a Major Shipping Company (CSAV}
Introduction
Compafiia Sud Americana de Vapores (CSAV) is a shipping
company headquarte red in Chile, South America , a nd is the
sixth largest shipping company in the world. Its operations
in over 100 countries worldwide a re managed from seven
regio nal offices. CSA V operates 700,000 containers valued at
$2 billion. Less than 10 pe rcent of these containers are owned
by CSAV. The rest are acquired fro m other third-party com-
panies o n lease. At the heart of CSA V’s business operations
is their container fleet, w hich is o nly second to vessel fuel in
terms of cost. As part of their strategic planning, the company
recognized that addressing the problem of empty containe r
logistics would help reduce operational cost. In a typical cycle
of a cargo container, a shippe r first acquires a n empty con-
tainer from a containe r depot. The containe r is the n loaded
onto a truck a nd sent to the merchant, w ho then fills it with his
products. Finally, the container is sent by truck to the ship for
onward transport to the destination. Typically, there are trans-
shipme nts alo ng the way w he re a containe r may be moved
from one vessel to another until it gets to its destination. At the
destination, the container is transported to the consignee. After
emptying the container, it is sent to the nearest CSAV depot,
w here maintenance is done on the container.
There were four main ch alle nges recognized by
CSA V to its empty container logistics problem:
• Imbalance. Some geographic regions are net expotters
while others are net ivmporters. Places like China are
net exporters; hence, there are always shortages of con-
tainers. North America is a net importer; it always has a
surplus of containers. This creates an imbalance of con-
tainers as a result of uneven flow of containers.
• Uncertainty. Factors like demand, date of return of
empty containe rs, travel times, and the s hi p ‘s capacity
Chapter 2 • Foundations and Technologies for Decision Making 75
for empty containers create uncertainty in the location
and availability of containe rs .
• Information handling and sharing. Huge loads o f
data need to be processed every day. CSAV processes
400,000 containe r transactions eve1y day. Timely deci-
sions based on accurate information had to be gener-
ated in orde r to he lp reduce safety stocks of e mpty
containers.
• Coordination of interrelated decisions worldwide.
Previously , decisions were made at the local level.
Consequently, in order to alleviate the empty container
proble m, decisions regarding movement of empty con-
tainers at various locations had to be coordinated.
Methodology /Solution
CSA V developed an integrated system called Empty Container
Logistics Optimization (ECO) using moving average, trended
and seasonal time series, and sales force forecast (CFM) meth-
ods. The ECO system comprises a forecasting model, inven-
to1y model, multi-commodity (MC) network flow model , and
a Web inte rface. The forecasting model draws data from the
regional offices, processes it, and feeds the resultant info rma-
tion to the inventory model. Some of the information the fore –
casting model generates are the space in the vessel for empty
containers and container demand. The forecasting module
also helps reduce forecast error and, hence, allows CSAV’s
depot to maintain lower safety stocks. The inventory model
calculates the safety stocks and feeds it to the MC Network
Flow model. The MC Network Flow model is the core o f the
ECO system. It provides information for optimal decisions
to be made regarding inventory levels, container reposition-
ing fl ows, and the leasing and return of empty conta iners.
The objective function is to minimize empty container logis-
tics cost, which is mostly a result o f leasing, repositio ning,
storage, loading, and discharge operations .
Results/Benefits
The ECO system activities in all regional centers are well coor-
dinated while still maintaining flexibility and creativity in their
operations. The system resulted in a 50 percent redu ction
in invento1y stock. The generation of intelligent information
from historical transactional data he lped increase efficiency
of operation. For instance, the empty time per containe r cycle
decreased from a high of 47.2 days in 2009 to only 27.3 days
the following year, resulting in an increase of 60 percent of
the average empty container turnover. Also, container cycles
References
Allan , N., R. Frame, and I. Turney. (2003). “Trust and Narrative :
Experiences of Sustainability.” Tbe Corporate Citizen,
Vol. 3, No. 2.
Alter, S. L. (1980). Decision Support Systems: Current Practices
and Continuing Challenges. Reading, MA: Addison-Wesley.
increased from a record low of 3.8 cycles in 2009 to 4.8 cycles
in 2010. Moreover, when the ECO system was implemented in
2010, the excess cost per full voyage became $35 cheaper than
the average cost for the period between 2006 and 2009. This
resu lted in cost savings of $101 million on all voyages in 2010.
It was estimated that ECO’s direct contribution to this cost
reduction was about 80 percent ($81 millio n). CSAV projected
that ECO will help generate $200 million profits over the next
2 years sin ce its implementation in 2010.
CASE QUESTIONS
1. Explain w hy solving the empty container logistics
problem contributes to cost savings for CSAV.
2. What are some of the qualitative benefits of the optimi-
zation model for the empty contain er movements?
3. What are some of the key benefits of the forecasting
model in the ECO system implemented by CSA V?
4. Perform an online search to dete rmine how other ship-
ping companies handle the empty container problem.
Do you think the ECO system would directly benefit
those companies?
5. Besides shipping logistics, can you think of any other
domain where su ch a system wou ld be useful in reduc-
ing cost?
What We Can Learn from This End-of-
Chapter Application Case
The empty containe r problem is faced by most sh ipping
companies. The problem is partly caused by an imbalance
in the demand of empty containers between different geo-
graphic areas. CSAV used an optimization system to solve
the empty container problem. The case demonstrates a situ-
ation w here a business problem is solved not just by one
method or model, but by a combination of different opera-
tions research and analytics methods. For instance, we rea li ze
that the optimization model u sed by CSA V consisted of differ-
ent s ubmodels such as the forecasting and inventory models.
The shipping industiy is only one sector among a myriad of
sectors where optimization models are used to decrease the
cost of business operations. The lessons learned in this case
could be explored in other domains such as manufacturing
and supply chain.
Source: R. Epste in et al. , “A Strategic Empty Co ntaine r Logistics
Optimization in a Major Shipping Company,” Inteifaces, Vol. 42, No.
1, Janua ry- February 2012, pp. 5-16.
Bake r, J., and M. Cameron. 0996, September). “The Effects of
the Service Environment on Affect and Consumer Perception
of Waiting Time: An Integrative Review and Research
Propositions.” Journal of the Academy of Marketing Science,
Vol. 24, pp. 338-349.
76 Pan I • Decision Making and Analytics: An Overview
Ba rba-Romero, S. (2001, July/August) . “The Spanish Govern-
m e nt Uses a Discrete Multicriteria DSS to Determine
Data Processing Acquisitions. ” l ntetfaces, Vol. 31, No. 4,
pp. 123-131.
Beach , L. R. (2005). The Psychology of Decision Making:
People in Organizations, 2nd ed. Thousand Oak s, CA:
Sage.
Birkma n Inte rnational, Inc. , birkman.com; Keirsey Tem-
perament Sorter a nd Keirsey Temperament Theory-II,
keirsey .com.
Ch o ngwatpo l, J. , and R. Sharda . (2010, December).
“SNAP: A DSS to Analyze Network Service Pricing for
State Networks .” Decision Support Systems, Vol. 50, No. 1,
pp. 347-359
Cohe n , M.-D. , C. B. Charles, and A. L. Medaglia. (2001 , March/
April). “Decision Suppo rt w ith Web-Enabled Software.”
lntetjaces, Vol. 31, No. 2, pp. 109-129.
De nning, S. (2000) . The Springboard: How Storytelling Ignites
Action in Knowledge-Era Organizations. Burlington , MA:
Butterwonh-He ine m ann.
Donovan, J. J. , and S. E. Madnick. 0977). “Institutiona l and
Ad Hoc DSS a nd Their Effective Use. ” Data Base, Vol. 8,
No. 3, pp. 79-88.
Drummond, H. (2001) . The Art of Decision Making: Mirrors of
Imagination, Masks of Fate. New York: Wiley.
Eden , C., and F. Ackermann. (2002) . “Emerge nt Strategizing. ”
In A. Huff a nd M. Jenkins (eds.) . Mapping Strategic
Thinking. Thousand Oaks, CA: Sage Publications.
Epste in, R. , et al. (2012, Janua1y/ Februa1y). “A Strategic Empty
Conta ine r Logistics Optimization in a Majo r Shipping
Company.” lntetjaces, Vol. 42, No. 1, pp. 5-16.
Farasyn, I. , K. Perkoz, and W. Van de Velde. (2008, July/
August). “Spreadsheet Models fo r Inventory Target
Setting at Procte r a nd Ga mble.” l n tetjaces, Vol. 38, No. 4,
pp. 241-250.
Goodie, A. (2004, Fall). “Goodie Studies Pathological
Gamblers’ Risk-Taking Behavio r. ” The Independent
Variable. Athens, GA: The University o f Georg ia, Institute
of Behavioral Research. ibr.uga.edu/publications/
fall2004 (accessed Fe bruary 2013) .
Hesse, R. , and G. Woolsey. 0975). Applied Manage-
ment Science: A Quick and Dirty Approach. Chicago:
SRA Inc.
Ibm.com . “IBM Watson: Ushe ring In a New Era of Computing. ”
www-03.ibm.com/innovation/us/watson (accessed
February 2013).
Ibm.com. “IBM Watson Helps Fight Can cer w ith Evidence-
Based Diagnosis a nd Treatment Suggestions.” www-03.
ibm. com/innovation/us/watson/pdf/MSK_ Case_
Study_IMC14794 (accessed February 2013).
Ibm.com. “IBM Watson Enables More Effective Healthcare
Preapproval Decisions Using Evide nce-Based Learn-
ing. ” www-03.ibm.com/innovation/us/watson/pdf/
Wel1Point_Case_Study_IMC14792 (accessed
Febru ary 2013).
Jenkins, M. (2002). “Cognitive Mapping. ” In D . Paitington
(ed.). Essential Skills for Management Research. Thousand
Oaks, CA: Sage Publications.
Kepner, C. , and B. Tregoe. 0998). TheNew RationalManager.
Princeton, NJ: Kepner-Tregoe .
Koksalan, M., a nd S. Zionts (eds.) . (2001) . Multiple Criteria
Decision Making in the New Millennium . Heidelberg:
Springer-Verlag.
Koller, G. R. (2000). Risk Modeling/or Determining Value and
Decision Making. Boca Raton, FL: CRC Press.
Larson, R. C. 0987, November/ December) . “Perspectives on
Q u e u es: Socia l Justice a nd the Psychology of Queueing.”
Operations Research, Vol. 35, No . 6 , pp. 895- 905.
Luce, M. F., J. W. Payne, and J. R. Bettman. (2004). “The
Emotio nal Nature of Decision Trade-offs. ” In S. J. Hoch,
H. C. Kunreuther, a nd R. E. Gunthe r (eds.). Wharton on
Making Decisions. New York: Wiley.
Olavson, T. , and C. Fry. (2008, July/ Augu st). “Spreadsheet
Decision-Support Tools: Lessons Learned at Hewlett-
Packard. ” l ntetjaces, Vol. 38, No. 4, pp. 300- 310.
Pauly, M. V. (2004). “Split Personality: Inconsistencies in Private
a nd Public Decisions. ” In S. J. Hoch, H. C. Ku nreuthe r,
and R. E. Gunther (eds.) . Wharton on Making Decisions.
New York: Wiley.
Power, D . J. (2002). Decision Making Support Systems:
Achievements, Trends and Challenges. Hershey, PA: Idea
Group Publishing.
Power, D. J. , and R. Sharda. (2009) . “Decisions Support
Systems. ” In S.Y. Nof (ed.) , Springer Handbook of
Automation. New York: Springe r.
Purdy, ]. (2005, Summe r). “Decisio ns, Delusions, & Debacles.”
UGA Research Magazine.
Ratne r, R. K., B. E. Kahn, and D. Kahne man. 0999, June).
“Choosing Less-Preferred Experie nces for the Sake of
Variety. ” Journal of Consumer Research, Vol. 26, No. 1.
Sawyer, D. C. 0999). Getting It Right: Avoiding the High Cost
of Wrong Decisions. Boca Rato n , FL: St. Lucie Press.
Simo n , H. 0977). The New Science of Management Decision.
Englewood Cliffs, NJ: Prentice Hall.
Stewart, T. A. (2002, November). “How to Think w ith Your
Gut. ” Business 2.0.
Teradata.com. “No Limits: Station Casinos Breaks the Mold
o n Customer Re lationships. ” teradata.com/case-studies/
Station-Casinos-No-Limits-Station-Casinos-Breaks-
the-Mold-on-Customer-Relationships-Executive-
Summary-eb6410 (accessed Fe bruary 2013).
Tversky, A. , P. Slovic, and D. Kahneman. 0990, March) .
“The Causes o f Prefere nce Reversal. ” American Economic
Review, Vol. 80, No. 1.
Yakov, B.-H. (2001). Information Gap Decision Theory: Decisions
Under Severe Uncertainty. New York: Academic Press.
p A R T
Descriptive Analytics
LEARNING OBJECTIVES FOR PART II
• Learn the role of descriptive analytics (DA) in
solving business problems
• Learn the basic definitions, concepts, and
architectures of data warehousing CDW)
• Learn the role of data warehouses in managerial
decisio n support
• Learn the capabilities of bu siness reporting an d
visualization as e n ablers of DA
• Learn the importance of information
visualization in managerial decision support
• Learn the foundatio ns of the emerging field
of visual analytics
• Learn the capabilities and limitation s
of dashboards and scorecards
• Learn the fu n damentals of business
performance man agement (BPM)
Descriptive analytics, often referred to as business intelligence, uses data and models to answer
the “what happened?” and “why did it happen?” questions in business settings. It is perhaps
the most fundamental echelon in the three-step analytics continuum upon which predictive and
prescriptive analytics capabilities are built. As you w ill see in the following chapters, the key enablers
of descriptive analytics include data warehousing, business reporting , decision dashboard/
scorecards, and visual analytics.
77
78
CHAPTER
Data Warehousing
LEARNING OBJECTIVES
• Understand the basic d efinitio ns and
co ncepts o f data wareh o uses
• Explain the role of d ata ware ho u ses in
decisio n su p p o rt
• Understand data wareh o u sing
architectures
• Explain data integratio n and the
extractio n , transformatio n , and load
(ETL) p rocesses • Describe the p rocesses u sed in
develo ping and m an aging data
wareh o u ses
• Describe real-time (active) d ata
ware ho u sing
• Explain da ta ware ho using o p eration s • Understa nd d ata warehou se
administratio n a nd security issues
T
he con cept of d ata warehou sing has been around since the late 1980s. This chapter
provides the foundatio n fo r an impo rtant typ e of d atab ase, called a data ware-
house, w hic h is primarily used for decisio n suppo rt a nd p rovides improved analyti-
cal cap abilities. We discu ss data wareh o using in th e following sectio n s:
3. 1
3 .2
3 .3
3.4
3 .5
3.6
3.7
3 .8
3 .9
3 .10
Ope ning Vig n e tte: Isle o f Capri Casinos Is W inning w ith Enterprise Da ta
Wa reho u se 79
Da ta Ware h o u s ing De finitio ns a nd Con cepts 81
Da ta W are h o u s ing Process Overview 8 7
Da ta Ware housing Arc hitectures 90
Da ta Integratio n and the Extractio n , Tra nsformation , a nd Load (ETL)
Processes 97
Da ta Ware h o u se D evelo pme nt 102
Da ta W areho u s ing Imple m e ntatio n Issues 113
Rea l-Time D ata Wa reho us ing 117
Da ta Ware h o u se Administratio n , Security Issu es, a nd Fu ture Tre nds 121
Resources, Links, a nd the T e rada ta Un ivers ity Network Connectio n 126
Chapter 3 • Data Warehousing 79
3.1 OPENING VIGNETTE: Isle of Capri Casinos Is Winning
with Enterprise Data Warehouse
Isle of Capri is a unique and innovative player in the gaming industry. After entering the
market in Biloxi, Mississippi, in 1992, Isle has grown into one of the country’s largest
publicly traded gaming companies, mostly by establishing properties in the southeastern
United States and in the country’s hea1tland. Isle of Capri Casinos, Inc., is currently operat-
ing 18 casinos in seven states, serving nearly 2 million visitors each year.
CHALLENGE
Even though they seem to have a differentiating edge, compared to others in the highly
competitive gaming industry, Isle is not entirely unique. Like any gaming company, Isle’s
success depends largely on its relationship with its customers-its ability to create a gaming,
entertainment, and hospitality atmosphere that anticipates customers’ needs and exceeds
their expectations. Meeting such a goal is impossible without two important components:
a company culture that is laser-focused on making the custome r experience an e njoyable
one, and a data and technology architecture that enables Isle to constantly deepen its under-
standing of its customers, as well as the various ways customer needs can be efficiently met.
SOLUTION
After an initial data warehouse implementation was derailed in 2005 , in part by Hurricane
Katrina, Isle decided to reboot the project with entirely new components and Teradata
as the core solution and key partner, along with IBM Cognos for Business Intelligence.
Shortly after that choice was made, Isle brought on a management team that clearly
understood how the Teradata and Cognos solution could enable key decision make rs
throughout the operation to easily frame their own initial queries, as well as timely follow-
up questions, thus opening up a wealth of possibilities to enhance the business.
RESULTS
Thanks to its successful implementation of a comprehensive data warehousing and busi-
ness intelligence solution, Isle has achieved some deeply satisfying results. The company
has dramatically accelerated and expanded the process of information gathering and
dispersal, producing about 150 reports on a daily basis, 100 weekly , and 50 monthly, in
addition to ad hoc queries, completed within minutes, a ll day every day. Prior to an enter-
prise data warehouse (EDW) from Te radata, Isle produced about 5 monthly re ports per
property, but because they took a week or more to produce, properties could not begin to
analyze monthly activity until the second week of the following month. Moreover, none
of the reports analyzed anything less than an e ntire month at a time; today, reports using
up-to-the minute data on specific customer segments at particular properties are available ,
often the same day, enabling the company to react much more quickly to a wide range
of customer needs .
Isle has cut the time in half needed to construct its core monthly direct-mail cam-
paigns and can generate less involved campa igns practically on the sp o t. In addition to
moving faster, Isle has honed the process of segmentation and now can cross-reference
a wide range of attributes, such as overall customer value , gaming behaviors, and hotel
prefere n ces. This e nables the m to produce more targeted campaigns aimed at p articular
customer segments and particular behaviors.
Isle also has enabled its management and employees to further deepen their under-
standing of customer behaviors by connecting data from its hotel systems and d ata from
80 Pan II • Descriptive Analytics
its custo mer-tracking systems-and to act on that understanding through improved
marketing campaigns a nd heightened levels of customer service. For example, the addi-
tion o f h o tel data offered n ew insights abou t the increased gaming local patrons do w hen
they stay at a ho te l. This, in turn, e nabled new incentive programs (such as a free h otel
night) that h ave pleased locals and increased Isle’s customer loyalty.
The hotel data also has enhanced Isle’s cu stomer h osting program. By automatically
n o tifying h osts w h e n a high-value gu est arrives at a h o te l, hosts have forged deeper re la-
tionships with the ir most importan t clients. “This is by fa r the best tool we’ve had s ince
I’ve been at the company, ” wrote one of the hosts.
Isle of Capri can now do more accurate property-to-property comparisons and
a na lyses, largely because Teradata consolidated disparate data housed at individual
properties a nd centralized it in o ne location. One result: A centralized intranet site
posts daily figures for each individu al property, so they can compare such things as
performance o f revenue fro m slot machines a n d table games, as well as complim entary
redemptio n values. In additio n , the IBM Cognos Business Inte lligen ce tool enables
additio n al comparison s, such as direct-mail redemption values, specific direct-mail
program respon se rates , direct-mail-incented gaming revenue, hotel-incented gaming
revenue, noncomplime nta ry (cash ) revenue from h o tel room reservations , and hote l
room occupa ncy. One clear ben e fit is that it h o lds individua l properties accountable
for consta ntly ra is ing the bar.
Beginning w ith a n important change in marketing strategy that shifted the focus to
customer days, time and again the Teradata/ IBM Cognos BI implementation has dem-
o nstra ted the value of extending the power of data throug hout Isle ‘s e nterprise. This
includes immediate a nalysis of respon se rates to m arketing campaign s a nd the addition
of profit and loss data that has su ccessfully connected customer value and total property
value. O ne example of the p ower of this integratio n: By joining customer value an d total
property value, Isle gains a b e tter understanding of its retail customers- a population
invisible to them before-enabling them to more effectively target marketing efforts , su ch
as radio ads .
Perhaps most sig nificantly, Is le h as begun to add slot m achine data to the mix .
The most importa nt a nd immediate impact will be the way in w hich customer value
w ill inform purch asing of n ew machines a nd product placement o n the customer floor.
Down the road, the additio n of this data also might position Isle to take advantage
o f server-based gaming, w here s lo t machines o n the casino floor w ill essentially be
compute r te rminals that e nable the casino to switch a gam e to a n ew one in a matter
o f seconds.
In sh o rt, as Isle constructs its solutio ns for regularly funneling slot machine data into
the warehouse, its ability to use data to re-imagine the floor an d forge ever deeper and
more lasting relationships w ill exceed anything it mig ht have expected w he n it embarked
o n this project.
QUESTIONS FOR THE OPENING VIGNETTE
1. Why is it impo rta nt for Isle to have an EDW?
2. What were the business challenges or opportunities that Isle was facing?
3. What was the process Isle followed to realize EDW? Commen t on the potential
challe nges Isle mig ht have had going through the process of EDW development.
4. What were the benefits of imple me nting a n EDW at Isle? Can you think of other
potential benefits that were not listed in the case?
5. Why do you think large e nterprises like Isle in the gaming ind u stry can succeed
w itho ut having a capable data ware ho use/business inte lligence infrastructure?
Chapter 3 • Data Warehousing 81
WHAT WE CAN LEARN FROM THIS VIGNETTE
The opening vignette illustrates the s trategic value o f impleme nting an enterprise data
warehouse, alo n g w ith its suppo rting BI m ethods. Isle of Capri Casinos was able to
leverage its data assets spread throughout the e nterprise to be used by knowledge
worke rs (wh erever a nd whenever they are n eeded) to m a ke accurate and timely deci-
sion s. The data warehouse integrated various databases throughout the o rganizatio n
into a s ingle , in-house enterprise unit to generate a s ingle version of the truth for th e
company, putting all d ecis io n makers, from planning to marketing , on the same p age.
Furthermore, by regularly funneling s lot machine d a ta into the ware house, combined
w ith customer-specific rich da ta that comes from variety of sources, Isle significantly
improved its ability to discover patterns to re -imagine / re invent the gaming floor opera-
tions and forge ever deeper and more lasting relationships with its customers. The key
lesson h ere is that a n e nterprise-level data wareho use combined w ith a strategy for its
use in d ecisio n support can result in s ig nificant benefits (fina ncia l a nd othe1wise) fo r
an organization.
Sources: Te radata, Customer Success Stories, teradata.com/t/case-stud.ies/Isle-of-Capri-Casinos-Executive-
Summary-EB6277 (accessed February 2013); www-01.ibm.com/software/analytics/cognos.
3.2 DATA WAREHOUSING DEFINITIONS AND CONCEPTS
Using real-time data warehousing in conjunction w ith DSS and BI tools is an important w ay
to conduct business processes. The opening vignette demonstrates a scenario in which a
real-time active data warehouse supported decision making by analyzing large amounts of
data from various sources to provide rapid results to support critical processes. The single
versio n of the truth stored in the data ware house and provided in an easily digestible form
expands the boundaries of Isle o f Capri’s innovative business processes. With real-time
data flows , Isle can view the current s tate of its business and q uickly ide ntify problems,
w hich is the first and foremost step toward solving them an alytically .
Decision makers require con cise, dependable information about current operations,
tre nds, and cha nges. Data are ofte n fragmented in distinct operational systems, so manag-
ers often m ake decisions with partial informatio n , at best. Data ware housing cuts th rough
this obstacle by accessing, integratin g, and organizing key o perational data in a form that
is consiste nt, re liable, timely, and readily available, wherever and w henever needed.
What Is a Data Warehouse?
In simple te rms, a data warehouse (DW) is a pool of data produced to support decision
ma king; it is a lso a repository of curre nt a nd historical data of potential inte rest to man-
agers throughout the organization. Data a re u su ally structured to be available in a form
ready for a n alytical processing activities (i. e. , online analytical processing [OLAP], data
mining, querying, reporting , and oth e r decision support applicatio ns) . A data wareho use
is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of
management’s decision-making process.
A Historical Perspective to Data Warehousing
Even though data ware housing is a relatively new term in informatio n techno logy, its
roots can be traced way back in time, even b efore computers were widely used. In th e
early 1900s, people were u sing data (th o ugh mostly via manual methods) to formulate
trends to h elp bu siness users make informed decisions, which is the most prevailing pur-
pose of data ware ho us ing.
82 Pan II • Descriptive Analytics
The motivations that led to developing data warehousing technologies go back to
the 1970s, when the computing world was dominated by the mainframes. Real business
data-processing applications, the ones run on the corporate mainframes, h ad complicated
file structures using early-generation databases (not the table-oriented relational databases
most applications use today) in which they stored data. Although these applications d id
a decent job of performing routine transactional data-processing functions, the data cre-
ated as a result of these functions (such as information about customers, the products
they ordered, and how much money they spent) was locked away in the depths of the
files and databases. When aggregated information such as sales trends by region and by
product type was needed, one had to formally request it from the data-processing depart-
ment, where it was put on a waiting list w ith a couple hundred other report requests
(Hammergren and Simon, 2009). Even though the need for information and the data that
could be used to generate it existed, the database technology was not there to satisfy it.
Figure 3.1 shows a timeline where some of the significant events that led to the develop-
ment of data warehousing are shown.
Later in this decade, commercial hardware and software companies began to emerge
with solutions to this problem. Between 1976 and 1979, the concept for a new company,
Teradata, grew out of research at the California Institute of Technology (Caltech), driven
from discussions with Citibank’s advanced technology group. Founders worked to design
a database management system for parallel processing with multiple microprocessors,
targeted specifically for decision suppo1t. Teradata was incorporated on July 13, 1979, and
started in a garage in Brentwood, California . The name Teradata was chosen to symbolize
the ability to manage terabytes (trillions of bytes) of data .
The 1980s were the decade of personal computers and minicomputers. Before any-
o ne knew it, real computer applications were no longer o nly o n mainframes; they were
all over the place-everywhere you looked in an organization. That led to a portentous
problem called islands of data . The solution to this problem led to a n ew type of soft-
ware, called a distributed database management system, which would magically pull the
requested data from databases across the organization, bring all the data back to the same
place, a nd then consolidate it, sort it, and do whatever else was necessa1y to answer the
user’s question. Although the con cept was a good one and early results from research
were promising, the results were plain and simple: They just didn’t work efficiently in the
real world, and the islands-of-data problem still existed .
./ Mainframe computers
./ Simple data entry
./ Routine reporting
./ Centralized data storage
./ Data warehousing was born
./ Big Data analytics
./ Social media analytics
./ Text and Web analytics
./ Primitive database structures
./ Teradata incorporated
./ Inmon, Building the Oat;a Warehouse
./ Kimball, The Oat;a Warehouse Toolkit
./ EDW architecture design
./ Hadoop , MapReduce, NoSQL
./ In-memory , in-database
—–1970s —-1 ssos—-1ssos—-2ooos—-201os ~
./ Mini/personal computers [PCs)
./ Business applications for PCs
./ Distributer DBMS
./ Relational DBMS
./ Ter adata ships commercial DBs
./ Business Data Warehouse coined
./ Exponentially growing data Web data
./ Consolidation of OW / Bl industry
./ Data warehouse appliances emerged
./ Business intelligence popularized
./ Data mining and predictive modeling
./ Open source software
./ Saas, PaaS, Cloud computing
FIGURE 3.1 A List of Events That Led to Data Warehousing Development.
Chapter 3 • Data Warehousing 83
Meanwhile, Teradata began shipping commercial products to solve this prob-
lem. Wells Fargo Bank rece ived the first Teradata test system in 1983, a parallel RDBMS
(relational database management system) for decision support- the world’s first . By 1984,
Teradata released a production version of their product, and in 1986, Fortune m agazine
named Teradata Product of the Year. Te radata, still in existence today, built the first data
warehousing appliance- a combination of hardware and software to solve the data ware-
housing needs of many. Other companies began to formulate their strategies, as well.
During this decade several other events happened, collectively making it the decade
of data warehousing innovation. For instance , Ralph Kimball founded Red Brick Systems
in 1986. Red Brick began to emerge as a visionary software company by discussing how
to improve data access; in 1988, Barry Devlin and Paul Murphy of IBM Ireland introduced
the term business data warehouse as a key component of business information systems.
In the 1990s a new approach to solving the islands-of-data proble m surfaced. If the
1980s approach of reaching out and accessing data directly from the files and databases
didn’t work, the 1990s philosophy involved going back to the 1970s method, in which
data from those places was copied to another location-only d oing it right this time;
hence, data warehousing was born. In 1993, Bill Inmon wrote the seminal book Building
the Data Warehouse. Many people recognize Bill as the father of data ware housing.
Additional publications emerged, including the 1996 book by Ralph Kimba ll , Tbe Data
Warehouse Toolkit, which discussed general-purpose dimensional design techniques to
improve the data architecture for query-cente red decision support systems.
In the 2000s, in the world of data warehousing, both popularity and the amount of
data continued to grow. The vendor community and options have begun to consolidate.
In 2006, Microsoft acquired ProClarity, jumping into the data warehousing market. In
2007, Oracle purchased Hyperion, SAP acquired Business Objects, and IBM merged w ith
Cognos. The data warehousing leaders of the 1990s have been swallowed by some of
the largest providers of informatio n system solutions in the world. During this time, other
innovations have emerged, including data warehouse appliances from vendors such as
Netezza (acquired by IBM), Greenplum (acquired by EMC) , DATAllegro (acquire d by
Microsoft), and performance ma nageme nt appliances that enable real-time performance
monitoring. These innovative solutions provided cost savings because they were plug-
compatible to legacy data warehouse solutions.
In the 2010s the big buzz has been Big Data. Many be lieve that Big Data is going to
make an impact on data warehousing as we know it. Either they will find a way to coex-
ist (which seems to be the most likely case, at least for several years) or Big Data (and
the technologies that come w ith it) w ill make traditional data warehousing obsolete. The
technologies that came with Big Data include Hadoop , MapReduce , NoSQL, Hive , and so
forth . Mayb e we will see a new te rm coined in the world of data that combines the needs
and capabilities of traditional data warehousing and the Big Data phenomenon.
Characteristics of Data Warehousing
A common way of introducing data warehousing is to refer to its fundamental character-
istics (see Inmon, 2005):
• Subject oriented. Data are organized by detailed subject, such as sales , products, or
customers, containing only information relevant for decision support. Subject orie nta-
tion enables users to determine not only how their business is p e rforming, but why. A
data warehouse differs from an operational database in that most operational databases
have a product orientation and are tuned to h andle transactions that update the data-
base. Subject orientation provides a more comprehensive view of the organization.
• Integrated. Integration is closely related to subject orientation. Data warehouses
must place data from differe nt sources into a consistent format. To do so, they must
84 Pan II • Descriptive Analytics
deal with naming conflicts and discrepancies among units of measure. A data ware-
house is presumed to be totally integrated.
• Time variant (time series). A warehouse maintains historical data. The data
do not necessarily provide current status (except in real-time systems). They detect
trends, deviations, and long-term relationships for forecasting and comparisons, lead-
ing to decision making. Every data warehouse has a temporal quality. Time is the one
important dimension that all data warehouses must support. Data for analysis from
multiple sources conta ins multiple time points (e.g., daily, weekly, monthly views).
• Nonvolatile. After data are entered into a data warehouse, users cannot change or
update the data. Obsolete data are discarded, and changes are recorded as new data.
These characteristics enable data warehouses to be tuned almost exclusively for data
access. Some additional characteristics may include the following:
• Web based. Data warehouses are typically designed to provide an efficient
computing environment for Web-based applications.
• Relational/multidimensional. A data warehouse u ses either a relational struc-
ture or a multidimensional stmcture. A recent survey on multidimensional stmctures
can be found in Romero and Abell6 (2009).
• Clientjserver. A data warehouse uses the client/ server architecture to provide
easy access for end users.
• Real time. Newer data warehouses provide real-time, or active, data-access and
analysis capabilities (see Basu, 2003; and Bonde and Kuckuk, 2004).
• Include metadata. A data warehouse contains metadata (data about data) about
how the data are organized and how to effectively use them.
Whereas a data warehouse is a repository of data, data wareh ousing is lite rally the
entire process (see Watson, 2002). Data warehousing is a discipline that results in appli-
cations that provide decision support capability, allows ready access to business infor-
mation, and creates business insight. The three main types of d ata warehouses are data
marts, operational data stores (ODS), and enterprise data warehouses (EDW). In addition
to discussing these three types of warehouses next, we also discuss metadata.
Data Marts
Whereas a data warehouse combines databases across an entire enterprise, a data mart
is usually smaller and focuses on a particular subject or department. A data m art is a
subset of a data warehouse, typically consisting of a single subject area (e.g ., marketing,
operations). A data mart can be either dependent or independent. A dependent data
mart is a subset that is created directly from the data warehouse. It has the advantages
of using a consistent data model and providing quality data. Dependent data marts sup-
port the concept of a single enterprise-wide data model, but the data warehouse must be
constructed first. A dependent data mart ensures that the end user is viewing the same
version of the data that is accessed by all other data warehouse users. The high cost of
data warehouses limits their use to large companies. As an alternative, many firms use a
lower-cost, scaled-dow n version of a data warehouse referred to as an independent data
mart. An independent data mart is a small warehouse designed for a strategic business
unit (SBU) or a department, but its source is not an EDW.
Operational Data Stores
An operational data store (ODS) provides a fairly recent form of customer information
file (CIF). This type of database is often used as an interim staging area for a data ware-
house. Unlike the static contents of a data warehouse, the contents of an ODS are updated
throughout the course of business operations. An ODS is used for sho1t-term decisions
Chapter 3 • Data Warehousing 85
involving mission-critical applications rather than for the medium- and long-term decisions
associated w ith an EDW. An ODS is similar to short-term memo1y in that it stores only very
recent information. In comparison, a data warehouse is like long-term memory because it
stores permanent information. An ODS consolidates data from multiple source systems and
provides a near-real-time, integrated view of volatile, current data. The exchange, transfer,
and load (ETI) processes (discussed later in this chapter) for an ODS are identical to those
for a data warehouse. Finally, oper marts (see Imhoff, 2001) are created when operational
data needs to be a nalyzed multidimensionally. The data for an oper mart come from an ODS.
Enterprise Data Warehouses (EDW)
An enterprise data warehouse (EDW) is a large-scale data warehouse that is used
across the enterprise for decision support. It is the type of data warehouse that Isle of
Capri developed, as described in the opening vignette. The large-scale nature provides
integratio n of data from many sources into a standard format for effective BI and decision
support applications. EDW are used to provide data for many types of DSS, including
CRM, supply ch ain management (SCM), business performance management (BPM), busi-
ness activity monito ring (BAM), product life-cycle ma nagement (PLM) , revenue manage-
ment, and sometimes even knowledge management systems (KMS) . Application Case 3. 1
shows the variety of b e nefits that telecommunication companies leverage from imple –
menting data warehouse driven analytics solutions.
Metadata
Metadata are data about data (e.g., see Sen , 2004; and Zhao , 2005). Metadata describe
th e structure of and some meaning about data , thereby contributing to their effective or
Application Case 3.1
A Better Data Plan: Well-Established TELCOs Leverage Data Warehousing and Analytics
to Stay on Top in a Competitive Industry
Customer Retention Mobile service p rovide rs (i.e ., Telecommunication
Companies, or TELCOs in sh ort) that helped trigger
the explosive growth of the industry in the mid- to
late-1990s have lo ng reaped the benefits of b eing first
to market. But to stay competitive, these companies
must continuously refine everything from customer
service to plan pricing. In fact, veteran carriers face
many of the sam e ch allenges that up-and-coming
carriers do: retaining custome rs, decreasing costs,
fine-tuning pricing models, improving customer sat-
isfaction, acquirin g new customers and understand-
ing the role of social media in customer loyalty
It’s no secret that the speed a nd success with which
a provider handles service requests directly affects
customer satisfaction and, in turn, the propensity to
churn. But getting down to which factors h ave the
greatest impact is a challenge.
Hig hly targeted data analytics play an ever-
more-critical role in helping carriers secure or
improve their standing in an increasingly competi-
tive marketplace. Here’s how some of the world’s
leading providers are creating a stron g future based
o n solid business and customer intelligence.
“If we could trace th e steps involved with each
process, we could un derstand points of failure and
acceleration, ” notes Roxanne Garcia, manager of
the Commercial Operations Cente r for Telefonica
de Argentina. “We could measure workflows both
within and across functions, anticipate rather than
react to performance indicators, and improve the
overall satisfaction w ith onboardin g new cu stomers.”
The company’s solution was its traceability pro-
ject, w hich began w ith 10 dashboards in 2009. It has
since realized US$2.4 millio n in annualized revenues
(Continued)
86 Pan II • Descriptive Analytics
Application Case 3.1 (Continued}
and cost savings, shortened customer provisioning
times and reduced customer defections by 30%.
Cost Reduction
Staying ahead of the game in any industry depends,
in large part, on keeping costs in line. For France’s
Bouygues Telecom, cost reduction came in the form
of automation. Aladin, the company’s Teradata-based
marketing operations management system, auto-
mates marketing/communications collateral produc-
tion. It delivered more than US$1 million in savings
in a single year while tripling email campaign and
content production.
“The goal is to be more productive and respon-
sive, to simplify teamwork, [and] to standardize and
protect our expertise,” notes Catherine Corrado, the
company’s project lead and retail communications
manager. “(Aladin lets] team members focus on value-
added work by reducing low-value tasks. The end
result is more quality and more creative [output].”
An unintended but very welcome benefit of
Aladin is that other departments have been inspired to
begin deploying similar projects for everything from
call center support to product/offer launch processes.
Customer Acquisition
With market penetration near or above 100% in
many countries, thanks to consumers who own
multiple devices, the issue of new customer acquisi-
tion is n o small challe nge. Pakistan’s largest carrier,
Mobilink, also faces the difficulty of operating in a
market where 98% of users have a pre-paid plan that
requires regular purchases of additional minutes.
“Topping up, in particular, keeps the revenues
strong and is critical to our company’s growth, ” says
Umer Afzal, senior manager, BI. “Previously we
lacked the ability to enhance this aspect of incremen-
tal growth. Our sales info1mation model gave us that
ability because it helped the distribution team plan
sales tactics based on smarter data-driven strategies
that keep our suppliers [of SIM cards, scratch cards
and electronic top-up capability] fully stocked.”
As a result, Mobilink has not only grown sub-
scriber recharges by 2% but also expanded new cus-
tomer acquisition by 4% and improved the profitability
of those sales by 4%.
Social Networking
The expanding use of social networks is chang-
ing how many organizations approach everything
from customer service to sales a nd marketing. More
carriers are turning their attention to social net-
works to better understand a n d influence customer
behavior.
Mobilink has initiated a social n etwork analy-
sis project that will enable the company to explore
the concept of viral marketing and identify key
influencers who can act as brand ambassadors to
cross-sell products. Velcom is looking for similar
key influencers as well as low-value customers
whose social value can be leveraged to improve
existing relationships. Meanwhile, Swisscom is
looking to combine the social network aspect of
customer behavior with the rest of its a nalysis over
the next several months .
Rise to the Challenge
While each market presents its own unique chal-
le nges, most mobile carriers spend a great deal of
time and resources creating, deploying and refining
plans to address each of the ch allenges outlined
here . The good news is that just as the industry a nd
mobile technology h ave expanded and improved
over the years, so also have the data analytics solu-
tions that have been created to meet these chal-
le nges head on.
Sound data analysis uses existing customer,
business and market intelligence to predict and influ-
ence future behaviors and outcomes. The end result
is a smarter, more agile and more successful approach
to gaining market share and improving profitability.
QUESTIONS FOR DISCUSSION
1. What are the main challenges for TELCOs?
2. How can data warehousing and data analytics
help TELCOs in overcoming their challenges?
3. Why do you think TELCOs are well suited to take
full advantage of data analytics?
Source: Teradata Magazine, Case Study by Colleen Marble , “A
Better Data Plan: Well-Established Telcos Leverage Analytics
to Stay on Top in a Competitive Ind ustry” http://www.
teradatamagazine.com/v13n01/Features/A-Better-Data-
Plan/ (accessed September 2013).
Chapter 3 • Data Warehousing 87
ineffective use. Mehra (2005) indicated that few organizations really understand metadata,
and fewer understand how to design and implement a metadata strategy. Metadata are
generally defined in terms of usage as technical or business metadata. Pattern is another
way to view metadata. According to the pattern view, we can d ifferentiate between syn-
tactic metadata (i.e., data d escribing the syntax of data) , structural me tadata (i.e., data
describing the structure of the data), and semantic metadata (i.e ., data describing the
meaning of the data in a specific domain).
We next explain traditional metadata patterns and insights into how to implement
an effective metadata strategy via a holistic approach to enterprise metadata integration.
The approach includes ontology and metadata registries; enterprise information integration
(Ell); extraction, transformation, and load (ETI); and service-oriented architectures (SOA).
Effectiveness, extensibility, reusability, interoperability, efficiency and performance, evolution,
entitlement, flexibility, segregation, user inte1face, versioning, versatility, and low maintenance
cost are some of the key requirements for building a successful metadata-driven enterprise.
According to Kassam (2002), business metadata comprise information that increases
our understanding of traditional (i.e., structured) data. The primary purpose of metadata
should be to provide context to the reported data ; that is , it provides enriching informa-
tion that leads to the creation of knowledge. Business me tadata, though difficult to pro-
vide efficiently, release more of the potential of structured d ata . The context need not
be the same for all users. In many ways, metadata assist in the conversion of data and
information into knowledge. Metadata form a foundation for a metabusiness architecture
(see Bell, 2001). Tannenbaum (2002) described how to identify metadata requirements.
Vaduva and Vetterli (2001) provided an overview of metadata management for data ware-
housing. Zhao (2005) described five levels of metadata management maturity: (1) ad
hoc, (2) discovered, (3) managed, ( 4) optimized, and (5) automated. These levels help in
understanding where an organization is in terms of how and how well it uses its metadata.
The design, creation, and use of metadata-descriptive or summary data about
data-and its accompanying standards may involve ethical issues. There are ethical
considerations involved in the collection and ownership of the information contained
in metadata, including privacy and intellectual prope 1ty issues that a rise in the design,
collection, and dissemination stages (for more , see Brody, 2003).
SECTION 3.2 REVIEW QUESTIONS
1. What is a data warehouse?
2. How does a data warehouse differ from a database?
3. What is an ODS?
4. Differentiate among a data m art, an ODS, and an EDW.
5. Explain the importance of metadata.
3.3 DATA WAREHOUSING PROCESS OVERVIEW
Organizations, private and public, continuously collect data , information, and knowledge
at an increasingly accelerated ra te and store them in computerized systems. Maintaining
and using these data and information becomes extremely complex, especially as
scalability issues arise. In addition, the number of users needing to access the informa-
tion continues to increase as a result of improved reliability and availability of network
access, especially the Internet. Working with multiple databases, e ither integrated in a
data warehouse or not, has become an extremely difficult task requiring considerable
expertise, but it can provide immense benefits far exceeding its cost. As an illustrative
example , Figure 3 .2 shows business benefits of the enterprise data warehouse built by
Teradata for a major automobile ma nufacture r.
88 Pan II • Descriptive Analytics
,—- —– —
Enterprise Data Warehouse
One management and analytical platform
for product configuration, warranty,
— and diagnostic readout data
I I I I
Reduced Reduced Warranty Improved Cost of Accurate IT Architecture
Infrastructure Expense Quality Environmental Standardization
Expense Improved reimbursement Faster ident ification ,
Performance
One strategic platform for
2 / 3 cost r eduction through accuracy through improved prioritization, and
Reporting
business intelligence and
data mart consolidatio n claim data quality resolution of quality issues compliance r eporting
FIGURE 3.2 Data-Driven Decision Making-Business Benefit s of a n Enterprise Data W arehouse.
Application Case 3.2
Data Warehousing Helps MultiCare Save More Lives
In the spring of 2012, leadership at MultiCare Health
System (MultiCare)- a Tacoma, Washington- based
health system- realized the results of a 12-month
journey to reduce septicemia.
The effort was supported by the system’s
top leadership, who participated in a data-driven
approach to prioritize care improvement based on
an analysis of resources consumed and variation in
care outcomes. Reducing septicemia (mortality rates)
was a top priority for MultiCare as a result of three
hospitals performing below, and one that was per-
forming well below, national mortality averages .
In September 2010, MultiCare implemented
Health Catalyst’s Adaptive Data Warehouse, a
healthcare-specific data model, and subsequent clin-
ical and process improvement services to measure
and effect care through organizational and process
improvements. Two major factors contributed to the
rapid reduction in septicemia mortality.
Clinical Data to Driv e Improvement
The Adaptive Data Warehouse™ organized and sim-
p lified data from multiple data sources across the
continuum of care. It became the single source of
truth requisite to see care improvement opportuni-
ties and to measure change. It also proved to be an
important means to unify clinical, IT, and financial
leaders and to drive accountability for performa nce
improvement.
Because it proved difficu lt to define sepsis due
to the complex comorbidity factors leading to sep-
ticemia, MultiCare partnered with Health Catalyst
to refine the clinical definition of sepsis. Health
Catalyst’s data work allowed MultiCare to explore
around the boundaries of the definition an d to ulti-
mately settle on an algorithm that defined a septic
patient. The iterative work resulted in increased con-
fidence in the severe sepsis cohort.
Sy ste m -Wide Critical Care Collaborative
The establishment and collaborative efforts of per-
manent, integrated teams consisting of clinicians,
technologists, analysts, and quality personnel were
essential for accelerating MultiCare’s efforts to
reduce septicemia mortality. Together the collabora-
tive addressed three key bodies of work- standard
of care definition, early identification, and efficient
delivery of defined-care standard.
Standard o f Care: Sev ere Sep sis
Order Set
The Critical Care Collaborative streamlined seve ral
sepsis order sets from across the organization into
one system-wide standard for the care of severely
septic patients. Adult patients presenting with sepsis
receive the same care, no matter at which MultiCare
hospital they present.
Early Identification: Modified Early
Warning System (MEWS)
MultiCare developed a modified early warning sys-
te m (MEWS) dashboard that leveraged the cohort
definition and the clinical EMR to quickly identify
patients w ho were trending toward a sudden down-
turn. Hospital staff constantly monitor MEWS, which
serves as an early detection tool for caregivers to
provide preemptive interventions.
Efficient Delivery: Code Sepsis
(“Time Is Tissue”)
The final key piece of clinical work undertaken by
the Collaborative was to ensure timely impleme nta –
tion of the defined standard of care to patients who
are more efficie ntly identified. That model already
exists in healthcare a nd is known as the “code” pro-
cess. Similar to other “code” processes (code trauma,
Chapter 3 • Data Warehousing 89
code neuro, code STEMI), code sepsis at MultiCare
is designed to bring together essential caregivers in
order to efficiently deliver time-sensitive, life -saving
treatments to the patient presenting with severe
sepsis.
In just 12 months, MultiCare was able to
redu ce septicemia mo rtality rates by a n average of
22 percent, leading to more than $1.3 million in
validated cost savings during that same period. The
sepsis cost reductio ns and quality of care improve-
ments h ave raised the exp ectatio n that similar
results can b e realized in other areas of MultiCare,
including heart failure, emergency department
performance, and inpatient throughput.
QUESTIONS FOR DISCUSSION
1. What do you think is the role of data wareh ous-
ing in healthcare systems?
2. How did MultiCare u se data warehousing to
improve h ealth outcom es?
Source.- healthcatalyst.com/success_stories/multicare-2 (ac-
cessed February 2013).
Many organizatio ns nee d to crea te data warehouses-massive data stores of time-
series data for decision support. Data are imported from various external and internal
resources and are cleansed a nd o rganized in a manner consistent with the organization’s
needs. Afte r the data are populated in the data warehouse, data marts can be loaded for a
specific area o r department. Alternatively , data marts can be created first, as needed, and
the n integrated into an EDW. Often , though, data marts are not developed, but data are
simply loaded o nto PCs or left in their original state for direct ma nipulation u sing BI tools .
In Figure 3.3, we show the data wareh o use con cept. The following a re the major
compone nts of the data warehousing process:
• Data sources. Data are sourced from multiple independen t operational “legacy”
system s and possibly from external data provide rs (such as the U.S. Cen sus). Data
may also come from an OLTP o r ERP system. Web data in the form of Web logs may
also feed a data ware house.
• Data extraction and transformation. Data are extracted and properly trans-
formed u sing custom-writte n or comme rcial software called ETL.
• Data loading. Data are loaded into a staging area, where they are transformed
a nd cleansed. The d ata are then ready to load into the data ware h ouse and/ or data
m arts.
• Comprehensive database. Essentially, this is the EDW to support a ll decision
a nalysis by providing relevant summarized and detailed informatio n o riginating
from many different sources.
• Metadata. Metadata are maintained so that they can be assessed by IT personnel
a nd users. Metadata include software programs about data and rules for organizing
d ata summaries that a re easy to index a nd search , especia lly w ith Web tools.
90 Pan II • Descriptive Analytics
Data
Sources
~
ETL
Process
~
Select
Extract
~ [ Transform
Integrate
Load
Metadata
Enterprise
Data
Warehouse
Replication
No data marts option
Access
Applications
(Visualization)
Data/text
mining
DLAP,
Dashboar d ,
Web
FIGURE 3.3 A Data Warehouse Framework and Views.
• Middleware tools. Middleware tools e n able access to th e data warehouse. Power
users su ch as analysts may w rite their own SQL queries. Others may employ a man-
aged query environment, su c h as Business Objects, to access data . There are many
fro nt-e nd applicatio ns tha t business u sers can u se to interact with data stored in the
data repositories, including data mining, OLAP, repo rting tools , and data visualiza-
tion tools.
SECTION 3 .3 REVIEW QUESTIONS
1. Describe the data warehousing process.
2 . Describe the m ajo r components of a data ware h ou se.
3. Identify and discuss the role o f middleware tools.
3.4 DATA WAREHOUSING ARCHITECTURES
There are several basic information system architectures that can be u sed for data ware-
housing. Generally speaking, these architectures are comm o nly called client/ server or
n-tier architectures , of which two-tier a nd three-tier architectures are the most common
(see Figures 3.4 and 3.5), but sometimes there is simply one tier. These types of mu lti-tiered
Tier 1: Tier 2: Tier 3:
Client workstation Application server Database server
FIGURE 3.4 Architecture of a Three-Tier Data Warehouse.
Tier 1:
Client workstation
Tier 2:
Application and
database server
FIGURE 3.5 Architecture of a Two-Tier Data Warehouse.
Chapter 3 • Data Warehousing 91
architectures are known to be capable of serving the n eeds of large-scale, performance-
de manding information systems such as data warehouses. Referring to the u se of n-tiered
architectures for data warehousing, Hoffer et al. (2007) distinguish ed a mon g these archi-
tectures by dividing the data warehouse into three p arts:
1. The data warehouse itself, which contains the data and associated software
2. Data acquisition (back-end) software, w hich extracts data from legacy systems and
external sources, consolida tes and summarizes them, and loads them into the data
ware house
3. Client (front-end) software, which allows users to access a nd analyze data from the
warehouse (a DSS/ Bl/business an alytics [BAJ e ng ine)
In a three-tie r architecture, operational syste ms contain the da ta and the software for
data acquisition in o ne tier (i.e., the server), the data ware house is an other tier, and th e
third tier includes the DSS/BI/BA engine (i.e. , the applicatio n server) and the client (see
Figure 3.4). Data from the wareh o use are processed twice and dep osited in an additio nal
multidimensional database, organized for easy multidimensional analysis and presenta-
tion, o r replicated in data marts. The advantage of the three-tie r architecture is its separa-
tio n of the functions of the data warehouse, w hic h e liminates resource constraints and
makes it possible to easily create data marts.
In a two-tier architecture, the DSS e ngine physically runs o n the same h ardware
platform as the data warehouse (see Figure 3.5). Therefore, it is more economical than
the three-tier structure. The two-tier architecture can have performance problems for large
data wareh o uses that work w ith data-intensive applicatio ns fo r decision suppo rt.
Mu ch of the common wisdom assumes an absolutist approach, maintaining that
one solution is better than the o ther, despite the o rganizatio n ‘s circumstances and unique
needs. To further complicate these architectural decisions, many con sultants and software
vendors focus on o ne portio n of the architecture, the refore limiting the ir capacity and
mo tivation to assist an organization through the o ptions based o n its needs. But these
asp ects are being questioned a nd a n alyzed. For example, Ball (2005) provided deci-
sion criteria for organizations tha t plan to implement a BI application and have already
determined their n eed for multidimensional data marts but need h elp determining th e
appropriate tie red architecture. His criteria revolve around forecasting needs for space
and speed of access (see Ball, 2005, for details).
Da ta ware housing and the Inte rnet are two key technologies that offer important
solutions for managing corporate data. The integratio n of these two technologies pro-
duces Web-based data ware h ou sing. In Figure 3.6, we show the arch itecture of Web-
based data ware ho using . The architecture is three tie red and includes the PC client, Web
server, and application server. On the clie nt side, the u ser needs an Internet connection
and a Web browser (preferably Java e n abled) throug h the familiar graphical u ser inter-
face (GUI) . The Inte rne t/ intrane t/ extra net is the communicatio n me dium between client
92 Pan II • Descriptive Analytics
Client
[Web browser) Internet/
Intranet/
Extra net
Web pages
Web
server
FIGURE 3.6 Architecture of Web-Based Data Warehousing.
Application
server
Data
warehouse
and servers. On the server side, a Web server is used to manage the inflow and outflow
of information between client and server. It is backed by both a data warehouse and an
application server. Web-based data warehousing offers several compelling advantages,
including ease of access, platform independence, and lower cost.
The Vanguard Group moved to a Web-based, three-tier architecture for its e nterprise
architecture to integrate all its data and provide customers with the same views of data
as internal users (Dragoon , 2003). Likewise, Hilton migrated all its independent client/
server systems to a three-tier data warehouse, using a Web design enterprise system. This
chan ge involved an investment of $3.8 million (excluding labor) and affected 1,500 users.
It increased processing efficiency (speed) by a factor of six. When it was deployed, Hilton
expected to save $4.S to $5 million annually. Finally, Hilton experimented with Dell ‘s clus-
tering (i.e., parallel computing) technology to enhance scalability and speed (see Anthes,
2003).
Web architectures for data warehousing are similar in structure to other data ware-
housing architectures, requiring a design choice for h ousing the Web data warehouse
w ith the transaction server or as a separate server(s). Page-loading speed is an important
consideration in designing Web-based applications; therefore, server capacity must be
planned carefully .
Several issues must be con sidered when deciding which architecture to use. Among
them are the following:
• Which database management system {DBMS) should be used? Most data
warehouses are built using relational database management systems (RDBMS). Oracle
(Oracle Corporation, oracle.com), SQL Server (Microsoft Corporation, microsoft.
com/sql), and DB2 (IBM Corporation, http://www-Ol.ibm.com/software/data/
db2/) are the o n es most commonly used. Each of these products supports both
client/server and Web-based architectures.
• Will parallel processing and/or partitioning be used? Parallel processing
enables multiple CPUs to process data warehouse query requests simultaneously
and provides scalability. Data warehouse designers need to decide whether the data-
base tables will be partitioned (i. e., split into smaller tables) for access efficiency and
what the criteria w ill be. This is an important consideration that is necessitated by
Chapter 3 • Data Warehousing 93
the large amounts of d ata contained in a typical data warehou se. A recent survey o n
parallel and distributed data ware ho uses can be fou nd in Furtado (2009). Teradata
(teradata.com) h as su ccessfully ad opted and ofte n commended on its novel imple-
mentation of this approach .
• Will data migration tools be used to load the data warehouse? Moving
data from a n existing system into a data warehouse is a tedious and laborious task.
Depending o n the diversity and the locatio n of the data assets, migration may be
a re latively simple procedure or (in contrast) a mo nths-lo ng project. The resu lts
of a thorough assessment of the existing data assets sh ould be used to determine
wheth er to use migration tools and, if so, what capabilities to seek in those com-
mercial tools.
• What tools will be used to support data retrieval and analysis? Often it
is necessary to use specialized tools to periodically locate, access, analyze, extract,
transform, and load n ecessary data into a data warehouse. A decision has to be
m ade o n (1) developing the migration tools in-hou se, (2) purchasing them from a
third-party provide r, or (3) using the ones provided w ith the data warehouse system .
Overly complex, real-time migrations warran t specialized third -part ETL tools.
Alternative Data Warehousing Architectures
At the highest level, data ware ho use architecture design viewpoints can be categorized
into enterprise-wide data warehouse (EDW) design and data mart (DM) design (Golfarelli
and Rizzi, 2009). In Figure 3.7 (parts a- e), we sh ow some alte rnatives to the basic archi-
tectural design types that are neither pure EDW n or pure DM, but in between or beyond
the traditional a rchitectural structures. Notable new o nes include hub -and -spoke and
federated architectures. The five architectures sh own in Figure 3.7 (parts a- e) are pro-
posed by Ariyachandra and Watson (2005 , 2006a, and 2006b). Previously, in an extensive
stud y, Sen and Sinha (2005) identified 15 different data wareh ousing me thodologies. Th e
sources of these me thodologies are classified into three broad categories: core -techn o logy
vendo rs, infrastructure vendors, and informatio n-modeling companies.
a. Independent data marts. This is argu ably the simp lest and the least costly archi-
tecture alte rna tive. The data marts are develope d to operate indepen dently of each
a nother to serve the needs of individual organizatio nal units . Because of their inde-
pendence, they may h ave incons iste nt data definitions and different dimensions and
measures, making it difficult to an a lyze data across the data marts (i.e., it is difficult,
if not impossible, to get to the “o ne versio n of the truth”).
b. Data mart bus architecture. This architecture is a viable alternative to the inde –
pendent data marts where the individual m arts are linked to each other via some
kind of middleware. Because the d ata are linked among the individual marts, there
is a better chan ce of maintaining data consistency across the en terprise (at least at
the metadata level). Even though it a llows fo r complex data que ries across data
m arts, the performance of these types of an alysis may n ot be at a satisfactory level.
c. Hub-and-spoke architecture. This is pe rhaps the most famous data wareh ous-
ing a rchitecture today. Here the attentio n is focused on building a scalable and
maintainable infrastructure (ofte n develo ped in an iterative way, subject area by
subject area) that includes a centralized data warehouse and several dependen t data
marts (each for an organizational unit). This architecture allows for easy customiza-
tion of user inte rfaces and reports. O n the n egative side, this architecture lacks th e
holistic e nterprise view, and may lead to data redundancy and data late ncy.
d. Centralized data warehouse. The centralized data warehouse architecture is
similar to the hub-a nd-spoke architecture except that there are n o d e pendent data
marts; instead, there is a g igantic e nterprise data wareh ouse that serves the needs
94 Pan II • Descriptive Analytics
(a) Independent Data Marts Architecture
Source
systems
ETL
Staging
area
Independent data marts
[atomic/summarized data]
(bl Data Mart Bus Architecture with Linked Dimensional Data Marts
Source
systems
ETL
Staging
area
Dimensionalized
data marts linked by
conformed dimensions
[atomic/summarized data]
(cl Hub-and-Spoke Architecture (Corporate Information Factory)
Source
systems
ETL
Staging
area
Normalized relational
warehouse [atomic data]
End-user
access and
applicat ions
End-user
access and
applications
access and
applications
Dependent data marts
[summarized/some atomic data)
(d) Centralized Data Warehouse Architecture
Source
systems
ETL
Staging
area
(e) Federated Architecture
Existing data warehouses
Data marts and
legacy systems
Normalized relational
warehouse [atomic/some
summarized data)
Data mapping/ metadata
Logical/physical integration
of common data elements
End-user
access and
applications
End-user
access and
applications
FIGURE 3.7 Alternative Data Warehouse Architectures. Source: Adapted from T. Ariyachandra and
H. Watson, “Which Data Warehouse Architecture Is Most Successful?” Business Intelligence Journal,
Vol. 11, No. 1, First Quarter, 2006, pp. 4–6.
of all organizatio nal units. This centralized approach provides users w ith access to
all data in the data warehouse instead of limiting them to data marts. In addition,
it reduces the amount of data the technical team has to transfer or ch ange, there-
fore simplifying data management and administration. If design ed and implemented
properly, this architecture provides a time ly and ho listic view of the enterprise to
Chapter 3 • Data Warehousing 95
Trnosactiooal Use~~ ~ ~ ~ ~
Transactional Data y T l T
Data Transformation l _
Operational Data Store [ODS]
“Enterprise” Data Warehouse
Data Replication
Data Marts
Decision Users
Strategic
Users
Tactical Reporting Data Event-driven/
Users OLAP Users M iners Closed Loop
(/J
:::,
CD
Q)
OJ
cu
(/J
(/J
Q)
~
Q)
(/J
·c:
Q.
‘-
Q)
~
C
w
” Q) ‘-cu
5
..<:!!
"O
"O
~
C
OJ
"ui
- Q)
~o
·~ffi
.r: cu
0.. .Cl
cu
~
cu
0
ai
"O
0
~
cu
ti
0
cu
u
·01
_.:3
~
C
Q)
E
Q)
OJ
cu
C
cu
~
Q)
(/J
cu
.Cl
fl
cu
0
"O
C
cu
E
tl
ci5'
Q)
(/J
T:
Q.
'-
2l
C
w
C
0
·.:;
cu Ul
~ Q)
"S .~
~ c
0 Q)
UUJ
I c
EJJ ·E
O CU
o ~
C "O
-5w
Q) "O
f- C
"O cu
ffi ~
(/J Q.
(/J Q.
Q) :::,
C UJ
"ui
:J
CD
FIGURE 3.8 Teradata Corporation's Enterprise Data Warehouse. Source: Teradata Corporation (teradata.com) . Used with
permission.
w h omever, whenever, and wherever they may be within the organizatio n . The
central data warehou ses architecture, which is advocated mainly by Teradata Corp.,
advises using data warehouses without any data marts (see Figure 3.8).
e. Federated data warehouse. The federated approach is a concession to the natu-
ral forces that undermine the best p lans for developing a perfect system. It uses
a ll possible means to integrate analytical resources from multiple sources to mee t
changing n eeds or business conditions. Essentially, the federated approach involves
integrating disparate systems. In a federated architecture, existing decision support
structures are left in place, and data are accessed from th ose sources as needed. Th e
federated approach is supported by middleware vendors that propose distributed
query and join capabilities. These eXtensib le Markup Language (XML)-based tools
offer users a g loba l view of distributed data sources, includi ng data warehouses, data
marts, Web sites, documents, a nd operational systems. When users choose query
objects from this view and press the submit button, the tool automatically que ries
the distributed sources, joins the results, and presents them to the user. Because of
performance and data quality issues, most experts agree that federated approaches
work well to supplement data warehouses, not replace them (see Eckerson, 2005).
Ariyach andra and Watson (2005) identified 10 factors that potentially affect th e
architecture selectio n decision:
1. Informatio n interdependence between organizational units
2. Upper manageme n t's information needs
3. Urgen cy of need fo r a data wareh ouse
96 Pan II • Descriptive Analytics
4. Nature of end-user tasks
5. Constraints on resources
6. Strategic view of the data wa re house prior to implemen tation
7. Compatibility with existing systems
8. Perceived ability of the in-house IT staff
9. Technical issues
10. Social/political factors
These facto rs are similar to many success factors described in the lite rature for
info rmatio n syste m s projects and DSS a nd BI projects. Technical issues, beyond provid-
ing technology that is feasibly ready for u se, are important, but often n ot as important
as behavioral issu es, such as meeting upper management's information needs a n d u ser
involvement in the development process (a social/political factor) . Each data warehousing
architecture has specific applicatio n s for which it is most (and least) effective and thus
provides maximal benefits to the organization. However, overall, the data mart structu re
seems to be the least effective in practice. See Ariyachandra and Watson (2006a) for some
addition al details.
Which Architecture Is the Best?
Ever since data wareh ou sing became a critical part of modern enterprises, the question
of which data warehouse architecture is the best has been a topic of regular discus-
sion . The two gurus of the data warehousing field, Bill Inmon and Ralph Kimball, are at
the heart of this discussion. Inmon advocates the hub-and-spoke architecture (e.g., the
Corporate Informatio n Factory), whereas Kimball promotes the data mart bus architectu re
w ith conformed dimensions. Othe r architectures are possible, but these two options are
fundamentally diffe re nt approaches, and each has stron g advocates . To shed light on this
controversial question, Ariyachandra a nd Watson (2006b) condu cted an empirical study.
To collect the data, they used a Web-based survey targeted at individuals involved in
data ware house implementatio n s. Their survey included questions about the respondent,
the respondent's company, the company's data warehouse, and the success of the data
warehouse a rchitecture .
In total, 454 respondents provided u sable info rmation. Surveyed companies ranged
from small (less than $10 million in revenue) to large (in excess of $10 billion). Most of the
companies were located in the United States (60%) and represented a variety of industries,
w ith the financial services industty (15%) providing the most responses. The predominant
architecture was the hub-and-spoke architecture (39%), followed by the bus architecture
(26%), the centralized architecture (17%), indep e ndent data marts (12%), and the federated
architecture C 4%). The most common platform for hosting the data warehouses was Oracle
(41%), followed by Microsoft (19%), and IBM (18%). The average (mean) gross revenue var-
ied from $3.7 billion for independen t data marts to $6 billion for the federated architecture.
They used four measures to assess the su ccess of the a rchitectures: (1) information
quality, (2) system quality, (3) individual impacts, and (4) organizatio nal impacts. The
questions used a seven-po int scale, with the higher score indicating a more successful
architecture. Table 3. 1 shows the average scores for the measures across the a rchitectures.
As the results of the study indicate, independent data marts scored the lowest on
all measures. This finding confirms the conventio nal w isdom that independent data marts
are a poor architectural solution. Next lowest on all measures was th e federated architec-
ture. Firms sometimes have disparate decision support platforms resulting from mergers
and acquisitio ns, a nd they may ch oose a federated approach, at least in th e sh ort run.
The findings suggest that the federated architecture is not an optimal long-term solution.
What is interesting, h owever, is the similarity of the averages for the bus, hub-and-spoke,
a nd centt·alized a rc hitectures. The differences a re sufficiently small that no claims can be
Chapter 3 • Data Warehousing 97
TABLE 3.1 Average Assessment Scores for the Success of the Architectures
Centralized
Hub-and Architecture
Independent Bus Spoke (No Dependent
Data Marts Architecture Architecture Data Marts)
Information Quality 4.42 5.16 5.35 5.23
System Quality 4.59 5.60 5.56 541
Individual Impacts 5.08 5.80 5.62 5.64
Organizational Impacts 4.66 5.34 5.24 5.30
made for a particular architecture's superiority over the others, at least based on a simple
comparison of these success measures.
They also collected data on the domain (e.g., varying from a subunit to company-
wide) and the size (i.e., amount of data stored) of the warehouses. They found that the
hub-and-spoke architecture is typically used with more enterprise-wide impleme ntations
and larger warehouses. They also investigated the cost and time required to implement
the different architectures. Overall, the hub-and-spoke architecture was the most expen-
sive and time-consuming to imple ment.
SECTION 3.4 REVIEW QUESTIONS
1. What are the key similarities and differences between a two-tie red architecture and a
three-tiered architecture?
2. How has the Web influenced data warehouse design?
3. List the alternative data warehousing architectures discussed in this section.
4. What issues should be considered when deciding which architecture to use in devel-
oping a data warehouse? List the 10 most important factors.
5. Which data warehousing architecture is the best? Why?
3.5 DATA INTEGRATION AND THE EXTRACTION, TRANSFORMATION,
AND LOAD (Ell) PROCESSES
Global competitive pressures, demand for return on investment (ROI), management and
investor inquiry, and government regulations are forcing business managers to rethink
how they integrate and manage their businesses. A decision m aker typically needs access
to multiple sources of data that must be integrated. Before data warehouses, data marts ,
and BI software, providing access to data sources was a major, laborious process. Even
with modern Web-based data management tools, recognizing what data to access and
providing them to the decision maker is a nontrivial task that requires database specialists.
As data warehouses grow in size, the issues of integrating data grow as well.
The business analysis needs continue to evolve. Mergers and acquisitions, regula-
tory requirements, and the introduction of new channels can drive changes in BI require-
me nts . In addition to historical , cleansed, consolidated, and point-in-time data, business
users increasingly demand access to real-time, unstructured, and/ or remote data. And
everything must be integrated with the contents of an existing data warehouse. Moreover,
access via PDAs and through speech recognition a nd synthesis is becoming more com-
monplace , further complicating integration issues (Edwards , 2003). Many integration pro-
jects involve enterprise-wide systems. Orovic (2003) provided a checklist of what works
and what does not work when attempting such a project. Properly integrating d ata from
Federated
Architecture
4.73
4.69
5.15
4.77
98 Pan II • Descriptive Analytics
various databases and other disparate sources is difficult. But when it is not done prop-
erly, it can lead to disaster in enterprise-wide systems such as CRM, ERP, and supply chain
projects (Nash, 2002).
Data Integration
Data integration comprises three major processes that, when correctly implemented,
permit data to be accessed and made accessible to an array of ETL and analysis tools and
the data warehousing environment: data access (i.e. , the ability to access and extract data
from any data source), data federation (i.e. , the integration of business views across mul-
tiple data stores), and change capture (based on the identification, capture, and delivery
of the changes made to enterprise data sources). See Application Case 3.3 for an example
of how BP Lubricant benefits from imp lementing a data warehouse that integrates data
Application Case 3.3
BP Lubricants Achieves BIGS Success
BP Lubricants established the BIGS program follow-
ing recent merger activity to deliver globally con-
sistent and transparent management information. As
well as timely business intelligence, BIGS provides
detailed, consistent views of performance across
functions such as finance , marketing, sales, and sup-
ply and logistics.
BP is o ne of the world 's largest oil and pet-
rochemicals groups. Part of the BP pie group, BP
Lubricants is an established leader in the global
automotive lubricants market. Perhaps best known
for its Castro! brand of o ils, the business operates
in over 100 countries and employs 10,000 people.
Strategically, BP Lubricants is concentrating on fur-
ther improving its customer focus and increasing its
effectiveness in automotive markets. Following recent
merger activity, the company is undergoing transfor-
mation to become more effective and agile and to
seize opportunities for rapid growth.
Challenge
Following recent merger actlVlty, BP Lubricants
wanted to improve the consistency, transparency, and
accessibility of management information and business
intelligence. In order to do so, it needed to integrate
data held in disparate source systems, without the
delay of introducing a standardized ERP system.
Solution
BP Lubricants implemented the pilot for its Business
Intelligence and Global Standards (BIGS) program, a
strategic initiative for management information and
business intelligence. At the heart of BIGS is Kalida,
an adaptive enterprise data warehousing solution for
preparing, implementing, operating, and managing
data warehouses.
Kalido's federated enterprise data warehous-
ing solution supported the pilot program's com-
plex data integration and diverse reporting require-
ments. To adapt to the program's evolving reporting
requirements, the software also enabled the under-
lying information architecture to be easily modi-
fied at high speed while preserving all information.
The system integrates and stores information from
multiple source systems to provide consolidated
views for :
• Marketing. Customer proceeds and mar-
gins for market segments with drill down to
invoice-level detail
• Sales. Sales invoice reporting augmented
with both detailed tariff costs and actual
payments
• Finance. Globally standard profit and loss,
balance sheet, and cash flow statements- with
audit ability; customer debt management sup-
ply and logistics; consolidated view of order
and movement processing across multiple ERP
platforms
Benefits
By improving the visibility of consisten t, timely
data, BIGS provides th e information needed to
assist the business in ide ntifying a multitude of
business opportunities to maximize m argins and/or
ma n age associated costs. Typical responses to the
benefits of consiste nt data resulting from th e BIGS
pilot include:
• Improved consistency a nd transparency of
business data
• Easier, faster, and more flexible reporting
• Accommodatio n of both global a nd local
sta nda rds
• Fast, cost-effective, and flexible implem e nta -
tion cycle
• Minimal disruption of existing business pro-
cesses and the day-to-day business
Chapter 3 • Data Warehousing 99
• Identifies data quality issues and en courages
their resolution
• Improved ability to respond inte lligently to
new business opportunities
QUESTIONS FOR DISCUSSION
1. What is BIGS at BP Lubricants?
2. Wha t were the challenges, the proposed solu-
tio n , and the obtained results w ith BIGS?
Sources.- Kalido, "BP Lub ricants Achieves BIGS, Key IT Solutions,"
http://www.kalido.com/ customer-stories/bp-plc.htm
(accessed on August 2013). Kalido, "BP Lubricants Achieves
BIGS Success," kalido.com/collateraVDocuments/English-US/
CS-BPo/420BIGS (accessed August 2013); a nd BP Lubricant
ho me page, bp.corn/lubricanthome.do (accessed August 2013).
from many sources. Some vendors, such as SAS Institute, Inc., have deve loped strong
data integratio n tools . The SAS enterprise data integration server includes customer data
integration tools that improve data quality in the integration process. The Oracle Bu siness
Inte lligence Suite assists in integrating data as well.
A ma jo r purpose of a data warehouse is to integrate data fro m multiple systems.
Various integration technologies e nable data a nd metadata integratio n :
• Enterprise a pplicatio n integratio n (EAI)
• Service-orie nted architecture (SOA)
• Enterprise informatio n integ ratio n (Ell)
• Extractio n , transformation, and load (ETL)
Enterprise application integration (EAi) provides a vehicle for pushing data
from source systems into the data warehouse. It involves integrating application function-
ality a nd is focu sed o n sharing functionality (rather than da ta) across systems, thereby
en abling flexibility and reuse. Traditio n ally, EAI solutio n s have focused on e nabling
application reuse at the application programming inte rface (API) level. Recently, EAI is
accomplished by u sing SOA coarse-grained services (a collection of business processes
o r functions) that are well d efined a nd documented. Using Web services is a specialized
way of implementing an SOA. EAI can be used to facilitate data acquisition directly into
a n ear-real-time data warehouse or to deliver decisions to the OLTP systems. There are
ma ny different approaches to and tools fo r EAI implementatio n .
Enterprise information integration (Ell) is an evolving tool space that promises
real-time data integration from a variety of sources, such as relational da tabases, Web
services, a nd multidimensional databases . It is a mechanism for pulling data from source
systems to satisfy a request for informatio n. Ell tools u se predefined metadata to p o pulate
views that make integrated data appear rela tional to e nd users. XML may be the most
important aspect of Ell because XML allows data to be tagged e ither at creation time
o r late r. These tags can be extended and modified to accommodate almost any area of
knowledge (see Kay, 2005).
Physical data integration has conventionally been the main mechanism for creating
an integrated view with data warehouses a nd d ata marts. With the advent of Ell tools (see
Kay, 2005), new virtual data integratio n patterns are feasible. Manglik a nd Mehra (2005)
100 Pan II • Descriptive Analytics
discussed the benefits and constraints of new data integration patterns that can expand
traditional physical methodologies to present a comprehensive view for the enterprise.
We next turn to the approach for loading data into the warehouse: ETL.
Extraction, Transformation, and Load
At the heart of the technical side of the data warehousing process is extraction, trans-
formation, and load (ETL) . ETL technologies, which have existed for some time, are
instrumental in the process and use of data warehouses. The ETL process is an integral
component in any data-centric project. IT managers are often faced w ith challenges because
the ETL process typically consumes 70 percent of the time in a data-centric project.
The ETL process consists of extraction (i.e., reading data from one or more data-
bases), transformation (i.e., converting the extracted data from its previous form into the
form in which it needs to be so that it can be p laced into a data warehouse or simply
another database), and load (i. e., putting the data into the data warehouse). Transformation
occurs by using rules or lookup tables or by combining the data with other data . The three
database functions are integrated into one tool to pull data out of one or more databases
and p lace them into a n other, consolidated database or a data warehouse.
ET L tools also transport data between sources and targets, document how data
e lements (e.g ., metadata) change as they move between source and target, exchange
metadata with other applications as needed, and administer a ll runtime processes and
operations (e.g., scheduling, error management, audit logs, statistics). ETL is extremely
important for data integration as well as for data warehousing . The purpose of the ETL
process is to load the warehouse with integrated and cleansed data . The data used in ETL
processes can come from any source: a mainframe application, an ERP application , a CRM
tool, a flat file, an Excel spreadsheet, or even a message queue. In Figure 3.9, we outline
the ETL process.
Th e process of m igrating data to a data warehouse involves the extraction of data
from all relevant sources. Data sources may consist of files extracted from OLTP databases,
spreadsheets, personal databases (e.g., Microsoft Access) , or external files . Typically, all
the input files are written to a set of staging tables, which are designed to facilitate the
load process. A data warehouse contains numerous business rules that define such things
as how the data will be used, summarization rules, stan dardization of encoded attributes,
and calculation rules. Any data quality issues pertaining to the source files need to be
corrected before the data are loaded into the data warehouse. One of the benefits of a
Packaged
application
Legacy
system
Other
internal
applications
Transient
data source
~/~~
.-'-E_xt_r_a-ct~ Transform I Cleanse I ~-L-o-ad~~
FIGURE 3.9 The Ell Process.
Data
warehouse
Data
mart
Chapter 3 • Data Warehousing 101
well-designed data warehouse is that these rules can be stored in a metadata repository
and applied to the data warehouse centrally. This diffe rs from an OLTP approach, which
typically has data and business rules scattered throughout the system. The process of
loading data into a data warehouse can be performed either through data transformation
tools that provide a GUI to aid in the development and maintenance of business rules
or through more traditional methods, such as developing programs or utilities to load
the data warehouse, using programming languages such as PL/SQL, C++, Java , or .NET
Framework languages. This decision is not easy for organizations. Several issues affect
whether an organization will purchase data transformation tools or build the transforma-
tion process itself:
• Data transformation tools are expensive.
• Data transformation tools may have a long learning curve.
• It is difficult to measure how the IT organization is doing until it has learned to use
the data transformation tools.
In the long run, a transformation-too l approach should simplify the m ainte nance of
an organization's data warehouse. Transformation tools can also be effective in detecting
and scrubbing (i.e., removing any anomalies in the data) . OLAF and data mining tools rely
on how well the data are transformed.
As an example of effective ETL, Motorola , Inc. , uses ETL to feed its data warehouses.
Motorola collects information from 30 different proc urement systems and sends it to
its global SCM data warehouse for analysis of aggregate company spending (see Songini,
2004).
Solomon (2005) classified ETL technologies into four categories: sophisticated, ena-
bler, simple, and rudimentary. It is generally acknowledged that tools in the sophisticated
category will result in the ETL process being better docume nted and more accurately
managed as the data warehouse project evolves.
Even though it is possible for programmers to develop software for ETL, it is simpler
to use an existing ETL tool. The following are some of the important criteria in selecting
an ETL tool (see Brown, 2004):
• Ability to read from and write to an unlimited number of data source architectures
• Automatic capturing and delivery of metadata
• A history of conforming to open standards
• An easy-to-use interface for the developer and the functional user
Performing extensive ETL m ay be a sign of poorly managed data and a
fundamental lack of a coherent data management strategy. Karacsony (2006) indi-
cated that the re is a direct correlation between the exte nt of re dunda nt data and the
number of ETL processes. When data are managed correctly as an enterprise asset,
ETL efforts are significantly reduced, and redundant data are completely eliminated.
This leads to huge savings in mainte n a nce and greater e fficiency in new develop-
ment while also improving data quality. Poorly designed ETL processes are costly to
maintain, change, and update. Consequently, it is crucial to make the proper choices
in terms of the technology and tools to use for developing and maintaining the ETL
process.
A number of packaged ETL tools are available. D atabase vendors curre ntly offer ETL
capabilities that both enhance and compete with independent ETL tools. SAS acknowl-
edges the importance of data quality and offers the industry's first fully integrated solu-
tion that merges ETL and data quality to transform data into strategic valuable assets.
Other ETL software providers include Microsoft, Oracle, IBM, Informatica, Embarcadero,
and Tibco. For additional information on ETL, see Golfarelli and Rizzi (2009), Karaksony
(2006), and Songini (2004).
102 Pan II • Descriptive Analytics
SECTION 3.5 REVIEW QUESTIONS
1. Describe data integration.
2. Describe the three steps of the ETL process.
3. Why is the ETL process so important for data warehousing efforts?
3.6 DATA WAREHOUSE DEVELOPMENT
A data warehousing project is a major undertaking for any organization and is more
complicated than a simple, mainframe selection and implementation project because it
comprises and influences many departments and many input and output interfaces and it
can be part of a CRM business strategy. A data warehouse provides several benefits that
can be classified as direct and indirect. Direct benefits include the following:
• End users can perform extensive analysis in numerous ways.
• A consolidated view of corporate data (i.e., a single version of the truth) is possible.
• Better and more timely information is possible. A data warehouse permits informa-
tion processing to be relieved from costly operational systems onto low-cost serv-
ers; therefore, many more end-user information requests can be processed more
quickly.
• Enhanced system performance can result. A data warehouse fre es production
processing because some operational system reporting requirements are moved
to DSS.
• Data access is simplified.
Indirect benefits result from end users using these direct benefits. On the whole,
these benefits enhance business knowledge, present competitive advantage, improve cus-
tomer service and satisfaction, facilitate decision making, and h elp in reforming business
processes; therefore, they are the strongest contributions to competitive advantage. (For
a discussion of how to create a competitive advantage through data warehousing, see
Parzinger and Fralick, 2001.) For a detailed discussion of how organizations can obtain
exceptional levels of payoffs, see Watson et al. (2002) . Given the potential benefits that a
data warehouse can provide and the substantial investments in time and money that such
a project requires, it is critical that an organization structure its data warehouse project
to maximize the chances of success. In addition, the organization must, obviously , take
costs into consideratio n. Kelly (2001) described a ROI approach that considers benefits
in the categories of keepers (i.e. , money saved by improving traditional decision support
functions) ; gatherers (i.e., money saved due to automated collection and dissemination
of information); and users (i.e. , money saved or gained from decisions made using the
data warehouse). Costs include those related to hardware, software, network bandwidth,
internal development, internal support, training, and external consulting. The net pre-
sent value (NPV) is calculated over the expected life of the data warehouse. Because
the benefits are broken down approximately as 20 percent for keepers, 30 percent for
gatherers, and 50 percent for users, Kelly indicated that users should be involved in the
development process , a success factor typically mentioned as critical for systems that
imply change in an organization.
Application Case 3.4 provides an example of a data warehouse that was deve loped
and delivered inte n se competitive advantage for the Hokuriku (Japan) Coca-Cola Bottling
Company. The system was so successful that plans are underway to expand it to encom-
pass the more than 1 million Coca-Cola vending machines in Japan .
Clearly defining the business objective, gathering project support from manage-
ment end users, setting reasonable time frames and budgets, and managing expectations
are critical to a successful data warehousing project. A data warehousing strategy is a
Application Case 3.4
Things Go Better with Coke's Data Warehouse
In the face of competitive pressures and consumer
demand, how does a successful bottling company
e nsure that its vending machines are profitable? The
answer for Hokuriku Coca-Cola Bottling Company
(HCCBC) is a data warehouse and analytical soft-
ware implemente d by Teradata Corp. HCCBC built
the system in response to a data warehousing system
developed by its rival, Mikuni. The data warehouse
collects not only historical data but also near-real-
time data from each vending machine (viewed as
a store) that could be transmitted via wireless con-
nection to headquarters. The initial phase of the
project was deployed in 2001. The data warehouse
approach provides detailed product information,
such as time and date of each sale, when a prod-
uct sells out, whether someone was short-changed,
and whether the machine is malfunctioning. In each
case, an alert is triggered, and the vending machine
immediately reports it to the data center over a wire-
less transmission system. (Note that Coca-Cola in
the United States has used modems to link vending
machines to distributors for over a decade.)
In 2002, HCCBC conducted a pilot test and put
all its Nagano vending machines o n a wireless net-
work to gather near-real-time point of sale (POS)
data from each one. The results were astounding
because they accurately forecasted demand and
identified problems quickly. Total sales immediately
Chapter 3 • Data Warehousing 103
increased 10 percent. In addition, due to the more
accurate machine servicing, overtime and other costs
decreased 46 percent. In additio n, each salesp e rso n
was able to service up to 42 p ercent more vending
machines.
The test was so successful th at planning b egan
to expand it to encompass the entire enterprise
(60,000 machines), using an active data warehouse.
Eventually, the data warehousing solution will ide-
ally expand across corporate boundaries into the
entire Coca-Cola Bottlers network so that the more
than 1 million vending machines in Japan w ill be
networked, leading to immense cost savings and
highe r revenue.
QUESTIONS FOR DISCUSSION
1. How did Coca-Cola in Japan use data warehous-
ing to improve its business processes?
2. What were the results of their enterprise active
data warehouse implementation?
Sources: Adapted from K. D. Schwartz, "Decisions at the Touch o f
a Button," Teradata Magazine, teradata.com/t/page/117774/
index.html (accessed June 2009); K. 0 . Schwartz, "Decisio ns at
the Touch of a Button," DSS Resources, March 2004, pp. 28-31 ,
dssresources . com/ cases/ coca-colaja pan/index.html
(accessed April 2006); and Te radata Corp. , "Coca-Cola Japan
Puts the Fizz Back in Vending Machine Sales," teradata.com/t/
page/118866/index.html (accessed June 2009).
blueprint for the successful introduction of the data warehouse. The strategy should
describe where the company wants to go, why it wants to go the re, and w hat it will do
when it gets the re . It needs to take into consideration the organization 's vision, structure,
and culture. See Matney (2003) for the steps that can help in developing a flexible and
efficie nt support strategy. When the plan and support for a data warehouse are estab-
lished, the organization needs to examine data warehouse vendors. (See Table 3.2 for
a sample list of vendors; also see The Data Warehousing Institute [twdi.org] and DM
Review [information-management.com].) Many vendors provide software de mos of
their data warehousing and BI products.
Data Warehouse Development Approaches
Many organizations need to create the data warehouses used for decision suppo rt. Two
competing approaches are employed. The first approach is that of Bill Inmon, who is
often called "the father of data warehousing. " Inmon supports a top-down development
approach that adapts traditional relational database too ls to the develo pment needs of an
104 Pan II • Descriptive Analytics
TABLE 3.2 Sample List of Data Warehousing Vendors
Vendor
Business Objects (businessobjects.com)
Computer Associates (cai.com)
DataMirror (datamirror.com)
Data Advantage Group (dataadvantagegroup.com)
Dell (dell.com)
Embarcadero Technologies (embarcadero.com)
Greenplum (greenplum.com)
Harte-Hanks (harte-hanks.com)
HP (hp.com)
Hummingbird Ltd. (hummingbird.com, now is a
subsidiary of Open Text.)
Hyperion Solutions (hyperion.com, now an Oracle
company)
IBM lnfoSphere (www-01.ibm.com/software/data/
infosphere/)
Informatica (informatica.com)
Microsoft (microsoft.com)
Netezza
Oracle (including PeopleSoft and Siebel) (oracle.com)
SAS Institute (sas.com)
Siemens (siemens.com)
Sybase (sybase.com)
Teradata (teradata.com)
Product Offerings
A comprehensive set of business intelligence and data visuali-
zation software (now owned by SAP)
Comprehensive set of data warehouse (DW) tools and products
DW administration, management, and performance products
Metadata software
DW servers
DW administration, management, and performance products
Data warehousing and data appl iance solution provider (now
owned by EMC)
Customer relationship management (CRM) products and services
DW servers
DW engines and exploration warehouses
Comprehensive set of DW tools, products, and applications
Data integration, DW, master data management, big data
products
DW administration, management, and performance products
DW tools and products
DW software and hardware (DW appliance) provider (now
owned by IBM)
DW, ERP, and CRM tool s, products, and applications
DW tools, products, and applications
DW servers
Comprehensive set of DW tools and applications
DW t ools, DW appliances, DW consultancy, and applications
enterprise-w ide data warehouse, also known as the EDW approach. The second approach
is that of Ralph Kimball, who proposes a bottom-up approach that employs dimensional
modeling, also known as the data ma11 approach.
Knowing how these two models are alike and how they differ helps us understand
the basic data warehouse concepts (e.g., see Breslin, 2004). Table 3.3 compares the two
approaches. We describe these approaches in detail next.
THE INMON MODEL: THE EDW APPROACH Inmon's approach emphasizes top-dow n
development, employing established database development methodologies and tools,
such as entity-relationship diagrams (ERD), and an adjustment of the spiral development
approach . The EDW approach does not preclude the creation of data marts. The EDW is
the ideal in this approach because it provides a consistent and comprehensive view of the
enterprise. Murtaza 0998) presented a framework for develo ping EDW.
THE KIMBALL MODEL: THE DATA MART APPROACH Kimball's data mart strategy is a "plan
big, build small" approach. A data mart is a subject-oriented or d epartment-oriented data
warehouse. It is a scaled-down version of a data warehouse that focuses o n the requests
Chapter 3 • Data Warehousing 105
TABLE 3.3 Contrasts Between the Data Mart and EDW Development Approaches
Effort
Scope
Development time
Development cost
Development difficulty
Data prerequisite for sharing
Sources
Size
Time horizon
Data transformations
Update frequency
Technology
Hardware
Operating system
Databases
Usage
Number of simultaneous
users
User types
Business spotlight
Data Mart Approach
One subject area
Months
$10,000 to $100,000+
Low to medium
Common (within business area)
Only some operational and external systems
Megabytes to severa l gigabytes
Near-current and historical data
Low to medium
Hou rly, daily, weekly
Workstations and departmental servers
Windows and Linux
Workgroup or standard database servers
10s
Business area analysts and managers
Optimizing activities within
the business area
EDW Approach
Severa l subject areas
Years
$1,000,000+
High
Common (across enterprise)
Many operationa l and external systems
Gigabytes to petabytes
Historica l data
High
Weekly, monthly
Enterprise servers and mainframe computers
Unix, l./05, 05/ 390
Enterprise database servers
1 OOs to 1,000s
Enterprise analysts and sen ior executives
Cross-functiona l optimization and decision
making
Sources: Adapted fro m J. Van d e n Hove n, "Da ta Marts: Pla n Big, Build Small ," in JS Management H andbook, 8th ed ., CRC Press, Boca
Raton, FL, 2003; and T. Ariyachandra and H. Watson, "Which Data Warehouse Architecture Is Most Successful?" Business I ntelligence
Journal, Vol. 11 , No. 1, First Quarter 2006, pp. 4-6.
of a specific department, such as marketing or sales. This model applies dimensional data
modeling, which starts w ith tables. Kimball advocated a development methodology that
entails a bottom-up approach, which in the case of data warehouses means building one
data mart at a time.
WHICH MODEL IS BEST? There is no one-size-fits-a ll strategy to data warehousing. An
enterprise's data warehousing strategy can evolve from a simple data mart to a complex
data warehouse in response to user demands, the e nterprise's business re quireme nts , and
the enterprise's maturity in managing its data resources. For many enterprises, a data mart
is frequently a convenient first step to acquiring experience in constructing and manag-
ing a data warehouse while presenting business users with the benefits of better access
to their data; in addition, a data mart commonly indicates the business value of data
warehousing. Ultimately , e ngineering an EDW that consolidates old data ma rts and data
warehouses is the ideal solution (see Application Case 3.5). However, the development
of individual data marts can often provide many benefits along the way toward develop-
ing an EDW, especially if the organization is unable or unwilling to invest in a large-scale
project. Data marts can also demonstrate feasibility and success in providing benefits.
This could potentially lead to an investment in an EDW. Table 3.4 summarizes the most
essential characteristic differences between the two models.
106 Pan II • Descriptive Analytics
Application Case 3.5
Starwood Hotels & Resorts Manages Hotel Profitability with Data Warehousing
Sta1wood Hotels & Resorts Worldwide, Inc., is one of
the leading h otel and le isure companies in the world
with 1,112 properties in nearly 100 countries and
154,000 employees at its owned and managed prop-
erties. Starwood is a fully integrated owner, operator
and franchisor of h otels, resorts, and residences with
the fo llowing internationally renowned brands: St.
Regis®, The Luxury Collection®, W®, Westin®, Le
Meridien®, Sheraton®, Four Points® by Sheraton,
Aloft®, and ElementSM. The Company boasts one
of the industry's leading loyalty programs, Starwood
Preferred Guest (SPG), allowing members to earn
and redeem points for room stays, room upgrades,
and flights, with no blackout dates. Starwood also
owns Starwood Vacation Ownership Inc., a pre-
mier provider of world-class vacation experiences
through villa-style resorts and privileged access to
Starwood brands.
Challenge
Sta1wood Hotels has significantly increased the num-
ber of hotels it operates over the past few years
through global corporate expansion , particularly in
the Asia/ Pacific region. This has resulted in a dra-
matic rise in the need for business critical informa-
tion about Starwood's hotels and customers. All
Starwood hotels g lobally use a single enterprise data
warehouse to retrieve information critical to efficient
hotel management, such an that regarding revenue,
central reservations, and rate p lan rep01ts. In addi-
tion, Starwood Hotels' management runs important
daily operating repo1ts from the data warehouse for
a wide range of business functions. Starwood's enter-
prise data warehouse spans almost all areas within
the company, so it is essential not only for central-
reservation and consumption information, but also to
Sta1wood's loyalty program, which relies on all guest
information, sales information, corporate sales infor-
mation, customer service, and other data that man-
agers, analysts, and executives depend on to make
operational decisions.
The company is committed to knowing and ser-
vicing its guests, yet, "as data growth and demands
grew too great for the company's legacy system, it
was falling short in delivering the information hotel
managers and administrators required on a daily
basis, since central reservation system (CRS) reports
could take as long as 18 hours, " said Richard Chung,
Starwood Hotels' director of data integration. Chung
added that hotel managers would receive the tran-
sient pace report- which presents market-segmented
information on reservations-5 hours later than it
was needed. Such delays prevented managers from
adjusting rates appropriately, which could result in
lost revenue.
Solution and Results
After reviewing several vendor offerings, Starwood
Hotels selected Oracle Exadata Database Machine
X2-2 HC Full Rack and Oracle Exadata Database
Machine X2-2 HP Full Rack, nmning on Oracle Linux.
"With the implementation of Exadata , Starwood
Hotels can complete extract, transform, and load
(ETL) operations for operational reports in 4 to 6
hours, as opposed to 18 to 24 hours previously, a
six-fold improvement," Chung said. Real-time feeds,
which were not possible before, now allow transac-
tions to be posted immediately to the data ware-
house, and users can access the changes in 5 to 10
minutes instead of 24 hours, making the process up
to 288 times faster.
Accelerated data access allows all Starwood
properties to get the same, up-to-date data needed
for their reports, globally. Previously, hotel managers
in some a reas could not do same-day or next-day
analyses. There were some locations that got fresh
data and others that got older data . Hotel managers,
worldwide, now have up-to-date data for their hotels,
increasing efficiency and profitability, improving cus-
tomer service by making sure rooms are available
for premier customers, and improvin g the company's
ability to manage room occupancy rates. Additional
reporting tools, such as those used for CRM and sales
and catering, also benefited from the improved pro-
cessing. Other critical reporting has benefited as well.
Marketing campaign management is also more effi-
cient now that managers can analyze results in days
or weeks instead of months.
"Oracle Exadata Database Machine enables
us to move forward with an environment that pro-
vides our hotel managers and corporate executives
with nea r-real-time information to make optimal
Ch apte r 3 • Data Ware housing 107
business decisio ns and p rovide ideal ame nities for
o ur guests. " -Gordon Lig ht, Business Re latio n ship
Man ager, Starwoo d Ho te ls & Resorts Wo rldw ide,
Inc .
2. How d id Starwood Hotels & Resorts u se data
wareho using fo r be tter profitability?
3. What w ere the ch allenges, the proposed solu-
tion , and the obtained results?
QUESTIONS FOR DISCUSSION Source: O racle custo mer success story, www.oracle.com/us/
corporate/ customers/ customersearch/ starwood-hotels-1-
exadata-sl-1855106.html; Starwood Hotels and Resorts,
starwoodhotels.com (accessed July 2013).
1. How big and complex are the business o p era -
tio ns of Starwood Ho tels & Resorts?
Additional Data Warehouse Development Considerations
Som e o rganizatio n s want to comple te ly o utsource the ir d a ta ware h ousing e ffo rts . They
simply do n o t want to deal w ith software a nd hardware acquisitio ns, a nd they d o
no t want to m a nage the ir info rmatio n syste ms. O n e alte rna tive is to use hosted da ta
wa re h o uses. In this scena rio , a n o the r firm-i dea lly, o ne that has a lo t of exp e rie nce
TABLE 3.4 Essential Differences Between lnmon's and Kimball's Approaches
Characteristic
Methodology and
Architecture
Overa ll approach
Architecture structure
Complexity of the method
Comparison w ith established
development methodologies
Discussion of physical design
Data Modeling
Data orientation
Tools
End-user accessibility
Philosophy
Primary audience
Place in the organization
Objective
Inmon
Top-dow n
Enterprise-wide (atomic) dat a
w arehouse "feeds" departmental
databases
Quite complex
Derived from the spira l methodology
Fa irly thorough
Subject or data driven
Traditional (entity-relationship diagrams
[ERD], data flow diagrams [DFD])
Low
IT professionals
Integral part of t he corporat e
information factory
Deliver a sound technical solution
based on proven database methods
and technologies
Kimball
Bottom-up
Data marts model a single business process,
and ent erprise consist ency is achieved through
a data bus and conformed dimensions
Fai rly simple
Fou r-step process; a departu re from relation al
dat abase management system (RDB M S)
methods
Fairly light
Process oriented
Dimensional modeling; a departure fro m
relat ional m odeling
High
End users
Transformer and retain er of operational data
Deliver a solution that makes it easy for end
users t o directly query t he data and still get
reason able response times
Sources: Adapted fro m M. Breslin, "Data Wa re ho using Battle of tl1e Gia n ts: Comparing the Basics o f Kimball and Inmo n Models," Business
Intelligence j ournal, Vol. 9, No. 1, Winte r 2004, pp. 6-20; and T. Ariyacha ndra a nd H . Watson , "Which Data Wa rehouse Architecture Is
Most Successful?" Busin ess Intelligence j ournal, Vol. 11 , No. 1, First Q uarte r 2006.
108 Pan II • Descriptive Analytics
TECHNOLOGY INSIGHTS 3.1 Hosted Data Warehouses
A hosted data wareho use has nearly the same, if not more, functionality as an on-site data ware-
hou se, but it does not consume compute r resources o n client premises. A hosted data wa reh ouse
offers the benefits of BI minus the cost of computer upgrades, network upgrades, software
licenses, in-house development, and in-house suppoI1 and maintenance.
A hosted data warehouse offers the following benefits:
• Requires minimal investment in infrastructure
• Frees up capacity o n in-house systems
• Frees up cash flow
• Makes powerful solutions affordable
• Enables powerful solutions that provide for growth
• Offers better quality equipme nt and software
• Provides faster connections
• Enables users to access data from remote locations
• Allows a company to focus on core business
• Meets storage needs for large volumes of data
Despite its benefits, a hosted data warehouse is not necessarily a good fit for every o rgani-
zatio n. Large companies with revenue upwards of $500 million could lose money if they already
have underused internal infrastructure and IT staff. Fwthermore, companies that see the para-
digm shift of outso urcing applications as loss of control of their data are not likely to use a
business intelligence service provider (BISP). Finally, the most significant and common argument
against implementing a hosted data warehouse is that it may be unwise to outsource sensitive
applicatio ns for reasons of security and privacy.
Sources: Compiled from M. Thornton a nd M. Lampa, "Hoste d Data Ware house ,"Journal of Data Warehousing,
Vol. 7, No. 2, 2002, pp. 27-34; and M. Thornton, "What About Security? The Most Common, but Unwarra nted,
Obje ction to Hosted Data Warehouses ," DM Review, Vol. 12, No. 3, Ma rch 18, 200 2, pp. 30-43.
a nd expertise-develops and maintains the data warehouse. However , there are
security a nd privacy con cerns w ith this approach. See Technology Insig hts 3.1 for
some details.
Representation of Data in Data Warehouse
A typical data warehouse structure is shown in Figure 3.3. Many variatio n s of data ware-
house architecture are possible (see Figure 3 .7). No m atter what the architecture was,
the design of data representation in the data wareho u se h as a lways been based o n the
concept of dime n s io n a l mode ling. Dimensional modeling is a retrieval-based system
that supports high-volume query access. Representation and storage of data in a data
warehouse s hould be designed in su ch a way that not o nly accommodates but a lso
boosts the processing of complex multidimensional queries. Often, the star schema and
the snowflakes schema are the means by which dimensional modeling is implemented in
data warehouses.
The star schema (sometimes referenced as star join schema) is the most commonly
used a nd the s implest style o f dimensional m odeling. A s tar schema contain s a central fact
table surrounded by and connected to several dimension tables (Adam son , 2009). The
fact table contains a large number o f rows that correspond to observed facts and external
links ( i.e. , foreign keys). A fact table con tains the descriptive attributes n eeded to perform
decision analysis and query reporting , a nd foreign keys are u sed to link to d ime nsion
Chapter 3 • Data Warehousing 109
tables . The decision analysis attributes consist of performance measures, operational met-
rics, aggregated measures (e.g., sales volumes, customer retention rates, profit margins,
production costs, crap rates, and so forth), and all the other metrics needed to analyze the
organization's performance. In other words, the fact table primarily addresses what the
data warehouse supports for decision analysis .
Surrounding the central fact tables (and linked via foreign keys) are dimension
tables. The dimension tables contain classification and aggregation information about the
central fact rows. Dimension tables contain attributes that describe the d ata contained
within the fact table ; they address how data will be analyzed and summarized. Dimension
tables have a one-to-many relationship with rows in the central fact table. In que1y ing,
the dimensions are used to slice and dice the numerica l values in the fact table to address
the requirements of an ad hoc information need. The star schema is designed to provide
fast query-response time , simplicity, and ease of maintenance for read-only database
structures. A simple star schema is shown in Figure 3.10a. The star schema is considered
a special case of the snowflake schema.
The snowflake sche m a is a logical arrangement o f tables in a multidimensio nal
database in such a way that the entity-relationship diagram resembles a snowflake in
shape. Closely related to the star schema, the snowflake schema is represented by central-
ized fact tables (usually only one) that are connected to mu ltiple dimensions . In the snow-
flake schema, however, dimensions are normalized into multiple related tables whereas
the star schema's dimensions are denormalized with each dime nsion being represented
by a single table. A simple snowflake schema is shown in Figure 3.10b.
Analysis of Data in the Data Warehouse
Once the data is properly stored in a data warehouse , it can be used in various ways to
support organizational decision making. OLAF (online analytical processing) is arguably
the most commonly used data analysis technique in data warehouses, and it has been
growing in popularity due to the exponential increase in data volumes and the recogni-
tion of the business value of data-driven analytics. Simply, OLAF is an approach to quickly
answer ad hoc questions by executing multidimensional analytical queries against organi-
zational data repositories (i.e ., d ata warehouses, data marts) .
(a) Star Schema (bl Snowflake Schema
Dimension Dimension Dimension
time product month
I Quarter - - I Brand
I I
Fact table
I M_Name ri
I Dimension Dimension LJ
date product
I Date - - I Lineltem
sales
~ I UnitsSold -r+ I -
Dimension
lf
I I
1 quarter I Q_Name Fact table I sales
Dimension Dimension - I UnitsSold -people geography - I -I Division I Country
I f-- - I Dimension Dimension
people store
I Division I LoclD I-+
I - - I
~
FIGURE 3.10 (a) The Star Schema, and (b) the Snowflake Schema.
Dimension
brand
I Brand
I
Dimension
category
I Category
I
Dimension
location
I State
I
110 Pan II • Descriptive Analytics
OLAP Versus OL TP
OLTP (online transaction processing system) is a term used for a transaction system,
which is primarily responsible for capturing and storing data related to day-to-day busi-
ness functions such as ERP, CRM, SCM, point of sale, and so forth. The OLTP system
addresses a critical business need, automating daily business transactions and running
real-time reports and routine analyses. But these systems are not designed for ad hoc
analysis and complex queries that deal with a number of data items. OLAP, on the other
hand, is designed to address this need by providing ad hoc analysis of organizational data
much more effectively and efficiently. OLAP and OLTP rely heavily on each other: OLAP
uses the data captures by OLTP, and OLTP automates the business processes that are
managed by decisions supported by OLAP. Table 3.5 provides a multi-criteria comparison
between OLTP and OLAP.
OLAP Operations
The main operational stmcture in OLAP is based on a concept called cube. A cube in
OLAP is a multidimensional data structure (actual or virtual) that allows fast analysis of
data. It can also be defined as the capability of efficiently manipulating and analyzing data
from multiple perspectives. The arrangement of data into cubes aims to overcome a limita-
tion of relational databases: Relational databases are not well suited for near instantaneous
analysis of large amounts of data. Instead, they are better suited for manipulating records
(adding, deleting, and updating data) that represent a series of transactions. Although
many report-writing tools exist for relational databases, these tools are slow when a multi-
dimensional query that encompasses many database tables needs to be executed.
Using OLAP , an analyst can navigate through th e database and screen for a par-
ticular subset of the data (and its progression over time) by changing the data's orienta-
tions and defining analytical calculations. These types of user-initiated navigation of data
through the specification of slices (via rotations) and drill down/ up (via aggregation and
disaggregation) is sometimes called "slice and dice ." Commonly used OLAP operations
include s lice a nd dice, drill down, roll up, and pivot.
• Slice. A slice is a subset of a multidimensional array (usually a two-dimensional
representation) corresponding to a single value set for one (or more) of the dimen-
sions not in the subset. A simple slicing operation on a three-dimensional cube is
shown in Figure 3.11.
TABLE 3.5 A Comparison Between OLTP and OLAP
Criteria
Purpose
Data source
Reporting
Resource
requirements
Execution speed
OLTP
To carry out day-to-day business functions
Transaction database (a normalized data
repository primarily focused on efficiency
and consistency)
Routine, periodic, narrowly focused reports
Ordinary relational databases
Fast (recording of business transactions
and routine reports)
OLAP
To support decision making and provide answers
to business and management queries
Data warehouse or data mart (a nonnormalized
data repository primarily focused on accuracy
and completeness)
Ad hoc, multidimensional, broadly focused
reports and queries
Multiprocessor, la rge-capacity, specialized
databases
Slow (resource intensive, complex, large-sca le
queries)
A three-dimensional
OLAP cube with
slicing
operations
Product
Ce/ls are filled
with numbers
representing
sales volumes
>,
.c
a.
[‘:’
Ol
a
QJ
(!)
Sales volumes of
a specific product
on variable time
Sales volumes of
a specific region
on variable time
and products
Sales volumes of
a specific time on
variable region
and products
FIGURE 3.11 Slicing Operations on a Simple Three-Dimensional Data Cube.
Chapter 3 • Data Warehousing 111
• Dice. The d ice operatio n is a s lice on more than two dimensions of a data cube.
• Drill Down/Up Drilling down or up is a specific OLAP technique whereby the
user navigates among levels of data ranging from the most summarized (up) to the
most detailed (down) .
• Roll-up. A roll-up involves computing all of the data relationships for one or more
d ime n sions. To do this, a computatio nal relationship o r formula might be defined.
• Pivot: A pivot is a means of ch anging the dimensional orientatio n of a report or
ad hoc query-page d isplay.
VARIATIONS OF OLAP OLAP has a few variations; among them ROLAP, MOLAP, and
HOLAP are the most common ones.
ROLAP stands for Relational Online An a lytical Processing. ROLAP is an alternative
to the MOLAP (Multidimensional OLAP) technology. Although both ROLAP and MOLAP
analytic tools are designed to allow analysis of data through the use of a multidimensio nal
data model, ROLAP differs significantly in that it does n ot require the precomputation
and storage of information. Instead, ROLAP tools access the data in a re lational database
an d generate SQL queries to calculate information at the appropriate level w hen an end
user requests it. With ROLAP, it is possible to create additional database tables (summary
tables or aggregations) that summarize the data at any desired combination of dimen sio n s.
While ROLAP uses a relational database source, generally the database must be carefully
designed for ROLAP use. A database that was designed for OLTP will not function well as
a RO LAP database. Therefore, RO LAP still involves creating a n additio na l copy of the data.
112 Part II • Descriptive Analytics
MOLAP is an alternative to the ROLAP technology. MOLAP differs from ROLAP
significantly in that it requires the precomputation and storage of information in the
cube-the operation known as preprocessing. MOLAP stores this data in an optimized
multidime nsio nal array storage, rather than in a relational database (which is often the
case for ROLAP).
The undesirable trade-off between ROLAP and MOLAP w ith regards to the addi-
tional ETL (extract, transform, and load) cost and slow query performance has led to
inquiries for better approaches where the pros and cons of these two approach es are
optimized. These inquiries resulted in HOLAP (Hybrid Online An alytical Processing),
w hich is a combin atio n of ROLAP and MOLAP. HOLAP allows storing p art of the data in
a MOLAP store and another part of the data in a ROLAP store. The degree of control that
the cube designer has over this partitioning varies from product to product. Technology
Insights 3.2 provides an opportunity for conducting a simple hands-on analysis w ith the
MicroStrategy BI tool.
TECHNOLOGY INSIGHTS 3.2 Hands-On Data Warehousing
with MicroStrategy
MicroStrategy is the leading indepe ndent provider of business intelligence, data warehousing
performance management, and bu siness reporting solutions . The other big players in this market
were recently acquired by large IT firms: Hyperion was acquired by O racle; Cognos was acquired
by IBM; and Business Objects was acquired by SAP. Despite these recent acquisitio ns, the busi-
ness intelligence and data ware h o using market remains active, vibrant, and full of opportunities.
Following is a step-by-step approach to using MicroStrategy software to analyze a hypo-
thetical business situation. A more compreh e nsive version of this h ands-on exercise can be
fou nd at the TUN We b site . According to this hypothetical scenario, you (the vice president of
sales at a global telecommunications company) are planning a bu siness visit to the Eu ropean
region. Before meeting with the regio na l salespeople on Monday, you want to know the sale
representatives’ activities for the last quarter (Qu arter 4 of 2004). You a re to create su ch an
ad hoc report using MicroStrategy’s Web access. In order to create this and many other OLAF
reports , you w ill need the access code for the TeradataUniversityNetwork.com Web site. It
is free of ch arge fo r edu cational use and only your professor will be able to get the necessary
access code fo r you to utilize not only MicroStrategy software but also a large collection of oth er
business intelligence resources at this site.
Once you are in TeradataUniversityNetwork, you need to go to “APPLY & DO” and select
“MicroStrategy BI” from the “Software” section. On the “MicroStrategy/ BI” Web page, follow
these steps:
1. Click on the link for “MicroStrategy Application Modules.” This w ill lead you to a page that
shows a list of previously built MicroStrategy applications.
2. Select the “Sales Force An alysis Module. ” This module is designed to provide you w ith in-
depth insight into the e ntire sales p rocess. This insight in turn allows you to increase lead
conversions, optimize product lines, take advantage of your organization’s most successful
sales practices, and improve you r sales organizatio n’s effectiveness.
3 . In the “Sales Force Analysis Module ” site you w ill see three section s: View, Create, and
Tolls. In the View section, click on the link for “Shared Reports. ” This link w ill take you to
a place w here a number of previously created sales reports are listed for everybody ‘s u se.
4. In the “Sh ared Reports” page, click on the folder named “Pipeline Analysis. ” Pipeline
Analysis reports provide insight into all open opportunities and deals in the sales p ipeline.
These reports measure the current statu s of the sales pipeline , detect changing tren ds and
key events, and identify key open opportunities. You want to review what is in the pipe-
line for each sales rep, as well as whether o r not they hit the ir sales quota last quarter.
5. In the “Pipeline Analysis” page, click on the report named “Current Pipeline vs. Quota by
Sales Regio n and District. ” This report presents the current p ipeline status for each sales
Chapter 3 • Data Warehousing 113
district within a sales region. It also projects whether target quotas can be achieved for the
current quarter.
6. In the “Current Pipeline vs. Quota by Sales Region and District” page, select (with single
click) “2004 Q4” as the report parameter, indicating that you want to see how the repre-
sentatives perfo rmed against their quotas for the last quarter.
7. Run the report by clicking on the “Run Report” button at the bottom of the page. This will
lead you to a sales report page whe re the values fo r each Metric are calculated fo r all three
European sales regions. In this interactive report, you can easily change the region from
Europe to United States or Canada using the pull-down combo box, or you can drill-in
one of the three European regions by simply clicking on the appropriate region’s heading
to see more detailed analysis of the selected region.
SECTION 3.6 REVIEW QUESTIONS
1. List the benefits of d ata wareho u se s.
2. List several crite ria for selecting a d ata wareh o u se ven dor, a nd describe w hy they are
impo rta nt.
3. What is OLAP a nd h ow d oes it diffe r fro m OLTP?
4. What is a cube? Wh a t d o drill down , roll up, a nd slice and dice mean?
5. What are ROLAP , MOLAP , and HOLAP? How do they d iffer fro m OLAF?
3.7 DATA WAREHOUSING IMPLEMENTATION ISSUES
Implem e nting a data wareh o use is gene ra lly a m assive e ffo rt that must be planne d and
execute d according to establish e d m e tho ds . Howeve r, the p roject life cycle h as m any
facets , a nd n o single p e rson can be an exp e rt in each area. Here we discuss specific ideas
and issues as they relate to d ata warehousing .
People want to know how su ccessful the ir BI and data ware h o u sing initiatives
are in comparison to those o f o the r com panies. Ariyacha ndra and Watson (2006a) p ro-
posed som e benchmarks for BI and d ata warehou sing success. Watson e t al. 0999)
research ed d ata ware ho u se failures. Th e ir results sh owed that p eople d efine a “fa ilure” in
diffe re nt ways, and this w as confirme d by Ariyachandra a nd Watson (2006a). The Data
Wa reh o using Institute (tdwi.org) has d evelo ped a data ware housing m aturity m o de l
that an e nte rprise can a pply in o rder to be nchmark its evolutio n . Th e m ode l offers a fast
me ans to gauge w h e re the o rganizatio n ‘s data wareh o using initia tive is now a n d where
it needs to go n ext. The m aturity m o de l con sists of six stages : pren atal, infa n t, ch ild,
teen ager, a dult, a nd sage. Business value rises as the data ware h o use p rogresses th rough
each su cceeding stage. The stages a re ide ntified by a number of ch a racteristics, inclu d ing
scop e, a na lytic structure, executive p e rceptio ns, types of an alytics, stewardship, fu nding ,
techno logy pla tfo rm, cha nge ma n agem e nt, a nd administra tio n . See Eckerson et al. (2009)
and Eckerson (2003) for mo re de tails.
Da ta ware h o u se projects h ave m an y risks. Most of them are also found in o ther IT
p rojects, but data ware ho using risks are m o re serio u s b ecause data ware hou ses are expen-
sive, time -and-resource d em a nding, la rge-scale projec ts . Each risk sho u ld be assessed at
the inceptio n of the p roject. Whe n develo ping a su ccessful data ware ho u se, it is im p ortant
to care fully conside r vario us risks and avoid the fo llowing issu es:
• Starting with the wrong sponsorship chain. Yo u n eed an executive sp o nsor
w h o h as influe nce over the necessary resources to suppo rt and invest in the data
ware ho use. Yo u a lso need an executive p roject driver, someone w ho has earne d
114 Pan II • Descriptive Analytics
the respect of other executives, h as a healthy skepticism about technology, and is
decisive but flexible. You also need an IS/ IT manager to head up the project.
• Setting expectations that you cannot meet. You do n o t want to frustrate exec-
utives at the m o ment o f truth. Every data warehou sing project h as two phases:
Phase 1 is the selling phase, in which you inte rnally market the project by selling
the benefits to those w ho have access to needed resources. Phase 2 is the struggle to
meet the expectatio n s described in Phase 1. For a mere $1 to $7 million, hopefully,
you can de liver.
• Engaging in politically naive behavior. Do no t simply state that a data ware-
ho u se w ill help managers make better decisions. This may imply that you feel they
have been making bad decisions until now. Sell the idea that they w ill be able to get
the information they need to help in decision making.
• Loading the warehouse with information just because it is available. Do
not let the data warehou se become a data landfill. This would unnecessarily slow
the u se of the system. There is a trend toward real-time computing and a n alysis.
Data wareh ouses must be shut down to load data in a timely way.
• Believing that data warehousing database design is the same as transac-
tional database design. In gene ral, it is n ot. The goal of data wareh ousing is to
access aggregates rath er tha n a s ingle o r a few records , as in transaction-processing
systems . Content is also different, as is evident in how d ata are organized. DBMS
tend to be no nredundant, normalized, and relational, whereas data warehou ses are
redundant, not normalized, and multidimensional.
• Choosing a data warehouse manager who is technology oriented rather
than user oriented. One key to data wareh ouse su ccess is to understand that the
users must get what they need, n ot advanced technology for techn o logy’s sake.
• Focusing on traditional internal record-oriented data and ignoring the
value of external data and of text, images, and, perhaps, sound and
video. Data come in many formats and must be made accessible to the right p eo-
ple at the rig ht time and in the right form at. They must be cataloged properly.
• Delivering data with overlapping and confusing definitions. Data cleans-
ing is a critical aspect of data warehousing. It includes reconciling conflicting data
definitions and formats o rganization-w ide . Politically, this may be difficult because
it involves change, typically at the executive level.
• Believing promises of performance, capacity, and scalability. Data ware-
houses gen erally require more capacity and speed th an is originally budgeted for.
Plan ahead to scale up.
• Believing that your problems are over when the data warehouse is up and
running. DSS/ BI projects tend to evolve continually. Each deployment is an iteration
of the prototyping process. There will always be a need to add more and different data
sets to the data ware hou se, as well as addition al an alytic tools for existing and addi-
tio na l groups of decision makers. High e ne rgy and annual budgets must be planned for
because success breeds su ccess. Data warehousing is a continuous process.
• Focusing on ad hoc data mining and periodic reporting instead of
alerts. The natural progression of informatio n in a data warehouse is (1) extract
the data from legacy systems, cleanse them, and feed them to the warehouse; (2)
support ad hoc reporting until you learn what people want; and (3) convert the ad
hoc reports into regularly scheduled reports. This process of learning w hat people
want in order to provide it seems n atural, but it is n ot optimal or even practical.
Managers a re busy a nd need time to read reports. Alert systems a re better than
periodic reporting systems and can make a data warehou se mission critical. Alert
systems monitor the data flowing into the warehou se and inform all key people who
have a need to know as soon as a critica l event occurs.
Chapter 3 • Data Warehousing 115
In many organizations, a data warehouse will be successful only if there is strong
senior management support for its development and if there is a project champion who
is high up in the organizational chart. Although this would likely be true for any large-
scale IT project, it is especially important for a data warehouse realizatio n. The successful
implementation of a data warehouse results in the establishment of an architectural frame-
work that may allow for decision analysis throughout an organization and in some cases
also provides comprehensive SCM by granting access to information on an organization’s
customers and suppliers. The implementa tion of Web-based data warehouses (sometimes
called Webhousing) has facilitated ease of access to vast amounts of data, but it is dif-
ficult to determine the hard benefits associated with a data warehouse. Hard benefits are
defined as benefits to an organization that can be expressed in monetary terms. Many
organizations have limited IT resources and must prioritize projects. Management support
and a strong project champion can help ensure that a data warehouse project will receive
the resources necessary for successful implementation. Data warehouse resources can
be a significant cost, in some cases requiring high-end processors and large increases in
direct-access storage devices (DASD). Web-based data warehouses may also have sp ecial
security requirements to ensure that only authorized users have access to the data .
User participation in the developme nt of data and access modeling is a critical suc-
cess factor in data warehouse development. During data mode ling, expertise is required
to determine what data are needed, define business rules associated with the data , and
decide what aggregations and other calculations may be necessary. Access modeling is
needed to determine how data are to be re trieved from a data warehouse, and it assists in
the physical definition of the warehouse by helping to define which data require index-
ing. It may also indicate whether dependent data marts are needed to facilitate informa-
tion retrieval. The team skills needed to develop and implement a data warehouse include
in-depth knowledge of the database technology and d evelopment tools used. Source sys-
tems and development technology, as m e ntioned previously, refere nce the ma ny inputs
and the processes used to load and maintain a data warehouse.
Application Case 3.6 presents an excellent example for a large-scale imple mentation
of an integrated data warehouse by a state government.
Application Case 3.6
EDW Helps Connect State Agencies in Michigan
Through customer service, resource optimization,
and the innovative use of information and tech-
nology, the Michigan Departme nt of Technology,
Management & Budget (DTMB) impacts every
area of government. Nearly 10,000 users in five
major departments, 20 agencies, and mo re than
100 bureaus rely on the EDW to do their jobs more
effectively and better serve Michigan residents . The
EDW achieves $1 million per business day in finan-
cial benefits.
per year within the Department of Human Services
(DHS). These savings include program integrity ben-
efits, cost avoidance due to improved outcomes,
sanction avoidance, operational efficiencies, and
the recovery of inappropriate payments within its
Medicaid program.
The EDW helped Michigan achieve $200 million
in annual financial benefits within the Department of
Community Health alone, plus another $75 million
The Michigan DHS data warehouse (DW) pro-
vides unique and innovative information critical to
the efficient operation of the agency from both a
strategic and tactical level. Over the last 10 years,
the DW has yielded a 15:1 cost-effectiveness ratio.
Consolidated information from the DW now con-
tributes to nearly every function of DHS , including
(Continued)
116 Pan II • Descriptive Analytics
Application Case 3.6 (Continued}
accurate delivery of an d accounting for benefits
delivered to almost 2.5 million DHS public assis-
tance clients.
Michigan has been ambitious in its attempts to
solve real-life problems through the innovative shar-
ing a nd comprehensive analyses of data. Its approach
to BI/ DW has always b een “enterprise ” (statewide) in
nature, ra the r tha n h aving sep a rate BI/DW platforms
for each business area or state agency. By remov-
ing barrie rs to sharing ente rprise data across business
units, Michigan has leveraged massive amounts of
data to create innovative approaches to the use of
BI/DW, delivering efficient, re liable e nterprise solu-
tions using multiple channels.
QUESTIONS FOR DISCUSSION
1. Why would a state invest in a large and expen-
sive IT infrastructure (such as a n EDW)?
2. What are the size a nd complexity of EDW u sed
by state agen cies in Michigan?
3. What were the ch alle nges, the proposed solu-
tion, and the obtained results of the EDW?
Source: Compiled from TDWI Best Practices Awards 2012 Winne r,
Ente rprise Data Ware housing, Gove rnme nt a nd Non-Profit
Category, “Michigan De partme nts o f Technolo gy, Manage me nt
& Budge t (DTMB), Community Health (OCH) , and Human
Se rvices (OHS),” featured in 7DWJ What Works, Vo l. 34, p . 22;
a nd michigan.michigan.gov.
Massive Data Warehouses and Scalability
In additio n to flexibility, a data warehouse needs to support scalability. The main issues
pertaining to scalability are the am o unt of data in the wareho use , how quickly th e ware-
house is expected to grow, the numbe r of con curre nt u sers, and the complexity of u ser
queries. A data wareh o use must scale both h orizontally a nd vertically. The warehouse will
grow as a function of data growth a nd the n eed to expand the warehouse to supp ort n ew
business function ality. Data growth may be a result of the additio n of current cycle data
(e.g., this mo nth’s results) a nd/ or historica l data.
Hicks (2001) described huge databases and data warehouses. Walmart is con tinually
increasing the size of its m assive data warehouse. Walmart is believed to u se a warehou se
w ith hundreds of terabytes of data to study sales trends, track inventory, and p erform
othe r tasks. IBM recently publicized its 50-terabyte wareho u se benchmark (IBM, 2009).
The U.S . Department o f Defe nse is using a 5-petabyte data ware h ou se a n d re pository to
hold medical records for 9 million military personnel. Because of the storage requ ired to
archive its news footage, CNN also has a petabyte-sized data ware house.
Given that the size of data wareh o uses is expanding at an expon ential rate, scalabil-
ity is an important issue. Good scalability means that queries and othe r data-access func-
tio n s will grow (ideally) linearly with the size of the wareho u se. See Rosenberg (2006) for
approaches to improve query performance. In practice, specia lized meth ods have been
developed to create scalable data warehouses. Scalability is difficult when managing hun-
dreds of terabytes or more. Terabytes of data h ave con siderable inertia, occupy a lot of
physical s pace, and require powerful compute rs . Some firms use paralle l processing, and
othe rs u se clever indexing and search schemes to manage their data . Some spread their
data across different physical data stores. As more da ta wareh ouses approach the petabyte
size, better and better solutions to scalability continue to be developed.
Hall (2002) also addressed scalability issu es. AT&T is an industry leader in deploy-
ing and u sing massive data w arehouses . With its 26-terabyte data warehouse, AT&T can
detect fraudulent use of calling cards a nd investigate calls related to kidnapp ings and
o the r crimes. It can also compute millions of call-in votes from television viewers select-
ing the next Ame rican Idol.
Ch apte r 3 • Da ta Ware housing 117
For a sample of successful d ata wareho u sing impleme ntations, se e Edwards (2003).
Jukic a nd Lang ( 2004) examined the tre nds a nd sp ecific issu es related to the u se of off-
sh ore resources in the develo pmen t and su p port of data w are ho u sing and BI applica-
tio n s. D avison (2003) indicated that IT-re lated o ffsh o re outsourcing h ad b een growin g
at 20 to 25 p e rcent pe r year. Wh en con side ring offsh o ring d ata ware h o u sing projects,
ca reful conside ra tio n must be given to culture and security (for deta ils, see Ju kic and
Lan g, 2004).
SECTION 3. 7 REVIEW QUESTIONS
1. What are the m ajo r DW imple mentatio n tasks that can be p erformed in para llel?
2. List and discuss the most pron ounced DW imple m e ntatio n guidelines.
3. Whe n develo ping a successful data ware ho use, w h at are the most important risks an d
issues to consider and p o tentially avoid?
4. What is scala bility~ How does it apply to DW?
3.8 REAL-TIME DATA WAREHOUSING
Data ware h ousing a nd BI tools traditio n a lly focu s on assisting ma nagers in making stra-
tegic a nd tactical decisio ns. Increased data volumes a nd accele rating u p date sp eeds are
fundame ntally changing the role of the d ata ware ho u se in mo de rn business. Fo r many
bu sinesses, ma king fa st and consiste nt decisio ns across the ente rprise requires mo re than
a traditional da ta warehou se or da ta m art. Traditio n al d ata ware h ou ses a re no t bu si-
ness critical. Data a re commo nly updated o n a weekly basis, and th is does n o t allow for
resp o nding to tra nsactio ns in near-real-time.
Mo re d ata , coming in faste r a nd requiring immediate conversio n into decision s ,
means that o rganizatio n s are confro n ting the need fo r real-time d ata ware ho u sing . This
is b ecau se decisio n support has b ecome o p e ratio nal, integrate d BI requires closed-loop
an alytics, and yeste rday’s O DS w ill n o t suppo rt existing re quire me n ts .
In 2003, w ith the advent of real-time da ta warehou sing, th e re was a shift toward
using these techno logies for o peratio nal d ecision s . Real-time data warehousing
(RDW), also known as active data warehousing (ADW) , is the p rocess of loading
and p roviding d ata via the d ata wareho u se as they become available. It evolved from
the EDW concept. The active traits of an RDW / ADW suppleme nt and expand traditio n al
data ware h o use functio n s into the realm of tactical decisio n m aking . People throug h out
the organizatio n w ho interact directly w ith cu sto me rs an d supp liers w ill be e mpowered
w ith informa tio n-based d ecisio n making at the ir fingertips . Even fu rthe r leverage results
w h e n a n ADW p rovides info rmation directly to cu sto m e rs and supplie rs. The reach an d
impact of informatio n access for decis ion making can positively affect almost a ll aspects
of custo me r service, SCM, logistics, a nd b eyond. E-business h as become a m ajo r catalyst
in the dem and for active da ta ware h o using (see Armstrong, 2000) . Fo r examp le, o nlin e
retailer Overstock. com, Inc. (overstock.com) connected data use rs to a real-time data
wareh o u se. At Egg pie, the world’s largest purely o nline ba nk, a cu stom e r d ata wareh o use
is refreshed in n ear-real-time . See Applicatio n Case 3.7 .
As business needs evolve , so d o the re quireme nts o f the data ware h ou se. At this
basic level, a da ta ware h ou se simply re p o rts w h at h a ppe ned. At the next level, some
analysis occurs . As the syste m evolves, it p rovides prediction cap abilities, w h ich lead
to the next level of o peratio nalizatio n. At its hig hest evolu tio n , the ADW is capable of
ma king events h appe n (e.g ., activities su ch as creating sales and ma rke ting cam paig ns o r
ide ntifying a nd explo iting opportunities). See Figure 3. 12 for a gra phic descrip tio n of this
evolutio nary process . A recent su rvey on ma n aging evolutio n of da ta ware h o u ses can b e
found in Wrem bel (2009).
118 Pan II • Descriptive Analytics
Application Case 3.7
Egg Pie Fries the Competition in Near Real Time
Egg pie, now a p art of Yorkshire Building Society
(egg.com) is the world’s largest online bank. It pro-
vides banking, insurance, investme nts, and mort-
gages to more than 3.6 millio n customers through its
Internet site. In 1998, Egg selected Sun Microsystems
to create a re liable, scalable, secure infrastructure to
suppo rt its more than 2.5 million daily transactions.
In 2001, the system was upgraded to eliminate
late n cy problems . This n ew customer data ware –
ho u se (CDW) use d Sun, Oracle, and SAS software
products. The initial data warehouse had a bout 10
terabytes of data a nd u sed a 16-CPU server. The sys-
tem provides near-real-time data access. It provides
data warehouse and da ta mining services to inte r-
n al users, and it provides a requisite set of cu s-
tomer d a ta to the customers themselves. Hundreds
of sales and ma rketing campaigns are constructed
u sing near-real-time data (within several minutes).
And better, the system enables faster d ecisio n m ak-
ing about sp ecific cu stom ers and customer classes.
QUESTIONS FOR DISCUSSION
1. Why kind of business is Egg pie in? What is the
competitive la ndscape?
2. How did Egg pie u se near-real-time data ware-
h o using for competitive advantage?
Sources: Compiled from “Egg’s Cu stom er Data Warehouse Hits th e
Mark,” DM Review, Vol. 15, No. 10, October 2005, pp. 24-28; Sun
Microsystems, “Egg Banks on Sun to Hit the Mark with Cu stom ers,”
September 19, 2005, sun.com/smi/Press/sunflash/2005-09/
sunflash.20050919.1.xml (accessed April 2006); and ZD Net UK,
“Sun Case Study: Egg’s Customer Data Warehouse,” whitepapers.
zdnet.co.uk/0,39025945 ,60159401p-39000449q,00.htm
(accessed June 2009).
Real-Time Decisioning
Applications
OPERA TIONALIZING
Enter prise Decisioning
Management
ACTIVATING
MAKE it happen!
~—-~ WHAT IS
Predictive
Models .1
~
a.
E
0
u
“O
C
C1I
“O
C1I
0
:i ..
0
3:
REPORTING
WHAT
happened?
Primarily
Batch and
Some Ad Hoc
Reports
Segmentation
and Profiles
ANALYZING
WHY
did it happen?
Increase in
Ad Hoc
Analysis
PREDICTING
WHAT WILL
happen?
Analytical
Modeling
Grows
happening now?
Continuous
Update and Time-
Sensitive Queries
Become
Important
Batch
OAd Hoc
•Analytics
Event-Based
Triggering Takes Hold
D Continuous Update/Short Queries
• Event-Based Triggering
Data Sophistication
FIGURE 3.12 Enterprise Decision Evolution. Source: Courtesy of Teradata Corporation. Used with permission .
Active Access
Front-Line operational
decisions or services
supported by NRT access;
Service Level Agreements of
5 seconds or less
Active Load
Intra-day data acquisition;
Mini-batch to near-real-time
[NRTJ trickle data feeds
measured in minutes or
seconds
Active Event s
Proactive monitoring of
business activity initiating
intelligent actions based on
rules and context; to
systems or users supporting
an operational business
process
Chapter 3 • Data Warehousing 119
Active W orkloa d
Management
Dynamically manage system
resources for optimum
performance and resource
utilization supporting a
mixed-workload environment
Active Enterprise
Integration
Integration into the
Enterprise Architecture for
delivery of intelligent
decisioning services
Active Availability
Business Continuity to
support the requirements of
the business
[up to 7 x 24)
FIGURE 3.13 The Teradata Active EDW. Source: Courtesy of Teradata Corporation . Used w ith permission.
Terad ata Corp oratio n provides the baseline requirem e nts to su p p o rt a n EDW. It also
provides the new traits of active data warehousing required to deliver data freshness, per-
forma nce, and availability an d to e nable e nterprise decisio n ma nagemen t (see Figure 3.1 3
for an example).
An ADW offe rs an integrated informatio n repository to d rive strategic an d tactical
decisio n support w ithin an organizatio n. With real-time data wareh ousing, instead of
extracting o peratio n al data from an O LTP system in nig h tly batches into an ODS, data
are assemb led fro m OLTP systems as and w he n even ts happ en an d are moved at o nce
into th e data wareho use. This p e rmits the instant u p da ting of the d ata ware h ou se and the
eliminatio n of an ODS. At this p o int, tactical and strategic q u eries ca n be made against th e
RDW to use im mediate as well as historical data.
Accord ing to Basu (2003), the m ost distinctive d iffere n ce between a traditio nal data
wareh ouse and an RDW is th e shift in the data acquisitio n paradigm. Some of the b usi-
ness cases a nd e nte rprise re quire me n ts that led to the need for data in real time include
the following :
• A bu siness ofte n canno t afford to wait a w h o le d ay for its operation al data to load
into th e data wareh ouse fo r an alysis .
• Until now, d ata ware h ou ses h ave cap tured sn apshots of a n organizatio n ‘s fixed
states instead of incre m ental real-time data sh owing every state chan ge an d almost
a n alogou s patterns over time.
• With a traditio nal hub-and-sp o ke architecture , k eeping the me tadata in syn c is dif-
ficu lt. It is also costly to develop, ma intain, a n d secure m any systems as opposed to
o ne huge d ata warehouse so tha t data a re centralized fo r Bl/BA tools.
• In cases o f huge nightly ba tch loads, the necessary ETL setup a nd processing p ower
for la rge nightly data warehou se loading m igh t be ve1y high, an d the processes
migh t take too long. An EAi w ith real-time data collection can redu ce or eliminate
the nightly batch processes.
120 Pan II • Descrip tive Analytics
Despite the ben efits of an RDW, developing o ne can create its own set of issu es.
These proble ms re late to architecture, data modeling, physical database design, storage
and scalability, a nd maintainability. In additio n , depen ding o n exactly w h e n d ata are
accessed , even d own to the microsecon d, d ifferent versio ns of the truth may be extracted
and created, w hich can confuse team m em bers. For details, refer to Basu (2003) an d Terr
(2004).
Real-time solutio n s present a rem arkable set of ch allenges to BI activities. Althou gh
it is not ideal for a ll solutio ns, real-t ime data wa re housing may be su ccessful if the organ i-
zation develops a sound metho do logy to handle p roject risks, incorporate p roper p lan-
ning, an d focu s o n q u ality assura nce activities. Un derstan d ing the common ch allenges
and a pp lying best practices can redu ce the exte nt of the p roblems that are often a p art
of imple m enting com p lex data wareh o using systems that incorporate Bl/BA m ethods.
Details an d real im plementatio ns are d iscussed by Bu rdett and Singh (2004) an d Wilk
(2003). Also see Akbay (2006) and Ericson (2006).
See Techno logy Insights 3.3 for some details o n h ow the real-time concept evolved.
The flig h t manageme n t dash board application at Contin en tal Airlines (see the End-of-
Chapter Applicatio n Case) illustrates the p ower of real-time BI in accessing a data ware-
house fo r use in face-to-face custo m er interactio n situatio ns. The operatio ns staff u ses th e
real-time system to identify issu es in the Continental flig ht network. As a noth e r example,
UPS invested $600 millio n so it could use real-time data an d p rocesses . The investm ent
was expected to cut 100 m illio n delivery miles and save 14 millio n gallons of fuel annu-
ally by ma naging its real-time package-flow technologies (see Malykhina, 2003). Table 3.6
compares traditio nal and active data ware housing e nvironm ents.
Real-time data warehousing, near- real-time data warehousing, zero-latency ware-
housing, and active data warehousing are different nam es used in practice to describe
the same con cep t. Gonzales (2005) presen ted d ifferent definitio n s fo r ADW. According
to Gonzales, ADW is o nly o ne option th at provides blen ded tactical a nd strategic data o n
dem and. The architecture to build an ADW is very similar to the corporate info rmatio n
fac tory architecture developed by Bill Inmo n. The o nly differen ce between a corporate
info rmatio n factory a nd a n ADW is the imp le me ntatio n of both data stores in a single
TECHNOLOGY INSIGHTS 3.3 The Real-Time Realities of Active
Data Warehousing
By 2003, the role of data warehousing in practice was growing rapidly. Real-time systems, though
a novelty, were the latest buzz, along with the ma jo r complications of providing data and infor-
matio n instantaneously to those w ho need the m . Many expeits, inclu ding Peter Coffee, eWeek’s
technology edito r, believe that real-time systems must feed a real-time decision-making process.
Stephen Brobst, CTO of the Teradata division of NCR, ind icated that active data wareho using is a
process of evolutio n in how a n ente rp rise uses data. Active means that the da ta warehouse is also
used as an operational an d tactical tool. Brobst provided a five-stage model that fits Coffee’s
experie nce (2003) o f how o rganizatio ns “grow ” in their data utilization (see Brobst e t al. , 2005).
These stages (and the questio ns they p urpoit to answer) are repoiting (What happened?) , an alysis
(Why d id it happen?), prediction (What will happen1) , operationalizing (What is happening?), and
active wareho u sing (What do I want to happen?). The last stage, active ware housing, is w here the
greatest ben efits may be obtained. Many organizations are enha ncing centralized data wareh ouses
to serve both o peratio nal and strategic decisio n making.
Sources: Adapted from P. Coffee, ‘”Active’ Ware housing,” eWeek, Vol. 20, No. 25, Ju ne 23, 2003, p. 36; a nd
Teradata Corp., “Active Data Wa re housing,” teradata.com/active -data-warehousing/ (accessed August 2013).
Ch apte r 3 • Data Ware housing 121
TABLE 3.6 Comparison Between Traditional and Active Data Warehousing Environments
Traditional Data Warehouse Environment
Strategic decisions only
Results sometimes hard to measure
Daily, weekly, monthly data currency
acceptable; summaries often appropriate
Moderate user concurrency
Highly restrictive reporting used to confirm
or check existing processes and patterns;
often uses predeveloped summary tables
or data marts
Power users, know ledge workers, internal
users
Active Data Warehouse Environment
Strategic and t actical decisions
Results measured with operat ions
Only com prehensive detailed data available
wit hin minut es is acceptable
High number (1 ,000 or more) of users
accessing and querying the syst em
simultaneously
Flexible ad hoc reporting , as w ell as
machi ne-assisted modeling (e. g., data
mining) to discover new hypotheses
and relationships
Operational staffs, call centers, externa l users
Sources: Adapted fro m P. Coffee, ‘”Active’ Wa rehousing ,” eWeek, Vol. 20, No. 25, Ju ne 23, 2003, p. 36; and
Te radata Corp., “Active Data Wa re ho us ing,” teradata.com/active-data-warehousing/ (accessed August 2013).
environme nt. However, an SOA based o n XML and Web services p rovides an o the r o ption
for ble n ding tactical and strategic data o n de m and.
One critical issue in real-time d ata wareho using is tha t not all data sh ould b e u pdated
continuo usly. This may certainly cau se p roble ms w he n re p o rts are gen erated in real time ,
becau se o ne p e rson ‘s results may not match a no the r person’s. For example, a compa ny
using Business Objects Web Intelligence noticed a significant problem w ith real-time
intelligence. Real-time re p o 1ts p roduced at slightly diffe re nt times diffe r (see Pete rson ,
2003). Also, it may no t be n ecessary to update certa in d ata continuo u sly (e.g ., course
grades that a re 3 o r mo re years old) .
Real-time re quirem e nts change the way we view the d esig n o f databases, data ware-
ho u ses , OLAP, a nd data mining tools becau se they a re literally updated concurre ntly
w hile queries are active . But the substantial business value in doing so h as been de mo n-
stra ted , so it is crucial that o rganizatio n s ad opt these m e thods in the ir business processes.
Careful planning is critical in su ch impleme ntations .
SECTION 3.8 REVIEW QUESTIONS
1. What is an RDW?
2. List the benefits of an RDW.
3. What are the m ajor diffe ren ces between a traditio n al d ata ware h ou se an d a n RDW?
4. List som e of the drivers for RDW .
3.9 DATA WAREHOUSE ADMINISTRATION, SECURITY ISSUES,
AND FUTURE TRENDS
Data ware h o uses p rovide a distinct competitive e dge to e nte rprises that effectively cre –
ate an d use the m. Due to its huge size an d its intrinsic na ture , a data ware h ouse req uires
especially strong mo nitoring in o rder to sustain satisfactory efficiency and productivity.
The su ccessful administratio n and ma nageme nt o f a data wareh o u se e ntails skills an d
pro ficie ncy that go p ast w hat is required of a traditio nal database administrato r (DBA).
122 Pan II • Descriptive Analytics
A data warehouse administrator (DWA) should be familiar w ith high-performan ce
software, hardware, and n etworking technologies. He or she should also p ossess solid
business insight. Because data warehouses feed BI systems and DSS that help manag-
e rs w ith their decision-making activities, the DWA should be familiar w ith the decision-
m aking processes so as to suitably desig n and m aintain the data warehou se structure. It is
particularly sig nificant for a DWA to keep the existing requ irements and capabilities of the
data ware ho u se stable while simultaneously providing flexibility fo r rapid improvements.
Finally, a DWA must possess excelle nt communicatio ns skills. See Benander et a l. (2000)
for a description of the key differences between a DBA a nd a DW A.
Security and privacy of information are main and significant con cerns for a data ware-
house professional. The U.S. government has passed regulatio ns (e.g., the Gramm-Leach-
Bliley privacy and safeguards rules, the Health In surance Portability and Accountability
Act of 1996 [HIPAAJ) , instituting obligatory requirements in the management of customer
informatio n . Hence, companies must create secu rity procedures that are effective yet flex-
ible to conform to numerous privacy regulations. According to Elson and Leclerc (2005),
effective security in a data warehouse sh ould focus o n fou r main areas:
1. Establishing effective corporate and security policies and procedures. An effective
security policy should start at the top, with executive management, and should be
communicated to all individua ls within the organization.
2. Implementing logical security procedures and techniques to restrict access. This
includes user authenticatio n , access controls, and encryption technology.
3. Limiting physical access to the data center environme nt.
4. Establishing an effective inte rnal control review process w ith an emphasis on security
and privacy.
See Techno logy Insights 3.4 for a description of Ambeo’s importan t software tool
that m onitors security and privacy of data warehouses. Finally, keep in mind that access-
ing a data warehouse via a mobile device should always be performed cautiously. In this
insta nce, data sh ould o nly be accessed as read-o nly.
In the near term, data ware h ousing developments w ill be determined by n otice-
able factors (e.g. , data volumes , increased intoleran ce for latency, the diversity an d com-
plexity of data types) a nd less noticeable factors (e.g., unmet end-user requirements for
TECHNOLOGY INSIGHTS 3.4 Ambeo Delivers Proven Data-Access
Auditing Solution
Since 1997, Arnbeo (ambeo.com; now Embarcadero Technologies, Inc.) h as deployed techno l-
ogy that provides performance manageme nt, data usage tracking , data privacy auditing, and
monitoring to Fortune 1000 companies. These firms have some of the largest database e nviron-
ments in existence . Ambeo data-access auditing solutions play a majo r role in an enterprise
information security infrastructure.
The Ambeo technology is a re lative ly easy solution that records eve1ything that happens
in the databases, w ith low or zero overhead. In addition, it provides data-access auditing that
identifies exactly w ho is looking at data, w hen they are looking, and w hat they are doing w ith
the d ata. This real-time monitoring helps quickly and effectively identify security breaches .
Sources.- Adapted fro m “Ambeo De livers Proven Data Access Auditing Solution,” Database Trends and
Applications, Vol. 19, No. 7, July 2005; and Ambeo, “Keep ing Data Private (a nd Knowing It): Moving
Beyond Conventional Safeguards to Ensure Data Privacy,” am-beo.com/why_ambeo_white_papers.html
(accessed May 2009).
Chapter 3 • Data Warehousing 123
dashboards, balanced scorecards, master data management, information quality). Given
these drivers, Moseley (2009) and Agosta (2006) suggested th at data warehousing trends
w ill lean toward simplicity, valu e, and performance.
The Future of Data Warehousing
The field of data warehousing h as been a vibrant area in information technology in the
last couple of decades, and the evide nce in the BI/BA and Big Data world sh ows that
the importan ce of the field w ill only get even more inte resting. Following are some of th e
recently popularized concepts a nd techno logies that w ill play a sig nificant role in defining
the future of data wareho using .
Sourcing (mechanisms for acquisition of data from diverse and dispersed sources):
• Web, social media, and Big Data. The recent upsurge in the use of the Web
for personal as well as business purposes coupled w ith the tremendous interest in
social media creates opportunities fo r analysts to tap into very rich data sources.
Because of the sheer volume, velocity, and variety of the data, a n ew term, Big Data,
has been coined to name the phenomenon. Taking advantage of Big Data requires
development of n ew a nd dramatically improved BI/BA technologies, which w ill
result in a revolutionized data warehousing world .
• Open source software. Use of open source software tools is in creasing at an
unprecedented level in wareh ousing, business intelligence, and data integration. There
are good reasons for the upswing of open source software used in data warehous-
ing (Russom, 2009): (1) The recession has driven up interest in low-cost open source
software; (2) open source tools are coming into a new level of maturity, and (3) open
source software augments traditional enterprise software without replacing it.
• Saas (software as a service), “The Extended ASP Model. ” Saas is a creative
way of deploying information system application s where the provider licenses
its applications to customers for use as a service o n d e mand (usually over the
Inte rnet). Saas software vendo rs may h ost the applicatio n o n their own servers
o r upload the application to the consumer site . In essence, Saas is the new and
improved versio n of the ASP model. For data warehouse customers, finding SaaS-
based software a pplicatio n s a n d resources that meet specific n eeds and require –
m e nts can be ch allen ging. As these software offerings become more agile, th e
app eal and the actual use of Saas as the choice of data warehousing platform w ill
also increase.
• Cloud computing. Cloud computing is perhaps the n ewest and the most inno –
vative platform cho ice to come along in years . Numerous h ardware and software
resources are pooled a nd virtualized, so that they can be freely allocated to appli-
cations a nd software platforms as resources are needed. This enables information
system applications to dynamically scale up as workloads increase. Although cloud
computing and similar virtualizatio n techniques a re fairly well esta blished for opera-
tional applications today, they are just now starting to be used as data warehouse
platforms of choice. The dynamic allocation of a cloud is p articularly useful when
the data volume of the warehouse varies unpredictably, making cap acity planning
difficult.
Infrastructure (architectural-hardware and software-enhancements):
• Columnar (a new way to store and access data in the database). A column-
oriented database manage ment system (also commonly called a columnar data-
base) is a system that stores data tables as sections of columns of data rather than
as rows of data ( w hich is the way most relational database managemen t systems
do it). That is, these columna r databases store data by columns instead of rows
124 Pan II • Descriptive Analytics
(all values of a single column a re stored con secutively on disk memory). Such a
structure gives a much finer g rain of control to the relational d atabase management
syste m. It can access only the columns required for the query as opposed to being
forced to access all columns o f the row. It performs significantly better for queries
that need a small percentage of the columns in the tables they are in but p erforms
significantly worse w hen you need m ost of the columns due to the overhead in
attaching all of the columns together to form the result sets. Comp arisons between
row-oriented a nd column-o rie nted data layouts are typically concerned with the
efficiency of hard-disk access for a given workload (which happens to be one of
the most time-consuming operatio ns in a computer). Based o n the task at hand,
one may be significantly advantageous over the other. Column-oriented organiza-
tions are more efficient w hen (1) an aggregate needs to be computed over many
rows but only for a n otably smalle r subset of a ll columns of data, because reading
that smaller subset of data can be faster than reading all data, and (2) new values of
a column are supplied for all rows at o nce, becau se that column data can be writte n
efficiently and replace o ld column data without touching any oth er columns fo r the
rows. Row-oriented organizations are mo re efficient when (1) many columns of a
single row are required at the sam e time, and w h e n row size is re latively small, as
the e ntire row can be retrieved with a s ing le disk seek , a nd (2) w riting a new row if
all of the column data is supplied at the same time , as the entire row can be written
w ith a single disk seek. Additionally, since the data stored in a column is of uniform
type, it le nds itself better for compression . That is, significant storage size optimiza-
tion is available in column-o riented d ata that is not available in row-oriented data.
Such optimal compressio n of data redu ces storage size, m aking it more economi-
cally justifiable to pursu e in-memory or solid state storage alternatives.
• Real-time data warehousing. Real-time data wareh ousing implies that the
refresh cycle of a n existing data wareh ouse updates the data more frequen tly (almost
at the same time as the data becomes available at operational databases) . These
real-time data ware ho use systems can achieve n ear-real-time update of data, w h ere
the data laten cy typically is in the range from minutes to hours. As the latency gets
smaller, the cost of data update seems to increase exponentially . Future advan ce-
me nts in man y techno logical fronts (ran ging from automatic data acquisition to inte l-
ligent softwa re agents) are needed to make real-time data wareh ousing a reality w ith
an affordable price tag .
• Data warehouse appliances (all-in-one solutions to DW). A d ata warehouse
appliance consists of an integrated set of servers, storage, operating system(s), data-
base managem e nt syste ms, and software specifically preinstalled a nd preoptimized
fo r data wareh ousing. In practice, data warehouse appliances provide solutions
for the mid-to-big data warehouse market, offering low-cost performance on data
volumes in the terabyte to petabyte range. In o rder to improve performance, most
data wareho u se appliance vendors use massively parallel processing architectures.
Even tho ugh most database and data warehouse vendors p rovide appliances nowa-
days, m any believe that Teradata was the first to provide a commercial data ware-
house applia nce product. What is often observed now is the e mergence of data
warehouse bundles, where vendors combine their hardware and database software
as a data ware h ouse platform. From a benefits standpoint, data wareh ouse appli-
ances h ave significantly low total cost o f ownership, which includes initial purchase
costs, o ngo ing maintena nce costs, and the cost of ch anging capacity as the data
grows. The resou rce cost for monitoring and tuning the da ta warehouse makes up
a large part of the total cost of ownership, often as much as 80 percent. DW appli-
an ces reduce administratio n fo r day-to-day operatio ns, setup , and integration. Since
data ware ho use appliances provide a single-vendo r solutio n , they tend to better
Chapter 3 • Data Warehousing 125
optimize the h ardware and software w ithin the appliance. Su ch a unified integration
maximizes the chances of successful integratio n and testing of the DBMS storage
a nd operating system by avoiding some of the compatibility issues that a rise from
multi-vendor solutio ns. A data warehou se applia n ce also provides a single point of
contact for problem resolution a nd a much simpler upgrade path for both software
a nd h ardware .
• Data management technologies and practices. Some of the most p ressing
needs for a next-ge neratio n data warehouse platform involve technologies and
practices that we gen e rally don’t think of as part of the platform. In particular,
many users need to update the data management tools that process data for use
through data warehou sing. The future holds strong growth for master data man-
agement (MDM). This relatively new, but extremely impo rta nt, concept is gaining
popularity for many reasons , including the following: (1) Tighte r integration w ith
operatio nal systems demands MDM; (2) most data warehou ses still lack MDM and
data quality functions; and (3) regulatory and financial reports must be perfectly
clean and accurate.
• In-database processing technology (putting the algorithms where the
data is). In-database processing (also called in-database analytics) refers to th e
integratio n of the algorithmic extent of data a nalytics into data warehouse. By doing
so, the data and the analytics that work off the data live w ithin the same environ-
ment. Having the two in close proximity increases the efficiency of the com puta-
tionally intensive analytics procedures. Today, many large database-driven decision
support systems, su ch as those used for credit card fraud detection and investment
risk management, u se this technology becau se it provides significant performance
improvements over traditional methods in a decision environment w here time is
of the essence. In-database processing is a complex e ndeavor compared to th e
traditional way of conductin g analytics, w h e re the data is m oved out of the data-
base (often in a flat file format that consists of rows a nd columns) into a sepa-
rate an alytics e nvironment (su ch as SAS Enterprise Modeler, Statistica Data Miner,
or IBM SPSS Modeler) for processing. In-database processing makes more sense
for high-throughput, real-time applicatio n e nv ironments , including frau d detec-
tion, credit scoring, risk management, transaction processing, pricing and ma rgin
a nalysis, usage-based micro-segmenting, behavioral ad targeting, and recommenda-
tio n e ngines, such as those u sed by customer service organizatio ns to determine
next-best action s. In-da tabase processing is performed a nd promoted as a feature
by many of the m ajor data wareh o using vendors, including Teradata (integrating
SAS an alytics cap abilities into the data warehouse appliances), IBM Netezza, EMC
Greenplum, and Sybase, among others.
• In-memory storage technology (moving the data in the memory for faster
processing). Conven tio n al database systems, such as relational database man-
agement syste ms, typically use p hysical hard drives to store data for an extended
period of time. When a data-related process is requested by an application , the
database manage ment system loads the data (or parts of the da ta) into the main
memory, processes it, and responds back to the application. Although data (or parts
of the data) is temporarily cached in the main memory in a database management
system, the primary storage locatio n re mains a magnetic hard disk. In contrast, an
in-memory database system keeps the data permanently in the main m emory. When
a data-related process is requested by an application, the database man agement
system directly accesses the data, which is already in the ma in me mory, processes
it, and responds back to the requesting application. This direct access to data in
main memory makes the processing of data o rders much faste r than the traditional
me thod. The main benefit of in-me mory techno logy (maybe the o nly be nefit of it) is
126 Pan II • Descriptive Analytics
the incredible speed at which it accesses the d ata . The disadvantages include cost of
paying for a very large m ain memory (even tho ug h it is getting cheaper, it still costs
a g reat deal to have a large e nough main memory that can hold all of com pany’s
data) and the need for sophisticated data recovery strategies (since main me m ory is
vola tile and can be w iped out accidentally) .
• New database management systems. A data warehouse platform consists of sev-
eral basic compo nents, of which the most critical is the database management system
(DBMS). This is o nly natural, given the fact that DBMS is the component of the platform
where the most work must be done to implement a data model and optimize it for
query performance. Therefore, the DBMS is wh ere many next-generation innovations
are expected to happe n.
• Advanced analytics. Users can choose different analytic methods as they move
beyond basic OLAP-based methods a n d into advanced analytics . Some u sers choose
advanced analytic methods based on data mining, predictive analytics, statistics,
artificial inte llige nce, and so o n. Still, the majority of users seem to be ch oosing SQL-
based meth ods. Either SQL-based or n ot, advan ced an alytics seem to be among the
most important promises of next-generation data warehousing.
The future of data warehousing seems to be full of promises and significant
challe nges. As the world of business becomes more global and complex, the n eed for
business inte llige nce a nd d ata warehousing tools will also become more prominent. The
fast-improving informa tion techno logy tools a nd techniques seem to be moving in the
right direction to address the needs of future business intelligen ce systems.
SECTION 3.9 REVIEW QUESTIONS
1. What ste p s can an orga nizatio n take to e nsure the security and confide ntiality of cu s-
to me r data in its d ata ware ho u se?
2. What skills should a DWA possess? Why?
3. What recent technologies may shape the future of d ata warehousing? Why?
3.10 RESOURCES, LINKS, AND THE TERADATA UNIVERSITY
NETWORK CONNECTION
The use of this chapter an d most other chapte rs in this book can be e n han ced by the tools
described in the following sections.
Resources and Links
We recommend looking at the following resources and links fo r further reading and
explanatio ns:
• The Da ta Warehouse Institute (tdwi.org)
• DM Review (information-management.com)
• DSS Resources (dssresources.com)
Cases
All major MSS vendors (e.g., MicroStrategy, Microsoft, Oracle, IBM, Hyperion , Cognos, Exsys,
Fair Isaac, SAP, Info rmatio n Builders) provide interesting cu stomer success stories. Academic-
oriented cases are available at the Harvard Business Sch ool Case Collection (harvardbu
sinessonline.hbsp.harvard.edu) , Business Performance Improvement Resource (bpir.
com), IGI Glo bal Disseminator of Knowledge (igi-global.com), Ivy League Publishing
(ivylp.com) , ICFAI Cente r for Management Research (icmr.icfai.org/casestudies/
Chapter 3 • Data Warehousing 127
icmr_case_studies.htm), KnowledgeStorm (knowledgestonn.com), and other sites. For
additional case resources, see Teradata University Network (teradatauniversitynetwork.
com). For data warehousing cases, we specifically recommend the following from the
Teradata University Network (teradatauniversitynetwork.com): “Continental Airlines Flies
High with Real-Time Business Intelligence,” “Data Warehouse Governance at Blue Cross
and Blue Shield of North Carolina, ” “3M Moves to a Customer Focus Using a Global Data
Warehouse,” “Data Warehousing Supports Corporate Strategy at First American Corporation,”
“Harrah’s High Payoff from Customer Information,” and “Whirlpool. ” We also recommend
the Data Warehousing Failures Assignment, which consists of eight short cases on data
warehousing failures.
Vendors, Products, and Demos
A comprehensive list of vendors, products, and demos is available at DM Review
(dmreview.com) . Vendors are listed in Table 3.2. Also see technologyevaluation.com.
Periodicals
We recommend the following p e riodicals :
• Baseline (baselinemag.com)
• Business Intelligence journal (tdwi.org)
• CIO (do.com)
• CIO Insight (cioinsight.com)
• Computerworld (computerworld.com)
• Decision Support Systems (elsevier.com)
• DM Review (dmreview.com)
• eWeek (eweek.com)
• Info Week (infoweek.com)
• Info World (infoworld.com)
• InternetWeek (internetweek.com)
• Management Information Systems Quarterly (MIS Quarterly; misq.org)
• Technology Evaluation (technologyevaluation.com)
• Teradata Magazine (teradata.com)
Additional References
For additional information o n data warehousing, see the following:
• C. Imhoff, N. Gale mmo, and]. G. Geiger. (2003). Mastering Data Warehouse Design:
Relational and Dimensional Techniques. New York: Wiley.
• D. Marco and M. Je nnings. (2004). Universal Meta Data Models. New York: Wiley.
•]. Wa ng. (2005). Encyclopedia of Data Warehousing and Mining. Hershey, PA: Idea
Group Publishing.
For more on databases, the structure on which data warehouses are developed, see
the follow ing:
• R. T. Watson. (2006). Data Management, 5th ed ., New York: Wiley.
The Teradata University Network (TUN) Connection
TUN (teradatauniversitynetwork.com) provides a wealth of information and cases
on data warehousing. One of the best is the Continental Airlines case, w hich we require
you to solve in a later exercise. Other recommended cases are mentioned earlie r in this
128 Pan II • Descriptive Analytics
chapter. At TUN, if you click the Courses tab and select Data Warehousing, you will see
links to many relevant a1ticles, assignments, book chapters, course Web sites, PowerPoint
presentations, projects, research reports, syllabi, and Web seminars. You will also find
links to active data warehousing software demonstrations. Finally, you will see links to
Teradata (teradata.com), where you can find additional information, including excel-
lent data warehousing success stories, white papers, Web-based courses, and the online
version of Teradata Magazine.
Chapter Highlights
• A data warehouse is a specially constructed data
repository where data are organized so that they
can be easily accessed by end users for several
applications.
• Data marts contain data on one topic (e.g., market-
ing). A data mart can be a replication of a subset
of data in the data warehouse. Data marts are a
less expensive solution that can be replaced by
or can supplement a data warehouse . Data marts
can be independent of or dependent on a data
warehouse .
• An ODS is a type of customer-information-file
database that is often used as a staging area for a
data warehouse.
• Data integration comprises three major pro-
cesses: data access, data federation, and change
Key Terms
active data warehousing
(ADW)
cube
data integration
data mart
data warehouse CDW)
data warehouse
administrator CDW A)
dependent data mart
dimensional modeling
dimension table
drill down
enterprise application
integration (EAi)
enterprise data
warehouse (EDW)
Questions for Discussion
1. Compare data integration and ETL. How are they related?
2. What is a data warehouse, and what a re its benefits? Why
is Web accessibility important with a data warehouse?
3. A data mart can replace a data warehouse or comple-
ment it. Compare and discuss these options.
capture. When these three processes are correctly
implemented, data can be accessed and made
accessible to an array of ETL and analysis tools
and data warehousing environments.
• ETL technologies pull data from many sources,
cleanse them, and load them into a data ware-
house. ETL is an integral process in any data-
centric project.
• Real-time or active data warehousing supple-
ments and expands traditional data warehousing,
moving into the realm of operational and tacti-
cal decision making by loading data in real time
and providing data to users for active decision
making.
• The security and privacy of data and information
are critical issues for a data warehouse professional.
enterprise information
integration (Ell)
extraction,
transformation,
and load (ETL)
independent data mart
metadata
OLTP
aper mart
operational data store
(ODS)
real-time data
warehousing (RDW)
snowflake schema
star schema
4. Discuss the major drivers and benefits of data warehous-
ing to end use rs.
5. List the differences and/ or similarities between the roles
of a database administrator and a data warehouse ad-
ministrator.
6. Describe how data integration can lead to higher levels
of data quality.
7. Compare the Kimball and Inmo n approaches toward
data ware house development. Ide ntify when each one is
most effective.
8. Discuss security concerns involved in building a data
wa re house.
Exercises
Teradata University and Other Hands-On Exercises
1. Conside r the case describing the development and appli-
cation o f a data warehouse fo r Coca-Cola J apan (a sum-
ma1y appears in Application Case 3.4), available at the
DSS Resources Web site, http://dssresources.com/
cases/coca-colajapan/. Read the case and answer the
nine questions for further analysis and discussion.
2 . Read the Ball (2005) article a nd rank-order the criteri a
(ideally for a real o rganizatio n). In a report, explain how
important each criterion is a nd w hy.
3. Explain when you sho uld imple ment a two- o r three-
tie red architecture when conside ring developing a data
warehouse.
4. Read the full Continental Airlines case (summa-
rized in the End-of-Chapter Application Case) at
teradatauniversitynetwork.com a nd answer the
questions.
5. At teradatauniversitynetwork.com, read and answer
the questions to the case “Harrah’s High Payoff from
Customer Information.” Relate Harrah’s results to how
airlines and othe r casinos use their customer data.
6. At teradatauniversitynetwork.com , read a nd answer
the questions of the assig nment “Data Warehousing
Failures. ” Because e ig ht cases are described in that
assignment, the class may be divided into eight groups,
with one case assigned per group. In addition , read
Ariyachandra and Watson (2006a), a nd for each case
identify how the failure occurred as related to not focu s-
ing on one o r more of the reference’s success factor(s).
7. At teradatauniversitynetwork.com, read and answer
the questions with the assig nment “Ad-Vent Technology:
Using the MicroStrategy Sales Analytic Model. ” The
MicroStrategy software is accessible from the TUN site.
Also, you might wa nt to use Barbara Wixom’s PowerPoint
presentation about the MicroStrategy software (“Demo
Slides for MicroStrategy Tutorial Script”) , w hich is also
available at the TUN site.
8. At teradatauniversitynetwork.com, watch the Web
semina rs titled “Real-Time Data Warehousing: The Next
Generation of Decision Support Data Management” and
“Building the Real-Time Enterprise.” Read the article
“Te rada ta ‘s Real-Time Enterprise Refe rence Architecture :
A Blueprint for the Future of IT, ” also available at this
site . Describe how real-time concepts a nd technologies
Chapter 3 • Data Warehousing 129
9. Investigate current data warehou se development imple-
mentation through offshoring. Write a report about it. In
class, debate the issue in terms of the benefits and costs,
as well as social factors.
work and h ow they can be u sed to extend existing data
warehousing and BI architectures to support day-to-day
d ecision making. Write a report indicating how real-time
data warehousing is specifically providing competitive
advantage for organizations. Describe in de tail the dif-
ficulties in su ch implementa tions and operations and
d escribe how they are being addressed in practice.
9. At teradatauniversitynetwork.com, watch the Web
seminars “Data Integration Renaissance: New Drivers and
Emerging Approaches,” “In Search of a Single Version of
the Truth: Strategies for Consolidating Analytic Silos,” and
“Data Integration: Using ETL, EAi, and Ell Tools to Create
an Integrated Enterprise. ” Also read the “Data Integration”
research report. Compare and contrast the presentations.
What is the most important issue described in these semi-
nars? What is the best way to handle the strategies and
challenges of consolidating da ta marts and spreadsheets
into a unified data warehousing a rchitecture? Perform a
Web search to identify the latest developments in the
field. Compare the presentation to the mate rial in the text
and the new material that you found.
10. Consider the future of data warehousing. Pe1form a Web
search on this topic. Also, read these two articles: L. Agosta,
“Data Warehousing in a Flat World: Trends for 2006,” DM
Direct Newsletter, March 31, 2006; and ]. G. Geiger, “CIFe:
Evolving w ith the Times,” DM Review, November 2005,
pp. 38-41. Compare and contrast your findings.
11. Access teradatauniversitynetwork.com. Identify the
latest articles, research repo1ts, a nd cases on data w a re-
housing. Describe recent developments in the field .
Include in your report how data warehousing is used in
BI and DSS.
Team Assignments and Role-Playing Projects
1. Kath1yn Avery has bee n a DBA w ith a nationwide retail
chain (Big Chain) for the past 6 years . She has recently
b een asked to lead the deve lopment of Big Chain’s first
data warehouse. The project has the sponsorship of sen-
ior management a nd the CIO. The rationale fo r devel-
oping the data warehou se is to advance the reporting
systems, particularly in sales and marketing, and, in the
lo nger term, to improve Big Chain’s CRM. Kathryn has
been to a Data Wareh ousing Institute conference and
has been doing some reading, but she is still mystified
130 Part II • Descriptive Analytics
about development methodologies. She knows there are
two groups-EDW (Inmon) and architected data marts
(Kimball)-that have robust features.
Initia lly, she believed that the two methodologies
were extremely dissimilar, but as she has e xamined the m
more carefully, she isn ‘t so certain. Kath1yn has a num-
ber of questio ns that she would like a nswered:
a. What are the real differences between the me thodolo-
gies?
b. What factors are important in selecting a particular
me thodology?
c. What should be he r next steps in thinking about a
methodology?
Help Kathryn a nswer these q u estio ns. (This exercise
was adapted from K. Duncan , L. Reeves, and J. Griffin,
“BI Experts’ Perspective,” Business Intelligence Journal,
Vol. 8, No. 4, Fall 2003, pp. 14-19.)
2. Jeer Kumar is the administrato r of data warehousing at
a b ig regio na l bank. He was appointed 5 yea rs ago to
impleme nt a data warehouse to suppo rt the bank’s CRM
business strategy. Using the data wa re ho use, the bank
has been su ccessful in integrating cu stome r information,
understanding customer profitability, attracting cu stom-
e rs, e nhancing custome r relationships, and retaining
cu stomers.
Over the years, the bank’s data warehouse has
moved closer to real time by moving to more frequent
refreshes of the data warehouse. Now, the bank wants
to imple ment customer self-service and call center appli-
catio ns that require even freshe r data tha n is curre ntly
ava ilable in the wa re house.
Jeer wants some support in conside ring the pos-
sibilities for prese nting fresher data. On e alte rnative is
to e ntire ly commit to imple me nting real-time data ware-
housing . His ETL vendor is prepared to assist him make
this cha nge. Nevertheless, Jeer has been informed about
EAI and Ell technologies and won ders how they might
fit into his plans.
In particular, he has the following questions:
a. What exactly a re EAI a nd Ell technologies?
b. How are EAI and Ell re lated to ETL?
c. How are EAI and Ell rela ted to real-time data
wa re hou sing?
d. Are EAI and Ell required, comple mentary, or alte rna-
tives to real-time data wareho using?
Help Jeer answer these questions. (This exercise was
adapted from S. Brobst, E. Levy, and C. Muzilla, “Ente rprise
Application Integratio n a nd Enterprise Informatio n
Integratio n ,” Business Intelligence Journal, Vol. 10, No. 2,
Spring 2005, pp. 27-33 .)
3. Interview administrators in your college o r executives
in your organization to determine how data warehous-
ing could assist them in their work. Write a proposal
describing your findings. Include cost estimates and ben-
efits in you r report.
4. Go through the list of data warehousing risks described
in this chapter and find two examples of each in practice.
5. Access teradata.com and read the w hite papers “Measuring
Data Warehouse ROI” and “Realizing ROI: Projecting
and Harvesting the Business Valu e of an Ente rprise Data
Warehouse. ” Also, watch the Web-based course “The ROI
Factor: How Leading Practitioners Deal with the Tough
Issue of Measuring DW ROI.” Describe the most important
issues described in the m. Compare these issues to the suc-
cess factors described in Ariyachandra and Watson (2006a).
6. Read the article by K. Liddell Ave1y and Hugh J.
Watson, “Training Data Warehouse End Users,” Business
Intelligence Journal, Vol. 9, No. 4, Fall 2004, pp. 40-51
(which is available at teradatauniversitynetwork.com).
Consider the different classes of e nd u sers, describe their
difficulties, and discuss the benefits of appropriate train-
ing for each g roup. Have each member of the group take
on o ne of the roles and have a discussion about h ow an
appropriate type of data warehousing training would be
good for each of you .
Internet Exercises
1. Search the Internet to find information about data ware-
housing. Identify some newsgroups that have an interest in
this concept. Explore ABI/INFORM in your library, e-library,
and Google for recent articles on the topic. Begin with
tdwi.org, technologyevaluation.com, and the major
vendors: teradata.com, sas.com, oracle.com, and ncr.
com. Also check do.com, information-management.
com, dssresources.com, and db2mag.com.
2. Survey some ETL tools and vendors. Start w ith fairisaac.
com and egain.com. Also consult information-
management.com.
3. Contact some data wa re hou se vendors and obtain info r-
mation about their products . Give special attention to
vendors that provide tools for multiple purposes, such as
Cognos, Software A&G, SAS Institute, and Oracle. Free
online demos are availa ble from some of these vendors.
Download a demo or two and try them. Write a report
describing your experie n ce.
4. Explore teradata.com for deve lopments and success
sto ries about data warehou sing. Write a report about
what you have discovered .
5. Explore teradata.com for w h ite papers and Web-based
courses on data warehousing. Read the former and watch
the latter. (Divide the class so that a ll the sources are
covered.) Write what you have discovered in a report.
6. Find recent cases of su ccessful data warehousing appli-
cations. Go to data ware house vendors’ sites a n d look
for cases o r success stories. Select one and write a brief
summary to present to your class.
Ch apte r 3 • Da ta Ware housing 131
End-of-Chapter Application Case
Continental Airlines Flies High with Its Real-Time Data Warehouse
As business intelligence (BI) becomes a critical compone nt o f
daily operations, real-time data warehouses that provide end
users w ith rapid updates and alerts generated from transactional
syste ms are increasingly be ing de ployed. Real-time data ware –
housing and BI, suppo rting its aggressive Go Fo1ward business
pla n , have he lped Continental Airlines alte r its industry status
from “worst to first” and the n fro m “first to favorite.” Continental
airlines (now a p art of United Airlines) is a leader in real-time
DW and BI. In 2004, Continental won the Data Warehousing
Institute’s Best Practices and Leadership Award. Even though it
has been a while since Continental Airlines de ployed its hugely
su ccessful real-time DW and BI infrastructure, it is still regarded
as o ne of the best examples and a seminal success story for
real-time active data warehousing.
Problem(s)
Continental Airlines was founded in 1934, with a single-e ngine
Lockheed aircraft in the Southweste rn United States. As of
2006, Continental was the fifth largest airline in the United
States and the seventh la rgest in the world. Continental had the
broadest global ro ute network of a ny U.S. airline , w ith mo re
than 2,300 d aily depa1tures to more than 227 destinations .
Back in 1994, Contine ntal was in deep financial trouble .
It had file d for Chapter 11 ba nkruptcy protection twice and was
heading for its third, and probably final , bankruptcy. Ticket
sales were hurting because pe rformance o n factors that a re
impo rtant to customers was dismal, including a low p e rcent-
age of o n-time de pa1tures, freque nt baggage a rrival problems,
and too many cu stome rs turned away due to overbooking.
Solution
The revival of Continental began in 1994, w he n Gordon
Bethune became CEO and initiated the Go Forward pla n ,
which consisted of four interrelated parts to be impleme nted
simultaneously. Bethune targe te d the need to improve cus-
to me r-valued performance measures by bette r unde rsta nding
cu stome r needs as well as customer perceptions of the value
of services tha t were and could be offered. Financial ma nage-
me nt practices were also ta rgete d for a sig nificant overhaul. As
early as 1998, the airline h ad separate databases for marke ting
and operations, a ll hosted a nd managed by outside vendors.
Processing queries a nd instigating marke ting programs to its
high-value cu sto me rs were time-consuming and ineffective .
In additio nal, informatio n that the workforce needed to make
quick decisio ns was simply no t available . In 1999, Contine ntal
chose to integrate its marketing, IT, revenue, a nd operational
data sources into a single, in-ho use, EDW. The data ware –
ho use provided a variety of early, major benefits.
As soon as Continental returned to profitability and
ranked first in the airline industry in ma ny performance met-
rics, Be thune and his manage ment team ra ised the b ar by
escalating the vision. Instead of just p e rforming best, they
wanted Contine nta l to be their cu stomers’ favorite airline. The
Go Forward plan establishe d more actio nable ways to move
fro m first to favorite among customers. Technology beca me
increasingly critical for supp orting these n ew initiatives. In
the early days, h aving access to historical, integrated informa-
tio n was sufficie nt. This produced substantial strategic value .
Bu t it became increasingly imperative for the data ware ho u se
to provide real-time, actionable information to support e nte r-
p rise-w ide tactical decision making and bu siness processes.
Luckily, the warehouse team had expected and arranged
for the real-time shift. From the ve1y beginning, the team had
created an architecture to handle real-time data feeds into the
ware house, extracts of data from legacy systems into the ware-
house, and tactical queries to the warehouse that required almost
inUTiediate respo nse times. In 2001, real-time data became avail-
able fro m the wareho use, and the amount stored g rew rapidly.
Contine ntal moves real-time data (ranging from to-the-minute
to hourly) about customers, reservatio ns, check-ins , operations,
and flights from its main operational syste ms to the warehouse.
Contine ntal’s real-time ap plications include the following:
• Revenue ma nageme nt a nd accounting
• Custo mer relationship man agement (CRM)
• Crew operations a nd payroll
• Security and fraud
• Flight o p eratio ns
Results
In the first year alo ne, after the data warehouse project was
de ployed , Continental ide ntified and eliminated over $7 million
in fraud and reduced costs by $41 million. With a $30 million
investment in hardware and software over 6 years, Contine ntal
has reached over $500 million in increased revenues and cost
savings in marketing, fraud detection, demand forecasting and
tracking, and improved data cente r management. The single,
integrated , trusted view of the business (i.e., the single version
of the truth) has led to better, faster decision making.
Because of its tremendous success, Continental’s DW
implementatio n has been recognized as an excellent examp le
for real-time BI, based on its scalable and exten sible archi-
tecture, practical decisions o n what data are captured in real
time, strong relatio nships with end users, a small and highly
competent data ware house staff, sensible weighing of strategic
and tactical decisio n support requirements, understanding of
the synergies between decision support and operations, and
changed business processes that use real-time data .
QUESTIONS FOR THE END-OF-CHAPTER
APPLICATION CASE
1. Describe the be nefits of impleme nting the Continental
Go Forward strategy.
2. Explain w hy it is impo rtant for an airline to use a real-
time data wareho use.
132 Pan II • Descriptive Analytics
3. Ide ntify the major differe n ces between the traditiona l
data wa re house a nd a re al-time data warehouse, as was
imple me nte d a t Contine n tal.
4. What strategic advantage can Contine nta l derive from
the real-time system as opposed to a tra ditio nal infor-
mation system?
Sources: Adapted from H. Wixom, J. Ho ffe r, R. Ande rson-Le hman,
and A. Reynolds, “Real-Time Business Inte lligence: Best Practices
at Continenta l Airlines,” Infonnation Systems Management Journal,
Winte r 2006, pp. 7-18; R. Anderson-Le hma n, H. Watson, B. Wixom,
and]. Hoffe r, “Contine ntal Airlines Flies Hig h w ith Real-Time Business
References
Adamson , C. ( 2009). Tbe Star Schema Handbook: Tbe
Complete Reference to Dimensional Data Warehouse
Design. Hobo ke n , NJ: Wiley.
Ade lma n , S. , and L. Moss. (2001, Winte r) . “Da ta Warehou se
Risks. ” Journa l of Data Warehousing, Vol. 6 , No. 1.
Agosta , L. (2006, Ja nuary). “The Data Stra tegy Advise r:
The Year Ahead-Data Warehousing Tre nds 2006. ” DM
Review, Vol. 16, No. 1.
Akbay, S. (2006, Qua1ter 1). “Data Ware ho using in Real
Time .” Business Intelligence Journal, Vol. 11 , No. 1.
Ambeo. (2005 , July). “Ambeo Delivers Prove n Data Access
Auditing Solution .” Database Trends and Applications,
Vol. 19, No. 7 .
Anthes, G. H. (2003,Ju ne 30). “Hilton Checks into New Suite .”
Computerworld, Vol. 37, No. 26.
Ariyacha ndra , T. , and H. Watson. ( 2005). “Key Factors in
Selecting a Data Warehouse Architecture .” Business
lntelligenceJournal, Vol. 10, No. 3.
Ariyachandra, T. , and H. Watson. (2006a, Janua1y) . “Be nchmarks
for BI and Data Wa re ho using Success. ” DM Review,
Vol. 16, No. 1.
Ariyacha ndra, T. , and H. Watson. (2006b). “Which Da ta
Ware h ou se Architecture Is Most Su ccessful?” Business
l ntelligenceJournal, Vol. 11 , No. 1.
Armstrong, R. (2000, Q u a1te r 3). “E-na lysis for the E-business.”
Teradata Magaz ine Online, teradata.com.
Ball, S. K. (2005, November 14) . “Do Yo u Need a Data
Ware house Layer in Your Bus iness Inte lligence Architecture?”
datawarehouse.ittoolbox.com/documents/industry-
articles/do-you-need-a-data-warehouse-layer-in-your-
business-intelligencearchitecture-2729 (accessed June
2009) .
Barquin, R. , A. Faller, and H. Edelste in. 0997). “Ten Mistakes
to Avoid for Data Ware h ousing Managers. ” In R. Barquin
a nd H. Edelstein (eds.). Building, Using, and Managing the
Data Warehouse. Uppe r Saddle River, NJ: Pre ntice Hall.
Basu, R. (2003, Nove mbe r). “Challe nges o f Real-Time Data
Warehousing. ” DM Review.
Be ll, L. D. ( 2001 , Spring) . “MetaBus iness Meta Da ta fo r the
Masses: Administering Knowledge Sh a ring for Your Data
Ware house. ” Journal of Data Warehousing, Vol. 6, No. 3.
Intelligence,” MIS Quarterly Executive, Vol. 3, No. 4, Decem ber
2004, pp. 163-176 (ava ilable at teradatauniversitynetwork.com);
H . Watson, “Real Time: The Next Gene ratio n of Decision-Su pport
Data Ma nageme nt,” Business Intelligence Journal, Vol. 10, No. 3,
2005, pp. 4-6; M. Edwa rds , “2003 Best Practices Awards Winners:
Innovators in Business Inte lligence a nd Data Wa re ho us ing ,” Business
IntelligenceJournal, Fall 2003, pp. 57-64; R. Westervelt, “Contine ntal
Airlines Builds Rea l-Time Data Warehouse,” August 20, 2003,
searchoracle.techtarget.com; R. Clayton , “Ente rprise Business
Performance Manageme nt: Business Intelligence + Data Warehouse
= Optimal Business Performance,” Teradata Magazine, Septe mber
2005, a nd The Data Warehousing Institute , “2003 Best Practices
Summaries: Enterprise Data Warehouse,” 2003.
Benander, A., B. Be nande r, A. Fadlalla, and G. James. ( 2000,
Winter) . “Data Warehouse Administration a nd Management. ”
Information Systems Management, Vol. 17, No. 1.
Bonde, A. , and M. Kuckuk. (2004, April) . “Rea l World
Business Inte llig e nce: Th e Implementa tion Persp ective.”
DM Review, Vol. 14, No. 4.
Breslin, M. ( 2004, Winte r) . “Data Wa re housing Battle of
the Giants: Comparing the Basics of Kimba ll and In mo n
Models. ” Business In telligence Journal, Vol. 9, No. 1.
Brobst, S., E. Levy, a nd C. Mu zilla. (2005, Spring).
“Ente rprise Application Integratio n and Enterprise
Informa tio n Integra tion.” Business In telligence Journal,
Vo l. 10, No. 3.
Brody , R. (2003, Summer). “Information Ethics in the Design
a nd Use o f Metadata. ” IEEE Technology and Society
Magazine, Vol. 22, No. 3.
Brow n , M. (2004 , May 9-1 2). “8 Characteristics of a Successful
Data Wa re hou se.” Proceedings of the Twenty-Ninth
Annual SAS Users Group International Conference (SUGI
29). Mo ntreal, Canada.
Burde tt, ]. , and S. Singh. (2004) . “Cha lle nges and Lessons
Learned from Real-Time Data Warehousing.” Business
l ntelligenceJournal, Vol. 9, No. 4.
Coffee, P. (2003, Ju ne 23) . ‘”Active’ Ware hou sing. ” eWeek,
Vol. 20, No. 25.
Cooper, B. L. , H.]. Watson, B. H. Wixom, a nd D. L. Goodhue.
(1999, August 15-1 9). “Data Wa re housing Supports
Corporate Strategy a t First Ame rican Corporation.” SIM
Internationa l Confere nce, Atla nta .
Cooper, B. L., H . ]. Watson, B. H. Wixom, and D. L. Goodhue .
( 2000). “Data Wareh ousing Suppo1ts Corporate Strategy at
First Ame rican Corpo ratio n. ” MIS Quarterly, Vol. 24, No.
4 , pp. 547- 567.
Dasu , T. , and T. Johnson. (2003). Exploratory Data Mining
and Data Cleaning. New York: Wiley.
D avison, D. (2003, Novembe r 14). “Top 10 Risks of Offsh o re
Outsourcing. ” META G roup Research Re p o rt, now Gartner,
Inc. , Sta mford, CT.
Devlin, B. (2003, Quaiter 2). “Solving the Data Warehouse
Puzzle. ” DB2 Magazine.
Dragoon , A. (2003, July 1) . “All for One View. ” C/0.
Eckerson, W . (2003, Fall) . “The Evolutio n of ETL. ” Business
l ntelligenceJournal, Vol. 8, No. 4 .
Eckerson, W. (2005, April 1). “D ata Warehouse Builders
Advocate for Different Architectures. ” Application
Development Trends.
Eckerson, W. , R. Hackatho rn, M. McGivern, C. Twogood,
and G. Watson. (2009) . “Data Wa re ho u sing Appliances.”
Business Intelligence Journal, Vol. 14, No. 1, pp. 40–48.
Edwards, M. (2003, Fall) . “2003 Best Practices Awards
Winne rs: Innovators
D ata Warehousing .”
Vol. 8, No . 4.
in Business Inte llige nce and
Business Intelligence Journal,
“Egg’s Customer Data Ware ho u se Hits the Mark. ” (2005 ,
October). DM Review, Vol. 15, No. 10, pp. 24-28.
Elson , R. , a nd R. Lecle rc. (2005). “Security and Privacy
Concerns in the Data Warehouse Environme nt. ” Business
l ntelligenceJournal, Vol. 10, No. 3.
Ericson , J. (2006, March) . “Real-Time Realities.” Bl Review.
Frntado, P. (2009). “A Survey of Parallel a nd Distributed Data
Ware houses.” International Journal of Data Warehousing
and Mining, Vol. 5, No. 2, p p. 57-78.
Golfarelli, M., a nd Rizzi, S. (2009). Data Warehouse Design:
Modern Principles and Methodologies. San Fra ncisco:
McGraw-Hill Osborne Media .
Gonzales, M. (2005, Quarter 1) . “Active Data Warehouses Are
Just One Approach for Combining Strategic and Technical
Data .” DB2 Magazine.
Hall, M. (2002, April 15). “Seeding for Data Growth. ”
Computerworld, Vol. 36, No. 16.
Hammergren, T. C., and A. R. Simon. (2009). Data Warehousing
for Dummies, 2nd ed. Hoboken, NJ: Wiley.
Hicks, M. (2001, Novembe r 26). “Getting Pricing Just Right. ”
eWeek, Vol. 18, No. 46.
Ho ffe r, J. A., M. B. Prescott, and F. R. McFadden. (2007).
Modern Database Management, 8th ed. Upper Saddle
Rive r, NJ: Pre ntice Hall .
Hwang, M., and H. Xu . (2005, Fall) . “A Survey o f Data
Warehousing Su ccess Issues.” Business l n telligenceJournal,
Vol. 10, No. 4.
IBM. (2009). 50 Tb Data Warehouse Benchmark on IBM
System Z . Armonk, NY: IBM Redbooks.
Imhoff, C. (2001 , May). “Power Up Your Ente rprise Po rtal. ”
E-Business Advice.
Inmo n , W. H. (2005). Building the Data Warehouse, 4th ed.
New York: Wiley.
Inmon, W. H. (2006, January). “Informa tion Management:
How Do You Tune a Data Warehouse?” DM Review, Vol.
16, No. 1.
Jukic, N., and C. Lang . (2004, Summer). “Using Offshore
Resources to Develop a nd Suppo rt Data Warehousing
Applications .” Business Intelligence Journal, Vol. 9, No. 3.
Kalido. “BP Lu bricants Achieves BIGS Success. ” kalido.com/
collateral/Documents/English-US/CS-BP%20BIGS .
pelf (accessed August 2009) .
Karacsony, K. (2006, January) . “ETL Is a Symptom of the
Proble m , not the Solutio n. ” DM Review, Vol. 16, No. 1.
Ch apter 3 • Da ta Warehousing 133
Kassam, S. (2002, April 16). “Freedom of Information.”
Intelligent Enterprise, Vol. 5, No. 7.
Kay, R. (2005 , Septe mber 19). “Ell. ” Computerworld, Vol. 39,
No. 38.
Kelly, C. (2001, June 14). “Calculating Data Ware h ousing
ROI.” SearchSQLServer.com Tips.
Malykhina , E. ( 2003, Ja nuary 3). “The Real-Time Impe rative.”
l nformationWeek, Issue 1020.
Manglik, A. , and V. Mehra . (2005, Winter). “Extending Enterprise
BI Capabilities: New Patterns for Data Integration. ” Business
l ntelligenceJournal, Vol. 10, No. 1.
Ma rtins, C. (2005, December 13). “HP to Consolidate D ata
Marts into Single Warehouse. ” Computerworld.
Matney, D . (2003, Sprin g). “End-User Support Strategy.”
Business Intelligence Journal, Vol. 8 , No. 3.
Mccloskey, D. W. (2002). Choosing Vendors and Products
to Maximize Data Warehousing Success. New York:
Auerbach Publicatio ns .
Mehra, V. (2005, Summe r). “Building a Metadata-Driven
Enterprise: A Ho listic Approach.” Business Intelligence
Journal, Vol. 10, No. 3.
Moseley, M. (2009). “Eliminating Data Warehou se Pressures
w ith Master Data Services and SOA. ” Business Intelligence
Journal, Vol. 14, No. 2, pp . 33–43.
Murtaza, A. 0998, Fall). “A Framework for Developing Enterprise
Data Wa re ho uses. ” Information Systems Management, Vol.
15, No. 4.
Nash , K. S. (2002, July). “Chemical Reactio n. ” Baseline.
Orovic , V. (2003, June). “To Do & Not to Do. ” eAIJournal.
Parzinger, M. ]. , and M. N. Fralick. (2001, July). “Creating
Competitive Adva ntage Through Data Wa re housing. ”
Information Strategy, Vol. 17, No. 4 .
Pe te rson, T. (2003, April 21). “Getting Real About Real Time .”
Computerworld, Vol. 37, No. 16.
Raden, N. (2003, June 30). “Real Time: Get Real, Part II. ”
Intelligent Enterprise.
Reeves, L. (2009) . Manager’s Guide to Data Warehousing.
Hoboke n, NJ: Wiley.
Ro mero, 0., and A. Abell6 . (2009). “A Survey of Multidim ensional
Modeling Methodologies.” International Journal of Data
Warehousing and Mining, Vol. 5, No. 2, pp. 1-24.
Rosenberg, A. (2006, Qua1ter 1) . “Improving Que1y
Pe rfo rma nce in Data Warehouses.” Business In telligence
Journal, Vol. 11 , No. 1.
Russom, P . (2009). Next Generation Data Warehouse
Platforms. TDWI Best Practices Report, available at www.
tdwi.org (accessed J anuary 2010).
Sa mmon, D. , a nd P. Finnegan. ( 2000, Fall). “The Ten
Commandme nts of Data Warehousing. ” Database fo r
Advances in Information Systems, Vol. 31, No. 4.
Sapir, D. (2005, May). “Data Integration: A Tuto ria l.” DM
Review, Vol. 15, No. 5.
Saunders, T. (2009). “Cooking u p a Data Ware hou se.”
Business Intelligence Journal, Vol. 14, No. 2, pp. 16- 23.
Schwa rtz, K. D . “Decisio ns at the Touch of a Button.” Teradata
Magazine, (accessed June 2009) .
134 Pan II • Descriptive Analytics
Schwa nz, K. D. (2004, March). “Decisions at the Tou ch of a
Button. ” DSS Resources, pp. 28-31. dssresources.com/
cases/coca-colajapan/index.html (accessed April 2006).
Sen, A. (2004, April). “Metadata Management: Past, Present
and Future.” Decision Support Systems, Vol. 37, No. 1.
Sen, A. , and P. Sinha . (2005) . “A Comparison of Data
Warehousing Methodologies. ” Communications of the
ACM, Vol. 48, No. 3 .
Solomon , M. (2005, Winter). “Ensuring a Successful Data
Warehouse Initiative.” Information Systems Management
Journal.
Son g ini, M. L. (2004, February 2). “ETL Quickstudy. ”
Computerworld, Vol. 38, No. 5.
Sun Microsystems. (2005, September 19). “Egg Banks on
Sun to Hit the Mark with Customers.” sun.com/smi/
Press/ sunflash/2005-09/ sunflash.20050919 .1.xml
(accessed April 2006; no lo nger available online).
Tann enbaum, A. (2002, Spring). “Identifying Meta Data
Requirements.”JournalofData Warehousing, Vol. 7, No. 3 .
Tennant, R. (2002, May 15). “The Impo rta n ce of Being
Granular. ” LibraryJournal, Vol. 127, No. 9.
Teradata Corp. “A Large US-Based Insurance Company
Maste rs Its Fina nce Data. ” (accessed July 2009).
Teradata Corp. “Active Data Wa re h o u s ing.” teradata.com/
active-data-warehousing/ (accessed April 2006).
Teradata Corp. “Coca-Cola J apan Puts the Fizz Back in
Vending Machine Sales. ” (accessed June 2009).
Teradata. “Enterprise Data Warehouse Delivers Cost Savings
and Process Efficiencies. ” teradata.com/t/resources/case-
studies/NCR-Corporation-eb4455 (accessed June 2009).
Terr, S. (2004, Febru ary) . “Real-Time Data Warehousing:
Hardware and Software. ” DM Review, Vol. 14, No. 3 .
Thornton, M. (2002, March 18). “What About Security? The
Most Common, but Unwa rranted, Objection to Hosted
Data Warehouses. ” DM Review, Vol. 12, No. 3, pp. 30- 43.
Thornton, M., and M. Lampa. (2002). “Hoste d Data Warehouse.”
Journal of Data Warehousing, Vol. 7, No. 2, pp. 27- 34.
Turban, E. , D. Leidner, E. McLean, and J. Wetherbe. (2006).
Information Technology for Management, 5th ed. New
York: Wiley.
Vaduva, A. , a n d T. Vetterli. (2001, September). “Metadata
Management for Data Warehou sing: An Overview. ”
International Journal of Cooperative Information Systems,
Vol. 10, No. 3.
Van den Hoven, J. 0998) . “Data Marts: Plan Big, Build Small. ”
Information Systems Management, Vol. 15, No. 1 .
Watson , H. J. (2002). “Recent Developments in D ata
Warehousing. ” Communications of the ACM, Vol. 8 ,
No. 1.
Watson , H.J., D. L. Goodhu e , and B. H. Wixom. (2002). “The
Ben efits of Data Ware housing: Wh y Some Organ izations
Realize Exceptional Payoffs. ” Information & Management,
Vol. 39.
Wa tson , H ., J. Ge ra rd, L. Gonzalez, M. Haywoo d , and D.
Fenton. 0999). “Data Warehouse Failures: Case Studies
and Findings. ” Journal of Data Warehousing, Vo l. 4 ,
No. 1.
Weir , R. (2002, Winte r). “Best Practices for Impleme nting a
Data Ware h ouse. ” Journal of Data Warehousing, Vol. 7,
No. 1.
W ilk , L. (2003, Spring). “Data Warehousin g a nd Re al-
T ime Computing .” Business Intelligence Journal,
Vol. 8 , No. 3.
Wixom, B., a nd H. Watson . (2001, March). “An Empirical
Investigation of the Factors Affecting Data Warehousing
Su ccess. ” MIS Quarterly, Vol. 25, No. 1.
W rembel, R. (2009). “A Survey of Managing the Evolution
of Data Wa re houses. ” International Journal of Data
Warehousing and Mining, Vol. 5, No. 2, pp . 24-56.
ZD Net UK. “Sun Case Study: Egg’s Customer Data Warehou se .”
whitepapers.zdnet.co.uk/0,39025945,60159401p-
39000449q,OO.htm (accessed June 2009) .
Zhao, X. (2005 , October 7). “Meta Data Man agement Maturity
Model. ” DM Direct Newsletter.
CHAPTER
Business Reporting,
Visual Analytics, and Business
Performance Management
LEARNING OBJECTIVES
• Define business reporting and
understand its historical evolution
• Recognize the need for and the power
of business reporting
• Understand the importance of d ata/
information visualization
• Learn different types of visualization
techniques
• Appreciate the value that visual analytics
brings to BI/BA
• Know the capabilities and limitations of
dashboards
• Understand the nature of business
p erformance management (BPM)
• Learn the closed-loop BPM methodology
• Describe the basic elements of the
balanced scorecard
A
report is a communication artifact prepared with the specific intention of relaying
information in a presentable form. If it concerns business matte rs, then it is
called a business report. Business reporting is an essential part of the business
intelligence movement toward improving managerial decision making. Nowadays, these
reports are more visu ally oriented, often using colors a nd graphical ico n s that collectively
look like a dashboard to enhance the information content. Business reporting and
business performance management (BPM) are both enablers of business intelligence and
analytics. As a decision support tool, BPM is more tha n just a rep orting techno logy. It is
an integrated set of processes, methodologies , metrics , and applications designed to drive
the overall financial and operational performance of an enterprise. It helps enterprises
translate the ir strategies and objectives into pla ns, monito r performance against those
plans, analyze variations between actual results and planned results, and adjust their
objectives and actions in response to this analysis.
This chapter starts with examining the n eed for a nd the power of business report-
ing. With the emergence of analytics, business reporting evolved into dashboards and
visual analytics, which, compared to traditio n al descriptive repo rting, is much mo re pre-
dictive and prescriptive. Coverage of dashboa rds and visual analytics is followed by a
135
136 Pan II • Descriptive Analytics
comprehensive introduction to BPM. As you will see and appreciate, BPM and visual
analytics have a symbiotic relationship (over scorecards and dashboards) where they
benefit from each other’s strengths.
4.1 Opening Vignette: Self-Service Reporting Environment Saves Millions for
Corporate Customers 136
4.2 Business Reporting Definitions and Concepts 139
4.3 Data and Information Visualization 145
4.4 Differe nt Types of Charts and Graphs 150
4.5 The Emergence of Data Visualization and Visual Analytics 154
4.6 Performance Dashboards 160
4. 7 Business Performance Management 166
4.8 Performance Measurement 170
4.9 Balanced Scorecards 172
4.10 Six Sigma as a Performance Measurement System 175
4.1 OPENING VIGNETTE: Self-Service Reporting Environment
Saves Millions for Corporate Customers
Headquartered in Omaha, Nebraska , Travel and Tra nsport, Inc. , is the sixth largest travel
management company in the United States, with more than 700 employee-owners located
nationwide. The company has extensive experience in multiple verticals, including travel
manage me nt, loyalty solutions programs, mee ting a nd incentive planning, and leisure
travel services.
CHALLENGE
In the field of employee travel services, the ability to effectively communicate a value
proposition to existing and potential customers is critical to w inning and retaining
business. With travel arrangements often made on an ad hoc basis, customers find it
difficult to analyze costs or instate optimal purchase agreements. Travel and Transport
wanted to overcome these challenges by implementing an integrated reporting and
analysis system to enhance relationships with existing clie nts, while providing the kind of
value-added services that would attract new prospects.
SOLUTION
Travel and Transport impleme nted Informatio n Builders’ WebFOCUS business intelligence
(BI) platform (called eTTek Review) as the foundation of a dynamic customer self-
service BI environment. This dashboard-driven expense-management application helps
more than 800 external clients like Robert W. Baird & Co., MetLife, and American Family
Insurance to p lan, track, analyze, and budget their travel expenses more efficiently and
to benchmark the m against similar companies, saving the m millions of dollars. More than
200 internal employees, including customer service specialists, also have access to the
system, using it to generate more precise forecasts for clients and to streamline and accel-
e rate other key support processes such as quarterly reviews.
Thanks to WebFOCUS, Travel and Transport doesn’t just tell its clients how much
they are saving by using its services-it shows them. This has helped the company to
differentiate itself in a market defined by aggressive competitio n. Additionally, WebFOCUS
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Management 137
eliminates manual report compilation for client service specialists, saving the company
close to $200,000 in lost time each year.
AN INTUITIVE, GRAPHICAL WAY TO MANAGE TRAVEL DATA
Using stunning graphics created with WebFOCUS and Adobe Flex, the business intelli-
gence system provides access to thousands of reports that show individual client metrics,
benchmarked information against aggregated market data, and even ad hoc reports that
users can specify as needed. “For most of our corporate customers, we thoroughly manage
their travel from planning and reservations to billing, fulfillment, and ongoing analysis, ”
says Mike Kubasik, senior vice president and CIO at Travel and Transport. “WebFOCUS
is important to our business. It helps our custome rs monitor employee spending, book
travel with preferred vendors, and negotiate corporate purchasing agreements that can
save them millions of dollars per year. ”
Clients love it, and it’s giving Travel and Transport a competitive edge in a crowded
marketplace. “I use Travel and Transport’s eTTek Review to automatically e-mail reports
throughout the company for a variety of reasons, such as monitoring travel trends and
company expendin1res and assisting with airline expense reconciliation and allocations, ”
says Cathy Moulton, vice president and travel manager at Robert W. Baird & Co. ,
a prominent financial services company. What she loves about the WebFOCUS-enabled
Web portal is that it makes all of the company’s travel information available in just a few
clicks. “I have the data at my fingertips ,” she adds. “I don’t have to wait for someone to
go in and do it for me. I can set up the reports on my own. Then we can go to the hotels
and preferred vendors armed with detailed information that gives us leverage to negotiate
our rates.”
Robert W. Baird & Co. isn’t the only firm benefiting from this advan ced access to
reporting. Many of Travel and Transport’s other clients are also happy w ith the technol-
ogy. “With Travel and Transport’s state-of-the-art reporting technology , MetLife is able
to measure its travel program through data analysis, standard reporting, and the ability
to create ad hoc reports dynamically, ” says Tom Molesky, director of travel services at
MetLife. “Metrics derived from actionable data provide direction and drive us toward our
goals. This is key to helping us negotiate with our suppliers, e nforce our travel policy,
and save our company money. Travel and Transport’s leading-edge product has helped
us to mee t and, in some cases, exceed our travel goals. ”
READY FOR TAKEOFF
Travel and Transport used WebFOCUS to create an online system that allows clients
to access information directly, so they won’t have to rely on the IT department to nm
reports for them. Its objective was to give customers online tools to monitor corporate
travel expenditures throughout their companies. By giving clients access to the right
data, Travel and Transport can help make sure its cu stome rs are getting the best pricing
from airlines, hotels, car rental companies, and other vendors . “We needed more than
just pretty reports, ” Kubasik recalls, looking back on the early phases of the BI project.
“We wanted to build a reporting e nvironment that was powerful enough to handle
transaction-intensive operations, yet simple enough to deploy over the Web. ” It was a
winning formula. Clients and customer service specialists continue to use eTTek Review
to create forecasts for the coming year and to target specific areas of business travel
expenditures. These u sers can choose from dozens of management reports. Popular
reports include travel summary, airline compliance, hotel analysis, and car analysis.
Travel managers at about 700 corporations use these reports to analyze corporate travel
spending on a daily, weekly, monthly, quarterly, and annual basis. About 160 standard
reports and more than 3,000 custom repo1ts are currently set up in eTTek Review,
138 Pan II • Descriptive Analytics
including everything from noncompliance reports that reveal why an employee did not
obtain the lowest airfare for a particular flight to executive overviews that summarize
spending patterns. Most reports are parameter driven w ith Information Builders’ unique
guided ad hoc reporting technology.
PEER REVIEW SYSTEM KEEPS EXPENSES ON TRACK
Users can also run reports that compare their own travel metrics w ith aggregated travel
data from other Travel and Transport clients. This benchmarking service lets them gauge
whether their expenditures, preferred rates, and other metrics are in line with those of
other companies of a similar size or within the same industry. By pooling the data, Travel
and Transport helps protect individual clients’ information while also enabling its e ntire
customer base to achieve lower rates by giving them leverage for th eir negotiations.
Reports can be run interactively or in batch mode, with results displayed on the
screen, stored in a library, saved to a PDF file, loaded into a n Excel spreadsheet, or
sent as an Active Report that permits additional an alysis. “Our clients love the visual
metaphors provided by Information Builders’ graphical displays, including Adobe Flex
and WebFOCUS Active PDF files,” explains Steve Cords, IT manager at Travel and
Transport and team leader for the eTTek Review project. “Most summary reports h ave
drill-down capability to a detailed report. All reports can be run for a particular hierarchy
structure, and more than o ne hierarchy can be selected. ”
Of course, u sers never see the code that makes all of this possible. They operate
in an intuitive dashboard environment w ith drop-down menus and drillable graphs, all
accessible through a browser-based interface that requires no client-side software. This
architecture makes it easy and cost-effective for users to tap into eTTek Review from any
location. Collectively, customers nm an estimated 50,000 reports p er month. Abo ut 20,000
of those reports are autom atically generated and distributed via WebFOCUS ReportCaster.
AN EFFICIENT ARCHITECTURE THAT YIELDS SOARING RESULTS
Travel a nd Transport captures travel information from reservation systems known as
Global Distribution Systems (GDS) via a proprietary back-office system that resides in
a DB2 database o n an IBM iSeries computer. They use SQL tables to store user IDs
and passwords, and use other databases to store the information. “The database can be
sorted according to a specific hierarchy to match the breakdown of reports required by
each company,” continues Cords. “If they want to see just marketing and accounting
information, we can deliver it. If they want to see the particular level of detail reflecting a
given cost center, we can deliver that, too. ”
Because all data is securely stored for three years, clients can generate trend reports
to compare current travel to previous years. They can also use the BI system to monitor
w here employees a re traveling at any point in time. The reports are so easy to use that
Cords and his team have started replacing outdated processes w ith new automated ones
using the same WebFOCUS technology. The company also uses WebFOCUS to streamline
their quarterly review process. In the past, client service managers had to manually create
these quarterly reports by aggregating data from a variety of clients. The 80-page report
took o ne week to create at the end of every quarter.
Travel and Transport has completely automated the quarterly review system using
WebFOCUS so the managers can select the pages, percentages, and specific data they
want to include. This gives them more time to do further an alysis and make better use of
the information. Cords estimates that the time savings add up to about $200,000 eve1y year
for this project alo ne. “Metrics derived from actionable data are key to helping us negotiate
w ith our suppliers, enforce our travel policy, and save our company money, ” continues
Cords. “During the recessio n , the travel industry was hit particularly hard, but Travel a nd
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Management 139
Transport managed to add new multimillion dollar accounts even in the worst of times. We
attribute a lot of this growth to the cutting-edge reporting technology we offer to clients. ”
QUESTIONS FOR THE OPENING VIGNETTE
1. What does Travel and Transport, Inc., do1
2. Describe the complexity and the competitive nature of the business environment in
which Travel and Transport, Inc ., functions.
3. What were the main business challenges?
4. What was the solution? How was it implemented?
5. Why do you think a multi-vendor, multi-tool solution was implemented?
6. List and comment on at least three main benefits of the implemented system. Can
you think of other potential benefits that are n ot mentioned in the case?
WHAT WE CAN LEARN FROM THIS VIGNETTE
Trying to survive (and thrive) in a highly competitive industry, Trave l and Transport,
Inc. , was aware of the need to create and effectively communicate a value proposition
to its existing and potential customers. As is the case in many industries, in the travel
business, success or mere survival depends on continuously winning new customers
while retaining the existing ones. The key was to provide value -added services to the
client so that they can efficie ntly an alyze costs and other options to quickly instate
optimal purchase agreements . Usin g WebFOCUS (an integrated reporting and information
visualization environment by Information Builders), Travel and Transport empowered
their clients to access information whenever and wherever they need it. Information is
the power that decision makers need the most to make better and faster decisions. When
economic conditions are tight, every managerial decision-every business transaction-
counts. Travel and Transport used a variety of reputable vendors/ products (hardware
and software) to create a cutting-edge repo1ting technology so that their clients can make
better, faster decisions to improve their financial well-being.
Source: Information Builde rs, Custome r Success Sto ry, informationbuilders.com/applications/travel-and-
transport (accessed February 2013).
4.2 BUSINESS REPORTING DEFINITIONS AND CONCEPTS
Decision makers are in need of information to make accurate and timely decisions.
Information is essentially the contextualization of data. Information is often provided in
the form of a written report (digital or o n paper), although it can also be provided orally.
Simply put, a report is any communicatio n artifact prepared w ith the specific intention of
conveying information in a presentable form to whoever needs it, wheneve r and wherever
they may need it. It is u sually a document that contains information (usually driven from
data and personal experiences) organized in a narrative , graphic, and/ or tabular form,
prepared periodically (recurring) or on an as-required (ad hoc) basis, refe rring to specific
time periods, events, occurrences, or subjects.
In business settings, types of reports include memos, minutes, lab reports , sales
repo1ts, progress reports, justification reports, compliance re po1ts, annual reports, and
policies and procedures. Reports can fulfill many different (but often related) functions .
Here are a few of the most prevailing ones:
• To ensure that all departments are functioning properly
• To provide information
140 Pan II • Descriptive Analytics
• To provide the results of an a nalysis
• To persuade others to act
• To create an organizational memory (as part of a knowledge m anagement system)
Reports can be lengthy at times. For those reports, there usually is an executive
summary for those w ho do not have the time and interest to go through it all. The
summary (or abstract, o r more commonly called executive brief) sh ould be crafted
carefully, expressing only the important points in a very con cise and precise manner, and
lasting no more than a page or two .
In additio n to business reports, examples of other types of reports include crime
scene reports, police reports, credit reports , scientific reports, recommendation reports,
white papers, annual reports, auditor’s reports, workplace reports, census reports, trip
reports, progress reports, investigative reports, budget reports, policy reports, demographic
reports, credit repo rts , appraisal reports, inspection reports, and military reports, among
others . In this chapter we are particularly interested in business reports.
What Is a Business Report?
A business report is a writte n document that contains information regarding business
matters. Business reporting (also ca lled e nterprise reporting) is a n essential part of the
larger drive toward improved managerial decision making and organizational knowledge
management. The foundation of these reports is various sources of data coming from
both inside and outside the organization. Creation of these reports involves ETL (extract,
transform, and load) procedures in coordination w ith a data wareh ouse a nd then using
o ne or mo re reporting tools. While reports can be distributed in print form or via e -ma il,
they are typically accessed via a corporate intranet.
Due to the expansio n of information technology coupled w ith the need for improved
competitiven ess in businesses, there h as been a n increase in the use of computing power
to produce unified reports that join different views of the enterprise in one place. Usually ,
this reporting process involves querying structured data sources, most of which are created
by using different logical data models and data dictionaries to produce a human-readable,
easily digestible report. These types of business reports allow managers and coworkers
to stay informed and involved, review options and alternatives, and make informed
decisions. Figure 4.1 shows the continuous cycle of data acquisition -+ information
generation-+ decision making -+ business process management. Perhaps the most critical
task in this cyclic process is the reporting (i.e., informatio n generation)- converting data
from different sources into actionable informatio n .
The key to a ny su ccessful report is clarity, brevity, completen ess, and correctn ess.
In terms of content and format, there are only a few categories of business report: infor-
mal , formal , and short. Informal reports are usually up to 10 pages long; are routine and
inte rnal; follow a letter o r memo format; and use personal pronouns and contractions.
Forma l reports are 10 to 100 pages lo ng; do not use personal pronouns o r contractio ns;
include a title page, table of contents, and an executive summary; are based o n deep
research o r an analytic study; and are distributed to external or internal people w ith a
need-to-know designation. Short reports are to inform people about events or system
status changes and are often periodic, investigative, compliance, and situ ation al focused.
The nature of the report also cha nges sig nificantly based on whom the report is
created for. Most of the research in effective reporting is dedicated to internal reports
that inform stakeh olders a nd decision makers w ithin the organization. There are also
external reports between businesses and the government (e.g., for tax purposes o r for
regular filings to the Securities and Exchange Commission) . These formal reports are
mostly standardized and periodically filed e ither nationally or internationally. Standard
Business Reporting , w hich is a collectio n of internatio na l programs instigated by a
Chapter 4 • Business Repo rting, Visual Analytics , and Business Performance Manage ment 141
,———
‘
Data :
Transactional Records
Exception Event
Symbol I Count !Description
~ I 1
I
I
I
I
I
I
I
t
I Machine
Failure
Data
Repositories
Business Functions
0
Information
(reporting)
FIGURE 4.1 The Role of Information Reporting in Managerial Decision Making.
Action
• (decision)
I
I
I
I
I
I
I
I
I
number of governments , aims to reduce the regulatory burden for business by simplifying
and standardizing reporting requirements. The idea is to make business the e picenter
when it comes to managing business-to-government reporting obligations. Businesses
conduct their own financial administration; the facts they record and decisions they make
should drive their reporting. The governme nt should be able to receive and process
this information without imposing undue constraints on how businesses administer
the ir finances. Application Case 4.1 illustrates an excellent example for ove rcoming the
challenges of financial reporting.
Application Case 4.1
Delta Lloyd Group Ensures Accuracy and Efficiency in Financial Reporting
Delta Lloyd Group is a financial services provider
based in the Netherlands. It offers insurance, pen-
sions, investing, and banking se1vices to its private
and corporate clients through its three strong brands:
to €3 .9 billion and investments under management
worth nearly €74 billion.
Delta Lloyd, OHRA, and ABN AMRO Insurance.
Since its founding in 1807, the company has grown
in the Netherlands, Germany, and Belgium, and
now employs around 5,400 pe1manent staff. Its 2011
full-year financial reports show €5.5 billion in gross
written premiums, with shareholders’ funds amounting
Challenges
Since Delta Lloyd Group is publicly listed on the
NYSE Euronext Amsterdam, it is obliged to produce
annual and half-year repo1ts. Various subsidiaries in
Delta Lloyd Group must also produce reports to fulfill
local legal requirements: for example , banking and
( Continued)
142 Pan II • Descriptive Analytics
Application Case 4.1 (Continued}
insurance reports are obligatory in the Netherlands.
In addition, Delta Lloyd Group must provide reports
to meet international requirements, such as the
IFRS (International Financial Reporting Standards)
for accounting and the EU Solvency I Directive for
insurance companies. The data for these reports is
gathered by the group’s finance department, which
is divided into small teams in several locations, and
then converted into XML so that it can be published
on the corporate Web site.
Importance of Accuracy
The most challenging part of the reporting process is
the “last mile”-the stage at which the consolidated
figures are cited, formatted, and described to form
the final text of the report. Delta Lloyd Group was
using Microsoft Excel for the last-mile stage of the
repolting process. To minimize the risk of errors,
the finance team needed to manually check all
the data in its reports for accuracy. These manual
checks were vety time-consuming. Arnold Honig,
team leader for reporting at Delta Lloyd Group,
comments: “Accuracy is essential in financial
reporting, since errors could lead to penalties,
reputational damage, and even a negative impact
o n the company’s stock price. We n eeded a new
solution that would automate some of the last mile
processes and reduce the risk of manual error. ”
Solution
The group decided to implement IBM Cognos
Financial Statement Reporting (FSR) . The implemen-
tation of the software was completed in just 6 weeks
during the late summer. This rapid implementation
gave the finance department enough time to prepare
a trial draft of the annual report in FSR, based on
figures from the third financial quarter. The success-
ful creation of this draft gave Delta Lloyd Group
enough confidence to use Cognos FSR for the final
version of the annual report, which was published
shortly after the end of the year.
Results
Employees are delighted with the IBM Cognos FSR
solution. Delta Lloyd Group has divided the annual
report into chapters, and each member of the report-
ing team is responsible for one chapter. Arnold Honig
says, “Since employees can work on documents
simultaneously, they can share the huge workload
involved in repolt generation. Before, the reporting
process was inefficient, because only one person
could work on the report at a time.”
Since the workload can be divided up, staff can
complete the report with less overtime. Arnold Honig
comments, “Previously, employees were putting in
2 weeks of overtime during the 8 weeks required to
generate a report. This year, the 10 members of staff
involved in the report generation process worked
25 percent less overtime, even though they were still
getting used to the new software. This is a big w in
for Delta Lloyd Group and its staff.” The group is
expecting further reductions in employee overtime
in the future as staff becomes more familiar with
the software.
Accurate Reports
The IBM Cognos FSR solution automates key stages
in the report-writing process by popu lating the
final report with accurate , up-to-date financial data .
Wherever the text of the report needs to mention
a specific financial figure, the finance team s imply
inserts a “variable “- a tag that is linked to an under-
lyin g data source. Wherever the variable appears
in the document, FSR will pull the figure through
from the source into the report. If the value of the
figure needs to be changed, the team can simply
update it in the source, and the new value w ill
automatically flow through into the text, maintain-
ing accuracy and consistency of data throughout
the report.
Arnold Honig comments, “The ability to
update figures automatically across the whole report
reduces the scope for manual error inherent in
spreadsheet-based processes and activities. Since
we have full control of our reporting processes,
we can produce better quality reports more effi-
ciently and reduce our business risk. ” IBM Cognos
FSR also provides a comparison feature , which
highlights any ch anges made to reports. This featu re
makes it quicker and easier for users to review new
versions of documents and ensure th e accuracy of
their reports.
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Manage ment 143
Adhering to Industry Regulations
In the future, Delta Lloyd Group is planning to extend
its use of IBM Cognos FSR to generate internal man-
agement reports. It w ill also help Delta Lloyd Group
to meet industry regulatory standards, which are
becoming stricter. Arnold Honig comments, “The EU
Solvency II Directive w ill come into effect soon, and
our Solvency II reports will need to be tagged w ith
extensible Business Reporting Language [XBRL]. By
implementing IBM Cognos FSR, which fully suppo1ts
XBRL tagging, we have equipped ourselves to meet
both current and future regulatory requirements. ”
QUESTIONS FOR DISCUSSION
1. How d id Delta Lloyd Group improve accuracy
and efficiency in financial reporting?
2. What were the challenges, the proposed solution,
and the obtained results?
3. Why is it important for Delta Lloyd Group to
comply with industry regulations?
Source: IBM, Customer Success Story, “Delta Lloyd Group
Ensures Accuracy in Fina ncial Repo rting,” public.dhe.ibm.com/
common/ssi/ecm/en/ytc03561nlen/YTC03561NLEN.PDF
(accessed Febrnary 2013); and www.deltalloydgroep.com .
Even though there are a w ide variety of business repo1ts, the o nes that are often
used for managerial purposes can be grouped into three major categories (Hill, 2013).
METRIC MANAGEMENT REPORTS In many organizations, business performance is
managed through o utcome-oriented metrics. For external groups, these are service-level
agreements (SLAs) . For inte rnal management, they are key performance indicators (KPis).
Typically, there are e nterprise-w ide agreed targets to be tracked over a period of time .
They may be used as part of oth er management strategies such as Six Sig ma or Total
Quality Man agement (TQM).
DASHBOARD-TYPE REPORTS A popular idea in business reporting in recent years has
bee n to present a range of different performance indicators o n o ne page, like a dash-
board in a car. Typically, dashboard vendors would provide a set of predefined reports
with static e lements and fixed structure, but also allow for customization of the dashboard
widgets , views, a nd set targets for various metrics. It’s common to have color-coded traf-
fic lights defined for performance (red, orange , green) to draw management attention to
particular areas. More details on dashboards are given later in this chapter.
BALANCED SCORECARD-TYPE REPORTS This is a method developed by Kaplan an d
Norton that attempts to present an integrated view of success in an organization. In addi-
tion to financial p e rformance , bala nced scorecard-type re ports also include customer,
business process, and learning and growth perspectives. More details on balanced score-
cards are provided la ter in this chapter.
Components of the Business Reporting System
Although each business reporting system has its unique characteristics , there seems to
be a gen e ric patte rn that is common across organizations an d technology a rchitectures.
Think of this generic pattern as hav ing the business user o n o n e e nd of the reporting
continuum an d the d ata sources on the other e nd . Based on the n eeds an d requirements
of the business user, the data is captured, stored , consolidated, and con verted to
desired reports using a set of predefined business rules . To be successful, su ch a
system n eeds an overarching assurance process that covers the e ntire valu e chain and
moves back a n d forth , e nsu ring that reporting requirements and information delivery
144 Pan II • Descriptive Analytics
are properly aligned (Hill, 2008). Following are the most common comp onents of a
business reporting system.
• OLTP (online transaction processing). A system that measures some asp ect
of the real world as events (e.g., transactions) and records them into e nterprise
databases. Examples include ERP systems, POS systems, Web servers, RFID readers,
handheld inventory readers, card reade rs, and so forth.
• Data supply. A system that takes recorded events/ transactions and delivers them
reliably to the reporting system. The data access can be push o r pull, depending
on w hether or not it is responsible for initiating the delivery process. It can a lso be
polled (or batched) if the data are transferred periodically, or triggered (or online)
if data are transfe rred in case of a specific event.
• ETL (extract, transform, and load). This is the intermediate step where these
recorded transactions/events are ch ecked for quality, put into the appropriate
fo rmat, and inse1ted into the desired data format.
• Data storage. This is the storage area for the data and metadata. It could be a
flat file o r a spreadsheet, but it is usually a relatio n al database managem ent system
(RDBMS) set up as a data mart, data warehouse, or operatio nal data store (ODS); it
often employs online analytical processing (OLAP) functions like cu bes.
• Business logic. The explicit steps for h ow the recorded transactions/ events are
to be converted into metrics , scorecards, and dashboards .
• Publication. The system that builds the various reports and h osts them (for
users) or disseminates the m (to u sers) . These systems may also provide notification,
annotation, collaboration, and oth er services.
• Assurance. A good business reporting system is exp ected to offer a quality
service to its u sers. This inclu des determining if an d when the right information is to
be delivered to the right people in the right way/format.
Application Case 4.2 is a n excelle nt example to illustrate the power and the util-
ity of automated report gen eration for a large (and, at a time of n atural crisis, som ewhat
chaotic) organization like FEMA.
Application Case 4.2
Flood of Paper Ends at FEMA
Staff at the Fe deral Emergency Manage ment Agency
(FEMA), a U.S. federal agen cy that coordinates
disaster response w h en the President declares a
natio nal disaster, always got two floods at o nce.
First, water covered the land. Next, a flood of paper,
required to administer the National Flood Insurance
Program (NFIP), covered their desks-pallets
and pallets of green-striped reports poured off a
ma inframe printer and into their offices. Individual
reports were sometimes 18 inches thick, w ith a
nugget of informatio n about insura nce claims,
premiums, or payments buried in them somewhere.
Bill Barton and Mike Miles don’t claim to be
able to do anything about the weather, but the
project manager a nd computer scientist, respectively,
from Computer Sciences Corporation (CSC) have
used WebFOCUS software from Information
Builde rs to turn back the flood of paper generated
by the NFIP . The program allows the government
to work together w ith national insurance companies
to collect flood insurance premiums and pay claims
for flooding in communities that adopt flood con trol
measures. As a result of CSC’s work, FEMA staff no
lo nger leaf through p aper reports to find the d ata
they need. Instead, they browse insuran ce data
posted on NFIP’s BureauNet intran et site, select just
the informatio n they want to see, and get an on-
screen report or download the data as a spreadsheet.
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Manage ment 145
And that is only the start of the savings that
WebFOCUS has provided. The number of times that
NFIP staff asks CSC for special reports has dropped
in half, because NFIP staff can generate many of the
special reports they need without calling on a pro-
grammer to develop them. Then there is the cost
of creating BureauNet in the first place. Barton esti-
mates that using conventional Web and database
software to export data from FEMA’s mainframe,
store it in a new database, and link that to a Web
server would have cost about 100 times as much-
more than $500,000- and taken about two years
to complete, compared w ith the few months Miles
spent on the WebFOCUS solutio n .
When Tropical Storm Allison, a huge slug of
sodden, swirling clouds, moved out of the Gulf of
Mexico onto the Texas and Louisiana coastline in June
2001 , it killed 34 people, most from drowning; dam-
aged or destroyed 16,000 homes and businesses; and
displaced more than 10,000 families . President George
W. Bush declared 28 Texas counties disaster areas,
and FEMA moved in to help. This was the first serious
test for BureauNet, and it delivered. This first compre-
hensive use of BureauNet resulted in FEMA field staff
readily accessing w h at they needed and w hen tl1ey
SECTION 4.2 REVIEW QUESTIONS
1. What is a report? What are they used for,
needed it, and asking for many new types of reports.
Fortunately, Miles and WebFOCUS were up to the
task. In some cases, Barton says, “FEMA would ask for
a new type of report one day, and Miles would have it
on BureauNet the next day, ilianks to ilie speed wiili
which he could create new reports in WebFOCUS. ”
The sudden demand o n the system had little
impact on its performance, notes Barton. “It h a ndled
ilie demand just fine, ” he says. “We had no prob-
lems with it at all.” “And it made a huge difference
to FEMA and the job they had to do. They h ad never
had that level of access before, never had been able
to just click on their desktop an d generate such
detailed and specific reports. ”
QUESTIONS FOR DISCUSSION
1. What is FEMA and w h at does it do?
2. What are the main challenges that FEMA faces?
3. How d id FEMA improve its inefficient reporting
practices?
Sources: Infomiatio n Builders, Custome r Success Story, “Useful
Inf0m1ation Flows at Disaster Response Agency,”
infonnationbuilders.com/applications/fema (accessed Januaiy
2013); and fema.gov.
2. What is a business report? What are the main characteristics of a good business
report?
3. Describe the cyclic process of management and comment on the role of business
reports .
4. List and describe the three major categories of business reports .
5. What are the main components of a business reporting system?
4.3 DATA AND INFORMATION VISUALIZATION
Data visualization (or more appropriately, information visualization) has been defined
as, “the use of visual representations to e xplore, make sense of, and communicate data”
(Few, 2008). Although the name that is commonly used is data visualization, usually
what is meant by this is information visualizatio n . Since information is the aggrega-
tion, summarizations, a nd contextualization of data (raw facts), w hat is portrayed in
visualizations is the informatio n a nd not the data. However, sin ce the two terms data
visualization and information visualization are used interchangeably and synonymously,
in this chapter we w ill follow suit.
Data visualization is closely related to the fields of information graphics, information
visu alizatio n , scie ntific visua lizatio n , and statistical graphics. Until recently, the major
146 Pan II • Descriptive Analytics
forms of data visualization available in both business intelligence applications h ave
included charts and graphs, as well as the other types of visual elements used to create
scorecards and dashboards. Application Case 4 .3 shows how visual reporting tools can
help facilitate cost-effective business information creations and sharing.
Application Case 4.3
Tableau Saves Blastrac Thousands of Dollars with Simplified Information Sharing
Blastrac, a self-proclaimed global leader in portable
surface preparation technologies and equipment
(e.g., shot blasting, grinding, polishing, scarifying,
scraping, milling, and cutting equipment), depended
on the creation and distribution of reports across the
organization to make business decisions. However,
the company did not have a consistent reporting
method in place and, consequently, preparation of
reports for the company’s various needs (sales data,
working capital, inventory, purchase analysis, etc.)
was tedious. Blastrac’s analysts each spent nearly
one whole day per week (a total of 20 to 30 hours)
extracting data from the multiple enterprise resource
planning (ERP) systems, loading it into several Excel
spreadsheets, creating filtering capabilities and
establishing predefined pivot tables.
Not only were these massive spreadsheets
often inaccurate and consistently hard to under-
stand, but also they were virtually useless for the
sales team, which couldn’t work with the complex
format. In addition, each consumer of the reports
had different needs.
Blastrac Vice President and CIO Dan Murray
began looking for a solution to the company’s report-
ing troubles . He quickly ruled out the rollout of a
single ERP system, a multimillion-dollar proposition.
He also eliminated the possibility of an enterprise-
wide business intelligence (BI) platform deployment
because of cost-quotes from five different ven-
dors ranged from $130,000 to over $500,000. What
Murray needed was a solution that was affordable,
could deploy quickly without disrupting current sys-
tems , and was able to represent data consistently
regardless of the multiple currencies Blastrac oper-
ates in.
The Solution and the Results
Working with IT services consultant firm, Interworks,
Inc., out of Oklahoma, Murray and team finessed
the data sources. Murray then deployed two data
visualization tools from Tableau Software: Tableau
Desktop, a visual data analysis solution that allowed
Blastrac analysts to quickly and easily create intui-
tive and visually compelling reports, and Tableau
Reader, a free application that enabled everyone
across the company to directly interact with the
reports, filtering, sorting, extracting, and printing
data as it fit their needs-and at a total cost of less
than one-third the lowest competing BI quote.
With only one hour per week now required to
create reports-a 95 percent increase in productiv-
ity-and updates to these reports happening auto-
matically through Tableau, Murray and his team are
able to proactively identify major business events
reflected in company data- such as an exception-
ally large sale-instead of reacting to incoming
questions from employees as they had been forced
to do previously.
“Prior to deploying Tableau, I spent countless
hours customizing and creating new reports based
on individual requests, which was not efficient or
productive for me,” said Murray. “With Tableau ,
we create one report for each business area, and,
with very little training, they can explore the d ata
themselves. By deploying Tableau, I not only
saved thousands of dollars and endless months
of deployment, but I’m also now able to create a
product that is infinitely more valuable for people
across the organization.
QUESTIONS FOR DISCUSSION
1. How did Blastrac achieve significant cost savin-
gin reporting and information sharing?
2. What were the challenge, the proposed solution,
and the obtained results?
Sources: tableausoftware.com/learn/stories/spotlight-blastric;
blastrac.com/about-us; and interworks.com.
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Management 147
To better understand the current and future trends in the field of data visualization,
it helps to begin with some historical context.
A Brief History of Data Visualization
Despite the fact that predecessors to data visualization date back to the second century
AD , most developments have occurred in the last two and a half centuries, predominantly
during the last 30 years (Few, 2007). Although visualization has not been widely
recognized as a discipline until fairly recently, today’s most popular visual forms date
back a few centuries. Geographical exploration, mathematics, and popularized history
spurred the creation of early maps, graphs, and timelines as far back as the 1600s, but
William Playfair is widely credited as the inventor of the modern cha1t, having created
the first widely distributed line and bar charts in his Commercial and Political Atlas of
1786 and what is generally considered to be the first pie chart in his Statistical Breviary,
published in 1801 (see Figure 4.2).
Perhaps the most notable innovator of information graphics during this period was
Charles Joseph Minard, who graphically portrayed the losses suffered by Napoleon’s army
in the Russian campaign of 1812 (see Figure 4 .3). Beginning at the Polish-Russian border,
the thick band shows the size of the army at each position. The path of Napoleo n ‘s
retreat from Moscow in the bitterly cold winter is depicted by the dark lower band, which
is tied to temperature and time scales. Popular visualization expert, author, and critic
Expo rts and Imports t o and from DENMARK Sc NOR”\V.AY from 1700 to 178Q
/100
l—-+—+—-!——l—+—i—-r-7.,,,,..—1&
l—-+——l—–+—+—t——1—1:;;.,.,L–‘f’ — 170
1——–+– –+—+—–r–· -+—-+—–=-.,.,~ … ,, – – 16o
1——-+—–+—-+—–l—-+—-t-“”‘i:i,,. _ _ — – 1l,o
l—- _J~–_…j——if—–+—–t—–1-:tr’-;;-:: — —+14.0 I .DALAKC.E [!!._ tllo
1—–1——+—+—-i—–j-·- –J’-·v.F.c;.~;:Yi;_;-;Ol!JLqJ_ 120
I J .ENGLAND. uo
i—-+—-!l-~o.—-~1~~~~–+—-,—I7l,—-·—-“”a;i,llf2!5iiiE:j.’g
,/), ~~- .::..:.~~~===-:;;:,.,,k–1- — -, 9’>
1—–+i-,.,,,—~-~-f – JJ · -· 1,.risr ‘-~~:;__iiiiiii;.;;;i;;;;~~~~~2.-·_i&
1-~—::::a~-/~=~~……._;,//NCE fl(iJV~ 7″
III-E:”—–+,/-Ji<...-- - -- ·- ~ .... _ ~, 60
~ 00
-~ 40
, ~-----l-----+----1----,ao ~ - --+----+--- -;-----
IL----l..----\.----+-----l----t-----t----r----12 0
l~---+----~---{----t----+----t----,---1·0
FIGURE 4.2 The First Pie Chart Created by William Playfair in 1801. Source: en.wikipedia.org.
148 Pan II • Descriptive Analytics
.r
t
.._._.,.,.,._,.. ,.r .. .,.__,,
• J .. ' • • '
,._ u a!'"
~ )IOSC OIJ
'
.,_.,._..
FIGURE 4.3 Decimation of Napoleon's Army During the 1812 Russian Campaign. Source: en.wikipedia.org.
Edward Tufte says that this "may well be the best statistical graphic ever drawn." In this
g raphic Minard managed to simultaneously represent several data dimensions (the size
of the army, direction of moveme nt, geographic locations, outside temperature, etc.) in
an artistic and informative manner. Many more great visualizations were created in the
1800s, and most of the m are chronicled in Tufte's Web site (edwardtufte.com) and his
visua lization books.
The 1900s saw the rise of a more formal, empirical attitude toward visualization,
which tended to focus on aspects such as color, value scales, and labeling. In the mid-
1900s, cartographer and theorist Jacques Bertin published his Semiologie Graphique,
which some say serves as the theoretical foundation of modern information visualization.
While most of his patterns are e ither outdated by more recent research or completely
inapplicable to digital media, many are still very relevant.
In the 2000s the Internet has emerged as a new medium for visualization and
brought with it a whole lot of new tricks and capabilities. Not only has the worldwide,
digital distribution of both data and visualization made them more accessible to a broader
audience (raising visual literacy along the way), but it has also spurred the design of new
forms that incorporate interaction, animation, graphics-rendering technology unique to
screen media, and real-time data feeds to create immersive environments for communi-
cating and consuming data.
Companies and individuals are, seemingly all of a sudden, interested in data; that
interest has, in turn, sparked a need for visual tools th at help them understand it. Cheap hard-
ware sensors and do-it-yourself frameworks for building your own system are driving down
the costs of collecting and processing data . Countless other applications, software tools,
and low-level code libraries are springing up to help people collect, o rganize, manipulate,
visualize, and unde rstan d data from practically any source. The Internet has also served as a
fantastic distribution channel for visualizations; a diverse community of designers, program-
mers, cartographers, tinke rers, and data wonks has assembled to disseminate all sorts of
new ideas and tools for working with data in both visual and nonvisual forms .
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Management 149
Google Maps has also single-handedly democratized both the interface conventions
(click to pan, double-click to zoom) and the technology (256-pixel square map tiles w ith
predictable file names) for displaying interactive geography online, to the extent that
most people just know what to do when they're presented with a map online. Flash has
served well as a cross-browser platform on which to design and develop rich, beautiful
Internet applications incorporating interactive data visualization and maps; now, new
browser-native technologies such as canvas and SVG (sometimes collectively included
under the umbrella of HTML5) are emerging to challenge Flash's supremacy and extend
the reach of dynamic visualization interfaces to mobile devices.
The future of data/ information visualization is very hard to predict. We can only
extrapolate from what has already been invented: more three-dimensional visualization,
more immersive experience with multidimensional data in a virtual reality environment,
and holographic visualization of information. There is a pretty good chance that we w ill see
something that we have never seen in the information visualization realm invented before
the end of this decade. Application Case 4 .4 shows how Dana-Farber Cancer Institute used
information visualization to better understand the cancer vaccine clinical trials.
Application Case 4.4
TIBCO Spotfire Provides Dana-Farber Cancer Institute with Unprecedented Insight into Cancer
Vaccine Clinical Trials
When Karen Maloney, business development manager
of the Cancer Vaccine Center (CVC) at Dana-Farber
Cancer Institute in Boston, decided to investigate the
competitive landscape of the cancer vaccine field, she
looked to a strategic planning and marketing MBA
class at Babson College in Wellesley, Massachusetts,
for help with the research project. There she met
Xiaohong Cao, whose bioinformatics background led
to the decision to focus on clinical vaccine trials as
representative of potential competition. This became
Dana-Farber CVC's first organized attempt to assess
in-depth the cancer vaccine market.
Cao focused on the an alysis of 645 clini-
cal trials related to cancer vaccines. The data was
extracted in XML from the ClinicalTrials.gov Web
site, and included categories such as "Summary
of Purpose," 'Trial Sponsor," "Phase of the Trial,"
"Recruiting Status," and "Location." Additional sta-
tistics on cancer types, including incidence and sur-
vival rates, were retrieved from the National Cancer
Institute Surveillance data.
Challenge and Solution
Although information from clinical vaccine trials is
organized fairly well into categories a nd can be down-
loaded, there is great inconsistency and redundancy
inherent in the data registry. To gain a good under-
standing of the landscape, both an overview and
an in-depth analytic capability were required simul-
taneously. It would have been very difficult, not to
mention incredibly time-consuming, to analyze infor-
mation from the multiple data sources separately, in
order to understand the relationships underlying the
data or identify trends and patterns using spread-
sheets. And to attempt to use a traditional business
intelligence tool would have required significant IT
resources. Cao proposed using the TIBCO Spotfire
DXP (Spotfire) computational and visual analysis tool
for data exploration and discovery.
Results
With the help of Cao and Spotfire software, Dana-
Farber's CVC developed a first-of-its-kind analysis
approach to rapidly extract complex d ata specifi-
cally for can cer vaccines from the major clinical
trial reposit01y. Summarization and visualization
of these data represents a cost-effective means of
making informed decisions about future cancer
vaccine clinical trials. The findings are helping the
CVC at Dana-Farber understand its competition and
the diseases they are working on to he lp shape its
strategy in the marketplace.
(Continued)
150 Pan II • Descriptive Analytics
Application Case 4.4 (Continued}
Spotfire software's visual and computation al
analysis approach provides the CVC at Dana-Farber
and the research community at large w ith a bet-
ter understanding of the cancer vaccine clinical
trials landscape and enables rapid insight into the
hotspots of cancer vaccine activity, as well as into
the identification of neglected cancers .
"The whole field of medical research is going
through an enormous transformation, in part
driven by information technology, " adds Brusic.
"Using a tool like Spotfire for analysis is a prom-
ising area in this field because it h elps integrate
information from multiple sources, ask specific
questions , and rapidly extract new knowledge
from the data that was previously not easily
attainable."
QUESTIONS FOR DISCUSSION
1. How did Dana-Farber Cancer Institute use TIBCO
Spotfire to enhance information reporting and
visualization?
2. What were the challenge, the proposed solution,
and the obtained results?
Sources: TIBCO Spotfire , Customer Success Story, "TIBCO Spotfire
Provides Dana-Farber Cancer Institute with Unprecedented Insight
into Cancer Vaccine Clinical Trials," spotfire.tibco.com/-/media/
content-center/ case-studies/dana-farber.ashx (accessed March
2013); a nd Dana-Farber Cancer Institute, dana-farber.org.
SECTION 4.3 REVIEW QUESTIONS
1. What is data visualization? Why is it needed?
2. What are the historical roots of data visualization?
3. Carefully analyze Ch arles Joseph Minard's graphical portrayal of Napoleon's march.
Identify and comment on all of the information dimen sions captured in this ancient
diagram.
4. Who is Edward Tufte? Why do you think we should kn ow about his work?
5. What do you think the "next big thing" is in data visualization?
4.4 DIFFERENT TYPES OF CHARTS AND GRAPHS
Often end users of business analytics systems are not sure what type of chart or graph
to use for a specific purpose. Some ch arts and/ o r graphs are better at answering certain
types of questions. What follows is a short description of the types of ch a rts a nd/ or
graphs commonly found in most business analytics tools and what types of question that
they are better at answering/analyzing.
Basic Charts and Graphs
What follows are the basic charts and graphs that are commonly used for information
visualization.
LINE CHART Line charts are perhaps the most frequently used graphical visuals for
time-series data. Line charts (or line graphs) show the relationship between two variables;
they most often are used to track changes or tren ds over time (having one of the vari-
ables set to time o n the x -axis). Line charts sequentially connect individual data points
to help infer changing trends over a period of time. Line charts are ofte n used to show
time-dependent changes in the values of some measure such as changes on a specific
stock price over a 5-year period or changes in the number of daily customer service calls
over a month.
Ch apte r 4 • Business Rep o rting , Visual Analytics, a nd Business Pe rformance Management 151
BAR CHART Bar ch arts are amo ng the most basic visu als used for d ata re presentation.
Bar ch arts are effective w he n you h ave n o minal data o r nume rical data th a t splits nicely
into differe nt categories so you can quickly see comparative resu lts and trends w ithin
your d ata . Bar cha rts are ofte n u sed to compare data across multiple categories su ch as
pe rcent ad vertising sp e nding by dep a rtme nts o r by p roduct categories. Bar charts can be
vertically or horizo ntally oriented. They ca n also be stacked on top of e ach othe r to show
multiple dimen sion s in a single ch art.
PIE CHART Pie ch arts are visua lly a ppealing, as the name implies, pie-looking charts.
Becau se they are so visually attractive , they are often incorrectly used. Pie charts sh o uld
o nly b e used to illustrate relative p rop o rtio ns of a sp ecific measure . Fo r in sta n ce, they
can be use d to show relative percentage o f advertising budget spent o n diffe re nt product
lines o r they can sh o w relative p ropo rtio n s of majo rs d ecla red by colle ge students in their
sopho mo re year. If the number of categories to sh ow a re m ore than ju st a few (say, mo re
than 4), one sho uld serio usly con sider using a bar chart instead of a pie ch a rt.
SCATTER PLOT Scatter plo ts are ofte n u sed to explo re relationships between two o r
three variables (in 2D o r 2D visuals). Since they are visua l explo ration tools , h aving
mo re than three variables, translating into more tha n three dime nsio ns , is n ot easily
achievable . Scatter plo ts are an effective w ay to explo re the existe n ce of tre nds, con cen-
tratio n s, and outliers . For instance , in a two-varia ble (two-axis) gra ph , a scatte r p lo t can
be u sed to illustrate the co-re latio nship be tween age a nd weigh t of heart disease p atients
o r it can illustrate the re latio n ship between number of cu sto me r care representatives an d
numbe r of o pen cu stomer service claims . Often, a trend line is superimposed o n a two-
dimen sio n al scatter plo t to illustrate the n ature of the relatio nship.
BUBBLE CHART Bubble ch arts are o ften e nhanced versio ns of scatter plo ts . The bubble
ch a rt is no t a new visualization type ; instead, it sho uld be view ed as a technique to
enrich data illustrated in scatte r plo ts (or even geographic map s). By varying the size a nd/
o r colo r of the circles, o ne can add a dditio n al data d ime nsions, offering m o re e nriched
meaning about the data . For instan ce, it can be used to sh ow a compe titive view of
college-level class attendance by majo r and by time of the day or it can be u sed to sh ow
profit margin by product type and by geog raphic regio n.
Specialized Charts and Graphs
The graphs and charts that we review in this sectio n are either d e rived fro m the b asic
cha rts as sp e cial cases o r they a re re la tively n ew and sp ecific to a problem typ e an d/ o r
an application a rea .
HISTOGRAM Graphically sp eaking, a histogram looks just like a bar cha rt. The d iffe re nce
between histog rams and gen eric ba r c harts is the information that is p o rtrayed in the m .
Histograms are used to sh ow the frequency distribution of a va riable , or several variable s.
In a histogram , the x -axis is ofte n u sed to show the categories or ranges, and the y -axis
is used to show the measures/ values/freque ncies. Histog rams sh ow the distribution al
shap e of the da ta . That way, on e can visu ally examine if the data is distributed n o rmally,
exp o ne ntially, a nd so o n . Fo r instan ce, o ne can u se a histogram to illustrate the exam
performan ce of a class , w here distribution of the g rades as well as comparative an alysis
of individual results can be shown ; o r o ne can u se a histogram to show age distribution
of the ir custo me r b ase .
152 Pan II • Descriptive Analytics
GANTT CHART Gantt charts are a special case of horizontal bar charts that are used to
portray project timelines, project tasks/ activity durations, and overlap amongst the tasks/
activities. By showing start and end dates/ times of tasks/activities and the overlapping
relationships , Gantt charts make an invaluable aid for management and control of
projects . For instance , Gantt charts are often used to show project timelin e, talk overlaps,
relative task completions (a partial bar illustrating the completion percentage inside a
bar that shows the actual task duration), resources assigned to each task, milestones,
and deliverables.
PERT CHART PERT charts (also called network diagrams) are developed primarily
to simplify the planning and scheduling of large and complex projects. A PERT chart
shows precedence relationships among the project activities/ tasks . It is composed
of nodes (represented as circles or rectangles) and edges (re presented with directed
arrows). Based on the selected PERT chart convention, either nodes or the edges may
be used to represent the project activities/tasks (activity-on-node versus activity-on-arrow
representation schema) .
GEOGRAPHIC MAP When the data set includes any kind of location data (e.g., physical
addresses, postal codes, state names or abbreviations, cou ntry names , latitude/ longitude,
or some type of custom geographic encoding) , it is better and more informative to see
the data on a map . Maps usually are u sed in conjunction with other charts and graphs,
as opposed to by the mselves. For instance, one can use maps to show distribution
of customer service requests by product type (depicted in pie charts) by geographic
locations. Often a large variety of information (e.g., age distribution, income distribution,
education, economic growth, population changes, etc.) can be portrayed in a geographic
map to help decide where to open a new restaurant or a new service station. These types
of systems are often called geographic informatio n systems (GIS).
BULLET Bullet graphs are often used to show progress toward a goal. A bullet graph
is essentially a variation of a bar chart. Ofte n they a re used in place of gauges, meters,
and thermometers in dashboards to more intuitively convey the meaning within a much
smaller space. Bullet graphs compare a primary m easure (e.g., year-to-date revenue) to
one or more other measures (e.g., annual revenue targe t) and present this in the context
of defined performance metrics (e.g ., sales quota). A bullet graph can intuitively illustrate
how the prima1y measure is performing against overall goals (e.g., how close a sales
representative is to achieving his/ her annual quota) .
HEAT MAP Heat maps are great visuals to illustrate the comparison of continuous values
across two categories using color. The goal is to help the user quickly see w here the
intersection of the categories is strongest and weakest in terms of numerical values of
the measure being analyzed. For instance, heat maps can be used to s how segmentation
analysis of the target market where the measure (color gradient would be the purchase
amount) and the dimensions would be age and income distribution.
HIGHLIGHT TABLE Highlight tables are intended to take heat maps one step further. In
addition to showing how data inte rsects by using color, highlight tables add a numbe r
o n top to provide additional detail. That is, it is a two-dimensional table w ith cells
populated with numerical values and gradients of colors. For instance, one can sh ow
sales representative p e rformance b y product type and by sales volume .
TREE MAP Tree maps display hierarchical (tree -structured) data as a set of nested
rectangles. Each branch of the tree is given a rectangle , which is the n tiled with
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Management 153
smaller rectangles representing sub-branches. A leaf node 's rectangle has an area pro-
portional to a specified dimension on the data. Often the leaf nodes are colored to
show a separate dimension of the data. When the color and size dimensions are
correlated in some way w ith the tree structure , one can often easily see patterns
that would be difficult to spot in other ways , such as if a certain color is particularly
relevant. A second advantage of tree maps is that, by construction, they make efficient
use of space. As a result, they can legibly display thousands of items on the screen
si multa n eously.
Even though these charts and graphs cover a major part of what is commonly used
in information visualization, they by no means cover it all. Nowadays, one can find many
other specialized graphs and charts that serve a specific purpose. Furthermore, current
trends are to combine/ hybridize and animate these charts for better looking and more
intuitive visualization of today's complex and volatile data sources. For instance , the
interactive, animated, bubble charts available at the Gapminder Web site (gapminder.
org) provide an intriguing way of exploring world health, wealth, and population data
from a multidimensional perspective. Figure 4.4 depicts the sorts of displays available
at the site. In this graph, population size, life expectancy, and per capita income at the
continent level a re shown ; also given is a time-varying animation that shows how these
variables changed over time .
Chart
How 10 use j [ 181 Share gr8')h j [ :•: Full screen j Color .,,..,
85
80
75
70
65
f 60
"'
~ I 55
g 50
~ m 45
0.
)C
GI 40
~
:J 35
30
25
~ 20
:u
0 •
0
0 AJbania
O AJgeria
O Ango1a
O A,genlina
0 Arrnenia
0 AN.ba
0 Australi.a
O Austria
O Azl!lbaijan
0 Bahamas
Q Bamain
400 1 000 2 000 4 000 10 000 20 000 40 000 100 000 S ize
Income per person (GDP/capita, PPP$ inffatio~djusted)
---- --,---
• log .. Population. tdal
Play ~ ~ 1800 1820 1940 1860 1880 1000 1920 1940 1960 1980 20
1
00 0:36 B .., Tniib ~
FIGURE 4.4 A Gapminder Chart That Shows Wealth and Health of Nations. Source: gapminder.org.
154 Pan II • Descriptive Analytics
SECTION 4.4 REVIEW QUESTIONS
1. Why do you think there are large numbers of different types of charts and graphs?
2. What are the main differences among line, bar, and pie ch arts? When should you
choose to use one over the other?
3. Why would you use a geographic map? What other types of charts can be combined
with a geographic map?
4. Find two more charts that are not covered in this section, and comment on their usability.
4.5 THE EMERGENCE OF DATA VISUALIZATION
AND VISUAL ANALYTICS
As Seth Grimes (2009) has noted, there is a "growing palette" of da ta visualization
techniques and tools that enable the u sers of business analytics and business intelligence
systems to better "communicate relationships, add historical context, uncover hidden
correlations and tell persuasive stories that clarify and call to action." The latest Magic
Quadrant on Business Intelligence and Analytics Platforms released by Gartner in February
2013 further emphasizes the importance of visualization in business intelligence. As the
chart sh ows, most of the solution providers in the Leaders quadrant are eith e r relatively
recently founded information visualization companies (e.g., Tableau Software, QlikTech,
Tibco Spotfire) or are well-established , large analytics companies (e.g. , SAS, IBM,
Microsoft, SAP, MicroStrategy) that are increasingly focusing their efforts in informatio n
visualization and visual analytics. Details on the Gartner's latest Magic Quadrant are given
in Technology Insights 4.1.
TECHNOLOGY INSIGHTS 4.1 Gartner Magic Quadrant for Business
Intelligence and Analytics Platforms
Ganner, Inc. , the crea to r of Magic Quadrants, is a leading information technology research and
advisory comp an y. Fo unded in 1979, Gartner has 5,300 associates, including 1,280 research a na-
lysts and consultants , and numerous clients in 85 countries.
Magic Quadrant is a research method designed and implemented by Gartner to mo nitor
and evaluate the progress and positions of companies in a specific, techno logy-based market.
By applying a graphical trea tment and a uniform set of evaluation criteria, Magic Quadrant helps
users to unde rsta nd how techno logy providers are positioned w ithin a market.
Gartner changed the name of this Magic Quadrant from "Business Intelligence Platforms"
to "Business Intelligence a nd Analytics Platforms" in 2012 to emphasize the growing imponance
of analytics capabilities to the informa tion syste ms that o rga nizations are now building. Gartner
defines the business intelligence and analytics platform ma rket as a software platform that
delive rs 15 capabilities across three categories: integratio n , information delivery, and a nalysis.
These capabilities enable organizations to build precise systems of classification and measure-
me nt to suppo rt decision ma king and improve performance.
Figure 4.5 illustra tes the latest Magic Quadrant for Business Intelligence and Analytics
platforms . Magic Quadrant places providers in four groups (niche players, ch allengers ,
visionaries, a nd leade rs) alo ng two dimensions: completeness of vision (x-axis) a nd ability to
execute (y-axis). As the quadrant clearly shows , most of the well-known BI/ BA providers are
positio ned in the "lead e rs" category w hile many o f the lesser known , relatively new, emerging
provide rs are positioned in the "niche players" category.
Right now, most of the activity in the business inte lligence and analytics platform m arket
is from organizations that are tty ing to mature the ir visualization capabilities and to move from
descriptive to diagnostic (i.e., predictive and prescriptive) analytics. The vendors in the market
have overwhelming ly concentrated on meeting this use r demand. If there were a single m arket
Ch apter 4 • Business Reporting , Visual Analytics, and Business Performance Manage ment 155
1
QI ..,
::I
u
QI
1i1
B
~
:.c
ct
Challengers Leaders
/
I
Tableau Softwa;,----- Microsoft
/ e--- QlikTech
LogiXML e
Birst •
r'~~~~o~ • Actuate
- Oracle • IBM
- SAS
~ ~ MicroStrategy
I \ Tibco Spotfire
Information Builders
• SAP
Board International _,/
-_ • Panorama Software
\ Alterys • I
Jaspersoft • Salient Management Company
Penta ho •
Targit • • GoodData
Arcplan •
N iche players Visionaries
~ --------< Completeness of vision >——-
‘\
A s of February 2013
FIGURE 4.5 Magic Quadrant for Business Intelligence and Analytics Platforms. Source: gartner.com.
the me in 2012, it would be that data d iscovery/ visu alization became a mainstrea m architec-
ture. Fo r years, data d iscovery/visualizatio n vendors- such as Q likTech, Salie nt Ma nagement
Compa ny, Tableau Software, a nd Tibco Spo tfire- received more positive feedback tha n vend ors
offering O LAP cube and sema ntic-layer-based architectures. In 2012, the market responded:
• MicroStrategy s ignificantly improved Visual Insight.
• SAP launch ed Visu al Intelligence.
• SAS launch ed Visu al Analytics .
• Mic rosoft bolste red PowerPivot w ith Power View.
• IBM launched Cognos Insight.
• Oracle acquired Endeca .
• Actua te acquired Q uite ria n .
This e mphasis on data d iscovery/ v isu aliza tion from most of the leade rs in the market-
w hich are now promo ting tools w ith business-user-friendly data integration, coup le d w ith
e mbedded sto rage a nd com p uting layers (typically in-memory/ colu m na r) a nd unfe ttered
drilling- accele ra tes the tre nd toward decentralizatio n a nd user e mpowerment of BI and
a nalytics, a nd g reatly e nables o rganizatio ns’ ability to perfo rm d iagnostic analytics.
Source: Ga rtner Magic Q uadrant, re leased o n Fe brua ry 5, 2013, gartner.com (accessed February 2013).
In business intelligence and a n alytics, the key ch allen ges for visu alizatio n h ave
revolved around the intuitive representatio n of large, complex data sets w ith multip le
dime nsio ns and measures. For the most part, the typical charts , graphs , and other visu al
ele me nts u sed in these ap p licatio ns u su ally involve two dime nsio ns, som etimes three,
and fa irly sma ll subsets of data sets. In contrast, the data in these systems reside in a
156 Pan II • Descriptive Analytics
data warehouse. At a minimum, these warehouses involve a range of dimensions (e.g .,
product, location, organizational structure, time), a range of measures, and millions of
cells of data. In a n effort to address these challe n ges, a number of researchers have
developed a variety of new visualization techniq u es.
Visual Analytics
Visual analytics is a recently coined term that is often used loosely to mean nothing more
than information visualizatio n . What is meant by visual analytics is the combination
of visualizatio n and predictive analytics. While info rmation v isualization is a imed at
answering “what happened” a nd “what is happening” and is closely associated with
business intelligence (routine reports, scorecards, a nd dashboards), visual analytics is
aimed at answering “why is it happening, ” “what is more likely to happen,” and is usually
associated with business analytics (forecasting, segmentation, correlation a nalysis).
Many of the informatio n visu alization vendors are adding the capabilities to call them-
selves visual a nalytics solution providers. One of the top, lo ng-time analytics solution
providers, SAS Institute, is approaching it from an other direction. They are embedding
their analytics capabilities into a high-performance data v isualization e nvironment that
they call visu al analytics.
Visual or not visual, automated or manual, o nline or paper based, business reporting
is not much different than telling a story. Technology Insights 4.2 provides a different,
unorthodox viewpoint to better business reporting.
TECHNOLOGY INSIGHTS 4.2 Telling Great Stories with Data
and Visualization
Everyone who has data to analyze h as stories to tell, w hether it’s diagnosing the reasons for
manufacturing defects, selling a new idea in a way that captures the imagination of your target
audience, or informing colleagues about a panicular custome r service improvement p rogram.
And when it’s telling the story behind a big strategic choice so that you and your senior
management team can make a solid decision, providing a fact-based story can be especially
challe nging. In all cases, it’s a big job. You want to be interesting a nd memorable; you know
you need to keep it simple for your busy executives and colleagues. Yet you also know you
have to be factual, detail oriented, and data driven, especially in today’s metric-centric world.
It’s tempting to present just the data and facts , but when colleagues a nd senior manage-
ment a re overwhelmed by data a nd facts w ithout context, you lose. We have all experienced
presentations w ith large slide decks, o nly to find that the audien ce is so overwhelmed w ith data
that they don’t know what to think, or the y are so completely tuned o ut, the y take away only a
fraction of the key points.
Stan engaging your executive team and explaining your strategies and results more
powerfully by approaching your assignment as a story. You w ill need the “w hat” of your story
(the facts and data) but you also need the “who?, ” the “how?,” the “why?,” and the often missed
“so w hat?” It’s these story elements that will make you r data relevant a nd tangible for you r
audience. Creating a good story can a id you and senior management in focusing on w hat
is important.
Why Story?
Stories bring life to data and facts. They can help you ma ke se nse and orde r o ut of a disparate
collection of facts. They make it easier to remembe r key points and can paint a vivid picture of
w hat the future can look like . Stories also create interactivity- people put themselves into stories
and can relate to the situa tio n.
Ch apter 4 • Business Repo rting , Visual Analytics, and Business Pe rformance Manage ment 157
Cultures have long u sed sto1ytelling to p ass on knowledge and con tent. In some cultures,
storyte lling is critical to the ir identity. For example, in New Zealand, some of the Maori people
tattoo the ir fa ces with mokus. A moku is a facial tattoo containing a story about ancesto rs- the
fa mily tribe . A m an may have a tattoo design on h is face th at sh ows features of a h ammerhead
to highlight unique qualities about his lineage. The design he chooses signifies w hat is part of
his “true self’ and his ancestral home .
Likewise, whe n we are trying to unde rstand a sto ry, the sto ryte lle r navigates to fin din g the
“true north.” If senio r management is looking to discu ss how they will respond to a competitive
cha nge, a good story can make sen se and o rder out of a lot of no ise. For example, you may
have fa cts and data from two studies, o n e including results from an advertising study and o ne
from a product satisfactio n study. Develo ping a story for w h at you measured across both studies
can he lp p eople see the w ho le whe re there we re disp arate p arts. For rallying your distributors
around a ne w product, you can e mploy a story to give vision to w h at the futu re can look like.
Most impo rtantly, storyte lling is inte ractive- typ ically the presenter u ses words and p ictures that
audie nce members can put themse lves into . As a result, they b ecome more e ngaged and better
unde rstand the informatio n.
So What Is a Good Story?
Most p eople can easily rattle off the ir favorite film o r book. Or they re me mber a funn y story
that a colleagu e recently shared. Why do p eople rem embe r these stories? Because they contain
certain characte ristics. First, a good story has great characte rs . In some cases, th e reade r or
view e r has a vicario u s experie nce w he re they b ecome invo lved with the characte r. The charac-
te r the n has to be faced w ith a challenge that is difficult but b elievable . There must b e hurdles
that the cha racte r overcomes. And fi nally , the o utcome or prognosis is clear by the e nd o f the
sto ry. The situation may not b e resolved-but the story h as a clear endpo int.
Think of Your Analysis as a Story-Use a Story Structure
Whe n crafting a data-rich story, the first objective is to find the sto1y . Who are the ch aracters?
What is the drama o r challe nge? What hurdles have to b e overcome? And at the end of your
sto ry, what do you want your a udience to d o as a result?
Once you know the core sto ry, craft you r othe r story e le me nts: define yo u r ch aracters,
unde rstand the cha llenge, identify the hurdles, and crystallize the outcome or decision
questio n. Make sure you are clear w ith w h at you want people to d o as a result. This w ill
s ha pe how your audience will recall your sto1y. With the sto1y ele me nts in place, write out
the sto ryboard, w hich represents the structure and form o f your sto1y. Although it’s tempting
to skip this ste p , it is better first to unde rsta nd the sto ry you a re te lling and the n to focus o n
the presentatio n structure and form. Once the storybo ard is in place, the other e leme nts w ill
fall into place. The storyboard w ill help you to think a bo ut the best a nalogies or meta pho rs , to
cle arly set up challe nge o r o pportunity , and to fin ally see the fl ow and transitions need ed. The
sto1y bo ard also helps you focu s on key visuals (graphs , charts, an d graphics) that you n eed
your executives to recall.
In summary, do n’t b e afraid to u se data to tell great stories . Being factual, detail o riented ,
a nd data driven is critical in tod ay’s metric-centric world bu t it does no t have to mean being bor-
ing and le ngthy. In fa ct, by finding the re al stories in your data and following the b est practices,
you can get people to focus o n your message- and thus o n what’s im porta nt. He re are those
b est practices:
1. Think of your an alysis as a story-use a story structure .
2. Be authentic- your story will fl ow.
3. Be visual-think of yourself as a fil m edito r.
4. Make it easy for your audien ce and you.
5. Invite a nd direct discussio n .
Source: Elissa Fink and Susan J. Moore , “Five Best Prac tices for Te lling Great Stories w ith Data ,” 2012, w h ite
pa pe r by Tableau Softwa re , Inc., tableausoftware.com/whitepapers/telling-stories-with-data (accessed
Februa ry 2013).
158 Pan II • Descriptive Analytics
High-Powered Visual Analytics Environments
Due to the increasing demand for visual analytics coupled with fast-growing data volumes,
there is an exponential moveme nt toward investing in highly efficient visualization systems.
With their latest move into visual analytics, the statistical software giant SAS Institute is
now among the ones who are leading this wave. Their new product, SAS Visual Analytics,
is a very high-performance, in-me mory solution for exploring massive amounts of data
in a very short time (almost instantaneously). It empowers users to spot patterns, identify
opportunities for further analysis, and convey visual results via Web reports or a mobile
platform such as tablets and smartphones. Figure 4.6 shows the high-level architecture of
the SAS Visual Analytics platform. On one end of the architecture, there are universal Data
Builder and Administrator capabilities, leading into Explorer, Report Designer, and Mobile
BI modules, collectively providing an end-to-end visual analytics solution.
Some of the key benefits proposed by SAS analytics are:
• Empower all users with data exploration techniques and approachable analytics to
drive improved decision making. SAS Visual Analytics enables different types of users
to conduct fast, thorough explorations on all available data . Subsetting or sampling
of data is not required. Easy-to-use, interactive Web inte rfaces broaden the audi-
ence for analytics, e nabling everyone to glean new insights. Users can look at more
options, make more precise decisions, and drive success even faster than before.
• Answer complex questions faster, e nhancing the contributions from your a nalytic
talent. SAS Visual Analytics augments the data discovery and exploration process by
providing extremely fast results to enable better, more focused analysis. Analytically
savvy users can identify areas of opportunity or concern from vast amounts of data
so further investigation can take place quickly.
• Improve information sharing and collaboration. Large numbers of users, including
those with limited analytical skills, can quickly view and inte ract with reports and
charts via the Web, Adobe PDF files , and iPad mobile devices, w hile IT maintains
control of the underlying data and security. SAS Visual Analytics provides the
right information to the right person at the rig ht time to improve productivity a nd
organizational knowledge.
FIGURE 4.6 An Overview of SAS Visual Analytics Architecture. Source: SAS.com.
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Manage ment 159
FIGURE 4.7 A Screenshot from SAS Visual Analytics. Source: SAS.com.
• Liberate IT by giving users a new way to access the information they n eed. Free
IT from the con stant barrage of demands from users who need access to different
a mounts of data, different data views, ad hoc reports, and one-off requests for
information . SAS Visual Analytics e n ables IT to easily load and prepare data for
multiple users. On ce data is loaded and available, users can dynamically explore
data, create reports, and sh are information on their own .
• Provide room to grow at a self-determined pace. SAS Visual Analytics provides th e
option of using commodity hardware or database appliances from EMC Greenplum
a nd Teradata. It is designed from the groun d up for performance optimization and
scalability to meet the needs of a ny size organization.
Figure 4.7 shows a screen shot of an SAS Analytics p latform w here time-series
forecasting a nd confidence intervals around the forecast are depicted. A wealth of infor-
mation o n SAS Visual Analytics, along with access to the tool itself for teaching and learn-
ing purposes, can be fou nd a t teradatauniversitynetwork.com.
SECTION 4.5 REVIEW QUESTIONS
1. What are the reasons for the recent emergence of visual analytics?
2. Look at Gartner’s Magic Quadrant for Business Intelligence and Analytics Platforms .
What do you see? Discuss and justify your observation s .
3. What is the difference between information visualization and visual an alytics?
4. Why should storytelling be a part of your reporting and data visualization?
5. What is a high-powered visual a na lytics environme nt? Why do we need it?
160 Pan II • Descriptive Analytics
4.6 PERFORMANCE DASHBOARDS
Performance dashboards are common components of most, if n ot all, performance man-
agement systems, performance measurement systems, BPM software suites, and BI plat-
forms. Dashboards provide visu al displays of important information that is consolidated
and arranged on a single screen so that information can be digested at a single glance
and easily drilled in and further explored. A typical dashboard is shown in Figure 4.8.
This particular executive dashboard displays a variety of KPis for a hypothetical software
company called Sonatica (selling audio tools). This executive dashboard shows a high-
level view of the different functional groups surrounding the products, starting from a
general overview to the marketing efforts, sales, finance, and support departments . All of
this is intended to give executive decision makers a quick and accurate idea of what is
going on within the organization. On the left side of the dashbord, we can see (in a time-
series fashion) the quarterly changes in revenues, expenses, and margins, as well as the
comparison of those figures to previous years’ monthly numbers. On the upper-right side
we see two dials with color-coded regions showing the amount of monthly expenses for
support services (dial on the left) and the amount of other expenses (dial on the right) .
Executive Dash boa rd
Specify a date range: !lune, 2009
$10.00-L, ……. …U..,..U…..1.1..,..U..-1..1…,.&J~L.,.,L ………….. ~
Jun 09 Aug 09 Oct 09 Dec 09 Feb 10 Apr 10
$50. 0
$O. o.,…. …….. ~,…….~ …….. ~ …….. ~.-…~ ……. –
Jun 09 Aug 09 Oct 09 Dec 09 Feb 10 fJ¥ 10
Sales Distributmn (USO)
0(:§]0
$0 to $ 19, 000
• $19,000 to $38,000
• $38,000 to $57,000
• $57,000 to $76,000
• $76 ,0 00 to $95,000
FIGURE 4.8 A Sample Executive Dashboard. Source: dundas.com.
[y Hover Over
Monthly Expense H,Vh ,& Montflly &pen.H Low
Ranges
• Nominal I
• Excessive
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Management 161
As the color coding indicates, while the monthly support expenses are well within the
normal ranges, the other expenses are in the red (or darker) region, indicating excessive
values. The geographic map on the bottom right shows the distribution of sales at th e
country level throughput the world. Behind these graphical icons there are variety of
mathematical functions aggregating numerious data points to their highest level of mean-
ingul figures . By clicking on these graphical icons , the consumer of this information can
drill down to more granular levels of information and data.
Dashboards are used in a w ide variety of businesses for a wide variety of reasons.
For instance, in Application Case 4.5 , you will find the summary of a successful imple-
mentation of information dashboards by the Dallas Cowboys football team.
Application Case 4.5
Dallas Cowboys Score Big with Tableau and Teknion
Founded in 1960, the Dallas Cowboys are a pro-
fessional American football team headquartered
in Irving, Texas. The team has a large national
following, w hich is perhaps best represented by
the NFL record for number of consecutive games at
sold-out stadiums.
Challenge
Bill Priakos, COO of the Dallas Cowboys Merchan-
dising Division, and his team needed more visibility
into their data so they could run it more profitably.
Microsoft was selected as the baseline platform for
this upgrade as well as a number of other sales, logis-
tics, and e-commerce applications. The Cowboys
expected that this new information architecture
would provide the needed analytics and reporting.
Unfortunately, this was not the case, and the search
began for a robust dashboarding, analytics, and
reporting tool to fill this gap.
Solution and Results
Tableau and Teknion together provided real-time
reporting and dashboard capabilities that exceeded
the Cowboys’ requirements. Systematically and
methodically the Teknion team worked side by side
w ith data owners and data users within the Dallas
Cowboys to deliver all required functionality, on
time and under budget. “Early in the process, we
were able to get a clear understanding of what it
would take to run a more profitable operation for
the Cowboys,” said Teknion Vice President Bill
Luisi. “This process step is a key step in Teknion’s
approach w ith any client, and it a lways pays huge
dividends as the implementation plan progresses. ”
Added Luisi, “Of course, Tableau worked very
closely with us and the Cowboys during the entire
project. Together, we made sure that the Cowboys
could achieve their reporting and analytical goals in
record time.”
Now, for the first time , the Dallas Cowboys
are able to monitor their complete merchandising
activities from manufacture to end customer and see
not only what is happening across the life cycle, but
drill down even further into why it is h appening.
Today, this BI solution is used to report and
analyze the business activities of the Merchandising
Division, which is responsible for all of the Dallas
Cowboys’ brand sales. Industry estimates say that
the Cowboys generate 20 percent of all NFL mer-
chandise sales, which reflects the fact they are the
most recognized sports franchise in the world.
According to Eric Lai, a ComputerWorld
repo1ter, Tony Romo and the rest of the Dallas
Cowboys may have been only average on the foot-
ball field in the last few years, but off the field,
especially in the merchandising arena, they remain
America’s team.
QUESTIONS FOR DISCUSSION
1. How did the Dallas Cowboys use information
visualization?
2. What were the challenge, the proposed solution,
and the obta ined results?
Sources: Tableau, Case Study, tableausoftware.com/learn/
stories/tableau-and-teknion-exceed-cowboys-requirements
(accessed Fe brnary 2013); and E. Lai, “BI Visualization Tool Helps
Da llas Cowboys Sell Mo re To ny Romo J erseys,” ComputerWorld,
October 8, 2009.
162 Pan II • Descriptive Analytics
Dashboard Design
Dashboards are not a new concept. Their roots can be traced at least to the EIS of the
1980s. Today, dashboards are ubiquitous. For example, a few years back, Forrester
Research estimated that over 40 percent of the largest 2,000 companies in the world
use the techn ology (Ante and McGregor, 2006). Sin ce then, o ne can safely assume
that this number h as gone up quite significantly. In fact, nowadays it would be rather
unusual to see a large company using a BI system that does n ot employ some sort of
performance dashboards. The Dashboard Spy Web site (dashboardspy.com/about)
provides further evide nce of their ubiquity. The site contains descriptions and screen-
shots of thousands of BI dashboards, scorecards, and BI interfaces used by businesses
of a ll sizes and industries, nonprofits, a nd government agencies.
According to Eckerson (2006), a well-known expert on BI in general and dash-
boards in particular, the most distinctive feature of a dashboard is its three layers of
information:
1. Monitoring. Graphical, abstracted data to monitor key performance metrics.
2. Analysis. Summarized dimensional data to analyze the root cause of problems.
3. Management. Detailed operational data that identify what actions to take to
resolve a problem.
Because of these layers, dashboards pack a lot of information into a sin gle
screen . According to Few (2005), “The fundamental challenge of dashboard design
is to display all the required information on a single screen, clearly and with out
distraction, in a manner that can be assimilated quickly .” To speed assimilation of
the numbers , the numbers n eed to be placed in context. This can be don e by com-
paring the numbers of interest to other baseline or target numbers, by indicating
w hether the numbers are good or bad, by den oting w h eth er a trend is better or worse,
a nd by using specialized display w idgets or com ponents to set the comparative a n d
evaluative context.
Some of the common comparisons that are typically made in busin ess
intelligence systems include comparisons against past values, forecasted values,
targeted valu es, benchmark or average values, multiple instances of the same measure,
a nd the values of other measures (e.g. , revenues versus costs). In Figure 4 .8, the
vario us KPis a re set in context by comparing them w ith targeted values, the revenue
figure is set in context by comparing it w ith marketing costs , and the figures for
the various stages of the sales pipeline are set in context by comparing o n e stage
w ith another.
Even w ith comparative measures, it is important to specifically point out
w hether a particular number is good or bad and w h eth e r it is trending in the right
direction. Without these sorts of evaluative designations, it can be time-consuming
to determine the status of a particular number or result. Typically, e ither specialized
v isual objects (e .g., traffic lights) or visual attributes (e.g., color coding) are u sed
to set the evaluative context. Again, for the dashboard in Figure 4.8, color coding
(or varyin g gray tones) is used with the gauges to designate whether the KPI is good
or bad, and g reen up arrows are used with the vario u s stages of the sales pipeline to
indicate whether the results for those stages are trending up or down and whether
up or down is good or bad. Alth ough not used in this particular exam ple, additional
colors- red a nd oran ge, for instance-could be used to represent other states on
the various gauges. An inte resting an d informative dash board -driven reporting
solution built specifically for a very large telecommunicatio n company is featured in
Application Case 4.6 .
Chapter 4 • Business Reporting, Visual Analytics, and Business Performance Management 163
Application Case 4.6
Saudi Telecom Company Excels with Information Visualization
Supplying Internet a nd mobile services to over
160 million customers across the Middle East,
Saudi Telecom Company (STC) is one of the larg-
est providers in the region, extending as far as
Africa and South Asia. With millions of customers
contacting STC daily for b illing, payme nt, network
usage, and support, a ll of this information has to be
monitored somewhere. Located in the headquarters
of STC is a data center that features a soccer field-
sized wall of monitors a ll displaying information
regarding n etwork statistics, service a nalytics, and
customer calls.
The Problem
When you h ave acres of information in front of
you, prioritizing and contextualizi ng the data are
paramount in understanding it. STC needed to iden-
tify the relevant metrics, properly visualize them,
and provide them to the right people, often with
time-sensitive information. ‘The executives didn’t
have the ability to see key performance indicators”
said Waleed Al Eshaiwy, manager of the data center
at STC. “They would have to contact the technical
teams to get status reports. By tha t time, it would
often be too late and we would be reacting to prob-
lems rather than preventing them.”
The Solution
After carefully evaluating several vendors, STC made
the decision to go with Dundas because of its rich
data visualization alternatives. Dundas business
inte lligence consultants worked on-site in STC’s
headqua1ters in Riyadh to refine the telecommu-
nication dashboards so they functioned properly.
“Even if someone were to show you what was in
the database, line by line, w ithout visualizing it, it
would be difficult to know w h at was going on, ”
said Waleed, who worked closely with Dundas con-
sultants. The success that STC experienced led to
engagement on an enterprise-wide, mission-critical
project to transform their data center an d create a
more proactive monitoring environment. This project
culminated w ith the monitoring systems in STC’s
data center finally transforming from reactive to pro-
active . Figure 4.9 shows a sample dashboard for call
center management.
The Benefits
“Dundas’ information visualization tools allowed
us to see tre nds an d correct issues before they
became proble ms ,” said Mr. Eshaiwy. He added,
“We decreased the amount of service tickets
by 55 percent the year that we started using the
information visualization tools and dashboards .
The availability of the system increased, which
m eant customer satisfaction levels increased, which
led to an increased customer base, w hich of course
lead to increased revenues.” With new, custom
KPis becoming visually available to the STC team ,
Dundas’ dashboards currently occu py nea rly a
quarter of the soccer fie ld-sized monitor wall.
“Eve1y thing is on my screen , and I can drill down
and find whatever I need to know,” explained
Waleed. He added, “Because of the design and
structure of the d ashboards , we can very quickly
recognize th e root cau se of the problems and take
appropriate action. ” According to Mr. Eshaiwy,
Dundas is a success: “The adoption rates are
excellent, it’s easy to use, and it’s o n e of the most
successful projects that we have impleme nted.
Even visitors who stop by my office are grabbed
right away by the look of the dashboard!”
QUESTIONS FOR DISCUSSION
1. Why do you think telecommunications compa-
nies are among the prime users of information
visualization tools?
2. How did Saudi Telecom u se information
visualization?
3. What were their challenges, the proposed solu-
tion, and the obtained results?
Source: Dundas , Customer Success Story, “Saudi Telecom
Company Used Dundas’ Information Visualization Solution,”
dundas.com/wp-content/uploads/Saudi-Telecom-Company-
Case-Studyl (accessed February 2013).
(Continued)
164 Pan II • Descriptive Analytics
Application Case 4.6 (Continued}
CALL CENTER DASHBOARD ….. ~ ••@ Dundas Data Visualization Inc.
“” -AGENT FClw QTO .. ,,.” TfRS CAllS RATE ,, .. ‘”””” ……. 0 ‘ • 1’6 18 JO M.lt” W~ _,….._ 92 ., 17 3l 92 – Gr- _,….._ 89 11 16 ll 89 . ” Aon W.,,ky _,….._ 81 • • .. , 12 ,, a,
At>u, Oumblodo,. _,….._ II • • … :: 11 31 81
s.-.,,.. s,,,,,. _,….._ 81 . .. • • – ‘l 1t 2• 81 M .. nc» Jul)r _,….._ 76 • • • … • .. : 17 27 76
Sytv,, Plolh
_,….._
7• • • • . 20 2l 7• .. J
Ny~, _,….._ 74 .. ‘!. 1S 2S 1,
P.at>lo N..- _,….. ,0 .. • .. II 2, ,0 11
Ehub11h 8ttw1tN’ _,….. 68
‘
17 22 68
Kun Vonnegut
_,….._
68 • .. l 16 21 68
Ri
‘ui
a
0…
QJ
> ·.:;
cu
OJ
QJ
z
True Class
Positive Negative
True False
Positive Positive
Count (TPJ Count (FPJ
False True
Negative Negative
Count (FNJ Count (TN)
FIGURE 5.8 A Simple Confusion Matrix for Tabulation of Two-Class Classification Results.
Estimating the True Accuracy of Classification Models
Chapter 5 • Data Mining 215
In classification problems, the primary source for accuracy estimatio n is the confusion
matrix (also called a classification matrix or a contingency table). Figure 5.8 shows a
confusion matrix for a two-class classification problem. The numbers along the diagonal
fro m the uppe r left to the lower right represent correct d ecisio ns, and the numbers out-
side this diagonal re present the e rro rs .
Table 5.2 provides equations for common accuracy metrics for classification models.
When the classification problem is not binary, the confusion matrix gets bigger
(a square matrix with the size of the unique number of class labels) , and accuracy metrics
become limited to per class accuracy rates and the overall classifier accuracy.
(Trne Classification Rate)i =
(Trne Classifi,cation)i
n
~ (False Classifi,cation)i
i= l
n
~ (Trne Classifi,cation)i
i= I
(Overall Classifier Accuracy)i = ———
Total Number of Cases
Estimating the accuracy of a classification m odel (or classifier) induced by a super-
vised learning algorithm is important for the following two reason s: First, it can be used
to estimate its future predictio n accuracy, w hich could imply the level of confide nce on e
sh o uld have in the classifier’s output in the prediction syste m. Second, it can be used for
choosing a classifier from a given set (identifying the “best” classification model among
the many trained). The following are am o ng the most popular estimatio n m e thodologies
used for classificatio n-type d ata mining models.
SIMPLE SPLIT The simple split (or h oldout o r test sample estimatio n ) partitions
the data into two mutually exclusive subsets called a training set and a test set (or
holdout set). It is commo n to designate two-thirds of the d a ta as the training set and
the re maining o ne-third as the test set. The training set is u sed by the indu cer (m o del
builder), and the built classifier is then tested o n the test set. An exception to this
rule occurs w h e n the classifier is a n a rtificial n e ura l n etwork. In this case, the data
is partitio n e d into three mutually exclus ive subsets : training, validatio n , a nd testing.
216 Pan III • Predictive Analytics
TABLE 5.2 Common Accuracy Metrics for Classification Models
Metric
TP
True Positive Rate = —
TP + FN
TN
True Negative Rate = -TN_ +_ F_P
TP + TN
Accuracy = TP + TN + FP + FN
TP
Precision = —
TP + FP
TP
Recall = TP + FN
Preprocessed
Data
1/3
Training
Data
Testing
Data
FIGURE 5.9 Simple Random Data Splitting.
Description
The ratio of correctly classified positives divided
by the total positive count (i. e., hit rate
or recall)
The ratio of correctly classified negat ives
divided by the total negative count
(i.e., fa lse alarm rate)
The ratio of correctly classified instances
(positives and negatives) divided by the
total number of instances
The ratio of correctly classified positives divided
by the sum of correctly classified positives and
incorrectly classified positives
Ratio of correctly classified positives divided by
the sum of correctly classified positives and
incorrectly classified negatives
Model
Development
Classifier
Model
Assessment
[scoring)
Prediction
Accuracy
The validation set is use d during model building to prevent overfitting (more on arti-
fici a l neural n e tworks can be found in Chapter 6). Figure 5.9 shows the simple split
methodology.
The main criticism of this m ethod is that it makes the assumption that the data in
the two subsets are of the same kind (i.e., have the exact same properties) . Becau se this
is a simple random partitioning, in most realistic data sets where the data are skewed on
the classification variable, such an assumption may not hold tru e. In order to improve this
situation, stratified sampling is suggested, where the strata become the output variable.
Even though this is an improvement over the simple split, it still has a bias associated
from the single random partitioning.
k-FOLD CROSS-VALIDATION In order to minimize the bias associated w ith the ra ndom
sampling of the training and holdout data samples in comparin g the predictive accuracy
of two or more methods, one can use a methodology called k-fold cross-validation. In
k-fold cross-validation, also called rotation estimation, the complete data set is randomly
split into k mutually exclusive subsets of approximately equal size. The classification
model is trained and tested k times. Each time it is trained on all but one fold and the n
tested o n the remaining single fold. The cross-validation estimate of the overall accuracy
Chapter 5 • Data Mining 217
of a model is calculated by simply averaging the k individual accuracy measures, as
shown in the following equation:
where CVA stands for cross-validation accuracy, k is the number of folds used, and A is
the accuracy measure (e.g., hit-rate, sensitivity, specificity) of each fold.
ADDITIONAL CLASSIFICATION ASSESSMENT METHODOLOGIES Other popular assess-
ment methodologies include the following:
• Leave-one-out. The leave-one-out method is similar to the k-fold cross-validation
where the k takes the value of 1; that is, every data point is used for testing once
on as many models developed as there are number of data points. This is a time-
consuming methodology, but sometimes for small data sets it is a viable option.
• Bootstrapping. With bootstrapping, a fixed number of instances from the origi-
nal data is sampled (with replacement) for training and the rest of the data set is
used for testing. This process is repeated as many times as desired.
• Jackknifing. Similar to the leave-one-out methodology, with jackknifing the accuracy
is calculated by leaving one sample out at each iteration of the estimation process.
• Area under the ROC curve. The area under the ROC curve is a graphical assess-
ment technique where the true positive rate is plotted o n the y-axis and false positive
rate is plotted on the x -axis. The area under the ROC curve determines the accuracy
measure of a classifier: A value of 1 indicates a perfect classifier whereas 0.5 indicates
no better than random chance; in reality, the values would range between the two
extreme cases. For example, in Figure 5.10 A has a better classification performance
than B , while C is not any better than the random chance of flipping a coin.
False Positive Rate (1-Specificity)
0.9
0.8
0.7
~//,./····/···//./··· ::·
~
> B ·., 0.6 “iii
C:
0)
~
0.5 0) ..,
C
“‘ CI:
0)
.2 0.4 ..,
“iii
0
Cl.
0) 0.3
:::,
‘-
I-
0.2
0.1
0—————————–l
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
FIGURE 5.10 A Sample ROC Curve.
218 Pan III • Predictive Analytics
CLASSIFICATION TECHNIQUES A number of techniques (or algorithms) are u sed for
classification modeling, including the fo llowing:
• Decision tree analysis. Decision tree analysis (a machine-learning technique)
is arguably the most popular classification technique in the data mining arena.
A detailed description of this technique is given in the following section .
• Statistical analysis. Statistical techniques were the primary classification algo-
rithm for many years until the emergence of machine-learning techniques. Statistical
classification techniques include logistic regression and discriminant an alysis, both
of which make the assumptions that the relationships between the input and output
variables are linear in nature, the data is normally distributed, a nd the variables are
not correlated and are independent of each other. The questionable nature of th ese
assumptions has led to the shift toward machine-learning techniques.
• Neural networks. These are among the most popular machine-learning tech-
niques that can be used for classification-type problems. A detailed description of
this technique is presented in Chapter 6.
• Case-based reasoning. This approach uses historical cases to recognize common-
alities in order to assign a new case into the most probable category.
• Bayesian classifiers. This approach uses probability theory to build classification
models based on the past occurrences that are capable of placing a new instance
into a most probable class (or category).
• Genetic algorithms. This approach uses the analogy of n atural evolutio n to build
directed-search-based mechanisms to classify data samples.
• Rough sets. This method takes into account the partial membership of class labels
to predefined categories in building models (collection of rules) for classification
problems.
A complete description of a ll of these classification techniques is beyond the scope of this
book; thus, o nly several of the most popular ones are presented here .
DECISION TREES Before describing the details of decision trees, we need to discuss
some simple terminology. First, decision trees include many input variables that may
have an impact on the classification of different patterns. Th ese input variables are usually
called attributes. For example, if we were to build a model to classify loan risks on the
basis of just two c haracteristics-income and a credit rating-these two characteristics
would be the attributes and the resulting output would be the class label (e.g., low,
medium, or high risk). Second, a tree con sists of branches and nodes. A branch represents
the outcome of a test to classify a pattern (on the basis of a test) using one of the attri-
butes. A leaf node at the e nd represents the final class choice for a pattern (a chain of
branches from the root node to the leaf n ode, w hich can be represented as a complex
if-then statement).
The basic idea behind a decision tree is that it recursively divides a training set until
each division consists entirely or primarily of examples from one class. Each n o nleaf
n ode of the tree contains a split point, which is a test on one or more attributes and deter-
mines how the data are to be divided further. Decision tree algorithms, in general, build
an initial tree from the training data such that each leaf n ode is pure, and they then prune
the tree to increase its generalization, and, hence, the prediction accuracy on test data.
In the growth phase, the tree is built by recursively dividing the data until each divi-
sio n is e ithe r pure (i.e., contain s members of the same class) o r relatively small. The basic
idea is to ask questions whose an swers would provide the most information, similar to
what we may do w hen playing the game “Twenty Questions. ”
The split used to partition the data depends on the type of the attribute used in the split.
For a continuo us attribute A , splits are of the fo rm value(A) < x, where x is some "optimal"
Chapter 5 • Data Mining 219
split value of A. For example, the split based on income could be "Income < 50000." For the
categorical attribute A, splits are of the form value(A) belongs to x, where x is a subset of A.
As an example, the split could be o n the basis of gender: "Male versu s Female. "
A general a lgo rithm for building a decision tree is as follows:
1. Create a root n ode and assign all of the training data to it.
2. Select the best splitting attribute.
3. Add a branch to the root n o de for each value of the split. Split the d ata into mutu-
ally exclu sive (no n overlapping) subsets alo ng the lines of the specific split and
mode to the branches.
4. Repeat the steps 2 and 3 for each and every leaf node until the stopping criterion is
reach ed (e.g ., the node is d o minated by a single class label).
Many different algo rithms have been proposed fo r creating decision trees. These
algorithms differ primarily in terms of the way in w hic h they determine the splitting attri-
bute (and its split values), the order of splitting the attributes (splitting the same attribute
o nly o nce o r ma ny times), the numbe r of splits at each node (binary versus ternary), the
stopping criteria, and the pruning of the tree (pre- versus p ostpruning). Some of the most
well-know n algorithms are ID3 (followed by C4.5 and CS as the improved versio ns of
ID3) from machine learning, classification and regression trees (CART) from statistics, and
the chi-squared automatic inte raction detector (CHAID) from pattern recognition.
When building a decision tree, the goal at each no de is to d etermine the attribute
and the split po int of that attribute that best divides the training records in order to purify
the class re presentatio n a t that n ode. To evaluate the goodness of th e split, some split-
ting indices have been proposed. Two of the most common o n es are the Gini index and
information gain. The Gini index is used in CART and SPRINT (Scalable PaRallelizable
Inductio n of Decision Trees) algorithms. Versions of information gain are used in ID3
(and its newer versions, C4.S and CS).
The Gini index has been used in economics to measure the diversity of a popula-
tio n. The same conce pt can be u sed to determine the purity of a specific class as a result
of a d ecision to branch along a particular attribute or variable . The best split is the o ne
that increases the purity of the sets resulting fro m a proposed split. Let u s briefly look into
a simple calculatio n of Gini index:
If a d ata set S contains examples fro m n classes, the Gini index is defined as
n
gini(S) = 1 - 2,P]
j=l
where p1 is a relative frequency of class j in S. If a data set S is split into two subsets ,
S1 a nd S2, w ith sizes N1 and N 2, respectively , the Gini index of the split data contains
examples fro m n classes , and the Gini index is d efined as
N N
ginisplit (S) = ___!_ gini (S1) + ---2 gini(S2 ) N N
The attribute/split combination that provides the smallest ginisplicCS) is chosen to split the
node. In such a determination, one should enumerate all possible splitting p oints for each
attribute.
Information gain is the splitting mechanism u sed in ID3, w hich is perhaps the
most w ide ly known decision tree algorithm. It was developed by Ross Quinlan in 1986,
and since then he h as evolved this algorithm into the C4.5 and CS algorithms. The basic
idea behind ID3 (and its variants) is to u se a concept called entropy in p lace of the Gini
index. Entropy measures the exte nt of unce rta inty or randomness in a data set. If a ll th e
data in a subset belong to just o ne class, there is no uncertainty o r ra ndo mness in that
220 Pan III • Predictive Analytics
data set, so the entropy is zero . The objective of this approach is to build subtrees so that
the entropy of each final subset is zero (or close to zero) . Let us also look at the calcula-
tion of the information gain.
Assume that there are two classes, P (positive) and N (negative). Let the set of
examples S contain p counts of class P and n counts of class N. The a mount of info rma-
tion needed to decide if an arbitrary example in S belongs to P or N is defined as
P P n n
J(p, n) = - -- log2 -- - -- log2 --p + n p+n p+n p+n
Assume that using attribute A a set Swill be partitioned into sets {S1 , S2 , k , Svl - If S; con-
tains P; examples of P and n; examples of N, the entropy, or the expected informatio n
needed to classify objects in all subtrees, S;, is
~ p;+ n;
E(A) = L., -- l(p;, nJ
i= l P + n
Then, the information that would be gained by branching on attribute A would be
Gain(A) = l(p, n) - E(A)
These calculations are repeated for each and every attribute, a nd the one with the highest
information gain is selected as the splitting attribute. The basic ideas behind these split-
ting indices a re rather similar to each other but the specific algorithmic details vary.
A detailed definition of the ID3 algorithm and its splitting mechanism can be found in
Quinlan (1986).
Application Case 5.5 illustrates how significant the gains may be if th e right data
mining techniques are used for a well-defined business problem.
Cluster Analysis for Data Mining
Cluster analysis is an essential data mining method for classifying items, events, or
concepts into common groupings called clusters. The method is commonly used in bio l-
ogy, medicine, genetics, social network analysis, anthropology, archaeology, astronomy,
character recognition, and even in MIS development. As data mining has increased
in popularity, the underlying techniques have been applied to business, especially to
marketing. Cluster analysis has been used extensively for fraud detection (both credit
card and e-commerce fraud) and market segmentation of customers in contemporary
CRM systems. More applications in business continue to be developed as the strength of
cluster analysis is recognized and used.
Cluster analysis is an exploratory data analysis tool for solving classification
problems. The objective is to sort cases (e.g., people, things, events) into groups, or
clusters, so that the degree of association is strong among members of the same clus-
te r and weak among me mbers of diffe rent clusters . Each cluster describes the class to
which its members belong. An obvious one-dimensional example of cluster analysis is to
establish score ranges into which to assign class grades for a college class. This is similar
to the cluster analysis problem that the U.S. Treasury faced when establishing n ew tax
brackets in the 1980s. A fictional example of clustering occurs in J. K. Rowling's Harry
Potter books. The Sorting Hat determines to which House (e.g., dormitory) to assign first-
year students at the Hogwarts School. Another example involves determining h ow to seat
guests at a wedding. As far as data mining goes, the importance of cluster analysis is that
it may reveal associations and structures in data that were not previously apparent but are
sensible and useful once found .
Chapter 5 • Data Mining 221
Application Case 5.5
2degrees Gets a 1275 Percent Boost in Churn Identification
2degrees is New Zealand's fastest growing mobile tele-
communications company - In less than 3 years, they
have transformed the landscape of New Zealand's
mobile telecommunications market. Entering very
much as the challenger and battling with incumbents
entrenched in the market for over 18 years, 2degrees
has won over 580,000 customers and has revenues
of more than $100 million in just their third year of
operation. Last year's growth was 3761 percent.
Situation
2degrees' information solutions manager, Peter
Mccallum, explains that predictive analytics had
been on the radar at the company for some time.
"At 2degrees there are a lot of analytically aware peo-
ple, from the CEO down. Once we got to the point
in our business that we were interested in deploying
advanced predictive analytics techniques, we statted to
look at what was available in the marketplace." It soon
became clear that although on paper there were sev-
eral options, the reality was that the cost of deploying
the well-known solutions made it very difficult to build
a business case, pa1ticularly given that the benefits to
the business were as yet unproven.
After careful evaluation, 2degrees decided
upon a suite of analytics solutions from llAnts
consisting of Customer Response Analyzer, Customer
Churn Analyzer, and Model Builder. .
When asked why they chose 1 lAnts Analytics'
solutions, Peter said, "One of the beauties of the
1 lAnts Analytics solution was that it allowed us to
get up and running quickly and ve1y economically.
We could test the water and determine what the ROI
was like ly to be for predictive analytics, making it a
lot easier to build a business case for future analytics
projects. Yet we didn't really have to sacrifice anything
in terms of functionality-in fact, the churn models
we've built have performed exceptionally well. "
1 lAnts Analytics director of business develop-
ment, Tom Fuyala, comments : "We are dedicated to
getting o rganizations up and running with predictive
analytics faster, witho ut compromising the quality of
the results. With other solutions you must [use] trial
and error through multiple algorithms manually, but
with 1 lAnts Analytics solutions the entire optimiza-
tion and management of the algorithms is automated,
allowing thousands to be trialed in a few minutes.
The benefits of this approach are evidenced in the
real-world results. "
Peter is also impressed by the ease of use.
"The simplicity was a big deal to us. Not having to
have the statistical knowledge in-house was defi-
nitely a selling point. Company culture was also a
big factor in our decision making. llAnts Analytics
felt like a good fit. They've been very responsive
and have been great to work with. The turnaround
on some of the custom requests we have made has
been fantastic. "
Peter also likes the fact that models can be
built w ith the desktop modeling tools and then
deployed against the enterprise customer database
with 1 lAnts Predictor. "Once the model has been
built, it is easy to deploy it in 1 lAnts Predictor to
run against Oracle and score our entire customer
base very quickly. The speed with which 1 lAnts
Predictor can re-score hundreds of thousands of
customers is fantastic. We presently re-score our
customer base monthly, but it is so easy that we
could be re-scoring daily if we wanted. "
Benefits
2degrees put 1 lAnts Analytics solutions to work
quickly with very satisfying results. The initial proj-
ect was to focus on an all-too-common problem in
the mobile telecommunications industry: customer
churn (customers leaving). For this they deployed
1 lAnts Customer Churn Analyzer.
2degrees was interested in identifying custom-
ers most at risk of churning by analyzing data such
as time on network, days since last top-up, activation
channel, w hether the customer ported their number
or not, customer plan, and outbound calling behav-
iors over the preceding 90 days .
A carefully controlled experiment was run over
a period of 3 months, and the results were tabulated
and analyzed. The results were excellent: Customers
identified as churners by 1 lAnts Customer Churn
Analyzer were a game-changing 1275 percent more
likely to be churners than customers chosen at
( Continued)
222 Pan III • Predictive Analytics
Application Case 5.5 (Continued)
random. This can also be expressed as an increase
in lift of 12.75 at 5 percent (the 5% of the total pop-
ulation identified as most likely to churn by the
model). At 10 percent, lift was 7.28. Other benefits
included the various insights that 1 lAnts Customer
Churn Analyzer provided, for instance, validating
things that staff had intuitively felt, such as time on
network's strong relationship with churn, and high-
lighting areas where product enhancement would be
beneficial.
Armed with the information of which custom-
ers were most at risk of defecting, 2degrees could
now focus retention efforts on those identified as
most at risk, thereby getting substantially higher
return on investment on retention marketing expen-
diture. The bottom line is significantly better results
for fewer dollars spent.
2degrees head of customers, Matt Hobbs, pro-
vides a perspective on why this is not just impor-
tant to 2degrees but also to their customers: "Churn
prediction is a valuable tool for customer market-
ing and we are excited about the capabilities llAnts
Analytics provides to identify customers who display
indications of churning behavior. This is beneficial
to both 2degrees and to our customers."
• To customers go the benefits of identification
(if you are not likely to churn, you are not
being constantly annoyed by messages asking
you to stay) and appropriateness (customers
receive offers that actually are appropriate to
their usage-minutes for someone who likes to
talk, texts for someone who likes to text, etc.).
• To 2degrees go the benefits of targeting (by
identifying a smaller group of at-risk custom-
ers, retention offers can be richer because
of the reduction in the number of people
who may receive it but not need it) and
appropriateness.
By aligning these benefits for both 2degrees and the
customer, the outcomes 2degrees are experiencing
are vastly improved.
QUESTIONS FOR DISCUSSION
1. What does 2degrees do? Why is it important for
2degrees to accurately identify churn?
2. What were the challenges, the proposed solu-
tion, and the obtained results?
3. How can data mining help in identifying cus-
tomer churn? How do some companies do it
without using data mining tools and techniques?
Source: llAntsAnalytics Custome r Stmy, "1275% Boost in
Churn Identification at 2degrees," llantsanalytics.com/
casestudies/2degrees_casestudy.aspx (accessed January 2013).
Cluster analysis results may be used to:
• Identify a classification scheme (e.g. , types of customers)
• Suggest statistical models to describe populations
• Indicate rules for assigning new cases to classes for identification, targeting, and
diagnostic purposes
• Provide measures of definition , size, and change in what were previously broad
concepts
• Find typical cases to label and represent classes
• Decrease the size and complexity of the problem space for other data mining methods
• Identify outliers in a specific domain (e.g., rare-event detection)
DETERMINING THE OPTIMAL NUMBER OF CLUSTERS Clustering algorithms u su ally
require one to specify the number of clusters to find. If this number is not known from
prior knowledge, it should be chosen in some way. Unfortunately, there is no optimal
way of calculating what this number is supposed to be. Therefore, several different
Chapter 5 • Data Mining 223
heuristic methods have been p roposed . The following a re among the most commonly
refe ren ced o nes:
• Look a t the p e rcentage of varia n ce explained as a functio n of the numbe r of clu s-
ters; that is, choose a number of clusters so that adding another cluster would not
give much better modeling of the data . Specifically, if o n e graph s the p ercentage
o f variance explained by the clusters, the re is a point a t which the marginal gain
w ill drop (giving an a ngle in the graph), indicating the number of clusters to be
ch osen .
• Set the numbe r of clusters to (n/2) 112 , where n is the numbe r of d ata points.
• Use the Akaike Information Criterion (AIC), w hich is a measure of the goodness of
fit (b ased o n the concept of entropy) to determine the number of clusters.
• Use Bayesian information criterion (BIC), which is a model-selection criterion
(based on maximum likelihood estimation) to determine the number of clusters.
ANALYSIS METHODS Cluster analysis may be based o n o ne o r more of the following
general me thods:
• Statistical me thods (including b oth hie ra rchical and n o nhierarchical), su ch as
k-means, k-modes, and so o n
• Neural networks (with the architecture called self-organizing map, or SOM)
• Fuzzy logic (e.g ., fu zzy c-means algorithm)
• Genetic algo rithms
Each of these methods gene rally works w ith o ne of two gen eral method classes:
• Divisive. With divisive classes, a ll items sta rt in o ne cluster a nd are broken apart.
• Agglomerative. With agglo merative classes, all items start in individ ual clusters,
a nd the cluste rs are joined togethe r.
Most cluste r an alysis me thods involve the use of a distance measure to calculate
the closeness between pairs of items. Popula r distan ce measures include Euclidian dis-
tance (the ordinary distance between two p o ints that on e would measure with a ruler)
and Manhattan distance (also ca lled the rectilinear dista nce, o r taxicab dista nce , between
two points). Often, they are based on true distances that are measured, but this need
not be so, as is typically the case in IS development. Weighted averages may be used to
establish these dista nces. For example, in an IS develo pme nt p roject, individual modules
of the system may be related by the s imilarity between their inputs, outputs, processes,
and the specific data u sed. These factors are the n aggregated, p airw ise by ite m , into a
single distance measure.
K-MEANS CLUSTERING ALGORITHM The k-means algorithm (where k stands for the pre-
de termined number of clusters) is arguably the most refere n ced clustering algorithm .
It has its roots in traditio n al statistical analysis . As the n ame implies, the algorithm assigns
each data p oint (custo mer, event, object, etc .) to the cluster w h ose cente r (also called
centroid) is the nearest. The center is calculated as the average of all the points in the
cluster; that is, its coordinates are the arithmetic mean for each dimension separately over
all the points in the cluster. The algorithm steps a re listed below a nd shown graphically
in Figure 5. 11 :
Initialization step: Choose the number of clusters (i.e., the value of k).
Step 1: Randomly generate k rando m points as initial cluster centers.
Step 2: Assign each p o int to the nearest cluster cente r.
Step 3: Recompute the new cluster cente rs.
224 Pan III • Predictive Analytics
Step 1
• •
I:
,. . ~ . . . . .. ~ . ..._-.
• • "-)p• . . ·.~·: •J.t.!1 • ••
.. • '5"- ••
. . : ,._ ,·0
1· .,. .
• •
Step 3
FIGURE 5.11 A Graphical Illustration of the Steps in k-Means Algorithm.
Repetition step: Repeat steps 2 a nd 3 until some convergence criterion is met (usually
that the assignment of points to clusters becomes stable).
Association Rule Mining
Association rule mining (also known as affinity analysis or market-basket analysis) is a
popular data mining method that is commonly used as an example to explain w h at data
mining is and what it can do to a technologically less savvy audience. Most of you might
have heard the famous (or infamo us, depending o n h ow you look at it) re latio nship
discovered between the sales of beer and diapers at grocery stores. As the story goes, a
large supermarket ch a in (maybe Walmart, maybe not; there is no consensu s o n which
supermarket chain it was) did a n a n alysis of customers' buying habits and found a statisti-
cally significant correlation between purchases of beer and purchases of diapers. It was
theorized that the reason for this was that fathers (presumably young men) were stopping
off at the supermarket to buy diapers for their babies (especially on Thursdays) , and since
they could no longer go to the sports bar as often, would buy beer as well. As a result of
this finding, the sup ermarket chain is alleged to have placed the diapers n ext to the beer,
resulting in increased sales of both .
In essence, associatio n rule mining aims to find interesting relationships (affinities)
between variables (items) in large databases. Because of its successful applicatio n to
retail business problems, it is commonly called market-basket analysis. The main idea
in market-basket a nalysis is to identify stron g relationships amon g different products
(or services) that are usually purchased togethe r (show up in the same basket together,
either a physical basket at a grocery store or a virtual basket at an e-commerce Web
site) . For example, 65 percent of those who buy comprehensive automobile insurance
also buy health insurance; 80 percent of those who buy books o nline a lso buy music
online; 60 percent of those who h ave high blood pressure and are overweight have high
cholesterol; a nd 70 percent of the customers w h o buy laptop compute r a nd virus protec-
tion software also buy extended service plan.
The input to ma rket-basket a n alysis is simple p oint-of-sale tran saction data, where a
number of products a nd/or services purchased together (just like the conte nt of a purchase
receipt) a re tabulated under a single transaction instance. The outcome of the analysis is
invaluable information that can be used to better understand customer-purchase behavior
in order to maximize the profit from business transactions. A business can take advantage
of such knowledge by (1) putting the items next to each other to make it m ore conve-
nient for the customers to pick them up together and n ot forget to buy o n e when buying
Chapter 5 • Data Mining 225
the others (increasing sales volume); (2) promoting the items as a package (do not put
one on sale if the other(s) are on sale); and (3) placing them apart from each other so that
the customer has to walk the aisles to search for it, and by doing so potentially seeing
and buying other items.
Applications of market-basket analysis include cross-marketing, cross-selling, store
design, catalog design, e-commerce site design, optimization of online advertising, product
pricing, and sales/promotion configuration. In essence, market-baske t analysis helps
businesses infer customer needs and preferences from their purchase patterns. Outside the
business realm, association rules are successfully used to discover relationships between
symptoms and illnesses, diagnosis and patient characteristics and treatments (which can
be used in medical DSS), and genes and their functions (which can be used in genomics
projects), among others. Here are a few common areas and uses for association rule mining:
• Sales transactions: Combinations of retail products purchased together can be used
to improve product placement on the sales floor (placing products that go together
in close proximity) and promotional pricing of products (not having promotion on
both products that are often purchased together) .
• Credit card transactions: Items purchased with a credit card provide insight into
oth er products the customer is likely to purchase or fraudulent use of credit card
number.
• Banking services: The sequential patte rns of services used by customers (checking
account followed by saving account) can be used to identify other services they
may be interested in (investment account).
• Insurance service products: Bundles of insurance products bought by customers
(car insurance followed by home insurance) can be used to propose additional
insurance products (life insurance) ; or, unusual combinations of insurance claims
can be a sign of fraud .
• Telecommunication services: Commonly purchased groups of options (e.g. , call
waiting, caller ID, three-way calling, etc.) h e lp better structure product bundles to
maximize revenue; the same is also applicable to multi-channel telecom providers
w ith phone, TV, and Internet service offerings.
• Medical records: Certain combinations of conditions can indicate increased risk of
various complications; or, certain treatment procedures at certain medical facilities
can be tied to certain types of infection.
A good question to ask w ith respect to the patterns/ relationships that association
rule mining can discove r is "Are all association rules interesting and useful?" In order to
answer such a question, associatio n rule mining uses two common metrics: support, and
confidence and lift. Before defining these terms, let's get a little technical by showing
what an association rule looks like:
x~ Y[Supp(%), Con/(%)]
{Laptop Computer, Antivirus Software}~ {Extended Service Plan} [300/o, 700/o]
Here, X (products and/ or service; called the left-hand side, LHS, or the antecedent) is
associated with Y (products and/ or service; called the right-hand side, RHS, or conse-
quent). S is the support, and C is the confidence for this particular rule. Here a re the
simple formulas for Supp, Conj and Lift.
number of baskets that contains both X and Y
Support= Supp(X~ Y) = ----------------
total number of baskets
226 Pan III • Predictive Analytics
Supp(X~ Y)
Confidence = Conf(X ~ Y) = ( )
Suppx
Conf(X~ Y)
Lift(X~ Y) = --------
Expected Conj (X ~ Y)
sex~ Y)
S(X)
S(X) * S(Y)
S(X)
sex~ Y)
S(X) * S(Y)
The support (S) of a collection of products is the measure of how often these products
a nd/ o r services (i.e. , LHS + RHS = Laptop Computer, Antivirus Software, and Extended
Service Plan) appear together in the same transaction , that is, the proportion of trans-
actions in the data set that contain all of the products and/ or services mentioned in a
specific rule. In this example, 30 percent of all transactions in the hypothetical store
database had all three products present in a single sales ticket. The confiden ce of a rule
is the measure of h ow often the products and/ or services on the RHS (con sequent) go
together w ith the products and/ or services on the LHS (antecedent), that is, the propor-
tion of transactions that include LHS w hile also including the RHS. In other words, it is
the conditional probability of finding the RHS of the rule present in transactions w h ere
the LHS of the rule already exists. The lift value of a n association rule is the ratio of the
confidence of the rule and the expected confidence of the rule. The expected confidence
of a rule is d efined as the product of the support values of the LHS and the RHS divided
by the support of the LHS .
Several algorithms are available for discovering association rules. Some well-
known algorithms include Apriori, Eclat, and FP-Growth. These algorithms only do
h alf the job, which is to ide ntify the frequent itemsets in the database. O n ce the
frequent itemsets are identified, they need to be converted into rules w ith anteced-
e nt a nd con sequent parts. Determination of th e rules from frequent itemsets is a
straightforward matching process , but the process may be time-consumi ng w ith large
transaction databases. Even though there can be many items o n each section of the
rule, in practice the consequent part usually contains a single item. In th e following
section, one of the most popu lar algorithms for identification of frequent itemsets is
explained.
APRIORI ALGORITHM The Apriori algorithm is the most commonly used algorithm
to discover association rules. Given a set of itemsets (e .g. , sets of retail transactions ,
each listing individual items purchased) , the algorithm a ttempts to find subsets that are
common to at least a minimum number of the itemsets (i.e., complies with a minimum
support) . Apriori uses a bottom-up approach, where frequent subsets are extended
o ne item at a time (a method known as candidate generation, whereby the size of
frequent subsets increases from one-item subsets to two-item subsets, then three-item
subsets , etc.) , a n d groups of candidates at each level are tested against the data for
minimum support . The algorithm terminates w h en no further successful extensions
a re fo und .
As a n illustrative example, con sider the following . A grocery store tracks sales
transactions by SKU (stock-keeping unit) and thus knows which items are typically
purchased together. The database of transactions , along with the subsequent steps in
identifying the frequent itemsets, is shown in Figure 5.12. Each SKU in the transac-
tion database corresponds to a product, such as "1 = butter," "2 = bread," "3 = water,"
a nd so on . The first step in Apriori is to count up th e frequencies (i.e., th e supports)
of each item (one-item itemsets). For this overly simplified example , let us set the
minimum support to 3 (or 50%; meaning an itemset is cons idered to be a frequen t
Chapter 5 • Data Mining 227
Raw Transaction Data One-Item ltemsets Two-Item ltemsets Three-Item ltemsets
Transaction SKUs
I----+
ltemset
Support - ltemset Support I----+ ltemset No [Item No] [SKUs) [SKUs) [SKUs)
1001 1, 2, 3, 4 1 3 1 , 2 3 1, 2, 4
1002 2, 3,4 2 6 1, 3 2 2, 3,4
1003 2 , 3 3 4 1 , 4 3
1004 1, 2 , 4 4 5 2 , 3 4
1005 1, 2 , 3, 4 2 , 4 5
1006 2 ,4 3,4 3
FIGURE 5.1 2 Identification of Frequent ltemsets in Apriori Algorit hm.
itemset if it shows up in at least 3 out of 6 transactions in the database). Because all
of the one-item itemsets have at least 3 in the support column, they are all considered
frequent itemsets. However, had any of the one-item itemsets not been frequent , they
would not have been included as a possible member of possible two-ite m pairs. In
this way, Apriori prunes the tree of all possible itemsets. As Figure 5.12 shows, using
one-item itemsets, a ll possible two-item itemsets are generated, and the transaction
database is used to calculate their support values. Because the two-item itemset (1, 3)
has a support less th an 3, it should not be included in the frequent itemsets that will
be used to generate the next-level itemsets (three-item itemsets). The algorithm seems
deceivingly simple, but o n ly for small data sets. In much larger data sets, especially
those with huge amounts of items present in low quantities and small amounts of
ite ms present in big quantities , the search and calculation become a computationally
intensive process.
SECTION 5 . 5 REV IEW QUESTIONS
1. Identify at least three of the main data mining methods .
2 . Give examples of situations in which classification would be an appropriate data
mining technique. Give examples of situations in which regression would b e an
appropriate data mining technique.
3 . List and briefly define at least two classification techniques.
4 . What are some of the criteria for comparing and selecting the best classification
technique?
5 . Briefly describe the general algorithm used in decision trees .
6. Define Cini index. What does it measure?
7 . Give examples of situations in w h ich cluster analysis would be an appropriate data
mining technique.
8. What is the major difference between cluster analysis and classification?
9. What are some of the methods for cluster analysis?
10. Give examples of situations in which association would be an appropriate data min -
ing techniqu e.
Support
3
3
228 Pan III • Predictive Analytics
5.6 DATA MINING SOFTWARE TOOLS
Many software vendors provide powerful data mining tools. Examples of these ven-
dors include IBM (IBM SPSS Modeler, formerly known as SPSS PASW Modeler and
Clementine), SAS (En terprise Min er), StatSoft (Statistica Data Min er), KXEN (Infinite
Insight), Salfo rd (CART, MARS , TreeNet, Rando mForest), Angoss (KnowledgeSTUDIO ,
KnowledgeSeeker), a nd Megapute r (PolyAn a lyst) . Noticeably but n ot surprisingly,
the most popular data m ining tools a re developed by the well-established statistical
software companies (SPSS, SAS, a nd StatSoft)-la rgely because statistics is the founda-
tion of data mining, a nd these compa nies have the means to cost-effectively develop
them into full-scale data mining systems. Most of the business inte lligen ce tool vendors
(e.g., IBM Cognos, Oracle Hyperion, SAP Business Objects, MicroStrategy, Teradata , and
Microsoft) a lso have some level of data mining capabilities integrated into th eir software
offerings. These BI tools a re still primarily focused o n multidimensional modeling and
data visu alization and are not considered to be direct competitors of the d ata mining
tool vendo rs.
In additio n to these comme rcial tools, several o p en source and/ or free data
mining software tools are available online. Probably the most popular free (an d open
source) data mining tool is Weka, whic h is develop ed by a number of researchers
from the University of Waikato in New Zeala nd (the tool can b e downloaded from
cs.waikato.ac.nz/ml/weka). Weka includes a large number of a lgorithms for differ-
ent d a ta mining tasks a nd has a n intuitive user interface. Another recently released, free
(for noncommercial use) data mining tool is RapidMiner (develo p ed by Rapid-I ; it can
be downloaded from rapid-i.com). Its graphically enhanced user interface, employ-
ment of a rather large number of algorithms , and incorporatio n of a variety of d ata
visu alization features set it apart from the rest of the free tools. Another free and open
source data mining tool w ith an appealing graphical user interface is KNIME (which
can be downloaded from knime.org). The main difference between commercial tools,
such as Enterprise Min er, IBM SPSS Modeler, and Statistica, and free tools, such as
Weka, RapidMine r , and KNIME, is computatio nal efficie n cy. The sam e data mining
task involving a la rge data set may take a w h o le lot longer to complete w ith the free
software, and for some algorithms may n ot even complete (i. e ., crashing due to the
inefficient u se of compute r memory) . Table 5.3 lists a few of the major products and
their Web sites.
A su ite of bu s iness intelligence capabilities that has become increasingly more
popular for data mining projects is Microsoft SQL Server, where data and the
models a re stored in the sam e re lational database e nvironment, m aking model man-
agement a cons iderably easier task. Th e Microsoft Enterprise Consortium serves
as the worldw ide source for access to Microsoft's SQL Server 2012 software suite fo r
academic purposes-teaching an d research. The con sortium has been established to
enable universities around the world to access enterprise technology with out having
to maintain the necessary h ardware and software o n their own campus. The consor-
tium provides a w ide range of business intelligence development tools (e.g., d ata
mining, cube building, business reporting) as well as a number of large , realistic data
sets from Sam's Club , Dillard's, a n d Tyson Foods. The Microsoft Enterprise Con sortium
is free of cha rge a nd can o nly be u sed fo r academic purposes. The Sam M. Wa lto n
College of Business at the University of Arkansas h osts the e nterprise system and
a llows con sortium members a nd their students to access these resources by u sin g a
s imple remote desktop conn ection. The details about becoming a part of the consor-
tium along w ith easy-to-follow tutorials and examples can be found at enterprise.
waltoncollege. uark.edu.
Chapter 5 • Data Mining 229
TABLE 5.3 Selected Data Mining Software
Product Name Web Site (URL)
IBM SPSS Modeler ibm.com/software/analytics/spss/products/modeler/
SAS Enterprise Miner sas.com/technologies/bi/analytics/index.html
Statistica statsoft.com/products/dataminer.htm
Intelligent Miner ibm.com/software/data/iminer
Poly Anal yst megaputer.com/polyanalyst.php
CART, MARS, TreeNet, RandomForest salford-systems.com
Insightful Miner insightful.com
XLMiner xlminer.net
KXEN (Knowledge extraction ENgi nes) kxen.com
GhostMiner fqs.pl/ghostminer
Microsoft SQL Server Data Mi ning microsoft.com/sqlserver/2012/data-mining.aspx
Knowledge Miner knowledgeminer.net
Teradata Warehouse Miner ncr.com/products/software/teradata_mining.htm
Oracle Data Mining (ODM) otn.oracle.com/products/bi/9idmining.html
Fair Isaac Business Science fairisaac.com/edm
Delta Master bissantz.de
iData Analyzer infoacumen.com
Orange Dat a Mining Tool ailab.si/orange
Zementis Predictive Analytics zementis.com
In May 201 2, kdnuggets.com cond ucted the thirteenth annual Software Po ll o n
the following questio n : "What Analytics, Data Mining, and Big Data software h ave you
used in the p ast 12 m onths for a real p roject (no t just evaluatio n)?" He re a re some o f the
inte resting findings that came ou t of the p oll:
• For the first time (in the last 13 years of polling o n the same q uestion) , the number
of users o f free/op e n source software exceeded the num ber of u sers o f commercial
software .
• Amo ng voters 28 p e rcent u sed comme rcia l software b ut not free software, 30 percent
u sed free software but n ot commercial, and 41 percent u sed both.
• Th e u sage of Big Data tools grew fivefold: 15 p ercent u sed the m in 201 2, versu s
about 3 p e rcent in 2011.
• R, Ra pidMine r, and KNIME a re the most p o pular free/ o pe n source tools, while
Sta tSoft's Statistica, SAS's Ente rp rise Mine r, and IBM's SPSS Mode ler a re the most
p o pular d ata mining tools.
• Amo n g those w ho wrote their own a na lytics code in lower-level lang u ages , R, SQL,
Java, a nd Pytho n were the most p opular.
230 Pan III • Predictive Analytics
To reduce bias through multiple voting, in this poll kdnuggets.com used e-mail
verificatio n , which reduced the total numbe r of votes compared to 2011 , but made results
more representative. The results for data mining software tools are sh own in Figure 5.1 3,
while the results for Big Data software tools u sed , and the platform/ language u sed for
your own code, is shown in Figure 5.14.
Application Case 5.6 is about a research study where a number of software tools and
data mining techniques are u sed to build mode ls to predict financial su ccess (box-office
receipts) of Hollywood movies while they are nothing mo re than ideas.
R 245
Excel 238
Rapid-I RapidMiner 213
KNIME 174
Weka/ Pentaho 118
Stat Soft Statistics 112
SAS 101
Rapid-I RapidAnalytics 83
MATLAB BO
IBM SPSS Statistics 62
IBM SPSS Modeler 54
SAS Enterprise Miner 46
Or ange
Microsoft SQL Server
Other free software
TIBCO Spotfire/ S+/Miner
Tableau
Oracle Data Miner
Other commercial software
JMP
Mathematica
Miner3D
IBM Cognos
Stata
Zementis
KXEN
Bayesia
C4.5/C5 .0 / See5
Revolution Computing
Salford SPM/ CART /MARS/ TreeNet/ RF
XLSTAT
SAP (BusinessObjects/ Sybase/Hana]
Angoss
Rapidlnsight/Veera
Teradata Miner
11 Ants Analytics
WordStat 3
Predixion Software 3
0 50 100 150 200 250 300
FIGURE 5.13 Popular Data M ining Software Tools (Poll Results). Source: Used with perm ission of
kdnuggets.com .
Chapter 5 • Data Mining 231
Big Data software tools/platforms used for your analytics projects
Apache Hadoop/Hbase/Pig/Hive
Amazon Web Services (AWSJ
NoSQL databases
Other Big Data software
Other Hadoop-based tools =::::J 10
D 10
121
20 30
Platforms/languages used for your own analytics code
R
SQL
Java
Python
C/ C++ 66
Other languages 57
Perl 37
Awk/Gawk/ Shell
F#
0 50 100
167
136
133
40 50 60 70 80
245
185
138
119
150 200 250 300
FIGURE 5.14 Popular Big Data Software Tools and Platforms/ Languages Used. So urce: Result s of a
poll conducted by kdnuggets.com .
Application Case 5.6
Data Mining Goes to Hollywood: Predicting Financial Success of Movies
Predicting box-office receipts (i.e., financial success)
of a particular motion picture is an interesting and
challenging problem. According to some domain
experts, the movie industry is the "land of hunches
and wild guesses" due to the difficulty associated
with forecasting product demand, making the
movie business in Hollywood a risky endeavor.
In support of such observations, Jack Valenti (the
longtime president and CEO of the Motion Picture
Association of America) once mentioned that " ... no
one can tell you how a movie is going to do in the
marketplace ... not until the film opens in darkened
theatre and sparks fly up between the screen and
the audience." Entertainment indu stry trade jou rnals
and magazines have been full of examples, state-
ments, and experiences that support such a claim.
Like many other researchers who have attempted
to shed light on this challenging real-world problem,
Ramesh Sharda and Dursun Delen have been explor-
ing the use of data mining to predict the financia l per-
formance of a motion picture at the box office before
it even enters production (while the movie is nothing
more than a conceptual idea). In their highly publi-
cized prediction models, they convert the forecasting
( Continued)
232 Pan III • Predictive Analytics
Application Case 5.6 (Continued)
(or regression) problem into a classification problem;
that is, rather than forecasting the point estimate of
box-office receipts, they classify a movie based o n its
box-office receipts in one of nine catego ries, ranging
from "flop" to "blockbuster," making the problem a
multinomial classification problem. Table 5.4 illus-
t:rates the definition of the nine classes in terms of the
range of box-office receipts.
Data
Data was collected from variety of movie-related
databases (e.g., ShowBiz, IMDb, IMSDb, AllMovie,
e tc.) and consolidated into a single data set. The
data set for the most recently developed models
contained 2,632 movies released between 1998 and
2006. A summary of the independe nt variables alo ng
with their specification s is provided in Table 5.5. For
more descriptive details and justification for inclu-
sion of these independe nt variables, the reader is
referred to Sharda and Deleo (2007).
Methodology
Using a variety of data mmmg methods, includ-
ing neural networks, decision trees, support vec-
tor machines, and three types of e nsembles, Sharda
TABLE 5.4 Movie Classification Based on Receipts
Class No. 1 2
Range (in millions of dollars) 61 7 1
(Flop) 6 10
3
710
6 20
TABLE 5.5 Summary of Independent Variables
and Delen developed the prediction models . The
data from 1998 to 2005 were used as training data
to build the prediction models, and the data from
2006 was used as the test data to assess and com-
pare the models' prediction accuracy. Figure 5.15
shows a screensh ot of IBM SPSS Modeler (formerly
Clementine data mining tool) depicting the process
map e mployed for the prediction problem. The
upper-left side of the process map shows the model
development process, and the lower-right corner of
the process map shows the model assessment (i.e.,
testing or scoring) process (more details o n IBM
SPSS Modeler tool and its usage can be found o n the
book's Web site).
Results
Table 5.6 provides the prediction results of all three
data mining methods as well as the results of the
three different ensembles. The first performance
measure is the percent correct classification rate,
which is ca lled bingo. Also reported in th e table is
the 1-Away correct classification rate (i.e., within
one catego1y). The results indicate that SVM per-
formed the best among the individual prediction
models, followed by ANN; the worst of the three
4 5 6
720 740 7 65
6 40 6 65 6 100
7
7 100
6150
8
7 150
6 200
9
7 200
(B lockbuster)
Independent Variable Number of Values Possible Values
MPAA Rating
Competition
Star va lue
Genre
Special effects
Sequel
Number of screens
5 G,PG,PG-13, R, NR
3 High, Medium, Low
3 High, Medium, Low
10 Sci-Fi, Historic Epic Drama, Modern Dra ma, Politically Related,
Thriller, Horror, Comedy, Cartoon, Action, Documentary
3 High, Medium, Low
2 Yes, No
A positive integer between 1 and 3876
Chapter 5 • Data Mining 233
;-----------
@
'Model
I I Development [i] : I process
Table X / Neural Ne~ :
I .... @ I
@
~ I
I CART Decision ITree
I .............__ _ I Assessment
Mo del
A996-2oos ~ ,. I .,,-------~
0
~~--.-~...__ •. / ml "\
Data .~~- /} )II &
t= \ _______ :,/ ,' ~ Cl
~ ....,. ( / i;,;1ass Analysis
DataMining_Movie_, ~@ )II. ~ y' )a I °'I
"' :2000 o,• ~ Class Ma,,;,
M i ..... ., ~
Class I -~~ ~
, ________ Class _ _ _ _ Analysis ./
FIGURE 5.15 Process Flow Screenshot for the Box-Office Prediction System. Source: Used with permission
from IBM SPSS.
was the CART decision tree algorithm. In general,
the ensemble models performed better than the indi-
vidual p redictions models , of w hich the fusion algo-
rithm performed the best. What is probably more
important to decision makers, and standing out in
the results table, is the significantly low standard
deviation obtained from the ensem bles compared to
the individual models.
Conclusion
The researchers claim that these prediction results are
better than any reported in the published literature for
TABLE 5.6 Tabulated Prediction Results for Individual and Ensemble Models
Prediction Models
Individual Models Ensemble Models
Random Boosted Fusion
Performance Measure SVM ANN CART Forest Tree (Average)
Count (Bingo) 192 182 140 189 187 194
Count (1-Away) 104 120 126 12 1 104 120
Accuracy (% Bingo) 55.49 % 52 .60% 40.46% 54.62% 54.05% 56.07%
Accuracy(% 1-Away) 85.55% 87.28% 76.88% 89.60% 84.10% 90.75%
Standard deviation 0.93 0.87 1.05 0.76 0.84 0.63
( Continued)
234 Pan III • Predictive Ana lytics
Application Case 5.6 (Continued}
this problem domain. Beyond the attractive accuracy
of their predictio n results of the box-office receipts,
these models could also be used to furthe r analyze
(and po tentially optimize) the decisio n variables in
orde r to maximize the financial return. Sp ecifically, the
paramete rs used for mode ling could be alte red u sing
the already trained predictio n models in o rder to bette r
unde rstand the in1pact of diffe rent p aramete rs on the
end results. During this process, w hich is commonly
refe rred to as sensitivity ana lysis, the decision maker of
a given entertainment firm could find o ut, w ith a fairly
high accuracy level, how much value a specific actor
(or a sp ecific release date, o r the additio n of more
technical effects, etc.) brings to the financial su ccess
of a film, making the underlying system an invaluable
decision aid.
QUESTIONS FOR DISCUSSION
1. Why is it impo rtan t for Hollywood professio n als
to pre dict the fina ncial success of movies?
2. How can data mining be used to p re dict th e
finan cial success of m ovies before th e start of
the ir p roductio n p rocess?
3. How do you think Hollywood performed, and
p e rha p s is still p e rforming, this task w ithout the
h e lp of d ata mining tools a nd techniqu es?
Sources: R. Sharda a nd D. Dele n , "Predicting Box-Office Success
of Motion Pictures w ith Neu ral Networks," Expert Systems with
Applications, Vol. 30, 2006, pp. 243-254; D . Delen , R. Sharda ,
and P. Ku mar, "Movie Forecast Guru: A Web-based DSS fo r
Hollywood Managers," Decision Support Systems, Vol. 43, No. 4,
2007, pp. 1151-1170.
SECTION 5 .6 REVIEW QUESTIONS
1. Wha t are the most popular comme rcial data mining tools?
2 . Why d o you think the most p opular tools are develo p ed by statistics companies?
3. Wha t are the most p opula r free data mining tools?
4. What are the m ain differences between commercial and free data mining software tools?
5. Wha t would be your to p five s electio n crite ria for a data mining tool? Explain.
5.7 DATA MINING PRIVACY ISSUES, MYTHS, AND BLUNDERS
Data Mining and Privacy Issues
Data tha t is collected , sto red , and an a lyzed in data mmmg ofte n contains info rmatio n
abo ut real p eople . Such informatio n may inclu de ide ntificatio n data (n ame, ad d ress,
Social Security numbe r, driver's license numbe r, e mployee number, etc.), demographic
data (e.g ., age, sex, e thnicity, ma rital status, number o f childre n , etc.), financial data
(e.g., salary, g ross family income, c h ecking o r savings account b alance, h ome own ership,
m ortgage o r loan account sp ecifics, credit card limits a nd b alan ces, investm e nt accou nt
sp ecifics, etc.), purchase history ( i. e., w hat is bought fro m w he re a nd w hen eith e r from
the vendor's transactio n records or fro m credit card tran sactio n specifics), and o the r p er-
sonal da ta (e.g ., anniversary, pregn an cy, illness, loss in the fa mily, bankruptcy filings,
e tc.). Most o f these da ta can be accessed through some third-party data provid e rs. The
main questio n here is the privacy of the p erson to w h om the data b elon gs . In o rder
to m aintain the privacy a nd p rotectio n of individuals' rights, d ata mining p rofession als
have ethical (and o fte n legal) o bligations . One way to acco mplish this is th e p rocess of
de-ide ntificatio n of the cu stom er records prior to a p plying d ata mining applicatio ns, so
that the records canno t be traced to an ind ivid u al. Many pub licly available data sources
(e.g., CDC d ata, SEER data, UNOS data, etc .) a re already de-iden tified. Prio r to accessing
these data sources, users are often aske d to consent tha t unde r n o circumstan ces w ill they
try to ide ntify the individuals be hind those fig ures.
Chapter 5 • Data Mining 235
There have been a number of instances in the recent past where companies shared
their customer data with others without seeking the explicit consent of their customers.
For insta nce, as most of you might recall, in 2003, JetBlue Airlines provided more than
a million passenger records of their customers to Torch Concepts, a U.S. government
contractor. Torch then subsequently augmented the passenger data with additional infor-
mation such as family size and Social Security numbers-information purchased from a
data broke r called Acxiom. The consolidated personal database was inte nded to be used
for a data mining project in order to develop potential terrorist profiles. All of this was
done without notification or consent of passengers. When news of the activities got out,
however, dozens of privacy lawsuits were filed against JetBlue, Torch, and Acxiom, and
several U.S. senators called for an investigation into the incident (Wald, 2004). Similar, but
not as dramatic, privacy-related news has come out in the recent past about the popular
social network companies, which allegedly were selling customer-specific data to other
companies for personalized target marketing.
There was another peculiar sto1y about privacy concerns that made it into the head-
lines in 2012. In this instance, the company did not even use any private and/ or personal
data. Legally speaking, there was no violation of any laws. It was about Target and is
summarized in Application Case 5.7.
Application Case 5.7
Predicting Customer Buying Patterns-The Target Story
In early 2012 , an infamous story appeared concern-
ing Target's practice of predictive analytics. The
story was about a teenager girl who was being
sent advertising flyers and coupons by Target for
the kinds of things that a new mother-to-be would
buy from a store like Target. The story goes like
this: An angry man went into a Target outside of
Minneapolis, demanding to talk to a manager: "My
daughter got this in the mail! " he said. "She's still
in high school, and you're sending her coupons for
baby clothes and cribs? Are you trying to encourage
her to get pregnant?" The manager didn't have any
idea what the man was talking about. He looked
at the mailer. Sure enough, it was addressed to the
man's daughter and contained advertisements for
maternity clothing, nursery furniture , and pictures of
smiling infants. The manager apologized and then
called a few days later to apologize again. On the
phone, though, the father was somewhat abashed.
"I had a talk with my daughter, " he said. "It turns out
there's been some activities in my house I haven't
been completely aware of. She's due in August. I
owe you an apology. "
As it turns o ut, Target figured out a teen girl
was pregnant before her father did! Here is how
they did it. Target assigns every customer a Guest
ID number (tied to their credit card, name, or e -mail
address) that becomes a placeholder that keeps a
history of everything they have bought. Target aug-
ments this data with any demographic information
that they had collected from them or bought from
other information sources. Using this information,
Target looked at historical buying data for all the
females who had signed up for Target baby reg-
istries in the past. They analyzed the data from all
directions, and soon enough some useful patterns
emerged. For example, lotions and special vitamins
were among the products with interesting purchase
patterns. Lots of people buy lotion, but they h ave
noticed was that women on the baby registry were
buying larger quantities of unscented lotion around
the beginning of their second trimester. Another ana-
lyst noted that sometime in the first 20 weeks, preg-
nant women loaded up on supplements like calcium,
magnesium, and zinc. Many shoppers purchase soap
and cotton balls, but when someone suddenly starts
buying lots of scent-free soap and extra-big bags
of cotton balls, in addition to hand sanitizers a nd
washcloths, it signals that they could be getting close
to the ir delivery date . At the e nd, they were a ble
to identify about 25 products that, whe n analyzed
together, allowed them to assign each shopper a
(Continued)
236 Pan III • Predictive Analytics
Application Case 5.7 (Continued}
"pregnancy prediction" score. More important, they
could also estimate a woman's due date to within a
small window, so Target could send coupons timed
to very specific stages of her pregnancy.
If you look at this practice from a legal p e rspec-
tive, you would conclude that Target did not use any
information that violates customer privacy; rather,
they used transactional data that most every o the r
retail chain is collecting and storing (and perhaps
analyzing) about their customers. What was disturb-
ing in this scenario was perhaps the targeted concept:
pregnancy. There are certain events or concepts that
should be off limits or treated extremely cautiously,
such as terminal disease, divorce, and bankruptcy.
QUESTIONS FOR DISCUSSION
1. What do you think about data mmmg and its
implications concerning privacy? What is the
threshold between knowledge discovery and
privacy infringe me nt?
2. Did Target go too far? Did they do a nything
illegal? What do you think they should have
done? What do you think they should do n ow
(quit these types of practices)?
Sources: K. Hill, "How Targe t Figured Out a Teen Girl Was
Pregnant Before He r Fathe r Did," Forbes, Fe bruary 13, 2012;
and R. Nola n , "Be hind the Cover Story: How Much Does Target
Know' " NYTimes.com, Februa1y 21, 2012.
Data Mining Myths and Blunders
Data mining is a powerful analytical tool that enables business executives to advance
from describing the nature of the past to predicting the future . It he lps marketers find
patterns that unlock the mysteries of customer behavior. The results of data mining can be
u sed to increase revenue , reduce expe nses , ide ntify fraud, and locate business opportuni-
ties , offering a whole new realm of competitive advantage. As an evolving and m aturing
field , data mining is often associated with a number of myths, including the following
(Zaima, 2003):
Myth
Data mining provides instant, crystal-ball-like
predictions.
Data mining is not yet viable for business
applications.
Data mining requires a separate,
dedicated database.
Only those with advanced degrees can
do data mining.
Data mining is only for large f irms that have
lots of customer data.
Reality
Data mining is a multist ep process that requires
deliberate, proactive design and use.
The current state-of-the-art is ready to go for
almost any business.
Because of advances in database technology,
a dedicated database is not required, even
though it may be desirable.
New er Web-based tools enable managers of all
educational levels to do data mining .
If the dat a accurately reflect t he business or its
customers, a company can use data mining .
Data mining visionaries h ave gained enormous competitive advantage by understanding
that these myths are just that: myths.
The following 10 data mining mistakes are often made in practice (Skalak , 2001 ;
Shultz, 2004), and you should try to avoid them:
1. Selecting the wrong problem for data mining.
2. Igno ring what your sponsor thinks data mining is and what it really can and cannot do.
Chapter 5 • Data Mining 237
3 . Leaving insufficient time for data prepara tion. It takes more effort than is gen erally
unde rstood.
4. Looking only at aggregated results and not at individual records. IBM's DB2 IMS can
highlight individual records of interest.
5. Being sloppy about keeping track of the data mining procedure and results.
6 . Ignoring suspicious findings and quickly moving o n .
7. Running mining algorithms repeatedly and blindly . It is important to think h ard
about the next stage o f data analysis. Data mining is a very ha nds-on activity.
8 . Believing everything you are told about the data.
9. Believing everything you a re told about your own data mining an a lysis.
10. Measuring your results differently from the way your s ponsor measures them.
SECTION 5. 7 REVIEW QUESTIONS
1. What are the privacy issues in d ata mining?
2. How do you think the discussion between privacy a nd data mining w ill progress? Why?
3. What a re the most common myth s about data mining?
4. What do you think are the reasons for these myths abo ut d ata mining?
5. What are the m ost commo n data mining mista kes/ blunders? How can they be mini-
mized and/ o r e liminate d?
Chapter Highlights
• Data mining is the process of d iscovering new
kn owledge from databases .
• Data mining can u se simple flat files as data
sources o r it can be performed on data in data
warehouses.
• There are many alternative n am es and definitions
for d ata mining.
• Data mining is at the intersection of many dis-
ciplines, including statistics, artificial intelligence ,
a nd mathematical modeling.
• Companies u se data mining to better understand
the ir customers and optimize the ir operatio n s.
• Da ta mining applications can be found in virtually
every area of busin ess and government, includ-
ing healthcare, fina nce, ma rketing, a nd ho meland
security.
• Three broad categories of d ata mining tasks a re
prediction (classification or regression), cluster-
ing, and associatio n .
• Similar to othe r informatio n systems initiatives,
a data mining project must follow a systematic
project management process to be su ccessful.
• Severa l data mining processes have been pro-
posed: CRISP-DM, SEMMA, KDD , and so forth .
• CRISP-DM provides a syste m atic and orderly way
to conduct data mining pro jects .
• The earlier steps in data mmmg projects (i. e .,
un derstanding the domain and the rele va nt data)
consume most of the to tal project time (often
more than 80% of the total time).
• Data preprocessing is essential to any su ccessful
data mining study. Good data leads to good infor-
mation; good information leads to good decisions.
• Data preprocessing includes four m ain steps: d ata
con solidatio n , data cleaning, data tran sformatio n ,
and data reduction.
• Classificatio n methods learn from previous exam-
p les containing inputs and the resulting class
labels, and once properly trained they are able to
classify future cases.
• Clustering p artitio ns pattern records into natural
segments or clusters. Each segment's members
share similar characteristics.
• A number of different algorithms are commonly
used for classificatio n. Commercial implementa-
tio n s include ID3, C4.5 , C5, CART, and SPRINT.
• Decision trees partitio n d ata by branching along
diffe rent attributes so that each leaf n o de h as all
the patterns of o n e class.
• The Gini index a nd information gain (e n tropy)
are two popular ways to determine branching
choices in a decision tree.
238 Pan III • Predictive Analytics
• The Gini index measures the purity of a sample.
If everything in a sample belongs to one class,
the Gini index value is zero.
• Several assessment techniques can measure the
prediction accuracy of classification models,
including simple split, k-fold cross-validation,
bootstrapping, and area under the ROC curve .
• Cluster algorithms are used when the data records
do not have predefined class identifiers (i.e. , it is not
known to what class a p articular record belongs).
• Cluster algorithms compute measures of similarity
in order to group similar cases into clusters.
• The most commonly used similarity measure in
cluster analysis is a distance measure.
• The most commonly used clustering algorithms
are k-means and self-organizing maps.
Key Terms
Apriori algorithm
area under the ROC
curve
associatio n
bootstrapping
categorical data
classification
clustering
confidence
CRISP-DM
data mining
decision tree
distance measure
entropy
Gini index
information gain
inte rval data
k-fold cross-v alidation
knowledge discovery in
data b ases (KDD)
Questions for Discussion
1. Define data mining. Why are the re ma ny na mes and
definitions for d ata mining?
2. What are the main reasons for the recent popularity of
da ta mining?
3. Discuss what an o rganiza tion should conside r before
making a decisio n to purchase data mining software.
4. Distinguish data mining from o the r analytical too ls and
techniques.
5. Discuss the main data mining me thods. What a re the
fundamental differences among them?
6. What a re the main data mining a pplicatio n areas? Discuss
the commonalities of these a reas that make them a pros-
pect for data mining studies.
7. Why do we need a standa rdized data mining pro-
cess? What are the most commo nly used data mining
processes?
lift
• Association rule mmmg is used to discover two
or more ite ms (or events or con cepts) that go
together.
• Association rule mining is commonly referred to
as market-basket an alysis.
• The most commonly used association algorithm
is Apriori, whereby frequent ite msets are identi-
fi ed through a bottom-up approach.
• Association rules are assessed based on th eir sup-
port and confidence measures .
• Many commercial and free data mining tools are
available.
• The most popular comme rcial d ata mining too ls
are SPSS P ASW and SAS Enterprise Miner.
• The most p opular free data mining tools are
Weka and Ra pidMiner.
link a nalysis
Microsoft Enterprise
ratio data
reg ressio n
SEMMA
sequence mining
simple split
support
Consortium
Microsoft SQL Server
nominal data
nume ric data
ordinal d ata
predictio n
RapidMine r
Weka
8. Discuss the differe nces between the two most com-
monly use d data mining processes.
9. Are data mining processes a mere seque ntial set of activ-
ities? Explain.
10. Why do we need data preprocessing? What are the m ain
tasks a nd relevant techniques used in d ata preprocessing?
11. D iscuss the reasoning behind the assessment of classifi-
cation models .
12. What is the main diffe re nce between classification and
cluste ring? Explain using concrete exam ples .
13. Moving beyond th e chapter discussion, where else can
association be u sed?
14. What are the privacy issu es w ith data mining? Do you
think they are substantiated?
15. What are the most common myths and mistakes a bout
data mining?
Exercises
Teradata University Network (TUN) and Other
Hands-on Exercises
1. Visit teradatauniversitynetwork.com. Identify case
studies and w hite papers abou t data mining . Describe
recent developments in the field.
2. Go to teradatauniversitynetwork.com or a URL pro-
vided by your instructor. Locate Web seminars re lated
to data mining. In particular, locate a seminar given by
C. Imhoff and T. Zouqes. Watch the Web seminar. Then
an swer the following questions:
a. What are some of the interesting applications of
data mining?
b. What types o f payoffs and costs can organizatio n s
expect from data mining initiatives?
3. For this exercise, your goal is to build a model to iden-
tify inputs or predictors that differentiate risky customers
from others (based on patterns p ertaining to previous
cu stome rs) and then u se those inputs to predict new
risky customers. This sample case is typical for this
domain.
The sample data to b e used in this exercise are in
Online File W5.1 in the file CreditRisk.xlsx. The data
set h as 425 cases and 15 variables pertaining to past and
curre nt customers w ho have borrowed from a bank for
various reasons. The data set contains customer-re lated
information such as financial stan din g, reason for the
loan , employment, de mographic information, and the
outcome or dependent variable for credit standing, clas-
sifying each case as good o r bad, based on the institu-
tion's past experie nce.
Take 400 of the cases as training cases and set
aside the oth er 25 fo r testing. Build a decision tree
model to learn the ch aracteris tics of the problem . Test
its performance on the other 25 cases. Report o n your
model's learning and testin g performance. Prepare a
report that identifies the d ecision tree mode l a nd train-
ing parameters, as well as the resulting perfo rman ce on
the test set. Use any d ecisio n tree software. (This exer-
cise is courtesy of StatSoft, Inc. , based on a German
data set from ftp.ics.uci.edu/pub/machine-learning-
databases/statlog/german re named CreditRisk a nd
alte red.)
4. For this exercise, you w ill replicate (on a smaller
scale) the box-office prediction modeling explaine d in
Application Case 5.6. Download the training data set
from Online File W5.2, MovieTrain.xlsx, w hich is in
Microsoft Excel fo rmat. Use the data description given
in Applicatio n Case 5.6 to unde rstand the doma in and
the problem you are trying to solve. Pick and ch oose
your independent variables . Develop at least three clas-
sificatio n models (e.g., d ecisio n tree, logistic regres-
sion, neural networks). Compare the accuracy results
using 10-fold cross-validatio n and percentage split
Chapter 5 • Data Mining 239
techniq ues, u se confusio n matrices, and comment on
the o utcome. Test the models you have developed on
the test set (see Online File W5 .3, MovieTest.xlsx) .
Analyze the results with d iffe rent models and come up
with the best classification model, supporting it with
your results.
5. This exercise is aimed at introdu cing you to association
rule mining. The Excel data set basketslntrans.xlsx
has around 2800 observations/ records of supermarket
transaction data. Each record contains the cu stomer's
ID and Products that they have purchased. Use this
data set to understand the relationships among prod-
u cts (i.e., w hich products are purchased together). Look
for inte resting relationships and add screenshots of any
subtle association patterns that you might fin d. More
specifically , answer the following questions.
Which association rules do you think are most
important?
Based on some of the association rules you found,
make at least three business recommendatio ns
that might be beneficial to the company. These
recommendations may include ideas about shelf
organization, upselling, or cross-selling products.
(Bonus points will be given to new/ innovative ideas.)
What are the Support, Confidence, a n d Lift values
for the following rule?
Wine, Canned Veg ~ Frozen Meal
Team Assignments and Role-Playing Projects
I. Examine how new data-capture d evices such as
radio-freque ncy identification (RFID) tags h elp organi-
zatio ns accurate ly ide ntify and segment th eir custome rs
for activities such as targeted marketing. Many of th ese
applications involve d ata mining. Scan the literatu re
and the Web and then propose five potential new data
mining applications that can use the data created with
RFID technology. What issues could arise if a country's
laws required such devices to be embedded in every-
one's body fo r a national identification system?
2. Interview administrators in your college or executives
in your organization to d etermine how data warehous-
ing, data mining, OLAP, and visualization tools could
assist them in their work. Write a proposal describing
your findings. Include cost estimates and benefits in your
report.
3. A ve1y good repository of data that has bee n u sed to
test the performance of many data mining algorithms
is available at ics.uci.edu/-mlearn/MLRepository.
html. Some of the data sets are meant to test the limits
of current machine-learning algorithms and to comp are
their performance w ith new approaches to learning.
However, some of the smalle r data sets can be u seful for
exploring the fun ctio nality of a ny data mining software
240 Pan III • Predictive Analytics
or the softwa re that is available as compa nio n softwa re
w ith this book, su ch as Sta tistica Data Miner. Download
at le ast o ne data set from this re p osito1y (e.g., Credit
Screening Databases, Ho using Database) a nd a pply
decisio n tree o r cluste ring me tho ds, as appropriate.
Pre p are a re p o rt based o n your results. (Some of these
exercises may be used as semeste r-lo ng te rm p rojects,
for e xample .)
4. The re a re la rge a nd feature rich data sets ma de avail-
able by the U. S. governme nt o r its subsidia ries o n the
Inte rne t . For ins ta nce Cente rs for Disease Control and
Preventio n d ata sets (cdc.gov/DataStatistics) ; the
Natio n a l Can cer Institute's Surveilla nce Epide mio logy
a nd End Results da ta sets ( seer.cancer.gov/data);
and the De partme nt of Tra ns p o rta tion's Fa tality
Analysis Re p o rting Syste m crash data sets ( nhtsa.gov/
FARS). These d a ta sets a re n o t pre processed for data
mining, whic h m a kes the m a g reat resource to exp e ri-
e n ce the complete d a ta mining process. Ano the r ri ch
source for a collectio n of analytics da ta sets is liste d
o n KDNuggets.com (kdnuggets.com/datasets/
index.html) .
5. Conside r the following da ta set, w hich includes three
attributes a nd a classificatio n fo r admissio n decisio ns
into an MBA program:
a. Using the data shown , develo p your own ma nual
expe n rules for d ecisio n making .
b. Use the Gini index to build a decisio n tree. Yo u can
use ma nual calculatio ns o r a spreads heet to p e rfo rm
the basic calcula tio ns .
c. Use a n a uto ma ted decisio n tree software p rogra m to
build a tree for the same data .
Quantitative GMAT
GMAT GPA Score (percentile) Decision
650 2.75 35 No
580 3.50 70 No
600 3.50 75 Yes
450 2.95 80 No
700 3.25 90 Yes
590 3.50 80 Yes
400 3.85 45 No
640 3.50 75 Yes
540 3.00 60 ?
690 2.85 80 ?
490 4.00 65 ?
Internet Exercises
1. Visit the AI Exp lorato rium at cs.ualberta.ca/-aixplore.
Click the Decisio n Tree link. Read the narrative on bas-
ke tball ga me statistics. Examine the data a nd the n build
a decision tree. Re p o rt your impressio ns of the accuracy
of this decisio n tree . Also , explo re the effects of d iffe rent
algorithms .
2. Survey some da ta mining tools and ve ndors. Stan w ith
fairisaac.com a nd egain.com. Consult dmreview.com
a nd ide ntify some data min ing products a nd service p ro-
vide rs that a re n ot mentio ned in this ch apte r.
3. Find recent cases o f successful da ta mining applicatio ns.
Visit the Web sites of some data mining vendo rs and
look fo r cases o r success stories . Prep a re a rep ort sum-
marizing five new case studies.
4. Go to vendor Web sites (esp ecially those of SAS, SPSS,
Cognos, Teradata, StatSoft, a nd Fair Isaac) a nd look at
success stories fo r BI (OLAF and da ta mining) tools.
Wha t do the vario us success sto ries h ave in commo n?
How do they differ?
5. Go to statsoft.com. Download at least three w hite
papers o n applicatio ns. Which of these applications may
h ave used the da ta/ text/Web mining techniques dis-
cussed in this chapter?
6. Go to sas.com. Download at le ast three w hite p apers on
applicatio ns . Which of these applicatio ns may h ave used
the d ata/ text/Web mining techniques discu ssed in this
ch apte r?
7. Go to spss.com. Download a t least three w hite pa p e rs
o n applicatio ns . Which of these applicatio ns may have
used the d ata/ text/Web mining techniq ues discussed in
this chapte r?
8. Go to teradata.com. Download at least three w h ite
p a pe rs o n applications. Which of these a pplications may
h ave used the data/ text/ We b mining techniques d is-
cussed in this chapter?
9. Go to fairisaac.com. Download at least three w hite
p a pers o n applications . Which of these applications
may have used the data/ text/Web mining tech niques
discu ssed in this chapter?
10. Go to salfordsystems.com. Download at least three
w hite pape rs o n application s. Which of these applica-
tio ns may have used the da ta/ text/ Web mining tech-
niques discussed in this cha pter?
11. Go to rulequest.com. Download at least three w h ite
p apers o n ap plicatio ns . Which of these applications
m ay have used the data/ text/Web mining tech niques
discussed in this chapte r?
12. Go to kdnuggets.com. Explo re the sectio ns o n a pplica-
tio ns as well as software . Find na mes of a t least three
additio nal p ackages for d ata mining and text mining.
Chapter 5 • Data Mining 241
End-of-Chapter Application Case
Macys.com Enhances Its Customers' Shopping Experience with Analytics
After more than 80 years in business, Macy's Inc. is o ne
of America 's most iconic retailers. With annu al revenues
exceeding $20 billion, Macy's e njoys a loyal base of
cu stomers who come to its sto res and sh op o nline each
day. To continue its legacy of providing stella r cu stomer
service and the right selection of p rodu cts, the re ta ile r's
e-commerce division-Macys.com-is u sing an alytics
to b e tter unde rstand a nd enhance its custo me rs' online
sh opping exp e rie nce, while h e lping to increase the re tail-
e r's overall profitability.
To mo re e ffectively measure and understand the
impact of its online marketing initiatives o n Macy's sto re
sales, Macys.com increased its a nalytical capabilities w ith SAS
Enterprise Miner (one of the premier data mining tools in the
market), resulting in a n e-mail subscriptio n churn reduction
of 20 percent. It also u ses SAS to automate re port gene ration,
saving more than $500,000 a year in comp an alyst time.
Ending "One Size Fits All" E-Mail Marketing
"We want to unde rstand cu stomer lifetime value ," explains
Kerem Tomak, vice president o f analytics fo r Macys.com.
"We want to unde rstand how lo ng our customers have been
with us, how often a n e-mail from us triggers a visit to o ur
site. This helps u s be tte r unde rstand who o ur best customers
are and how e ngaged th ey a re w ith us. [With that knowledge]
we can give our valuable customers the right promotions in
orde r to serve the m the best way possible.
"Cu stomers share a lot of information w ith us-the ir
likes a nd dislikes-and our task is to support them in re turn
for the ir loyalty by providing them with what they want,
instantly," adds Tomak. Macys.com uses Hadoop as a data
platform for SAS Enterprise Miner.
Initially, Tomak was worried that segme nting cu sto m-
ers and sending fewer, but mo re specific, e =mails would
re duce traffic to the Web site . "The general belief was that
we had to blast everyone," Tomak said. Today, e-mails a re
sent less frequ e ntly, but w ith more thought, a nd the retailer
has redu ced su bscription churn rate by approximate ly
20 percent.
References
Bha ndari, I. , E. Colet, J. Parke r, Z. Pines, R. Pratap, and
K. Ramanujam. 0997). "Adva nced Scout: Data Mining and
Knowledge Discovery in NBA Data ." Data Mining and
Knowledge Discovery, Vol. 1, No. 1, pp. 121-125.
Buck, N. (December 2000/January 2001). "Eureka! Knowledge
Discove1y." Software Magazine.
Time Savings, Lower Costs
Tomak's group is responsible for creating a variety of mission
critical repo11s- some daily, some weekly, others monthly-
that go to e mployees in marke ting and finance. These
data-rich reports were taking analysts 4 to 12 hours to pro-
duce-much of it busy work that involved cutting and p ast-
ing fro m Excel spreadsheets. Macys.com is now u sing SAS to
automate the reports . "This cuts the time dramatically. It saves
u s mo re than $500,000 a year in te rms of comp FTE hours
saved- a really big impact," Tomak says, noting that the sav-
ings began within about 3 months of installing SAS.
Now his staff can maximize time spent on provid-
ing value-added analyses and insights to provide content,
products, and offers that gu arantee a personalized shopping
experie nce for Macys.com custome rs.
"Macy's is a ve1y info1mation-hungry o rgani zatio n, and
requests for ad hoc reports come from all over the company.
These streamlined systems eliminate error, guarantee accuracy,
and increase the speed with which we can address requests,"
Tomak says. "Each time we use the software, we find new
ways of doing things, and we are more and more impressed
by the sp eed at w hich it churns o ut data and models."
Moving Forward
"With the extra time, the team has moved from being reac-
tio nary to p roactive, meaning they can examine more data,
spend quality time analyzing , a nd become internal consul-
tants w ho provide more insight behind the data, " he says.
"This w ill be important to su pporting the strategy and driving
the next generation of Macy's .com. "
As competitio n increases in the online reta iling world ,
To mak says there is a push toward generating more accurate,
real-time decisions about c ustomer prefere n ces. The ability
to gain cu stome r insight across ch a nnels is a critical part of
improving cu stomer satisfactio n and revenues , and Macys.
com uses SAS Ente rprise Miner to validate and guide the
site's cross- and up-sell offer algorithms.
Source: www.sas.com/success/macy.html.
Chan, P. K. , W. Phan, A. Prodrom idis , a nd S. Stolfo. 0999) .
"Distributed Data Mining in Credit Card Fraud Detection."
IEEE Intelligent Systems, Vol. 14, No. 6, pp. 67- 74.
CRISP-DM. (2013). "Cross-Industry Standard Process for Data
Mining ( CRISP-DM)." www.the-modeling-agency.com/
crisp-dm (accessed February 2, 2013).
242 Pan III • Predictive Analytics
Davenport, T. H. (2006, January). "Competing on Analytics. "
Harvard Business Review.
Delen, D. , R. Sharda , and P. Kumar. (2007). "Movie Forecast
Guru : A Web-based DSS for Hollywood Managers."
Decision Support Systems, Vol. 43, No. 4, pp. 1151-1170.
Delen, D., D. Cogdell, and N. Kasap. (2012). "A Comparative
Analysis of Data Mining Methods in Predicting CAA Bow l
Outcomes," International Journal of Forecasting, Vol. 28,
pp. 543-552.
De len, D. (2009). "Analysis of Can cer Data: A Data Mining
Approach. " Expert Systems, Vol. 26, No. 1, pp. 100-112.
Delen, D. , G. Walker, and A. Kadam. (2005). "Predicting Breast
Cancer Survivability: A Comparison of Three Data Mining
Methods." Artificial Intelligence in Medicine, Vol 34, No. 2,
pp. 113-127.
Dunham, M. (2003). Data Mining: Introductory and
Advanced Topics. Upper Saddle River, NJ: Prentice Hall.
EPIC. (2013). Electronic Privacy Information Center. "Case
Against JetBlue Airways Corporation and Acx.iom
Corpo ratio n ." http://epic.org/privacy/airtravel/jetblue
/ftccomplaint.html (accessed January 14, 2013).
Fayyad, U. , G. Piatetsky-Sha piro, and P. Smyth . 0996). "From
Knowle dge Discovery in Databases. " Al Magazine, Vol.
17, No. 3, pp. 37- 54.
Hoffman, T. 0998, December 7) . "Banks Turn to IT to
Reclaim Most Profitable Customers." Computerworld.
Hoffman, T. 0999, April 19) . "Insurers Mine for Age-
Appropriate Offering. " Computerworld.
Kohonen, T. 0982). "Self-Organized Formation of Topologically
Correct Feature Maps." Biological Cybernetics, Vol. 43, No. 1,
pp. 59-69.
Nemati, H. R., an d C. D. Barko. (2001). "Issues in Organizational
Data Mining: A Survey of Curre nt Practices. " Journal of
Data Warehousing, Vol. 6, No. 1, pp. 25-36.
Nonh, M. (2012). Data mining for the masses. A Global
Text Project Book. https://sites.google.com/site/
dataminingforthemasses (accessed June 2013).
Quinlan, J. R. 0986). "Induction of Decision Trees. " Machine
Learning, Vol. 1, pp. 81-1 06.
SEMMA. (2009). "SAS's Data Mining Process: Sample, Explore,
Modify, Model, Assess. " sas.com/offices/europe/uk/
technologies/analytics/datamining/miner/semma.
html (accessed August 2009).
Sharda, R. , and D . Dele n. ( 2006) . "Predicting Box-office
Success of Motion Pictures w ith Neural Networks." Expert
Systems with Applications, Vol. 30, pp. 243- 254.
Shultz, R. (2004, December 7). "Live from NCDM: Tales of
Database Buffoonery. " directmag.com/news/ncdm-12-
07-04/index.html (accessed April 2009).
Skalak, D. (2001). "Data Mining Blunders Expose d!" DB2
Magazine, Vol. 6, No. 2, pp. 10-13.
StatSoft. (2006). "Da ta Mining Techniques. " statsoft.com/
textbook/stdatmin.html (accessed August 2006).
Wald, M. L. (2004) . "U.S. Calls Release of JetBlue Data
Imprope r. " Tbe New York Times, February 21 , 2004.
W ilson, R. , and R. Sharda. 0994) . "Ba nkruptcy Prediction
Using Neural Networks. " Decision Support Systems, Vol. 11 ,
pp . 545- 557.
Wright, C. (2012). "Statistical Predicto rs of March Madness: An
Examination of the NCAA Men 's Basketball Championship. "
http://economics-files.pomona.edu/ Gary Smith/
Econ190/Wright°/o20March%20Madness%20Final%20
Paper (accessed Februa1y 2, 2013).
Zaima , A. (2003). "The Five Myths of Data Mining. " What
Works: Best Practices in Business Intelligence and Data
Warehousing, Vol. 15, the Data Warehousing Institute ,
Chatswonh, CA, pp . 42-43.
CHAPTER
Techniques for Predictive Modeling
LEARNING OBJECTIVES
• Understand the concept and definitions
of artificial neural networks (ANN)
• Learn the different types of ANN
architectures
• Knowhow learning happens in ANN
• Understand the concept and structure of
support vector machines (SVM)
• Learn the advantages and disadvantages
of SVM compared to ANN
• Understand the concept and formulation
of k-nearest neighbor algorithm (kNN )
• Learn the advantages and disadvantages
of kNN compared to ANN and SVM
P
redictive modeling is perhaps the most commonly practiced branch in data mining. It
allows decision makers to estimate what the future holds by means of learning from
the past. In this chapter, we study the internal structures, capabilities/ limitations,
and applications of the most popular predictive modeling techniques, such as artificial
neural networks, support vector machines, and k-nearest neighbor. These techniques
are capable of addressing both classification- and regression-type prediction problems.
Often, they are applied to complex prediction problems where other techniques are not
capable of producing satisfactory results. In addition to these three (that are covered in
this chapter), other notable prediction modeling techniques include regression (linear or
nonlinear), logistic regression (for classification-type prediction problems), nai:ve Bayes
(probabilistically oriented classification modeling), and different types of decision trees
(covered in Chapter 5).
6.1 Opening Vignette: Predictive Modeling Helps Better Understand and Manage
Complex Medical Procedures 244
6.2 Basic Concepts of Neural Networks 247
6.3 Developing Neural Network-Based Systems 258
6.4 Illuminating the Black Box of ANN with Sensitivity Analysis 262
243
244 Pan III • Predictive Analytics
6.5 Support Vector Machines 265
6.6 A Process-Based Approach to the Use of SVM 273
6. 7 Nearest Neighbor Method for Prediction 275
6.1 OPENING VIGNETTE: Predictive Modeling Helps Better
Understand and Manage Complex Medical Procedures
Healthcare has become one of the most important issues to have a direct impact
on quality of life in the United States and around the world. While the demand for
healthcare services is increasing because of the aging population, the supply side is
having problems keeping up with the level and quality of service. In order to close the
gap, healthcare systems ought to significantly improve their operational effectiveness
and efficiency. Effectiveness (doing the right thing, such as diagnosing and treating
accurately) and efficiency (doing it the right way, such as using the least amount
of resources and time) are the two fundamental pillars upon which the healthcare
system can be revived. A promising way to improve healthcare is to take advan-
tage of predictive modeling techniques along with large and feature -rich data sources
(true reflections of medical and healthcare experiences) to support accurate and timely
decision making.
According to the American Heart Association, cardiovascular disease (CVD) is the
underlying cause for over 20 percent of deaths in the United States. Since 1900, CVD has
been the number-one killer every year except 1918, which was the year of the great flu
pandemic. CVD kills more people than the next four leading causes of deaths combined:
cancer, chronic lower respiratory disease, accidents, and diabetes mellitus. Out of all
CVD deaths, more than half are attributed to coronary diseases. Not only does CVD take
a huge toll on the personal health and well-being of the population, but it is also a great
dra in on the healthcare resources in the Unites States and elsewhere in the world. The
direct and indirect costs associated with CVD for a year are estimated to be in excess of
$500 billion. A common surgical procedure to cure a large variant of CVD is called coro-
nary artery bypass grafting (CABG). Even though the cost of a CABG surgery depends on
the patient and service provider- related factors, the average rate is between $50,000 and
$100,000 in the United States. As an illustrative example, Deleo et al. (2012) carried out
an analytics study where they used various predictive modeling methods to predict the
outcome of a CABG and applied an information fusion-based sensitivity analysis on the
trained models to better understand the importance of the prognostic factors. The main
goal was to illustrate that predictive and explanatory analysis of large and feature-rich
data sets provides invaluable information to make more efficient and effective decisions
in healthcare.
RES EARCH METHOD
Figure 6.1 shows the model development and testing process used by Deleo et al. They
employed four d ifferent types of prediction models (artificial neural networks, support
vector machines, and two types of decision trees, CS and CART), and went through a
large number of experimental runs to calibrate the modeling parameters for each model
type . Once the models were developed, they went on the text data set. Finally, the
trained models were exposed to a sensitivity analysis procedure where the contribution
of the variables was measured. Table 6.1 shows the test results for the four different types
of prediction models.
Chapter 6 • Techniques for Predictive Modeling 245
ANN
Testing the
model
Training and
calibrating the
model
Conducting
sensitivity
' analysis \
Partitioned data
(training, testing
and validation) SVM
Testing the
model
Training and
calibrating the Tabulated Model
model Testing Results
Conducting [Accuracy,
sensitivity -,I Sensitivity, and
analysis i Specificity)
Preprocessed
Data
(in Excel format)
DT/C5 Testing the
model Integrated [fused) .... Sensitivity
Training and
,,
I Analysis Results I
calibrating the I
model I
I
Conducting
~ sensitivity _,,
Partitioned data analysis
(training, testing,
and validation)
OT/CART
Testing the
model
Training and
calibrating the
model
Conducting
sensitivity
,,
analysis
~--------Input--------~~--------- Processing---------~~--------- Output ---------:,..-
FIGURE 6.1 A Process Map for Training and Testing of the Four Predictive Models.
246 Pan III • Predictive Analytics
TABLE 6.1 Prediction Accuracy Results for All Four Model Types
Based on the Test Data Set
Confusion Matrices2
Model Type1 Pos (1) Neg (O) Accuracy3 Sensitivity3
Pos (1) 749 230
ANN 74.72% 76.5 1%
Neg (0) 265 714
SVM
Pos (1) 876 103
87 .74% 89.48%
Neg (0) 137 842
cs
Pos (1) 876 103
79.62% 80.29%
Neg (O) 137 842
Pos (1) 660 319
CART 71 .15% 67.42%
Neg (0) 246 73 3
Specificity3
72 .93%
86.01 %
78.96%
74.87%
1Acronyms for model types: ANN: Artificial Neural Networks; SVM: Support Ve ctor Machines; C5 : A popular
decision tree algorithm; CART: Classificatio n a nd Regress io n Trees .
2Prediction results for the test data sample s are shown in a confusion matrix, whe re the rows re prese nt the
actuals and column s re prese nt the pred icted cases.
3 Accuracy, Se nsitivity, and Specificity are the three pe rformance measures that were used in compa ri ng the
four prediction mode ls.
RESULTS
In this study, they showed the p ower o f data mining in predicting the outcome and in
an alyzing the prognostic factors of complex medical procedures su ch as CABG surgery.
They showed that using a number of prediction methods (as opposed to only one) in
a competitive experimental setting has the potential to p rodu ce better predictive as
well as explanatory results . Among the four methods that they used, SVMs produced
the best results with prediction accuracy o f 88 percent on the test data sample. The
informatio n fu sion-based sen sitiv ity analysis revealed th e ranked importance of the
independent variables. Some of the top variables identified in this an alysis having to
overlap w ith the most important variables identified in previously conducted clinical
and biological studies confirms the validity and effectiven ess of the proposed data min-
ing m e thodology.
From the managerial standpoint, clinical decision support systems that u se the
o utcome of data mining studies (such as the o n es presented in this case study) a re not
meant to replace healthcare ma nagers a nd/ or medical professionals. Rathe r, they intend
to support them in making accurate and timely decisio ns to o ptimally a llocate resources
in order to increase the quantity and quality of medical services. There still is a lo n g way
to go before we can see these decision aids being used extensively in healthcare prac-
tices. Among oth e rs, there are behavioral, ethical, and political reasons for this resistan ce
to adoptio n. Maybe the n eed and the government incentives for better healthcare systems
will expedite the adoption.
QUESTIONS FOR THE OPENING VIGNETTE
1. Why is it important to study medical procedures? What is the value in predicting
outcomes?
2. What factors do you think are the most impo rta nt in better understanding and
managing health care? Consider both m a nagerial and clinical aspects of h ealth care.
Chapter 6 • Techniques for Predictive Modeling 247
3. What would be the impact of predictive modeling on h ealth care a nd medicine?
Can predictive modeling replace medical or manage rial personnel?
4. What were the outcomes of the study? Who can use these results? How can the
results be implemented?
5. Search the Internet to locate two additional cases where predictive modeling is used
to understand and manage complex medical procedures.
WHAT WE CAN LEARN FROM THIS VIGNETTE
As you will see in this ch apter, predictive modeling techniques can be applied to a
w ide range of problem areas, from standard business problems of assessing customer
needs to understanding and enhancing efficiency of production processes to improving
healthcare and medicine. This vignette illustrates an innovative application of predictive
modeling to better predict, understand, a nd manage coronary bypass grafting pro-
cedures. As the results indicate, these sophisticated predictive modeling techniques
are capable of predicting and explaining such complex phenomena. Evidence-based
medicine is a relatively new term coined in the hea lthcare arena, where the main idea
is to dig deep into past experiences to discover new and useful know ledge to improve
medical a nd manage rial procedures in h ealthcare . As we all know, healthcare needs
all the help that it ca n get. Compared to traditional research, which is clinical and
biological in nature, data-driven studies provide an out-of-the -box view to medicine
and manage ment of medical systems.
Sources: D. Delen, A. Oztekin, and L. Tomak, "An Analytic Approach to Better Understanding and Manageme nt
o f Corona1y Surgeries," Decision Support Systems, Vol. 52, No. 3, 2012, pp. 698-705; and American Heart
Association, "Heart Dise ase and Stroke Statistics-2012 Update," heart.org (accessed February 2013).
6.2 BASIC CONCEPTS OF NEURAL NETWORKS
Neural networks represent a brain me taphor for information processing. These mode ls
are biologically inspired rather than an exact replica of how the brain actually fun ctions.
Neural networks have been shown to be very promising systems in many forecasting
and business classification applications due to their ability to "learn" from the data,
their nonparametric nature (i.e ., n o rigid assumptions) , and their ability to gen eralize .
Neural computing refers to a pattern-recognition methodology for machine learning.
The resulting model from neural computing is often called an artificial neural network
(ANN) or a neural network. Neural n etworks have been used in many business
applications for pattern recognition, forecasting, prediction, and classification. Neural
network computing is a key component of a ny d ata mining toolkit. Applications of neural
networks abound in finance, marketing, manufacturing, operations, information systems,
and so on. Therefore, we devote this chapter to d eveloping a b ette r unde rstanding of
neural network models, methods , and applications.
The human brain possesses bewildering capabilities for information processing and
proble m solving that mode rn computers cann ot compete with in many asp ects. It h as
been postulated that a model or a system that is e nlightened a nd supported by the results
from brain research , w ith a structure similar to that of biological neural networks, could
exhibit similar intelligent functionality. Based o n this bottom-up approach , ANN (also
known as connectionist models, parallel distributed processing models, neuromorphic
systems, or simply neural networks) have been developed as biologically inspired and
plausible models for va rious tasks.
248 Pan III • Predictive Analytics
Biological neural networks are composed of many massively interconnected
neurons. Each neuron possesses axons and dendrites, fingerlike projections that enable
the neuron to communicate with its neighboring neurons by transmitting and receiving
e lectrical and chemical signals. More or less resembling the structure of their biological
counterparts, ANN are composed of interconnected, simple processing e lements called
artificial neurons. When processing information, the processing elements in an ANN
operate concurrently a nd collectively, similar to biological neurons . ANN possess some
desirable tra its similar to those of biological neural networks, such as the abilities to learn,
to self-organize, and to support fault tolerance .
Coming along a winding journey, ANN have been investigated by researchers
for more than half a century. The formal study of ANN began with the pioneering
work of McCulloch a nd Pitts in 1943 . Inspired by the results of bio logical experiments
and observations, McCulloch an d Pitts (1943) introduced a simp le model of a bina1y
artificial neuron th at captured some of the functions of bio logical neurons. Using
information-processing machines to model the brain, McCulloch and Pitts built their
neural network model using a large number of interconnecte d artificial binary neurons.
From these beginnings, neural network research became quite popular in the late
1950s and early 1960s. After a thorough an alysis of a n early neural n e twork model
(ca lled the perceptron, w hich used no hidden layer) as well as a pessimistic eva lua-
tion of the research potential by Minsky and Papert in 1969, interest in n eural networks
diminished.
During the past two decades, there has been an exciting resurgence in ANN studies
due to the introduction of new network topologies, new activation functions , and new
learning algorithms, as well as progress in n euroscience a nd cognitive science. Advances
in theory and methodology have overcome m any of the obstacles that hindered neural
network research a few decades ago . Evide nced by the appealing results of numerous
studies, neural n etworks are gaining in acceptance and popularity. In addition, th e desir-
able features in n eural information processing make n eural networks attractive for solvin g
complex problems. ANN have been applied to numerous complex problems in a variety
of applicatio n settings . The successful use of neural network applications has inspired
renewed interest from industry and business.
Biological and Artificial Neural Networks
The human brain is composed of special cells called neurons. These cells do not die
and replenish when a person is injured (all other cells reproduce to replace them-
selves and then die). This phenomenon may explain w hy human s retain information
for an extended period of time an d start to lose it when they get old- as th e brain cells
gradually start to die. Information storage span s sets of neurons. The brain has anywh ere
from 50 billion to 150 billio n neurons, of which there are more than 100 different kinds.
Neurons are partitioned into groups ca lled networks. Each network contains several
thousand highly interconnected neurons. Thus, the brain can be viewed as a collection
of neural networks.
The ability to learn and to react to ch anges in our environment requires intelligence.
The brain and the central nervous system control thinking and intelligent behavior.
People who suffer brain damage have difficulty learning an d reacting to changing
e nvironme nts. Even so, undamaged parts of the brain can often compensate w ith new
learning.
A portion of a n etwork composed of two cells is shown in Figure 6.2. The cell itself
includes a nucleus (the central processing portion of the n euron). To the left of cell 1,
the dendrites provide input signals to the cell. To the right, the axon sends output signals
Chapter 6 • Techniques for Predictive Modeling 249
Synapse
Dendrites
/
Axon
Soma
FIGURE 6.2 Portion of a Biological Neural Network: Two Interconnected Cells/Neurons.
to cell 2 via the axon te rminals. These axon te rminals merge with the dendrites of cell 2.
Signals can be transmitted unchanged, or they can be altered by synapses. A synapse is
able to increase or decrease the strength of the connection between neurons and cause
excitation or inhibition of a subseque nt neuron. This is how information is stored in the
neural networks.
An ANN emulates a biological n eural network. Neural computing actu ally u ses a
very limited set of concepts from biological neural systems (see Technology Insights
6.1). It is more of an analogy to the human brain than an accurate model of it. Neural
con cepts usually are implemented as software simulations of the massively parallel
processes involved in processing interconnected elements (also called artificial neu-
rons, or neurodes) in a n etwork architecture. The artificial n euron receives inputs
analogous to the e lectroche mical impulses that dendrites of biological neurons receive
from other neurons. The output of the artificia l neuron corresponds to signals sent
from a bio logical neuron over its axon . These artificial signals can be ch anged by
weights in a manne r similar to the physical changes that occur in the synapses (see
Figure 6.3) .
Several ANN paradigms have been proposed for applications in a variety of prob-
lem domains . Perhaps the easiest way to differentiate among the various n eural models
is on the basis of how they structurally emulate the human brain, the way they process
information, and how they learn to perform their designated tasks.
Inputs Weights
X1 - -t-l---i
•
•
Neuron (or PE)
n
S = ~ X ;W;
i-1
Summation
f[SJ
Transfer
Function
FIGURE 6.3 Processing Information in an Artificial Neuron.
Output
250 Pan III • Predictive Analytics
TECHNOLOGY INSIGHTS 6.1 The Relationship Between Biological
and Artificial Neural Networks
The following list shows some of the relationships between biological and anificial networks.
Biological
Soma
Dendrites
Axon
Synapse
Slow
Many neurons (109)
Artificial
Node
In put
Output
Weight
Fast
Few neurons (a dozen to hundreds of thousands)
Sources, L. Medsker and J. Liebowitz, Design and Development of Bcpe11 Systems and
Neural Networks, Macmillan , New York, 1994, p. 163; and F. Zahedi, Intelligent Systems
for Business, bpe,t Systems w ith Neural Networks, Wadsworth, Belmont, CA, 1993.
Because they are biologically inspired, the m a in processing elements of a neural
network are individual ne urons, analogous to the brain's neurons. These artificial n eurons
receive the information from other neurons or external input stimuli, perform a transfor-
mation on the inputs , and then pass on the transformed information to other neurons or
external outputs. This is similar to how it is cu rrently thought that the huma n brain works.
Passing information from neuron to neuron can be th ought of as a way to activate, or
trigger, a response from certain neurons based on the information or stimulus received.
How information is processed by a neural network is inherently a function of its
structure. Neural networks can have one or more layers of neurons. These neurons can
be highly or fully interconnected, or only certain layers can be connected. Connections
between neurons have an associated weight. In essence, the "knowledge" possessed by
the network is encapsulated in these interconnection weights. Each neuron calculates a
weighted sum of the incoming neuron values, transforms this input, and passes on its
neural value as the input to subsequent neurons . Typically, although not always, this
input/ output transformation process at the individual neuron level is performed in a non-
linear fashion.
Application Case 6. 1 provides an interesting example of the use of neural networks
as a prediction tool in the mining industry.
Application Case 6.1
Neural Networks Are Helping to Save Lives in the Mining Industry
In the mining industry, most of the underground
injuries and fatalities are due to rock falls (i.e., fall
of hanging wall/roof). The method that has been
used for many years in the mines when determin-
ing the integrity of the hanging wall is to tap the
hanging wall with a sounding bar and listen to the
sound emitted. An experienced miner can differenti-
ate an intact/solid hanging wall from a detached/
loose hanging wall by the sound that is emitted. This
method is subjective. The Council for Scientific and
Industrial Research (CSIR) in South Africa h as devel-
oped a device that assists any miner in making an
objective decision when determining the integrity of
the h anging wall. A trained neural network model is
embedded into the device. The device then records
the sound emitted when a hanging wall is tapped.
The sound is then preprocessed before being input
into a trained neural network model, and the trained
model classifies the hanging wall as either intact or
detached.
Mr. Teboho Nyareli, working as a research
engineer at CSIR, who holds a master's degree in
e lectronic e ngineering from the University of Cape
Town in South Africa, used NeuroSolutions, a
popular artificial neural network modeling software
developed by NeuroDimensions, Inc., to develop
the classification type prediction models. The mul-
tilayer perceptron-type ANN architecture that he
built achieved better than 70 percent prediction
Chapter 6 • Techniques for Predictive Modeling 251
accuracy on the hold-out sample. Currently, the
prototype system is undergoing a final set of tests
before deploying it as a decision aid, followed by
the commercialization phase. The following figure
shows a snapshot of NeuroSolutio n 's model building
platform.
QUESTIONS FOR DISCUSSION
1. How did neural networks help save lives in the
mining industry?
2. What were the challenges, the proposed solu-
tion, and the obtained results?
Source: NeuroSo lutio ns custome r success story, neurosolutions.com/resources/nyareli.html (accessed Fe bruary 2013).
Elements of ANN
A ne ural n etwork is composed of processing e le me nts that are organized in different
ways to form the network's structure. The basic processing unit is the neuron. A number
of n eurons are then organized into a n etwork. Neurons can b e organized in a numbe r of
different ways; these various network patterns are referred to as topologies. O ne popu lar
approach, known as the feedfo rward-backpropagation paradigm (or simp ly backpropa-
gation), allows all neurons to link the output in one layer to the input of the n ext layer,
but it does n ot allow a ny feedback linkage (Haykin, 2009). Backpropagation is the most
commonly u sed network paradigm.
PROCESSING ELEMENTS The processing elements (PE) of an ANN are artificial neu-
ron s . Each n euron receives inputs, processes them, an d delivers a single o utput, as shown
in Figure 6 .3. The input can be raw input data or the output of other processing e le ments.
The output can be the final result (e.g ., 1 means yes, 0 means no), or it can be input to
other neurons.
252 Pan III • Predictive Ana lytics
X
1
/ (PE)
X ~ W eighted Transfer
2 ~ Sum Function
X3 / (:E) ( f)
I
Input Layer
FIGURE 6.4 Neural Network with One Hidden Layer.
NETWORK STRUCTURE Each ANN is composed of a collectio n of n e urons that are
grouped into layers. A typical structure is shown in Figure 6.4. Note the three layers:
input, intermediate (called the hidden layer), and o utput. A hidden layer is a layer of
neu rons that takes input fro m the previo us layer and converts those inputs into o utputs
for further processing. Several hidden layers can be placed betwe e n the input and output
layers, although it is common to u se only o n e hidde n layer. In tha t case, the hidde n layer
simply converts inputs into a n o nlinear combinatio n and p asses th e tran sforme d inputs
to the output layer. The most common interpretation of the hidden layer is as a feature-
extractio n mechanism ; that is, the hidde n layer converts th e o riginal inputs in th e p rob-
le m into a hig he r-level combinatio n of such inputs.
Like a biological network, an ANN can be organized in several different ways (i.e. ,
to p o logies o r architectures); that is, the ne urons can be inte rconnected in diffe re nt ways .
Whe n info rmation is processe d , many of the p rocessing e le m e nts perform their com p u ta-
tions at the same time . This parallel processing resembles the way the brain works, and
it diffe rs fro m the serial p rocessing of conventio n al computing .
Network Information Processing
Once the structure of a n e ural network is dete rmine d , info rmatio n can b e processed. We
now present the majo r concepts re late d to network info rmatio n processing.
INPUT Each input corresp onds to a single attribute . Fo r example, if the p roble m is to
d ecide o n a pproval o r disapproval o f a loan , attributes could include the applicant's
income level, age, and h o me ownership status. The numeric value, o r representatio n ,
o f an attribute is the input to the n etwork. Several typ es of data, su ch as text, pictures,
and voice, can b e use d as inputs. Preprocessing may b e need ed to convert the data into
m eaningful inputs fro m symbo lic d a ta o r to scale the data .
OUTPUTS The o utput of a network contains the solutio n to a problem. Fo r exam p le, in
the case of a loa n applicatio n , the output can be yes o r n o. The ANN assig ns nu meric
Chapter 6 • Techniques for Predictive Modeling 253
values to the output, such as 1 for "yes" and O for "no. " The purpose of the netwo rk is to
compute the output values. Often, postprocessing of the output is required because some
networks u se two o utputs: one for "yes" and another for "no." It is common to rou nd th e
outputs to the nearest O or 1.
CONNECTION WEIGHTS Connection weights a re the key e le me nts of an ANN. They
express the relative strength (or mathematical value) of the input data or the m any con-
nections that transfer da ta from layer to layer. In other words, weights express the relative
importance of each input to a processing e le me nt a nd, ultimately , the output. Weights
are crucial in that they store learned patterns of information. It is through repeated adjust-
ments of weights that a n etwork learns.
SUMMATION FUNCTION The summation function computes the weighted sums of all
the input elements e ntering each processing ele ment. A summatio n functio n multiplies
each input value by its weight and to tals the values for a weighted sum Y The formu la
for n inputs in one processing element (see Figure 6.Sa) is:
n
Y= LX;W,:
i = l
Fo r the jth neuron of several processing n e urons in a layer (see Figure 6 .5b), th e
formu la is :
(a) Single Neuron
X1
PE = Processing Element
(or neuron)
y
n
1:t = LXi%
i = l
(bl Multiple Neurons
Y1 = X1 W 1 1 + X2W2 1
Y2 = X1 W 1 2 + X2W2 2
Y3 = X2W23
FIGURE 6.5 Summation Function for (a) a Single Neuron and (b) Several Neurons.
254 Pan III • Predictive Analytics
TRANSFORMATION (TRANSFER) FUNCTION The summation function computes the inter-
n al stimulatio n , or activation level, of the neuron. Based on this level, the neuron may or
may not produce an o utput. The relationship between the interna l activation level and
the output can be linear or no nlinear. The re lationship is expressed by on e of several
types of transformation (transfer) functions . The transformation function combines
(i.e., adds up) the inputs coming into a neuron from other neurons/ sources and then pro-
duces an output based on the transformation function. Selection of the specific function
affects the network's operation. The sigmoid (logical activation) function (or sigmoid
tranifer function) is an S-shaped transfer function in the range of Oto 1, and it is a popu-
lar as well as useful nonlinear transfer function:
1
y; - -----
T - (1 + e-Y)
where Y7 is the transformed (i.e., normalized) valu e of Y (see Figure 6 .6).
The transformation modifies the output levels to reasonable values (typically
between O and 1). This transformation is performed before the output reaches the
next level. Without such a transformation, the value of the output becomes very large,
especially w hen there are several layers of neurons . Sometimes a threshold value is u sed
instead of a transformation function. A threshold value is a hurdle value for the output
of a n euron to trigger the next level of neurons. If an output value is smaller than the
threshold value, it will not be passed to the next level of neurons. For example, any
value of 0.5 or less becomes 0, an d any value above 0.5 becomes 1. A transformation can
occur at the output of each processing element, or it can be performed only at the final
o utput nodes.
HIDDEN LAYERS Complex practical applications require o ne or more hidden layers
between the input and o u tput neurons and a correspondingly large number of weights.
Many commercial ANN include three and sometimes up to five layers, w ith each contain-
ing 10 to 1,000 processing elements. Some experimental ANN use millions of process-
ing e lements. Because each layer increases the training effort exponentially an d also
increases the computatio n required, the use of more than three hidden layers is rare in
most commercial systems.
Neural Network Architectures
There are several neural network architectures (for specifics of models and/ or algo-
rithms, see Haykin, 2009). The most common ones include feedforwa rd (multilayer
Summation function: Y = 3(0.2) + 1 (0.4) + 2(0.1 J = 1 .2
Transfer function : Yr= 1 /(1 + e - 1 2 J = 0. 77
Processing Y = 1 .2
element (PE) Yr= 0.77
FIGURE 6.6 Example of ANN Transfer Function.
Chapter 6 • Techniques for Predictive Modeling 255
Input 1
Input 2
Input 3
Input 4
* H indicates a "hidden" neuron without a target output
FIGURE 6.7 A Recurrent Neural Network Architecture.
perceptron with backpropagation), associative memory, recurrent networks , Kohonen's
self-organizing feature maps, and Hopfield networks. The generic architecture of a
feedforward network architecture is shown in Figure 6.4, where the information flows
unidirectionally from input layer to hidden layers to output layer. In contrast, Figure 6.7
shows a pictorial representation of a recurrent neural network architecture, where the
connections between the layers are not unidirectional; rathe r, there are many connections
in every direction between the layers an d neurons, creating a complex connection struc-
ture . Many experts believe this better mimics the way biological neurons are structured
in the human brain.
KOHONEN'S SELF-ORGANIZING FEATURE MAPS First introduced by the Finnish profes-
sor Teuvo Kohonen, Kohonen's self-organizing feature maps (Kohonen networks
or SOM, in short) provide a way to represent multidimensional data in much lower
dimensional spaces, usually one or two dimensions. One of the m ost interesting aspects
of SOM is that they learn to classify data without supervision (i.e., there is no output vec-
tor). Remember, in supervised learning techniques, such as backpropagation, the training
data con sists of vector pairs-an input vector and a target vector. Because of its self-
organizing capability, SOM are commonly used for clustering tasks where a group of
cases are assigned an arbitrary number of naturals groups. Figure 6.8a illustrates a very
small Kohonen network of 4 x 4 nodes connected to the input layer (with three inputs) ,
representing a two-dimensional vector.
HOPFIELD NETWORKS The Hopfield network is another interesting neural network
architecture, first introduced by John Hopfield (1982). Hopfield demonstrated in a series
of research a rticles in the early 1980s how highly interconnected n etworks of n onlinear
neurons can be extremely effective in solving complex computational problems. These
networks were shown to provide novel and quick solutions to a family of problems stated
in terms of a desired objective subject to a numbe r of con straints (i.e., con straint optimi-
zation problems). One of the major advantages of Hopfield neural networks is the fact
that their structure can be realized on an electronic circuit board, possibly on a VLSI C very
large-scale integration) circuit, to be used as a n online solver with a p ara llel-distributed
256 Pan III • Predictive Analytics
IN PUT
Input 1
Input 2
Input 3
[al Kohonen Network [SOM] [bl Hopfield Network
J
0
u
T
p
u
T
FIGURE 6.8 Graphical Depiction of Kohonen and Hopfield ANN Structu res.
process. Architecturally, a general Hopfield network is represented as a single large layer
of neurons with total interconnectivity; that is, each neuron is connected to every other
neuron w ithin the network (see Figure 6.Sb).
Ultimately, the architecture of a neural network model is driven by the task it is
intended to carry out. For instance , neural network models have been used as classifiers,
as forecasting tools, as customer segmentation mechanisms, and as general optimizers.
As shown later in this chapter, neural network classifiers are typically multilayer mod-
e ls in which information is passed from one layer to the next, with the ultimate goal of
mapping an input to the network to a specific category, as identified by an output of the
network. A neural model used as an optimizer, in contrast, can be a single layer of neu-
rons, highly intercon nected, and can compute neuron values iteratively until the model
converges to a stable state. This stable state represents an optimal solution to the problem
under analysis.
Application Case 6.2 summarizes the use of predictive modeling (e.g. , neural net-
works) in addressing several changing problems in the e lectric power industry.
Application Case 6.2
Predictive Modeling Is Powering the Power Generators
The electrical power industry produces and delivers
electric energy (electricity or power) to both residen-
tial and business customers, wherever and when-
ever they need it. Electricity can be generated from
a multitude of sources. Most often, electricity is pro-
duced at a power station using electromechanical
generators that are driven by heat engines fueled by
chemical combustion (by burning coal, petroleum, or
natural gas) or nuclear fusion (by a nuclear reactor).
Generation of electricity can also be accomplished
by other means, such as kinetic energy (through fall-
ing/flowing water or wind that activates turbines),
solar energy (through the energy emitted by sun,
either light or heat), or geothermal energy (through
the steam or hot water coming from deep layers of
the earth). Once generated, the electric energy is dis-
tributed through a power grid infrastructure.
Even though some energy-generation methods
are favored over others, all forms of electricity gen-
eration have positive and negative aspects. Some are
environmentally favored but are economically unjus-
tifiable; others are economically superior but envi-
ronmentally prohibitive. In a market economy, the
options with fewer overall costs are generally chosen
above all other sources. It is not clear yet which form
can best meet the necessary demand for electricity
without permanently damaging the environment.
Current trends indicate that increasing the shares of
renewable energy and distributed generation from
mixe d sources has the promise of reducing/balanc -
ing environmental and economic risks.
The electrical power industry is a highly regulated,
complex business endeavor. There are four distinct
roles that companies choose to participate in: power
producers, transmitters, distributers, and retailers.
Connecting all of the producers to all of the customers
is accomplished through a complex structure, called
the power grid. Altl10ugh all aspects of the electricity
industry are witnessing stiff competition, power gen-
erators are perhaps the ones getting the lion's share
of it. To be competitive, producers of powe r nee d to
maximize the use of their variety of resources by mak-
ing the right decisions at the right rime.
StatSoft, one of the fastest growing provid-
ers of customized analytics solutions, developed
integrated decision support tools for power gen-
erators. Leveraging the data that comes from the
production process , these data mining-driven soft-
ware tools help technicians and managers rapidly
optimize the process paramete rs maximize the
power output while minimizing the risk of adverse
effects. Following are a few examples of what these
advanced analytics tools, which include ANN and
SVM, can accomplish for power generators.
• Optimize Operation Parameters
Problem: A coal-burning 300 MW multi-
cyclone unit required optimization for consis-
tent high flame temperatures to avoid forming
slag and burning excess fue l oil.
Solution: Using StatSoft's predictive model-
ing tools (along with 12 months of 3-minute
historical data), optimized control parameter
settings for stoichiometric ratios, coal flows ,
primary air, tertiary air, and split secondary air
damper flows were identified and imple me nted.
Results: After optimizing the control param-
eters, flame temperatures showed strong
responses, resulting in cleaner combustion for
higher and more stable flame temperatures.
• Predict Problems Before They Happen
Problem: A 400 MW coal-fired DRB-4Z
burner required optimization for consistent and
robust low NOx operations to avoid excursions
Chapter 6 • Techniques for Predictive Modeling 257
and expensive downtime. Identify root cau ses
of ammonia slip in a selective noncatalytic
reduction process for NOx reduction.
Solution: Apply predictive a nalytics m e thod-
ologies (along with historical process data) to
predict and control variability; then target pro-
cesses for better performance, thereby reduc-
ing both average NOx and variability.
Results: Optimized settings for combinations
of control parameters resulted in consistently
lower NOx emissions with less variability (and
no excursions) over continued operations at
low load, including predicting failures or unex-
pected maintenance issues.
• Reduce Emission {NOx, CO}
Problem: While NOx e missio ns for highe r
loads were within acceptable ranges, a 400 MW
coal-fired DRB-4Z burner was not optimized
for low-NOx operations under low load
(50-175 MW).
Solution: Using data-driven predictive mod-
eling technologies with historical data, opti-
mized parameter settings for changes to airflow
were identified, resulting in a set of specific,
achievable input parameter ranges that were
easily implemented into the existing DCS (digi-
tal control system ).
Results: After optimization, NOx emissions
under low-load operations were comparable to
NOx emissions unde r higher loads.
As these specific examples illustrate, there are
numerous opportunities for advanced analytics to
make a significant contribution to the power indus-
try. Using data and predictive mode ls could he lp
decision makers get the best efficiency from their
production system w hile minimizing the impact o n
the environment.
QUESTIONS FOR DISCUSSION
1. What are the key environmental concerns in the
electric power industry?
2. What a re the main application areas for predic-
tive modeling in the electric power industry?
3. How was predictive modeling used to address
a variety of problems in the e lectric power
industry?
Source: StatSoft , Success Sto ries, power.statsoft.com/files/
statsoft-powersolutions (accessed February 2013) .
258 Pan III • Predictive Analytics
SECTION 6.2 REVIEW QUESTIONS
1. What is an ANN1
2. Explain the following terms: neuron, axon, and synapse.
3. How do weights function in an ANN?
4. What is the role of the summation and tran sformation function?
5. What are the most common ANN architectures? How do they differ from each oth er?
6.3 DEVELOPING NEURAL NETWORK-BASED SYSTEMS
Although the development process of ANN is similar to the structured design methodolo-
gies of traditional computer-based information systems, some phases are u nique or h ave
some unique aspects . In the process described here, we assume that the preliminary steps
of system development, such as determining informatio n requirements, conducting a fea-
sibility an alysis, a nd gaining a ch ampio n in top management for th e project, have been
completed successfully . Such steps are generic to any information system.
As shown in Figure 6 .9, the development process for an ANN application includes
nine steps. In step 1, the data to be used for training and testing the network are col-
lected. Important consideratio ns are that the particular problem is amenable to a neural
network solution a nd that adequate data exist and can be obtained. In step 2, training
data must be identified, and a plan must be made for testing the performance of the
network.
In steps 3 and 4, a network architecture and a learning method are selected. The
availability of a particular development tool or the capabilities of the development person-
nel may determine the type of neural network to be constructed. Also, certain problem
types have demonstrated high su ccess rates with certain configurations (e.g., multilayer
feedforward neural n etworks for bankruptcy prediction [Altman (1968) , Wilson and Sharda
(1994), and Olson et a l. (2012)]). Important considerations are the exact number of n eu-
rons and the number of layers. Some packages use genetic algorithms to select the net-
work design.
There are several parameters for tuning the n etwork to the desired learning-
performance level. Part of the process in step 5 is the initialization of the network weights
and parameters, followed by the modification of the parameters as training-performance
feedback is received. Often, the initial values are important in determining the efficiency
and length of training. Some methods change the parameters during training to enhan ce
performance.
Step 6 tran sforms the application data into the type and format required by the
neural n etwork. This may require writing software to preprocess the data or performing
these o peratio ns directly in an ANN package. Data storage and manipulation techniques
and processes must be designed for convenie ntly a nd efficiently retraining the neural
network, when needed. The application data representation and ordering often influence
the efficiency and possibly the accuracy of the results.
In steps 7 and 8, training an d testing a re conducted iteratively by presenting input
and desired or known output data to the network. The network computes the outputs
and adjusts the weights until the computed outputs are w ithin a n acceptable tolerance
of the known outputs for the input cases. The desired outputs and their re lationships to
input data are derived from historical data (i.e ., a portion of the data collected in step 1).
In step 9, a stable set of weights is obtained. Now the network can reproduce the
desired outputs, given inputs such as those in the training set. The n etwork is ready for
use as a stand-alone system or as part of another software system where new input data
w ill be presented to it and its o utput will be a recomme nded decision.
In the fo llowing sectio ns , we examine th ese steps in more detail.
- - - - - - - - - - - - - - - - - - - - ->–
Get more data;
reformat data
Collect, organize, and
format the data
Separate data
– – – – – – – – – – – – – – – – – – – – ->- into training, validation,
Re-separate data and testing sets
into subsets
.. _____________________ .,.
‘ Change network
architecture
‘ ,. ———————•
‘ Change learning
algorithm
Decide on a network
architecture
and structure
Select a
learning algorithm
: Set network parameters
·- – – – – – • – – – – – – – – – – – – – ·>-: Change network and initialize their values
parameters
• – — – – – — – – — – – – — – – – ->-
Reset and restart
the training
Initialize weights
and start training
(and validation)
Stop training , freeze
the network weights
Test the trained network
Deploy the network
for use on unknown
new cases
FIGURE 6.9 Development Process of an ANN Model.
The General ANN Learning Process
Chapter 6 • Techniques for Predictive Modeling 259
Step
2
3
4
5
6
7
8
g
In supervised learning, the learning process is inductive; that is, connection weights
are derived from existing cases. The usual process of learning involves three tasks (see
Figure 6 .10):
1 . Compute temporary outputs.
2. Compare outputs with desired ta rgets.
3. Adjust the weights a nd repeat the process.
260 Pan III • Predictive Analytics
Adjust
weights
No
FIGURE 6.10 Supervised Learning Process of an ANN.
Compute
output
Stop
learning
When existing o utputs are available for compa rison , the learning process staits by
setting the connection weights . These are set via rules or at rando m . The difference
between the actual output (Yor Yr) and the desired output (Z) for a given set of inputs is
a n e rro r called delta (in calculus, the Greek symbol delta , l’i, means “difference”).
The objective is to minimize delta (i.e., reduce it to O if possible), w hich is do ne by
adjusting the n etwork’s weights . The key is to change the weights in the right direction,
making c hanges that reduce delta (i.e., e rror). We will show how this is done later.
Information processing w ith an ANN con sists of attempting to recognize patterns
o f activities (i. e ., p a tte rn recognitio n) . During the learning stages, the interconnection
weights change in response to training data presented to the system.
Different ANN compute delta in different ways, depending on th e learning algo-
rithm being u sed. Hundreds o f learning algorithms are available for various situatio ns
and configurations of ANN. Perhaps the o n e that is most commonly u sed and is easiest to
unde rstand is backpropagatio n.
Backpropagation
Backpropagation (short for back-error propagation) is the most w idely u sed supervised
learning algorithm in n eural computing (Principe et al. , 2000). It is very easy to imple-
ment. A backpropagation network includ es one or more hidden layers. This type of
network is considered feedforward because there are n o interconnections between the
o utput of a processing ele me nt and the input of a node in the same layer or in a preced-
ing layer. Externally provided correct patterns are compared w ith the neural network’s
output during (supervised) tra ining, and feedback is used to adjust the weights until the
n etwork has categorized all the tra ining patterns as correctly as possible (the error toler-
ance is set in advan ce).
Starting w ith the output layer, errors between the actu al a nd desired outputs are
u sed to correct the we ig hts for the connections to the previous layer (see Figure 6.11) .
Chapter 6 • Techniques for Predictive Modeling 261
a(Z1 – Y1 )
error ~,
x,
,,,————,,,
Neuron (or PE)
n
S = ~ X;\/11
i – 1
Summation
f(SJ
FIGURE 6.11 Backpropagation of Error for a Single Neuron.
Y=f(SJ
Transfer
Function
For any output neuron j, the error (delta)= (Z1 – Y;-) (df/ dx) , where Zand Y are the
desired and actual outputs, respectively. Using the sigmoid function,!= [1 + exp(- x)]- 1 ,
where x is proportional to the sum of the weighted inputs to the neuron, is an effec-
tive way to compute the output of a n e uron in practice. With this function, the deriva-
tive of the sigmoid function d.fldx = J(l – f) and the error is a simple function of the
desired and actual outputs . The factor J(l – f) is the logistic function, which serves to
keep the error correction well bounded. The weights of each input to the jth neuron
are then changed in proportion to this calculated error. A more complicated expression
can be derived to work backward in a similar way from the output neurons through
the hidden layers to calculate the corrections to the associated weights of the inner
neurons. This complicated method is an iterative approach to solving a nonlinear opti-
mization problem that is very similar in meaning to the one characte rizing multiple-
linear regression.
The learning algorithm includes the fo llowing procedures:
1. Initialize weights with random values and set other p a rameters.
2. Read in the input vector and the desired output.
3. Compute the actual output via the calculations, working forward through the layers.
4. Compute the error.
5. Change the weights by working backward from the output layer through the hidden
layers.
This procedure is repeated for the e ntire set of input vectors until the desired
output and the actual output agree w ithin some predetermined tolerance. Given the
calculation requirements for one iteration, a large network can take a very long time to
train; therefore , in one variatio n , a set of cases is run forward a nd an aggregated error
is fed backward to speed up learning. Sometimes, depending on the initial random
weights and network parameters, the network does not converge to a satisfactory
pe rformance level. When this is the case, new random weights must be generated,
and the network parameters, or even its structure, may have to be modified before
an other attempt is made. Current research is aimed at developing algorithms and u sing
parallel computers to improve this process . For example, genetic algorithms can be
used to guide the selection of the network parameters in order to maximize the desired
output. In fact, most commercial ANN software tools are now using GA to help u sers
“optimize” the network parameters. Technology Insights 6.2 discusses some of the most
popular neural network software and offers some Web links to more comprehensive
ANN-re lated software sites.
262 Pan III • Predictive Analytics
TECHNOLOGY INSIGHTS 6.2 ANN Software
Many tools are available for developing ne ural n etworks (see this book’s Web site and the
resource lists at PC AI, pcai.com) . Some of these tools function like software shells. They pro-
vide a set of standa rd architectures, learning algorithms, and parameters, along with the ability
to manipulate the data. Some development tools can suppo1t up to several dozen network para-
digms and learning algorithms.
Neural network implementations are also available in most of the comprehe nsive data
mining tools , such as the SAS Enterprise Miner, IBM SPSS Modele r (formerly Clementine), and
Statistica Data Miner. Weka, RapidMiner, and KNIME are open source free data mining software
tools that include neural network capabilities. These free tools can be downloaded from their
respective Web sites; simple Inte rnet search es on the names of these tools should lead you
to the download pages. Also, most of the commercial software tools are available for down-
load and use for evalua tion purposes (usually , they are limited on time of availability and/ or
functionality).
Many specialized neural network tools enable the building and deployment of a n eural
network model in practice. Any listing of such tools would be incomplete. Online resources su ch
as Wikipedia (en.wikipedia.org/wiki/ Artificial_neural_network), Google ‘s or Yahoo!’s
software directory, and the vendor listings on pcai.com are good places to locate th e latest
informatio n on neural network software vendors. Some of the vendors that have bee n around
for a while and have reported industrial applications of th eir n eural network software include
California Scie ntific (BrainMaker), NeuralWare, NeuroDimension Inc. , Ward Systems Group
(Neuroshell) , and Megaputer. Again, the list can never be complete.
Some ANN development tools are spreadsheet add-ins. Most can read spreadsheet, data-
base, and text files. Some are freeware or s hareware. Some ANN systems have been developed
in Java to run directly o n the Web and are accessible through a Web browser interface. Other
ANN products are designed to interface with expert systems as hybrid development products.
Developers may instead prefe r to use more genera l programming languages, such as C++,
or a spreadsheet to program the model and perform the calculations. A variation on this is to
use a library of ANN routines. For example, h av.Software (hav.com) provides a library of C++
classes for implementing stand-alone or embedded feedforward, simple recurre nt, and random-
order recurrent ne ural networks. Computational software such as MATLAB also includes neural
network-specific libraries.
SECTION 6.3 REVIEW QUESTIONS
1. List the nine steps in conducting a neural network project.
2. What are som e of the design parameters for developing a n eural network?
3. How does backpropagation learning work?
4. Describe different types of neural network software available today.
5. How are neural networks implemented in practice when the training/testing is
complete?
6.4 ILLUMINATING THE BLACK BOX OF ANN
WITH SENSITIVITY ANALYSIS
Neural networks have been used as an effective tool for solving highly complex real-
world problems in a wide ran ge of application areas. Even though ANN have been
proven in many problem scenarios to be superior predictors and/or cluster identifiers
(compared to their traditional counterparts), in some applications there exists an addi-
tio nal need to know “how it does w hat it does. ” ANN a re typically thought of as b lack
Chapter 6 • Techniques for Predictive Modeling 263
boxes, capable of solving complex problems but lacking the explanation of their capabili-
ties. This phenomenon is commonly referred to as the “black-box” syndrome.
It is important to be able to explain a model’s “inner being”; such an explanation
offers assurance that the network has been properly trained and will behave as desired
once deployed in a business intelligence environment. Such a need to “look under the
hood” might be attributable to a relatively small training set (as a result of the high cost of
data acquisition) or a very high liability in case of a system error. One example of such an
application is the deployment of airbags in automobiles. Here, both the cost of data acqui-
sition (crashing cars) and the liability concerns (danger to human lives) are rather signifi-
cant. Another representative example for the impo1tance of explanation is loan-application
processing. If an applicant is refused for a loan, he or she has the right to know why . Having
a prediction system that does a good job on differentiating good and bad applications may
not be sufficient if it does not also provide the justification of its predictions.
A variety of techniques has been proposed for analysis a nd evaluation of trained
neural n etworks. These techniques provide a clear interpretation of how a neural network
does what it does; that is, specifically how (and to what extent) the individual inputs
factor into the generation of specific network output. Sensitivity analysis has been the
front runner of the techniques proposed for shedding light into the “black-box” character-
ization of trained neural networks.
Sensitivity analysis is a method for extracting the cause-and-effect relationships
among the inputs and the outputs of a trained neural network model. In the process of
performing sensitivity analysis, the trained neural network’s learning capability is disabled
so that the network weights are not affected . The basic procedure behind sensitivity
analysis is that the inputs to the network are systematically perturbed within the allow-
able value ranges and the corresponding change in the output is recorded for each and
every input variable (Principe et al., 2000). Figure 6 .1 2 shows a graphical illustration of
this process. The first input is varied between its mean plus-and-minus a user-defined
number of standard deviations (or for categorical variables, all of its possible values
are used) while all other input variables are fixed at their respective means (or modes).
The network output is computed for a user-defined number of steps above and below
the mean. This process is repeated for each input. As a result, a report is generated to
summarize the variation of each output with respect to the variation in each input. The
generated rep01t often contains a column plot (along with numeric values presented on
the x -axis), reporting the relative sensitivity values for each input variable. A representa-
tive example of sen sitivity analysis on ANN models is provided in Application Case 6.3.
Systematically
Perturbed
Inputs
c::::::::, -‘—-7
Trained ANN
“the black box”
Observed
Change in
Outputs
FIGURE 6.12 A Figurative Illustration of Sensitivity Analysis on an ANN Model.
264 Pan III • Predictive Ana lytics
Application Case 6.3
Sensitivity Analysis Reveals Injury Severity Factors in Traffic Accidents
According to the Natio nal Highway Traffic Safety
Administration, over 6 million traffic accide nts claim
mo re than 41,000 lives each year in the United States.
Cau ses of accid e nts and relate d injury severity a re
of sp ecial inte rest to traffic-safety resea rchers. Such
research is a imed no t o nly at re ducing the numbe r of
accide nts but also the severity of inju ry. One way to
accomplish the latter is to identify the most profound
facto rs that affect injury severity. Unde rstanding the
circumstances unde r w hich d rive rs a nd p assengers
are mo re like ly to be seve rely injured (or killed)
in an autom o bile accide nt can help improve the
overall driving safety situatio n . Facto rs that p o te n-
tially elevate the risk of injury severity of vehicle
occupa nts in th e event of a n auto motive accide nt
include demogra phic and/ o r b eh avioral ch a racteris-
tics of the p e rson (e.g ., age, gende r, seatbelt u sage,
use o f drugs o r alco ho l w hile driving), e nviro nme n-
tal facto rs a nd/or roadway conditio ns at the time
of the accide nt (e.g ., surface con ditio ns, weathe r
o r lig ht conditio ns, th e directio n o f impact, vehicle
o rientatio n in the crash , occurrence of a rollover), as
well as technical characte ristics of the vehicle itself
(e.g., veh icle ‘s age, b ody type).
In a n explo rato1y d ata mining s tudy, Dele n
e t a l. (2006) used a large samp le of d ata- 30,358
p olice-re ported accide nt records o btained fro m the
Ge n eral Estimates Syste m of the Natio n al Highway
Tra ffic Safety Administratio n-to identify w hic h
factors become increasingly more impo 1tan t in
escalating the p robability of inju1y severity during
a traffic crash . Accide nts examined in this study
included a geographically representative samp le of
multiple-vehicle collisio n accidents, single-vehicle
fixed-object collisio ns, and single -vehicle no n colli-
sio n (rollover) crashes.
Contrary to ma ny of the previou s stu dies
conducted in this d om ain, w hich h ave p rimar-
ily u sed regressio n-typ e gen eralized linear mo dels
w he re the functio nal re latio nships between inju ry
severity and crash-related facto rs are assume d to
be linear ( w hich is an oversimplificatio n o f the
reality in most real-world situ atio n s), Dele n a nd his
colleagues decided to go in a diffe re nt directio n .
Because ANN are known to be supe rio r in captur-
ing highly n o nlinear complex relationships between
the p redictor variables (crash factors) a n d the target
variable (severity level of the injuries), they decide d
to u se a series of ANN models to estimate the sig-
nifican ce of the crash facto rs on th e level o f injury
severity su stained by the d river.
From a m eth odological standpo int, they
fo ll owed a two-s te p p rocess . In th e fi rst s tep, they
d eve lo p ed a series o f predictio n mo d els (on e fo r
each injury severity level) to capture the in-depth
rela tio ns h ips between th e crash-related facto rs
a nd a sp ecific level o f injury severity. In th e sec-
o nd ste p , they condu cted sen sitivity an alysis o n
t he traine d n e ura l n etwork m odels to ide n tify the
p rio ritized importan ce o f crash-re lated fac to rs as
t hey relate to d iffe rent inju ry severity levels . In
t he fo rmula tio n of the stud y , the fi ve-class pre-
dictio n proble m was decomposed into a number
o f b ina ry classificatio n mo d e ls in o rder to obtain
the granularity o f information nee ded to identify
the “true ” cau se-a nd-effect relatio ns h ips between
the c rash-re lated facto rs and diffe re n t levels o f
injury severity.
The resu lts revealed con siderable differe n ces
a mo n g th e m o d e ls built fo r different injury severity
levels. This implies that the m ost influential facto rs
in p redictio n models h ig h ly dep e nd o n the level
of injury severity . For example, the study reveale d
that th e variable seatbelt u se was the most impor-
ta nt d e termina nt fo r p re dicting hig he r levels o f
injury se verity (su ch as incapacitating inju ry o r
fatality), but it was o ne of the least sig nificant
p redic tors fo r lower levels o f injury severity (such
as n o n-incap acita ting injury an d mino r injury).
Ano the r inte resting find ing involved gen der: The
d rivers’ gende r was a mo ng the sig nifican t p re d ic-
tors fo r lower levels of injury severity, but it was
n o t a mo n g the significant facto rs fo r hig h e r lev-
e ls o f injury severity , indicating that m ore seriou s
injuries do n ot d e p e n d o n th e d river being a male
or a fe ma le. Yet a no the r inte resting a nd somewha t
intuitive finding o f the study indicate d tha t age
becomes an increasingly more sig nificant fac to r as
the level o f injmy severity in creases, im p lying t hat
older people are mo re like ly to incur severe inju-
ries (and fa ta lities) in serious auto mobile crash es
than younger p eople .
QUESTIONS FOR DISCUSSION
1. How does sensitivity analysis shed light on the
black box (i.e., neural networks)?
2. Why would someone choose to use a black-
box tool like neural networks over theoretically
sound, mostly transparent statistical tools like
logistic regression?
REVIEW QUESTIONS FOR SECTION 6.4
1. What is the so-called “black-box” syndrome?
Chapter 6 • Techniques for Predictive Mo d eling 265
3. In this case, how did neural networks and sensi-
tivity analysis help identify injury-severity factors
in traffic accidents?
Source.- D. Delen, R. Sharda , and M. Bessonov, “Id entifying
Significant Pred ictors of Inju ry Se ve rity in Traffic Accide nts Using
a Series of Artificial Neural Networks ,” Accident Analysis and
Prevention, Vol. 38, No. 3, 2006, pp. 434- 444.
2. Why is it important to b e able to explain an ANN’s model structure?
3. How does sensitivity analysis work?
4. Search the Internet to find other ANN explanation methods .
6.5 SUPPORT VECTOR MACHINES
Support vector machines (SVMs) are one of the popular machine-learning techniques ,
mostly because of their superior predictive power and their theoretical foundation. SVMs
are among the supervised learning methods that produce input-output functions from a
set of labeled training data. The function between the input and output vectors can be
either a classification function (used to assign cases into predefined classes) or a regres-
sion function (used to estimate the continuous numerical value of the desired output).
Fo r classification, nonlinear kernel functions are often used to tra nsform the input data
(naturally representing highly complex nonlinear relationships) to a high dimensional
feature space in which the input data becomes linearly separable. Then, the maximum-
margin hyperplanes are constructed to o ptimally separate the output classes fro m each
other in the training data.
Given a classification-type prediction problem, generally sp eaking, man y linear clas-
sifiers (hyperplanes) can separate the data into multiple subsections, each re presenting
one of the classes (see Figure 6 .13a, where the two classes are represented with circles
[“e “J and squares [“•”D. Howeve r, only one hype rplane achieves the maximum sep ara-
tion between the classes (see Figure 6 .1 3b, where the hype rplane and the two maximum
margin hyperplanes are separating the two classes) .
Data u sed in SVMs may have more than two dimension s (i.e ., two distinct classes).
In that case , we would be interested in separating data using the n-1 dimensional hyper-
plane, where n is the number of dimensions (i.e., class labels). This may be seen as a
typical form of linear classifier, where we are interested in finding the n-1 hyperplane
so that the distance from the hyperplanes to the nearest data points are maximized. The
assumption is that the larger the margin or distance between these parallel hyperplanes ,
the be tter the generalization power of the classifie r (i.e., prediction power of the SVM
model). If such hyperplanes exist, they can be mathematically represented using qua-
dratic optimizatio n modeling. These hyperplanes are known as the maximum-ma rgin
hyperplane , and such a linear classifier is known as a maximum margin classifier.
In addition to their solid mathematical foundation in statistical learning theory, SVMs
have also demon strated highly competitive p e rformance in numerous real-world predic-
tion problems, such as medical diagnosis, bioinformatics, face/voice recognition, demand
forecasting , image processing, and text mining, which has established SVMs as one of the
266 Pan III • Predictive Analytics
•
• •
••••
[a]
• • • •
•
•
•
• •
• • • • •••• • ••
[bl
•
FIGURE 6.13 Separation of the Two Classes Using Hyperplanes.
most popular analytics tools for knowledge discovery and data mining. Similar to artificial
neural networks, SVMs possess the well-known ability of being universal approximators
of any multivariate function to any desired degree of accuracy. Therefore, they are of par-
ticular interest to modeling highly nonlinear, complex problems, systems, and processes.
In the research study summarized in Application Case 6 .4, SVM are used to successfully
predict freshman student attrition.
Application Case 6.4
Managing Student Retention with Predictive Modeling
Generally, student attrition at a university is
defined by the number of students who do not
complete a degree in that institution. It has become
one of the most c hallenging problems for decision
makers in academic institutions. In spite of all of
the programs and services to help retain students,
according to the U.S. Department of Education,
Center for Educational Statistics (nces.ed.gov),
only about half of those who enter higher educa-
tion actually graduate w ith a bachelor’s degree.
Enrollment management and the retention of stu-
dents has become a top priority for administrators
of colleges and universities in the United States
and other developed countries around the world.
High rates of student attrition usually result in loss
of financial resources, lower graduation rates, and
inferior perception of the school in the eyes of
all stakeholders. The legislators and policymakers
who oversee higher education and allocate funds,
the parents who pay for their children’s education
in order to prepare them for a better future, and
the students who make college choices look for
evidence of institutional quality (such as low
attrition rate) a nd reputation to guide their college
selection decisions .
The statistics show that the vast majority
of students withdraw from the university during
their first year (i.e., freshman year) at the col-
lege. Since most of the student dropouts occur
at the end of the first year, many of the student
retention/ attrition research studies (including the
one summarized here) have focused on first-year
dropouts (or the number of students that do not
return for the second year). Traditionally, student
retention-related research has been survey driven
(e.g., surveying a student cohort and following
them for a specified period of time to determine
whether they continue their education). Using such
a research design, researchers worked on devel-
oping and validating theoretical models including
the famous student integration model developed
by Tinto. An alternative (or a complementary)
approach to the traditional survey-based retention
research is an analytic approach where the data
commonly found in institutional databases is used.
Educational institutions routinely collect a broad
range of information about their students, includ-
ing demographics, educational background, social
involvement, socioeconomic status, and academic
progress.
Research Method
In order to improve student retention, one should
try to unde rstand the non-trivial reasons behind the
attrition. To be successful, one should also be able
to accurately identify those students that are at risk
of dropping out. This is where analytics come in
handy. Using institutional data, prediction models
can be developed to accurately identify the students
at risk of dropout, so that limited resources (people,
money, time, etc., at an institution’s student suc-
cess center) can be optimally used to retain most
of them.
In this study, using 5 years of freshman student
data (obtained from the university’s existing data-
bases) along with several data mining techniques,
four types of prediction models are developed and
tested to ide ntify the best predictor of freshman attri-
tion. In order to explain the phenomenon (identify
the relative importance of variables), a sensitivity
analysis of the developed models is also conducted.
The main goals of this and other similar analytic
studies are to (1) develop models to correctly iden-
tify the freshman students who are most like ly to
drop out after their freshman year, and (2) identify
the most important variables by applying sensitiv-
ity analyses on developed models. The models that
we developed are formulated in such a way that
the prediction occurs at the end of the first semester
(usually at the end of fall semester) in order for the
decision makers to properly craft intervention pro-
grams during the next semester (the spring semes-
ter) in order to retain them.
Chapter 6 • Techniques for Predictive Modeling 267
Figure 6.14 shows the graphical illustration of
the research mythology. First, data from multiple
sources about the students are collected and con-
solidated (see Table 6.2 for the variables used in
this study). Next, the d ata is preprocessed to handle
missing values and other anomalies. The prepro-
cessed data is then pushed through a 10-fold cross-
validation experiment where for each model type,
10 different models are developed and tested for
comparison purposes.
Results
The results (see Table 6.3) showed that, given suf-
ficient data w ith the proper variables, data mining
techniques are capable of predicting freshman stu-
dent attrition with approximately 80 percent accu-
racy. Among the four individual prediction models
used in this study, support vector machines per-
formed the best, followed by decision trees, n eural
networks, and logistic regression.
The sensitivity analysis o n the trained pre-
diction models indicated that the most important
predictors for student attrition a re those re la ted to
past and present educational success (such as the
ratio of completed credit hours into total num-
ber of hours enrolled) of the student and whether
they are getting financial help.
QUESTIONS FOR DISCUSSION
1. Why is attrition one of the most important issues
in higher education?
2. How can predictive analytics (ANN, SVM, and
so fotth) be used to better manage student
retention?
3. What are the main challe nges a nd potential
solutions to the u se of analytics in retention
management?
Sources: Compiled from D. Dele n , “A Comparative Analysis
of Machine Learning Techniques for Stude nt Re tention
Management,” Decision Support Systems, Vol. 49, No. 4, 2010,
pp. 498-506; V. Tinto, Leaving College: Rethinking the Causes
and Cures of Stu.dent Attrition , University of Chicago Press , 1987;
and D. Delen, “Predicting Student Attrition with Data Mining
Methods,” Journal of College Student Retention, Vol. 13, No. 1,
2011 , pp. 17-35.
(Continued)
268 Pan III • Predictive Analytics
Application Case 6.4 (Continued}
Raw Data
Data
Preprocessing
Preprocessed
Data
Design of
Experiments
Experimental Design
( 1 D-fold Cross
Validation)
Model
Building
Prediction Models
Decision Trees Neural Networks
. . . . . .
Support Vecto;
Machine
a:cz:-o, a., .,
a.,
0.4 1
Cl3 f[zl ; 1~ ‘ .,
a. ,
Logistic Regression
Model
Testing
Model
Deployment
FIGURE 6.14 The Process of Developing and Testing Prediction Models.
y ES
NO
Experiment Results
[Confusion Matrixes)
YES NO
# of correctly # of incorrectly
predicted YES predicted YES
# of incorrectly # of correctly
predicted NO pr edicted NO
Chapter 6 • Techniques for Predictive Mo d eling 269
TABLE 6.2 List of Variables Used in the Student Retention Project
No. Variables Data Type
1 College Mult i Nominal
2 Degree Multi Nominal
3 Major Multi Nominal
4 Concentrat ion Mult i Nominal
5 Fall Hours Registered Number
6 Fall Earned Hours Number
7 Fall GPA Number
8 Fall Cumulative GPA Number
9 Spring Hou rs Registered Num ber
10 Spring Earned Hours Num ber
11 Spring GPA Num ber
12 Spring Cumulative GPA Num ber
13 Second Fa ll Registered (YI N) Nom ina l
14 Ethnicity Nom inal
15 Sex Bina ry Nominal
16 Residentia l Code Bina ry Nominal
17 Marital Status Bina ry Nominal
18 SAT High Score Comprehensive Number
19 SAT High Score English Number
20 SAT High Score Reading Number
21 SAT High Score Math Number
22 SAT High Score Science Number
23 Age Number
24 High School GPA Number
25 High School Graduation Year and Month Date
26 Starting Term as New Freshmen Multi Nominal
27 TOEFL Score Number
28 Transfer Hours Number
29 CLEP Earned Hours Number
30 Admission Type Multi Nominal
31 Permanent Address State Mult i Nominal
32 Received Fall Financial Aid Binary No minal
33 Received Spring Financial Aid Binary Nominal
34 Fall St udent Loan Bina ry Nominal
35 Fall Grant/Tuition Waiver/ Scholarship Bina ry Nominal
36 Fall Federal Work Study Binary Nominal
37 Spring Student Loan Bina ry Nominal
38 Spring Grant/Tu ition Wa iver/Schola rship Bina ry Nominal
39 Spring Federal Work Study Bina ry No mina l
(Continued)
270 Pan III • Predictive Analytics
Application Case 6.4 (Continued)
TABLE 6.3 Prediction Results for the Four Data Mining Methods (A 10-fold cross-validation
with balanced data set is used to obtain these test results.)
ANN(MLP) DT(CS) SVM LR
No Yes No Yes No Yes No Yes
Confusion { No 2309 464 2311 417 2313 386 2125 626
Matrix Yes 781 2626 779 2673 777 2704 965 2464
SUM 3090 3090 3090 3090 3090 3090 3090 3090
Per-class Accuracy 74.72% 84.98% 74.79% 86.50% 74.85% 87.51% 68 .77 % 79.74%
Overall Accuracy 79.85% 80.65% 81. 18% 74.26%
Mathematical Formulation of SVMs
Consider data points in the training data set of the form:
where the c is the class label taking a value of either 1 (i.e., “yes”) or O (i.e., “no”) while x
is the input variable vector. That is , each data point is an m-dimensional real vector, usu-
ally of scaled (0, 1] or [-1 , 1] values . The normalization and/ or scaling are important steps
to guard against variables/ attributes with larger variance that might o therwise dominate
the classification formulae. We can view this as training data , which denotes the correct
classification (something that we would like the SVM to eventually achieve) by mean s of
a dividing hyperplane , which takes the mathematical form
w·x – b = 0.
The vector w points perpendicular to the separating hyperplane. Adding the offset
parameter b allows us to increase the margin . In its absence, the hyperplane is forced
to pass through the origin, restricting the solution. As we are interested in the maximum
margin, we are interested in the support vectors and the p arallel hyperplanes (to the
optimal hyperplane) closest to these support vectors in either class. It can be shown that
these parallel hyperplanes can be described by equations
w·x – b = l,
w·x – b = -1.
If the trammg data are linearly separable, we can select these hyperplanes so
that there are no points between them and then try to maximize their distance (see
Figure 6.13b). By using geometry, we find the distance between the hyperplanes is
2/ I w I , so we want to minimize I w 1- To exclude data points, we need to en sure that for
all i e ither
w · X; – b 2: 1 or
W • X; – b :S -1.
Chapter 6 • Techniques for Predictive Modeling 271
This can be rewritten as:
c;(w • x; -b)2:l, 1 ::; i:;; n.
Primal Form
The problem now is to minimize I w I subject to the constraint c; (w • X ; – b) 2:: 1, 1 ::; i :;; n.
This is a quadratic programming (QP) optimization problem. More clearly,
Minimize 0 12) II w II 2
Subject to C; (W • X; – b) 2′. 1, 1::; i:;; n.
The factor of 1/ 2 is used for mathematical convenience.
Dual Form
Writing the classification rule in its dual form reveals that classification is only a function
of the support vectors, that is, the training data that lie on the margin. The dual of the
SVM can be shown to be:
n
max ~a ; – ~a;a1c;c1x;x1
i = 1 i ,j
where the a terms constitute a dual representation for the weight vector in terms of the
training set:
W = ~ a ; C;X;
Soft Margin
In 1995, Cortes and Vapnik suggested a modified maximum margin idea that allows for
mislabeled examples. If there exists no hyperplane that can split the “yes” and “no” exam-
ples, the soft margin method will choose a hyperplane that splits the examples as cleanly
as possible, while still maximizing the distance to the nearest cleanly split examples. This
work popularized the expression support vector machine or SVM. The method introduces
slack variables, t;, which measure the degree of misclassification of the datum.
C; (w • X; – b) 2′. 1 – t;
The objective function is then increased by a function that penalizes non-zero t ;,
and the optimization becomes a trade-off between a large margin and a small error pen-
alty. If the penalty function is linear, the equation now transforms to
min ll wll 2 + c~g; such that c;(w · x ; – b) 2: 1 – t ; 1::; i:;; n
This constraint along with the objective of minimizing I w I can be solved using
Lagrange multipliers. The key advantage of a linear penalty function is that the slack
variables vanish from the dual problem, with the constant C appearing only as an
vadditional constraint on the Lagrange multipliers. Nonlinear penalty functions h ave been
used, particularly to reduce the effect of outliers on the classifier, but unless care is taken,
the problem becomes non-convex, and thus it is considerably more difficult to find a
global solution.
272 Pan III • Predictive Analytics
Nonlinear Classification
The original optimal hyperplane algorithm proposed by Vladimir Vapnik in 1963, while
he was a doctoral student at the Institute of Control Science in Moscow, was a linear
classifier. However, in 1992, Boser, Guyon, and Vapnik suggested a way to create nonlin-
ear classifiers by applying the kernel trick (originally proposed by Aizerman et al., 1964)
to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that
every dot product is replaced by a nonlinear kernel function . This allows the algorithm to
fit the maximum-margin hyperplane in the transformed feature space. The transformation
may be nonlinear and the transformed space high dimensional; thus, though the classifier
is a hyperplane in the high-dimensional feature space it may be nonlinear in the original
input space.
If the kernel used is a Gaussian radial basis function, the corresponding feature
space is a Hilbert space of infinite dimension. Maximum margin classifiers are well reg-
ularized, so the infinite dime nsion does not spoil the results. Some common kernels
include,
Polynomial (homogeneous): k( x, x’) = ( x · x’)
Polynomial (inhomogeneous): k(x, x’) = (x · x’ + l)
Radial basis function: k(x, x’) = exp(-yJJx – x’ ll 2 ), for y > 0
Gaussian radial basis function: k(x, x’) = exp(- Jlx ;CT{ll
2
)
Sigmoid: k(x, x’) = tan h(kx·x’ + c) for some k > 0 and c < 0
Kernel Trick
In machine learning, the kernel trick is a method for converting a linear classifier algorithm
into a nonlinear one by using a nonlinear function to map the original observations into
a higher-dimensional space; this makes a linear classification in the new space equivalent
to nonlinear classification in the original space.
This is done using Mercer's theorem , which states that any continuous, symmetric,
positive semi-definite kernel function K(x, y) can be expressed as a dot product in a
high-dimensional space. More specifically, if the arguments to the kernel are in a measur-
able space X, and if the kernel is positive semi-definite - i.e.,
~K(x;, x}c; c1 2= 0
i, j
for any finite subset {x1, ... , Xnl of X and subset {c1 , . . . , cnl of objects (typically real
numbers or even molecules)-then there exists a function cp(x) whose range is in an
inner product space of possibly high dimension , such that
K(x, y) = cp(x) · cp(y)
The kernel trick transforms any algorithm that solely depe nds on the dot product
between two vectors . Wherever a dot product is used, it is replaced with the kernel ft.mc-
tion. Thus, a linear algorithm can easily be transformed into a nonlinear algorithm. This
nonlinear algorithm is equivalent to the linear algorithm operating in the range space of cp .
However, because kernels are used, the cp function is never explicitly computed. This is
Chapter 6 • Techniques for Predictive Modeling 273
desirable, because the high-dimensional space may be infinite-dimensional (as is the case
w he n the ke rne l is a Gaussian).
Although the o rigin of the term kernel trick is n ot known , the kernel trick was first
published by Aizerman e t al. (1964). It h as b een applied to several kinds of algorithm in
machine learning and statistics, including:
• Perceptrons
• Support vecto r machines
• Principal compone nts analysis
• Fisher's linear discriminant analysis
• Clustering
SECTION 6.5 REVIEW QUESTIONS
1. How d o SVM work?
2. What are the ad vantages and disad vantages of SVM?
3. What is the meaning of "maximum margin hyperplanes"? Why are they important in
SVM?
4. What is "kernel trick"? How is it used in SVM?
6.6 A PROCESS-BASED APPROACH TO THE USE OF SVM
Due largely to the better classificatio n results, recently supp ort vector machines (SVMs)
have become a popular technique for classification-type p roblems. Even though p eople
con sider them as being easier to use than artificial neural networks, u sers w ho are not
familiar w ith the intricacies of SVMs often get unsatisfactory results. In this sectio n we
provide a process-based approach to the use of SVM, w hich is more like ly to produce
better results. A picto rial representatio n of the three-step process is given in Figure 6 .15.
NUMERICIZING THE DATA SVMs re quire that each data insta nce is represented as a
vector of real numbe rs. Hen ce, if the re a re categorical attributes, we first have to convert
them into numeric data . A commo n recommendatio n is to use m p seudo -binary-variables
to represent an m-class attribute (whe re m::::: 3). In practice, o nly one of the m variables
assumes the value of "1" and othe rs assume the value of "O" based o n the actual class of
the case (this is also called 1-of-m representatio n) . For example , a three-category attribute
su ch as {red, g reen , b lue} can be represented as (0,0,1) , (0,1,0) , and (1,0,0).
NORMALIZING THE DATA As was the case for artificial ne ural n etworks , SVMs also
require normalizatio n and/ or scaling of nume rical values. The main advantage of nor-
malization is to avoid attributes in greater numeric ranges dominating those of in smaller
numeric ra nges. Anoth e r advantage is that it h elps performing numerical calculations
during the iterative process of mo del building . Becau se kernel values u sually depend on
the inne r products of feature vectors (e.g., the linear k erne l and the p olyn omial kernel) ,
la rge attribute values might slow the training process. Use recommendatio ns to no rma lize
each attribute to the range [-1 , + 1) or [O, 1). Of course, we have to use the same no rmal-
izatio n me thod to scale testing data before testing .
SELECT THE KERNEL TYPE AND KERNEL PARAMETERS Even though there are o nly four
common ke rne ls m e ntioned in the previous sectio n , one must decide w hich one to use
(or whether to try them all, one at a time, using a simple experimental design approach) .
Once the kernel type is selected, then o ne needs to select the value of penalty parameter C
and ke rnel parameters. Generally sp eaking , RBF is a reason able first choice for the kernel
typ e . The RBF k ernel aims to n o nlinearly map data into a higher dimensional space; by
do ing so (unlike w ith a linear ke rnel) it handles the cases w here the re la tio n between
274 Pan III • Predictive Analytics
Training
data
,----
Preprocess the Data
,/ Scrub the data
"Identify and handle missing,
incorrect, and noisy"
,/ Transform the data
"Numericize, normalize, and
standardize the data"
Preprocessed data
Develop the Model
,/ Select the kernel type
"Choose from RBF, Sigmoid,
or Polynomial kernel types"
,/ Determine the kernel values
"Use v-fold cross validation
or employ 'grid-search"'
Experimentation
''Training/ Testing"
~------------------ Validated SVM model
Deploy the Model
,/ Extract the model coefficients
,/ Code the trained model into
the decision support system
,/ Monitor and maintain the
model
Prediction
Model
FIGURE 6.15 A Simple Process Description for Developing SVM Models.
input and output vectors is highly nonlinear. Besides, one should note that the linear
kernel is just a special case of RBF ke rnel. There are two parameters to choose for RBF
kernels: C and y. It is not known beforehand which C and y are the best for a given
prediction problem; therefore, some kind of parameter search method needs to be used.
The goal for the search is to identify optimal values for C and y so that the classifier can
accurately predict unknown data (i.e., testing data). The two most commonly u sed search
methods are cross-validation and grid search.
DEPLOY THE MODEL Once an "optimal" SVM prediction model has been developed,
the next step is to integrate it into the decision support system. For that, there are two
option s: (1) converting the model into a computational object (e.g., a Web service, Java
Bean, or COM object) that takes the input parameter values and provides output predic-
tion , (2) extracting the model coefficients and integrating them directly into the decision
support system. The SVM models are useful (i. e., accurate, actionable) only if the behav-
ior of the underlying domain stays the same. For some reason, if it ch anges, so does the
accuracy of the model. Therefore, one should continuously assess the performance of
the models, decide when they no longer are accurate, and, hence, need to be retrained.
Support Vector Machines Versus Artificial Neural Networks
Even though some people characterize SVMs as a special case of ANNs, most recognize
them as two competing machine-learning techniques with different qualities. Here are a
few points that help SVMs sta nd out against ANNs. Historically, the development of ANNs
Chapter 6 • Techniques for Predictive Modeling 275
followed a heuristic path, w ith applications and extensive experimentation preceding
theory. In contrast, the development of SVMs involved soun d statistical learning theory
first, then implementation and experiments. A significant advantage of SVMs is that while
ANNs may suffer from multiple local minima, the solutions to SVMs are global an d unique.
Two more advantages of SVMs are that they have a simple geometric interpretation and
give a sparse solutio n . The reason that SVMs often outperform ANNs in practice is that
they successfully deal w ith the "over fitting" proble m , which is a big issue with ANNs.
Besides these adva ntages of SVMs (from a practical point of v iew), they also have
some limitations. An important issue that is not entirely solved is the selection of the ker-
nel type and kernel function parameters. A second and perhaps more important limitation
of SVMs are the speed and size, both in the training and testing cycles. Model building
in SVMs involves complex and time-demanding calcu lations. From the practical point of
view, perhaps the most serious problem with SVMs is the high algorithmic complexity
and extensive memory requirements of the required quadratic programming in large-
scale tasks. Despite these limitations, because SVMs a re based on a sound theoretical
foundation and the solutions they produce are global and unique in nature (as opposed
to getting stuck in a su boptimal alternative such as a local minima), n owadays they are
arguably o n e of the most popular prediction modeling techniques in the data mining
arena. Their use and popularity w ill only increase as the popular commercial data mining
tools start to incorporate them into their modeling arsenal.
SECTION 6.6 REVIEW QUESTIONS
1. What are the main steps a nd decision points in developing a SVM model?
2. How do you determine the optimal kernel type and kernel parameters?
3. Compared to ANN, what are the advantages of SVM?
4. What are the common application areas for SVM? Conduct a search o n the Internet
to identify popular application areas and specific SVM software tools u sed in those
applications.
6.7 NEAREST NEIGHBOR METHOD FOR PREDICTION
Data mining algorithms tend to be highly mathematical and computationally intensive.
The two popular o n es that are covered in the previous section (i.e., ANNs and SVMs)
involve time-demanding, computationally intensive iterative mathematical derivations.
In contrast, the k-nearest neighbor a lgorithm (or kNN, in short) seems overly simplistic
for a competitive prediction method. It is so easy to understand (and explain to oth-
ers) what it does a nd how it does it. k-NN is a predictio n method for classification- as
well as regression-type prediction problems. k-NN is a type of instan ce-based learning
(or lazy learning) where the function is o nly approximated locally an d all computations
are defe rre d until the actual prediction.
The k-nearest neighbor algorithm is among the simplest of all machine-learning algo-
rithms: For instance, in the classification-type prediction, a case is classified by a majority
vote of its neighbors, with the object being assigned to the class most common among
its k nearest neighbors (whe re k is a positive integer). If k = l, then the case is simply
assigned to the class of its nearest ne ighbor. To illustrate the concept with an example , let
us look at Figure 6. 16, w here a simple two-d imensional space represents the values for
the two variables (x, y); the star represents a new case (or object); and circles and squares
represent known cases (or examples). The task is to assign the new case to eithe r circles
or squares based on its closeness (similarity) to o n e or the oth er. If you set the value of k
to 1 (k = 1), the assignment should be made to square, because the closest example to star
is a square. If you set the value of k to 3 Ck = 3), then the assignment should be made to
276 Pan III • Predictive Analytics
y
• •
k= 3
Y;
\ k= 5
~
I
I
•
•
X; X
FIGURE 6.16 The Importance of the Value of kin kNN Algorithm.
circle, because there two circles and one square, and hence from the simple majority vote
rule, circle gets the assignme nt of the n ew case. Similarly, if you set the value of k to 5
(k = 5), then the assignment should be made to square-class. This overly simplified exam-
ple is meant to illustrate the importance of the value that one assigns to k.
The same method can also be used for regression-type prediction tasks, by simply
averaging the values of its k nearest neighbors and assigning this result to the case being
predicte d. It can b e useful to weight the contributions of the neighbors, so that th e nearer
neighbors contribute more to the average than the more distant ones. A common weight-
ing scheme is to give each neighbor a weight of 1/ d , where d is the distance to the neigh-
bor. This scheme is essentially a generalization of linear interpolation.
The neighbors are taken from a set of cases for which the correct classification (or,
in the case of regression, the numerical value of the output value) is known. This can be
thought of as the training set for the algorithm, even tho ugh no explicit training step is
required. The k-nearest neighbor algorithm is sensitive to the local structure of the data.
Similarity Measure: The Distance Metric
One of the two critical decisions that an a nalyst has to make while u sing kNN is to
determine the similarity measure (the other is to determine the value of k, which is
explained next). In the kNN algorithm, the similarity m easure is a m athematically calcu-
lable distance metric. Given a new case, kNN makes predictions based on the outcome
of the k neighbors closest in distance to that point. Therefore, to make predictions with
kNN, we need to define a metric for measuring the distance between the new case and
the cases from the examples. One of the most popular choices to measure this distance
is known as Euclidean (Equation 3), w hich is simply the linear distance between two
points in a dimensional space; the other p o pular one is the rectilin ear (a.k.a. City-block
or Manhattan distance) (Equation 2). Both of these distance measures are special cases of
Minkowski distance (Equation 1).
Minkowski distance
Chapter 6 • Techniques for Predictive Modeling 277
w here i = (x;1 , x;2 , . .. , X;p) a nd j = (x11, x12 , . . . , x1p) are two p-dimensio n al data o bjects
(e.g ., a new case and an example in the data set) , and q is a positive integer.
If q = l , the n d is called Manhattan distance
If q = 2, the n d is called Euclidean distance
Obviously, these measures apply only to numerically represented data. How about
nominal data' There are ways to measure distance for no n-numerical d ata as well. In the
simplest case, for a multi-value no minal variable, if the value of that variable fo r the new
case and that for the example case are the same, the distance would be zero, otherwise
one. In cases su ch as text classification, more sophisticated metrics exist, su ch as the
overlap metric (or Hamming distance). Often, the classification accuracy of kNN can be
improved significantly if the distance metric is determined through an experimental design
w here different me trics are tried and tested to ide ntify the b est one for the given problem.
Parameter Selection
The best ch oice of k depends upo n the data; generally, larger values of k reduce th e
effect of n o ise o n the classification (or regression) but also make boundaries between
classes less distinct. An "optimal" value of k can be found by some heuristic techniques ,
for instance, cross-validation. The special case w h e re the class is predicte d to be the class
of the closest training sample (i.e., when k = 1) is called the n earest neighbor algorithm.
CROSS-VALIDATION Cross-validatio n is a well-establish ed exp erime ntation technique
that can be used to determine optimal values for a set of unknown model parameters. It
applies to m ost, if n o t all, of the machine -learning techniques, w he re the re are a number
of mo de l parameters to be d etermined. The gene ra l idea of this experimentatio n method
is to divide the data sample into a number of randomly drawn, disjointed sub-samples
(i.e ., v number of folds). For each potential value of k, the kNN model is used to make
predictions on the vth fold w hile u sing the v- 1 folds as the examples, and evaluate the
error. The common choice for this error is the root-mean-squared-error (RMSE) for regres-
sio n-typ e prediction s and pe rcentage o f correctly classified instances (i. e ., hit rate) for the
classificatio n-type predictions. This process of testing each fold against the remaining of
examples re peats v times. At the end of the v number of cycles, the computed errors are
accumulated to yield a goodness measure of the model (i. e ., h ow well the model predicts
with the current value of the k). At the end, the k value that produces the smallest overall
error is chosen as the o p timal value for that problem. Figure 6.17 sh ows a simple process
where the training data is used to determine optimal values for k and distance metric,
w hich are then used to predict new incoming cases.
As we observed in the simple example g iven earlier, the accuracy of the kNN algo-
rithm can be significantly different w ith different values of k. Furthermore, the predictive
power of the kNN algorithm degrades with the presence of no isy, inaccurate, or irrelevant
features. Much research effort has been put into feature selectio n and n orm alization/
scaling to e nsure reliable prediction results . A particularly popular approach is the u se of
evolutionary algorithms (e.g. , genetic algo rithms) to o ptimize the set of features included
in the kNN prediction system. In binary (two class) classificatio n problems, it is h elpful to
choose k to be an o dd number as this would avoid tied votes .
A drawback to the basic majo rity voting classification in kNN is that the classes w ith
the mo re freque nt examples te nd to dominate the prediction of the new vecto r, as they
278 Pan III • Predictive Analytics
Historic Data
Training Set
Validation Set
Parameter Setting
../ Distance measure
../ Value of "I<'
Predicting
_________________ __, Classify (or forecast]
new cases using k
number of most
similar cases
New Data
FIGURE 6.17 The Process of Determining the Optimal Values for Distance Metric and k.
te nd to come up in the k n earest neighbo rs w he n the ne ighbors are computed due to their
large number. One way to overcome this problem is to weigh the classification taking
into account the distance from the test point to each of its k nearest ne ighbors. Another
way to overcome this drawback is by o ne level of abstractio n in data representation.
The naive version of the algo rithm is easy to implement by computing the distan ces
fro m the test sample to all stored vecto rs, but it is computation ally intensive, esp ecially
w he n the size of the training set g rows. Many nearest neighbor search algorithms h ave
been proposed over the years; these generally seek to reduce the numbe r of distance
evaluatio ns actually performed. Using an appropriate nearest n eighbor search algorithm
makes kNN computationally tractable even for large data sets. Application Case 6.5 talks
about the superior capabilities of kNN in image recognition and categorization.
Application Case 6.5
Efficient Image Recognition and Categorization with kNN
Image recognitio n is an e m erging data mining appli-
cation fie ld involved in processing, analyzing, and
categorizing visual objects such as pictures. In the
process of recognitio n (or categorization), images
are first transformed into a multidime nsio nal fea-
ture space and then, using machine-lea rning
techniques, are categorized into a finite numbe r of
classes. Application areas of image recognition and
categorization range from agriculture to h omeland
security, personalized marketing to e nviro nme ntal
protection. Image recognition is an integral part of
an artificial intelligen ce field called computer vision.
As a technological discipline, computer v1s1on
seeks to develop computer systems that are capa-
ble of "seeing " and reacting to their environment.
Examples of applications of computer vision include
systems for process automation (industrial robots),
n avigatio n (autono mo us vehicles), mo n itoring/
detecting (visual surveilla nce), searching and sorting
v isuals (indexing databases of images a nd image
sequ e nces) , engaging (computer- human interac-
tion), and inspectio n (manufactu ring processes).
While the field of visual recognitio n and category
recognitio n has been progressing rapidly, much remains
to be done to reach human-level pe1formance. Ctment
approaches are capable of dealing with only a limited
number of categories (100 or so categories) and are
computationally expensive. Many machine-learning
techniques (including ANN, SVM, and kNN) are used
to develop computer systems for visual recognition
and categorization. Though commendable results have
been obtained, generally speaking, none of these tools
in their cun-ent form is capable of developing systems
that can compete with humans.
In a research project, several researchers from
the Computer Science Division of the Electrical
Engineering and Computer Science Department at the
University of California, Berkeley, used an innovative
e nsemble approach to image categorization (Zhang
et al., 2006). They considered visual category rec-
ognition in the framework of measuring similari-
ties, or perceptual distances , to develop examples
of categories. Their recognition and categorization
approach was quite flexible , permitting recogni-
tion based on color, texture, and particularly shape.
While nearest neighbor classifiers (i.e., kNN) are nat-
ural in this setting, they suffered from the proble m
of high variance (in bias-variance decomposition)
in the case of limited sampling. Alternatively, one
could choose to use support vector machines but
they also involve time-consuming optimization and
computations. They proposed a hybrid of these two
methods , which deals naturally with the multiclass
setting, has reasonable computational complexity
both in training and at run time, a nd yields excel-
lent results in practice. The basic idea was to find
close neighbors to a query sample and train a local
support vector machine that preserves the distance
function on the collection of neighbors .
Their method can be applied to large, multi-
class data sets where it o utperform s nearest n e igh-
bor and support vector machines and remains effi-
cient when the problem becomes intractable. A wide
variety of distance functions were used, a nd their
experiments showed state-of-the-art performance on
a number of benchmark data sets for shape and tex-
ture classification (MNIST, USPS, CUReT) and object
recognition (Caltech-101).
SECTION 6. 7 REVIEW QUESTIONS
1. What is sp ecial about the kNN algorithm?
Chapter 6 • Techniques for Predictive Modeling 279
Another group of researchers (Boiman et al.,
2008) argued that two practices commonly u sed
in image classification methods (namely SVM- and
ANN-type model-driven approaches and kNN type
non-parametric approaches) have led to less-than-
desired p erformance outcomes. They also claim that
a hybrid method can improve the performance of
image recognition and categorization . They propose
a trivial Naive Bayes kNN-based classifier, which
employs kNN distances in the space of the local
image descriptors (and not in the space of images).
They claim that, although the m odified kNN method
is extremely simple, efficient, and requires n o learn-
ing/training phase, its performance ranks among the
top leading lea rning-based parametric image classi-
fiers. Empirical comparisons of their method were
shown on several challenging image categorization
databases (Caltech-101 , Caltech-256, and Graz-01).
In addition to image recognition and catego-
rization, kNN is successfully applied to complex
classification problems, such as content retrieval
(handwriting detection, video content analysis, body
and sign language, where communication is done
using body or hand gestures), gene expression (this
is another area where kNN tends to perform bet-
ter than other state-of-the-art techniques; in fact, a
combination of kNN-SVM is one of the most popular
techniques u sed he re), and protein-to-protein inter-
action and 3D structure prediction (graph-based kNN
is often used for interaction structure predictio n).
QUESTIONS FOR DISCUSSION
1. Why is image recognition/classification a worthy
but difficult problem?
2. How can kNN be effectively used for image rec-
ognition/classification applications?
Sou rces: H. Zhang, A. C. Be rg , M. Maire, a nd J. Malik, "SVM-
KNN: Discriminative Nearest Neighbor Classification for
Visua l Category Recognition, " Proceedings of the 2006 IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR'06}, Vol. 2, 2006, pp. 2126- 2136; 0 . Boiman,
E. Shechtman, and M. Irani, "In Defense of Nearest-Neighbor
Based Image Classificatio n ," IEEE Conference on Computer Vision
a n d Pattern Recognition, 2008 (CVPR}, 2008, pp.1-8.
2. What are the advantages and disadvantages of kNN as compared to ANN and SVM?
280 Pan III • Predictive Analytics
3. What are the c ritical success factors for a kNN implementation?
4. What is a similarity (or distance measure)? How can it be applied to both numerical
and nominal valued variables?
5. What are the common applications of kNN?
Chapter Highlights
• Neural computing invo lves a set of methods that
e mulate the way the huma n brain works. The
basic processing unit is a neuron. Multiple neu-
rons are grouped into layers and linked together.
• In a neural network, the knowledge is stored
in the weight associated w ith each connection
between two neurons.
• Backpropagation is the most popular paradigm in
business applications of neural networks. Most busi-
ness applications are handled using this algorithm.
• A backpropagation-based neural network con-
sists of an input layer, an output layer, and a
certain number of hidden layers (usually one).
The nodes in one layer are fully connected to the
n odes in the next layer. Learning is done through
a trial-and-error process of adjusting the connec-
tion weights.
• Each node at the input layer typically represents a
single attribute that may affect the prediction.
• Neural network lea rning can occur in supervised
or unsupervised mode.
• In supervised learning mode , the training patterns
include a correct an swer/classificatio n/forecast.
• In unsupervised learning mode, there are no
known answers. Thus, unsupervised lea rning is
used for clustering or exploratory data analysis.
• The usual process of learning in a neural network
involves three steps: (1) compute temporary
outputs based o n inputs and ran dom weig hts ,
(2) compute outputs with desired targets , and
(3) adjust the weights a nd repeat the process.
Key Terms
artificial ne ural n etwork
(ANN)
axon
backpropagation
connection weight
dendrite
hidde n layer
k-nearest neighbor
Kohonen's self-organizing
feature map
neural computin g
neural network
n euron
nucleus
• The delta rule is commonly used to adjust the
weights. It includes a learning rate and a momen-
tum parameter.
• Developing neural network-based systems
requires a step-by-step process. It includes data
preparation and preprocessing, training and
testing, and conversion of th e trained model into
a production system.
• Neural network software is available to allow
easy experimentation with many models. Neural
network modules are included in all major data
mining software tools. Specific neural network
packages are also available. Some neural network
tools a re available as spreadsheet add-ins.
• After a trained network has been created, it is usually
implemented in end-user systems through program-
ming languages such as C++, Java, and Visual Basic.
Most neural network tools can generate code for the
trained network in these languages.
• Many neural network models beyond backpropa-
gation exist, including radial basis functions, sup-
port vector machines, Hopfield networks, and
Kohonen's self-organizing maps.
• Neural network applications abound in almost
all business disciplines as well as in virtually all
other functional areas.
• Business applications of neural networks include
finance , bankru ptcy prediction, time-series fore-
casting, and so on.
• New applications of neural networks are emerg-
ing in healthcare, security, and so o n.
parallel processing
pattern recognition
perceptron
processing e lement (PE)
sigmoid (logical
activation) function
summation function
supervised learning
synapse
threshold value
transforma tion (tran sfer)
function
Questions for Discussion
1. Compare artificia l a nd b io logical neural networks. What
aspects of biological networks are not mimicked by
artificia l o nes? What aspects are similar?
2. The performance of ANN re lies heavily o n the summa -
tio n and transformation functio n s. Explain the combined
effects of the summation a nd transformation fun ctions
and how they differ fro m statistical regression a na lysis.
3. ANN can be used for both s upervised a nd unsupervised
learning. Explain how they learn in a supervised mode
and in a n unsupervised mode.
4. Expla in the difference between a training set and a testing
set. Why d o we n eed to differe ntiate them? Can the same
set be u sed for both purposes? Why or w hy not?
Exercises
Teradata University Network (TUN) and Other
Hands-On Exercises
1. Go to the Teradata University Network Web site
(teradatauniversitynetwork.com) or the URL g iven
by your instru cto r. Locate Web seminars related to data
mining a nd neural networks . Specifica lly, view the semi-
na r g iven by Professor Hugh Watson at the SPIRIT2005
conference at Oklahoma State University; the n , answer
the following questions:
a. Which real-time applicatio n at Continenta l Airlines
may have u sed a ne ural network?
b. What inputs a nd o utputs can be used in b uilding a
neura l network application?
c. Given that Contine ntal's data mining applications a re
in real time, how mig ht Continenta l imp lement a ne u -
ral network in practice?
d. What othe r neural network applicatio ns would you
propose fo r the airline industty?
2. Go to the Teradata University Network Web site
(teradatauniversitynetwork.com) o r the URL given by
your instructo r. Locate the Harrah's case. Read the case
and answer the following questions:
a. Which of the Harrah 's data applications are most
likely implemented using neural ne tworks?
b. What othe r applications could Harrah's deve lop using
the data it is collecting from its custome rs?
c. What are som e concerns you mig ht have as a cus-
tomer at this casino?
3. The bankruptcy-prediction problem can be viewed as a
proble m of classification. The d a ta set you w ill be using
for this proble m includes five ra tios tha t h ave been com-
puted from the fin ancia l statements of real-world firms.
These five ra tios have been u sed in studies invo lving
bankruptcy predictio n. The first sample includes data on
firms that went bankrupt and firms that didn't. This w ill be
your tra ining sample for the neural n etwork. Th e second
sample o f 10 firms also consists o f some bankru pt firms
Chapter 6 • Techniques for Predictive Modeling 281
5. Say tha t a ne ura l network has been constructed to predict
the creditworthiness of applicants. There are two output
nodes: o ne for yes (1 = yes, 0 = no) and one for no (1 = no,
0 = yes) . An applicant receives a score o f 0.83 for the
"yes" output node and a 0.44 for the "no " output n ode.
Discuss what may have happened a nd whethe r the a ppli-
cant is a good cre dit risk.
6. Everyone would like to make a g reat deal of money on
the stock m arke t. Only a few are very su ccessful. Why
is using an ANN a promising app roach? What can it d o
that o ther decision support technologies cannot do? How
could it fail?
and some no nbankru pt firms. Your goal is to use neural
n etworks, su pport vector machines, and nearest n eigh bor
a lgorithms to build a mode l, using the first 20 data points,
a nd the n test its performance on the other 10 data points.
(Try to analyze the new cases yourself manually before
you run the neural network and see how well you do.)
The followin g tables sh ow the training sam p le and test
data you s hould use fo r th is exercise.
Training Sample
Firm WC/TA RE/TA EBIT/TA MVE/TD SITA BR/NB
0. 1650 0.1 192 0.2035 0.8130 1.6702
2
3
4
5
6
7
8
9
10
0. 1415
0.5804
0.2304
0.3684
0.1527
0. 11 26
0.0 141
0.2220
0.2776
0.3868
0.3331
0.2960
0.39 13
0.3344
0.3071
0.2366
0.1797
0.2567
0.0681 0.57 55 1.0579
0.0810 1.1964 1.3572
0.1225 0.4102 3.0809
0.0524 0.1658 1.1 533
0.0783 0.7736 1.5046
0.0839 1.3429 1.5736
0.0905 0.5863 1.465 1
0.1526 0. 3459 1. 7237
0.1642 0.2968 1.8904
11 0.2689 0.1729 0.0287 0.1224 0.9277 0
12 0.2039 - 0 0476 0 1263 0.8965 1.0457 0
13 0.5056 - 0.1951 0.2026 0.5380 1.951 4 0
14 0. 1759 0.1343 0.0946 0.1955 1.9218 0
15 0.3579 0.1515 0.08 12 0.1991 1.4 582 0
16 0.2845 0.2038 0.0171 0.3357 1.3258 0
17 0. 1209 0.2823 -0.0 11 3 0.3157 2.3219 0
18 0.1254 0.1956 0.0079 0.2073 1.4890 0
19 0.1777 0.0891 0.0695 0.1924 1.687 1 0
20 0.2409 0.1660 0.0746 0.2516 1.8524 0
282 Part III . Predictive Analytics
Test Data
Firm WC/TA RE/TA EBIT/TA MVE/TD SITA BR/NB
A 0. 1759 0.1343 0.0946 0.1955 1.9218 ?
B 0.3732 0.3483 - 0 0013 0.3483 1.8223 7
C 0.1725 0.3238 0.1040 0.8847 0.5576 ?
D 0.1630 0.3555 0.0110 0.3730 2.8307 ?
E 0. 1904 0.2011 0.1329 0.5580 1.6623 ?
F 0.1123 0.2288 0.0 100 0. 1884 2 .7 186 7
G 0.0732 0.3526 0.0587 0.2349 1.7432 ?
H 0.2653 0.2683 0.0235 0.5118 1.8350 ?
0.1070 0.0787 0.0433 0.1083 1.2051 ?
0.2921 0.2390 0.0673 0.3402 0.9277 ?
Describe the results of the neural n e twork, support vec-
tor machines, and nearest n eighbo r model predictions,
including software, architecture, and training information.
4. The purpose of this exercise is to d evelo p models to
predict fo rest cover type u sing a number of cartogra phic
measures . The given da ta set (Online File W6.1) includes
four wilderness a reas fo und in the Roosevelt Nationa l
Forest of northe rn Colorado. A to tal of 12 cartogra phic
measures were utilized as independent variables; seven
major forest cover types we re u sed as dependent
variables. The following table provides a sh ort descrip-
tion of these inde pende nt a nd dependent variables:
This is an excelle nt example for a multiclass classifi-
cation proble m. The data set is rathe r large (with 581 ,012
unique instances) a nd fe ature rich. As you will see, the
data is also raw and skewed (unbalanced for diffe r-
ent cover types). As a mode l builde r, you are to make
Number Name
necessa1y d ecisions to preprocess the da ta and build the
best possible predictor. Use your favo rite tool to build
the models for neural networks, support vector machines,
and nearest neighbor algo rithms, and docume nt the
details of your results and experie nces in a w ritten re p ort.
Use screenshots within your report to illustrate important
and interesting findings. You a re expected to discu ss and
justify any decisio n that you make alo ng the way.
Th e re use of this data set is unlimited w ith retention
of copyright notice fo r Jock A. Blackard and Colorado
State Unive rsity.
Team Assignments and Role-Playing Projects
1. Conside r the following set of data that relates daily elec-
tricity u sage as a fun ction of outside h igh te mperature
(for the day):
Temperature, X Kilowatts, Y
46.8 12,53 0
52 . 1 10,800
55. 1 10,180
59 .2 9,730
61.9 9,750
66.2 10,23 0
69.9 11,1 60
76.8 13,910
79 .7 15,1 10
79 .3 15,690
80.2 17,020
83.3 17,880
Description
1
2
3
4
5
6
7
8
9
Elevation
Aspect
Slope
Independent Variables
El evation in meters
10
11
12
Number
Horizontal_Distance_ T o_Hydrology
Vertical_Distance_To_Hydrology
Horizontal_Distance_ T o_Roadways
Hillshade_9am
Hillshade_Noon
Hillshade_3pm
Horizontal_Distance_ To_Fire_Points
Wilderness_Area (4 binary variables)
Soil_Type (40 binary variables)
Cover_Type (7 un ique types)
Aspect in degrees azimut h
Slope in degrees
Horizontal distance to nearest surface-water features
Vertical distance to nearest surface-water features
Horizontal dista nce to nearest roadway
Hill shade index at 9 A.M., summer solstice
Hill shade index at noon, summer solstice
Hill shade index at 3 P.M., summer solstice
Horizontal distance to nearest w ildfire ignition points
W ilderness area designation
Soil type designation
Dependent Variable
Forest cover type desig nation
Nol e: More details a bout the data set (va riables and observations) can be found in the online file.
a. Plo t th e raw data . What pattern d o you see? What do
you th ink is really a ffecting electricity u sage?
b. Solve this problem w ith linear regression Y = a + bX
(in a spreadsheet). How well does this work? Plot
your results. What is wrong? Calculate the su m-of-
the-squa res error and Ii2.
c. Solve this proble m by us ing no nlinear regression .
We recomme nd a quadratic functio n , Y = a + b1X
+ b2X2 . How well d oes this work? Plot your results.
Is a nything wrong? Calculate the sum-of-the -squ ares
e rror a nd R2.
d. Break up the prob le m into three sectio ns (look at the
p lot) . Solve it using three linear regressio n models-
o ne for each sectio n . How well does this work? Plot
your results. Calculate the sum-o f-the -squa res e rro r
and Ii2. ls this modeling approach appropriate? Wh y
o r w hy no t?
e. Build a neura l network to solve the original prob-
le m. (You may have to scale the X a nd Y values to
be between O a nd 1.) Train it (on the entire set of
data) and solve the problem (i.e., make p redictions
for each o f the o rig ina l data items). How well does
this work? Plo t your results . Calculate the sum-o f-the -
squ ares e rror a nd Ji2 .
f. Which method works best and w h y?
2. Build a real-world neural network. Using demo soft-
ware d ownloaded fro m the Web (e.g., Neu roSolutio n s
at neurodirnension.corn o r a n o the r site), ide ntify real-
world data (e.g., start searching o n th e Web at ics.uci.
edu/-rnlearn/MLRepository.htrnl o r use d a ta fro m an
organization with w hich someon e in you r group has a
contact) and b uild a ne ura l network to m ake p re diction s.
Topics m ight include sales forecasts, predicting su ccess
in a n acade mic program (e.g. , p redict GPA fro m high
school rating a nd SAT scores, being care ful to look o u t
fo r "bad " data, such as GP As of 0.0), o r h o u sing p rices; or
su rvey the class for weig ht, gende r, and he ight a nd tty to
predict he ight based o n the o ther two facto rs . You could
a lso use U.S. Censu s data o n this book's Web site o r at
census.gov, b y sta te, to identify a re latio n ship between
edu catio n level a nd income. How good a re your predic-
tio ns? Compa re the results to pre dictio ns gene rated using
standard statistical me thods (regressio n). Which method
is better? How could your system be e mbe dded in a DSS
for real decisio n m aking?
3 . For each of the fo llowing a pplicatio ns, would it be be t-
ter to u se a n e ural network o r an expert syste m ' Explain
your a nswers, including possib le excepti o ns o r special
conditio ns.
a. Diagnosis of a well-established b u t complex d isease
b. Price -lookup su bsystem fo r a hig h-volume me rcha n-
dise selle r
c. Auto mated voice-inq uiry processing syste m
d. Training o f new employees
e. Handwriting recognition
4.
Chapter 6 • Techniques for Predictive Modeling 283
Consider th e following data set, w hich includ es three
attributes a nd a classification fo r adm issio n d ecisio ns into
a n MBA p rogram :
Quantitative
GMAT GPA GMAT Decision
650 2.7 5 35 NO
580 3. 50 70 NO
600 3. 50 75 YES
450 2.95 80 NO
700 3. 25 90 YES
590 3. 50 80 YES
400 3.85 45 NO
640 3. 50 75 YES
540 3.00 60 ?
690 2.85 80 7
490 4 .00 65 ?
a . Using the d ata g iven he re as examples, d evelop your
own m anu a l exp ert rules for d ecisio n making.
b. Build a nd test a neu ral network model using your
favorite data min ing tool. Exp e rim ent with diffe rent
m odel parameters to "optimize" the predictive power
o f your mod e l.
c. Build and test a su p p ort vector machine mod el using
you r favorite data mining tool. Exp e rim ent with dif-
fe re nt mod e l parameters to "optimize " the predictive
power o f your m odel. Compare th e results of ANN
a nd SVM.
d . Report the p redictions o n the last th ree observa-
tio ns from each of the three classification approaches
(ANN, SVM, and kNN) . Comme n t o n the results.
e. Comment o n the similarity a nd differences of these
three predictio n approaches. What did you learn
from this exercise?
5. You have worked on neura l networks a nd o ther data
mining techniq u es . Give exam p les of where each
of th ese h as been used. Based o n your kn owle d ge,
h ow would you diffe re n tiate am o ng these tech niqu es?
Assume that a few years from now you come across a
s ituation in w hich ne ura l n etwork o r other da ta min-
ing techniques could be used to bu ild an inte restin g
app licatio n fo r your o rganizatio n . You have a n intern
working w ith you to do the g runt work. H ow will you
d ecide w hether the application is appropriate for a neu-
ra l network o r for another d ata mining m odel? Based
o n your homework assig n ments, w hat specific software
gu ida nce can you provide to get your inte rn to be pro-
ductive for you quickly? Your answer for this question
might mention the specific software , describe how to g o
about setting u p the mod e l/n e ura l n etwork , and validate
the applicatio n .
284 Pan III • Predictive Analytics
Internet Exercises
1. Explore the Web sites of seve ral n e ural network ven-
do rs, su ch as California Scie ntific Software (calsci.com),
NeuralWare (neuralware.com), and Ward Systems
Group (wardsystems.com), and review some o f the ir
products . Download at least two demos and install, run,
and compare the m.
2. A very good re p osito ry of data that has been u sed to
test the pe rformance of n eural n etwo rk and o ther
machine-learning algorithms can be accessed at ics.uci.
edu/-mlearn/MLRepository.html. Som e of the data
sets a re really meant to test the limits o f curre nt machine -
learning algorithms and compare the ir performance
again st n ew approach es to learning . However , some
of the smalle r d ata sets can b e u seful for explo ring
the functionality of the software you mig ht d ownload
in Inte rne t Exe rcise 1 o r the softwa re that is ava ilable
at StatSoft.com (i. e., Statistica Data Mine r w ith exten-
sive ne ural netwo rk cap abilities). Download at least one
data se t from the UC! re p osito ry (e.g., Cre dit Screening
Data bases, Ho using Datab ase) . The n apply n e ural ne t-
wo rks as we ll as d ecisio n tree methods, as appropriate.
Pre p are a re p ort on your results . (Some o f these exe r-
cises could also b e completed in a group or may even
be proposed as semester-lo ng projects for te rm p a pe rs
and so o n .)
End-of-Chapter Application Case
Coors Improves Beer Flavors with Neural Networks
Coors Brewers Ltd. , based in Bunon-upon-Trent, Britain's
brewing capital, is proud o f having the Unite d Kingdo m's
to p beer brands, a 20 percent share of the market, years of
exp e rie nce, and some of the best p eople in the business.
Po pular brands include Carling (the country's b estselling
lager), Grolsch, Coors Fine Light Beer, Sol, and Ko renwolf.
Problem
Today's custome r has a w ide variety o f o ption s regarding
w hat he or s he drinks . A drinke r's cho ice d e pe nds o n va rio u s
facto rs, including mood , venue, and occasio n . Coors' goal is
to ens ure that the customer ch ooses a Coors brand no matter
what the circumstances are .
According to Coors, creativity is the key to long-term
su ccess. To be the custo mer's choice brand, Coors needs
to be creative and anticipate the cu stome r's ever so rapidly
cha nging moods. An impo rtant issu e w ith beers is the fl avor;
each beer has a distinctive flavor. These flavors are mostly
determined through p anel tests . However, s u ch tests take
time . If Coo rs could understand the beer flavo r b ased sole ly
on its chemical compositio n , it would open up new avenues
to create beer that would suit cu sto mer expecta tio ns.
The re lationship between chemical analysis and beer
flavor is not clearly understood yet. Substantial d ata exist
3. Go to calsci.com a nd read about the company's vario us
business a pplicatio ns . Pre pare a repoit that summa ri zes
the app lications.
4. Go to nd.com. Read ab o ut the company's applications
in investment and trading. Prepare a re p ort about the m .
5. Go to nd.com. Download the trial version of
euroSolutions for Excel and exp e rime nt with it , using
o ne of the data sets from the exercises in this chapte r.
Prep are a re p ort ab o ut your expe rie nce with th e tool.
6 . Go to neoxi.com. Identify at least two software tools
that h ave not been mentione d in this chapte r. Visit Web
sites of those tools and pre p are a brief repoit on the
tools' cap abilities .
7. Go to neuroshell.com. Look at Gee Whiz examples.
Comment on the feasibility of achieving the results
claime d by the develo pers o f this ne ural network model.
8. Go to easynn.com. Download the trial version of the
software. After the installation of the software, find the
sample fil e called Houseprices.tvq. Retrain the neural
n etwork and test the model by sup plying som e data .
Prep a re a re port about your experie nce with this software.
9 . Visit statsoft.com. Download at least three white pap ers
of a pplication s. Which of these a pplicatio ns may have
u sed neural networks?
10. Go to neuralware.com. Pre pare a repoit ab out the
p rodu cts the compa ny offers.
o n the chemical composition of a b eer an d sensory analy-
s is . Coors neede d a mechanism to link those two togethe r.
Neural networks w e re applie d to create the link b etween
ch emical composition and sensory analysis .
Solution
Over the years, Coors Brewers Ltd . h as accumulated a sig-
nificant amount of data related to the final p rodu ct analy-
sis, w hich h as been suppleme nted by sensory data p rovided
b y the trained in-hou se testing p a nel. Some of the an alytical
inputs and sensory outputs are sh own in the following table :
Analytical Data: Inputs
A lcohol
Color
Calculat ed b itterness
Ethyl acetate
lsobutyl acetate
Ethyl butyrate
lsoamyl acet at e
Ethyl hexanoate
Sensory Data: Outputs
Alcohol
Estery
Ma lty
Grainy
Burnt
Hoppy
Toffee
Sw eet
A sing le ne ural n etwo rk, restricted to a sing le quality and
fla vor was first used to m o de l the relatio nship b etween the
ana ly;ical and sensory data . The ne ura l n etwork was b ased
on a p ackage solutio n supplie d by Neu roDime nsio n , Inc.
(nd.com) . Th e neura l ne two rk consiste d o f a n MLP archi-
tecture w ith two hidde n layers . Data were no rma lized w ithin
the network, the reby e nabling compa rison between the
results for the va riou s sensory o utputs. Th e neura l network
was tra ined (to learn the rela tionship between the inputs
and o utputs) throug h the presenta tion o f m any combina -
tio ns of relevant input/ outp ut combinatio n s. Whe n the re was
no o bserve d improvem ent in the ne two rk e rror in the last
100 ep ochs, training was auto m atically te rminate d . Tra ining
was carried o ut 50 times to ensure th at a con siderable mean
ne two rk erro r could be calculate d for compa rison purposes.
Prio r to each tra ining run , a different training a nd cross-
va lidatio n d ata set was presente d b y rando mizing the source
da ta records, the re by re moving a ny b ias.
This technique produced poor results, due to two
majo r fa cto rs . First, concentrating on a s ingle prod uct's
quality m eant tha t the variatio n in the d a ta was pretty low.
The neural netwo rk could no t extract u seful re la tio nships
fro m the data. Second, it was p robable tha t o nly o ne subset
o f the p rovide d inputs would have a n impact o n the selected
bee r flavor. Performa nce o f the neura l n e two rk w as affected
by "no ise" create d b y inputs tha t had no impact o n fl avor.
A mo re diverse product range was include d in the
tra ining range to ad dress the first facto r. It was m ore challe nging
to identify the most impottant analytical inputs. This challe nge
w as addressed by using a software switch that e nable d the
neural network to be traine d o n all p ossib le comb inatio ns o f
inputs. The switch w as not used to d isable a significant input;
if the significant input were disabled , we could exp ect the net-
work e tTo r to incre ase. If the disabled input was insignificant,
the n the network e rror would e ither rem ain unchanged o r be
reduced due to the re moval o f no ise. This app roach is called an
exhaustive search because all p ossible combinations are evalu-
ate d . The techniq ue, altho ugh conceptually sim ple, was com-
putatio nally impractical w ith the nume rous inputs; the numbe r
of p ossible combinatio ns was 16.7 million p er flavor.
A mo re e fficient me tho d o f searching fo r the re levan t
inputs was re quired. A ge netic a lgorithm was the solutio n to
the proble m . A gene tic algorithm was able to ma nip ulate the
diffe re nt input switches in response to the e rror te rm fro m
the ne ural ne twork. The objective o f the gene tic algorithm
w as to minimize the ne two rk e rro r te rm. W he n this minimum
was reache d , the switch settings would ide ntify the an alytical
inputs that were m ost like ly to predict the fl avor.
References
Ainscoug h, T. L. , a nd J. E. Aronson. 0999) . "A Neura l
Networks Approach for the Analysis of Scanne r Data ."
Journal of Reta iling and Consumer Services, Vol. 6.
Aize rma n , M., E. Braverman, and L. Ro zo noer. (1964) .
"Theoretical Foundatio ns o f the Po tentia l Functio n Me tho d
Chapter 6 • Techniques for Pre dictive Modeling 285
Results
After d ete rmining w h at inputs were re levan t, it was possible
to ide n tify w hich fl avors could be predicted more skillfully.
The network w as tra ine d using the re levant in puts p reviou sly
ide ntified multip le times. Before each tra ining run, the net-
work data were randomized to e nsure that a differe nt train-
ing a n d cross-validatio n data set was used. Network e rror
was recorde d after each training nm. The testing set used for
assessing th e p erformance of the traine d n etwork contained
approximate ly 80 records o ut of the sample d a ta. The ne u-
ral network accurate ly predicted a few flavors by using the
ch emical inputs. For example, "b urnt" flavor was p redicted
w ith a corre la tio n coefficient of 0.87.
Tod ay, a limite d numbe r o f fl avors are being predicted
by u s ing the a nalytical data. Sensory response is extrem ely
complex, w ith man y p otentia l inte ractions and h ugely vari-
able sens itivity thresho lds. Standa rd instru mental a na lysis
te nds to b e o f gross pa ra me te rs, a nd for practical a nd eco-
no mical reasons, m any flavor-active comp o unds a re simply
no t m easured . The relatio ns hip of flavor and analysis can
be e ffectively mod e led o nly if a large numbe r o f flavor-
contributory a na lytes a re conside re d . What is more, in addi-
tio n to the o bviou s flavor-active mate ria ls, mo uth-feel an d
p hysical contrib uto rs sh o uld a lso be conside re d in the over-
a ll sensory profile . With fmther d eve lopm e nt o f the input
p a ra m e te rs, the accuracy of the n eural network models will
im p rove.
QUESTIONS FOR THE END-OF-CHAPTER
APPLICATION CASE
1. Why is b eer fl avor imp o rtant to Coors' profitability?
2. Wha t is the objective of the neural network u sed at
Coors?
3. Why were the results of Coors' ne ura l network initia lly
poor, and w hat was d o ne to improve the results?
4. Wha t benefits mig ht Coors d e rive if this project is
successful?
5. Wha t m o difications wou ld you make to imp rove the
results o f beer fl avor pre d ictio n?
Sources. Compiled from C. I. Wilson a nd L. Threapleton , "Ap plication
of Attificial Inte lligence fo r Predicting Beer Flavou rs from Chemical
Analysis," Proceedings of the 29th European Brewery Congress,
Dublin, Ire la nd , May 17-22, 2003, neurosolutions.com/resources/
apps/beer.html (accessed February 2013); and R. Nischw itz,
M. Goldsmith , M. Lees, P. Rogers, and L. Macleod, "Develop ing
Functional Malt Specificatio ns for Improved Brewing Perfo rma nce,"
The Regional Institute Ltd ., regional.org.au/au/abts/1999/
nischwitz.htm (accessed February 2013).
in Patte rn Recogn itio n Learning. " Automation and Rem ote
Control, Vol. 25, p p. 821-837.
Altma n , E. I. (1968). "Financia l Ratios, Discriminant Analysis
an d the Predictio n o f Corporate Ba n kruptcy." Journal of
Finance, Vol. 23 .
286 Pan III • Predictive Analytics
California Scientific. "Maximize Returns o n Direct Mail w ith
BrainMaker Neural etworks Software. " calsci.com/
DirectMail.html (accessed August 2009).
Collard, J. E. 0990). "Commodity Trading with a Neural Net. "
Neural Network News, Vol. 2, No. 10.
Collins, E., S. Ghosh, and C. L. Scofield. 0988). "An Application
of a Multiple eural Network Learning System to Emulation
of Mongage Underwriting Judgments. " IEEE International
Conference on Neural Networks, Vol. 2, pp. 459-466.
Das, R. , I. Turkoglu, a nd A. Sengur. (2009). "Effective Diagnosis
of Hean Disease Through Neural Networks Ensembles. "
Expert Systems with Applications, Vol. 36, pp . 7675-7680.
Davis, J. T., A. Episcopos, and S. Wettimuny . (2001).
"Predicting Directio n Shifts on Canadian-U.S. Exchange
Rates w ith Artificial Neural Networks. " International
Journal of Intelligent Systems in Accounting, Finance and
Management, Vol. 10, No. 2.
Delen, D., and E. Sirakaya. (2006). "Determining the Efficacy
of Data-Mining Methods in Pre dicting Gaming Ballot
Outcomes. " Journal of Hospitality & Tourism Research,
Vol. 30, No. 3, pp. 313-332.
Delen, D. , R. Sh arda , and M. Bessonov. (2006). "Identifying
Significant Predictors of Injury Severity in Traffic Accidents
Using a Series of Anificial Neural Networks. " Accident
Analysis and Prevention, Vol. 38, No. 3, pp. 434-444.
Dutta , S., and S. Shakhar. 0988, July 24-27). "Bond-Rating:
A Non-Conservative Application of Neural Networks. "
Proceedings of the IEEE International Conference on
Neural Networks, San Diego, CA.
Estevez, P. A., M. H. Claudio , and C. A. Perez. "Prevention
in Telecommunications Using Fuzzy Rules and Neural
Networks." cec.uchile.cl/-pestevez/RIO (accessed
May 2009).
Fadlalla, A., and C. Lin. (2001). "An Analysis of the
Applicatio ns of Neural Networks in Finance. " Interfaces,
Vol. 31 , No. 4 .
Fishman, M., D. Barr, and W. Loick. 0991 , April). "Using
Neural Networks in Ma rket Analysis. " Technical Analysis
of Stocks and Commodities.
Fozzard, R., G. Bradshaw, and L. Ceci. 0989). "A Connectionist
Expett System for Solar Flare Forecasting." In D.S. Touretsky
(ed.), Advances in Neural Information Processing Systems,
Vol. 1. San Mateo, CA: Kaufman.
Francett, B. 0989, January). "Neural Nets Arrive. " Computer
Decisions.
Gallant, S. 0988, February). "Connectionist Expen Systems."
CommunicationsoftheACM, Vol. 31, No. 2 .
Gi.iler, I., Z. Gok,;il, a nd E. Gi.ilbandilar. (2009). "Evaluating
Traumatic Brain Injuries Using Anificial Neural Networks. "
Expert Systems with Applications, Vol. 36, pp. 10424-10427.
Hay kin, S. S. (2009). Neural Networks and Learning Machines,
3rd ed. Upper Saddle River, NJ: Prentice Hall.
Hill, T. , T. Ma rquez, M. O 'Connor, and M. Remus. 0994).
"Neural Network Models for Forecasting and Decision
Making." International Journal of Forecasting, Vol. 10.
Hopfield, J. 0982, April). "Neural Networks and Physical
Systems with Emergent Collective Computational
Abilities. " Proceedings of National Academy of Science,
Vol. 79, No. 8.
Hopfield, J. ]. , and D. W. Tank. 0985). "Neural Computation
of Decisions in Optimization Problems." Biological
Cybernetics, Vol. 52.
Iyer, S. R. , and R. Sharda. (2009). "Prediction of Athletes'
Pe rformance Using Neural Networks: An Application in
Cricket Team Selection." Expert Systems with Applications,
Vol. 36, No. 3, pp. 5510-5522.
Kamijo, K. , and T. Tanigawa . 0990, June 7-11). "Stock
Price Pa tte rn Recognition: A Recurrent Neural Network
Approach. " International Joint Conference on Neural
Networks, San Diego.
Lee, P. Y. , S. C. Hui, and A. C. M. Fong. (2002, September/
October) . "Neural Networks for Web Content Filtering. "
IEEE Intelligent Systems.
Liang, T. P. 0992). "A Composite Approach to Automated
Knowledge Acquisition. " Management Science, Vol. 38,
No. 1.
Loeffe lholz, B. , E. Bednar, and K. W. Bauer. (2009).
"Predicting NBA Games Using Neural Networks. " Journal
of Quantitative Analysis in Sports, Vol. 5, No. 1.
McCulloch, W. S., and W. H. Pitts. 0943). "A Logical Calculus
of the Ideas Imminent in Ne1vous Activity. " Bulletin of
Mathematical Biophysics, Vol. 5.
Medsker, L. , and J. Liebowitz. 0994). Design a n d
Development of Expert Systems and Neural Networks. New
York: Macmillan, p. 163.
Mighe ll, D . 0989). "Back-Propagation and Its Application to
Handwritten Signature Verification. " In D. S. Touretsky
(ed .), Advances in Neural Information Processing Systems.
San Mateo, CA: Kaufman.
Minsky, M. , a nd S. Papert. 0969). Perceptrons. Cambridge,
MA: MIT Press.
Neural Technologies. "Combating Fraud: How a Leading
Telecom Company Solved a Growing Problem. " neuralt.
com/iqs/dlsfa.list/dlcpti. 7 /downloads.html (accessed
March 2009).
Nischwitz, R. , M. Goldsmith, M. Lees, P. Rogers, and
L. Macleod. "Developing Functional Malt Specifications for
Improve d Brewing Pe rformance ." The Regional Institute
Ltd. , regional.org.au/au/abts/1999/nischwitz.htm
(accessed May 2009).
Olson, D. L. , D. Delen, and Y. Meng . (2012). "Comparative
Analysis of Data Mining Models for Bankruptcy Prediction. "
Decision Support Systems, Vol. 52, No. 2, pp. 464-473.
Piatesky-Shapiro, G. "ISR: Microsoft Success Using Neural
Network for Direct Marketing." kdnuggets.com/
news/94/n9.txt (accessed May 2009).
Principe, J. C. , N. R. Euliano, and W. C. Lefebvre. (2000).
Neural and Adaptive Systems: Fundamentals Through
Simulations. New York: Wiley.
Rochester, J. (ed.). 0990, February). "New Business Uses for
Neurocomputing. " 1/S Analyzer.
Sirakaya, E., D. Delen, a nd H-S. Choi. (2005). "Forecasting
Gaming Referenda. " Annals of Tourism Research, Vol. 32,
No. 1, pp. 127-149.
Sordo, M., H. Buxton, and D. Watson. (2001). "A Hybrid
Approach to Breast Cancer Diag nosis. " In L. Jain and P.
DeWilde (eds.), Practical Applications of Computational
Intelligence Techniques, Vol. 16. Norwell, MA: Kluwer.
Surkan , A. , and J. Singleton. 0990). "Ne ura l Networks for
Bond Rating Improved by Multiple Hidden Layers ."
Proceedings of the IEEE International Conference on
Neural Networks, Vol. 2.
Tang, Z., C. de Almieda, a nd P. Fishwick. (1991). "Time-
Series Forecasting Using Neu ral Networks vs. Box-Jenkins
Methodology." Simulation, Vol. 57, No. 5.
Thaler, S. L. (2002, January/ February). "AI for Network
Protection: LITMUS:-Live Intrusion Tracking via Multiple
Unsu pervised STANNOs. " PC Al.
Walczak, S. , W. E. Pofahi, and R. J. Scorpio. (2002).
"A Decision Suppo rt Tool for Allocating Hospital Bed
Resources and Determining Required Acuity of Care. "
Decision Support Systems, Vol. 34, No. 4.
Chapter 6 • Techniques for Predictive Modeling 287
Wallace, M. P . (2008, July). ''Neural Networks and Their
Applications in Finance." Business Intelligence Journal,
pp. 67- 76.
Wen, U-P. , K-M. Lan, and H-S. Shih . ( 2009). "A Review of
Hopfield Neural Networks for Solving Mathematical
Programming Problems." European Journal of Operational
Research, Vol. 198, pp. 675- 687.
Wilson, C. I. , and L. Threaple ton. (2003, May 17- 22).
"Application of Artificial Inte lligence for Predicting
Beer Flavou rs from Chemical Analysis. " Proceedings of
the 29th European B rewery Congress, Du blin , Ireland .
neurosolutions.com/resources/apps/beer .html
(accessed May 2009) .
Wilson, R. , a nd R. Sharda. (1994). "Ba nkruptcy Prediction
Using Neural Networks. " Decision Support Systems,
Vol. 11.
Zahedi, F. (1993). Intelligent Systems for Business: Expert
Systems with Neural Networks. Belmont, CA: Wadsworth.
288
CHAPTER
Text Analytics, Text Mining, and
Sentiment Analysis
LEARNING OBJECTIVES
• Describe text mining and understand the
need for text mining
• Differentiate among text analytics, text
mining, and data mining
• Understand the different application
areas for text mining
• Know the process for carrying out a text
mining project
• Appreciate the different methods to
introduce structure to text-based data
• Describe sentiment analysis
• Develop familiarity w ith popular
applications of sentiment analysis
• Learn the common methods for
sentiment analysis
• Become familiar with speech a nalytics
as it relates to sentiment analysis
T
his chapter provides a rather comprehensive overview of text mining and one of
its most popular applications, sentiment analysis, as they both relate to business
analytics and decision support systems. Generally speaking , sentiment analysis
is a derivative of text mining, and text mining is essentially a derivative of data mining.
Because textual data is increasing in volume more than the data in structured databases, it
is important to know some of the techniques u sed to extract actio nable information from
this large quantity of unstructured data.
7.1 Opening Vignette: Machine Versus Men on jeopardy!: The Story of Watson 289
7 .2 Text Analytics and Text Mining Concepts and Definitions 291
7 .3 Natural Language Processing 296
7 .4 Text Mining Applications 300
7 .5 Text Mining Process 307
7.6 Text Mining Tools 317
7.7 Sentiment Analysis Overview 319
7 .8 Sentiment Analysis Applications 323
7 .9 Sentiment Analysis Process 325
7.10 Sentiment Analysis and Speech Analytics 329
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 289
7.1 OPENING VIGNETTE: Machine Versus Men on Jeopardy!:
The Story of Watson
'
I I
Can machine beat the best of ma n in what ma n is supposed to be the best at? Evidently,
yes, and the machine's name is Watson. Watson is an extraordinary computer system
(a n ovel combinatio n of advanced h ardware a nd software) designed to answer questions
posed in n atural human language. It was developed in 2010 by an IBM Research team as
p art of a DeepQA project and was n amed after IBM's first preside nt, Tho mas J. Watson.
BACKGROUND
Ro ughly 3 years ago, IBM Research was looking for a major research challenge to rival
the scientific and popular interest of Deep Blue , the compute r ch ess-playing champion ,
w hich would also have clear relevance to IBM business interests. The goal was to advance
computer science by exploring new ways for computer technology to affect science, busi-
ness, and society. Accordingly, IBM Research undertook a challe nge to build a computer
system that could compe te at the human champion level in real time on the American TV
quiz show, Jeopardy! The extent of the challenge included fielding a real-time automatic
contesta nt o n the sh ow, capable of liste ning, unde rstanding , an d respond ing-no t m erely
a laboratory exercise.
COMPETING AGAINST THE BEST
In 2011, as a test of its abilities, Watson competed o n the quiz show Jeopardy!, which was
the first ever human-versus-machine matchup for the show. In a two-game, combined-point
match (broadcast in three Jeopardy! episodes during February 14-16), Watson beat Brad
Rutter, the biggest a ll-time mo ney winner onJeopardy!, and Ke n Jennings, the record ho lde r
for the longest championship streak (75 d ays). In these episodes, Watson consistently out-
performed its human opponents o n the game's signaling device, but had trouble respond-
ing to a few categories, notably those having short clues containing only a few words.
Watson had access to 200 million pages of structured a nd unstructured content consuming
four te rabytes of disk sto rage. During the game Watson was no t connected to the Internet.
Meeting the Jeopardy ! Challe nge required advancing and incorporating a variety of QA
technologies (text mining and n atural language processing) including parsing, question
classificatio n , question decomposition, auto matic source acquisitio n a nd evaluatio n , e ntity
and relation detection, logical form generation, and knowledge representatio n and reason-
ing. Winning at Jeopardy! required accura tely computing confide nce in your answers. Th e
questio ns and conte nt are ambiguous a nd no isy and no ne of the individual algorithms are
290 Pan III • Predictive Analytics
perfect. Therefore, each component must produce a confidence in its output, and indi-
vidual component confide n ces must be combined to compute the overall confidence of
the final answer. The final confidence is used to determine whethe r the computer system
sh ould risk ch oosing to answer at all. In Jeopardy! parlance, this confidence is used to
determine whether the computer will "ring in" or "buzz in" for a question. The confidence
must be computed during the time the question is read and before the opportunity to buzz
in. This is roughly between 1 and 6 seconds with an average around 3 seconds.
HOW DOES WATSON DO IT?
The system behind Watson, which is called DeepQA, is a massively parallel, text mining-
focused, probabilistic evidence-based computational architecture. For the Jeopardy! chal-
lenge, Watson u sed more than 100 different techniques for analyzing n atural language,
identifying sources, finding and generating hypoth eses, find ing and scoring evidence,
and merging and ranking hypoth eses. What is far more important than any particular
technique that they u sed was how they combine them in DeepQA such that overlapping
approaches can bring their stre ngths to bear and contribute to improvements in accuracy,
confidence, a nd speed.
DeepQA is an architecture with an accompanying methodology, which is not specific
to the Jeopardy! challe nge . The overarching principles in DeepQA are massive parallelism,
many experts, pervasive confidence estimation , and integration of the-latest-and-greatest
in text analytics.
• Massive parallelism: Exploit massive parallelism in the consideration of mul-
tiple inte rpretations and hypotheses.
• Many experts: Facilitate the integration, application, a nd contextual evaluation of
a w ide range of loosely coupled probabilistic question and conte nt analytics.
• Pervasive confidence estimation: No component commits to an answer; all
components produce features and associated confidences, scoring different ques-
tio n and content interpretations. An underlying confidence-processing substrate
learns how to stack and combine the scores.
• Integrate shallow and deep knowledge: Balance the use of strict semantics
and shallow semantics, leveraging many loosely formed o ntologies.
Figure 7 .1 illustrates the DeepQA architecture at a very high level. More technical
details about the various architectural components and their specific roles and capabilities
can be found in Ferrucci et al. (2010).
Question
®
Question
analysis
Query
decomposition
Primary
search
Answer
sources
Candidate
answer
generation
Hypothesis Soft
generation filtering
Hypothesis Soft
generation filtering
Support
evidence
retrieval
Evidence
sources
Deep
evidence
scoring
Hypothesis and
evidence scoring
Hypothesis and
evidence scoring
FIGURE 7 .1 A High-Level Depiction of DeepQA Architecture.
Trained
models
Synthesis
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 291
CONCLUSION
The Jeopardy! challenge helped IBM address requirements that led to the design of the
DeepQA architecture and the implementation of Watson. After 3 years of intense research
and development by a core team of about 20 researchers, Watson is performing at human
expert levels in terms of precision, confidence, and speed at the Jeopardy! quiz show.
IBM claims to have developed many computational and linguistic algorithms to
address different kinds of issues and requirements in QA. Even though the internals of
these algorithms are not known, it is imperative that they made the most out of text analyt-
ics and text mining. Now IBM is working o n a version of Watson to take on surmountable
problems in healthcare and medicine (Feldman et al. , 2012).
QUESTIONS FOR THE OPENING VIGNETTE
1. What is Watson? What is special about it?
2. What technologies were used in building Watson (both hardware and software)?
3. What are the innovative characteristics of DeepQA architecture that made Watson
superior?
4. Why did IBM spend all that time and money to build Watson? Where is the ROI?
5. Conduct an Internet search to identify other previously developed "smart machines"
(by IBM or others) that compete against the best of man. What technologies did
they use?
WHAT WE CAN LEARN FROM THIS VIGNETTE
It is safe to say that computer technology, on both the hardware and software fronts, is
advancing faster than anything else in the last 50-plus years. Things that were too big, too
complex, impossible to solve are now well within the reach of information technology. One
of those enabling technologies is perhaps text analytics/ text mining. We created databases
to structure the data so that it can be processed by computers. Text, on the other hand, has
always been meant for humans to process. Can machines do the things that require human
creativity and intelligence, and which were not originally designed for machines? Evidently ,
yes! Watson is a great example of the distance that we have traveled in addressing the impos-
sible. Computers are now inte lligent enough to take on men at what we think men are the
best at. Understanding the question that was posed in spoken human language, processing
and digesting it, searching for an answer, and replying within a few seconds was something
that we could not have imagined possible before Watson actually did it. In this chapter, you
will learn the tools and techniques embedded in Watson and many other smart machines to
create miracles in tackling problems that were once believed impossible to solve.
Sources: D. Fe rrucci, E. Brow n , J. Chu-Ca rroll , J. Fa n, D. Gonde k, A. A. Kalya npur, A. La lly, J. W . Murdock,
E. Nyberg, J. Prage r, N. Schlaefer, and C. Welty, "Building Watso n: An Ove rvie w of the DeepQA Project,"
Al Magazine , Vol. 31 , o. 3, 2010; DeepQA, DeepQA Pro ject: FAQ, IBM Co rpo ration , 2011 , resea rc h .ibm.
com/ deepqa/ faq.shtml (accesse d January 2013) ; and S. Fe ldman, J. Hanover, C. Burghard, and D. Schubme hl ,
"Unlo cking the Po wer o f Unstructured Data," IBM white pape r, 2012, www-01.ibm.co m/software/ ebus iness/
jstart/ downloads/ unlockingUnstructuredData (accessed Fe bruary 2013) .
7.2 TEXT ANALYTICS AND TEXT MINING CONCEPTS AND DEFINITIONS
The information age that we are living in is characterized by the rapid growth in the
amount of data and information collected, stored, and made available in electronic format.
The vast majority of business data is stored in text documents that are virtually unstruc-
tured. According to a study by Merrill Lynch and Gartner, 85 percent of all corporate data
292 Pan III • Predictive Analytics
is captured and stored in some sort of unstructured form (McKnight, 2005). The same
study also stated that this unstructured data is doubling in size every 18 months . Because
knowledge is power in today's business world, a nd knowledge is derived from data and
informatio n , businesses that effectively and efficiently tap into their text data sources will
have the n ecessary knowledge to make better decisions, leading to a competitive advan-
tage over those businesses that lag behind. This is w here the need for text analytics and
text mining fits into the big picture of today's businesses.
Even though the overarching goal for both text a nalytics and text mining is to turn
unstructured textual data into actio nable information through the application of natural
language processing (NLP) and analytics, their definitions are somewhat different, at least
to some experts in the field. According to them, text analytics is a broader concept that
includes information retrieval (e.g., searching and identifying relevant documents for a
given set of key terms) as well as information extraction, data mining, and Web mining,
whereas text mining is primarily focused o n discovering n ew and useful knowledge from
the textual data sources. Figure 7.2 illustrates the relationships between text analytics and
text mining along with other related application areas. The bottom of Figure 7.2 lists the
main disciplines (the foundation of the house) that play a critical role in the development
of these increasingly more popular applicatio n areas. Based on this definition of text
a nalytics a nd text mining, o ne cou ld simply formu late the difference between the two as
follows:
Text Analytics = Information Retrieval + Information Extraction + Data Mining
+ Web Mining,
or simply
Text Analytics = Information Retrieval + Text Mining
TEXT ANALYTICS
Text Mining
Web Mining
Information
Data Mining
FIGURE 7.2 Text Analytics, Related Application Areas, and Enabling Disciplines.
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 293
Compared to text mining, text a nalytics is a relatively n ew term. With the recent
emphasis o n analytics, as has been the case in many oth er related technical application
areas (e.g., consumer analytics, completive analytics, v isual analytics, social analytics, and
so forth), the text field has also wanted to get o n the an alytics ba ndwagon. While th e
term text analytics is mo re commo nly used in a business applicatio n context, text mining
is frequently used in academic researc h circles. Even though they may be defined some-
what differently at times, text analytics and text mining are u sually used syno nymously,
and we (th e autho rs o f this book) con cur w ith this.
Text mining (also know n as text data mining or knowledge discovery in textual
databases) is the semi-automated process of extracting patterns (useful inform atio n and
knowledge) from large amounts of unstructured data sources. Re member that data mining
is the process of identifying valid, novel, potentially useful , and ultimately understandable
patterns in d ata stored in structured databases, where the data are o rganized in records
structured by categorical, ordinal, o r continuous variables. Text mining is the same as
data mining in that it h as the sam e purpose and uses the same processes, but w ith text
mining the input to the process is a collectio n of unstructured (or less structured) data
files such as Word documents, PDF files, text excerpts, XML files , and so on. In essence,
text mining can be thought of as a process C w ith two m ain step s) that starts with imposing
structure o n the text-based data sources, followed by extracting re levant information and
know ledge from this structured text-based data u sing data mining techniques and tools.
The benefits of text mining are obviou s in the areas w here very large amounts of
textual data are being gen erated , such as law (court orders), academic research (research
articles) , finance (quarterly re ports), medicine (discharge summaries), biology (molecular
inte ractions), techno logy (patent files), an d marketing (custo mer comments). For example ,
the free-form text-based interactio ns w ith customers in the form of complaints (or praises)
and warranty claims can be u sed to objective ly ide ntify product and service characte ristics
that are deemed to be less than perfect and can be used as input to better product devel-
opment and service allocatio ns . Likewise, m arket outreach p rograms and focus group s
generate large amo unts of d ata. By n o t restricting product or service feedback to a codi-
fied form, custo mers can present, in their own words, w hat they think about a company's
products and services. Ano ther area w here the automated processing of unstructured text
has had a lo t of impact is in electronic communicatio n s an d e-mail. Text mining not o nly
can be u sed to classify and fil ter junk e -ma il, but it can also be used to automatically prior-
itize e-mail based on importa nce level as w ell as generate automatic responses (Weng and
Liu , 2004). The following are am o ng the most popular application a reas of text mining:
• Information extraction. Ide ntificatio n of key phrases and rela tio nships w ithin
text by looking for predefined objects a nd sequen ces in text by way of p attern
m atching. Pe rhaps the most commonly u sed form of informa tion extraction is
named entity extraction. Named entity extractio n includes named entity recognition
(recognitio n of known entity n a mes-for peo ple and organizations, place names ,
temporal expressio ns, and certain types of numerical expressions, using existin g
knowledge of the d o ma in), co-reference resolution (detection of co-reference and
a naphoric links between text entities), and relationship extraction (identification of
relatio ns b etween e ntities).
• Topic tracking. Based o n a user pro file a nd documents that a user views, text
mining can predict other documents of interest to the user.
• Summarization. Summarizing a document to save time o n the part of the reader.
• Categorization. Ide ntifying the main the mes of a document and then placing the
document into a predefined set of categories based o n those themes.
• Clustering. Grouping similar documents w ithout having a predefined set of
categories.
294 Pan III • Predictive Analytics
• Concept linking. Connects related documents by identifying the ir shared con-
cepts and, by doing so, helps use rs find information that they p e rhaps would not
have found using traditional search methods.
• Question answering. Finding the best answer to a given question through
knowledge-driven pattern matching.
See Technology Insights 7 .1 for expla n atio ns o f some o f the terms and concepts u sed in
text mining. Application Case 7 .1 describes the use of text mining in patent analysis .
TECHNOLOGY INSIGHTS 7.1 Text Mining Lingo
The following list describes some commo nly used text mining terms:
• Unstructured data (versus structured data). Structured data has a p redetermined
format. It is usually organized into records with simple data values (categorical, ordinal,
and continuous variables) a nd sto red in databases. In contrast, unstructured data does
not have a pred ete rmined format and is stored in the form of textual docume nts. In
essence, the structure d d ata is for the compute rs to process while the unstructured d ata is
fo r humans to process and unde rsta nd.
• Corpus. In linguistics, a corpus (plural corpora) is a large and structured set of texts
(n ow us ually stored and p rocessed electronically) p repared for the purpose o f conducting
knowle dge d iscovery.
• Terms. A term is a single word or multiword phrase extracted directly from the corpus
of a specific domain by means o f n atural language processing (NLP) methods.
• Concepts. Concepts are features gene rated from a collection of documents by means of
manual , statis tica l, rule-based , o r hybrid categorizatio n methodology. Compared to terms,
concepts are the result of higher level abstraction.
• Stemming. Stemming is the process of reducing inflected words to their ste m (or base
o r root) form. For instance, stemmer, stemming , and stemmed are all based on the
root stem.
• Stop words. Stop words (or noise words) are words that a re filtered out prior to or
after processing of natural language data (i.e. , text) . Even though the re is no universally
accepted list of stop words, most natural language processing tools use a list that includes
anicles (a, am, the, of, etc.), auxiliary verbs (is, are, was, were, etc.), and context-specific
words that are deemed no t to h ave differentiating value .
• Synonyms and polysemes. Syno nyms are syntactically differe nt words (i.e., spelled dif-
ferently) w ith identical or at least similar meanings (e.g., movie, film , and motion picture).
In contrast, polysemes, which are a lso called homonyms, are syntactically identical words
(i.e. , spelle d exactly the same) w ith different meanings (e.g. , bow can mean "to ben d for-
ward," "the front of the ship," "the weapon that shoots a rrows," o r "a kind of tied ribbon") .
• Tokenizing. A token is a categorize d block of text in a sentence. The block of text
corresponding to the token is categorized according to the function it performs. This
assignment o f meaning to blocks of text is known as tokenizing. A toke n can look like
anything ; it just needs to be a u seful pan of the structured text.
• Term dictionary. A collectio n of te rms sp ecific to a narrow field that can be used to
restrict the extracted terms within a corpus.
• Word frequency. The numbe r of times a word is found in a specific document.
• Part-of-speech tagging. The process of marking up the words in a text as correspond-
ing to a particular pan of speech (su ch as nouns, verbs, adjectives, adverbs, etc.) based on
a word's d efinitio n and the context in w hich it is used.
• Morphology. A branch of the field of linguistics and a pan of natural la nguage pro-
cessing that studies the internal structure o f words (patterns of word-formation w ithin a
language o r across langu ages).
• Term-by-document matrix (occurrence matrix). A common representation schema
of the frequency-based re lation s hip between the te rms and documents in tabular format
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 295
where terms are listed in rows, documents are listed in columns, and the frequency
between the terms and documents is listed in cells as integer values.
• Singular-value decomposition (latent semantic indexing). A dimensionality
reduction me thod used to transform the term-by-docume nt matrix to a manageable size
by generating an intermediate represe ntation of the frequencies u sing a matrix manipula-
tion method similar to principal component analysis.
Application Case 7.1
Text Mining for Patent Analysis
A patent is a set of exclusive rights granted by a
country to an inventor for a limited period of time
in exchange for a disclosure of an invention (note
that the procedure for granting patents, the require-
ments placed on the patentee, and the extent of the
exclusive rights vary widely from country to coun-
try). The disclosure of these inventions is critical to
future advancements in science and technology. If
carefully analyzed, patent documents can help iden-
tify emerging technologies, inspire novel solutions,
foster symbiotic partnerships, and enhance overall
awareness of business' capabilities and limitations.
Patent analysis is the use of analytical tech-
niques to extract valuable knowledge from patent
databases. Countries or groups of countries that
maintain patent databases (e.g., the United States, the
European Union, Japan) add tens of millions of new
patents each year. It is nearly impossible to efficiently
process such enormous amounts of semistructured
data (patent documents usually contain partially
structured and partially textual data). Patent analy-
sis with semiautomated software tools is one way to
ease the processing of these very large databases.
A Representative Example of Patent
Analysis
Eastman Kodak employs more than 5,000 scie ntists,
engineers, and technicians around the world. During
the twentieth century, these knowledge workers
and their predecessors claimed nearly 20,000 pat-
ents, putting the company among the top 10 patent
holders in the world. Being in the business of con-
stant change, the company knows that success (or
mere survival) depends on its ability to apply more
than a centu1y's worth of knowledge about imaging
science and technology to new uses and to secure
those new uses w ith patents.
Appreciating the value of patents, Kodak not
only generates new patents but a lso analyzes those
created by others. Using dedicated analysts and
state-of-the-a1t software tools (including specialized
text mining tools from ClearForest Corp.), Koda k
continuously d igs deep into various data sources
(patent databases, new release archives, and prod-
uct announcements) in order to develop a holistic
view of the competitive landscape. Proper analysis
of patents can bring companies like Kodak a wide
range of benefits:
• It enables competitive intelligence. Knowing
what competitors are doing can help a com-
pany to develop counte rmeasures.
• It can help the company make critical business
decisions, such as what new products, product
lines , and/or technologies to get into or what
mergers and acquisitions to pursue.
• It can aid in ide ntifying a nd recruiting the best
and brightest new talent, those whose names
appear on the patents that are critical to the
company's success.
• It can help the company to identify the unau-
thorized use of its patents, enabling it to take
action to protect its assets .
• It can identify complementary inventions to
build symbiotic pa1tnerships or to facilitate
mergers and/ or acquisitions.
• It prevents competitors from creating similar
products and it can help protect the company
from pate nt infringement lawsuits.
Using patent analysis as a rich source of
knowledge and a strategic weapon (both defensive
as well as offensive), Kodak not only survives but
excels in its market segment defined by innovation
and constant change.
(Continued)
296 Pan III • Predictive Ana lytics
Application Case 7.1 (Continued}
QUESTIONS FOR DISCUSSION
1. Why is it importa nt for companies to keep up
w ith p ate nt filings?
2. How did Koda k u se text a nalytics to be tte r an a-
lyze p ate nts?
3. Wha t were the ch alle nges, the p roposed solu-
tio n , and the o btained results?
SECTION 7 .2 QUESTIONS
Sources: P. X. Chie m , "Kodak Turns Knowled ge Gain ed About
Patents into Competitive Inte llige nce ," Knowledge Ma nagem ent,
2001, p p . 11-12; Y-H . Tsenga, C-J. Linb, a nd Y-I. Linc, "Text
Mining Techniq ues for Patent Analysis," Information Processing &
Management, Vol. 43, No. 5, 2007, p p . 1216-1247 .
1. Wha t is text an alytics? How does it diffe r fro m text mining?
2. Wha t is text mining? How does it diffe r fro m data mining?
3. Why is the p opularity of text mining as an a nalytics tool increasing?
4. Wha t are some of the most po pular applicatio n areas of text mining?
7.3 NATURAL LANGUAGE PROCESSING
Som e of the early text mining application s used a simplifie d representation called bag-
o.fwords w h e n introducing structure to a collectio n o f text-based d ocu ments in order
to classify the m into two o r m o re pred ete rmined classes o r to cluste r the m into n atural
groupings. In the bag-of-words mo de l, text, such as a sente n ce , paragraph, o r com p le te
document, is represented as a collectio n of w ords, disregarding the grammar or the order
in w hich the words a ppear. The b ag-of-words mo de l is still used in som e simple docu-
me nt classificatio n tools. For insta n ce, in sp am filte ring a n e-ma il message can be mod-
eled as an uno rde red collectio n of words (a bag-of-words) that is comp ared against two
diffe re nt pre de te rmine d bags. One bag is filled w ith words found in spam messages an d
the othe r is filled w ith words found in legitimate e-mails. Althou gh some of the words are
likely to be found in b oth bags, the "sp am " bag w ill con tain spam-re lated words such as
stock, Viag ra, and buy much m ore frequ ently than the legitimate bag, w hich w ill contain
more words related to the u ser's frie nds or workplace. The level of m atch between a
sp ecific e-ma il's b ag-of-words and the two bags containing the d escripto rs determines the
m em bership of the e-mail as e ithe r sp a m o r legitimate.
Naturally , we (hu ma n s) d o n o t u se word s w ithout some o rder o r structure. We u se
words in sente nces, w hich have semantic as well as syntactic structure. Thus, automated
techniques (such as text mining) n eed to look fo r ways to go beyond the bag-o f-words
interpretation and incorporate mo re a nd mo re semantic structure into the ir operatio ns.
The curre nt tre nd in text mining is toward including many of the ad vanced features that
can be obtained u s ing natural language processing.
It h as been shown that the b ag-of-words method may no t produ ce good e nou gh
info rmatio n conte nt for text mining tasks (e.g., classificatio n , clustering, associatio n) .
A good example o f this can b e found in evide nce-based medicine . A critical com pon e nt
of evide n ce-based medicine is incorporating the best available research findings into the
clinical decisio n-making process, w hich involves appraisal of the informatio n collected
from the printed media for validity and relevance . Several researchers from th e Unive rsity
of Maryland develo p ed evidence assessment models using a bag-of-words method
(Lin and De mne r, 2005). They e mployed p o pula r machine-learning me thods along w ith
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 297
more than half a million research articles collected from MEDLINE (Medical Literature
Analysis and Retrieval System Online). In their models, they represented each abstract as a
bag-of-words, w here each stemmed term represented a feature. Despite u sing popular clas-
sificatio n me thods w ith proven exp erimental design methodologies, their prediction results
were n ot much better than simple gu essing, w hich may indicate that the bag-of-words is
not generating a good enough representation of the research articles in this d omain; hence,
more advan ced techniques such as n atural language processing are needed.
Natural language processing (NLP) is a n important com pone nt of text min-
ing and is a subfield of artificial intelligence a nd computation al linguistics. It studies the
problem of "understanding" the natural human la ngu age , with the view of converting
depictions of human language (such as textual documents) into m ore formal representa-
tio ns (in the form of numeric and symbolic data) that are easier for computer programs to
ma nipulate . The goal of NLP is to move beyond syntax-driven text manipulation (which
is ofte n called "word counting") to a true unde rstanding and processing of natural lan-
gu age that con siders grammatical and semantic con straints as well as the context.
The definitio n and scope of the word "understanding" is on e of the major discus-
sion topics in NLP. Considering that the na tural human language is vague and that a true
understanding of meaning requires exte n sive knowledge of a topic (beyond w h at is in
the words, sente nces, a nd paragraphs), w ill computers ever be able to understand natura l
language the same way and with the same accuracy that huma ns do? Probably not! NLP
has come a lo ng way from the days o f simple word counting, but it h as an even lo nger
way to go to really unde rstanding n atural human lan guage. The following are just a few
of the challenges commonly associated w ith the implementatio n of NLP:
• Part-of-speech tagging. It is difficult to mark up terms in a text as correspond-
ing to a particular part of speech (such as n oun s, verbs, adjectives, adverbs, etc.)
because the part of speech depends not only o n the definitio n of the term but also
o n the context w ithin w hich it is u sed.
• Text segmentation. Some w ritte n langu ages, such as Chinese, J apanese, and
Thai, do no t have single-word boundaries. In these instances, the text-parsing task
requires the ide ntificatio n of word boundaries, w hich is often a difficult task. Similar
c halle nges in speech segmentatio n emerge when analyzing spoken la nguage ,
because sounds representing successive letters and word s blend into each other.
• Word sense disambiguation. Many words have more than o ne meaning.
Selecting the meaning that makes the most sense can o nly be accomplish ed by tak-
ing into account the context w ithin w hich the word is used.
• Syntactic ambiguity. The grammar for n atural la nguages is ambiguous; that is,
multiple possible sente nce structures o ften need to be con sidered. Choosing th e
m ost appropriate structure usu a lly requires a fusion of semantic and contextual
informatio n .
• Imperfect or irregular input. Foreign o r regional accents and vocal impedi-
m e nts in speech and typ ographical o r grammatica l e rro rs in texts make th e process-
ing of the language an even mo re difficult task.
• Speech acts. A sentence can o ften be con sidered an action by the speaker. The
sente n ce structure alone may n o t contain e n o ug h info rma tio n to define this action.
For example, "Can you pass the class?" requests a simple yes/ no answer, w h ereas
"Can you pass the salt?" is a request for a physical action to be performed.
It is a lo ngstanding dream of the artificial inte lligence community to h ave algorithms
that are capable of automatically reading and obtaining knowledge from text. By apply-
ing a learning algorithm to parsed text, researchers from Stanford University's NLP lab
have developed methods tha t can automatically identify the concepts a nd relation ships
between those concepts in the text. By applying a unique procedure to large amounts
298 Pan III • Predictive Analytics
of text, their algorithms automatically acquire hundreds of thousands of items of world
knowledge and use them to produce significantly enhanced repositories for WordNet.
WordNet is a laboriously hand-coded database of English words , their definitions, sets of
synonyms, and various semantic relations between synonym sets. It is a major resource
for NLP applications, but it has proven to be very expensive to build and maintain manu -
ally. By automatically inducing knowledge into WordNet, the potential exists to make
WordNet an even greater and more comprehensive resource for NLP at a fraction of the
cost. One prominent area where the benefits of NLP and WordNet are already being
harvested is in customer relationship management (CRM). Broadly speaking, the goal of
CRM is to maximize customer value by better understanding and effectively responding
to their actual and perceived needs. An important area of CRM, where NLP is making
a significant impact, is sentiment analysis. Sentiment analysis is a technique used to
detect favorable and unfavorable opinions toward specific products and services using
large numbers of textual data sources (customer feedback in the form of Web postings) .
A detailed coverage of sentiment analysis and WordNet is given in Section 7 .7.
Text mining is also used in assessing public complaints. Application Case 7.2 provides
an example where text mining is used to anticipate and address public complaints in
Hong Kong .
Application Case 7.2
Text Mining Improves Hong Kong Government's Ability to Anticipate
and Address Public Complaints
The 1823 Call Centre of the Hong Kong government's
Efficiency Unit acts as a single point of contact for
handling public inquiries and complaints on behalf
of many government departments. 1823 operates
round-the-clock, including during Sundays and pub-
lic holidays. Each year, it answers about 2.65 million
calls and 98,000 e -mails, including inquiries, sugges-
tions, and complaints. "Having received so many
calls and e-mails, we gather substantial volumes of
data. The next step is to make sense of the data, " says
the Efficiency Unit's assistant director, W . F. Yuk.
"Now, with SAS text mining technologies, we can
obtain deep insights through uncovering the hidden
relationship between words and sentences of com-
plaints information, spot emerging trends and pub-
lic concerns, and produce high-quality complaints
intelligence for the departments we serve. "
Building a "Complaints Intelligence
System"
The Efficiency Unit aims to be the preferred con-
sulting partner for a ll government bureaus and
departments and to advance the delivery of world-
class public services to the people of Hong Kong.
The Unit launched the 1823 Call Centre in 2001.
One of 1823's main functions is handling com-
plaints-IO percent of the calls received last year
were complaints. The Efficiency Unit recognized
that there are social messages hidden in the com-
plaints data, which provides important feedback on
public service and highlights oppo1tunities for ser-
vice improvement. Rather than simply handling calls
and e-mails, the Unit seeks to use the complaints
information collected to gain a better understanding
of daily issues for the public.
"We previously compiled some reports on
complaint statistics for reference by government
departments, " says Yuk. "However, through 'eye-
ball' observations, it was absolutely impossible to
effectively reveal new or more complex potential
public issues and identify their root causes, as most
of the complaints were recorded in unstructured
textual format," says Yuk. Aiming to build a plat-
form, called the Complaints Intelligence System, the
Unit required a robust and powerful suite of text
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 299
processing and mining solutions that could uncover
the trends, patterns, and relationships inherent in
the complaints.
Uncovering Root Causes of Issues from
Unstructured Data
The Efficiency Unit chose to deploy SAS Text Miner,
which can access and analyze various text for-
mats, including e-mails received by the 1823 Call
Centre. "The solution consolidates all information
and uncovers hidden relationships through sta-
tistical modeling analyses, " says Yuk. "It helps us
understand hidden social issues so that government
departments can discover them before they become
seriou s, and thus seize the opportunities for service
improvement. "
Equipped w ith text analytics, the departments
can better understand underlying issues and quickly
respond even as situations evolve. Senior manage-
ment can access accurate, up-to-date information
from the Complaints Intelligence System.
Performance Reports at Fingertips
With the platform for SAS Business Analytics in place,
the Efficiency Unit gets a boost from the system's
ability to instantly generate reports. For instance, it
previously took a week to compile reports on key
performance indicators such as abandoned call rate,
customer satisfaction rate, and first-time resolution
rate. Now, these reports can be created at the click
of a mouse through performance dashboards, as
all complaints information is consolidated into the
Complaints Intelligence System. This enables effec-
tive monito ring of the 1823 Call Centre's operation s
and service quality.
Strong Language Capabilities, Customized
Services
Of particular importance in Hong Kong, SAS Text
Miner has strong language capabilities- supporting
English and traditional and simplified Chinese-
and can perform automated spelling correction.
The solution is also aided by the SAS capability of
developing customized lists of synonyms such as
the full and short forms of different government
departments and to parse Chinese text for similar or
identical terms whose meanings and connotations
change, often dramatically, depending on the con-
text in which they are used. "Also, throughout this
4-month project, SAS has proved to be our trusted
partner," said Yuk. "We are satisfied with the com-
prehensive support provided by the SAS Hong Kong
tea1n."
Informed Decisions Develop Smart
Strategies
"Using SAS Text Miner, 1823 can quickly discover
the correlations among some key words in the
complaints," says Yuk. "For instance, we can spot
districts with frequent complaints received concern-
ing public health issues such as dead birds found
in residential areas. We can then inform relevant
government departments and prope1ty manage-
ment companies, so that they can allocate adequate
resources to step up cleaning work to avoid spread
of potential pandemics.
"The public's views are of course extremely
important to the government. By decoding the
'messages' through statistical and root-cause analy-
ses of complaints data , the government can better
understand the voice of the people, and help gov-
ernment departments improve service delivery,
make informed decisions, and develop smart strat-
egies. This in turn helps boost public satisfaction
with the government, and build a quality city, "
said W. F. Yuk, Assistant Director, Hong Kong
Efficiency Unit.
QUESTIONS FOR DISCUSSION
1. How did the Hong Kong government use text
mining to better serve its constitue nts?
2. What were the challenges, the proposed solu-
tion, and the obtained results'
Sources: SAS Institute, Custome r Success Story, sas.com/
success/pdf/hongkongeu (accessed Februaty 2013);
and enterpriseinnovation.net/whitepaper/text-mining-
improves-hong-kong-governments-ability-anticipate-and-
address-public.
300 Pan III • Predictive Analytics
NLP has successfully been applied to a variety of domains for a variety of tasks
via computer programs to automatically process natural human language that previously
could only be done by humans . Following are among the most popular of th ese tasks:
• Question answering. The task of autom atically answering a question posed
in natural language; that is, producing a human-language an swer when given a
human-language question. To find the answer to a question, the computer program
may use either a prestructured database or a collection of natural language docu-
ments (a text corpus such as the World Wide Web) .
• Automatic summarization. The creation of a shortened version of a textual
document by a computer program that contains the most important points of the
original document.
• Natural language generation. Systems convert information from compu ter
databases into readable human langu age.
• Natural language understanding. Systems convert samples of human lan-
gu age into more formal representations that are easier for computer programs to
manipulate.
• Machine translation. The automatic translation of one human langu age to
another.
• Foreign language reading. A computer program that assists a nonnative lan-
guage speaker to read a foreign language with correct pronunciation and accents
on different parts of the words .
• Foreign language writing. A computer program that assists a nonnative lan-
gu age user in writing in a foreign language.
• Speech recognition. Converts spoken words to machine-readable input. Given a
sound clip of a person speaking , the system produces a text dictation.
• Text-to-speech. Also called speech synthesis, a computer program automatically
converts no rmal language text into human speech .
• Text proofing. A computer program reads a proof copy of a text in order to
detect and correct any errors.
• Optical character recognition. The automatic translation of images of hand-
writte n , typewritten , or printed text (usually captured by a scanner) into machine-
editable textual documents.
The success and popularity of text mining depend greatly on advan cements in NLP
in both generation as well as understanding of human languages. NLP en ables the extrac-
tion of features from unstructured text so that a w ide variety of data mining techniques
can be used to extract knowledge (novel an d useful patterns a nd relationships) from it. In
that sense, simply put, text mining is a combination of NLP and data mining.
SECTION 7 .3 REVIEW QUESTIONS
1. What is natural langu age processing?
2. How does NLP relate to text mining?
3. What are some of the benefits and challenges of NLP?
4. What are the most common tasks addressed by NLP?
7.4 TEXT MINING APPLICATIONS
As the am ount of unstructured data collected by o rgan izations in creases, so does the
value proposition a nd popularity of text mining tools. Many organizations are now realiz-
ing the importance of extracting knowledge from their document-based data repositories
through the use of text mining tools. Following are o nly a small subset of the exemplary
applicatio n categories of text mining.
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 301
Marketing Applications
Text mining can be used to increase cross-selling and up-selling by analyzing the unstruc-
tured data generated by call centers. Text generated by call center notes as well as tran-
scriptions of voice conversatio ns w ith customers can be analyzed by text mining algorithms
to extract novel, actionable information about customers' perceptions toward a company's
products and services. Additionally, biogs , user reviews of products at independent Web
sites, and discussion board postings are a gold mine of customer sentiments. This rich col-
lection of information, once properly analyzed, can be used to increase satisfaction and
the overall lifetime value of the customer (Coussement and Van den Poe!, 2008).
Text mining has become invaluable for customer relationship management.
Companies can use text mining to analyze rich sets of unstructured text data , combined
with the relevant structured data extracted from organizational databases, to predict cus-
tomer perceptions and subsequent purchasing behavior. Coussement and Van den Poe!
(2009) successfully applied text mining to significantly improve the ability of a model to
predict customer churn (i.e., customer attrition) so that those customers identified as most
like ly to leave a company are accurately identified for retention tactics.
Ghani et al. (2006) used text mining to develop a system capable of inferring implicit
and explicit attributes of products to enhance retailers' ability to analyze product databases.
Treating products as sets of attribute- value pairs rather than as atomic entities can poten-
tially boost the effectiveness of many business applications, including demand forecasting,
assortment optimization, product recommendations, assortment comparison across retail-
ers and manufacturers, and product supplier selection. The proposed system allows a
business to represent its products in terms of attributes and attribute values without much
manual effort. The system learns these attributes by applying supervised and semi-super-
vised learning techniques to product descriptions found on retailers' Web sites.
Security Applications
One of the largest and most prominent text mining applications in the security domain is
probably the highly classified ECHELON surveillance system. As rumor has it, ECHELON
is assumed to be capable of identifying the content of telephone calls, faxes , e-mails, and
other types of data and intercepting information sent via satellites, public switched tele-
phone networks, and microwave links.
In 2007, EUROPOL developed an integrated system capable of accessing, storing,
and analyzing vast amounts of structured and unstructured data sources in order to track
transnational organized crime. Called the Overall Analysis System for Intelligence Support
(OASIS), this system aims to integrate the most advanced data and text mining tech-
nologies available in today's market. The system has enabled EUROPOL to make sig-
nificant progress in supporting its law enforcement objectives at the international level
(EUROPOL, 2007).
The U.S. Federal Bureau of Investigation (FBI) and the Central Intelligence Agency
(CIA) , under the direction of the Department for Homeland Security, are jointly develop-
ing a supercomputer data and text mining system. The system is expected to create a
gigantic data warehouse along w ith a variety of data and text mining modules to meet
the knowledge-discovery needs of federal, state, and local law enforcement agencies.
Prior to this project, the FBI and CIA each had its own separate databases, with little or
no interconnection.
Another security-related application of text mining is in the area of deception
detection. Applying text mining to a large set of real-world criminal (person-of-interest)
statements, Fuller et al. (2008) developed prediction models to differentiate deceptive
statements from truthful ones. Using a rich set of cues extracted from the textual state-
ments, the model predicted the holdout samples with 70 percent accuracy, which is
302 Pan III • Predictive Analytics
believed to be a significant success considering that the cues are extracted only from
textu a l statements (no verbal or visual cues are prese nt). Furthermore, compared to other
deception-detection techniques, such as polygraph, this method is n onintrusive and
w idely applicable to not only textual data, but also (potentially) to transcriptions of voice
recordings. A more detailed description of text-based deception detection is provided in
Applicatio n Case 7.3.
Application Case 7.3
Mining for Lies
Driven by advancements in Web-based informa-
tion technologies and increasing globalization , com-
puter-mediated communication continues to filter
into eve1yday life, bringing w ith it new venues for
deception. The volume o f text-based ch at, instant
messaging, text messaging, and text generated by
online communities of practice is increasing rap-
idly . Even e-mail continues to grow in u se. With the
massive growth of text-based communication, the
potential for people to deceive others through com-
puter-mediated communication h as a lso grown, and
such deception can have disastrous results.
Unfortunately, in general, humans tend to
perform poorly at deception-detection tasks. This
phenomenon is exacerbated in text-based commu-
nications. A large part of the research o n deception
detection (also known as credibility assessment) has
involved face-to-face meetings and interviews. Yet,
with the growth of text-based communication , text-
based deception-detection techniques are essential.
Techniques for successfully detecting
deception-that is, lies-have w ide applicability. Law
enforcement can use decision suppo1t tools and tech-
niques to investigate crimes, conduct security screening
in airports, and monitor communications of suspected
terrorists. Human resources professionals might use
deception detection tools to screen applicants . These
tools and techniques also have the potential to screen
e-mails to uncover fraud or other wrongdoings com-
mitted by corporate officers. Although some people
believe that they can readily identify those who are
not being truthful, a summary of deception research
showed that, on average, people are o nly 54 percent
accurate in making veracity determinations (Bond and
DePaulo, 2006). This figure may actually be worse
when humans try to detect deception in text.
Using a combination of text mining a nd data
mining techniques, Fuller et al. (2008) analyzed
person-of-interest statements completed by people
involved in crimes o n military bases. In these state-
ments, suspects and witnesses are required to write
their recollection of the event in their own words.
Military law enforcement personnel searched archi-
val data for statements that they could conclusively
identify as b e ing truthful or deceptive . These deci-
sions were made on the basis of corroborating
evidence and case resolution. Once labeled as
truthful or deceptive, the law enforcement person-
nel removed identifying information and gave the
statements to the research team. In total, 371 usable
statements were received for an alysis. The text-
based deception-detection method used by Fuller
et al. (2008) was based o n a process known as
message feature mining, which relies on elements
of data and text mining techniques . A simplified
depiction of the process is provided in Figure 7.3.
First, the researchers prepared the data for pro-
cessing. The original h andwritten statements had to
be transcribed into a word processing file . Second,
features (i.e., cues) were identified. The researchers
identified 31 features representing categories or types
of language that are relatively independent of the text
content and that can be readily analyzed by auto-
mated means. For example, first-person pronouns
such as I or me can be identified w ithout analysis
of the surrounding text. Table 7.1 lists the categories
and an example list of features used in this study.
The features were extracted from th e textual
statements and input into a flat file for further pro-
cessing. Using several feature-selection methods
along w ith JO-fold cross-validation, the researchers
compared the prediction accuracy of three popu-
lar data mining methods. Their results indicated
that neural network models performed the best,
w ith 73.46 percent prediction accuracy on test data
samples; decision trees performed second best, with
71.60 percent accuracy; and logistic regression was
last, with 67.28 percent accuracy.
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 303
Statements labeled as
truthful or deceptive
by law enforcement
Classification models
trained and tested on
quantified cues
Statements
transcribed for
processing
Text processing
software generated
quantified cues
Cues extracted
and selected
Text processing
software identified
cues in statements
FIGURE 7.3 Text-Based Deception-Detection Process. Source: C. M . Fuller, D. Biros,
and D. Delen, "Exploration of Feature Selection and Advanced Classification Models for
High-Stakes Deception Detection," in Proceedings of the 41st Annual Hawaii International
Conference on System Sciences (HICSS), January 2008, Big Island, HI, IEEE Press, pp. 80-99.
TABLE 7.1 Categories and Examples of Linguistic Features Used in Deception Detection
Number
1
2
3
4
5
6
7
8
9
Construct (Category)
Quantity
Complexity
Uncertainty
Non immediacy
Expressivity
Diversity
Informality
Specificity
Affect
Example Cues
Verb count, noun-phrase cou nt, etc.
Average number of clauses, average sent ence length, etc.
Modifiers, modal verbs, etc.
Passive voice, objectification, etc.
Emotiveness
Lexical diversity, redundancy, etc.
Typog raph ical error ratio
Spatiotemporal information, perceptual information, etc.
Positive affect, negative affect, etc.
The results indicate that automated text-based
deception detection has the potential to aid those
who must try to detect lies in text and can be suc-
cessfully applied to real-world data. The accuracy
of these techniques exceeded the accuracy of most
other deception-detection techniques even though it
was limited to textual cues.
( Continued)
304 Pan III • Predictive Analytics
Application Case 7.3 (Continued}
QUESTIONS FOR DISCUSSION
1. Why is it difficult to detect deception?
2. How can text/ data mining be used to detect
deception in text?
3. What do you think are the main challenges for
such an automated system?
Biomedical Applications
Sources: C. M. Fuller, D. Biros, and D. Delen, "Exploration of
Feature Selection and Advanced Classification Models for H igh-
Stakes Deceptio n Detectio n ," in Proceedings of the 4 1st Annual
Hawaii International Conference on System Sciences (HJCSS),
2008, Big Island , HI , IEEE Press, pp. 80-99; C. F. Bond and
B. M. DePaulo, "Accuracy of Deception Judgmen ts," Personality
and Social Psychology Repo11s, Vol. 10, No. 3, 2006, pp. 214-234.
Text mining holds great potential for the medical field in general and biomedicine in par-
ticular for several reasons. First, the published literature and publicatio n outlets (esp ecially
with the advent of the open source journals) in the field are expanding at an exponential
rate. Second, compared to most other fields, the medical literature is more standardized
and orderly, making it a more "minable" information source . Finally, the termino logy
used in this literature is relatively constant, having a fairly standardized ontology. What
follows are a few exemplary studies where text mining techniques were successfully used
in extracting novel patterns from biomedical literature.
Experimental techniques such as DNA microarray analysis, serial analysis of gene
expression (SAGE), and mass spectrometry proteomics, among others, are generat-
ing large amounts of data related to genes and proteins. As in any other experimental
approach, it is necessary to analyze this vast amount of data in the context of previ-
ously known information about the biological entities under study. The literature is a
particularly valuable source of information for experiment validation and interpretation.
Therefore, the development of automated text mining tools to assist in such inte rpretation
is one of the main challenges in current bioinformatics research.
Knowing the location of a protein within a cell can help to elucidate its role in
biological processes and to determine its pote ntial as a drug target. Numerous location-
prediction systems are described in the literature; some focus on specific organisms,
whereas others attempt to analyze a wide range of organisms. Shatkay et al. (2007)
proposed a comprehensive system that uses several types of seque nce- and text-based
features to predict the location of proteins. The main novelty of their system lies in the
way in which it selects its text sources and features and integrates them w ith sequence-
based features. They tested the system on previously u sed data sets and on new data
sets devised specifically to test its predictive power. The results showed that their system
consistently beat previously reported results .
Chun e t al. (2006) described a system that extracts disease- gene re lationships from
literature accessed via MedLine. They constructed a dictionary for disease and gene names
from six public databases and extracted rela tion candidates by dictionary matching .
Because dictionary m atching produces a large number of false positives, they developed
a method of machine learning-based named entity recognition (NER) to filter out false
recognitio ns of disease/gene n ames. They found that the su ccess of relation extraction is
heavily dependent on the performance of NER filtering and that the filtering improved the
precision of relation extraction by 26.7 percent, at the cost of a small reduction in recall.
Figure 7.4 shows a simplified depictio n of a multilevel text analysis process for dis-
covering gene-protein relationships (or protein-protein interactions) in the biomedical
literature (Nakov et al. , 2005). As can be seen in this simplified example that u ses a sim-
p le sente nce from biomedical text, first (at the bottom three levels) the text is tokenized
Chapter 7 • Text Analytics, Text Mining , and Sentiment Analysis 305
~ -~ j
ffi t, ---, \
(.!) ct '
42722 397276 -- ,
----------------------------------------------1
En
0
0 ....,
C
0
0019254
,. - _I_ - ~
0007962
0 016923
0 001773
0044_465 0001_769 000?477 000;3643 0016158
II .. expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.
""O
~ -_: :_ - - - -195- - - _ J~ s\11-1 ;/ s J:_ - -·i3cffi- _ J:_ 21_ J: _ - - -5874 -- _ Ji _2isf _::_ ss52-\6i- 5532_ J~ _fi J:_ - --9252 - - _ J~ /252sJ
~ --,,' :_ -----------J~ _ J:_ - - - - _:~ _ J:_ - - - - - - - - - _ J:_ - - _ J:_ - - - - - - - - - _ J~ -----_::_ ----_ J:_ - - _::_ -- - - _ J~ --_ J:_ - - - - - - - - - _ J~ -_::_ ---_:
0.. NN IN NN IN VBZ IN JJ JJ NN NN NN CC NN IN NN
~ ,
II 1 1 _Q ~ J: :: :: :: :: :: : - '- ..,.. ---------------- --------------------- ----- -------------------------------------- ___________ 11 _ 1, _ ___ _
Jg co \ NP PP NP NP PP NP
UJ 0..
NP PP NP
FIGURE 7.4 Multilevel Analysis of Text for Gene/Protein Interaction Identification. Source: P. Nakov, A. Schwartz, B.
Wolf, and M . A. Hearst, "Supporting Annotation Layers for Natural Language Processing," Proceedings of the Association
for Computational Linguistics (ACL), interactive poster and demonstration sessions, 2005, Ann Arbor, Ml, Association for
Computational Linguistics, pp. 65-68.
using part-of-speech tagging and shallow-parsing. The tokenized terms (words) are
then matched (and interpreted) against the hierarchical representation of the domain
ontology to derive the gene-protein relationship. Application of this method (and/ or
some variation of it) to the biomedical lite ra ture offers great potential to decode the
complexities in the Human Gen ome Project.
Academic Applications
The issue of text mining is of great importance to publishers who hold large databases
of information requiring indexing for better retrieval. This is particularly true in scientific
disciplines, in which highly specific information is often contained within written text.
Initiatives have been launched, such as Nature's proposal for an Open Text Mining Interface
(OTMI) and the National Institutes of Health's common Journal Publishing Document Type
Definition (DTD), which would provide semantic cues to machines to answer specific que-
ries contained within text without removing publisher barriers to public access.
Acade mic institutions have also launched text mining initiatives. For example ,
the National Centre for Text Mining, a collaborative effott between the Universities of
Manchester and Liverpool, provides customized tools, research facilities , and advice on
text mining to the academic community. With an initial focus on text mining in the biologi-
cal and biomedical sciences, research has since expanded into the social sciences. In the
United States, the School of Information at the University of California, Berkeley, is devel-
oping a program called BioText to assist bioscie nce researche rs in text mining and analysis.
As described in this section, text mining has a w ide variety of applications in a num-
ber of different disciplines. See Application Case 7.4 for an example of how a financial
services firm is using text mining to improve its customer service performance.
306 Pan III • Predictive Analytics
Application Case 7.4
Text Mining and Sentiment Analysis Help Improve Customer Service Performance
The company is a financial services firm that pro-
vides a broad range of solutions and services to a
global customer base. The company has a com-
prehensive network of facilities around the world,
with over 5000 associates assisting their customers.
Customers lodge service requests by telephone,
email, or through an online chat interface.
As a B2C service provider, the company strives to
maintain high standards for effective communication
between their associates and customers, and tries
to monitor customer interactions at every oppor-
tlmity. The broad objective of this service perfor-
mance monitoring is to maintain satisfactory quality
of service over time and across the organization. To
this end, the company has devised a set of standards
for service excellence, to which all customer interac-
tions are expected to adhere. These standards com-
prise different qualitative measures of service levels
(e.g., associates should use clear and understand-
able language, associates should always maintain a
professional and friendly demeanor, etc.) Associates'
performances are measured based on compliance with
these quality standards. Organizational units at differ-
ent levels, like teams, departments, and the company
as a whole, also receive scores based on associate
performances. The evaluations and remunerations of
not only the associates but also of management are
influenced by these se1vice performance scores.
Challenge
Continually monitoring service levels is essential
for service quality control. Customer surveys are
an excellent way of gathering feedback about
service levels. An even richer source of information
is the corpus of associate-customer interactions.
Historically the company manually evaluated a
sample of associate-customer interactions and survey
responses for compliance with excellence standards.
This approach, in addition to being subjective and
error-prone, was time- and labor-intensive. Advances
in machine learning and computational linguistics
offer an opportunity to objectively evaluate all
customer interactions in a timely manner.
The company needs a system for (1) automati-
cally evaluating associate-customer interactions for
complian ce with quality standards and (2) analyzing
survey responses to extract positive and negative
feedback . The analysis must be able to account for the
wide diversity of expression in natural language (e.g .,
pleasant and reassuring tone, acceptable language,
appropriate abbreviations, addressing all of the cus-
tomers' issues, etc.).
Solution
PolyAnalyst 6.5™ by Megaputer Intelligence is a
data mining and analysis platform that provides a
comprehensive set of tools for analyzing structured
and unstructured data. PolyAnalyst's text analysis
tools are used for extracting complex word pat-
terns, grammatical and semantic relationships, and
expressions of sentiment. The results of these text
analyses are then classified into context-specific
themes to identify actionable issues, which can
be assigned to relevant individuals responsible for
their resolution. The system can be programmed to
provide feedback in case of insufficient classifica-
tion so that analyses can be modified or amended.
The relationships between structured fields and text
analysis results are also established in order to iden-
tify patterns and interactions. The system publishes
the results of analyses through graphical, interactive,
web-based reports. Users create analysis scenar-
ios using a drag-and-drop graphical user interface
(GUI) . These scenarios are reusable solutions that
can be programmed to automate the analysis and
repott generation process.
A set of specific criteria were designed to cap-
ture and automatically detect compliance with the
company's Quality Standards. The figure below
displays an example of an associate's response, as
well as the quality criteria that it succeeds or fails to
match.
As illustrated above, this comment matches
several criteria while failing to match one, and
contributes accordingly to the associate's perfor-
mance score. These scores are then automatically
calculated and aggregated across various organiza-
tional units. It is relatively easy to modify the sys-
tem in case of changes in quality standards, and the
changes can be quickly applied to historical data .
The system also has an integrated case management
system, which generates email alerts in case of
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 307
I Mentioned customer's name I I Request sent to incorrect department I
Greg, I am forwarding this item to the correct department for
processing.
I •
Mentioned next department I
Accounting, upon processing, please advise Greg of the
correct path for similar requests. Thank you and have a nice
day! I Pleasantry to next department I
drops in service quality and allows users to track
the progress of issue resolution.
Tangible Results
1. Completely automated analysis; saves time.
2. Analysis of entire dataset (> 1 million records
per year); no need for sampling .
3. 45% cost savings over traditional analysis.
4. Weekly processing. In the case of traditional
analysis, data could only be processed monthly
due to time and resource constraints.
5. Analysis not subjective to the analyst.
a. Increased accuracy.
b . Increased uniformity.
6. Greater accountability. Associates can review
the analysis and raise concerns in case of
discrepancies.
SECTION 7.4 REVIEW QUESTIONS
Future Directions
Currently the corpus of associate-customer inter-
actions does not include transcripts of phone
conversations. By incorporating speech recognition
capability, the system can become a one-stop desti-
nation for analyzing all customer interactions. The
system could also potentially be used in real-time,
instead of periodic analyses.
QUESTIONS FOR DISCUSSION
1. How did the financial services firm use text
mining and text analytics to improve its customer
service performance?
2. What were the challenges, the proposed solu-
tion, and the obtained results?
Source: Megapute r, Customer Success Story, megaputer. com
(accessed September 2013).
1. List and briefly discuss some of the text mining applications in marketing.
2. How can text mining be used in security and counterterrorism?
3. What are some promising text mining applications in biomedicine?
7 .5 TEXT MINING PROCESS
In order to be successful, text mining studies should follow a sound methodology based
on best practices. A standardized process model is needed similar to CRISP-DM, which is
the industry standard for data mining projects (see Chapter 5). Even though most parts of
CRISP-DM are also applicable to text mining projects, a specific process model for text min-
ing would include much more elaborate data preprocessing activities. Figure 7.5 depicts a
high-level context diagram of a typical text mining process (Deleo and Crossland, 2008).
This context diagram presents the scope of the process , emphasizing its interfaces with the
larger environment. In essence, it draws boundaries around the specific process to explic-
itly ide ntify w hat is included in (and excluded from) the text mining process.
308 Pan III • Predictive Analytics
Software/hardware limitations
Privacy issues
rg,istic limltatioas
——~
Extract
_LI_n_st_r_u_ct_u_r _ed_d_a_ta_[t_ext_J –;~ knowledge
from available
data sources
Structured data [databases)
Context-specific knowledge
AO
Domain expertise
Tools and techniques
FIGURE 7.5 Context Diagram for the Text M ining Process.
As the context diagram indicates, the input (inward connectio n to the left edge of the
box) into the text-based knowledge-discovery process is the unstructured as well as struc-
tured data collected , stored, and made available to the process. The o utput (ou tward exten-
sion from the right edge of the box) of the process is the context-specific knowledge that
can be used for decision making. The controls, also called the constraints (inward connec-
tion to the top edge of the box), of the process include software and hardware limitations,
privacy issues, and the difficulties related to processing the text that is presented in the form
of n atural language. The mech anisms (inward connection to the bottom edge of the box) of
the process include proper techniques, software tools, and domain expertise. The p rimary
purpose of text mining (within the context of knowledge discovery) is to process unstruc-
tured (textual) data (along with structured data, if re levant to the problem being addressed
and available) to extract meaningful and actionable patterns for better decision making.
At a very high level, the text mining process can b e broken down into three con sec-
utive tasks, each of w hich has specific inputs to gen erate certain outputs (see Figure 7.6).
If, for some reason , the output of a task is not w hat is expected, a backward redirection
to the previous task executio n is n ecessary.
Task 1: Establish the Corpus
The main purpose of the first task activity is to collect all of the documents related to
the context (domain of inte rest) being studie d . This collection may include textual docu-
ments, XML files , e -mails, Web pages, and short notes. In addition to the readily available
textua l data, voice recordings may also be transcribed using speech-recognitio n algo-
rithms and made a p art of the text collection.
Once collected, the text documents are tran sformed and organized in a m anner
su ch that they are all in the same represen tational form (e.g., ASCII text files) for com-
puter processing. The organization of the documents can be as simple as a collectio n of
d igitized text excerpts stored in a file folde r o r it can be a list of links to a collection of
Web p ages in a specific domain. Ma ny commercially available text mining software tools
Chapter 7 • Text Analytics, Text Mining , and Sentiment Analysis 309
Task 1 Task 2 Task 3
Establish the Corpus: Create the Term- Extract Knowledge:
Collect and organize the
~
Document Matrix:
~
Discover novel
-:–__,__ domain-specific Introduce structure patterns from the
unstructured data to the corpus T-0 matrix
‘ ‘
t Feedback I Feedback I
‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘ ‘
:
The inputs to the The output of task 1 The output of task 2 The output of task 3
process include a is a collection of is a flat file called a is a number of
variety of relevant documents in some term-document problem-specific
unstructured [and digitized format for matrix wher e the classification ,
semistructured) data computer processing cells are populated association, and
sources such as text, with the term clustering models and
XML, HTML, etc. frequencies visualizations
FIGURE 7 .6 The Three-Step Text Mining Process.
could accept these as input a nd convert them into a flat file for processing. Alte rnatively,
the flat file can be prepared outside the text mining software a nd then presented as the
input to the text mining application.
Task 2: Create the Term-Document Matrix
In this task, the digitized and organized documents (the corpus) are u sed to create the
term-document matrix (TDM). In the TDM, rows represent the documents and columns
represent the terms. The relationships b e tween the terms and d ocuments are characterized
by indices (i.e. , a relational measure that can be as simple as the number of occurrences of
the term in respective documents). Figure 7.7 is a typical example of a TDM.
…, Cl
C: C:
Q) ·;:
E Q)
Terms .:.t. Q) Q)
Ul Cl C: a: ct) ·ci .., C: C: C: .., ct) w Q)
C: :ii: E Q) Q)
E …, … Q. ct) 0 .., u
$ Ul Q) ai
Q) ·o ~ > D..
Documents > … 0 Q) <( ..!: D.. rn C rn
Document 1 1 1
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
FIGURE 7.7 A Simple Term- Document Matrix.
310 Pan III • Predictive Analytics
The goal is to convert the list o f o rganized documents (the corpus) into a TDM w here
the cells are filled w ith the most appropriate indices. The assumptio n is that the essen ce of a
docume nt ca n be re presente d w ith a list and frequen cy of the terms u sed in that docume nt.
However, are all te rms important w he n ch aracterizing documents? Obvio u sly, the a nswer
is "n o." Some terms, such as articles, auxiliary verbs, and te rms u sed in almost all of the
documents in the corpus, h ave no differentiating power and therefore should be excluded
from the indexing p rocess. This list of terms, commo nly called stop terms or stop words, is
sp ecific to the doma in of study and sho uld be identified by the doma in exp e rts. O n the
o ther hand, o ne might ch oose a set of predetermined terms under which the docu ments
are to be indexed (this list of terms is convenie ntly called include terms or dictionary).
Addition ally, syno nyms (pairs of te rms that a re to be treated the same) a nd specific p hrases
(e.g. , "Eiffel Tower") can also be provided so that the index entries are mo re accurate .
Ano the r filtratio n that sho uld take place to accurate ly create the indices is stemming,
w hich refe rs to the reductio n of words to the ir roots so that, for example , diffe re nt grammati-
cal forms or declina tio ns of a verb are identified and indexed as the same word. Fo r exam-
ple, stemming will en sure that modeling and modeled will be recogn ized as the word model.
The first generatio n of the TDM includes all of the unique terms id entified in the
corpus (as its columns), excluding the o n es in th e sto p te rm list; all of the docume nts
(as its rows); a nd th e occurre nce count o f each te rm fo r each d ocume n t (as its cell values).
If, as is commonly the case , the corpus includes a rather large number of d ocuments,
the n the re is a very good cha n ce that the TDM w ill have a very large numb er of te rms.
Processing su ch a la rge matrix mig ht be time-con suming a nd, mo re impo rtantly , might
lead to extraction of inaccurate p a tterns. At this p o int, o ne h as to decide th e following:
(1) What is the b est re presentatio n of the indices? an d (2) How can we re duce the d ime n-
sio n ality of this matrix to a manageable size?
REPRESENTING THE INDICES O nce the input d ocume nts are indexed a nd the initia l word
frequencies (by d ocument) computed, a number of ad ditio nal tra nsformatio ns can be
p erformed to summa rize a nd aggregate the extracted information. Th e raw te rm fre-
que n cies gen e rally re flect o n how salie nt or imp o rtant a word is in each docu ment.
Sp ecifically, words that occur w ith greater freque ncy in a docume nt are be tte r descriptors
o f the conte nts of tha t docume nt. However, it is no t reason able to assume th at the word
counts themselves are proportio na l to their impo rtan ce as descriptors of the d ocu men ts.
For example, if a word occurs on e time in docume nt A , but three times in d ocument B ,
the n it is no t necessarily reasonable to conclude that this word is three times as impo rta nt
a descripto r of document Bas compared to docume nt A . In o rder to have a mo re con-
siste nt TDM for furthe r analysis, these raw indices n eed to be no rmalized . As o pposed to
sh owing the actua l fre que ncy counts, the nume rical re presentatio n between terms a nd
documents can be no rmalized using a number of alternative me thods. The following are
a few of the most commo nly u sed n o rmalizatio n metho d s (StatSoft, 2009):
• Log frequencies. The raw frequen cies can be transformed u sing the log functio n.
This tra nsforma tio n would "d a mpe n " the raw frequen cies and how they affect the
results of subsequent an alysis .
J (wf) = 1 + log(wf) for wf > 0
In the fo rmula, wf is the raw word (or term) frequ e n cy andf(wf) is the result of the
log transformatio n . This tran sforma tio n is applie d to all of the raw frequ e n cies in the
TDM w here the frequency is g reater than zero.
• Binary frequencies. Likewise, an even simple r transformation can be u sed to
enume rate w h e the r a te rm is u sed in a docume nt.
J (wf) = l fo r wf > 0
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 311
The resulting TDM matrix will contain only ls a nd Os to indicate the presence or
absen ce of the respective words . Again, this transformation w ill dampen the effect
of the raw frequency counts on subsequent computations and analyses.
• Inverse document frequencies. Anothe r issue that o ne may wan t to con sider
more carefully and reflect in the indices u sed in fu1ther a nalyses is the relative
document frequencies (dj) of different terms. Fo r example, a term such as guess
may occur freque ntly in all documents, whereas an oth e r te rm, su ch as software,
may appear o nly a few times. The reason is that o ne might make guesses in various
contexts, regardless of the specific topic, whereas software is a more semantically
focused term that is only likely to occur in documents that deal with compute r soft-
ware. A common and very useful transformation that reflects both the specificity
of words (document frequencies) as well as the overall frequencies of their occur-
rences (term frequencies) is the so-called inverse document frequency (Manning
a nd Schutze, 2009). This transformation for the ith word and j th document can be
w ritte n as:
idfi i, j)
ifwfiJ = 0
if wJ;1 2′.: 1
In this formula, N is the total number of documents, an d dfi is the document fre-
quency for the ith word (the numbe r of docume nts that include this word) . Hence,
it can be seen that this formula includes both the dampening of the simple-word fre-
quencies via the log function (described here) and a weighting facto r that evaluates
to O if the word occurs in all documents [i.e. , log(N/ N = 1) = OJ , and to the maximum
value w h e n a word only occurs in a single document [i.e. , log(N/ 1) = log(N)] . It can
easily be seen how this transformation w ill create indices th at reflect both the rela-
tive frequencies of occurrences o f words as well as their sem antic specificities over
the documents included in the a nalysis . This is the most commo nly u sed tran sfor-
mation in the field.
REDUCING THE DIMENSIONALITY OF THE MATRIX Because the TDM is often very large
and rather sp arse (most of the cells filled with zeros), a nothe r important question is “How
do we reduce the dimensionality of this matrix to a manageable size?” Several options are
available for man aging the m atrix size :
• A domain expert goes through the list of terms and eliminates those that do not
make much sense for the context of the study (this is a manual, labor-intensive
process).
• Eliminate terms with very fe w occurren ces in very few documents.
• Transform the matrix u sing singu lar value decomposition.
Singular value decomposition (SVD), which is closely related to principal com-
ponents an alysis, reduces the overall dimensionality of the input matrix (number of input
documents by number of extracted te rms) to a lower dimensional space, w here each
consecutive dimension represents the largest degree of variability (between word s and
documents) possible (Manning and Schutze, 1999). Ideally, the an alyst might ide ntify the
two or three most salient dime nsio ns that account for most of the variability (differe n ces)
between the words and documents, thus identifying the latent sema ntic sp ace that o rga-
nizes the words and documents in the a na lysis . Once such dimensions are identified, th e
underlying “meaning” of what is contained (discussed or d escribed) in the d ocuments has
been extracted. Specifically, assume that matrix A represe nts an m X n term occurrence
ma trix w he re m is the numbe r o f input d ocume nts and n is the number of terms selected
312 Pan III • Predictive Analytics
for analysis. The SYD computes the m X r orthogonal matrix U, n X r orth ogon al m atrix V,
a nd r X r matrix D, so that A = UDV’ an d r is the number of e igen values of A’A.
Task 3: Extract the Knowledge
Usin g the well-structured TDM, a nd potentially augmented w ith other structured data
e lements, novel patterns are extracted in the context of the specific problem being
addressed. The main categories of knowledge extraction methods a re classification, clus-
tering, association, an d trend analysis. A short description of these m ethods follows.
CLASSIFICATION Arguably the most commo n knowledge-discovery topic in an alyzing
complex data sources is the classification (or categorization) of certain objects. The task
is to classify a given data instance into a predetermined set of categories (or classes). As it
applies to the domain of text mining, the task is known as text categorization, w h ere for
a given set of categories (subjects, topics, or concepts) and a collection of text documents
the goal is to find the correct topic (subject or concept) for each document u sing models
developed w ith a training data set that includes both the documents and actual document
categories. Today, automated text classification is applied in a variety of contexts, includ-
ing automatic or semiautomatic (interactive) indexing of text, spam filtering , Web page
categorization under hierarchical catalogs, automatic generation of metadata, detection of
genre, a nd many others.
The two main approach es to text classification are knowledge engineering and
machine learning (Feldman and Sanger, 2007). With the knowledge-e nginee ring
approach, an expert’s knowledge about the categories is encoded into the system eithe r
declaratively o r in the form of procedural classification rules. With the machine-learning
approach, a general inductive process builds a classifier by learning from a set of reclas-
sified examples. As the number of documents increases at an exponential rate and as
knowledge experts become h arder to come by, the popularity trend between the two is
shifting toward the machine-learning approach .
CLUSTERING Clustering is an unsupervised process whereby objects are classified into
“n atural” groups called clusters. Compared to categorization, where a collection of pre-
classified training examples is used to develop a model based on the descriptive features
of the classes in order to classify a new unlabeled example, in clustering the problem is
to group a n unlabelled collection of objects (e.g., documents, customer comments, Web
pages) into meaningful clusters w ithout any prior knowledge.
Cluste ring is useful in a w ide range of applications, from docume nt retrieval to
e nabling better Web content search es. In fact, one of the prominent applications of clu ster-
ing is the analysis and navigation of very large text collections, such as Web pages. The
basic underlying assumption is that relevant documents tend to be more similar to each
other than to irrelevan t ones. If this assumption holds, the clustering of documents based
o n the similarity of the ir content improves search effectiveness (Feldman and Sanger, 2007):
• Improved search recall. Clustering, because it is based on overall similarity as
opposed to the presence of a single term, can improve the recall of a query-based
search in such a way that when a query matches a document its w ho le cluster is
returned.
• Improved search precision. Clustering can also improve search precision. As
the number of d ocuments in a collectio n grows, it becomes difficult to browse
through the list of m atch ed documents. Clustering can help by grouping th e docu-
ments into a number of much smaller groups of related documents, ordering th em
by relevance, and returning only the documents from the most relevant group (or
groups) .
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 313
The two most popular clustering methods are scatter/ gather clustering and query-
specific clustering:
• Scatter/gather. This document browsing method uses clustering to enhance the
efficiency of human browsing of d ocuments when a specific search query cannot
be formulated. In a sense, the me thod dynamically generates a table of conten ts for
the collection a nd adapts and modifies it in response to the user selectio n .
• Query-specific clustering. This method employs a hiera rchical clustering
a pproach where the m ost relevant documents to the posed query appear in
s ma ll tight clusters that are n ested in la rger clusters containing less similar doc-
uments , creating a spectrum o f relevance levels a mo ng the documents. This
method p erforms con s iste ntly well for document collection s of realistically large
s izes.
ASSOCIATION A formal definition a nd detailed description of association was pro-
vided in the c hapter o n data mining (Ch apter 5). Associations, or association rule
learning in data mining, is a popular a nd well-researched technique for discovering
interesting re latio nships among variables in large databases. The m ain idea in generating
associatio n rules (or solving market-basket problems) is to identify the freque nt sets that
go together.
In text mining, associatio n s specifically refer to the direct relationships between
concepts (te rms) or sets of concepts. The concept set association rule A ~ B , relat-
ing two freque nt con cept sets A and C, can be quantified by the two basic measures of
suppo rt and confide nce. In this case, confide nce is the percentage of documents that
include all the concepts in C within the same subset of those documents that include all
the con cepts in A . Suppo rt is the pe rcentage (or numbe r) of documents that inclu de all
the concepts in A and C. For instance, in a document collectio n the concept “Software
Implementation Failure” may appear most ofte n in association w ith “Enterprise Resource
Planning” and “Cu stomer Relatio nship Management” with significant supp ort (4%) and
confide nce (55%), meaning that 4 p ercen t of the documents had all three concepts rep-
resented togethe r in the same document and of the documents that included “Software
Implem e nta tio n Failure, ” 55 percent of the m also included “Enterprise Resource Planning”
and “Custome r Relationship Management. ”
Text mining w ith association rules was used to an alyze published literature
(n ews a nd academic articles posted o n the Web) to ch a rt the outbreak a nd p rogress
of bird flu (Mahgoub et a l. , 2008). The idea was to automatically identify the asso-
ciatio n among the geographic a reas, spreading across species, a nd countermeasures
(treatme nts) .
TREND ANALYSIS Recent methods of trend analysis in text mining have been based on
the notion that the various types of concept distrib utio ns are functions o f document col-
lectio n s; that is, different collections lead to different con cept distributions for the same
set of concepts. It is therefore p ossible to compare two distributions that are otherwise
ide ntical except that they are from different subcollections . One n otable direction of this
type of analyses is having two collectio n s fro m the same source (such as from the same
set of academic journals) but from different points in time. Delen a nd Crossla nd (2008)
applied trend analysis to a large numbe r of academic articles (published in the three
highest-rated academic jo urnals) to ide ntify the evolution of k ey concepts in the fie ld of
info rmatio n systems .
As described in this sectio n , a number of metho ds are available for text mining.
Application Case 7.5 describes the u se of a number of different techniques in analyzing a
la rge set o f lite rature.
314 Pan III • Predictive Analytics
Application Case 7.5
Research Literature Survey with Text Mining
Researchers conducting searches and reviews of rel-
evant literature face an increasingly complex and
voluminous task. In extending the body of relevant
knowledge, it has always been important to work
hard to gather, organize, analyze, and assimilate
existing information from the literature , particularly
from one’s home discipline. With the increasing
abundance of potentially significant research being
reported in related fields, and even in what are tra-
ditionally deemed to be nonrelated fields of study,
the researcher’s task is ever more daunting, if a thor-
ough job is desired.
In new streams of research, the researcher’s
task may be even more tedious and complex. Trying
to ferret out relevant work that others have reported
may be difficult, at best, and perhaps even near
impossible if traditional, largely manual reviews
of published literature are required. Even with a
legion of dedicated graduate students or helpful col-
leagues, trying to cover all potentially relevant pub-
lished work is problematic.
Many scholarly conferences take place every
year. In addition to extending the body of knowl-
edge of the current focus of a conference, organiz-
ers often desire to offer additional mini-tracks and
workshops. In many cases, these additional events
are intended to introduce the attendees to signifi-
cant streams of research in related fields of study
and to try to identify the “next big thing” in terms of
research interests and focus. Identifying reasonable
candidate topics for such mini-tracks and workshops
is often subjective rather than derived objectively
from the existing and emerging research.
In a recent study, Delen and Crossland (2008)
proposed a method to greatly assist and enhance
the efforts of the researchers by enabling a semi-
automated analysis of large volumes of published
literature through the application of text mining.
Using standard digital libraries and online publica-
tion search engines, the authors downloaded and
collected all of the available articles for the three
major journals in the field of management informa-
tion systems: MIS Quarterly (MISQ) , Information
Systems Research (ISR), and the Journal of
Management Information Systems (JMIS). In order
to maintain the same time interval for a ll three
journals (for potential comparative longitudinal
studies), the journal with the most recent starting
date for its digital publication availability was used
as the start time for this study (i.e., JMIS articles
have been digitally available since 1994). For each
article, they extracted the title , abstract, author list,
published keywords, volume, issue number, and
year of publication. They then loaded all of the arti-
cle data into a simple database file . Also included
in the combined data set was a field that designated
the journal type of each article for likely discrimi-
natory analysis. Editorial notes, research notes, and
executive overviews were omitted from the collec-
tion. Table 7.2 shows how the data was presented
in a tabular format.
In the analysis phase, they chose to use only
the abstract of an article as the source of infor-
mation extraction. They chose not to include the
keywords listed with the publications for two
main reasons: (1) under normal circumstances,
the abstract would already include the listed key-
words, and therefore inclusion of the listed key-
words for the analysis would mean repeating the
same information and potentially giving them
unmerited weight; and (2) the listed keywords
may be terms tha t authors would like their article
to be associated with (as opposed to what is really
contained in the article), therefore potentially
introducing unquantifiable bias to the analysis of
the content.
The first exploratory study was to look at
the longitudinal perspective of the three journals
(i.e., evolution of research topics over time). In
order to conduct a longitudinal study, they divided
the 12-year period (from 1994 to 2005) into four
3-year periods for each of the three journals. This
framework led to 12 text mining experiments with
12 mutually exclusive data sets. At this point, for
each of the 12 data sets they used text mining to
extract the most descriptive terms from these col-
lections of articles represented by their abstracts.
The results were tabulated and examined for time-
varying changes in the terms published in these
three journals.
As a second exploration, using the complete
data set (including all three journals and all four
Ch apte r 7 • Text Analytics, Text Mining , and Sentiment Analysis 315
TABLE 7.2 Tabular Representation of the Fields Included in t he Combined Data Set
Journal Year Aut hor(s) Title Vol/No Pages Keywords Abstract
M ISQ 200 5 A. Malhotra, Absorptive 29/1 145-187 knowledge The need for
S. Gossain, capacity management continua l value
and 0. A. configurations supply cha in in novation is
El Sawy in supply cha ins: absorptive capacity d riving supply
Gearing for interorganizationa l chains to evo lve
part ner-enabled information from a pure
market systems transact ional focus
knowledge configuration to leveraging
creation approaches interorganization
partnerships for
sharing
ISR 1999 D. Robey Accounting 165-1 85 organizationa l Although much
and M . C. for t he transform ation contemporary
Boudtreau contradictory impacts of thoug ht considers
organizat ional t echnology advanced
consequences organ ization information
of information theory research technolog ies
t echnology: methodology as either
Theoret ical int raorganizational determinants
direct ions and power elect ronic or enablers
methodolog ica l communication of radical
implications misimp lementation organizatio nal
cu lture systems change, empirical
studies have
revea led inconsis-
tent fi ndings
to support t he
deterministic
logic implicit
in such
arguments.
This paper
review s the
contradictory
JM IS 200 1 R. Aron and Achieving 65-88 information When producers
E. K. Clemons t he opt ima l prod ucts of goods
ba lance Internet (or services) are
between advert ising confronted by a
invest ment product sit uation in
in quality positioning wh ich t heir
an d invest- si gna li ng offerings no
ment in self – signa li ng games longer perfectly
promot ion for match consumer
information preferences, t hey
products must determ ine
the extent to
wh ich t he
advert ised
f eatures of …
(Continued )
316 Pan III • Predictive Analytics
Application Case 7.5 (Continued}
periods), they conducted a clustering analysis .
Clustering is arguably the most commonly used
text mining technique. Clustering was used in this
study to identify the natural groupings of the arti-
cles (by putting them into separate clusters) and
then to list the most descriptive terms that char-
acterized those clusters. They used singular value
decomposition to reduce the dimensionality of the
term-by-document matrix and then an expecta-
tion-maximization algorithm to create the clusters.
They conducted several experiments to identify the
optimal number of clusters, which turned out to
be nine. After the construction of the nine clus-
ters , they analyzed the content of those clusters
from two perspectives: (1) representation of the
journal type (see Figure 7.8) and (2) representa-
tion of time. The idea was to explore the potential
differences and/ or commonalities among the three
100
90
80
70
60
50
40
30
20
10
0
journals a nd potential changes in the emphasis on
those clusters; that is, to answer questions such as
“Are there clusters that represent different research
themes specific to a single journal?” and “Is there
a time-varying characterization of those clusters?”
They discovered and discussed several interesting
patterns using tabular and graphical representation
of their findings (for further information see Deleo
and Crossland, 2008).
QUESTIONS FOR DISCUSSION
1. How can text mining be used to ease the task of
literature review?
2. What are the common outcomes of a text mining
project on a specific collection of journal articles?
Can you think of other potential outcomes not
mentioned in this case?
ISR JMS MISQ
CLUSTER: 1
ISR JMS M ISQ
CLUSTER: 2
ISR JMS MISQ
CLUSTER: 3
UJ
100
QJ 90
ti 80
·;:; 70
‘- 60
<( 50 ..... 40 0
30
0 20 z 10
0
100
90
80
70
60
50
40
30
20
10
0
ISR JMS MISQ
CLUSTER: 4
ISR JMS MISQ
CLUSTER: 7
ISR JMS M ISQ
CLUSTER: 5
ISR JMS M ISQ
CLUSTER: 8
Journal
ISR JMS MISQ
CLUSTER: 6
ISR JMS MISQ
CLUSTER: 9
FIGURE 7 .8 Distribution of the Number of Articles for the Three Journals over the Nine Clusters. Source: D. Del en and
M. Crossland, "Seeding the Survey and Analysis of Research Literature with Text Mining," Expert Systems with Applications,
Vol. 34, No. 3, 2008, pp. 1707-1720.
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 317
SECTION 7.5 REVIEW QUESTIONS
1. What are the main steps in the text mining process?
2. What is the reason for normalizing word frequencies? What are the common methods
for normalizing word frequencies?
3. What is singular value decomposition? How is it used in text mining?
4. What are the main knowledge extraction methods from corpus?
7 .6 TEXT MINING TOOLS
As the value of text mining is being realized by more a nd more organizations, the num-
ber of software tools offered by software companies and nonprofits is also increasing.
Following are some of the popular text mining tools , which we classify as commercial
software tools and free (and/ or open source) software tools .
Commercial Software Tools
The following are some of the most popular software tools used for text mining. Note
that many companies offer demonstration versions of their products on their Web sites.
1. ClearForest offers text analysis and visualization tools.
2. IBM offers SPSS Modeler and data and text analytics toolkits.
3. Megaputer Text Analyst offers semantic analysis of free-form text, summarization,
clustering, navigation, and natural language retrieval with search dynamic refocusing.
4. SAS Text Miner provides a rich suite of text processing and analysis tools.
5. KXEN Text Coder (KTC) offers a text analytics solution for automatically preparing
and transforming unstructured text attributes into a structured representation for use
in KXEN Analytic Framework.
6. The Statistica Text Mining engine provides easy-to-use text mining functionality
with exceptional visualization capabilities.
7. VantagePoint provides a variety of interactive graphical views and analysis tools
with powerful capabilities to discover knowledge from text databases.
8. The WordStat analysis module from Provalis Research analyzes te xtual information
such as responses to open-ended questions, interviews, etc.
9. Clarabridge text mining software provides end-to-end solutions for customer experi-
e nce professionals wishing to transform customer feedback for marketing, service,
and product improvements.
Free Software Tools
Free software tools, some of which are open source, are availa ble from a number of non-
profit organizations:
1. RapidMiner, one of the most popular free, open source software tools for data min-
ing and text mining, is tailored with a graphically appealing, drag-and-drop user
interface.
2. Open Calais is an open source toolkit for including semantic functionality within
your blog, content management system, Web site, or application.
3. GATE is a leading open source toolkit for text mining. It has a free open source
framework (or SDK) and graphical development environment.
4. LingPipe is a suite of Java libraries for the linguistic analysis of human language .
5. S-EM (Spy-EM) is a text classification system that learns from positive and unlabeled
examples.
6. Vivisimo/ Clusty is a Web search and text-clustering engine .
318 Pan III • Predictive Analytics
Often, innovative application of text mining comes from the collective use of several
software tools. Application Case 7.6 illustrates a few customer case study syn opses w h ere
text mining and advanced a nalytics are used to address a variety of business challenges.
Application Case 7.6
A Potpourri of Text Mining Case Synopses
1. Alberta's Parks Division gains insight
from unstructured data
Business Issue:
Alberta's Parks Division was relying on manual pro-
cesses to respond to stakeho lders, which was time-
consuming and made it difficult to glean insight
from unstructured data sources.
Solution:
Using SAS Text Miner, the Parks Division is able to
reduce a three-week process down to a couple of
days, and discover new insights in a matter of minutes.
Benefits:
The solution has not o nly automated manual tasks, but
also provides insight into both structured and unstrnc-
tured data sources that was previously not possible.
"We now have opportunities to channel cus-
tomer communications into products and services
that meet their needs. Having the analytics will enable
us to better support chan ges in program delivery, "
said Roy Finzel, Manager of Business Integration and
Analysis, Alberta Tourism, Parks and Recreation.
For more details, please go to http.//www.sas.
com/success/alberta-parks2012 .html
2. American Honda Saves Millions by
Using Text and Data Mining
Business Issue:
O n e of the most admired an d recognized automo-
bile brands in the United States, American Honda
wanted to detect and contain warranty and call
center issues before they become widespread.
Solution:
SAS Text Miner helps American Honda spot patterns
in a wide range of data and text to pinpoint problems
early, ensuring safety, quality, and customer satisfactio n.
Benefits:
"SAS is helping us make discoveries so that we can
address the core issues before they ever become
problems- and we can make sure that we are
addressing the right causes. We're talking about hun-
dreds of millions of dollars in savings," said Tracy
Cermack, Project Manager in the Service Enginee ring
Information Department, American Honda Motor Co.
For more details, please go to http.j/www.sas.
com/successjhonda.html
3. MaspexWadowice Group Analyzes
Online Brand Image with Text Mining
Business Issue:
MaspexWadowice Group, a dominant player among
food and beverage manufacturers in Central and
Eastern Europe, wanted to analyze social media chan-
nels to monitor a product's b rand image and see how
it compares with its general perception in the market.
Solution:
MaspexWadowice Group choose to use SAS Text
Miner, which is a part of the SAS Business Analytics
capabilities, to tap into social media data sorces.
Benefits:
Maspex gained a competitive advantage through
better consumer insights, resulting in more effective
and efficient marketing efforts.
"This will allow us to plan and implement
our marketing and communications activities more
effectively , in particular those using a Web-based
channel," said Marcin Lesniak, Research Manager,
MaspexWadowice Group .
For more details, please go to http.//www.sas.
com/success/maspex-wadowice.html
4. Viseca Card Services Reduces Fraud
Loss with Text Analytics
Business Issue:
Switzerland's largest credit card company aimed to
prevent losses by detecting a nd preventing fraud
on Viseca Card Services' 1 million credit cards and
more than 100,000 daily transactions.
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 319
Solution:
They choose to use a suite of a nalytics tools from SAS
including SAS® Enterprise Miner™, SAS® Enterprise
Guide®, SAS Text Miner, and SAS BI Server.
Benefits:
Eighty-one percent of a ll fraud cases are found
within a day, and total fraud loss has been reduced
by 15 percent. Even as the number of fraud cases
across the industry has doubled, Viseca Card Services
has reduced loss per fraud case by 40 percent.
"Thanks to SAS Analytics our total fraud loss has
been reduced by 15 percent. We have one of the best
fraud prevention ratings in Switzerland and our busi-
ness case for fraud prevention is straightforward: Our
returns are simply more than our investment," said
Marcel Bieler, Business Analyst, Viseca Card Se1vices.
For more details, please go to http.//www.sas.
com/successjVisecacardsvcs.html
5. Improving Quality with Text Mining
and Advanced Analytics
Business Issue:
Whirlpool Corp., the world's leading manufacturer
and marketer of major home appliances, wanted to
reduce service calls by finding defects through war-
ranty analysis and correcting them quickly.
Solution:
SAS Warranty Analysis and early-warning tools on
the SAS Enterprise BI Server distill and analyze
SECTION 7.6 REVIEW QUESTIONS
warranty claims data to quickly detect product
issues. The tools used in this project included SAS
Enterprise BI Server, SAS Warranty Analysis, SAS
Enterprise Guide, and SAS Text Miner.
Benefits:
Whirlpool Corp. aims to cut overall cost of quality,
and SAS is playing a significant part in that objec-
tive. Expectations of the SAS Warranty Analysis solu-
tion include a significant reduction in Whirlpool's
issue detection-to-correction cycle, a three-month
decrease in initial issue detection, and a potential
to cut overall warranty expenditures w ith significant
quality, productivity and efficiency gains.
"SAS brings a level of analytics to business
intelligence that no one else matches, " said John
Kerr, General Manager of Quality and Operational
Excellence, Whirlpool Corp.
For more details, please go to http.j/www.sas.
com/success/whirlpool. html
QUESTIONS FOR DISCUSSION
1. What do you think are the common characteris-
tics of the kind of challenges these five compa-
nies were facing'
2. What are the types of solution methods and tools
proposed in these case synopses?
3. What do you think are the key benefits of using
text mining and advanced analytics (compared
to the traditional way to do the same)?
Sources: SAS, www.sas.com/success/ (accessed Se ptembe r 2013).
1. What are some of the most popular text mining software tools?
2. Why do you think most of the text mining tools are offered by statistics companies?
3. What do you think are the pros a nd cons of choosing a free text mining tool over a
commercial tool?
7.7 SENTIMENT ANALYSIS OVERVIEW
We, humans, are social beings. We are adept at utilizing a variety of means to
communicate. We often consu lt financial discussion forums before making an
investment decision; ask our friends for their opinions on a newly open ed restaurant
or a newly released movie; and conduct Internet searches and read consumer reviews
and expert reports before making a big purchase like a house, a car, or an appli-
a n ce. We rely o n others' opinio ns to make better decisions, especia lly in a n area
320 Pan III • Predictive Analytics
w here we don 't h ave a lot of knowledge or experience. Thanks to the growing
availability and popularity of opinion-rich Internet resources such as social media
o utlets (e.g. , Twitter, Facebook, etc.), online review s ites, a nd personal blogs, it is
now easier than ever to find opinions of others (th ou sands of them, as a matter
of fact) o n eve1ything from the latest gadgets to political and public figures. Even
though not everybody expresses opinion s over the Internet, due mostly to the fast-
growing numbers and capabilities of social communication channels, the numbers are
increas ing expone ntia lly.
Sentiment is a difficult word to define. It is often linked to or confused w ith other
terms like belief, view, opinion, a nd conviction. Sentiment su ggests a settled opinion
reflective of one's feelings (Mejova, 2009). Sentiment has some unique properties that
set it apart from other concepts that we may want to identify in text. Often we want to
categorize text by topic, which may involve dealing with whole taxonomies of topics.
Sentiment classification, on the other hand, usually deals w ith two classes (positive versus
n egative) , a range of polarity (e.g. , star ratings for movies) , o r even a range in strength
of opinion (Pang a nd Lee, 2008). These classes span many topics, users , and documents.
Altho ugh dealing with only a few classes may seem like an easier task th an standard text
a nalysis, it is far from the truth.
As a field of research , sentime nt a nalysis is closely related to computational linguis-
tics , natural language processing, and text mining. Sentiment analysis has many names.
It's often referred to as opinion mining, subjectivity analysis, and appraisal extraction,
with some connections to affective computing (computer recognition and expression of
emotion) . The sudden upsurge of interest and activity in the area of sentiment analysis
(i.e., opinion mining), which deals w ith the automatic extraction of opinions, feelings ,
and subjectivity in text, is creating opportun ities and threats for businesses a nd individu-
a ls alike . The o nes w h o e mbrace and take advantage of it w ill greatly benefit from it.
Eve1y opinion put o n the Internet by a n individual or a company will be accredited to the
originator (good or bad) and w ill be retrieved and mined by others (often automatically
by computer programs).
Sentiment a n alysis is trying to answer the question "What do people feel about
a certain topic?" by digging into opinio ns of many using a variety of automated tools.
Bringing together researchers and practitioners in business, computer scie nce, com-
putational linguistics, data mining, text mining, psychology, and even sociology, sen-
timent a nalysis aims to expand traditional fact-based text an alysis to new frontiers,
to realize opinio n-o rie nted information systems. In a business setting , especially in
marketing a nd customer relationship management, sentiment analysis seeks to detect
favo rable and unfavorable opinions toward specific products and/ or services using
large numbers of textual data sources (customer feedback in the form of Web postings ,
tweets, blogs, etc.) .
Sentime nt that appears in text comes in two flavors: explicit, where th e subjective
sentence directly expresses a n opinio n ("It's a wonde rful day") , and implicit, where the
text implies an opinion ("The h andle breaks too easily") . Most of the earlier work done
in sentime nt analysis focused on the first kind of sentiment, since it was easier to analyze .
Current trends are to implement analytical methods to consider both implicit and explicit
sentiments. Sentiment polarity is a particular feature of text that sentime nt analysis primar-
ily focuses o n. It is usually d ichotomized into two-positive and negative-but polarity
can also be thought of as a range. A document containin g several opinio nated statements
would h ave a mixed polarity overa ll, which is different from not having a polarity at all
(being objective) (Mejova, 2009).
Timely collection and analysis of textual data, w hich may be coming from a variety
of sources- ranging from custome r call center transcripts to social media postings-is a
c rucial part of the capabilities of proactive and customer-focused companies , nowadays.
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 321
.,:::;::.
((@)) Brand Analysis REAL TIME SOCIAL SIGNAL RTIEllJITY
Volume (5-s.ec. mtervails)
t! •
'~-... ,_ _____ ....:::lt,A~~...::,l,=..;..,.lllol;.:.,.::iUl,l:...,.c:;:: _____ ,...
~
Pr0V1der Share or Voice Topic MenlJons For AJI CaniEH'S c::11!
~~~-~ "'~·! ...,_ ......... ~ ... &fl4, - ----•1• ~--... -~.N;,Q~...,, - .. -...........,.._r.,q,..... .... ,i ...... ,. ....... ...,.. ........ _. --.. ~ ............ ...,.._
....... t'!a,,..., .....
...... ,.,.., ................ f.~ ..... Ntllliyl ........ ....... ~"--·~ _,_.....,._
l.ilfr-~~_j~ ~-....,."'--·-- .,. ___ ,.,..,. __ _ ...... ~--_. --¥- '
..... lltllWDWfal'fp ....,...._i:.- ............. ,,~ ...... ~,- ................ flllf
.....,..,~ ....... ---------. .., ... _. ___ .,.. .. ~-ii, .. ---~--.. -!,'-~ ... --.._ - .. -~ _ .. ,_
1....00,11
,111..r
""''"'
ATS.I
::.a• .
Provider N~-u'li:va Sentiment
..
64
•
..
• 3l 16 J,MOl!lll """" YUlllON Vlll170N Alli ·-·
FIGURE 7 .9 A Sample Social Media Dashboard for Continuous Brand Analysis Source: Attensity.
These real-time analyses of textual data are often visualized in easy-to-understand
dashboards . Attensity is one of those companies that provide such end-to-end solutions
to companies' text analytics needs (Figure 7.9 shows an example social media analytics
dashboard created by Attensity) . Application Case 7 .7 provides an Attensity's customer
success story, where a large consumer product manufacturer used text analytics and
sentiment analysis to better connect w ith their customers.
Application Case 7.7
Whirlpool Achieves Customer Loyalty and Product Success with Text Analytics
Background
Every day, a substantial amount of new customer
feedback data-rich in sentiment, customer issues,
and product insights-becomes available to orga-
nizations through e-mails, repair notes, CRM notes,
and o nline in social media. Within that data exists
a wealth of insight into how customers feel about
products, services, brands, and much more. That
data also holds information about potential issues
that could easily impact a product's long-term
success and a company's bottom line. This data is
invaluable to marketing, product, and service man-
agers across every industry.
Attensity, a premier text analytics solution
provider, combines the company's rich text analyt-
ics applications within customer-specific BI plat-
forms. The result is an intuitive solution that enables
customers to fu lly leverage critical d ata assets to
discover invaluable business insight and to foster
better and faster decision making.
(Continued)
322 Pan III • Predictive Analytics
Application Case 7.7 (Continued}
Whirlpool is the world's leading manufacturer
and markete r of major home appliances, w ith annual
sales of approximately $19 billion, 67,000 employees,
and nearly 70 manufacturing and technology research
cente rs around the world. Whirlpool recognizes that
consumers lead busy, active lives, and continues to
create solutio ns that help con sumers o ptimize pro-
ductivity and efficien cy in the home. In additio n to
designing appliance solutions based o n consumer
insight, Whirlpool's brand is dedicated to creating
ENERGY STAR-qualified appliances like the Resource
Saver side-by-side refrigerator, w hich recently was
rated the # 1 brand for side-by-side refrigerators.
Business Challenge
Customer satisfaction and feedback are at the cen-
ter of how Whirlpool drives its overarching business
strategy. As such , gaining insight into customer satis-
faction and product feedback is p aramount. One of
Whirlpool's goals is to more effectively unde rstand
and react to cu stomer and product feedback data ,
originating fro m biogs, e-mails, reviews, forums ,
repair n o tes, and othe r d ata sources. W hirlpool also
strives to enable its managers to report on longitu-
dinal data, and be able to comp are issues by brand
over time . Whirlpool has entrusted Attensity's text
analytics solutio ns ; and w ith that, Whirlpool listens
and acts on customer data in the ir service depart-
ment, their innovatio n a nd product developments
groups, a nd in market every day.
Methods and the Benefits
To face its business requirements head-on, Whirlpool
uses Attensity products for deep text analytics of
their multi-channel customer data, w hich includes
e-ma ils, CRM notes, repair notes, warranty data,
and social media. More than 300 business users at
Whirlpool use text an alytics solutio ns every day to
get to the root cause of product issues a nd receive
alerts o n emerging issues. Users of Attensity's
analytics products at Whirlpool include product/
service managers, corporate/ product safety staff,
consume r advocates , service quality staff, innova-
tio n managers, the Category Insights team, and
all of Whirlpool's manufacturing divisions (across
five countries).
Attensity's Text Analytics application has
played a particularly critical role for Whirlpool.
W hirlpool relies o n the application to conduct
deep an alysis of the voice of the custom er, with
the goal of identifying product quality issues and
innovation opportunities , and drive those insights
more broadly across the organization . Users con-
duct in-depth an alysis of customer data and then
extend access to that a nalysis to business users all
over the world.
Whirlpool has been able to more proac-
tively identify and m itigate quality issues before
issues escalate and claims are filed. Whirlpool has
also been able to avoid recalls, which has the dual
benefit of increased customer loyalty and reduced
costs (realizing 80% savings on their costs of recalls
due to early detection) . Having insight into cus-
tomer feedback and product issues has also resu lted
in more efficient custome r support and ultimately
in better products. Whirlpool's customer support
agents n ow receive fewer product service support
calls , and w he n agents do receive a call , it's eas-
ier for them to leverage the interaction to improve
products and services.
The process of laun ching n ew products
h as also been enhanced by h aving the ability to
a nalyze its cu stomers ' needs and fit n ew products
and services to those n eeds appropriately. When
a product is launched, Whirlpool can use external
custo me r feedback data to stay o n top of potential
product issues and address them in a timely fashion .
Michael Page, development and testing man-
ager for Quality Analytics at Whirpool Corporation
affirms these types of benefits: "Attensity's prod-
u cts have provided imme nse value to our business.
We've been able to proactively address customer
feedback and work toward high levels of customer
service a nd product success."
QUESTIONS FOR DISCUSSION
1. How did Whirlpool u se capabilities of text ana-
lytics to better understand their customers and
improve product offerings?
2. What were the ch allenges, the proposed solu-
tion, and the obtained results?
Source: Source : Attensity, Custome r Success Sto ry, www.attensity.
com/2010/08/21/whirlpool-2/ (accessed August 2013).
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 323
SECTION 7.7 REVIEW QUESTIONS
1. What is sentiment analysis? How does it relate to text m ining?
2. What are the sources of data for sentiment analysis?
3. What a re the commo n challe nges that sentime nt a nalysis has to deal w ith?
7.8 SENTIMENT ANALYSIS APPLICATIONS
Compared to traditio nal sentiment analysis methods , which were survey based o r focus
group centered, costly, and time -con suming (and therefore driven from small samples of
participants), the new face of text analytics- based sentiment a nalysis is a limit breaker.
Current solutions automate very large-scale data collection, filtering , classification, and
clustering methods via n atural language processing a nd data mining technologies that
handle both factual and subjective informatio n . Sentiment an alysis is perhaps the most
p o pula r applicatio n of text an alytics , tapping into data sources like tweets, Facebook
posts , o nline communities, discussion boards, Web logs, product reviews, call center
logs and recording , product rating s ites, chat rooms, price comparison portals, search
e ngine logs, and n ewsgroup s. The following applicatio ns of sentime nt an alysis are
meant to illustrate the power a nd the w idespread coverage of this tech no logy.
VOICE OF THE CUSTOMER (VOC) Voice of the customer (VOC) is an integra l part of
an analytic CRM and cu sto mer experience management systems . As the enabler of VOC ,
sentiment an alysis can access a company's product and service reviews (either continu-
o usly o r periodically) to better unde rstand a nd better ma nage the custome r complaints
and praises . For instan ce, a motio n picture advertising/marketing company may detect
the negative sentime nts toward a m ovie that is about to open in theatres (based o n its
tra ilers), and quickly cha nge the composition of trailers and advertising strategy (on a ll
media o utlets) to mitigate the negative impact. Similarly, a software company may detect
the negative buzz regarding the bugs fou nd in the ir newly released product early e nough
to release patches a nd quick fixes to alleviate the situatio n .
Often, the focus of VOC is indiv idual customers, their service- an d suppo rt-re lated
needs, wants, and issues. VOC draw data from the full set of customer touch p o ints ,
including e -mails, surveys, call center notes/ recordings, and social media postings, an d
match custo me r voices to transactions (inquiries, purchases, re turns) and individual cu s-
tomer profiles captured in e nte rprise operatio na l syste ms. VOC, mostly driven by senti-
ment analysis, is a key element of customer experience management initiatives, where
the goal is to create an intimate relationship w ith the customer.
VOICE OF THE MARKET (VOM) Voice of the market is about understanding aggregate
opinio ns and tre nds. It's about knowin g what stakeholders- custom e rs, potential custom-
ers, influencers, whoever- are saying about your (and your competitors') products an d
services. A well-done VOM a nalysis h e lps companies w ith competitive intelligence and
product development a nd positio ning.
VOICE OF THE EMPLOYEE (VOE) Traditionally VOE h as been limite d to employee satis-
faction surveys. Text analytics in general (and sentime nt analysis in particula r) is a huge
en able r of assessing the VOE. Using rich , opinionated textual data is an effective and
efficient way to listen to w hat e mployees are saying. As we all know, happy employees
empower customer experience efforts and improve customer satisfaction.
BRAND MANAGEMENT Brand management focuses o n listen ing to social media where
anyon e (past/curre nt/ prospective cu stomers, industry experts, othe r authorities) can post
opinio ns that can d a mage o r boost your reputation. Th ere are a number of relatively
324 Pan III • Predictive Analytics
new ly launched start-up companies that offer analytics-driven brand m anagement ser-
vices for others. Brand management is product and company (rather than customer)
focused . It attempts to shape perceptions rather tha n to manage experiences using senti-
ment analysis techniques.
FINANCIAL MARKETS Predicting the future values of indiv id ual (or a group oD
stocks h as been an inte restin g and seemingly unsolvable problem. What makes a
stock (or a group of stocks) move up o r down is a nything but an exact science. Many
believe that the stock market is mostly sentiment driven, making it anything but rational
(especially for short-te rm stock movements). The refore, use of sentiment analysis in
financial markets has gained significant popularity. Automated ana lysis of market senti-
ments using social media , news, biogs, and discussion groups seems to be a proper way
to compute the market movements. If done correctly, sentime nt analysis can identify
short-term stock movements based on the buzz in the market, potentially impacting
liquidity a nd trading.
POLITICS As we all know, opm1ons matter a great deal in politics. Because political
discussions are dominated by quotes, sarcasm, a nd complex references to persons , o rga-
niza tions, and ideas, politics is one of the most difficult, and p o tentially fruitful, areas
for sentiment an alysis. By an alyzing the sentiment on election forums, one may predict
w ho is more likely to win or lose . Sen timent an alysis can help understand what voters
are thinking a nd can clarify a candidate's position on issues. Sentime nt analysis can help
political organizations, campaigns, and news analysts to better understand which issues
a nd positions matter the most to voters . The technology was successfully applied by both
parties to the 2008 and 2012 American presidential e lection campaigns.
GOVERNMENT INTELLIGENCE Government intelligence is another application that has
been used by intelligence agencies . For example, it has been suggested that on e could
m onitor sources for increases in h ostile or n egative communication s . Sentim ent analysis
can allow the automatic an alysis of the opinion s that people submit about pen ding policy
or government-regulation proposals. Furthermore, monitoring communications for spikes
in n egative sentime nt may be of use to agencies like Homeland Security.
OTHER INTERESTING AREAS Sentiments of customers can be used to better design
e -commerce sites (product suggestions, upsell/ cross-sell advertising), better place adver-
tisements (e.g ., placing dynamic advertisem e nt of products a nd services that consider the
sentiment o n the page the u ser is browsing), a nd manage opinio n- or review-oriented
search engines (i. e., an opinion-aggregation Web site, an alternative to sites like Epinions,
summarizing user reviews). Sentime nt a nalysis can help w ith e-mail filtration by catego-
rizing and prioritizing incoming e-mails (e.g., it can detect strongly negative or flaming
e -ma ils and forward them to the proper folde r), as well as citatio n an a lysis, where it
can determine whether an author is citing a piece of work as supporting evidence or as
research that h e or she dismisses.
SECTION 7.8 REVIEW QUESTIONS
1. What are the most popular application areas for sentiment analysis? Why?
2. How can sentiment a nalysis be used for brand management?
3. What would be the expected benefits and beneficiaries of sentime nt analysis in
politics?
4. How can sentime nt a na lysis be used in predicting financial markets?
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 325
7 .9 SENTIMENT ANALYSIS PROCESS
Because of the complexity of the problem (underlying concepts, expressions in text, con-
text in which the text is expressed, etc.), there is no readily available standardized process
to conduct sentiment analysis. However, based on the published work in the field of sen -
sitivity analysis so fa r (both on research methods and range of applications), a multi-step ,
simple logical process, as given in Figure 7.10, seems to be an appropriate methodology for
sentiment analysis. These logical steps are iterative (i.e., feedback , corrections, and iterations
are part of the discove1y process) and experimental in nature, and once comple ted and
combined, capable of producing desired insight about the opinions in the text collection.
STEP 1 : SENTIMENT DETECTION After the retrieval and preparation of the text docume nts,
the first main task in sensitivity analysis is the detection of objectivity. Here the goal is to
differentiate between a fact and an opinion, which may be viewed as classification of text
as objective or subjective. This may also be characterized as calculation of 0-S Polarity
(Objectivity-Subjectivity Polarity, which may be represented with a numerical value ranging
from Oto 1). If the objectivity value is close to 1, then there is no opinion to mine (i.e., it is a
fact); therefore, the process goes back and grabs the next text data to analyze. Usually opinio n
Textual Data
A statement
No sentiment? Yes
Yes
Step 2
0-S
polarity
measure
Calcu late the NP N-P Polarity
__ ___,__ polarity of the
sentiment
Step 3
~---_,Identify the target
for the sentiment Target
Record the Polarity,
Strength, and the
Target of the
sentiment
Step 4
Tabulate and aggregate
the sentiment
analysis results
FIGURE 7.10 A Multi-Step Process to Sentiment Analysis.
326 Pan III • Predictive Analytics
detection is based on the examination of adjectives in text. For example, the polarity of "what
a wonderful work" can be determined relatively easily by looking at the adjective.
STEP 2: N-P POLARITY CLASSIFICATION The second main task is that of polarity
classification. Given an opinionated piece of text, the goal is to classify the opinion as
falling under o ne o f two opposing sentiment polarities, or locate its position on the con-
tinuum between these two polarities (Pang and Lee, 2008). When viewed as a binary
feature, polarity classification is the binary classification task of labeling a n opinio nated
document as expressing either an overall positive or an overall negative opinion (e.g.,
thumbs up or thumbs down). In addition to the identification of N-P polarity, one should
also be interested in identifying the strength of the sentiment (as opposed to just positive,
it may be expressed as mildly, moderately, strongly, or very strongly positive) . Most of
this research was done on product or movie reviews where the definitions of "positive "
and "negative" are quite clear. Other tasks , su ch as classifying news as "good" or "bad,"
present some difficulty. For instan ce an article may contain n egative news w ithout explic-
itly using any subjective words or terms. Furthermore, these classes usually appear inter-
mixed when a document expresses both p ositive and negative sentiments. Then the task
can be to ide ntify the main (or dominating) sentiment of the document. Still, for lengthy
texts, the tasks of classification may need to be done at several levels: term, phrase,
sentence , and perhaps document level. For those, it is comm o n to use the outputs of one
level as the inputs for the next higher layer. Several methods used to identify the polarity
and strengths of the polarity are explained in the next section.
STEP 3: TARGET IDENTIFICATION The goal of this step is to accurately identify the target
o f the expressed sentime nt (e.g. , a person, a product, an event, etc.). The difficulty of this
task depends large ly o n the domain of the analysis. Even though it is usually easy to accu-
rately ide ntify the target for product or movie reviews, because the review is directly con-
nected to the target, it may be quite challenging in oth er domains . For instance, lengthy,
gen eral-purpose text such as Web pages, news articles, and blogs do not always have a
predefined topic that they a re assigned to, a nd often mention many objects, any of w hich
may be deduced as the target. Sometimes there is more than one target in a sentiment
sente n ce, w hich is the case in comparative texts. A subjective comparative sentence orders
objects in order of preferences-for example, "This laptop computer is better than my
desktop PC. " These sentences can be identified using comparative adjectives and adverbs
(more, less, better, lon ger), superlative adjectives (most, least, best), an d other words (such
as same, differ, w in, prefer, etc.). Once the sentences have been retrieved, the objects can
be put in an o rder that is most representative of their merits, as described in text.
STEP 4: COLLECTION AND AGGREGATION Once the sentiments of all text data points
in the document are identified and calculated, in this step they a re aggregated and con-
verted to a single sentime nt measure for the whole document. This aggregation m ay be
as simple as summing up the polarities and strengths of all texts, or as complex as using
semantic aggregation techniques from natural language processing to come up with the
ultimate sentime nt.
Methods for Polarity Identification
As mentioned in the previous sectio n , polarity identification-identifying the polarity
of a text-can be made at the word, term, sentence, or document level. The most granu -
lar level for polarity identification is at the word level. Once the polarity ide ntification is
made at the word level, then it can be aggregated to the next higher level, and then the
next until the level of aggregation desired from the sentime nt a nalysis is reached. There
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 327
seem to be two dominant techniques used for identification of polarity at the word/ term
level, each having its advantages and disadvantages:
1. Using a lexicon as a reference library (either developed manually or automatically,
by an individual for a specific task or developed by an institution for general
use)
2. Using a collection of training documents as the source of knowledge about the
polarity of terms within a specific domain (i.e., inducing predictive models from
opinionated textual documents)
Using a Lexicon
A lexicon is essentially the catalog of words, their synonyms, and their meanings for
a given language. In addition to lexicons for many other languages, there are several
general-purpose lexicons created for English. Often general-purpose lexicons are used
to create a variety of special-purpose lexicons for use in sentiment analysis projects.
Perhaps the most popular general-purpose lexicon is WordNet, created at Princeton
University, which has been extended and used by many researchers and practitioners
for sentiment analysis purposes. As described on the WordNet Web site (wordnet.
princeton.edu), it is a large lexical database of English, including nouns, verbs, adjec-
tives, and adverbs grouped into sets of cognitive synonyms (i.e., synsets), each express-
ing a distinct concept. Synsets are interlinked by means of conceptual-semantic and
lexical relations.
An interesting extension of WordNet was created by Esuli and Sebastiani (2006)
where they added polarity (Positive-Negative) a nd objectivity (Subjective-Objective)
labels for each term in the lexicon. To label each term, they classify the synset (a group
of syn onyms) to which this term belongs using a set of ternary classifiers (a measure
that attaches to each object exactly one out of three labe ls), each of them capable of
deciding whether a synset is Positive, or Negative , or Objective. The resulting scores
range from 0.0 to 1.0, giving a graded evaluation of opinion-related properties of the
terms. These can be summed up visually as in Figure 7.11. The edges of the triangle
represent one of the three classifications (positive, negative, and objective). A term can
be located in this space as a point, representing the extent to which it belongs to each
of the classifications.
A similar extension methodology is used to create SentiWordNet, a publicly avail-
able lexicon specifically developed for opinion mining (sentiment analysis) purposes.
Positive [Pl
(+)
Subjective [SJ
>,
.-1:!
t…
(1J
0
Cl.
~
Objective [DJ
FIGURE 7.11 A Graphical Representation of the P-N Polarity and S-0 Polarity Relationship.
328 Pan III • Predictive Analytics
SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, neg-
ativity, objectivity. More about SentiWordNet can be found at sentiworclnet.isti.cnr.it.
Another extension to WordNet is WordNet-Affect, developed by Strapparava and
Valitutti (Strapparava and Valitutti, 2004). They label WordNet syn sets using affective labels
representing d ifferent affective categories like emotion , cognitive state, attitude, feeling,
and so on. WordNet has also been directly used in sentiment an alysis. For example, Kim
and Hovy (Kim and Hovy, 2004) and Hu and Liu (Hu and Liu , 2005) generate lexicons of
positive a nd negative terms by starting w ith a small list of “seed” terms of known polari-
ties (e.g. , love, like , nice, etc.) and then using the antonymy and syn onymy properties of
terms to group them into e ither of the polarity categories.
Using a Collection of Training Documents
It is possible to perform sentime n t classification u sing statistical an alysis and machine-
learning tools that take advantage of the vast resources of labeled (manually by annota-
tors or using a star/ point system) documents available . Product review Web sites like
Amazon, C-NET, ebay, RottenTomatoes, and the Internet Movie Database (IMDB) have
all been extensively used as sources of annotated data . The star (or tomato, as it were)
system provides a n explicit label of the overall polarity of the review, a nd it is often taken
as a gold standard in algorithm evaluation.
A variety of manually labeled textual data is available through evaluation efforts
such as the Text REtrieval Conference (TREC), NII Test Collection for IR Systems
(NTCIR) , and Cross Language Evaluation Forum (CLEF). The data sets these efforts pro-
duce often serve as a standard in the text mining community, including for sentiment
analysis researchers. Individual researchers and research groups h ave also produced
many inte resting data sets. Technology Insights 7.2 lists some of the most popular o n es.
Once an already labeled textual data set is obtain ed, a variety of predictive modeling
and other machine-lea rning algorithms can be used to train sen tim ent classifiers. Some
of the most popular algorithms used for this task include artificial n eural networks, sup-
port vector machines, k-nearest neighbor, Na ive Bayes, decision trees , and expectation
maximization-based clustering.
Identifying Semantic Orientation of Sentences and Phrases
Once the semantic orie ntation of individu al words h as been determined, it is often
desirable to extend this to the phrase or sentence th e word appears in. The simplest
way to accomplis h such aggregation is to use some type of averagin g fo r the polari-
ties of words in the phrases or sentences . Though ra rely applied, such aggregatio n
can be as complex as using o ne or more m achine -learning techniques to create a
predictive relationship between the words (and their polarity values) and phrases or
sente nces.
Identifying Semantic Orientation of Document
Even though the vast majo rity of the work in this area is done in determining seman-
tic orie ntation of words a nd phrases/ sente n ces, some tasks like summarization and
information retrieval may require semantic labeling of the whole document (REF) .
Simila r to the case in aggregating sentime nt polarity from word level to p hrase or
sente n ce level, aggregation to document level is also accomplish ed by some ty pe of
averaging. Sentime nt orientation of the document may not make sense for very large
documents; therefore, it is often used o n small to medium-sized documents posted on
the Internet.
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 329
TECHNOLOGY INSIGHTS 7 .2 Large Textual Data Sets for Predictive
Text Mining and Sentiment Analysis
Congressional Floor-Debate Transcripts: Published by Thomas et al. (Thomas
and B. Pang, 2006); contains political speeches that a re labeled to indicate whether the
speaker supported or opposed the legislation discussed.
Economining: Published by Stem School at New York University; consists of feed-
back postings for merchants at Amazon.com.
Cornell Movie-Review Data Sets: Introduced by Pang and Lee (Pang and Lee,
2008); contains 1,000 positive and 1,000 negative automatically derived document-level
labels, a nd 5,331 positive and 5,331 negative sentences/snippets.
Stanford-Large Movie Review Data Set: A set of 25,000 highly polar movie
reviews for training, and 25,000 for testing. There is additional unlabeled data for use as
well. Raw text and a lready processed bag-of-words formats are p rovided. (See: http://
ai.stanford.edu/-arnaas/ data/ sentiment.)
MPQA Corpus: Corpus and Opinion Recognition System corpus; contains 535 manu-
ally annotated news articles from a variety of news sources containing labels for opinions
and private states (beliefs, emotions, speculations, etc.).
Multiple-Aspect Restaurant Reviews: Introduced by Snyder and Barzilay (Snyder
and Barzilay, 2007); contains 4 ,488 reviews with an explicit 1-to-5 rating for five different
aspects: food, ambiance, service, value, and overall experience.
SECTION 7.9 REVIEW QUESTIONS
1. What are the main steps in carrying out sentiment analysis projects?
2. What are the two common methods for polarity identification? What is the main dif-
ference between the two?
3. Describe how special lexicons are used in identification of sentiment polarity.
7.10 SENTIMENT ANALYSIS AND SPEECH ANALYTICS
Speech analytics is a growing field of scie n ce that allows users to analyze and extract
information from both live a nd recorded conversations. It is being used effectively to
gather inte lligence for security purposes, to e nhance the presentation a nd utility of rich
media applications, and perhaps most significantly, to deliver meaningful an d quantitative
business intelligence throug h the analysis of the millio ns of recorded calls that occur in
customer contact centers around the world.
Sentiment analysis, as it applies to speech analytics, focuses specifically on assess-
ing the emotional states expressed in a conversation and on measuring the presence
an d strength of positive a nd negative feelings that are exhibited by the participants.
On e common use of sentiment analysis w ithin contact centers is to provide insight into
a customer’s feelings about an organization, its products, services, and customer service
processes, as well as an individual agent’s behavior. Sentiment analysis data can be used
across an organization to aid in customer relationship management, agent training, and in
identifying and resolving troubling issues as they emerge.
How Is It Done?
The core of automated sentiment an alysis centers aroun d creating a model to describe
how certain features and content in the audio relate to the sentiments being felt and
expressed by the participants in the conversation. Two primary methods have been
deployed to predict sentiment w ithin audio: acoustic/ phonetic a nd linguistic modeling.
330 Pan III • Predictive Analytics
THE ACOUSTIC APPROACH The acoustic approach to sentiment analysis relies on extract-
ing and m easuring a specific set of features (e.g., tone of voice, pitch o r volume, inten sity
and rate of speech) of the audio. These features can in some circumstances provide basic
indicators of sentiment. For example, the speech of a surprised speaker te nds to become
somewhat faster, loude r, a nd highe r in pitch. Sadness and depression are presented as
slower, softer, and lower in pitch (see Moore et al. , 2008). An angry caller may speak
much faster, much louder, and will increase the pitch of stressed vowels. There is a w ide
variety of audio features that can be measured. The most common ones are as follows:
• Intensity: energy, sound pressure level
• Pitch: variation of fundamental frequency
• Jitte r: variation in amplitude of vocal fold moveme nts
• Shimmer: variation in frequency of vocal fold movements
• Glottal pulse: glottal-source spectral characteristics
• HNR: h armonics-to-noise ratio
• Speaking rate: number of phonemes, vowels, syllables, or words per unit of t ime
When developing an acoustic analysis tool, the system must be built on a model
that defines the sentiments being measured. The model is based on a database of the
audio features (some of whic h a re listed here) and how their presence may indicate each
of the sentiments (as simple as positive, negative, neutral, or refined, such as fear, anger,
sadness, hurt, surprise, relief, e tc.) that are being measured . To create this database,
each single-e motion example is preselected from an original set of recordings, manually
reviewed, and annotated to identify which sentiment it represents. The final acoustic
analysis tools are then trained (using data mining techniques) and a predictive mode l is
tested and validated using a different set of the same annotated recordings.
As sophisticated as it sounds, the acou stic approach has its deficie n cies . First,
because acou stic analysis relies on ide ntifying the audio characteristics of a call, the
quality of the audio can significantly impact the ability to identify these features. Second,
speakers often express blended emotions, such as both empathy and annoyan ce (as in “I
do understand, madam , but I have no miracle solutio n “), w hich are extremely difficult to
classify based solely on their acoustic features. Third, acoustic analysis is often incapable
of recognizing and adjusting for the variety of ways that different callers may express the
same sentiment. Finally, its time-demanding and laborious process make it impractical for
use with live audio streams.
THE LINGUISTIC APPROACH Conversely, the linguistic approach focuses on the explicit
indications of sentiment and context of the spoken content within the audio; linguistic
m odels acknowledge that, when in a charged state, the speaker has a highe r probability
of using specific words, ex clamations, or phrases in a p articular order. The features that
a re most often an alyzed in a linguistic model include:
• Lexical: words, phrases, and other linguistic patterns
• Disfluencies: filled pauses, h esitation, restarts, and nonverbals such as laughte r o r
breathing
• Higher semantics: taxonomy/ o ntology , dialogue history, and pragmatics
The simplest method, in the linguistic approach, is to catch within the au dio a lim-
ited number of specific keywords (a specific lexicon) that has domain-specific sentiment
significance. This approach is perhaps the least popular due to its limited applicability
a nd less-than-desired prediction accuracy. Alternatively, as with the acoustic approach,
a model is built based on understanding which linguistic elements are predictors of
particular sentiments, and this mo del is the n run against a series of recordings to deter-
mine the sentiments that are contained therein. The cha lle nge w ith this approach is in
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 331
collecting the linguistic information contained in any corpus of audio. This has tradition-
ally been done using a large vocabulary continuous speech recognition (L VCSR) system,
often referred to as speech-to -text. However, LVCSR systems are prone to creating signifi-
cant error in the textual indexes they create. In addition, the level of computational effort
they require-that is , the amount of computer processing power needed to analyze large
amounts of audio content-has made them very expensive to deploy for mass audio
analysis .
Yet, a nother approach to linguistic analysis is that of phonetic indexing and search.
Among the significant advantages associated with this approach to linguistic modeling
is the method’s ability to maintain a high degree of accuracy no matter what the quality
of the audio source, and its incorporation of conversational context through the use of
structured queries during analysis (Nexidia, 2009) .
Application Case 7 .8 is a great example to how analytically savvy companies find
ways to better “listen” and improve their customers’ experience.
Application Case 7.8
Cutting Through the Confusion: Blue Cross Blue Shield of North Carolina Uses Nexidia’s Speech
Analytics to Ease Member Experience in Healthcare
Introduction
With the passage of the healthcare law, many health
plan members were perplexed by new rules and
regulations and concerned about the effects man-
dates would have on their benefits, copays, and
providers. In an attempt to ease concerns, health
plans such as Blue Cross Blue Shield of North
Carolina (BCBSNC) published literature, updated
Web sites, and sent various forms of communication
to members to further educate them on the changes.
However, members continued to reach out via the
contact center, seeking a nswers regarding current
claims and benefits a nd how their health insurance
coverage might be affected in the future. As the law
moves forward, members will be more engaged in
making their own decisions about healthcare plans
and about where to seek care, thus becoming better
consumers. The transformation to healthcare con-
sumerism has made it crucial for health plan contact
centers to diligently work to optimize the customer
experience.
BCBSNC became concerned that despite its
best efforts to communicate changes, confusion
remained among its nearly 4 million members,
which was driving unnecessary calls into its contact
center, which could lead to a decrease in member
satisfaction. Also, like all plans, BCBSNC was look-
ing to trim costs associated with its contact center,
as the health reform law mandates health plans
spend a minimum of 80 percent of all premium pay-
ments on healthcare. This rule leaves less money for
administrative expenses, like the contact center.
However, BCBSNC saw an opportunity to
leverage its partnership with Nexidia , a leading
provider of customer interaction analytics, and use
speech analytics to better understand the cause and
depth of member confusion. The use of speech ana-
lytics was a more attractive option for BCBSNC than
asking their customer service professionals to more
thoroughly document the nature of the calls within
the contact center desktop application, which
would have decreased efficiency and increased
contact center administrative expenses. By iden-
tifying the specific root cause of the interactions
when members called the contact center, BCBSNC
would be able to take corrective actions to reduce
call volumes and costs and improve the members’
experience.
Alleviating the Confusion
BCBSNC has been ahead of the curve on engaging
and educating its members and providing exem-
plary customer service. The health plan knew it
needed to work vigorously to maintain its customer
(Continued)
332 Pan III • Predictive Analytics
Application Case 7.8 (Continued}
service track record as the healthcare mandates
began. The first step was to better understand how
members perceived the value they received from
BCBSNC and their overall opinion of the company.
To accomplish this, BCBSNC elected to conduct sen-
timent analysis to get richer insights into members’
opinions and interactions.
When conducting sentiment analysis, two
strategies can be used to garner results. The acoustic
model relies on measuring specific characteristics of
the audio, such as sound, tone of voice, p itch, vol-
ume, intensity, and rate of speech. The other strat-
egy, used by Nexidia, is linguistic modeling, which
focuses directly on spoken sentiment. Acoustic
modeling results in inaccurate data because of poor
recording quality, background noise, and a person’s
inability to change tone or cadence to reflect his or
her emotion. The linguistic approach, which focuses
directly on words or phrases used to convey a feel-
ing, has proven to be most effective.
Since BCBSNC suspected its members may
perceive their health coverage as confusing,
BCBSNC utilized Nexidia to put together structured
searches for words or phrases used by callers to
express confusion: ‘Tm a little confused,” “I don’t
understand, ” “I don’t get it,” and “Doesn’t make
sense.” The results were the exact percentage of
calls containing this sentiment and helped BCBSNC
specifically isolate those circumstances and cover-
age instances where callers were more likely to be
confused with a benefit or claim. BCBSNC filtered
their “confusion calls” from their overall call volume
so these calls were available for further analysis.
The next step was to use speech analytics
to get to the root cause of what was driving the
disconnection and develop strategies to alleviate
the confusion. BCBSNC used Nexidia ‘s dictionary
independent phonetic indexing and search solution,
allowing for all processed audio to be searched for
any word or phrase, to create additional structured
searches. These searches further classified the call
drivers, and when combined with targeted listening,
BCBSNC pinpointed the problems.
The findings revealed that literature created
by BCBSNC used industry terms that members were
unfamiliar with and didn’t clearly explain their bene-
fits, claims processes, and deductibles. Additionally,
information on the Web site was neither easily
located nor understood, and members were unable
to “self-serve,” resulting in unnecessary contact
center interaction. Further, adding to BCBSNC’s
troubles, when Nexidia’s speech analytics combined
the unstructured call data with the structured data
associated with the call, it showed “confusion calls”
had a significantly higher average talk time CATT) ,
resulting in a higher cost to serve for BCBSNC.
The Results
By listening to, and more specifically understand-
ing, the confusion of its members regarding benefits,
BCBSNC began implementing strategies to improve
member communication and customer experience.
The health plan has developed more reader-friendly
literature and simplified the layout to highlight per-
tinent information. BCBSNC also has implemented
Web site redesigns to support easier navigation and
education. As a result of the modifications, BCBSNC
projects a 10 to 25 percent drop in “confusion calls,”
resulting in a better customer service experience and
a lower cost to serve. Utilizing Nexidia’s analytic
solution to continuously monitor and track changes
will be paramount to BCBSNC’s continued success
as a leading health plan.
“Because there is so much to do in healthcare
today and because of the changes under way in the
industry, you really want to invest in the consumer
experience so that customers can get the most out
of their h e alth care coverage, ” says Gretchen Gray ,
director of Customer and Consumer Experience at
BCBSNC. “I believe that unless you use [Nexidia’s]
a pproach, I don’t know how you pick your priori-
ties and focus. Speech analytics is one of the main
tools we have where we can say, ‘here is where we
can have the most impact and here’s what I need to
do better or differently to assist my customers.”‘
QUESTIONS FOR DISCUSSION
1. For a large company like BCBSNC with a lot of cus-
tomers, what does “listening to customer” mean?
2. What were the challenges, the proposed solu-
tion, and the obtained results for BCBSNC?
Source.- Used with permissio n from Nexidia.com.
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 333
SECTION 7.10 REVIEW QUESTIONS
1. What is speech analytics? How does it relate to sentiment analysis?
2. Describe the acoustic approach to speech analytics .
3. Describe the linguistic approach to speech analytics.
Chapter Highlights
• Text mining is the discovery of knowledge from
unstructured (mostly text-based) data sources.
Given that a great deal of information is in text
form, text mining is one of the fastest growing
branches of the business intelligence field.
• Companies use text mining and Web mining to
better understand their customers by a nalyzing
their feedback left on Web forms , biogs, and w ikis.
• Text mining applications are in virtually every
area of business and government, including mar-
keting, finance, healthcare, medicine, and home-
land security.
• Text mining uses natural language processing to
induce structure into the text collection and then
uses data mining algorithms such as classification,
clustering, association, and sequence discovery
to extract knowledge from it.
• Successful application of text mining requires a
structured methodology similar to the CRISP-DM
methodology in data mining.
• Text mining is closely related to information
extraction, nan1ral language processing, and doc-
ument summarization.
• Text mining entails creating numeric indices from
unstructured text a nd then applying data mining
algorithms to these indices.
• Sentiment can be defined as a settled opinion
reflective of one’s feelings.
• Sentiment classification usually deals with dif-
ferentiating between two classes, positive and
negative.
• As a field of research, sentiment analysis is closely
related to computational linguistics, natural
Key Terms
association
classification
clustering
corpus
customer experience
management (CEM)
deception detection
language processing, and text mmmg. It may
be used to enhance search results produced by
search engines.
• Sentiment analysis is trying to answer the ques-
tion of “What do people feel about a certain
topic?” by digging into opinions of many using a
variety of automated tools.
• Voice of the customer is an integral part of an
analytic CRM and customer experien ce manage-
ment systems, and is often powered by sentiment
a nalysis.
• Voice of the market is about understanding
aggregate opinions and trends at the market
level.
• Brand management focuses on listening to social
media where anyone can post opinions that can
damage or boost your reputation.
• Polarity identification in sentiment analysis is
accomplished e ither by using a lexicon as a refer-
ence library or by using a collection of training
documents.
• WordNet is a popular general-purpose lexicon
created at Princeton University.
• SentiWordNet is an extension of WordNet to be
used for sentiment identification.
• Speech analytics is a growing field of science that
a llows users to analyze and extract information
from both live and recorded conversations.
• The acoustic approach to sentiment analy-
sis relies on extracting a nd measuring a spe-
cific set of features (e.g. , tone of voice, pitch
or volume, intensity and rate of speech) of the
audio.
inverse document
frequency
natural language
processing (NLP)
part-of-speech tagging
polarity identification
polyseme
sentiment
334 Pan III • Predictive Analytics
sentiment analysis
SentiWordNet
sequence discovery
singular value
decomposition (SVD)
speech analytics
stemmin g
stop words
term-docume nt matrix
(TDM)
Questions for Discussion
1. Explain the re lationships a mo ng data mining, text min-
ing, and sentiment analysis.
2. What shou ld an o rga nizatio n consider before ma king a
decision to purchase te xt mining software?
3. Discuss the differences and commo nalities b etween text
mining a nd sentime nt a nalysis.
4. In your own words, define text mining a nd d iscu ss its
most popu la r applicatio ns .
5. Discuss the similarities and diffe rences be tween the data
mining process (e.g., CRISP-DM) a nd the three-step,
hig h-level text mining process explained in this c ha pter.
6. What d oes it mean to introduce stru cture into the text-
based data? Discuss the alte rnative ways of introducing
structure into text-based da ta.
7. What is the role of natural language processing in text
mining? Discu ss the capabilities and limitations of NLP in
the context of text min ing .
8. List and discuss three promine nt a pplication areas for
text mining. Wha t is the commo n theme amo ng the
three applicatio n a reas you chose?
9. What is sentime nt analysis? How d oes it relate to text
mining?
Exercises
Teradata University Network (TUN) and Other
Hands-On Exercises
1. Visit teradatauniversitynetwork.com. Ide ntify cases
about text mining. Describe recent developme nts in the
field. If you canno t find e n ough cases at the Te radata
Unive rsity network Web site, broaden your search to
oth e r Web-based resources.
2. Go to teradatauniversitynetwork.com or locate w hite
pape rs, Web seminars, a nd othe r mate rials related to text
mining. Synthesize your findings into a sho1t written repo1t.
3. Browse the Web a nd your library’s digital databases to
identify a rticles that make the natural linkage between
text/Web mining an d conte mporary busin ess intelligen ce
syste ms.
4 . Go to teradatauniversitynetwork.com a nd find a case
study na me d “eBay Analytics. ” Read the case care fully,
exte nd your unde rstanding of the case by searching the
Inte rne t for additio nal informatio n , a nd a nswer the case
questions.
5. Go to teradatauniversitynetwork.com a nd find a sen-
time nt a nalysis case named “How Do We Fix and App
text mining
toke nizing
trend an alysis
u nstructured data
voice of custome r (VOC)
voice of the m arket
WordNet
10. What a re the sources of data for sentiment analysis?
11. What are the common challenges that sentiment analysis
h as to deal with?
12. What a re the most popular applicatio n a reas for senti-
ment analysis? Why?
13. How can sentiment analysis b e u sed for b rand
management?
14. What would b e the exp ected benefits a nd beneficiaries
of sentime nt analysis in p olitics?
15. How can sentiment a nalysis be used in predicting fin an-
cial marke ts?
16. What are the main ste ps in carrying out sentiment analy-
sis projects?
17. What a re the two commo n me thods for p o la rity ide n-
tificatio n? What is the mai n d ifferen ce between the
two?
18. Describe how sp ecial lexicons are used in identification
of sentime nt p olarity.
19. Wha t is sp eech a nalytics? How does it relate to sentiment
a nalysis?
20. Describe the acoustic approach to speech analytics.
21. Describe the linguistic ap proach to sp eech a nalytics.
Like That!” Read the descriptio n and follow the d irec-
tio ns to dow nload the d ata and the tool to carry out the
exercise.
Team Assignments and Role-Playing Projects
1 . Examine how textual data can be captured a utom atically
using Web-based technologies. Once captured, what are
the potential patterns that you can extract from these
unstru cture d da ta sou rces?
2. Interview administrators in your college or execu tives in
your o rganization to determine how text mining a n d Web
mining could assist the m in their work. Write a p roposal
describ ing your findings . Include a preliminary cost-
benefits analysis in your re port.
3 . Go to your library’s o nline resources. Learn h ow to
download attributes of a collectio n of literature ( journal
a rticles) in a specific topic. Download and process the
da ta using a me thodology similar to the o ne expla ined in
Application Case 7.5.
4. Find a readily available sentime nt text data set (see
Techno logy Insig hts 7.2 for a list of popula r data sets) a nd
Chapte r 7 • Text Analytics, Text Mining , and Sentiment Analysis 335
download it into your computer. If you have an analytics
too l that is capable of text mining, use that; if no t, download
RapidMine r (rapid-i.com) and install it. Also install the text
analytics add-o n for RapidMiner. Process the downloaded
data using your text mining tool (i.e., conve1t the data into
a structured fom1). Build models and assess the sentiment
detectio n accuracy of seve ral classificatio n models (e.g .,
support vecto r machines, decisio n u·ees, neural ne tworks ,
logistic regression , etc.). Write a de tailed re port w he re you
explain your finings and your expe rie nces.
Internet Exercises
1. Survey some text mining tools a nd vendors. Stan w ith
clearforest.com and megaputer.com. Also consult
with dmreview.com and ide ntify some text mining
produ cts and service p rovide rs th at are n o t me ntio ned in
this cha pte r.
2. Find recent cases of successful text mining and We b
mining applicatio ns. Tty text a nd Web mining software
ve ndo rs and consultancy firms a nd look for cases o r su c-
cess stories. Prep a re a rep o rt s umma rizing five new case
studies.
3. Go to statsoft.com. Select Downloads a nd download at
least three w hite pape rs o n applicatio ns . Which o f these
applicatio ns may have used the da ta/text/Web mining
techniques discussed in this cha p ter?
End-of-Chapter Application Case
4. Go to sas.com. Download at least three w hite p apers on
applicatio ns. Which of these a pplication s may have used
the da ta/text/Web mining techniq u es discu ssed in this
ch apter’
5. Go to ibm.com. Download at least three w hite pa p e rs
o n application s . Which of these application s may have
u sed the data/ text/Web mining techniq ues d iscussed in
this chapter?
6. Go to teradata.com. Download at least three w hite
pa pe rs o n applicatio ns. Which of these applicatio ns may
h ave used th e data/ text/Web min ing tech niq u es d is-
cussed in th is chapter’
7. Go to fairisaac.com. Download at least three w hite
papers o n applicatio ns. Which of these applications may
have u sed th e data/ text/Web mining tech niques d is-
cu ssed in this chapter?
8. Go to salfordsystems.com. Download at least three
w hite pa p e rs o n ap plications. Which of these applica-
tio ns may have used the data/ text/Web m ining tech-
niques discussed in this cha pte r?
9. Go to clarabridge.com. Download a t least three w hite
pa pe rs o n applicatio ns. Which of these applicatio ns may
h ave u sed text mining in a creative way’
10. Go to kdnuggets.com. Explo re the sections o n applica-
tio ns as well as software. Find names of at least three
ad ditio nal packages for data mining a nd text mining.
BBVA Seamlessly Monitors and Improves its Online Reputation
BBV A is a glob al group that o ffe rs individual a nd corpo –
rate cu sto me rs a comprehe n sive ra nge of fina ncial and no n-
fin a ncial produ cts a nd services. It e njoys a solid leade rship
positio n in the Sp a nish market, w he re it first bega n its activi-
ties over 150 years ago. It also has a leading franchise in South
Ame rica; it is the la rgest financial institutio n in Mexico; o ne o f
the 15 la rgest U.S. comme rcial banks and o ne of the few large
inte rnatio nal gro ups o pe ra ting in China and Turkey. BBVA
employs approximately 104,000 people in over 30 countries
aro und the world , a nd has mo re tha n 47 million customers
and 900,000 share holde rs.
Looking for tools to reduce reputational risks
BBVA is inte rested in knowing w h at existing clie nts- and
possible new o nes- think abou t it th rough social me dia .
The refore, the b a nk has imple me nted a n auto mate d con-
sumer insig ht solutio n to mo nitor and m easure th e impact
of b rand perceptio n o nline- w hethe r th is be custo mer com-
me nts o n social media s ites (Tw itter, Facebook , forums ,
biogs, e tc.), the voices of exp e n s in o nline articles abou t
BBVA a nd its compe titors, o r refe re nces to BBVA o n news
sites-to detect possible risks to its re puta tion or to possible
business o ppo rtunities.
Insig hts derived from th is a nalytical tool give BBVA the
oppo rtunity to address reputa tional cha lle nges an d continue
to build o n positive opinio ns. For exam ple, the bank can now
resp o nd to negative (or p ositive) brand perceptio n by focu s-
ing its comm unication strategies o n panicular Internet sites,
countering- o r backing u p-the most outsp oken authors on
Twitte r, boards a nd b iogs.
Finding a way forward
In 2009, BBVA began monito ring the web with a n IBM social
media research asset called Corporate Brand Reputation
Analysis (COBRA), as a pilo t between IBM and the bank’s
Innovatio n departme nt. This p ilo t proved highly successfu l
for differe nt areas of the ba nk, including the Communicatio ns,
Bran d & Re p utatio n , Corporate Social Respo nsibility,
Consume r Insig ht, a nd Online Ba nking de p artments.
The BBVA Communication depa1tment then decided
to tackle a new project, de ploying a single tool that would
e nable the entire grou p to a nalyze online mentions of BBVA
a nd mo nitor the ba nk’s brand p e rceptio n in vario us online
communities.
The bank decided to imp le ment IBM Cognos Con sumer
Insig ht to unify a ll its bra nches worldw ide a nd allow them to
336 Pan III • Pre dictive Ana lytics
use the sam e samples , mo d els, and taxonomies . IBM Glob al
Business Se rvi ces is currently h e lping the ba nk to imple –
me nt the solu tio n , as well as d esign the fo cu s o f the a n alysis
ad a pte d to each country’s requirem e nts.
IBM Cognos Consume r Ins ig ht w ill a llow BBVA to
mo nito r the voices o f curre nt a nd p o te ntial clients on socia l
me dia web s ites such as Tw itte r, Facebook and message
boards, ide ntify expe it o pinio ns abo ut BBVA a nd its compe t-
ito rs o n biogs, a nd control the presen ce o f the ba nk in news
cha nne ls to gain insig hts and de tect p ossible re putation a l
risks . All this new info rmation w ill be distribute d amo ng the
bus iness de p a ttme nts o f BBVA, ena bling the b an k to ta ke a
ho listic view across all areas o f its business.
Seamless focus on online reputation
Th e solutio n has n ow been rolle d o ut in Sp a in, and BBVA’s
Online Communicatio ns team is already seeing its ben e fits.
“Huge am o unts o f d ata a re b e ing posted o n Tw itte r
every d ay, w hic h m a kes it a g reat source of informatio n fo r
us,” sta tes the Online Communicatio ns De p a rtme nt o f this
ba nk. “To m a ke e ffective u se o f this resource, we n eed e d
to find a way to capture, sto re a nd a na lyze the da ta in a
be tte r , faste r and m o re d e ta ile d fashio n. We b e lieve tha t
IBM Cognos Con s ume r Ins ight w ill he lp us to differe ntiate
and categorize a ll the data we collect according to pre –
esta blis h ed crite ria, su ch as a utho r, d a te, country and sub-
ject . This e nables u s to focus o nly o n comme nts a nd news
ite m s that are actua lly re levant, whethe r in a p ositive, n eg a-
tive o r n eutral sen se .”
The conte nt of the comme nts is su bseque ntly ana lyzed
using custo m Spanis h and Eng lish dictio naries, in o rde r to
ide ntify whe the r the sentime nts expressed are positive o r
negative. “Wh at is great about this solutio n is tha t it h e lps u s
to focus our actio ns o n the most impo rta nt to pics o f o nline
discussio ns and imme diate ly pla n the correct a nd most suit-
able reactio n ,” add s the De p artme nt, “By b u ilding o n w ha t
w e a ccomplis he d in the initial COBRA p roject, the new solu-
tio n e n ables BBVA to seamlessly mo nitor comme nts and
postings , improve its d e cisio n-making processes, a nd the re by
stre ngthe n its online re puta tion. ”
“Whe n BBVA d e tects a negative comme nt, a re puta-
tio n al risk a rises,” explains Migue l Iza Mo re no , Bus iness
Analytics and Optimizatio n Consulta nt at IBM Global
Bus iness Services . “Cognos Consume r Ins ig ht provides a
re p o rting system w hich ide ntifies the o rigin of a n egative
state me nt a nd BBVA sets up a n inte rna l protocol to d e cide
how to react. This can happe n through p ress releases, d irect
References
Chun, H . W. , Y. Tsuru o ka , J. D. Kim, R. Shiba , N. Nagata, and
T. Hishiki. ( 2006). “Ex tractio n o f Gen e -Disease Relatio ns
fro m Me dline Using Do ma in D ictio naries and Machine
Learning .” Proceed ings of the 11th Pacific Symposiu m on
B iocomp u ting, pp. 4-1 5.
communicatio n with u sers o r , in som e cases, n o actio n is
d eeme d to b e require d ; the solutio n also highlights those
cases in which the negative comme nt is considered ‘irre le-
vant’ o r ‘h armless’ . The sam e p rocedure applies to p ositive
comme nts-the solution allows the ban k to follow a stand a rd
a nd structure d process, which , based o n positive insig h ts, e n-
ables it to stre ng the n its reputatio n .
“Fo llowing the su ccessful d eployment in Spain, BBVA
w ill be able to easily re plicate the Cognos Con sume r Insight
solution in othe r countries, p roviding a single solutio n that
w ill h e lp to con solidate and reaffirm the bank’s re p utation
managem e nt strategy,” says the De pa rtment.
Tangible Results
Staiting w ith the COBRA p ilot project, the solutio n delivered
v isible be n efits during the first half o f 2011 . Positive feed-
back about the com pan y inc reased by more than o ne percent
w hile negative feed back w as redu ced b y 1.5 pe rcent-
s uggesting that hundreds of custom ers and stake h olde rs
across Spa in a re already e njoying a more satisfying experi-
e nce from BBV A. Moreover, global monitoring improved,
providing greate r re liab ility w he n comparing results between
branches and countries. Similar b e ne fits are e x pected fro m
the Cognos Consumer Insig h t project, a nd the initia l results
a re expecte d sho nly.
“BBVA is already seeing a re m arkable improvem e n t
in the way tha t info rmation is gathe re d a nd a n alyzed,
whic h we a re sure w ill tran slate into the sam e k ind of tan-
g ible b en e fits we saw fro m the COBRA p ilo t p roject ,” states
the b a nk, “Fo r th e time being, we ha ve already achieved
wha t we n eeded the m ost: a sing le tool w hic h unifies the
online measuring o f our b u s iness stra tegies, e nabling m o re
d e ta ile d , stru cture d and controlled o nline d ata a na lysis .”
QUESTIONS FOR THE END-OF-CHAPTER
APPLICATION CASE
1. How did BBVA u se text mining?
2. Wha t were BBVA’s cha lle nges’ How d id BBVA over-
com e them w ith text mining and socia l me d ia a n alysis?
3 . In w ha t o the r areas, in your o p inio n , can BBVA use
text mining?
Sou r ce: IBM Custome r Success Story, “BBVA seamlessly moni-
tors and improves its online reputation” at http://www-01.ibm.
com/software/success/cssdb.nsf/CS/STRD-8NUD29?0pen
Document&Site=corp&cty=en_us (accessed August 2013) .
Cohe n , K. B. , and L. Hunter. (2008). “Getting Started in
Text Mining .” PLoS Comp u tu tional Biology, Vol. 4, No. 1,
pp. 1-10.
Cousseme nt, K. , an d D. Van Den Poe!. (2008). “Improving
Cu stome r Comp laint Ma nagem e nt by Automatic Email
Chapter 7 • Text Analytics, Text Mining, and Sentiment Analysis 337
Classificatio n Using Linguistic Style Features as Predictors.”
Decision Support Systems, Vol. 44, No. 4, pp. 870-882.
Coussement, K. , a nd D. Va n De n Poe!. ( 2009). “Improving
Customer Attrition Prediction by Integra ting Emotions from
Clie nt/Company Inte raction Emails and Evaluating Multip le
Classifie rs.” Expert Systems with Applications, Vo l. 36,
No . 3, pp. 6127- 6134.
Delen, D. , and M. Crossland. (2008). “Seeding the Survey and
Analysis of Research Literature with Te xt Mining. ” Expert
Systems w ith Applications, Vol. 34, No . 3, p p. 1707-1720.
Etzio ni, 0. 0996) . “The World Wide Web: QuagmireorGoldMine?”
Communications of the ACM, Vol. 39, No. 11, pp. 65- 68.
EUROPOL. (2007) . “EUROPOL Work Program 2007. ”
statewatch.org/news/2006/apr/europol-work-pro-
gramme-2007 (accessed October 2008).
Feldman, R. , a nd ]. Sange r. ( 2007). The Text Mining Handbook:
Advanced Approaches in A n alyz ing Unstructured Data.
Boston: ABS Ventures.
Fulle r, C. M. , D. Biros, a nd D. Dele n . ( 2008). “Exp lo ra tio n
of Fe ature Selectio n and Advanced Classificatio n Mo dels
for High-Stakes Deception De tection. ” Proceedings of th e
41st Annual Hawaii International Conference on System
Sciences (HICSS) , Big Isla nd, HI: IEEE Press, pp. 80- 99.
Ghani, R. , K. Probst, Y. Liu, M. Krema, and A. Fano. (2006).
“Te xt Mining for Product Attribute Extraction.” SIGKDD
Explorations, Vol. 8, No. 1, pp. 41-48.
Grimes, S. (2011 , February 17). “Seven Breakthrough
Sentime n t Analysis Sce na rios.” Information Week.
Han, ]. , and M. Kamber. (2006). Data Mining: Concepts and
Techniques, 2nd ed. San Fran cisco: Morgan Kaufmann.
Kanaya ma, H. , a nd T. Nasukawa. ( 2006) . “Fully Automatic
Lexicon Expanding for Domain-o riente d Sentime nt
Analysis, EMNLP : Empirical Metho ds in Natural Language
Processing.” trl.ibm.com/projects/textmining/takmi/
sentiment_analysis_e.htm.
Kleinberg, ]. (1999). “Authoritative Sources in a Hy perlinke d
Environment. ” Journal of the ACM, Vol. 46, No. 5 ,
pp. 604-632.
Lin, ]. , and D. De mner-Fushman. (2005) . “‘Bag of Words’ Is
Not Eno ugh for Stre ngth of Evid e n ce Classificatio n. ” AML4
Annual Symposium Proceedings, pp. 1031-1032. pubmed-
central.nih.gov / articlerender .fcgi?artid= 1560897.
Ma hgoub, H. , D . Ros ner, N. Is ma il , and F. To rkey. (2008).
“A Text Mining Technique Using Association Rules
Extractio n. ” International Journal of Computational
Intelligence, Vol. 4, No. 1, pp. 21-28.
Manning, C. D. , and H. Schutze . 0999) . Foundations of
Statistical Natural Language Processing. Cambridge, MA:
MIT Press.
Masand, B. M. , M. Spiliopoulo u , J. Srivasta va, and 0 . R. Za”iane.
(2002) . “Web Mining for Usage Patte rns and Profiles. ”
SIGKDD Explorations, Vol. 4, No. 2, pp. 125-132.
McKnight, W. (2005 , January 1). “Text Data Mining in Business
Intelligence. ” Information Management Magazine.
information-management.com/issues/20050101/
1016487-1.html (accessed May 22, 2009).
Mejova , Y. (2009). “Sentime nt Analysis: An Overview. ”
Comprehen sive exam p aper. www.cs.uiowa.
edu/-ymejova/publications/CompsY elenaMejova.
pdf (accessed Februa1y 201 3).
Mille r, T. W. (2005) . Data and Text Mining: A Business
Applications Approach . Upper Saddle River, NJ:
Pre ntice Hall.
Na kov, P. , A. Schwartz, B. Wolf, and M. A. Hearst. (2005) .
“Supporting Annotation Layers for Natural Langu age
Processing. ” Proceedings of the ACL, interactive poster and
demonstration sessio ns, Ann Arbor, MI. Associatio n for
Computational Linguistics, pp. 65-68.
Nasraoui, 0. , M. Spiliopoulou, ]. Srivastava, B. Mobashe r,
and B. Masand. (2006). “WebKDD 2006: Web Mining
a nd Web Usage Ana lysis Post-Workshop Report. ”
ACM SIGKDD Explorations Newsletter, Vol. 8, No. 2,
pp. 84- 89.
Nexidia (2009). “State of the art: Sentiment analysis” Nexidia
White Pape r, http://nexidia .com/ files/ resource_files/ nexidia_
sentime nt_analysis_ wp_8269. pdf (accessed February 2013).
Pang , B. , and L. Lee. (2008). “Opinion Mining and Sentim ent
Analysis.” Now Pub. http://books.google.com.
Pe te rson , E. T. (2008) . “The Voice o f Customer: Qu alitative
Data as a Critical Input to Web Site Optimization .” fore-
seeresults.com/Form_Epeters on_ WebAnalytics.html
(accessed May 22 , 2009).
Shatkay, H ., A. Hoglund, S. Brady, T. Blum, P. Donnes, and 0.
Kohlbache r. (2007). “SherLoc : High-Accuracy Pre diction
o f Prote in Su bcellular Localizatio n by In tegrating Text and
Prote in Sequ en ce Da ta .” B ioinformatics, Vol. 23, No. 11 ,
p p . 14 10-14 17.
SPSS. “Me rck Sha rp & Dohme. ” spss.com/success/tem-
plate_view.cfm?Story_ID=l85 (accessed May 15,
2009).
StatSoft. (2009) . Statistica Data and Text Miner User Manual.
Tulsa, OK: StatSoft, Inc.
Ture tke n , 0 ., and R. Sharda. (2004) . “Deve lo pme nt of
a Fisheye-Based Information Search Processing Aid
(FISPA) fo r Ma naging Informa tio n Overload in the Web
Environment. ” Decision Support Systems, Vol. 37, No. 3,
pp. 415-434.
Weng , S. S., and C. K. Liu. (2004) “Using Text Classification
an d Multiple Concepts to Answer E-Mails .” Expert Systems
with Applications, Vol. 26, No. 4, pp . 529- 543.
Zhou , Y. , E. Reid, ]. Qin, H. Chen, and G. Lai. (2005) .
“U.S. Do mestic Extre mist Grou ps o n the Web: Link and
Content Analysis. ” IEEE Intelligent Systems, Vol. 20, No. 5,
pp. 44-51.
338
CHAPTER
Web Analytics, Web Mining,
and Social Analytics
LEARNING OBJECTIVES
• Define Web mining and understand its
taxonomy and its application areas
• Differentiate between Web content
mining and Web structure mining
• Understand the internals of Web search
engines
• Learn the details about search engine
optimization
• Define Web usage mining and learn its
business application
• Describe the We b analytics maturity
model and its use cases
• Understand social networks and social
analytics and their practical applications
• Define social network analysis and
become familiar with its application
areas
• Understand social media analytics and
its use for better customer engagement
T
his chapter is all about Web mining and its application areas. As you will see,
We b mining is one of the fastest growing technologies in business inte llige n ce
and business analytics. Under the umbrella of Web mining, in this chapter, we
will cover Web analytics, search engines, social analytics and their enabling methods,
algorithms, and technologies.
8.1 OPENING VIGNETTE: Security First Insurance Deepens Connection
with Policyholders 331
8. 2 Web Mining Overview 341
8.3 Web Content and Web Structure Mining 344
8 .4 Search Engines 347
8 .5 Search Engine Optimization 354
8.6 Web Usage Mining (Web Analytics) 358
8.7 Web Analytics Maturity Model and Web Analytics Tools 366
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 339
8.8 Social Analytics and Social Network Analysis 373
8.9 Social Media Definitions and Concepts 377
8.10 Social Media Analytics 380
8.1 OPENING VIGNETTE: Security First Insurance
Deepens Connection with Policyholders
Security First Insurance is one of the largest homeowners’ insurance companies in Florida.
Headquartered in Ormond Beach, it employs more than 80 insurance professionals to
serve its nearly 190,000 customers.
CHALLENGE
Being There for Customers Storm After Storm, Year After Year
Florida has more property a nd people exposed to hurricanes than any state in the country.
Each year, the Atlantic Ocean averages 12 n amed storms and nine named hurricanes.
Security First is one of a few Florida homeowners’ insurance companies that has the
financial strength to withstand multiple natural disasters. “One of our promises is to be
there for our customers, storm after storm, year after year,” says Werner Kruck, chief
operating officer for Security First.
During a typical month, Security First processes 700 claims. However, in the after-
math of a hurricane, that number can swell to tens of thousands w ithin days. It can be a
challenge for the company to quickly scale up to handle the influx of customers trying
to file post-storm insurance claims for damaged property and possessions. In the past,
customers submitted claims primarily by phone and sometimes email. Today, policyhold-
ers use any means available to connect with an agent or claims representative, including
posting a question or comment on the company’s Facebook page or Twitter account.
Although Security First provides ongoing monitoring of its Facebook and Twitter
accounts, as well as its multiple email addresses and call centers, the company knew that
the communication volume after a major storm required a more aggressive approach. “We
were concerned that if a massive number of customers contacted us through email or social
media after a hurricane, we would be unable to respond quickly and appropriately,” Kruck
says. “We need to be available to our customers in whatever way they want to contact us. ”
In addition, Security First recognized the need to integrate its social media responses into
the claims process and document those responses to comply w ith industry regulations.
SOLUTION
Providing Responsive Service No Matter How Customers Get in Touch
Security First contacted IBM Business Partner Integritie for help w ith harnessing social
media to improve the customer experience. Integritie configured a solution built on
key IBM Enterprise Content Management software components, featuring IBM Content
Analytics w ith Enterprise Search, IBM Content Collector for Email and IBM® FileNet®
Content Manager software. Called Social Media Capture (SMC4), the Integritie solution
offers four critical capabilities for managing social media platforms: capture, control, com-
pliance and communication. For example, the SMC4 solution logs all social n etworking
interaction for Security First, captures content, monitors incoming and outgoing messages
and archives all communication for compliance review.
Because the solutio n uses open IBM Enterprise Content Management software,
Security First can easily link it to critical company applications , databases a nd processes.
340 Pan III • Predictive Analytics
For example, Content Collector for Ema il software automatically captures e mail content
and a ttachme nts and sends a n email back to the policyholder acknowledging receipt. In
addition, Content Analytics with Enterprise Search software sifts through a nd analyzes the
conte nt of cu stomers ‘ posts and emails . The software the n captures informa tio n glean ed
fro m this an alysis directly into claims documen ts to begin the claims process. Virtu ally
all incoming communicatio n fro m the company ‘s web, the Internet and emails is pulled
into a central FileNet Content Manager software repository to maintain, control a n d link
to the appropriate workflow. “We can b ring the customer conversatio n a nd a ny pictures
and attachments into o ur policy a nd cla ims management system and use it to trigger our
claims process and add to our documentation, ” says Kruck.
Prioritizing Communications with Access to Smarter Content
People whose homes have been damaged o r destroyed by a hurrican e are often displaced
quickly, w ith little more than the clothes on their backs. Grabbing an insurance policy
o n the way out the doo r is often a n afterthought. They’re relying o n their insurance com-
panies to have the information they n eed to h elp them get their lives back in order as
quickly as possible. When tens of thousands of policyholders require assistance w ithin a
sh o rt period of time, Security First must triage requests quickly. The Content Analytics with
Enterprise Search software that an chors the SMC4 solution provides the information n ec-
essary to help the company identify and address the most urgent cases first. The software
auto m atically sifts through data in ema il and social media posts, tweets and comments
using text mining, text analytics, n atural la nguage processing a nd sentime nt analytics to
detect words and tones that identify significant property damage or that convey d istress.
Security First can then prioritize the messages and rou te them to the proper p e rsonnel to
provide reassurance, handle compla ints or process a claim. “With access to smarter con-
tent, we can respo nd to o ur customers in a more rapid, efficient and personalized way,”
says Kruck. “When cu stomers are having a bad experie nce, it’s really important to get to
them quickly w ith the level of assistance appropriate to their particular situations.”
RESULTS
Successfully Addressing Potential Compliance Issues
Companies in a ll industries must stay compliant w ith new and emerging regulatory
requirements rega rding social media. The text analysis cap abilities provided in the IBM
software help Security First filter inappropriate incoming communications an d audit out-
bound communicatio n s, avoiding potential issues with message content. The company
can be confident that the resp o nses its employees provide are compliant an d controlled
based on both Security First policies an d ind ustry regulations .
Security First can designate people or roles in the o rganization that are auth orized
to create and submit responses. The system auto matically verifies th ese designations and
analyzes outgoing message conte nt, stopping any ineffective o r questionable commun i-
catio ns for further review. “Everything is recorded for complian ce, so we can effectively
track and maintain the process. We have the ability to control which employees respond,
their level of autho rity and the content of their responses,” says Kruck.
These capabilities give Security First the confidence to expand its use of social media.
Because compliance is covered, the company can focus o n additional opportunities for direct
dialog w ith customers. Before this solution, Security First filtered customer communications
through agents. Now it can reach o u t to customers directly and proactively as a company .
“We’re one of the first insurance companies in Florida to make ourselves available
to cu stomers w he never, wherever a nd however they ch oose to communicate. We’re also
managing inte rnal processes more effectively and proactively, reaching out to cu stomers
in a controlled a nd complia nt manner,” says Kruck.
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 341
Some of the prevailing business benefits of creative use of Web and social analytics
include:
• Turns social media into an actionable communications channel during a major
disaster
• Speeds claims processes by initiating claims with information from email and social
media posts
• Facilitates prioritizing urgent cases by analyzing social media content for sentiments
• Helps ensure compliance by automatically documenting social media communications
QUESTIONS FOR THE OPENING VIGNETTE
1. What does Security First do?
2. What were the main challenges Security First was facing?
3. What was the proposed solution approach? What types of analytics were integrated
in the solutio n?
4. Based on what you learn from the vignette, what do you think are the relationships
between Web analytics, text mining, and sentiment analysis?
5. What were the results Security First obtained? Were any surprising benefits realized?
WHAT WE CAN LEARN FROM THIS VIGNETTE
Web analytics is becoming a way of life for many businesses, especially the ones that are
directly facing the consumers. Companies are expected to find new and innovative ways
to connect with their customers, understand their needs, wants, and opinions, and proac-
tively develop products and services that fit well w ith them. In this day and age, asking
customers to tell you exactly what they like and dislike is not a viable option. Instead,
businesses are expected to deduce that information by applying advanced analytics tools
to invaluable data generated on the Internet and social media sites (along with corporate
databases). Security First realized the need to revolutionize the ir business processes to be
more effective a nd efficient in the way that they deal w ith their customers and customer
claims. They not only used what the Internet and social media have to offer, but also
tapped into the customer call records/ recordings a nd other relevant transaction data-
bases. This vignette illustrates the fact that analytics technologies are advanced enough
to bring together many different data sources to create a holistic view of the customer.
And that is perhaps the greatest success criterion for today’s businesses. In the following
sections, you w ill learn about many of the Web-based analytical techniques that make it
all happen.
Source: IBM Customer Success Story, “Secu rity First Insurance deepens connectio n w ith policyholders” accessed
at http://www-Ol.ibm.com/software/success/cssdb.nsf/CS/SAKG-975H4N?OpenDocument&Site=def
ault&cty=en_us (accessed August 2013) ..
8.2 WEB MINING OVERVIEW
The Internet has forever changed the la ndscape of business as we know it. Because of the
highly connected, flattened world and broadened competitive field, today’s companies are
increasingly facing greater opportunities (being able to reach customers and markets that
they may have never thought possible) and bigger challenge (a globalized and ever-changing
competitive marketplace). Ones w ith the vision and capabilities to deal with such a volatile
342 Pan III • Predictive Analytics
environment are greatly benefiting from it, while others w ho resist are having a hard time
su rviving . Having an engaged presence o n the Internet is not a choice anymore: It is a busi-
ness requirement. Customers are expecting companies to offer their products and/ or services
over the Internet. They are not o nly buying products and services but also talking about com-
panies and sharing their transactional and usage experiences with others over the Internet.
The growth of the Internet and its enabling technologies has made data creation,
data collectio n , a nd data/ information/ opinio n exchange easier. Delays in service, man-
ufacturing, shipping, delivery, and customer inquiries are no longer private incidents
and are accepted as necessary evils. Now, thanks to social media tools and technolo-
gies o n the Internet, everybody knows everything. Su ccessful companies are the ones
who embrace these Internet technologies and use them for the betterment of their busi-
ness processes so that they can better communicate with their cu stomers , understanding
their needs and wants and serving them thoroughly and expeditiously. Being customer
focused a nd keeping customers happy have never been as important a concept for busi-
nesses as they are now, in this age of the Internet and social media.
The World Wide Web (or, for short, the Web) serves as an enormou s repository
of data and informatio n on virtually everything one can conceive- business, personal,
you name it; an abundant amount of it is there. The Web is perhaps th e world’s largest
data a nd text repository, a nd the amount of information o n the Web is growing rapidly.
A lot of interesting information can be found online: w h ose homepage is linked to w hich
other pages, how many people h ave links to a specific Web page, and how a particular
site is organized. In additio n , each visitor to a Web site, each search on a search engine,
each click on a link, a nd each transaction o n an e-commerce site create addition al data.
Although unstructured textual data in the form of Web pages coded in HTML or XML is
the dominant conte nt of the Web, the Web infrastructure also contains hyperlink info rma-
tion (connection s to o ther Web pages) and u sage information (logs of visitors’ interac-
tions w ith Web sites), a ll of which provide rich data for knowledge discovery. Analysis of
this information can help us make better use of Web sites and also aid us in e nha ncing
relationships and value for the visitors to our own Web sites.
Because of its s heer size a nd complexity, mining the Web is not a n easy undertak-
ing by any means. The Web also poses great challenges for effective and efficient knowl-
edge discovery (Han and Kamber, 2006):
• The Web is too big for effective data mining. The Web is so large and grow-
ing so rapidly that it is difficult to even quantify its size. Because of the sheer size
of the Web, it is no t feasible to set up a data wareh ouse to replicate, store, and inte-
grate a ll of the data on the Web, making data collection an d integration a ch allenge.
• The Web is too complex. The complexity of a Web page is far greater than a
page in a traditional text document collection. Web pages lack a unified structure.
They contain far more authoring style an d content variation than any set of books,
articles, or other traditional text-based document.
• The Web is too dynamic. The Web is a highly dynamic information source. Not only
does the Web grow rapidly, but its content is constantly being updated. Biogs, news
stories, stock market results, weather reports, sports scores, prices, company advertise-
ments, and numerous other types of information are updated regularly on the Web.
• The Web is not specific to a domain. The Web serves a broad diversity of com-
munities and connects billions of workstatio ns. Web u sers have very different back-
grounds, interests, a nd usage purposes. Most users may n ot have good knowledge
of the structu re of the informatio n network and may n ot be aware of the heavy cost
of a particular search that they perform.
• The Web has everything. Only a small portion of the information on the Web is truly
relevant or useful to someone (or some task). It is said that 99 percent of the information
on the Web is useless to 99 percent of Web users. Although this may n ot seem obvious,
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 343
it is true that a particular person is generally interested in only a tiny portion of the Web,
whereas the rest of the Web contains information that is uninteresting to the user and
may swamp desired results. Finding the portion of the Web that is truly relevant to a
person and the task being performed is a prominent issue in Web-related research.
These challenges have prompted many research efforts to enhance the effective ness
and efficie ncy o f discovering and using data assets on the Web. A numbe r of index-based
Web search engines constantly search the Web and index Web pages under certain key-
words. Using these search engines , a n exp erienced use r may be able to locate docu-
ments by providing a set of tightly constrained keywords or phrases. However, a simple
keyword-based search engine suffers from several deficiencies. First, a topic of any breadth
can easily contain hundreds or thousands of documents. This can lead to a large number
of document entries returned by the search engine, ma ny of which are marginally relevant
to the topic. Second, many documents that are highly relevant to a topic may not contain
the exact keywords defining them. As we will cover in more detail late r in this chapter,
compared to keyword-based Web search, Web mining is a prominent (and more challeng-
ing) approach that can b e used to substantially enhance the power of Web search engines
because Web mining can ide ntify authoritative Web pages, classify Web docume nts, and
resolve many ambiguities and subtleties raised in keyword-based Web search engines.
Web mining (or Web data mining) is the process of discovering intrinsic relationships
(i.e ., interesting and useful information) from Web data , which are expressed in the form of
textual, linkage, or usage information. The term Web mining was first used by Etzioni 0996);
today, many conferences, journals, and books focus o n Web data mining. It is a continu-
ally evolving area of technology and business practice. Web mining is essentially the same
as data mining that uses data generated over the Web. The goal is to turn vast repositories
of business transactions, customer inte ractions, and Web site usage data into actionable
information (i.e ., knowledge) to promote better decision making throughout the enterprise.
Because of the increased popularity of the term analytics, nowadays many have started to
call Web mining Web analytics. However, these two terms are not the same. Although Web
analytics is primarily Web site usage data focused, Web mining is inclusive of all data gener-
ated via the Inte rnet, including transaction, social, and usage data. While Web analytics aims
to describe what has happe ned on the Web site (employing a predefined, metrics-driven
descriptive analytics methodology), Web mining aims to discover previously unknown pat-
terns and relationships (employing a novel predictive or prescriptive analytics methodology).
From a big-picture perspective, Web analytics can be considered a part of Web mining.
Figure 8.1 presents a simple taxonomy of Web mining, where it is divided into three main
areas: Web content mining, Web strncture mining, and Web usage mining. In the figure, the
data sources used in these three main areas are also specified. Although these three areas are
shown separately, as you will see in the following section, they are often used collectively
and synergistically to address business problems and opportunities.
As Figure 8.1 indicates, Web mining relies heavily on data mining and text mining and
their enabling tools and techniques, w hich we have covered in detail in the previous two
chapters (Chapters 6 a nd 7). The figure also indicates that these three generic areas are futther
extended into several very well-known application areas. Some of these areas were explained
in the previous chapters, and some of the others will be covered in detail in this chapter.
SECTION 8.2 REVIEW QUESTIONS
1. What are some of the main challenges the Web poses for know ledge discovery?
2. What is Web mining? How does it differ from regular data mining or text mining?
3. What are the three main areas of Web mining?
4. Identify three application areas for Web mining (at the bottom of Figure 8.1). Based
on your own experiences, comment on their use cases in business settings.
344 Pan III • Predictive Analytics
Web Content Mining
Source: unstructured
textual content of the
,—-i Web pages [usually in
HTML format)
,,..——–
1
I
I
I
Data
Mining
Text
Mining
WEB MINING
Web Structure Mining Web Usage Mining
Source: the unified Source: the detailed
resource locator [URL) description of a Web
links contained in the 1—–; site’s visits [sequence
Web pages of clicks by sessions)
Sentiment Analysis Web Analytics
Information Retrieval Graph Mining Social Analytics
Log Analysis
Customer Analytics 360 Customer View
FIGURE 8.1 A Simple Taxonomy of Web Mining.
8.3 WEB CONTENT AND WEB STRUCTURE MINING
Web content mining refers to the extraction of useful info rmation from Web pages. The
documents may be extracted in some machine-readable format so that automated tech-
niques can extract some informatio n from these Web p ages . Web crawlers (also called
spiders) a re u sed to read through the conte nt o f a Web site automatically. The info rma-
tion gathered may include document characteristics similar to w h at are u sed in text min-
ing, but it may also include additio nal con cepts, su ch as the document hierarchy . Su ch
an automa ted (or semiautomated) process of collecting a nd mining Web conte nt can be
used for competitive intelligence (collecting intelligence about competitors ‘ products, ser-
vices, and cu sto me rs). It can also be used for information/ news/ opinio n collection a nd
summarization, sentime nt a nalysis , automated data collectio n , and structuring for predic-
tive m odeling. As a n illustrative example to u sing Web content mining as an automated
data collectio n tool, con sider the following. For m o re than 10 years, two of the three
authors of this book (Drs . Sharda and Delen) have been developing models to predict
the finan cial success of Ho llywood movies before their theatrical release. The data that
they u se for training the models come from several Web s ites, each of w hich has a differ-
ent hierarchical p age structure. Collecting a la rge set of variables on thousands of mo v-
ies (from the past several years) from these Web sites is a time-demanding, error-prone
process. Therefore, they u se Web conte nt mining a nd spiders as an enabling technology
to automatically collect, verify, validate (if the specific data item is available o n more
than o ne Web site, then the valu es are validate d against each other an d an omalies are
captured a nd recorded), a nd store these values in a re lational database . That way, they
ensure the quality of the data w hile saving valuable time (days o r weeks) in the process.
In additio n to text, Web pages also con tain hype rlinks pointing one page to another.
Hyperlinks contain a significant amount of hidden human annotation that can potentially
help to automatically infer the n otion of centrality or authority. When a Web page devel-
oper includes a link pointing to anothe r Web page, this may be regarded as the developer’s
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 345
endorsement of the other page. The collective endorsement of a given page by different
developers on the Web may indica te the importance of the page and may naturally lead to
the discovery of authoritative Web pages (Miller, 2005). Therefore, the vast amount of Web
linkage information provides a rich collection of information about the relevance, quality,
and structure of the Web’s conte nts , and thus is a rich source for Web mining.
Web content mining can also be used to enhance the results produced by search
engines. In fact, search is p e rhaps the most prevailing application of Web content min-
ing and Web structure mining. A sea rch on the Web to obtain information o n a spe-
cific topic (presented as a collection of keywords or a sentence) usually returns a few
relevant, high-qu ality Web pages and a larger number of unusable Web pages. Use of
a relevance index based on keywords and authoritative pages (or some m easure of it)
will improve the search results and ranking of relevant pages. The idea of authority
(or authoritative pages) stems from earlie r information retrieval work using citations
among journal articles to evaluate the impact of research papers (Miller, 2005). Though
that was the origin of the idea, there are significant diffe rences between the citations
in research articles and hyperlinks on Web pages. First, not every hyperlink represents
an endorsement (some links are created for navigation purposes and some are for paid
advertisement). While this is true , if the majority of the hyp e rlinks are of the e ndorse-
ment type, then the collective o pinion will still prevail. Second, for commercial and
competitive interests, one authority will rarely have its Web page point to rival authori-
ties in the same domain. For example, Microsoft may prefer not to include links o n its
Web pages to Apple’s Web sites, because this may be regarded as endorsement of its
competitor’s authority. Third, authoritative pages are seldom p articularly descriptive. For
example , the main Web page of Yahoo! may not contain the explicit self-description that
it is in fact a Web search engine .
The structure of Web hyperlinks has led to another important category of Web
pages called a hub. A hub is one or more Web p ages that provide a collection of links to
authoritative pages. Hub pages may not be prominent and only a few links may point to
the m; however, they provide links to a collection of prominent sites on a specific topic of
interest. A hub could be a list of recommended links on an individual’s home p age, rec-
ommended reference sites on a course Web page, or a professionally assembled resource
list on a sp ecific topic. Hub pages play the role of implicitly conferring the authorities on
a narrow field . In essence, a close symbiotic relationship exists between good hubs and
authoritative pages; a good hub is good because it points to many good authorities, and a
good authority is good because it is being pointed to by many good hubs. Such relation-
ships between hubs and authorities make it possible to automatically retrieve high-quality
content from the Web.
The most popular publicly known and refe re nced algorithm used to calculate hubs
and authorities is hyperlink-induced topic search (HITS) . It was originally developed
by Kleinberg 0999) and has since been improved on by many research ers. HITS is a link-
analysis algorithm that rates Web pages u sing the hype rlink information contained within
them. In the context of Web search, the HITS algorithm collects a base document set
for a specific query. It then recursively calculates the hub and authority values for each
document. To gather the base docume nt set, a root set that matches the query is fetched
from a search engine. For each document retrieved, a set of documents that points to the
o riginal document and another set of documents that is pointed to by the original docu-
ment are added to the set as the original document’s neighborhood. A recursive process
of document identification and link analysis continues until the hub and authority values
converge. These values are then used to index a nd prioritize the document collection
generated for a specific query.
Web structure mining is the process of extracting u seful information from the
links e mbedded in Web documents. It is use d to identify a uthoritative pages and hubs ,
346 Pan III • Predictive Analytics
which are the cornerstones of the contemporary p age-rank algorithms that a re central
to popula r searc h engines such as Google and Yahoo!. Just as links going to a Web
page may indicate a site’s popularity (or authority) , links within the Web page (or the
compe te Web site) m ay indicate the depth of coverage of a specific topic . Analysis of
links is very important in unde rs ta nding the inte rrela tionships am o ng large numbe rs
of Web pages, leading to a better understanding of a specific Web community , clan,
or clique . Application Case 8.1 d escribes a project that u sed both Web content min-
ing a nd Web structure mining to better u nderstand how U.S. extremist groups are
connected.
SECTION 8.3 REVIEW QUESTIONS
1. What is Web content mining? How can it be used for competitive ad vantage?
2. What is an “authoritative page”? What is a “hub”? What is the difference between
the two?
3. What is Web structure mining? How does it differ from Web conte nt mining?
Application Case 8.1
Identifying Extremist Groups with Web Link and Content Analysis
We normally search for answers to our problems
outside of our immediate environment. Often,
however, the trouble stems from within. In taking
action against global te rrorism, domestic extrem-
ist groups often go unnoticed. However, domestic
extremists pose a significant threat to U.S . security
because of the information they possess , as well
as their increasing ability, through the use of the
Interne t, to reach out to extremist groups around
the world.
Keeping tabs on the content available on the
Internet is difficult. Researchers and authorities need
superior tools to analyze and monitor the activities
of extremist groups. Researchers at the Unive rsity
of Arizona , with support fro m the De partme nt of
Homeland Security and other agencies, have devel-
oped a Web mining methodology to find and ana-
lyze Web sites operated by domestic extremists in
order to learn about these groups through their use
of the Internet. Extremist groups use the Interne t to
communicate, to access private messages, and to
raise money online.
The research me thodology begins by gathering
a superior-quality collection of relevant extremist and
terrorist Web sites. Hyperlink analysis is performed,
which leads to other extremist and terrorist Web
sites. The interconnectedness with other Web sites
is cm cial in estimating the similarity of the objectives
of various groups. The next step is content analysis,
w hich further codifies these Web sites based o n vari-
ous attributes, such as communicatio ns, fund raising,
a nd ideology sharing, to name a few.
Based on link analysis and content analysis ,
researche rs have ide ntified 97 Web sites of U.S .
extremist a n d hate groups. Often, the links between
these communities do not necessarily represent
a ny coop eration between them. However, finding
numerous links between common interest groups
helps in clustering the communities under a com-
mon banner. Further research u sing data mining
to automate the process has a global aim, with the
goal of identifying links between international h ate
a nd extre mist groups a nd their U.S. counterparts.
QUESTIONS FOR DISCUSSION
1. How can Web link/ content analysis be used to
ide ntify extremist groups?
2. What do you think are the challenges and the
pote ntial solution to such intelligence gathering
activities?
Source: Y. Zhou, E. Reid, ]. Qin, H. Che n, and G. Lai, “U.S.
Domestic Extremist Groups o n the Web: Link and Content
Ana lysis ,” IEEE Intelligent Systems , Vol. 20, No. 5, Septe mbe r/
O c tobe r 2005, pp. 44-51.
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 347
8.4 SEARCH ENGINES
In this day and age, there is no denying the importance of Internet search engines. As the
size and complexity of the World Wide Web increases, finding what you want is becoming a
complex and laborious process. People use search engines for a variety of reasons. We use
them to learn about a product or a service before committing to buy (including who else
is selling it, what the prices are at different locations/sellers, the common issues people are
discussing about it, how satisfied previous buyers are, what other products or services might
be better, etc.) and search for places to go, p eople to meet, and things to do. In a sense,
search engines have become the centerpiece of most Interne t-based transactions and other
activities. The incredible success and popularity of Google, the most popular search engine
company, is a good testament to this claim. What is somewhat a myste1y to many is how a
search engine actually does what it is meant to do. In simplest terms, a search engine is a
software program that searches for documents (Internet sites or files) based on the keywords
(individual words, multi-word te1ms, or a complete sentence) that users have provided that
have to do with the subject of their inquiry. Search e ngines are the workhorses of the
Internet, responding to billions of queries in hundreds of different languages every day.
Technically speaking, search engine is the p opular term for info rmation retrieval sys-
tem. Although Web search engines are the most popular, search engines are often used in
a context other than the Web, such as desktop search engines or document search engines.
As you will see in this section , many of the concepts and techniques that we covered in
the text analytics and text mining chapter (Chapter 7) also apply here. The overall goal of
a search engine is to re turn one or more documents/ pages (if more than o ne documents/
pages applies, a rank-order list is often provided) that best match the user’s query. The two
metrics that are often used to evaluate search engines are effectiveness (or quality- find-
ing the right documents/ pages) and efficiency (or speed- returning a resp o nse quickly) .
These two metrics tend to work in reverse direction; improving one tends to worsen the
other. Often, based on user expectation, search engines focus o n one at the expense of
the other. Better search engines are the ones that excel in both at the same time. Because
search engines not only search but, in fact, find and return the documents/pages, perhaps
a more appropriate name for the m would be “finding e ngines.”
Anatomy of a Search Engine
Now let us d issect a search engine and look inside it. At the highest level, a search e ngine
system is composed of two main cycles: a development cycle a nd a responding cycle (see
the structure of a typical Internet search engine in Figure 8.2). While one is interfacing
c,’<:' r
!
Query Analyzer
0
Responding Cycle
Document
Matcher /Ranker
.o,., '(,0
oce '(?-.\,<:,
~zy
~ \) ~ 00'(,/
_ ,.--- ----~ co ...., '-- ----co X
""O QJ
.l!l Cashed/Indexed ""O
QJ
Documents DB
~
~ - ----
FIGURE 8.2 Structure of a Typical Internet Search Engine.
Web Crawler
Scheduler
0
Development Cycle
Document
Indexer
World Wide Web
348 Pan III • Predictive Ana lytics
w ith the World Wide Web, the other is interfacing w ith th e u se r. One can think of the
development cycle as a produc tio n process (manufacturing and inven torying d ocume nts/
pages) a nd the respo nding cycle as a reta iling process (p roviding cu stome rs/ users with
w hat they w ant). In the following sectio n these two cycles a re explained in more deta il.
1. Development Cycle
The two main components of the development cycle are the Web crawler and document
indexer. The purpose of this cycle is to create a huge datab ase of docume nts/p ages o rga-
nized and indexed based o n their conte nt and informatio n value. The reason for devel-
o ping su ch a repository of docume nts/ p ages is quite obviou s : Due to its sheer size and
complexity, searching the We b to find p ages in resp o nse to a user q uery is not p ractical (or
feasible w ithin a reaso nable time frame); the refore, search e ngines "cashes the Web" into
their d ata base, and uses the cashed versio n of the We b for searching an d finding . Once cre-
ated , this datab ase allows search e n gines to rapidly and accurately resp o nd to user queries.
Web Crawler
A We b crawle r (also called a sp id e r o r a Web spide r) is a piece of software that syste m-
atically b row ses (crawls throug h) the World Wide Web for the purpose of finding a nd
fetching Web page s. Often Web crawle rs copy all the p ages they visit for later p rocessing
by o ther functio ns of a search e ngine.
A Web craw le r starts w ith a list of URLs to visit, w hich are listed in the sch eduler and
ofte n are calle d the seeds. These URLs may come fro m su bmissio n s m ade by Webmasters
o r, mo re ofte n , they come fro m the inte rnal hyp e rlinks of previo usly crawled docu-
ments/pages. As the craw ler visits these URLs, it identifies all the hyperlinks in the page
and adds the m to the list of URLs to visit (i.e ., the schedule r). URLs in the sch eduler are
recursively vis ite d according to a set o f policies d ete rmine d by the sp ecific search e ngine.
Because there are large volumes o f Web p ages , the crawler can o nly download a limited
number of the m w ithin a given time; the refore, it m ay n eed to prio ritize its d ownloads .
Document Indexer
As the d ocuments are found and fe tche d by the crawle r, they are sto red in a tempo rary
staging area for the docume nt indexer to grab and p rocess . The d ocument indexer is
respo nsible for processing the docume nts (Web p ages or document files) and placing
them into the docume nt datab ase . In o rde r to convert the d ocume nts/ p ages into the
desired, easily searchab le fo rmat, the d ocume nt indexer perfo rms the following tasks .
STEP 1: PREPROCESSING THE DOCUMENTS Becau se the docume n ts fe tche d by the
craw ler may a ll b e in diffe re nt formats, for the ease of processing the m fu rther, in this
step they all are converted to some typ e of standard representatio n. Fo r in stan ce, diffe re nt
conte nt types (text, hyp e rlink, image, etc.) may be sep arated fro m each othe r, formatted
(if necessary) , and stored in a p lace for further processing .
STEP 2: PARSING THE DOCUMENTS This ste p is essentially the a pplication of text mining
(i.e., computation al linguistic, natural language processin g) tools and techniques to a col-
lectio n of docume nts/ p ages. In this ste p , first the standardized docume nts are p arsed into
its components to ide ntify index-worthy words/te rms . The n , u sing a set of rules , the words/
terms are indexed. Mo re sp ecifically, using to kenizatio n rules, the words/ terms/ entities are
extracted fro m the sente nces in these d ocume nts. Using p rop e r lexicon s, the sp elling erro rs
a nd o ther anomalies in these w ords/ terms are corrected. Not all the terms are discriminators.
The nondiscriminating words/ te rms (also known as sto p words) are e limin ated from the list
o f index-worthy words/ te rms. Because the same word/ te rm can be in many d ifferent fo rms,
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 349
stemming is applied to reduce the words/ terms to their root forms . Again, using lexicons
and other language-specific resources (e.g., WordNet), synonyms and homonyms are iden-
tified and the word/ term collection is processed before moving into the indexing phase.
STEP 3: CREATING THE TERM-BY-DOCUMENT MATRIX In this step, the relationships between
the words/terms and documents/pages are identified. The weight can be as simple as assign-
ing 1 for presence or O for absence of the word/term in the document/ p age. Usually more
sophisticated weight schemas are used. For instance, as opposed to binary, one may choose
to assign frequency of occurrence (number of times the same word/term is found in a docu-
ment) as a weight. As we have seen in Chapter 7, text mining research and practice have
clearly indicated that the best weighting may come from the use of tenn-Jrequency divided
by inverse-document-frequency (TF/IDF). This algorithm measures the frequency of occur-
re nce of each word/ term within a document, and then compares that frequency against the
frequency of occurrence in the document collection. As we all know, not all high-frequency
words/term are good document discriminators; and a good document discriminator in a
domain may not be one in anothe r domain. Once the weighing sche ma is determined, the
weights are calculated and the term-by-document index file is created.
2. Response Cycle
The two main components of the responding cycle are the query analyzer and the docu-
me nt matcher/ ranker.
Query Analyzer
The query a nalyzer is responsible for receiving a search request from the user (via the
search engine's Web server interface) and converting it into a standardized data structure,
so that it can be easily queried/ matched against the entries in the document database.
How the query analyzer does what it is supposed to do is q uite similar to what the
document indexer does (as we have just explained) . The query analyzer parses the
search string into individual words/terms using a series of tasks that include tokenization,
removal of stop words, stemming, and word/ term disambiguatio n (identification of spell-
ing errors, synonyms, and homonyms). The close similarity between the query analyzer
and document indexer is not coincidental. In fact, it is quite logical, because both are
working off of the document database; one is putting in documents/pages using a spe-
cific index structures, and the other is converting a query string into the same structure so
that it can be u sed to quickly locate most relevant documents/ pages.
Document Matcher/Ranker
This is where the structured query data is match ed against the document database to
find the most relevant documents/ p ages and also rank the m in the order of relevan ce/
importance. The proficiency of this step is perhaps the most important component when
differe nt search engines are compared to one another. Every search e ngine has its own
(often proprietary) algorithm that it uses to carry out this important step.
The early search engines used a simple keyword match against the document d ata-
base and re turned a list of ordered documents/ pages, w here the determinant of the order
was a function that used the number of words/ terms matched b etween the query and the
document along with the weights of those words/ terms. The quality and the usefulness
of the search results were no t all that good. Then, in 1997, the creators of Google came
up with a new algorithm, called PageRank. As the name implies, PageRank is an algorith-
mic way to rank-order documents/pages based o n their re levance and value/ importance.
Technology Insights 8.1 provides a high-level description of this patented algorithm. Even
350 Pan III • Predictive Analytics
TECHNOLOGY INSIGHTS 8.1 PageRank Algorithm
PageRank is a link analysis algorithm-named after Larry Page, o ne of the two inventors of
Google , which sta rted as a research project at Sta nford University in 1996-used by the Google
We b search e ngine . PageRank assigns a nume rical weight to each element of a hyp erlinked set
o f d ocume nts, su ch as the ones found o n the Wo rld Wide Web , w ith the purpose of measurin g
its relative impo rtance within a give n collection.
It is b elieved that PageRank h as been influe nced by citation a nalysis, where citations in
scholarly works are examined to d iscover relatio nships among research ers a n d their research
to pics. The applications of citatio n analysis ranges from ide ntification of p rominen t exp erts
in a given field o f study to p roviding invaluable info rmatio n for a tra ns pare nt review of aca-
de mic achieveme nts , which can b e used for merit revie w , tenure , and promotio n decision s.
The PageRank algorithm a ims to d o the same thing: ide ntifying reputable/important/valu able
documents/ pages that are highly regarded by othe r docume nts/p ages. A graphical illustration of
PageRank is shown in Figure 8.3.
How Does PageRank Work?
Computatio nally s peaking, PageRank exten ds the citation analysis idea by not counting links
fro m all p ages equally and by no rmalizing by the number of links on a p age. PageRa nk is
defined as follows:
Assume p age A has pages P1 through P,, p ointing to it (w ith hyper/inks, w hich is similar to
cita tions in citatio n an alysis) . The p aramete r d is a d amping/smoothing factor that can assu me
values between O an d 1. Also C(A) is define d as the number of links going o u t o f page A . The
simple formula fo r the PageRan k for page A can be w ritte n as follows:
n PageRa nk(P;)
PageRank(A) = (1 - d) + d~ i= l C(p;)
B
38.4 %
FIGURE 8.3 A Graphical Example for the PageRank Algorithm.
C
34. 3 %
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 351
Note that the PageRanks form a probability distribution over Web pages, so the sum of all
Web pages' PageRanks will be 1. PageRank(A) can be calculated using a simple iterative algo-
rithm and corresponds to the principal eigenvector of the normalized link matrix of the Web.
The algorithm is so computationally efficie nt that a Page Rank for 26 million Web pages can be
computed in a few hours on a medium-size workstation (Brin and Page, 2012). Of course, th ere
are more details to the actual calculation of PageRank in Google . Most of those details a re either
not publicly available or are beyond the scop e of this simple explanation.
Justification of the Formulation
PageRank can be thought of as a model of user behavior. It assumes there is a random swfer who
is given a Web page at random and keeps clicking on hyperlinks, never hitting back but eventually
getting bored and starting on another random page. The probability that the rando m surfer visits a
page is its PageRank. And, the d damping factor is the probability at each page the random surfer
will get bored and request another random page . One important variation is to only add the damp-
ing factor d to a single page, o r a group of pages. This allows for personalization and can make it
nearly impossible to deliberately mislead the system in o rde r to get a higher ranking.
Another intuitive justificatio n is that a page can have a high PageRank if there are many pages
that point to it, o r if there are some pages that point to it and have a high PageRank. Intuitively ,
pages that are well cited from many places around the Web are worth looking at. Also, pages that
have perhaps only one citation from something like the Yahoo! homepage are also generally worth
looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo"s
homepage would not link to it. The formulation of PageRank handles both of these cases and
everything in between by recursively propagating weights through the link structure of the Web.
though PageRank is an innovative way to rank documents/ pages, it is an augmentation to
the process of retrieving relevant documents from the database and ranking them based
o n the weights of the words/terms. Google does all of these collectively and more to
come up w ith the most relevant list of documents/ pages for a given search request. Once
an ordered list of documents/ pages is created, it is pushed back to the user in an easily
digestible format . At this point, users may choose to click on any of the documents in the
list, and it may not be the one at the top. If they click on a document/ p age link that is not
at the top of the list, then can we assume that the search e ngine did not do a good job
ranking them? Perhaps, yes . Leading search engines like Google monitor the performance
of their search results by capturing, recording, a nd a n alyzing postdelivery user actions
and experie nces. These analyses often lead to more a nd more rules to further refine the
ranking of the documents/ pages so that the links at the top are more preferable to th e
end users.
How Does Google Do It?
Even though complex low-level computational details are trade secrets and are not
known to the public, the high-level structure of the Google search system is well-
known a nd quite simple . From the infrastructure standpoint, the Google search system
runs on a distributed network of tens of thousands of computers/servers a nd can, there-
fore, carry o ut its h eavy workload effectively a nd efficiently using sophisticated parallel
processing algorithms (a method of computation in which many calculatio n s can be
distributed to many servers and performed simultaneously, significantly speeding up
data processing) . At the highest level, the Google search system h as three distinct parts
(googleguide.com):
1. Googlebot, a Web crawler that roams the Internet to find and fetch Web pages
2. The indexer, w hich sorts every word on every page and stores the resulting index
of words in a huge database
352 Pan III • Predictive Analytics
3. The query processor, which compares your search query to the index and recom-
mends the documents that it considers most relevant
1. Googlebot Googlebot is Google's Web crawling robot, which finds and
retrieves pages o n the Web a nd hands them off to the Google indexer. It's easy to
imagine Googlebot as a little spider scurrying across the strands of cyberspace,
but in reality Googlebot doesn't traverse the Web at all. It functions , much like
your Web browser, by sending a request to a Web server for a Web p age, down-
loading the e ntire page, and then handing it off to Google's indexer. Googlebot
consists of many computers requesting and fetching pages much more quickly
than you can with your Web browser. In fact, Googlebot can request thousands
of different pages simultaneously. To avoid overwh elming Web servers, or crowd-
ing out requests from human users, Googlebot deliberately makes requests of
each individual Web server more slowly than it's capable of doing.
When Googlebot fetches a page, it removes all the links appearing on the
page a nd adds them to a queue for subsequent crawling. Googlebot tends to
encounter little spam because most Web authors link only to what they believe are
high-quality pages. By harvesting links from every page it encounters, Googlebot
can quickly build a list of links that can cover broad reaches of the Web. This
technique, known as deep crawling, also a llows Googlebot to probe deep w ithin
individual sites. Because of their massive scale, deep crawls can reach almost
every page in the Web. To keep the index current, Google continuou sly recrawls
popular frequently changing Web pages at a rate roughly proportional to how
often the pages change . Such crawls keep an index current and are known as
fresh crawls. Newspaper pages are downloaded daily; pages with stock quotes are
downloaded much more frequently. Of course, fresh crawls return fewer pages
tha n the deep crawl. The combination of the two types of crawls allows Google
to both make efficient use of its resources and keep its index reasonably current.
2. Google Indexer Googlebot gives the indexer the full text of the pages it finds.
These pages are stored in Google's index database. This index is sorted alphabeti-
cally by search term, w ith each index entry storing a list of documents in w h ich the
term appears and the locatio n within the text where it occurs. This data structure
allows rapid access to documents that contain user query terms. To improve search
performance, Google ignores commo n words, called stop words (such as the, is,
on, or, of, a, an, as well as ce1tain single digits and single letters). Stop words are
so common that they do little to narrow a search, and therefore they can safely be
discarded. The indexer also ignores some punctuation and multiple spaces, as well
as converting all le tters to lowercase, to improve Google's performance.
3. Google Query Processor The query processor has several p arts, including
the user interface (search box), the "engine" that evaluates queries and m atches
them to relevant documents, and the results formatter.
Google uses a proprietary algorithm, called PageRank, to calculate the re lative rank
order of a given collectio n of Web pages. PageRank is Google's system for ranking Web
pages. A page with a higher PageRank is deemed more important and is more likely to be
listed above a page w ith a lower Page Rank. Google considers over a hundred facto rs in
computing a PageRank and d eterm ining which documents are most relevant to a query,
including the popularity of the page, the p osition and size of the search terms within the
page, and the proximity of the sea rch terms to one a nother on the page.
Google also applies machine-learning techniques to improve its performance automati-
cally by learning relationships and associations within the stored data. For example, the spell-
ing-correcting system uses such techniques to figure o ut likely alternative spellings. Google
Chapter 8 • Web Analytics, Web Mining , and Social Analytics 353
closely guards the formulas it uses to calculate relevance; they're tweaked to improve quality
and performance, and to outwit the latest devio us techniques used by spamme rs.
Indexing the full text of the Web a llows Google to go beyond simply matchin g
sin gle search terms. Google g ives more priority to pages that h ave search terms near each
other and in the same order as the query. Google can also match multi-word phrases and
sente n ces. Because Google indexes HTML code in addition to the text on the page, u sers
can restrict searches on the basis of where query words appear (e.g. , in the title, in the
URL, in the body, and in links to the page, options offered by Google 's Advanced Search
Form and Using Search Operators).
Understan d ing the internals of popular search e ngines helps companies, who rely
o n search engine traffic, better d esign the ir e-commerce sites to improve their chances
of getting indexed and highly ranked by search providers. Application Case 8.2 gives an
illustrative example of such a phenomenon, w here an e nte rtainme nt company increased
its search-originated customer traffic by 1500 percent.
Application Case 8.2
IGN Increases Search Traffic by 1500 Percent
IGN Entertai nme nt operates the Internet's largest
network of destinations for video gaming, enter-
tainme nt, a nd community geared toward teens
and 18- to 34-year-o ld males. The compa ny ' s
properties include IGN.com, GameSpy, AskMen.
com, Rotte nTomatoes, File Planet, TeamXbox , 3D
Gamers, VE3D, and Direct2Drive-more than
70 community s ites an d a vast array of o nline
fo rums . IGN Entertainment is a lso a leading pro-
vider of technology for o nline game play in v ideo
games.
The Challenge
When this company contacted SEO Inc . in summer
2003, the site was a n established an d well-known
site in the gaming community. The site a lso had
some good search engine rankings a nd was getting
approximately 2.5 million unique visitors per month.
At the time IGN used proprie tary in-house content
management and a team of con tent writers. The
pages that were generated when new game reviews
and information were added to the site were not
very well optimized. In addition, there were serious
architectural issues w ith the site, which prevented
search e ngine spiders fro m thoroughly a nd consis-
tently crawling the site .
IGN's goals were to "dominate the search
rankings for keywords related to a ny video games
and gaming systems reviewed o n the site. " IGN
wa nted to rank high in the search engines, a nd
most specifically, Google, for a ny and all game
titles an d variants on those game titles' phrases.
IGN's revenue is generated from advertising sales,
so more traffic leads to more inventory for ad sales,
more ads being sold, and therefore more revenue.
In order to generate more traffic, IGN knew that it
needed to be much more visible when people used
the search en gines.
The Strategy
After several conversations with the IGN team, SOE
Inc. created a custo mized optimization p ackage that
was designed to achieve their ra nking goals and
also fit the client's budget. Because IGN.com had
architectural problems and a proprietary CMS (con-
tent management system), it was decided that SEO
Inc. would work with their IT and Web develop-
ment team at their location. This allowed SEO to
send their team to the IGN location for several days
to learn how their system worked and partner with
their in-house programmers to improve the system
and, hence, improve search e ngine o ptimization. In
addition, SEO created customized SEO best prac-
tices and architected these into their proprietary
CMS. SEO also trained their content w riters and
page developers on SEO best practices . When n ew
games an d pages are added to the site, they are typi-
cally getting ranked within weeks, if not days.
(Continued)
354 Pan III • Predictive Analytics
Application Case 8.2 (Continued}
The Results
This was a true and quick success story. Organic
search engine rankings skyrocketed a nd thousands
of previously not-indexed pages were now being
crawled regularly by search engine spiders. Some of
the specific results were as follows :
IGN was acquired by News Corp in September 2005
for $650 millio n .
QUESTIONS FOR DISCUSSION
1. How did IGN dramatically increase search traffic
to its Web portals?
• Unique visitors to the site doubled within the first
2 months after the optimization was completed.
• The re was a 1500 p e rcent increase in organic
search e ngine traffic.
• Massive growth in traffic and revenues enabled
acquisition of additio n al Web properties includ-
ing Rotte ntomatoes .com and Askmen.com
2. What were the ch alle nges, the proposed solu-
tio n , and the obtained results?
Source: SOE Inc ., Custome r Case Srudy, seoinc.com/seo/case-
studies/ign (accessed March 2013).
SECTION 8.4 REVIEW QUESTIONS
1. What is a searc h e ngine? Why are they important for today's businesses?
2. What is the relationship between search engines and text mining?
3. What are the two main cycles in search engines? Describe the steps in each cycle.
4. What is a Web crawle r? What is it u sed for? How d oes it work?
5. How does a query analyzer work? What is PageRank algorithm an d how does it work?
8.5 SEARCH ENGINE OPTIMIZATION
Search engine optimization (SEO) is the intentional activity of affecting the visibility of
an e-commerce site or a Web site in a search e ngine 's n atural (unpaid o r organic) search
results. In general, the hig he r ranked o n the search results page, and more freque n tly
a site appears in the search results list, the more visitors it will receive fro m the search
e ngine's u sers. As an Inte rnet marketing stra tegy, SEO considers how search e ngines
w ork, w hat p eople search for, the actual search terms o r keywords typed into search
e ngin es, a nd w hich search engines are preferred by th eir targeted audience. Optimizing
a Web site may involve editing its conte nt, HTML, and associated coding to both increase
its relevance to specific keyword s and to remove barriers to the indexing activities of
search engines. Promoting a site to increase the numbe r of backlinks, o r inbound links,
is anothe r SEO tactic.
In the early days, in order to be indexed, all Webmasters needed to do was to submit
the address of a page, o r URL, to the various e ngines, w hich would then send a "spid e r"
to "crawl" that page, extract links to other pages from it, and return information found o n
the page to the server for indexing. The process, as explain ed before, involves a search
e ngine spider downloading a p age and storing it on the search engine 's own server, w he re
a second program, known as a n indexer, extracts various informatio n about the page, such
as the words it contains and w here these are located, as well as any weight for specific
words, a nd all links the page contains, w hich are then placed into a sch eduler for crawling
at a later date. Nowadays search engines are no longer relying on Webmasters submitting
URLs (even though they still can); instead, they are proactively a nd continuously crawl-
ing the Web, and finding, fetchi ng , a nd indexing everything about it.
Chapter 8 • Web Analytics, Web Mining , and Social Analytics 355
Being indexed by search engines like Google, Bing, and Yahoo! is not good
e n ough for businesses. Getting ranked on the most widely used search engines
(see Techn o logy In sigh ts 8.2 for a list of most w idely used search engines) a n d gettin g
ranked higher than your competitors are what make th e d ifference. A variety of meth-
ods can increase the ranking of a Web page within the search results. Cross-linking
between pages of the same Web site to provide more links to the most important pages
may improve its visibility. Writing content that includes frequently searched keyword
p h rases , so as to be relevant to a wide variety of search queries , w ill tend to increase
traffic. Updating con tent so as to keep search engines crawling back frequently can
give additional weight to a site . Adding relevant keywords to a Web page 's metadata ,
including the title tag and metadescription , will tend to improve the relevancy of a
site's search listings, thus increasing traffic . URL normalization of Web pages so that
they are accessible via multiple URLs and using canonical link e lement a nd redirects
can help make sure links to d ifferent versions of th e URL all count toward the page's
link popularity score.
Methods for Search Engine Optimization
In general, SEO techniques can be classified into two broad categories: techniques that
search e ngines recommend as part of good site design, and those techniques of which
search engines do not approve. The search engines attempt to minimize the effect of
the latter, which is o ften called spamdexing (also known as search spam, search engine
spam, or search engine poisoning). Industry commentators have classified these meth-
ods, and the practitioners who employ them, as either white-hat SEO or black-hat SEO
TECHNOLOGY INSIGHTS 8.2 Top 15 Most Popular Search Engines
(March 2013)
Here a re the 15 most popula r search e ngines as derive d from eBizMBA Rank (ebizmb a.co m/
articles/search-engines), which is a constantly updated average of each Web site's Alexa
Global Traffic Rank , a nd U.S . Traffic Rank fro m bo th Compete a nd Quantcast.
Rank Name Estimated Unique Monthly Visitors
Google 900,000,000
2 Bing 165,000,000
3 Yahoo ! Search 160,000,000
4 Ask 125,000,000
5 AOL Search 33,000,000
6 MyWebSearch 19,000,000
7 blekko 9,000,000
8 Lycos 4,300,000
9 Dogpile 2,900,000
10 WebC rawler 2,700,000
11 Info 2,600,000
12 lnfospace 2,000,000
13 Search 1,450,000
14 Excite 1,150,000
15 Good Search 1,000,000
356 Pan III • Predictive Analytics
(Goodman, 2005). White hats tend to produce results that last a long time, w hereas black
hats anticipate that their sites may eventually be banned either temporarily or perma-
nently once the search engines discover what they are doing.
An SEO technique is con sidered white h at if it conforms to the search engin es'
gu ide lines and involves n o deception. Because search e n gine gu idelin es are n ot writ-
ten as a series of rules or commandments, this is an important distinction to note.
White-hat SEO is n ot just about following g uidelines, but about en suring that the con-
tent a searc h e ngine indexes a nd subsequ e ntly ranks is the same conte nt a u ser wil l
see. White-hat advice is generally summed up as creating content for users , not for
search e n gines, and then making that con tent easily accessible to the spiders, rather
than attempting to trick the algorithm from its intended purpose. White-hat SEO is in
many ways similar to Web development that promotes accessibility, although the two
are not identical.
Black-hat SEO attempts to improve rankings in ways that are disapproved by the
search engines, or involve deception. One black-hat technique u ses text that is hid-
den, either as text colored similar to the background, in an invisible div, or positioned
off-screen. Another method gives a different page depending on whether the page is
being requested by a human visitor o r a search e ngine, a technique known as cloak-
ing. Search e ngines may penalize sites they discover using black-hat methods, e ither
by reducing their rankings or eliminating their listings from their databases altogether.
Su ch penalties can be applied either automatically by the search e ngines' algorithms,
or by a manual site review. One example was the February 2006 Google removal of
both BMW Germany a nd Ricoh Germany for u se of unapproved p ractices (Cutts, 2006).
Both companies, h owever, quickly apologized, fixed their practices, and were restored
to Google's list.
For some businesses SEO may generate significant return on investment. However,
o ne should keep in mind that search e ngines are n ot paid for o rganic search traffic, their
algorithms change constantly, and there are no guarantees of continued referrals. Due to
this lack of certainty and stability, a business that relies heavily on search engin e traffic
can suffer major losses if the search e ngine decides to change its algorithms a n d stop
sending visitors. According to Google's CEO, Eric Schmidt, in 2010, Google made over
500 algorithm changes- almost 1.5 per day. Because of the difficulty in keeping up with
changing search engine rules , companies that rely on search traffic practice one or more
of the following: ( 1) Hire a company that specializes in search engine optimizatio n (there
seem to be an abunda nt number of those n owadays) to continuously improve your site's
appeal to ch anging practices of the search engines; (2) pay the search engine provid-
ers to be listed on the paid sponsors sections; and (3) consider liberating yourself from
dependence o n search engine traffic .
Either originating from a search engine (organically or oth erwise) or coming
from other sites and places, what is most important for an e-commerce s ite is to
maximize the like lihood of customer transactions . Having a lot of visitors with out
sales is not w hat a typical e-commerce site is built for. Application Case 8.3 is about
a la rge Internet-based shopping mall where detailed analysis of customer beh av-
ior (using clickstreams and other data sources) is used to significantly improve the
con versio n rate.
SECTION 8.5 REVIEW QUESTIONS
1. What is "search engine optimization"? Who benefits from it?
2. Describe the o ld a nd new ways of indexing performed by search engines.
3. What are the things that help Web pages rank higher in the search engine results?
4. What are the most commonly used methods for search engine optimization?
Chapter 8 • Web Analytics, Web Mining, and Social Ana lytics 357
Application Case 8.3
Understanding Why Customers Abandon Shopping Carts Results in $10 Million Sales Increase
Lotte.com, the leading Internet shopping mall in
Korea with 13 million customers, has developed an
integrated Web traffic analysis system using SAS for
Customer Experience Analytics. As a result, Lotte
.com has been able to improve the online experi-
ence for its customers, as well as generate better
returns from its marketing campaigns. Now, Lotte
.com executives can confirm results anywhere, any-
time, as well as make immediate changes.
With almost 1 million Web site visitors each day,
Lotte.com needed to know how many visitors were
making purchases and which channels were b1inging
the most valuable traffic. After reviewing many diverse
solutions and approaches, Lotte.com introduced
its integrated Web traffic analysis system using the SAS
for Customer Experience Analytics solutio n . This is the
first online behavioral analysis system applied in Korea.
With this system, Lotte.com can accurately
measure and analyze Web site visitor numbers (UV),
page view (PV) status of site visitors and purchas-
ers, the popularity of each product category and
product, clicking preferences for each page, the
effectiveness of campaigns, and much more. This
information enables Lotte.com to better understand
customers and their behavior onlin e, and conduct
sophisticated, cost-effective targeted marketing.
Commenting on the system, Assistant General
Manager Jung Hyo-boon of the Marketing Planning
Team for Lotte.com said, "As a result of introducing
the SAS system of analysis, many 'new truths' were
uncovered around customer behavior, and some of
them were 'inconvenient truths. "' He added, "Some
site-planning activities that had been undertaken
with the expectation of certain results actually had a
low reaction from customers, and the site planners
had a difficult time recognizing these results. "
Benefits
Introducing the SAS for Customer Experience Analytics
solution fully transformed the Lotte.com Web site. As a
result, Lotte.com has been able to improve the online
experience for its customers as well as generate better
returns from its marketing campaigns. Now, Lotte.com
executives can confitm results anywhere, anytime, as
well as make immediate changes.
Since implementing SAS for Customer Experience
Analytics, Lotte.com has seen many benefits:
A Jump in Customer Loyalty
A large amount of sophisticated activity informa-
tion can be collected under a visitor environment,
including quality of traffic. Deputy Assistant General
Manager Jung said that "by analyzing actual valid
traffic and looking only at one to two pages, we can
carry out campaigns to heighten the level of loyalty,
and determine a certain range of effect, accordingly."
He added, "In addition , it is possible to classify and
confirm the order rate for each channel and see
which channels have the most visitors. "
Optimized Marketing Efficiency Analysis
Rather than just analyzing visitor numbers only, the
system is capable of analyzing the conversion rate
(shopping cart, immediate purchase, wish list, pur-
chase completion) compared to actual visitors for
each campaign type (affiliation or e-mail, banner,
keywords, and others), so detailed analysis of chan-
nel effectiveness is possible. Additionally, it can con-
firm the most popular search words used by visitors
for each campaign type, location, and purchased
products. The page overlay function can measure
the number of clicks and number of visitors for each
item in a page to measure the value for each loca-
tion in a page. This capability enables Lotte.com to
promptly replace or renew low traffic items.
Enhanced Customer Satisfaction and Customer
Experience Lead to Higher Sales
Lotte.com built a customer behavior analysis data-
base that measures each visitor, what p ages are
visited, how visitors navigate the site, and what
activities are undertaken to enable diverse analysis
and improve site efficien cy. In addition, the data-
base captures customer demographic information,
shopping cart size and conversion rate, number of
orders, and number of attempts.
By analyzing which stage of the ordering pro-
cess deters the most customers and fixing those
stages, conversion rates can be increased. Previously,
analysis was done only on placed orders. By analyz-
ing the movement pattern of visitors before ordering
and at the point where breakaway occurs, customer
(Continued)
358 Pan III • Predictive Analytics
Application Case 8.3 (Continued}
behavior can be forecast, and sophisticated market-
ing activities can be undertaken. Through a pattern
analysis of visitors, purchases can be more effec-
tively influenced and customer demand can be
reflected in real time to ensure quicker responses.
Customer satisfaction has also improved as Lotte
.com has better insight into each customer's behav-
iors, needs, and interests.
Evaluating the system, Jung commented, "By
finding out how each customer group moves on the
basis of the data, it is possible to determine cus-
tomer service improvements and target marketing
subjects, and this has aided the success of a num-
ber of campaigns." However, the most significant
benefit of the system is gammg insight into indi-
vidual customers and various customer groups. By
understanding when customers will make purchases
and the manner in which they navigate throughout
the Web page, targeted channel marketing and bet-
ter customer experience can now be achieved.
Plus, when SAS for Customer Experience
Analytics was implemented by Lotte.corn's largest
overseas distributor, it resulted in a first-year sales
increase of 8 million euros (US$10 million) by iden-
tifying the causes of shopping-cart abandonment.
Source: SAS, Customer Success Stories, sas.corn/success/lotte
.html (accessed March 2013).
8.6 WEB USAGE MINING (WEB ANALYTICS)
Web usage mining (also ca lled Web analytics) is the extraction of useful information
from data generated through Web page visits and transactions. Masand et al. (2002) state
that at least three types of data are generated through Web page visits:
1. Automatically generated data stored in server access logs, referrer logs, agent logs,
and client-side cookies
2. User profiles
3. Metadata, such as page attributes, content attributes, and usage data.
Analysis of the information collected by Web servers can help us better understand
user behavior. Analysis of this data is often called dickstream analysis . By using the
data and text mining techniques, a company might be able to discern interesting patterns
from the clickstreams. For example, it might learn that 60 percent of visitors who searched
for "hotels in Maui" had searched earlier for "airfares to Maui." Such information could be
useful in determining where to place online advertisements. Clickstream analysis might
also be useful for knowing when visitors access a site. For example, if a company knew
that 70 percent of software downloads from its Web site occurred between 7 and 11 P.M.,
it could plan for better customer support and network bandwidth during those hours.
Figure 8.4 shows the process of extracting knowledge from clickstream d ata and how the
generated knowledge is used to improve the process, improve the Web site, an d , most
important, increase the customer value.
Web mining has wide a range of business applications. For instance, Nasraoui
(2006) listed the following six most common applications:
1. Determine the lifetime value of clients.
2. Design cross-marketing strategies across products.
3. Evaluate promotional campaigns.
4. Target electronic ads and coupons at user groups based on user access patterns.
5. Predict user behavior based on previously learned rules and users' profiles.
6. Present dynamic information to users based on their interests and profiles.
Chapter 8 • Web Analytics, Web Mining, and Social Ana lytics 359
lwii
Preprocess Data Extract Knowledge
Collecting Usage patterns
Merging User profiles
Cleaning Page profiles
Structuring Visit profiles
User/ • Identify users Customer value
Customer • Identify sessions
~ 1~blogsl
f--------+- • Identify page
------..
views
• Identify visits
t How to better the data
How to improve the Web site
How to increase the customer value
FIGURE 8.4 Extraction of Knowledge from W eb Usage Data.
Amazon.com provides an excellent example of how Web usage histo1y can be
leveraged dynamically. A registered user who revisits Amazon.com is greeted by name.
This is a simple task that involves recognizing the use r by reading a cookie (i.e., a small
text file written by a Web site on the visitor's computer). Amazon.com also presents the
user with a choice of products in a personalized store, based on previous purchases and
an association analysis of similar use rs. It also m akes special "Gold Box" offers that are
good for a short amount of time. All these recommendations involve a detailed analysis
of the visitor as well as the user's peer group developed through the use of clustering,
sequence pattern discovery, association, and other data and text mining techniques.
Web Analytics Technologies
There are numerous tools and technologies for Web analytics in the marketplace. Because
of their power to measure , collect, and analyze Internet data to better understand and
optimize Web usage , the popularity of Web analytics tools is increasing. Web analytics
holds the promise to revolutionize how business is done on the Web. Web analytics is
not just a tool for measuring Web traffic; it can also be used as a tool fore-business and
market research, and to assess and improve the effectiveness of an e-commerce Web site.
Web analytics applications can also help companies measure the results of traditional
print or broadcast advertising campaigns. It can help estimate how traffic to a Web site
changes after the launch of a new advertising campaign. Web analytics provides informa-
tion about the number of visitors to a Web site and the number of page views. It helps
gauge traffic and popularity trends, which can be used for market research.
There are two main categories of web analytics; off-site and on-site. Off-site Web
analytics refers to Web measurement and analysis about you and your products that takes
place outside your Web site. It includes the measurement of a Web site's potential audi-
ence (prospect or opportunity), share of voice (visibility or word-of-mouth), and buzz
(comments or opinio ns) that is happening on the Internet.
What is more mainstream is on-site Web analytics. Historically, Web analytics has
referred to on-site visitor measureme nt. However, in recent years this has blurred, mainly
because vendors are producing tools that span both categories. On-site Web analytics
measure a visitors' behavior once they are on your Web site. This includes its drivers and
conversio ns- for example, the degree to which different landing pages are associated w ith
360 Pan III • Predictive Analytics
online purchases. On-site Web analytics measure the performance of your Web site in a
commercial context. This data collected on the We b site is then compared against key
performance indicators for performance, and used to improve a Web site 's or marketing
campaign's audience response. Even though Google Analytics is the most widely-used on-
site Web analytics service, there are others provided by Yahoo! and Microsoft, and newer
and better tools are emerging constantly that provide additional layers of information.
For on-site Web analytics, there are two technical ways of collecting the data. The first
and more traditional method is the server log file analysis, where the Web server records
file requests made by browsers. The second method is page tagging, which uses JavaScript
embedded in the site page code to make image requests to a third-party analytics-dedicated
server whenever a page is rendered by a Web browser (or when a mouse click occurs).
Both collect data that can be processed to produce Web traffic reports. In addition to these
two main streams, other data sources may also be added to augment We b site be havior
data . These other sources may include e-mail, direct-mail campaign data, sales and lead his-
tory, or social media-originated data. Application Case 8.4 shows how Allegro improved
Web site performance by 500 percent with analysis of Web traffic data.
Application Case 8.4
Allegro Boosts Online Click-Through Rates by 500 Percent with Web Analysis
The Allegro Group is headquartered in Posnan,
Poland, and is considered the largest non-eBay online
marketplace in the world. Allegro, which currently
offers over 75 proprietary Web sites in 11 European
countries around the world, hosts over 15 million
products and generates over 500 million page views
per day. The challenge it faced was how to match the
right offer to the right customer while still being able
to support the extraordinary amount of data it held.
Problem
In today's marketplace, buyers have a wide variety
of retail, catalog, and online options for buying their
goods and services. Allegro is an e-marketplace with
over 20 million customers who themselves buy from
a network of over 30 thousand professional retail
sellers using the Allegro network of e-commerce
and auction sites. Allegro had been supporting its
internal recommendation engine solely by applying
rules provided by its re-sellers.
The challenge was for Allegro to increase its
income and gross merchandise volume from its cur-
rent network, as measured by two key performance
indicators.
• Click-Thru Rates {CTR}: The number of
clicks on a product ad divided by the number
of times the product is displayed.
• Conversion Rates: The number of com-
pleted sales transactions of a product divided
by the number of customers receiving the
product ad .
Solution
The online retail industry has evolved into the
premier channel for personalized product recom-
mendations. To succeed in this increasingly com-
petitive e-commerce environment, Allegro realized
that it needed to create a new, highly personalized
solution integrating predictive analytics and cam-
paign management into a real-time recommenda-
tion system.
Allegro decided to apply Social Network
Analysis (SNA) as the analytic methodology under-
lying its product recommendation system. SNA
focuses on the relationships or links between nodes
(individuals or products) in a network, rather than
the nodes' attributes as in traditional statistical meth-
ods. SNA was used to group similar products into
communities based on their commonalities; then,
communities were weighted based on visitor click
paths, items placed in shopping carts, and purchases
to create predictive attributes. The graph in Figure
8.5 displays a few of the product communities gen-
erated by Allegro using the KXEN's Infiniteinsight
Social product for social network analysis (SNA).
Chapte r 8 • Web Analytics, Web Mining, and Social Ana lytics 361
• 7926924 • 7926638 nuts
• 7926024
• • 7926041
• 16496
•
7926012 ~
spices, · e
- pepper, garlic,
cumiii;=and
cinnamon to~
• 7926819
• 7925601
7925649 • 7926863 • 7926623
I • 7926937
vanilla derivatives
I
• 7926117
• 792591 7
• • 7925468 7926051
FIGURE 8.5 The Product Communities Generated by Allegro Using KXEN's lnfinitelnsight . Source: KXEN .
Statistical classification models were then built
u sing KXEN Infinitelnsight Modeler to predict con-
versio n propensity for each product b ased o n these
SNA product communities and individual customer
attributes. These conversion propensity scores are
then used by Allegro to define personalized offers
presented to millions of Web site visitors in real time.
Some of the ch allenges Allegro faced applying
social network analysis included:
• Need to build multiple networks, d e p e nding
o n the product group categories
- Very large differences in the frequency dis-
tribution of particula r products and their
popularity (clicks, transactions)
• Automatic settin g of optimal p arameters, such
as the minimum number of occurrences of
items (support)
• Automation through scripting
• Overconnected products (best-sellers, mega-
hub communities).
Implementing this solution also presented its
own challe nges including:
• Different rule sets are produced per Web page
p laceme nt
• Business owners decide appropriate w eight-
ings of rule sets for each typ e of placement /
business strategy
• Building 160k rules every week
• Automatic conversion of social network a naly-
ses into rules and table-ization of rules
Results
As a result of implementing social network analy-
sis in its a utomated real-time recommendatio n pro-
cess, Allegro has seen a marked improvement in all
areas .
Today Allegro offers 80 million p erso nalized
product recommendatio ns daily, a n d its page views
h ave increased by over 30 perce n t. But it's in the
(Continued)
362 Pan III • Pre dictive Analytics
Application Case 4.4 (Continued}
Consequent Rule Rule Belong to the same
Rule ID Antecedent product ID product ID support confidence Rule Kl product community?
1 DIGITAL CAM ERA LENS 21213 20% 0.76 YES
2 DIGITAL CAMERA M EMORY CARD 3 145 18% 0.64 NO
3 PI NK SHOES PINK DRESS 4343 38% 0.55 NO
numbers delive red by Allegro's two most critical
KPis tha t the results are most o bvio us:
QUESTIO N S FOR DISCUSS ION
1. How did Allegro significantly improve click-
throug h rates with Web analytics? • Click-through rate (CTR) has increased by
more than 500 percent as compa red to 'best
selle r' rules.
• Co nversion rates are up by a facto r of o ve r 40X.
Web Analytics Metrics
2. What were the challenges, the proposed solu-
tio n , and the o btained results?
Sou rce: kxen .corn/customers/allegro (accessed J uly 2013).
Using a variety o f data so urces, We b analytics prog rams provide access to a lot of valu-
able ma rketing d ata, w hich can b e leveraged for b e tte r in sights to grow your business
a nd bette r d ocume nt your ROI. The insight a nd inte lligence gained fro m Web a n alytics
can be u sed to effective ly manage the marketing effo rts of a n organization and its vario us
products or services. Web analytics p rograms p rovide n early real-time data, w hic h can
docume nt your ma rketing campaig n su ccesses or e mpower you to make timely adjust-
ments to your current marketing strategies.
While We b a nalytics provides a broad range o f me trics, there are four categories
of me trics that are gene rally actiona ble and can directly impact your bu siness objectives
(TWG , 201 3). These catego ries include :
• We b site usa bility: How w ere they u sing my Web site?
• Traffic sources: Whe re did they come fro m?
• Visitor profiles: What do my v isitors look like?
• Conversio n statistics: Wha t d oes all this mean for the business?
Web Site Usability
Begin ning w ith your We b site , let's take a look at h ow well it wo rks for your visito rs. This
is w h e re you can learn how "u ser-friendly" it really is o r w he ther or n ot you are providing
the right content
1 . P ag e views. The most b asic of measuremen ts , this metric is u su ally presented
as the "average p age views p e r visitor. " If p eop le come to your Web site and don 't view
many pages, the n your We b s ite m ay have issues w ith its desig n o r structure. Ano the r
explanatio n for low p age views is a disconnect in the m arketing messages that b roug ht
them to the site and the conte nt that is actua lly availa ble.
2 . Time on site. Similar to page views, it's a fundamental measure me nt of a visi-
tor's interactio n w ith your We b site. Ge nerally, the lo nger a p erso n sp ends o n your Web
site, the bette r it is. That cou ld mean they're carefully reviewing your conten t, utilizing inte r-
active co mpo ne nts you have available, and building toward a n info rmed decisio n to buy,
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 363
respond, or take the next step you 've provided. On the contrary, the time o n site also needs
to be examined again st the number of pages viewed to make sure the visitor isn't spending
his or her time trying to locate content that should be more readily accessible.
3. Downloads. This includes PDFs, videos, and other resources you make avail-
able to your visitors. Con sider how accessible these items are as well as how well they're
promoted. If your Web statistics, for example, reveal that 60 percent of the individuals
who watch a demo video also make a purchase, then you 'll want to strategize to increase
viewership of that video.
4. Click map. Most analytics programs can show you the percentage of clicks
each item on your Web page received. This includes clickable photos, text links in your
copy, downloads, and, of course, any navigation you may have on the page. Are they
clicking the most important items?
5. Click paths. Although an assessment of click paths is more involved, it can
quickly reveal where you might be losing visitors in a specific process. A well-designed Web
site uses a combination of graphics and information architecture to encourage visitors to fol-
low "predefined" paths through your Web site. These are not rigid pathways but rather intui-
tive steps that align with the various processes you've built into the Web site. One process
might be that of "educating" a visitor w ho has minimum understanding of your product or
service. Another might be a process of "motivating" a returning visitor to consider an upgrade
or repurchase. A third process might be structured around items you market online. You'll
have as many process pathways in your Web site as you have target audiences, products, and
services. Each can be measured through Web analytics to determine how effective they are.
Traffic Sources
Your Web analytics program is an incredible tool for ide ntifying where your Web traf-
fic originates. Basic categories such as search e ngines , referral Web sites, and visits from
bookmarked pages (i.e. , direct) are compiled with little involvement by the marketer.
With a little effort, however, you can also identify Web traffic that was generated by your
various offlin e or online advertising campaigns.
1. Referral Web sites. Other Web sites that contain links that send visitors directly
to your Web site are considered referral Web sites. Your analytics program will identify each
referral site your traffic comes from, and a deeper analysis will help you determine which
referrals produce the greatest volume, the highest conversio ns, the most new visitors, etc.
2. Search engines. Data in the search e ngine category is divided between paid
search and organic (or natural) search. You can review the top keywords that generated
Web traffic to your site and see if they are representative of your products and services.
Depending upon your business, you might want to have hundreds (or thousands) of key-
words that draw potential customers. Even the simplest product search can have multiple
variatio n s based on how the individual phrases the search query.
3. Direct. Direct searches are attributed to two sources. An individual who book-
marks o ne of your Web pages in their favorites and clicks that link will be recorded as
a direct search . Another source occurs when someon e types your URL directly into their
browser. This happens when someone retrieves your URL from a business card, bro-
chure , print ad, radio commercial, etc. That's why it's good strategy to use coded URLs.
4. Offline campaigns. If you utilize advertising optio ns othe r th an Web-based
campaigns, your Web an alytics program can capture performance data if you 'll include
a mechanism for sendin g them to your Web site. Typically, this is a dedicated URL that
you include in your advertisement (i.e., "www.mycompany.com/offer50") that delivers
those visitors to a specific landing page. You now have data on how many responded to
that ad by visiting your Web site.
364 Pan III • Predictive Analytics
5. Online campaigns. If you are running a banner ad campaign, search engine
advertising campaign, o r even e-ma il campaigns, you can measure individual campaign
effectiveness by simply u sing a dedicated URL similar to the offline campaign strategy.
Visitor Profiles
One of the ways you can leverage your Web analytics into a really powerful marketing
tool is through segmentatio n . By blending data from different analytics reports, you 'll
begin to see a variety of user profiles emerge.
1. Keywords. Within your analytics report, you can see w hat keywords visitors
used in search engines to locate your Web site . If you aggregate your keywords by similar
attributes, you'll begin to see distinct visitor groups that are u sing your Web site. For exam-
ple, the particular search phrase that was used can indicate how well they understand your
product or its benefits. If they use words that mirror your own product o r service descrip-
tions, then they probably are already aware of your offerings from effective adve1tisements,
brochures, etc. If the terms are more general in nature, then your visitor is seeking a solu-
tion for a problem and has happened upon your Web site . If this second group of search ers
is sizable, then you'll want to e nsure that your site has a strong education component to
convince them they've found their a nswer and then move them into your sales channel.
2. Content groupings. Depending upon h ow you group your content, you may
be able to an alyze sectio ns of your Web site that correspond w ith specific products, ser-
vices, campaigns, and other marketing tactics. If you conduct a lot of trade shows and
drive traffic to your Web site for specific product literature, then your Web analytics will
highlight the activity in that section.
3. Geography. Analytics permits you to see where your traffic geographically
originates, including country, state, and city locatio ns . This can be especially useful if you
u se geo-targeted campaig n s o r want to measure your visibility across a region.
4. Time of day. We b traffic generally has peaks at the beginning of the work-
day, during lunch , and toward the end of the workday. It's not unusual, however, to find
strong Web traffic e ntering your Web site up until the late evening. You can analyze this
data to determine w he n people browse versus buy and also make decisions o n what
hours you sh o uld offer customer service.
5. Landing page profiles. If you structure your various advertising campaigns prop-
erly, you can drive each of your targeted groups to a different landing page, which your Web
analytics w ill capture and measure. By combining these numbers with the demographics of
your campaign media, you can know w hat percentage of your visitors fit each demographic.
Conversion Statistics
Each organization w ill define a "conversion" according to its specific marketing objec-
tives. Some Web a nalytics programs use the term "goal" to ben chmark certain Web site
objectives, w hether that be a certain number of visitors to a page, a completed registra-
tion form, or an o nline purchase.
1. New visitors. If you're working to increase visibility, you 'll want to study the
trends in your new visitors data. Analytics identifies a ll visitors as either new or returning.
2. Returning visitors. If you 're involved in loyalty programs or offer a product
that has a long purchase cycle, then your returning visitors data w ill help you measure
progress in this area.
3. Leads. Once a form is submitted and a thank-you page is generated, you h ave
created a lead. Web analytics w ill permit you to calculate a com pletio n rate (or abandonment
rate) by dividing the number of completed forms by the number of Web visitors that came to
your page. A low completio n percen tage would indicate a page that needs attention.
Chapter 8 • Web Analytics, Web Mining , and Social Ana lytics 365
4. Sales/conversions. Depending upon the intent of your Web site, you can
define a "sale" by an online purchase, a completed registration, an onlin e submission ,
or any number of other Web activities. Monitoring these figures will alert you to any
changes (or successes!) that occur furth er upstream.
5. Abandonment/exit rates. Just as important as those moving through your Web
site are those who began a process and quit or came to your Web site and left after a page
or two. In the first case, you 'll want to analyze where the visitor terminated the process and
whether there are a number of visitors quitting at the same p lace. Then investigate the situa-
tion for resolution. In the latter case, a high exit rate on a Web site or a specific page generally
indicates an issue with expectations. Visitors click to your Web site based o n some message
contained in an advertisement, a presentation, etc., and expect some continuity in that mes-
sage. Make sure you're advertising a message that your Web site can reinforce and deliver.
Within each of these items are metrics that can be establish ed for your specific
organization. You can create a weekly dashboard that includes specific numbers or per-
centages that w ill indicate where you're succeeding- or highlight a marketing challenge
that should be addressed. When these metrics are evaluated consistently and used in
conjunction with other available marketing data , they can lead you to a highly quanti-
fied marketing program. Figure 8.6 shows a Web analytics dashboard created with freely
available Google Analytics tools .
•
GoogloS!oro I*
S ports www.googlestore.com (www googlestore com} y V
F owioGET
Top Metrics
Revenue
Revenue
€35,525.85
% of Totat 100.00% {(35,525.85)
Visits
Visits
389,427
% of Totat 100.00% (389,427)
(t Timellne
20,000 20,000
10,000 10,000
Feb20 Feb27 Mat 13
Top Countries
Bounce Rate
Bounce Rate Country!Tenltory Visits Bounce Reta
66.03%
Ste Avg: 66.03% (0.00%)
Avg. Time on Site
Avg. Time on Site
53 sec.
S I Avg: 53 IIC. (0.00%)
United States
United Kingdom
Brazil
Canada
Spain
Mexico
India
France
Italy
FIGURE 8.6 A Sample Web Analytics Dashboard.
107 ,521 55,96'.i,
26,1 41 74.31%
24,735 75.83%
16,417 57.32%
15,012 78.86%
12,369 52.76'.i,
11,816 68.32'.i,
10,204 60.87%
9,778 66.19%
I . DELETE DASHBO~
Feb 20, 2011 - Mar 22, 2011 •
Browser Breakdown
• 49.72% Internet Explorer
193,614Vels
• 21 .92% Chrome
85,350Vlsits
• 18.55% Firefox
72,247Vosits
6.32% Safari
24,599Visits
• 1.19% Opera
4,629Vlsits
• 2.31% Other
8,988 Visits
366 Pan III • Predictive Analytics
SECTION 8.6 REVIEW QUESTIONS
1. What are the three types of data generated through Web page visits?
2. What is clickstream analysis? What is it used for?
3. What are the main applications of Web mining?
4. What are commonly u sed Web analytics metrics? What is the importance of metrics?
8. 7 WEB ANALYTICS MATURITY MODEL
AND WEB ANALYTICS TOOLS
The term "maturity" relates to the degree of proficiency, formality, and optimization of
business models, m oving "ad h oc" practices to formally defined steps and optimal bu si-
ness processes. A maturity model is a formal depiction of critical dimensions and their
competency levels of a business practice. Collectively, these dimensions and levels define
the maturity level of an organizatio n in that area of practice. It often describes an evolu-
tionary improvement path from ad hoc, immature practices to disciplined, mature pro-
cesses with improved quality a nd efficiency.
A good example of maturity models is the BI Maturity Model developed by The Data
Warehouse Institute (TDWI). In the TDWI BI Maturity Model the main purpose was to
gauge where organization data warehousing initiatives are at a point in time and w here
it sho uld go next. It was represented in a six-stage framework (Management Reporting
-+ Spreadmarts -+ Data Marts -+ Data Warehouse -+ Enterprise Data Warehouse -+ BI
Services). Another related example is the simple business analytics maturity model, moving
from simple descriptive measures to predicting future outcomes, to obtaining sophisticated
decision systems (i.e ., Descriptive Analytics-+ Predictive Analytics-+ Prescriptive Analytics).
For Web a n alytics perhaps the most compreh e n sive m odel was proposed by
Stephan e Hamel (2009) . In this model, Hamel used six dimensions-Cl) Management,
Governance and Adoption, (2) Objectives Definition, (3) Scoping, ( 4) The An alytics Team
a nd Expertise, (5) The Continuo u s Improvement Process and Analysis Methodology, (6)
Tools, Technology and Data Integration-and for each d ime ns ion he used six levels of
proficiency/competence. Figure 8.7 sh ows Hamel's six dimensions and the respective
proficiency levels.
The proficie ncy/ competence levels have different terms/labels for each of the six dimen-
sions, describing specifically w hat each level means. Essentially, the six levels are indications
of an alytical maturity ranging from "0-Analytically Impaired" to "5-Analytical Competitor."
A sh ort description of each of the six levels of competencies is given here (Hamel, 2009):
1. Impaired: Characterized by the use of out-of-the-box tools and reports; limited
resources lacking formal training (h ands-on skills) and educatio n (knowledge). Web a nalyt-
ics is used on an ad hoc basis and is of limited value and scope. Some tactical objectives are
defined, but results are not well communicated and there are multiple versions of the truth.
2. Initiated: Works with m etrics to optimize specific areas of the bu siness (such
as marketing o r the e-commerce catalogue). Resources are still limited, but the process
is getting streamlined. Results a re communicated to various business stakeholders (often
director level). However, Web analytics might be supporting obsolete bu siness processes
a nd, thus, be limited in the ability to push fo r optimizatio n beyond the online ch annel.
Success is mostly anecdotal.
3. Operational: Key performance indicators a nd dashboards are defined a nd
a lign ed w ith strategic busin ess objectives. A multidisciplinary team is in place an d uses
vario u s sources of information su ch as competitive data, voice of customer, and social
media o r mobile analysis . Metrics are explo ited and explored through segmentation and
multivariate testing . The Internet channe l is being optimized; personas a re being defined.
Chapter 8 • Web Analytics, Web Mining, and Social Ana lytics 367
Results start to appear and be considered at the executive level. Results are centrally
driven, but broadly distributed.
4. Integrated: Analysts can now correlate online and offline data from vari-
ous sources to provide a near 360-degree view of the whole value chain. Optimization
encompasses complete processes, including back-end and front-end. Online activities are
defined from the user perspective and persuasion scenarios are defined. A continuous
improvement process and problem-solving methodologies are prevalent. Insight and rec-
ommendations reach the CXO level.
5. Competitor: This level is characterized by several attributes of companies w ith
a strong analytical culture (Davenport and Harris, 2007):
a. One or more senior executives strongly advocate fact-based decision making and
analytics
b. Widespread use of not just descriptive statistics, but predictive modeling and
complex optimization techniques
c. Substantial use of analytics across multiple business functions or processes
d. Movement toward an enterprise-level approach to managing analytical tools ,
data, and organizational skills and capabilities.
6. Addicted: This level matches Davenport's "Analytical Competitor" charac-
teristics: deep strategic insight, continuous improvement, integrated, skilled resources ,
top management commitment, fact-based culture, continuous testing, learning, and most
important: far beyond the boundaries of the online channel.
In Figure 8.7, one can mark the level of proficiency in each of the six dimensions
to create their organizatio n 's maturity model (which would look like a spider diagram).
Strategic (5 )
CRM (4)
eMarketing (3 )
Behavior al Optim izat ion (2 )
Web metrics (
No Web analytics )
Methodolo
Agile approach (5
Agile methodology [online) (4)
Cont inuous improvement pr ocess (3)
Department/team method (2 )
Analyst's own ( 1)
No methodology (DJ
Resources
Experienced/multidisciplinary (5)
Multidisciplinary (4)
Distr ibuted team (3 J
Single analyst (2 J
Project appr oach ( 1 J
No dedicated resour ce (DJ
FIGURE 8.7 A Framework for Web Analytics Maturity Model.
Management
(5) Competitive analytics (5 )
(4) Culture
(3 ) Senior management
2) Director
( A project
(0 o champion
Scope
bjectives
(5J Competing on analytics
(4) Business optimization
(3J eBusiness opt imization
(2J eM arketing opt imization
(1) Request list
(DJ Undefined
(5J Competing on analytics
(4J Online ecosystem
(3J Single website
(2J Specific online activity
[1 J HIPPO
(DJ Improvisat ion
368 Pan III • Predictive Analytics
Such an assessment can help organizatio ns b etter understand at w hat d imensio ns they are
lagging b e hind, a nd take corrective actio ns to mitigate it.
Web Analytics Tools
There a re ple n ty o f Web a nalytics a pplicatio n s (downloadable softwa re tools and Web-
based/on-de m and service platforms) in the market. Companies (large , me dium, o r sma ll)
are creating produ cts and services to g rab the ir fair sh are from the e merging Web a n a-
lytics ma rke tplace . What is the m ost interesting is that many of th e most p opula r Web
analytics tools are free- yes, free to download and use fo r w h atever reason s, comm er-
cial or n o nprofit. The follo w in g a re amo ng the most p opula r free (or almost free) Web
analytics tools :
GOOGLE WEB ANALYTICS (GOOGLE.COM/ANALYTICS) This is a service o ffe red by
Google that generates de taile d statistics abo ut a We b site 's traffic an d traffic sources a nd
m easures conversio n s an d sales. The p roduct is aimed at marketers as opp osed to the
Webmasters and technologists fro m w hich the industry of Web analytics originally grew .
It is the m ost w ide ly u sed We b an a lytics service. Even thou gh the b asic service is free of
c ha rge, the pre mium vers io n is ava ilable for a fee.
YAHOO! WEB ANALYTICS (WEB.ANALYTICS.YAHOO.COM) Yah oo! Web an alytics is
Yahoo!'s alte rnative to the d o mina nt Google Analytics . It's an e nterprise-level, robu st
Web-based third-p arty solutio n that ma kes accessing data easy, especially for multiple-
u ser g roups. It's got all the things you 'd exp ect fro m a compre he nsive We b a n alytics
tool, su ch as pretty g ra phs, cu sto m -designe d (and p rintable) re p o rts , and real-time d ata
tracking.
OPEN WEB ANALYTICS (OPENWEBANALYTICS.COM) O pen We b Analytics (OWA) is a
p opular o pe n source Web an alytics software that anyon e can u se to track and a nalyze
how p eople use Web sites a nd applicatio ns . OWA is licen sed unde r GPL a n d p rovides
Web site own ers and develo p ers w ith easy ways to add Web an alytics to their sites
u sing simple J avascript, PHP , o r REST-based APis. OWA also comes w ith built-in su p -
port for tracking Web sites mad e with p o pular conte nt m an ageme nt fra meworks such as
WordPress and MediaWiki.
PIWIK (PIWI K.ORG) Piw ik is the on e of the leading self-hosted, decen tralized , o p e n
source Web analytics p la tfo rms, u sed by 460,000 We b sites in 150 countries. Piw ik was
founded by Matthie u Aub1y in 2007 . O ver the last 6 years, m ore tale n ted and passion ate
members of the community h ave joine d the tea m. As is the case in many o p en source
initiatives, they are actively looking for new develo p e rs, desig ners, datavis architects, a nd
sp o nsors to jo in the m .
FIRESTAT (FIRESTATS.CC) FireStats is a simple a nd straightforward Web analytics appli-
catio n w ritte n in PHP/ MySQL. It suppo rts nume rous platforms and set-ups including C#
sites, Dja ngo sites , Drupal, Joomla !, Wo rdPress , a nd several others. FireStats h as a n intui-
tive API that assists d evelopers in c reating the ir own cu stom apps o r p u blishing platform
compon e nts .
SITE METER (SITEMETER.COM) Site Meter is a service that p rovides counte r a nd tracking
informatio n for We b sites . By logging IP add resses and u sing JavaScript or HTML to track
visito r informatio n , Site Mete r provides Web site own ers w ith informatio n about the ir
visito rs, including how they reached the s ite, the da te a nd time o f the ir visit, a nd mo re.
Chapter 8 • Web Analytics , Web Mining, and Social Ana lytics 369
WOOPRA (WOOPRA.COM) Woopra is a real-time customer analytics service that
provides solutions for sales , service, marketing, and product teams. The platform is
designed to h e lp organizations optimize the customer life cycle by delivering live ,
granula r behavioral data for individual Web site visitors and customers. It ties this
individual-level data to aggregate analytics reports for a full life-cycle view that bridges
departmental gaps.
AWSTATS (AWSTATS.ORG) A WStats is an open source Web analytics reporting tool , su it-
able for an alyzing data from Internet services such as Web, streaming media, mail, and
FTP servers. A WStats parses and analyzes server log files , producing HTML reports. Data
is visually presented within repotts by tables and bar graphs. Static reports can be created
through a command line interface, and on-demand reporting is supported through a Web
browser CGI program.
SNOOP (REINVIGORATE.NET) Snoop is a desktop-based application th at runs on the Mac
OS X and Windows XP/Vista platforms. It sits nicely on your system status bar/system
tray, notifying you w ith audible sounds whenever something happens. Another outstanding
Snoop feature is the Name Tags option, which allows you to "tag" visitors for easier ide ntifi-
cation. So when Joe over at the accounting department visits your site , you'll instantly know.
MOCHIBOT (MOCHIBOT.COM) MochiBot is a free Web analytics/ tracking tool especially
designed for Flash assets. With MochiBot, you can see who's sharing your Flash content,
how many times people view your conte nt, as well as help you track w h ere your Flash
content is to prevent piracy a nd content theft. Installing MochiBot is a breeze; you simply
copy a few lines of ActionScript code in the .FLA files you want to monitor.
In addition to these free Web analytics tools, Table 8.1 provides a list of commer-
cially available Web analytics tools.
TABLE 8.1 Commercial Web Analytics Software Tools
Product Name
Angoss Knowledge
WebMiner
ClickTracks
LiveStats from
DeepMetrix
Megaputer WebAnalyst
MicroStrategy Web
Traffic Analysis Module
SAS Web Analytics
SPSS Web Mining for
Clementine
Web Trends
XM L Miner
Description
Combines ANGOSS Know ledge
STUDIO and clickstream analysis
Visitor patterns can be shown on
Web site
Real-time log analysis, live demo
on site
Data and text mining capabilities
Traffic highlights, content analysis,
and Web visitor analysis reports
Analyzes Web site traffic
Extraction of Web events
Data mining of Web traffic
information.
A system and class library for
mining data and text expressed
in XM L, using fuzzy logic expert
system rules
URL
angoss.com
clicktracks.com, now at Lyris.com
deepmetrix.com
megaputer.com/site/textanalyst.php
m icrostrategy .com/Solutions/ Applications/WT AM
sas.com/solutions/webanalytics
www-01.ibm.com/software/analytics/spss/
webtrends.com
scientio.com
370 Pan III • Predictive Analytics
Putting It All Together-A Web Site Optimization Ecosystem
It seems that just about everything on the Web can be measured-every click can be
recorded, every view can be captured, and every visit can be analyzed- all in an e ffo rt
to continually and automatically optimize the online experience. Unfortunately, the
notions of "infinite measura bility" and "auto matic optimization" in the online channel are
far more complex than m ost realize. The assumptio n that any single application of Web
mining techniques will provide the necessary range of insights required to understand
Web site visitor behavior is dece ptive and potentially risky. Ideally, a h olistic view to
customer experience is needed tha t ca n only be captured using both quantitative a nd
qualitative data. Forward-thinking companies have already taken steps toward capturing
and analyzing a holistic view of the custome r experience, w hich has led to significant
gains, both in terms of increme ntal financial growth and increasing customer loyalty and
satisfaction.
According to Peterson (2008), the inputs for Web site optimization efforts can be
classified along two axes describing the nature of the data and how that data can be used.
On one axis are data and informatio n-data being primarily quantitative and information
being primarily qualitative. On the other axis are measures and actio ns-measures being
reports, analysis, and recommendations all designed to drive actions, the actual changes
being made in the ongoing process of site and marke ting optimization. Each quadrant
created by these d imensions leverages different technologies a nd creates different out-
puts, but much like a biological e cosystem , each techno logical niche interacts with the
others to support the e ntire online e nvironment (see Figure 8.8).
Most believe that the Web site optimization ecosystem is d efined by the ability to
log, parse, and report on the clickstream behavior of site visitors . The underlying tech-
nology of this ability is generally referred to as Web analytics. Although Web an alytics
Quantitative
(data)
Testing and
Targeting
Actions
(actual changes)
The nature of the data
-0
QJ
en
::J
QJ
.0
C
(0
u
Web Analytics (0 ,...,
(0
-0
QJ
..c ,...,
s
0
I
Personalization
and Content
Management
Voice of the
Customer and
Customer
Experience
Management
Measures
(reports/ analyses)
FIGURE 8.8 Two-Dimensional View of the Inputs for Web Site Optimization.
Qualitative
(information)
Chapter 8 • Web Analytics, Web Mining , and Social Analytics 371
tools provide invaluable insights, understanding visitor behavior is as much a fu n ction
of qualitatively determining interests and intent as it is quantifying clicks from page to
page. Fortunately there are two other classes of applications designed to provide a more
qualitative view of o nline visitor behavior designed to report on the overall user experi-
ence and re port d irect feedback given by visitors a nd customers: customer experience
management (CEM) and voice of customer (VOC) :
• Web analytics applications focus on "where and when " question s by aggregating,
mining, and visualizing large volumes of data, by reporting on online marketing an d
visitor acquisition effo1ts, by summa rizing page-level visitor interaction data, and by
summarizing visitor flow through defined multistep processes.
• Voice of customer applications focus on "who and h ow" question s by gathering and
reporting direct feedback from site visitors, by benchmarking against other sites and
offline ch annels, and by supporting predictive modeling of future visitor behavior.
• Customer experience management applications focus on "what an d why" ques-
tions by detecting Web application issues and problems, by tracking and resolv-
ing business process and u sability obstacles, by reporting on site performance an d
availability, by en abling real-time alerting and m o nitoring, and by supporting deep
diagnosis of observed visitor behavior.
All three applications are needed to have a complete view of the visitor behavior
w here each application plays a distinct and valuable role. Web analytics, CEM, and VOC
applications form the foundation of the Web site optimization ecosystem that supports the
online business's ability to positively influence desired outcomes (a pictorial representation
of this process view of the Web site optimization ecosystem is given in Figure 8.9). These
similar-yet-distinct applicatio ns each contribute to a site operator's ability to recognize, react,
and respond to the o ngoing ch alle nges faced by every Web site owner. Fundamental to the
optimization process is measureme nt, gathering data and information that can the n be trans-
formed into tangible analysis, and recommendations for improvement u sing Web mining
tools and techniques. When used properly, these applications allow for convergent valida-
tion-combining different sets of data collected for the same audie nce to provide a riche r
and deeper understanding of audience behavior. The convergent validation model---one
Customer Interaction
on the Web Analysis of Interactions
Web
Analytics ./
Voice of
Customer ./
Customer Experience
Management ./
Knowledge about the Holistic
View of the Customer
FIGURE 8.9 A Process View of the Web Site Optimization Ecosystem.
372 Pan III • Predictive Analytics
w here multiple sources of data describing the same population are integrated to increase
the depth and richness of the resulting analysis- forms the framework of the Web site opti-
mization ecosystem. O n o ne side o f the spectrum are the primarily qualitative inputs from
voe applications; o n the o ther side are the primarily qu antitative inputs from eEM bridg-
ing the gap by supporting key elements of data discovery. When properly implemented, all
three systems sample data from the same audience. The combination of these data--either
through data integration projects or simply via the process of conducting good an alysis-
supports far more actio nable insights than any of the ecosystem members ind ividually.
A Framework for Voice of the Customer Strategy
Voice of the customer (VOe) is a term usually used to describe the analytic process of
capturing a customer's expectations, preferences, and aversions. It essentially is a market
research technique that produces a detailed set of customer wants and needs, o rganized into
a hierarchical structure, and then prioritized in te1ms of relative importance and satisfaction
w ith current alternatives. Attensity, o ne of the innovative service providers in the analytics
marketplace, developed an intuitive framework for voe strategy that they called LARA,
which stands for Listen, Analyze, Relate, a nd Act. It is a methodology that outlines a process
by which organizations can take user-generated content (UGe), whether generated by con-
sumers talking in Web forums, on micro-blogging sites like Twitter and social networks like
Facebook, or in feedback surveys, e-mails, documents, research, etc., and using it as a busi-
ness asset in a business process. Figure 8.10 shows a pictorial depiction of this framework.
LISTEN To "listen" is actually a process in itself that encompasses both the capability to
liste n to the open Web (forum s, blogs, tweets, you name it) and the capability to seam-
lessly access e nterprise information (eRM notes, documents, e-mails, etc.). It takes a
listening post, deep federated search capabilities , scraping and enterprise class data inte-
gratio n , and a strategy to determine w h o and what you want to listen to.
ANALYZE This is the hard part. How can you take all of this mass of unstructured data
a nd make sense of it? This is w here the "secret sauce" of text an alytics comes into play.
Look for solutions that include keyword, statistical, and natural langu age approaches
~PROOUCT
B OPERATIONS
FIGURE 8.10 Voice of the Customer Strategy Framework. Source: Attensity.com. Used w ith permission.
Chapter 8 • Web Analytics, Web Mining , and Social Analytics 373
that will allow you to essentially tag o r barcode every word and the re lationships amon g
words, making it data that can be accessed, searched, routed, counted, analyzed, charted ,
reported on, and even reused. Keep in mind that, in addition to technical capabilities ,
it h as to be easy to use, so that your business u sers can focus on the insights, not th e
technology. It should have an engine that doesn't require the u ser to define keywords or
terms that they want their system to look for or include in a rule base. Rather, it should
automatically identify te rms ("facts," people, places, things, etc.) and their relationships
with other terms or combinations of terms-making it easy to use , maintain, and a lso be
more accurate, so you can rely on the insights as actionable.
RELATE Now th at you have fou nd the insights and can analyze the unstructured data,
the real value comes w h en you can connect those insights to your "structured" data:
your customers (which customer segment is complaining about your product most?);
your products (which product is having the issu e?); your p arts (is there a problem with a
specific part manufactured by a sp ecific partner?); your locations (is the customer who is
tweeting about wanting a sandwich near your nearest restaurant?) ; a nd so o n . Now you
can ask questions of your data and get deep , actionable insights.
ACT Here is where it gets exciting, and your business strategy and rules are critical.
What do you do with the new cu stomer insight you 've obtained? How do you leverage
the problem resolution content created by a customer that you just ide ntified? How do
you connect with a customer who is uncovering issues that are important to your busi-
ness or who is asking for help? How do you route the insights to the right people? And,
how do you engage with cu stomers, partners, and influencers once you understand what
they are saying? You understand it; now you 've got to act.
SECTION 8. 7 REVIEW QUESTIONS
1. What is a maturity mode l?
2. List and comment on the six stages of TDWI's BI maturity framework.
3. What are the six dimensions u sed in Hamel's Web an alytics maturity model?
4. Describe Attensity's framework for VOC strategy. List and describe the four stages.
8.8 SOCIAL ANALYTICS AND SOCIAL NETWORK ANALYSIS
Social analytics may mean different things to different people, based on their worldview
and field of study . For instance, the dictionary definition of social analytics refers to
a philosoph ical perspective developed by the Danish historian and philosoph er Lars-
Henrik Schmidt in the 1980s. The theoretical object of the perspective is socius, a kind
of "commonness" that is neither a unive rsal account nor a communality shared by every
member of a body (Schmidt, 1996) . Thus, social a n alytics differs from traditio nal philoso-
phy as well as sociology. It might be viewed as a perspective that attempts to articulate
the contentions between philosophy and sociology.
Our definition of social analytics is somewhat different; as opposed to focus ing on
the "social" part (as is the csae in its philosophical definition), we are more interested in
the "analytics" p art of the te rm. Gartn er d efined social a nalytics as "monitoring, analyzing,
measuring and interpreting d igital interactio n s and relationships of people, topics , ideas
and content. " Social an alytics include mining the textual content created in social media
(e.g., sentiment analysis, na tural lan guage processing) an d an alyzing socially established
networks (e.g. , influencer identification, profiling, prediction) for the purpose of gaining
insight about existing and potential customers' current and future behaviors, and about
the likes and dislikes toward a firm's products and services. Based o n this definition an d
374 Pan III • Predictive Analytics
the current practices, social analytics can be classified into two different, but not neces-
sarily mutually exclusive, branches: social network analysis and social media analytics.
Social Network Analysis
A social network is a social structure composed of individuals/ p eople (or groups of
individu als or organizations) linke d to one another with some type of connections/
relationships. The social network perspective prov ides a holistic approach to analyz-
ing structure a nd dynamics of social entities. The study of these structures uses social
network analysis to identify local and global patterns, locate influential entities, and
examine network dynamics. Social networks and the analysis of them is essentially a n
inte rdisciplinary field that emerged from social psychology, sociology, statistics, and
graph theory . Development and formalization of the mathematical extent of social net-
work analysis dates back to the 1950s; the development of foundational theories and
m ethods of social n etworks dates back to the 1980s (Scott a nd Davis, 2003). Social
network analysis is now one of the major paradigms in business analytics, consumer
inte llige nce, and contemporary sociology, and is also e mployed in a numbe r of other
socia l and formal sciences.
A social network is a theoretical construct useful in the social sciences to study
relationships between individuals, groups, organizations, or even e ntire societies (social
units). The term is u sed to d escribe a social structure determined by such interactions.
The ties through which any given social unit connects represent the convergence of the
variou s social contacts of that unit. In general, social networks are self-organizing, emer-
gent, and complex, such that a globally coherent pattern appears from the local interac-
tion of the elements (individuals and groups of individuals) that make up the system.
Following are a few typical social network types that are relevant to business activities.
COMMUNICATION NETWORKS Communication studies are ofte n considered a part of
both the social sciences and the humanities , drawing heavily on fields su ch as sociology,
psychology, anthropology, information science, biology, political science, and econom-
ics. Many communications concepts d escribe the transfe r of informatio n fro m one source
to another, and thus can be represented as a social network. Telecommunication compa-
nies are tapping into this rich information source to optimize their business practices and
to improve customer relationships.
COMMUNITY NETWORKS Traditio nally , community referred to a specific geographic
location, and studies of community ties had to do with who talked, associated, traded,
a nd attended social activities w ith whom. Today, however, there are extended "online"
communities developed throug h social n etworking tools a nd telecommunications devices.
Such tools and devices continuously generate large amounts of data , which can be u sed
by companies to discover invaluable , actionable information.
CRIMINAL NETWORKS In crimino logy and urban sociology, much attention has been paid
to the social networks among criminal actors. For example, studying gang murders and other
illegal activities as a series of exchanges between gangs can lead to better understanding and
prevention of such criminal activities. Now that we live in a highly connected world (thanks
to the Internet), many of the criminal networks' formations and their activities are being
watched/ pursued by security agencies u sing state-of-the-art Internet tools and tactics. Even
though the Internet has changed the landscape for criminal networks and law enforcement
agencies, the traditional social and philosophical theo ries still apply to a large extent.
INNOVATION NETWORKS Business studies on diffusion of ideas and innovatio ns in a net-
work e nvironment focus on the spread and use of ideas among the me mbers of the social
Chapter 8 • Web Analytics, Web Mining , and Social Analytics 375
network. The idea is to understand why some networks are more innovative, and why some
communities are early adopters of ideas and innovations (i.e., examining the impact of social
network structure on influencing the spread of an innovation and innovative behavior).
Social Network Analysis Metrics
Social network analysis (SNA) is the systematic examin ation of social networks. Social net-
work analysis views social relationships in terms of n etwork theory, consisting of nodes
(representing individuals or organizations within the network) and ties/connectio ns (which
represent relationships between the individuals or organizations, such as friendship, kinship,
organizational position, etc.). These networks are often represented using social network dia-
grams, where nodes are re presented as points and ties are represented as lines. Application
Case 8.5 gets into the details of how SNA can be used to help telecommunication companies.
Over the years, various metrics (or measurements) have been developed to analyze
social network structures from different perspectives. These metrics are often grouped
into three categories: connections, d istributions, and segmentation.
Application Case 8.5
Social Network Analysis Helps Telecommunication Firms
Because of the w idespread use of free Internet tools
and techniques (VoIP, video conferencing tools
su ch as Skype, free phone calls w ithin the United
States by Google Voice, e tc.) , the te lecommunica-
tion industry is going through a tough time. In order
to stay viable and competitive, they need to make
the right decisions and utilize their limited resources
optimally. One of the key success factors for tele -
com companies is to maximize their profita bility by
listening an d understanding the needs and wants
of the customers, offering cornnrnnication plans,
prices, and features that they want at the prices that
they are w illing to pay.
These market pressures force telecommu-
nication companies to be more innovative. As we
all know, "necessity is the mother of invention."
Therefore, many of the most promising use cases for
social n etwork a nalysis (SNA) are coming fro m the
telecommunication companies. Using detailed call
records that are a lready in their databases, they are
trying to identify social networks and influencers.
In order to identify the social networks, they are
asking questions like "Who contacts whomr "How
often?" "How long?" "Both directions?" "On Net,
off Net?" They are also trying to answer questions
that lead to ide ntifica tion of influencers, such as
"Who influenced w h om how much on purchases?"
"Who influences whom how much on churn?" and
"Who will acquire others?" SNA metrics like degree
(how many p eople are directly in a person 's social
network), density (how dense is the calling pattern
within the calling circle), betweenness (how essen-
tial you are to facilitate communication w ithin your
calling circle), and centrality (how "important" you
are in the social network) are often used answer
these questions.
Here are some of the benefits that can be
obtained from SNA:
• Manage customer churn
• Reactive (reduce collateral churn)- Identify
subscribers whose loyalty is threatened by
churn around them.
• Preventive (reduce influential c hurn)-
Identify subscribers w ho, should they
churn, would take a few frie nds w ith them.
• Improve cross-sell and technology transfer
- Reactive (leverage collateral adoption)-
Identify subscribers whose affinity for prod-
ucts is increased due to adoption around
them and stimulate them.
- Proactive (identify influence rs for this adop-
tion)-Identify subscribers who, sh ould
they adopt, would push a few friends to do
the same.
(Continued)
376 Pan III • Predictive Analytics
Application Case 8.5 (Continued}
• Man age viral campaigns- Understand w hat
leads to high-scale spread of messages about
products a nd services, a nd use this informa -
tion to your benefit.
• Improve acquisition-Identify who are most
likely to recomme nd a (off-Net) friend to
become a new subscribe r of the operator. The
recommendation itself, as well as the subscrip-
tion, is incentivized for both the subscriber
and the recommending p erson.
• Identify households, communities, and close-
groups to better manage your relationships
with the m.
• Ide ntify customer life-stages-Ide ntifying social
network changes and from there identifying
life -stage changes such as moving, changing a
job, going to a university, starting a relation-
ships, getting married, etc.
• Ide ntifying pre-churners- Detecting potential
churners during the process of leaving and
mo tivating them to stay w ith you.
• Gain competitor insights- Track dyna mic
changes in social networks based on competi-
tor's marketing activities
Connections
• Others inducing identifying rotational chur-
ners (switching between operators)-
Facilitating re- to postmigratio n , and tracking
customer's networks dynamics over his/ her
life cycle.
Actual cases indicate that prope r implementation of
SNA can significantly lower churn, improve cross-
sell, boost new customer acquisition, optimize pric-
ing and, hen ce, maximize profit, and improve over-
a ll competitiveness.
QUESTIONS FOR DISCUSSION
1. How can social network analysis be used in the
telecommunications industry?
2. What do you think are the key ch alle n ges,
potential solutio n , a nd probable results in apply-
ing SNA in telecommunications firms?
Source: Compiled from "More Things We Lo ve Abo ut SNA: Re turn
o f the Magnificent 10," Febrnary 2013, presentation by Judy Bayer
a nd Fawad Q ureshi, Teradata.
Homopbily: The exte nt to which actors form ties w ith similar versus dissimi-
lar othe rs. Similarity can be define d by gender, race, age, occupation, edu catio nal
achievement, status, values, or any other salient ch aracteristic.
Multiplexity: The number of content-forms contained in a tie. For example , two
people w h o are frie nds and also work togethe r would h ave a multiplexity of 2.
Multiplexity has been associated w ith relationship strength.
Mutuality/reciprocity: The extent to which two acto rs reciprocate each oth er's
frie ndship o r oth e r interactio n.
Network closure: A m easure o f the completen ess of relational triads . An indi-
vidual's assumption of network closure (i.e. , that their friend s are also friends) is
called transitivity. Transitivity is an o u tcome of the individual or situatio nal trait of
need for cognitive closure.
Propinquity: The tendency for actors to h ave more ties w ith geographically
close others.
Distributions
Bridge: An individual w hose weak ties fill a structural hole, providing the only
link between two individuals o r clusters. It also includes the sh o rtest route w hen a
lo nger o ne is unfeasib le due to a hig h risk of message distortio n o r delivery fa ilure.
Chapter 8 • Web Analytics, Web Mining , and Social Analytics 377
Centrality: Refers to a group of metrics that aim to quantify the importan ce or
influence (in a variety of senses) of a particular node (or group) w ithin a network.
Examples of common methods of measuring centrality include betweenness central-
ity, closeness centrality, eigenvector centrality, alpha centrality, and degree centrality.
Density: The proportion of direct ties in a n etwork relative to the total number
possible.
Distance: The minimum number of ties required to connect two particular actors .
Structural boles: The absen ce of ties between two parts of a network. Finding
and exploiting a structural hole can give an entrepreneur a competitive advantage.
This concept was developed by sociologist Ron ald Burt a nd is sometimes referred
to as an alternate conceptio n of social capital.
Tie strength: Defined by the linear combin a tion of time, e motional intensity,
intimacy, and reciprocity (i.e., mutuality) . Strong ties are associated with homophily ,
propinquity, a nd transitivity, while weak ties are associated with bridges.
Segmentation
Cliques and social circles: Groups a re identified as cliques if every individ u al
is directly tied to every other individual or social circles if there is less stringency of
direct con tact, which is imprecise, or as structurally coh esive blocks if precision is
wanted.
Clustering coefficient: A m easure of the likelihood that two membe rs of a node
a re associates. A higher clustering coefficient indicates a greater cliquishness.
Cohesion: The degree to which actors are connected directly to each other by
cohesive bonds. Structural cohesion refers to the minimum number of members
who, if removed from a group, would disconnect th e group.
SECTION 8.8 REVIEW QUESTIONS
1. What is meant by social analytics? Why is it an important business topic?
2. Wh at is a social network? What is social network analysis?
3. List and briefly describe the most common social network types .
4. List and briefly describe the social network analysis metrics.
8.9 SOCIAL MEDIA DEFINITIONS AND CONCEPTS
Social media refers to the enabling technologies of social interactio ns among people in
which they create, share, and exchange information , ideas, and opinions in virtual communi-
ties and networks. It is a grou p of Internet-based software applications that build o n the ide-
ological and technological fou ndations of Web 2.0, and that allow the creation and exchange
of user-generated content (Kaplan and Haenlein, 2010). Social media depends on mobile
and other Web-based technologies to create highly inte ractive platforms for individuals and
communities to share, co-create, discuss, and modify user-generated content. It introduces
substantial changes to communication between organizations, communities, and individuals.
Since their emergence in the early 1990s, Web-based social media technologies
have seen a significan t improvement in both quality and quantity. These technolo -
gies take on many diffe rent forms, including online magazines, Internet forums, Web
logs, social biogs, microblogging, wikis, social networks, podcasts, pictures, video, and
378 Pan III • Predictive Analytics
product/service evaluations/ratings. By applying a set of theories in the field of media
research (social presence, media richness) and social processes (self-presentation, self-
disclosure), Kaplan and Haenlein (2010) created a classification schem e w ith six different
typ es of social media : collaborative projects (e.g ., Wikipedia), biogs and microblogs (e.g.,
Twitter), conte nt communities (e.g., YouTube), social networking sites (e.g., Facebook),
virtual game worlds (e.g., World of Warcraft), a n d virtual social worlds (e.g. , Second Life).
Web-based social media are different from traditional/industrial media, such as
newspapers, television, a nd film, as they are comparatively inexpensive and accessib le
to enable anyone (even private individuals) to publish or access/ consume information .
Industrial media generally require significant resources to publish information , as in most
cases the articles (or books) go through many revisions before being published (as was
the case in the publication of this very book). Here are some of the most prevailing char-
acteristics that help differentiate between social and industrial media (Morgan et al. , 2010):
Quality: In industrial publishing- mediated by a publisher-the typical range
of quality is substantially narrower than in niche, unmediated markets. The main
challenge p osed by content in social media sites is the fact that the distribution of
quality has high variance: from ve1y high-quality items to low-quality, sometimes
abusive, content.
Reach: Both industrial and social m edia technologies provide scale and are capa-
ble of reaching a global audience. Industrial media, h owever, typically use a cen-
tra lized framework for orga nizatio n , production, and dissemination, w hereas social
media are by their very nature more decentralized, less hierarchical, and distin-
guished by multiple points of production and utility.
Frequency: Compared to industrial media, updating and reposting o n social
media p latforms is easier, faster, and cheap er, and therefore practiced more fre-
quently, resulting in fresher content.
Accessibility: The means of production fo r industrial media are typically gov-
ernment and/or corporate (privately owned), and are costly , whereas social media
tools a re generally available to the public at little or n o cost.
Usability: Industrial media production typically requires specialized skills and
training. Conversely , most social media production requires only modest reinter-
pretatio n of existing skills; in theory, anyone w ith access can operate the m ean s of
social media production.
Immediacy: The time lag between communications produced by industrial media
can be long (weeks, months, or even years) compared to social media (which can
be cap able of virtually instantaneous responses).
Updatability: Industrial m edia, on ce created, cann ot be altered (on ce a maga-
zine article is printed and distributed, changes cannot be m ade to that same article),
w h ereas social media can be altered almost instantaneously by comments or editing.
How Do People Use Social Media?
Not o nly are the numbe rs o n social networking sites growing , but so is the degree to w hich
they are e ngaged w ith the channe l. Brogan a nd Bastone (2011) presented research results
that stratify u sers according to h ow actively they use social media and tracked evolution of
these user segments over time. They listed six different engagement levels (Figure 8.11).
According to the research results , the online user community has been steadily
migrating upwards o n this engagement hierarchy. The most n otable change is among
Inactives. Forty-four percent of the online population fell into this category. Two years
Chapter 8 • Web Analytics, Web Mining, and So cial Ana lytics 379
Creators
Critics
Joiners
Collectors
Spectators
lnactives
Time
FIGURE 8.11 Evolution of Social M edia User Engagement.
later, more than half of those Inactives had jumped into social media in some form or
another. "Now roughly 82 percent of the adult population online is in one of the upper
categories," said Bastone. "Social media has tmly reached a state of mass adoption. "
Application Case 8 .6 shows the positive impact of social media at Lollapalooza.
Application Case 8.6
Measuring the Impact of Social Media at Lollapalooza
C3 Presents creates, books, markets, and produces live
experiences, concerts, events, and just about anything
that makes people stand up and cheer. Among oth-
ers, they produce the Austin City Limits Music Festival,
Lollapalooza, as well as more than 800 shows nation-
wide. They hope to see you up in front sometime.
An early adopter of social media as a way to
drive event attendance, Lollapalooza organizer C3
Presents needed to know the impact of its social
media effo1ts. They came to Cardinal Path for a
social media measurement strategy and ended up
with some startling insights.
The Challe nge
When the Lollapalooza music festival decided to incor-
porate social media into their o nline marketing strat-
egy, they did it w ith a bang. Using Facebook, MySpace,
Twitter, and more, the Lollapalooza Web site was a first
mover in allowing its users to engage and share through
social channels that were integrated into the site itself.
After investing the time and resources in build-
ing out these integrations and their functionality , C3
wanted to know one simple thing: "Did it work?"
To answer this , C3 Presents needed a measurement
strategy that would provide a wealth of information
about the ir social media implementation, such as:
• Which fans are using social media and sharing
content?
• What social media is being used the most, and
how?
• Are visitors that interact with social media
more likely to buy a ticket?
• Is social media driving more traffic to the site?
Is that traffic buying tickets?
The Solution
Cardinal Path was asked to architect and implement a
solution based on an existing Google Analytics imple-
mentation that would to answer these questions.
A combination of customized event tracking, cam-
paign tagging, custom variables, and a complex imple-
mentation and configuration was deployed to include
the tracking of each social media outlet on the site.
(Conti nued)
380 Pan III • Predictive Analytics
Application Case 8.6 (Continued}
The Results
As a result of this measurement solution, it was easy
to surface some impressive insights tha t helped C3
quantify the return on the ir social media investme nt:
• Fan engagement metrics such as time on site,
bounce rate, page views per visit, and inter-
action goals improved significantly across the
board as a result of social media applications.
• Users of the social media application s on
Lollapalooza.com spent twice as much as
non-users.
• Over 66 percent of the traffic refeITed from
Facebook, MySpace, and Twitter was a result of
sharing applications and Lollapalooza's messag-
ing to its fans on those platfo rms.
QUESTIONS FOR DISCUSSION
1. How did C3 Presents u se social media a nalytics
to improve its business?
2. What were the cha llenges, the proposed solu-
tion, and the obtained results'
Source: www.cardinalpath.com/case-study/social-media-
rneasurement (accessed March 2013).
SECTION 8 .9 REVIEW QUESTIONS
1 . What is social media? How does it relate to Web 2.0?
2. What are the diffe rences a nd commonalities between Web-based social media a nd
traditional/industrial media?
3. How do p eople use social media? Wh at are the evolutionary levels of en gagement?
8.10 SOCIAL MEDIA ANALYTICS
Social media a nalytics refers to the systematic and scientific ways to consume the vast
amount of conte nt create d by Web-based social media outle ts, tools, and techniques for
the betterment of an organization's competitiveness. Socia l media analytics is rapidly
becoming a n ew force in organizatio ns around the world, allowing them to reach out to
and unde rstand consume rs as never before. In many companies, it is becoming the tool
for integrate d marketing a nd communications strategies.
The exponential growth of social media outlets, from biogs, Facebook, and Tw itter to
Linkedln and YouTube, and analytics tools that tap into these rich data sources offer o rgani-
zations the chance to jo in a conversation w ith millions of customers around the glo be every
day. This aptitude is w hy nearly two-thirds of the 2,100 companies w ho participated in a
recent survey by Harvard Business Review (HBR) Analytic Services said they are either cur-
rently using social media channels or have social media plans in the works (HBR, 2010). But
many still say social media is an experiment, as they try to understand how to best use the
diffe re nt channe ls, gauge their effectiven ess, and integrate social media into their strategy.
Despite the vast potential social media analytics brings, many companies seem
focused on social media activity primarily as a on e -way promotion al cha nnel and h ave
yet to capitalize on the ability to not only liste n to, but also analyze, consumer con versa-
tions and turn the information into insights that impact the bottom line. Here are some of
the results fro m the HBR Analytic Services survey (HBR, 2010):
• Three-quarters (75%) of the companies in the survey said they did not know w h ere
their most valuable customers were talking about the m .
• Nearly one-third (31 %) do n o t measure effectiveness of social media .
• Less than o n e -quarter (23%) are using social media an alytic tools .
• A fraction (7%) of p articipating companies are able to integrate social media into
the ir marketing activities.
Chapter 8 • Web Analytics, Web Mining , and Social Analytics 381
While still searching fo r best practice and measurements, two-thirds of the com-
panies surveyed are convinced their u se of social media will grow, an d many anticipate
investing more in it n ext year, even as spending in traditional media declines. So what is
it specifically that the companies are interested in measuring in social media?
Measuring the Social Media Impact
For organizations, small or large, there is valuable insight hidden in all the user-gen-
erated content o n social media sites. But how do you dig it out of dozen s of review
sites, thousands of blogs, millions of Facebook posts, a nd billio ns of tweets? Once
you do that, how do you measure the impact of your efforts? These questions can
be addressed by the analytics exte nsion of th e social media techn o logies. Once you
decide on your goal fo r social media (wh at it is that you want to accomplish) , th e re is
a multitude of tools to help you get there . These analysis tools usu ally fall into three
broad categories :
• Descriptive analytics: Uses simple statistics to identify activity characteristics
and trends , such as how many followers you have, how many reviews were gener-
ated o n Facebook, and which channels are being used most often.
• Social network analysis: Follows the links between friends, fans, and followers
to identify connection s of influence as well as the b iggest sources of influence.
• Advanced analytics: Includes predictive an alytics and text an alytics that exam-
ine the content in online conversations to ide ntify the mes, sen time nts, and connec-
tions that would not be revealed by casual surveillance.
Sophisticated tools and solutions to social media analytics use a ll three categories
of a nalytics (i.e., descriptive, pre dictive, and prescriptive) in a somewhat progressive
fashion .
Best Practices in Social Media Analytics
As an emerging tool, social media analytics is practiced by companies in a somewhat
haphazard fashion. Because there are not well established methodologies, everybody is
trying to create their own by trial and error. What follows are some of the field-tested best
practices for social media analytics proposed by Paine and Chaves (2012) .
THINK OF MEASUREMENT AS A GUIDANCE SYSTEM, NOT A RATING SYSTEM Measurements
are often used for punishme nt or rewards; they sho uld not be. They sh ould be about fig-
uring out what the most effective tools and practices are, w hat needs to be discontinued
because it doesn't work, and what n eeds to be done more because it does work very
well. A good analytics system sh ould te ll you where you need to focus. Maybe all that
emphasis o n Facebook doesn 't really matter, because that is not w here your audience
is. Maybe they are all o n Twitte r, or vice versa. According to Paine an d Chaves, ch anne l
preferen ce won't necessarily be intuitive, "We just worked w ith a hotel that had virtually
no activity on Twitter fo r o n e brand but lots of Twitter activity for one of th eir higher
brands." Without an accurate measure ment too l, you would not know.
TRACK THE ELUSIVE SENTIMENT Customers want to take what they are hearing an d
learning from online conversations and act o n it. The key is to be precise in extracting
and tagging their inte ntio ns by measuring their sentiments. As we have seen in Chapter 7,
text analytic tools can catego rize online content, uncover linked con cepts, and reveal the
sentime nt in a conversatio n as "positive," "n egative," o r "neutral," based on the words
people u se. Ideally , you would like to be able to attribute sentiment to a specific product,
service, and business unit. The more precise you can get in understanding the ton e an d
382 Pan III • Predictive Ana lytics
perception that people express, the m o re actio n able the informatio n becomes, becau se
you are mitigating con cerns ab o ut m ixed p olarity. A mixe d-polarity phrase, su ch as "h otel
in g reat location b ut bathroo m was sme lly" sho uld not be tagged as "ne utral" becau se
you h ave p ositives and n egatives o ffsetting each o ther. To b e actio n able , these types of
phrases are to be treated sep arate ly; "b athroom was sm e lly" is som ething someon e can
own a nd improve upo n . One can classify a nd categorize these sentiments, look at trends
ove r time, and see sig nificant diffe re n ces in the way p eople sp eak eithe r p ositively or
negatively abo ut you . Furthermo re, you can compa re sentime nt about your b ra nd to your
competitors.
CONTINUOUSLY IMPROVE THE ACCURACY OF TEXT ANALYSIS An ind ustry-sp ecific text
a nalytics package w ill already know the vocabulary of your business. The system w ill
h ave linguistic rules built into it, but it learns over time a nd gets bette r a nd be tte r. Mu ch
as you would tune a sta tistical mo de l as you get mo re data , b etter para me ters, o r new
techniques to deliver b ette r results, you wou ld do the same thing w ith the n atural lan-
guage processing that goes into sentime nt an alysis . You set u p rules, taxon omies, catego-
rizatio n , a nd meaning of w ords; watch w h at the results look like ; and then go back and
do it again.
LOOK AT THE RIPPLE EFFECT It is o ne thing to get a great hit on a high-profile site, but
that's only the sta rt. There 's a d iffe rence b e tween a g reat hit that just sits the re and goes
away versus a great hit that is tweeted , retweete d , and p icked up by influe ntial bloggers.
Analysis sho uld sh ow you which social media activities go "viral" and w hich qu ickly go
do rmant- and w hy.
LOOK BEYOND THE BRAND One of the biggest mistakes p eople make is to be con-
cerned o nly a bo ut the ir brand . To su ccessfully analyze an d act o n social media , you
need to understa nd n o t just what is b e ing said ab o ut your brand, but the broade r con-
versation ab o ut the spectrum of issu es surrounding your product o r service, as well.
Custo me rs don 't u sually care about a firm's message o r its brand; they care a b out them-
selves . Therefore, you sho uld pay atte ntio n to what they a re talking abo ut, w he re they
are ta lking, and w here their inte rests are.
IDENTIFY YOUR MOST POWERFUL INFLUENCERS Organizatio ns struggle to identify w ho
has the most pow er in sha ping public o pinion . It tu rns out, your most im p ortant influ-
e ncers a re n o t n ecessarily the on es w h o advocate sp ecifically for your brand ; they are
the o nes w ho influence the w hole realm o f conversatio n a bout your topic. Yo u n eed to
understa nd w hether they are saying nice things , expressing suppo rt, o r simply m aking
o bservatio ns o r critiquing . What is the nature of the ir conversation s? How is my bran d
being positio ned relative to the compe titio n in th at space?
LOOK CLOSELY AT THE ACCURACY OF YOUR ANALYTIC TOOL Un til recently, compute r-
based auto mated tools were no t as accurate as human s for sifting through online con-
tent. Even now , accuracy varies depending on the media . Fo r product review sites, h o tel
review sites, a nd Tw itte r, it can reach anywh e re b etween 80 to 90 p e rcent accuracy,
because the context is more boxed in. Whe n you start looking a t biogs and discu s-
sion forums, w h e re the conversation is mo re w ide-ra n ging , the software can deliver
60 to 70 p e rcent accuracy (Pa ine a nd Ch aves, 201 2) . These fig ures w ill increase over
time, becau se the a nalytics too ls a re continually upgrade d w ith new rules and improved
a lgorithms to reflect field exp e rie n ce, new p rodu cts, ch an ging marke t conditions, and
e me rg ing p atte rns of speech.
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 383
INCORPORATE SOCIAL MEDIA INTELLIGENCE INTO PLANNING Once you have big-pic-
ture perspective and detailed insight, you can begin to incorporate this information into
your planning cycle. But that is easier said than done. A quick audience poll revealed
that very few people currently incorporate learning from online conversations into their
planning cycles (Pain e and Chaves, 2012). One way to achieve this is to find time-linked
associations between social media metrics and other business activities or market events.
Social media is typically either organically invoked or invoked by something your orga-
nization does; therefore, if you see a spike in activity at some point in time , you want to
know w hat was behind that.
Application Case 8.7 shows an interesting case w here eHarmony, one of the most
popular online relationship service providers, uses social media analytics to better listen,
understand, and service its customers.
Application Case 8.7
eHarmony Uses Social Media to Help Take the Mystery Out of Online Dating
eHarrnony launched in the United States in 2000 and
is now the number-one trusted relationship services
provider in the United States. Millions of people have
used eHarmony's Compatibility Matching System to
find compatible long-term relationships; an average
of 542 eHarrnony members many every day in the
United States, as a result of being matched on the site.
The Challenge
Online dating has continued to increase in popu-
larity, and with the adoption of social media the
social media team at eHarmony saw an even greater
opportunity to connect with both current and future
members. The team at eHarmony saw social media
as a chan ce to dispel any myths and preconceived
notions about online dating and, more importantly,
have some fun with their social media presence.
"For us it's about being human, and sharing great
conte nt that w ill help our members and our social
media followers," says Grant Langston, director of
social media at eHarmon y. "We believe that if there
are conversations happening around our brand, we
need to be there and be a part of that d ialogue."
The Approach
eHarmony started using Salesforce Marketing Cloud
to listen to conversations around the brand and
around keywords like "bad date" or "first date." They
also took to Facebook and Twitter to connect with
members, share success stories-including engage-
ment and wedding videos- a nd a nswer questions
from those looking for dating advice.
"We wanted to ensure our team felt comfott-
able using social media to connect with our com-
munity so we set up guide lines for how to respond
and proceed," explains Grant Langston. "We try to
use humor and h ave some fun when we reach out
to people through Twitter or Facebook. We think
it makes a huge difference and helps make people
feel more comfortable ."
The Results
By using social media to help educate and create
awareness around the benefits of online dating,
eHarmony has built a strong and loyal community.
The social media team now has eight staff members
working to respond to social interactions and posts,
helping them reach out to clients and respond to
hundreds of posts a week. They plan to start cre-
ating Facebook apps that celebrate their members'
success, and they are looking to create some new
videos around some common dating mistakes. The
social tea m at eHarmony is making all the right
moves and their hard work is paying off for their
millions of happy members.
QUESTIONS FOR DISCUSSION
1. How did eHarmony use social media to enh ance
online dating?
2. What were the cha llenges, the proposed solu-
tion, and the obtained results?
Source.- Sa lesForce Marketing Cloud, Case Study,
salesforcemarketingcloud.com; eharmony .com.
384 Pan III • Predictive Analytics
Social Media Analytics Tools and Vendors
Monitoring social media, identifying interesting conversations among potential customers,
and inferring what they a re saying about your company, your products, and services is
an essential yet a challenging task for many organizations. Generally speaking, there
are two main paths that an organization can take to attain social media analytics (SMA)
capabilities: in-house deve lopment o r o utsourcing. Because the SMA-related field is still
evolving/maturing and because building an effective SMA system requires extensive
knowledge in several related fields (e.g., Web, text mining, predictive analytics, report-
ing, visualization, performance management, e tc.) , with the exception of very large
enterprises, most organizations choose the easier path: outsourcing.
Due to the astounding emphasis given to SMA, in the last few years we have witnessed
an incredible emergence of start-up companies claiming to provide practical, cost-effective
SMA solutions to organizations of all sizes and types. Because what they offered was not
much more than just monitoring a few keywords about brands/products/services in social
media , many of them did not succeed. While there still is a lot of uncertainty and churn in the
marketplace, a significant numbe r of them have survived and evolved to provide services that
go beyond basic monito ring of a few brand names and keywords; they provide an integrated
approach that helps many parts of the business, including product development, customer
support, public outreach, lead generation, market research, and campaign management.
In the following section, we list and briefly define 10 SMA tools/ vendors. This list is
not meant to be "the absolute top 10" or the complete top-tier leaders in the market. It is
to provide only 10 of the many successful SMA vendors and their respective tools/ services
with which we have some familiarity.
ATTENS1TV360 Atte nsity360 operates on four k ey principles: listen, a nalyze, relate, and
act. Attensity360 helps monitor trending topics , influencers, and the reach of your brand
while recommending ways to join the conversation. Atte nsity Analyze applies text ana-
lytics to unstructure d text to extract mea ning and uncove r tre nds. Attensity Respond
helps automate the routing of incoming social media mentions into user-defined que ues.
Clients include Whirlpool, Vodafone, Versatel, TMobile, Oracle, and Wiley.
RAD1AN6/SALESFORCE CLOUD Radian 6, purchased by Salesforce in 2011, works with
brands to help them listen more intelligently to your consumers, competitors, and influ-
e ncers w ith the goal of growing your business via detailed, real-time insights . Beyond
their monitoring dashboard, which tracks mentions on more than 100 million social
m edia sites, they offer an e ngagement console that allows you to coordinate your internal
responses to external activity by immediately updating your blog, Twitter, and Facebook
accounts all in one spot. Their clients include Red Cross, Adobe , AAA, Cirque du Soleil,
H&R Block, March of Dimes, Microsoft, Pe psi, and Southwest Airlines.
SVSOMOS Managing conversations in real time , Sysomos's Heartbeat is a real-
time monito ring and measure me nt tool that provides con stantly updated snapshots
of social media conversations delivered using a variety of user-friendly graphics.
Heartbeat organizes conversations, m anages workflow, facilitates collaboration, and
provides ways to e ngage with key influ e n cers. The ir clie nts include IBM , HSBC , Roche ,
Ketchum, Sony Ericsson, Philips, ConAgra, Edelman, Shell Oil, Nokia, Sapient, Citi,
and Interbrand. Owner: Marketwire.
COLLECTIVE INTELLECT Boulder, Colorado-based Collective Intellect, which started out
by providing monitoring to financial firms , has evolved into a top-tier player in the m ar-
ketplace o f social media intelligence gathering. Using a combination of self-serve client
Chapter 8 • Web Analytics, Web Mining, and Social Analytics 385
dashboards and human analysis, Collective Intellect offers a robust monitoring and mea-
surement tool suited to mid-size to la rge companies with its Social CRM Insights platform.
Their clients include General Mills, NBC Universal, Pepsi, Walmart, Unilever, MillerCoors ,
Paramount, and Siemens.
WEBTRENDS Webtrends offers services geared toward monitoring, measuring, analyz-
ing, profiling, and targeting audiences for a brand. The partner-based platform allows
for crowd-sourced improvements and problem solving, creating transparency for their
products and services. Their clients include CBS, NBC Universal, 20th Century Fox, AOL,
Electronic Arts, Lifetime, and Nestle.
CRIMSON HEXAGON Cambridge, Massachusetts-based Crimson Hexagon taps into
billions of conversations taking place in online media and turns them into actionable
data for better brand understanding and improvement. Based on a technology licensed
from Harvard, its VoxTrot Opinion is able to analyze vast amo unts of qualitative infor-
matio n and determine quantitative proportion o f opinion. Their clients include CNN,
Hanes, AT&T, HP , Johnson & Johnson, Mashable, Microsoft, Monster, Thomson Reuters ,
Rubbermaid, Sybase, and The Wall Street journal.
CONVERSEON New York-based social media consulting firm Converseon, n amed a
leader in the social media monitoring sector by Forrester Research, builds tailored dash-
boards for its enterprise installations and offers professional services around every step of
the social business intelligence process. Converseon starts with the technology and adds
human analysis, resulting in high-quality data and impressive functionality. Their clients
include Dow, Amway, Graco, and other major brands.
SPIRAL 16 Spiral1 6 takes an in-de pth look at who is saying what about a brand and com-
pares results with those of top competitors. The goal is to help you monitor the effective-
ness of your social media strategy, understand the sentime nt be hind conversations online,
and mine la rge amounts of data. It uses impressive 3D displays and a standard d ashboard .
Their clients include Toyota, Lee, and Cadbury.
BUZZLOGIC BuzzLogic uses its technology platform to identify and organize the conversa-
tion universe, combining both conversation topic and audience to help brands reach audi-
ences who are passionate on everything fro m the latest tech craze and cloud computing to
parenthood and politics. Their clients include Starbucks, American Express, HBO, and HP.
SPROUTSOCIAL Founded in 2010 in Chicago, Illinois, SproutSocial is an innovative social
media analytics company that provides social analytics services to many well-known
firms and organizations. Their clients include Yahoo!, Nokia, Pepsi, St. Jude Children's
Research Center, Hyatt Rege ncy, McDonalds , and AMD. A sample screen shot of their
social media solution dashboard is shown in Figure 8 .1 2.
SECTION 8.10 REVIEW QUESTIONS
1. What is social media analytics? What type of data is analyzed with it?
2. What are the reasons/ motivations behind the exponential growth of social media
analytics?
3. How can you measure the impact of social media analytics?
4. List and briefly describe the best practices in social media analytics.
5. Why do you think social media analytics tools are usually offered as a service and not
a tool?
386 Pan III • Predictive Analytics
1111111111 111111111111111111 11111
12 I 2 3 4 5 6 7 8 9 lO 11 t2 I 2 3 4 5 5 7 8 9 ,0 11 12 I 2 3 4 5 I 7 8 9 10 1112 I 2 J 4 5 6 7 8 9 10 11
TWITT[R Ar,' Still fer Hw,ttor accounts., tho Sprout Social Teem group
FOU.OWER D£MOGRAPHICS 1WITTER STAlS
+..,. 2001
Now fallowen In tlis bmo p«lod.
,a.20 1 ., .... :-==========:::::: 0 12.7k @ 3.4k ~ 895
~ Link Clicks Monllono ----»04 - ---------~
I&-
DAILY ENGAGEMENT - OMENTIONS 3,408 -RETWEETS 1195 OUTBOUND lWEET CON=
000
...,
T 1 059 Plain Texi
~ 251 Unl
bc,el S:pread~heet uvn11m1c Mode l bcilmp le ofa :,1mp le Loan
l’rcp.:iy
I S1 , 10CHi”; s1oom
2 :a.100.G~ $100.00 Sl.200.G~
3 S1.100.G5 $100.00 S.1.200.GS . $1,100.b!> $1UU.W $1,:!UU.b!t
.5 $1,100.155 $.100.00 $1 , 200.15.5
~heetJ f.J
A. $100 Prep11yment every Month-Loan
1s p.,1doff1n Month uu
$150,000
$14q,1qq~ :::fH·(l+$E$11)~024 I
$149.~97
$144J,3!)4
$1’1 Y ,1HY
$14a ,935 COP,’ th• C•IIE: ln Row2 ‘1 lntoRoWE: 25 throu;h
Rew J.83to 1d 360Mcnth50t ~ulb
I <
~~Insert·
}k0t1m ..
IS:lrorm 11t ·
(rll\
..
~:w
Sert a rind D.
• t-11tu .. '.> tltct”
Fd1l1111J
n
, ,
, ,
FIGURE 9.4 Excel Spreadsheet Dynamic Model Example of a Simple Loan Calculation of Monthly Payments and the Effects
of Prepayment.
Chapter 9 • Model-Based Decision Making: Optimization and Multi-Criteria Systems 407
9.6 MATHEMATICAL PROGRAMMING OPTIMIZATION
The basic idea of optimization was introduced in Chapter 2. Linear programming
(LP) is the best-known technique in a family of optimization tools called mathematical
programming; in LP , all relationships among the variables are linear. It is used extensively
in DSS (see Application Case 9 .5) . LP models have many important applications in
practice. These include supply chain manage ment, product mix decisions, routing, and
so on. Special forms of the models can be used for specific applications. For example ,
Application Case 9.5 describes a spreadsheet model that was u sed to create a sch edule
for medical inte rns.
Application Case 9.5
Spreadsheet Model Helps Assign Medical Residents
Fle tcher Allen Health Care (FAHC) is a teaching
hospital that works with the University of Vermont’s
College of Medicine. In this particular case, FAHC
employs 15 reside nts with hopes of adding 5 more
in the diagnostic radiology program. Each year the
chief radiology resident is required to make a year-
long schedule for all of the residents in radiology.
This is a time-consuming process to do manually
because there are many limitations on when each
resident is and is not a llowed to work . During
the weekday working hours , the residents work
with certified radiologists , but nights , weekends ,
and holidays are all staffed by res idents only. The
residents are also required to take the “emergency
rotations ,” which involve taking care of the radiol-
ogy needs of the emergency room, which is often
the busiest on weekends. The radiology program
is a 4-year program, and there are differe nt rules
for the work schedules of the residents for each
year they are there . For example, first- and fourth-
year residents cannot be on call on holidays , sec-
ond-year residents cannot be on call or assigned
ER shifts during 13-week blocks w hen they are
assigned to work in Boston, and third-year residents
must work one ER rotation during only one of the
ma jor winter holidays (Thanksgiving or Christmas/
New Year’s). Also, first-year residents cannot be
on call until after January 1, and fourth-year resi-
dents cannot be on call after December 31, a nd so
on. The goal that the various chief residents have
each year is to give each person the maximum
number of days between on-call days as is pos-
sible. Manually, only 3 days between on-call days
was the most a chief res ident had b een able to
accomplish.
In order to create a more efficient metho d
of creating a sch e dule, the chief reside nt worked
with an MS class of MBA students to develop a
spreadsheet model to create the schedule. To
solve this multiple-objective decision-making
problem, the class used a constraint method made
up of two stages. The first stage was to u se the
spreadsheet created in Excel as a calculator a nd
to not use it for optimizing. This allowed the
creators “to measure the key metrics of the resi-
dents ‘ assignments, such as the number of d ays
worked in eac h category. ” The second stage was
an optimization model, w hich was layered o n the
calculator spreadsheet. Assignment constraints
and the objective were added . The Solver engine
in Excel was then invoked to find a feasible solu-
tion . The developers used Premium Solver by
Frontline and the Xpress MP Solver engine by
Dash Optimization to solve the yearlong model.
Finally, using Excel functions , the developers con-
verted the solution for a yearlong schedule from
zeros and ones to an easy-to-read format for the
residents . In the end, the program could solve the
problem of a sch edule with 3 to 4 d ays in between
on calls instantly and with 5 days in between o n
calls (which was never accomplished manually).
Source: Based on A. Ovc hi nnikov and J. Mi lne r, “Spreadsheet
Mode l Helps to Assign Medical Reside nts at the University of
Vermont’s College of Medici ne,” Interfaces, Vol. 38, No. 4, July/
August 2008, pp. 311-323.
408 Pan IV • Prescriptive Analytics
Mathematical Programming
Mathematical programming is a family of tools designed to help solve managerial
problems in which the decision maker must allocate scarce resources among competing
activities to optimize a m easurable goal. For example, the distribution of machine time
(the resource) among vario us products (the activities) is a typical allocation problem. LP
allocation problems usually display the following characteristics:
• A limited quantity of econo mic resources is available for a llocation .
• The reso urces are u sed in the production of products or services.
• There a re two or more ways in which the resources can be u sed. Each is called a
solution or a program.
• Each activity (produ ct or service) in w hich the resources are used yields a return in
terms of the stated goal.
• The allocation is usually restricted by several limitations and requirements , called
constraints.
The LP allocatio n model is based o n the follow ing rational economic assumption s:
• Returns from different allocations can be compared; tha t is , they can be measured
by a common unit (e.g., dollars, utility).
• The return from a ny allocation is independent of other allocations.
• The total return is the sum of the returns yielded by the different activities.
• All data are known w ith certainty.
• The resources are to be used in the m ost economical manner.
Allocation problems typically have a large number of possible solutions. Depending
o n the underlying assumption s, the number of solutio ns can be either infinite or finite.
Usu a lly, different solutions yield different rewards. Of the available solutions, at least one
is the best, in the sen se that the degree of goal attainment associated with it is the highest
(i.e., the total reward is m aximized). This is called an optimal solution, a nd it can be
found by using a special algorithm.
Linear Programming
Every LP problem is composed of decision variables (whose values are unknown and
a re searched for), a n objective fu nction (a linear mathematical function that relates
the decision variables to the goal , measures goal attainment, and is to be optimized) ,
objective function coefficients (unit profit or cost coefficients indicating the contribution
to the objective of o n e unit of a decision variable), constraints (expressed in the form
of linear inequalities o r equalities that limit resources and/ or requirements; these relate
the variables through linear relationships), capacities (which describe the upper and
sometimes lower limits on the constraints and variables), and input/output (technology)
coefficients (which indicate resource utilization for a decision variable).
Let us look at a n example. MBI Corporatio n , w hich m anufactures special-purp ose
computers, n eeds to make a decision: How many computers should it produce next
month at the Boston plant? MBI is con sidering two types of computers: the CC-7, w hich
requires 300 days of labor and $10,000 in materials, and the CC-8, w hich requires 500
days of labor an d $15,000 in materials. The profit contribution of each CC-7 is $8,000,
w hereas that of each CC-8 is $12,000. The p lant has a capacity of 200,000 working days
per mo nth, a nd the mate rial budget is $8 million per month. Marketing requires that at
least 100 units of the CC-7 and at least 200 units of th e CC-8 be produced each month.
The problem is to maximize the company’s profits by determining h ow many units of the
CC-7 and how many units of the CC-8 sh ould be produced each month . Note that in a
real-world e nv iron me nt, it could possibly take months to obta in the data in the problem
Chapter 9 • Model-Based Decision Making: Optimization and Multi-Criteria Systems 409
statement, and while gathering the data the decision maker would no doubt uncover
facts about how to structure the model to be solved. Web-based tools for gathering
data can help.
Modeling in LP: An Example
A standard LP model can be developed for the MBI Corporation problem just described.
As discussed in Technology Insights 9. 1, the LP model has three components: decision
variables, result variables, and uncontrollable variables (constraints).
The decision variables are as follows:
X 1 = unit of CC-7 to be produced
X2 = unit of CC-8 to be produced
The result variable is as follows:
Total profit = Z
The objective is to maximize total profit:
Z = 8,000X1 + 12,000Xz
The uncontrollable variables (constraints) are as follows:
Labor constraint: 300X1 + 500X2 :S 200,000 (in days)
Budget constraint: 10,000X1 + 15,000X2 :S 8,000,000 (in dollars)
Marketing requirment for CC-7: X 1 ::::: 100 (in units)
Marketing requirment for CC-8: X2 ::::: 200 (in units)
This information is summarized in Figure 9.5.
The model also has a fourth , hidden component. Every LP model has some
internal intermediate variables that are not explicitly stated. The labor and budget con-
straints may each h ave some slack in them when the left-hand side is strictly less than
the right-hand side. This slack is represented internally by slack variables that indi-
cate excess resources available. The marketing requirement constraints may each have
some surplus in them when the left-ha nd side is strictly greater tha n the right-ha nd
side. This surplus is represented internally by surplus variables indicating that there is
some room to adjust the right-hand s ides of these constraints. These slack and surplus
variables are intermediate. They can be of great value to a d ecision maker because LP
solution methods use them in establishing sensitivity parameters for economic what-if
analyses .
TECHNOLOGY INSIGHTS 9.1 Linear Programming
LP is perhaps the best-known optimizatio n model. It deals with the optimal allocation of resources
among competing activities. The allocation proble m is represented by the model described here .
The problem is to find the values of the decision variables X 1 , X2, and so o n, such that
the value of the result variable Z is maximize d , subject to a set of linear constraints that express
the technology, market conditions, and other uncontrollable variables. The mathematical rela-
tionships are all linear equations and ine qualities. Theoretically , any allocation problem of this
type has an infinite number of possible solutions. Using special mathematical procedures, the
LP approach applies a unique computerized search procedure that finds a best solution(s) in a
matter of seconds. Furthermore, the solution approach provides automatic sensitivity analysis.
410 Pan IV • Prescriptive Analytics
Decision variables
Mathematical (logical)
Result variable
relationships
X1 = units of C C-7
~
Maximize Z [profit]
~
Total profit = Z
X2 = units of C C-8 subject to constraints Z = 8,000X1 + 12,0DDX2
t
Constraints (uncontrollable)
3DOX1 + 5DOX2 ::; 200,000
1 O,OODX1 + 15,00DX2 5 8,000,000
X1 2: 1 DD
X
2
2: 200
FIGURE 9.5 Mathematical M odel of a Product-Mix Example.
The product-mix model has an infinite number of possible solutions. Assuming
that a production plan is not restricted to whole numbers- which is a reasonable
assumption in a monthly production p lan- we want a solution that maximizes total
profit: an optimal solution. Fortunately, Excel comes with the add-in Solver, which
can readily obtain an optimal (best) solution to this problem. Although the location
of Solver Add-in has moved from one version of Excel to another, it is still available
as a free Add-in. Look for it under Data tab and on the Analysis ribbon. If it is not
there, you should be able to enable it by going to Excel ‘s Options Menu and selecting
Add-ins .
We enter these data directly into an Excel spreadsheet, activate Solver, and iden-
tify the goal (by setting Target Cell equal to Max) , decision variables (by setting By
Changing Cells), and constraints (by ensuring that Total Consumed elements is less than
or equal to Limit for the first two rows and is greater than or equal to Limit for the third
and fourth rows). Cells C7 and D7 constitute the decision variable cells. Results in these
cells would be filled after running the Solver Add-in. Target Cell is Cell E7, which is also
the result variable, representing a product of decision variable cells and their per unit
profit coefficients (in Cells CS and DS). Note that all the numbers have been divided by
1,000 to make it easier to type (except the decision variables). Rows 9- 12 describe the
constraints of the problem: the constraints on labor capacity, budget, and the desired
minimum production of the two products X1 and X 2 . Columns C and D define the coef-
ficients of these constraints. Column E includes the formulae that multiply the decision
variables (Cells C7 and D7) w ith their respective coefficients in each row. Column F
defines the right-hand side value of these constraints. Excel’s matrix multiplication capa-
bilities (e.g. , SUMPRODUCT function) can be used to develop such row and column
multip lications easily.
After the model’s calculations have been set up in Excel, it is time to invoke the
Solver Add-in. Clicking on the Solver Add-in (again under the Analysis group under
Data Tab) opens a dialog box (window) that lets you specify the cells or ranges that
define the objective function cell , decision/changing variables (cells), and the con-
straints . Also, in Option s, we select the solution method (usually Simplex LP), and
then we solve the problem. Next, we select all three reports- Answer, Sensitivity, and
Limits- to obtain an optimal solution of X1 = 333 .33, X 2 = 200, a nd Profit = $5,066,667,
as shown in Figure 9.6. Solver produces three useful reports about the solution . Try it.
Solver now also includes the ability to solve nonlinear programming problems and inte-
ger programming problems by using other solution methods available within it.
Chapter 9 • Model-Based Decision Making : Optimizatio n and Multi-Criteria Systems 411
.,.. ,,_ …. f,oa’:Mf’,tf f
Muu Wm fr,t ,-..,
a
A
•c:=t
2
l
/.
O.Cldon VarbhlH:
.. tlll’Kttrtts
rfl(‘” ….. , .. c–
C 0
,.roduct·M UI Model
Total
G M
Tot•t P,oflt: r—–:.r– ,,t;:…:~,:k–L~-….. .,;..i~’—~–‘-~~-~’——,=::r.
10
11
12
13 ,.
” ,.
17
18
19
20
” l>
23
” lS ,.
27 ,.
l9
30
J1
Labor:
9udacl:
Xl Lowtr:
X2 lowtf:
(PYofit und Con• . .lrilinl!; ~led by 1000)
… _
….
t0,’9<•.W
lf$10 oC•$1"SMI
•Jll >•t”SII
•s:ti>•IP’n2
…………
fMOI’;
=w:=-~::.~u.:a~~=~~~::-x —
FIGURE 9.6 Excel Solver Solution to the Product-Mix Example.
The following example was created by Prof. Rick Wilson of Oklahoma State
Un ivers ity to further illustrate the power of spreadsheet modeling for decis ion
support.
The table in Figure 9 .7 describes some estimated data a nd attributes of nine
“swing states” for the 2012 e lection. Attributes of the nine states include the ir
number of electoral votes , two reg io n a l descriptors (n ote tha t three states are clas-
sifi ed as n e ithe r North or South), a n d a n estima ted “influence fu n c tion, ” which
relates to increased candidate support per unit of campaig n financial investment
in that s tate .
For instance, influence function Fl sh ows tha t for every financial unit invested
in that state, there w ill be a total of a 10-unit increase in voter support (units w ill stay
gen eral here), made up of an increase in young men support by 3 units, old men support
by 1 unit, a nd young and o ld women each by 3 units.
The campaign h as 1,050 financial units to invest in the 9 states. It must invest at least
5 percent in each state of the total overall invested, but no more than 25 perce nt of the
overall total invested can be in any one state. All 1,050 units do n ot have to be invested
(your model must correctly deal w ith this).
The campaign has some o the r restrictio ns as well. From a financial investment
standpoint, the West states (in total) must have campaign investment at levels th at are
at least 60 percent of the total invested in East states. In terms of people influe n ced, th e
decision to allocate financial investments to states must lead to at least 9,200 total p eople
influenced. Overall, the total numbe r of females influenced must be greater than or equal
to the total number of males influen ced. Also, at least 46 perce nt of all people influenced
must be “old. ”
::.
412 Pan IV • Prescriptive Analytics
Electoral Influence
State Votes W/E N/S Function
NV 6 West F1
co 9 West F2
IA 6 West North F3
WI 10 West North F1
OH 18 East North F2
VA 13 East South F2
NC 15 East South F1
FL 29 East South F3
NH 4 East F3
F1 Young Old
Men 3 1 4
Women 3 3 6
6 4 10 Total
F2 Young Old
Men 1.5 2 .5 4
Women 2.5 1 3 .5
4 3 .5 7.5 Total
F3 Young Old
Men 2 .5 2 .5 5
Women 1 2 3
3.5 4 .5 I 8 Total
FIGURE 9.7 Data for Election Resource Allocation Example.
Our task is to create an appropriate integer programming model that determines
the optimal integer (i.e., whole number) allocation of financial units to states that
m aximizes the sum of the products of the e lectoral votes times units invested subject
to the other aforementioned restrictio n s. (Thus, indirectly, this model is giving prefer-
ence to states with higher numbers of e lectoral votes). Note that for ease o f implemen-
tatio n by the cam paign staff, a ll decisions for allocatio n in the model should lead to
integer va lues.
The three aspects of the mo dels can be catego rized based o n the following ques-
tio n s that they answer:
1. What do we control? The amo unt invested in advertisements across the nine
states, Nevada, Colorado, Iowa, Wisconsin, Ohio, Virginia , North Carolina, Florida,
and New Hampshire , which are represented by the nine decision variables, NV, CO,
IA, WI, OH, VA, NC, FL, and NH
2. What do we want to achieve? We wan t to maximize the total number of elec-
toral votes gains. We know the value of each electoral vote in each state (EV), so
this amounts to EV*Investments aggregated over the nine states, i. e. ,
Max(6NV + 9CO + 6IA + lOWI + 180H + 13VA + 15NC + 29FL + 4NH)
Chapter 9 • Model-Based Decision Making: Optimization and Multi-Criteria Systems 413
3. What constrains us? Following are the con straints as given in the problem
description:
a. No more than 1,050 finan cial units to invest into, i. e ., NV + CO + IA+ WI + OH +
VA + NC + FL + NH <= 1050.
b. Invest at least 5 percent o f the total in each state, i.e.,
NV >= 0.05(NV + CO + IA + WI + OH + VA + NC + FL + NH)
CO>= 0 .05(NV + CO + IA+ WI+ OH+ VA + NC + FL+ NH)
IA>= 0.05(NV + CO + IA+ WI+ OH+ VA+ NC + FL+ NH)
WI >= 0.05(NV + CO + IA + WI + OH + VA + NC + FL + NH)
OH >= 0.05(NV + CO + IA + WI + OH + VA + NC + FL + NH)
VA >= 0.05(NV + CO + IA + WI + OH + VA + NC + FL + NH)
NC >= 0.05(NV + CO + IA + WI + OH + VA + NC + FL + NH)
FL>= 0.05(NV + CO + IA+ WI+ OH+ VA + NC + FL+ NH)
NH >= 0.05(NV + CO + IA + WI + OH + VA + NC + FL + NH)
We can implement these nine con straints in a variety of ways u sin g Excel.
c. Invest no more than 25 percent of the total in each state.
As w ith (b) we n eed nine indiv idual constraints again sin ce we do n ot know how
much o f the 1,050 financial units we w ill invest. We must write the con straints o n
“gen e ral” terms.
NV<= 0.25(NV + CO + IA+ WI+ OH + VA + NC + FL+ NH)
CO<= 0 .25(NV + CO + IA+ WI+ OH+ VA + NC+ FL+ NH)
IA<= 0.25(NV + CO + IA+ WI+ OH + VA + NC + FL+ NH)
WI<= 0 .25(NV + CO + IA+ WI+ OH+ VA + NC + FL+ NH)
OH <= 0 .25(NV + CO + IA + WI + OH + VA + NC + FL + NH)
VA <= 0.25(NV + CO + IA + WI + OH + VA + NC + FL + NH)
NC <= 0.25(NV + CO + IA + WI + OH + VA + NC + FL + NH)
FL<= 0 .25(NV + CO + IA+ WI + OH+ VA+ NC + FL+ NH)
NH<= 0 .25(NV +CO+ IA+ WI+ OH+ VA + NC + FL+ NH)
d. Western states must have inv estment lev els that are at least 60 p e rcent of th e
Eastern states.
West States = NV + CO + IA + WI
East Sta tes = OH+ VA + NC+ FL+ NH
So, (NV + CO +IA+ WI) >= 0.60(0H + VA + NC + FL+ NH) . Again we can
impleme nt this constraint in a variety of ways using Excel.
e. Influence at least 9,200 total people.
(lONV + 7.5CO +SIA+ lOWI + 7.50H + 7.5VA +lONC + 8 FL+ 8 NH)>= 9200
f. Influen ce at least as many females as males. This requires transition of influence
functions.
414 Pan IV • Prescriptive Analytics
Fl = 6 women influenced, F2 = 3.5 women
F3 = 3 women influe n ced
Fl = 4 me n influe nced, F2 = 4 men
F3 = 5 men influenced
So imp lementing females >= males, we get:
(6NV + 3.5CO + 3 IA+ 6WI + 3.50H + 3.SVA + 6NC + 3FL + 3NH) > = (4NV +
4CO +SIA+ 4WI + 40H + 4VA + 4NC + SFL + SNH)
As before, we can impleme nt this in Excel in a couple o f different ways .
g. At least 46 percent of a ll people influenced must be old.
All people influenced was on th e left-hand side of the constraint Ce). So, old
people influenced would be:
(4NV + 3.5CO + 4.5IA + 4WI + 3.SOH + 3 .SVA + 4NC + 4.SFL + 4.SNH)
This would be set>= 0.46* the left-hand side of constraint (e) (lONV + 7.SCO +
SIA + lOWI + 7.50H + 7.SVA + lONC + SFL + SNH), which would give a right-hand
side o f 0.46NV + 3.45CO + 3.681A + 4.6WI + 3.450H + 3.45VA + 4.6NC + 3.68FL +
3.68NH
This is the last constraint other tha n to force all variables to be integers.
All told in algebraic terms, this integer programing model would have 9 decision
variables and 24 constrain ts (one constraint for integer requirements).
Implementation
One approach would be to implement the model in strict “standard form, ” o r a row-
column form, w here a ll constraints are written w ith decision variables on the left-han d
side, and a numbe r o n the right-hand side . Figure 9.8 sh ows su ch an implementation and
displays the solved model.
Alternatively, we could use the spreadsheet to calculate different parts of the model in a
less rigid manner as well as uniquely implementing the repetitive constraints Cb) and Cc), and
have a much more concise (but not as transparent) spreadsheet. This is shown in Figure 9.9.
LP models (and their specializatio n s and generalizations) can be also specified
directly in a number of oth er u ser-frie n d ly modeling systems. Two of the best known are
Lindo and Lingo (Lindo Systems, Inc. , lindo.com; demos are available). Lindo is an LP
and integer programming syste m. Models a re specified in essentially the sam e way that
they a re defined a lgebraically. Based o n the success of Lindo, the company developed
Lingo, a modeling language that includes the powerful Lindo optimizer and extensions
for solving n o nlinear problems. Many oth e r modeling langu ages su ch as AMPL, AIMMS,
MPL, XPRESS, and others are available.
The most common optimization models can be solved by a variety of mathematical
programming methods, including the fo llowing:
• Assignment (best matching of objects)
• Dynamic programming
• Goal programming
• Investment (maximizing rate of return)
• Linear a nd integer programming
• Network models for planning and scheduling
• Nonlinear programming
• Replacement (capital budgeting)
• Simple inventory models (e.g. , economic order quantity)
• Transpottation (minimize cost of shipme nts)
Cha pte r 9 • Model-Based Decisio n Making : O ptim izatio n and Multi-Crite ria Systems 415
-., £
A B C
S3 S3
I [S.CtOfal VotiH 6 9
• ro1111nws1 I 1
AlltH15!1 095 -0.05
6 AllH,.511 -0.05 0.95
-UISll 005 -005
AdrHIS’5 -005 -005
AlleH1S!I -0.05 -005
10 1111o .. ,511 -0.05 -0.05
l t Atlo,11″-IS” -0.05 -0.05
11 AtloH15!1 0 .05 -0.05
13 Attea.$15″ -0.0S -0.05
14 NoMoreTh.lnlS” 0.75 -0.25
15 NoMo1tThon2S!I -0.25 0 .75
16 NoMo
:«M4-• ….
,a.5ctKSU >• LSS:&.I U
p I”‘
D G H
S3 235 119 53 169
6 10 18 ll IS
l I 1 1 l
0.05 -0.05 -0.05 -005 -0.05
0.05 -0.05 -0.05 -0.05 -0.05
095 0 .05 -005 0.05 -0.05
·005 0 .95 -005 -OD!> -OD!>
-0 05 -0.05 O!IS -005 -005
0.05 -0.05 -0.05 0 .9S -0.05
005 o.os ,().05 005 o.,s
005 -005 -005 -005 -0.05
-0.05 -0.05 -0.0S -005 -0.05
-0 25 -0.25 -0.25 -0.25 -0.25
-025 -0.25 -025 -025 -0.25
0 .75 -0.25 ·0 .25 -0.25 0.25
0 25 0 .75 -0 25 025 -025
-0 25 -0 25 075 -0 25 -025
0 25 -0.25 -025 0 .15 -0.25
0.25 0.25 0 .25 0 .25 0.75
0 25 0 .15 0.15 0 .25 0 .25
0 25 025 ·015 01S -015
1 -0.6 0 .6 -0.6
8 10 7.5 75 10
· 2 2 -o.s -0.5 2
082 0.6 005 005 06
-.- _J
_ … _____ E]
– J
…………..
5*:tNGIG,….-.,-c”‘WwrfllrCllal9″‘rt_Dl,.._,,5..aft\JI~
…,.,_…,..Scfflr’P’NMcl:a..n11*1aN~_.,.,_s.atw.’~Nt~ – – .,..–,
FIGURE 9.8 Model for Election Resource Allocation-Standard Version.
SECTION 9 .6 REVIEW QUESTIONS
1. List and exp lain the assumptions involved in LP.
2. List and exp lain the cha racte ristics o f LP .
3 . Describe a n allocation p roble m.
4. Define the product-mix problem.
5. Define the b lending proble m.
6. List severa l commo n optimizatio n mo dels .
” ? . ….. ” ·1 –“!”‘~ ……
I l M N
262 531
29 • 166)9 MAX
I I 1050 IOSO LT
-0.05 -0.05 o.s 0 GT
-0.05 -0.05 0.5 0 GT
005 -0.05 OS 0 GT
·DOS -ODS 182 S D GT
-005 -0.05 66.5 O GT
-0.0S -0.05 0.5 0 GT
0.05 -0.05 116.S 0 GT
09S -0.05 1095 D GT
-0.05 0 .95 OS 0 GT
-0.25 -0.25 -209.S 0 LT
-0 25 -0.25 ·209 5 0 LT
0 25 -0.25 20H OlT
0 25 -0.25 .175 0 ll
-0 2S -0.25 ·10S OlT
-0.25 -0.25 .2095 0 LT
·0.25 -0.25 93.5 0 LT
075 0.25 OS 0 lT
·OH 075 209 5 OH
-0.6 -0.6 0 .4 OGT
8 9201S 9200 GT
· 2 · 2 65.S 0 GT
082 0.82 3881 0 GT
416 Pan IV • Prescriptive Analytics
f.
C D f I ‘ M N 0 Q NV co ,. WI OH VA NC fl. “”
” ” ”
n, 1ll ” … ,., 5)1 ‘ I J ti«t«,itvotes ~ ……. ————————‘———————————
bl ,, 0 5″ofl.6 GT OD’Jll,ofll LT
• ‘ • 10 1, I) I> ” . 1661’J MAX ” 1) ro1•l1nvtn 10>0 10)0 LI
5 d)LHS WKl … GT
& d)I\HS ~U … Hil.6IIO’.otf.a\l
7 t) 1nflu•r,c.e 10 7.) 10 ,., ,., 10 I ,20t.> >20001
I I) lHS Fcmul.-. u u u 4613., GT
, l}AHS ~In . . . …..
10 &)LHS Old ,., .., ,., ,., . ., . ., •?7U •IU ‘6″otl.6
11
u
” 1,
I> ,.
11 .. ..
20
21
” ” ,,. ,. ,.
27 ,.
” ,.
” ” ..
” ,.
)7 ..
,—————–==
-.w >•s.•
9CSl:,tJ2;( .. .i!U I
ICU:«ll >•-ai
IC11:IIU2 • Mt’)lf
1.UO>• PtUO
L*4<•94"4
t.U>•IMS6
t.S1>•VU7
—
… …. …
…….
…
—-=~~’1::’~~~~~=:=-=-~~ _,.
L …, _J ,_ IL.,,.-1
01
FIGURE 9.9 A Compact Formulation for Election Resource Allocation.
9.7 MULTIPLE GOALS, SENSITIVITY ANALYSIS, WHAT-IF ANALYSIS,
AND GOAL SEEKING
The search process described earlier in this chapter is coupled w ith evaluation. Evaluatio n
is the final ste p tha t leads to a recomm ended solutio n.
Multiple Goals
The analysis of ma nagement decisions aims at evaluating , to the greatest p ossible exte nt,
how far each alte rnative advances managers toward their goals. Unfortunately, managerial
problems are seldom evaluated w ith a single simple goal, such as profit maximization.
Today’s management systems are much more complex, and o ne w ith a single goal is rare.
Instead, managers want to a tta in simultaneous goals, some of w hich may conflict. Different
stakeholders have d ifferent goals. Therefore, it is often necessary to analyze each alternative
in light of its determinatio n of each of several goals (see Koksalan and Zionts, 2001) .
For example, conside r a profit-making fi rm . In additio n to earning money, the
company wants to grow, develo p its products and employees, provide job security to
its worke rs, a nd serve the community. Manage rs want to satisfy the sh a re holders and
at the same time enjoy high salaries a nd expense accounts, and employees want to
increase their take-ho me pay a nd benefits. When a decisio n is to be made- say, ab out
an investment project- some of these goals complement each other, w he reas oth e rs con-
flict . Kearns (2004) described how the analytic hierarchy process (AHP), w hich we will
introduce in Section 9.9, combined w ith integer programming, addressed multiple goals
in eva luating IT investme nts .
Chapter 9 • Model-Based Decision Making: Optim ization and Multi-Criteria Systems 417
Many quantitative models of decision theory are based o n comp aring a single
measure of effectiven ess, generally some form of utility to the decision maker. Therefore,
it is u sually necessary to transform a multiple-goal problem into a single -measure-of-
effectiveness problem before comparing the effects of the solutions. This is a common
method for h andling multiple goals in an LP model.
Certain difficulties may arise w hen a nalyzing multiple goals:
• It is u sually difficult to obtain a n explicit statement of the organization’s goals.
• The decision maker may ch a nge the importan ce assigned to specific goals over time
or for different decision scenarios.
• Goals and sub-goals are viewed differently at variou s levels of the organization and
w ithin different departments .
• Goals cha nge in response to cha nges in the organization a nd its e nvironm ent.
• The re latio nship between alte rna tives a nd their role in determining goals may be
difficult to quantify.
• Complex problems are solved by groups of decision makers , each of w hom has a
personal agenda.
• Participants assess the importance (priorities) of the various goals differently.
Several methods of handling multiple goals can be used when working w ith MSS.
The most common o nes are :
• Utility theory
• Goal programming
• Expression of goals as constraints, u sing LP
• A points system
Sensitivity Analysis
A model builder makes predictions and assumptions regarding input data , many of which
deal w ith the assessment of uncertain futures. When the m odel is solved, the results
depend on these data. Sensitivity analysis attempts to assess the impact of a change in
the input data o r parameters o n the proposed solutio n (i.e ., the result variable).
Sensitivity analysis is extremely important in MSS because it allows flexibility and
adaptation to ch anging conditio ns and to the requirements of different decision-making
situations, provides a better understanding of the model and the decision-making situa-
tio n it attempts to describe, and permits the manager to input data in order to increase
the confide n ce in the m odel. Sensitivity analysis tests relationships such as the fo llowing:
• The impact of cha nges in external (uncontrollable) variables and parameters on the
o utcome variable(s)
• The impact of changes in decision variables o n the outcome variab le(s)
• The effect of uncertainty in estimating external variables
• The effects of different dependent interactio n s amon g variables
• The robustness of decisio ns under changing con d itio n s
Sensitivity a nalyses are used for:
• Revising mode ls to e liminate too -la rge sensitivities
• Adding details about sensitive variables o r scena rios
• Obtaining better estimates of sen sitive exte rnal variables
• Altering a real-world system to reduce actual sen sitivities
• Accepting and using the sensitive (and h e nce vulnerable) real world, leading to th e
continuou s and close monitoring of actual results
The two types of sensitivity analyses a re automatic a nd trial-and-error.
418 Pan IV • Prescriptive Analytics
AUTOMATIC SENSITIVITY ANALYSIS Automatic sens1t1v1ty analysis is performed in
standard quantitative model implementations such as LP . For example , it reports the range
within which a certain input variable or parameter value (e.g. , unit cost) can vary without
having any significant impact on the proposed solution. Automatic sensitivity analysis is
usually limited to one change at a time, and only for certain variables. However, it is very
powerful because of its ability to establish ranges and limits very fast (and with little or
no additional computational effort). For example, automatic sensitivity analysis is part of
the LP solution report for the MBI Corporation product-mix problem described earlier.
Sensitivity analysis is provided by both Solver and Lindo. Sensitivity analysis could be
used to determine that if the right-hand side of the marketing constraint on CC-8 could be
decreased by one unit, then the net profit would increase by $1 ,333.33. This is valid for
the right-hand side decreasing to zero. For details, see Hillier and Lieberman (2005) and
Taha (2006) or later editions of these textbooks.
TRIAL-AND-ERROR SENSITIVITY ANALYSIS The impact of changes in any variable, or
in several variables, can be determined through a simple trial-and-error approach. You
change some input data and solve the problem again. When the changes are repeated
several times, better and better solutions may be discovered. Such experimentation,
which is easy to conduct when using appropriate modeling software, such as Excel, has
two approaches: what-if analysis and goal seeking.
What-If Analysis
What-if analysis is structured as What will happen to the solution if an input variable, an
assumption, or a parameter value is changed? Here are some examples:
• What will happen to the total inventory cost if the cost of carrying inventories
increases by 10 percent?
• What w ill be the market share if the advertising budget increases by 5 percent?
With the appropriate user interface, it is easy for managers to ask a computer model
these types of questions and get immediate answers . Furthermore, they can perform
multiple cases and thereby change the percentage, or any other data in the question , as
desired. The decision maker does all this directly, without a computer programmer.
Figure 9.10 shows a spreadsheet example of a what-if query for a cash flow prob-
lem. When the user changes the cells containing the initial sales (from 100 to 120) and
the sales growth rate (from 3% to 4% per quarter), the program immediately recomputes
the value of the annual net profit cell (from $127 to $182). At first, initial sales were 100,
growing at 3 percent per quarter, yie lding an annual net profit of $127. Changing the
initial sales cell to 120 and the sales growth rate to 4 percent causes the annual net profit
to rise to $182 . What-if analysis is common in expert systems. Users are given the oppor-
tunity to change their an swers to some of the system’s questions, and a revised recom-
mendation is found.
Goal Seeking
Goal seeking calculates the values of the inputs necessary to achieve a desired level of
an output (goal). It represents a backward solution approach. The following are some
examples of goal seeking:
• What annual R&D budget is needed for an annual growth rate of 15 percent
by 2018?
• How many nurses are needed to reduce the average waiting time of a patient in the
emergency room to less than 10 minutes?
Chapter 9 • Model-Base d Decision Making: Optimizatio n and Multi-Criteria Systems 419
– A” 11,; • • • • .. v. … r.,.s , .. – :lfi ~, .. –
“‘!’• J •o,,uc.~tf B / U _ . · A · …. iJI: ~ i;}JMNo,&(fflltt • S •’t. ‘ ~:J f:::’:::-.=-~-
OU
A
2
3
4
s
6
7 Unit revenue
8 Unit cost
9
10 Initial sales
11 Sales arowth rate
12
13 Annual net profit
14
1S
16
9
s 1. 20
s 0 .60
lZO
–
C D G
Changelntitlal soles(cell 810) and sales
arowth rate (cell 811) to evaluate cha nae In
annual profit.
Init iate salos of 100 growing at 3%/qtr
yields an annual net profit of $127.
Compare to this What-If case of int itial
sales of 120 growing at 4!6/ q tr.
—————
17
18
19 Qtrl
Cash Flow Model for 1996
Qtr2 Qtr3 Qtr4
—————
20 Sales
21 Revenue
22 Vanablo ,o,t
23 Fixed cost
24 Net profit
25
26
s
s
s
s
lZO
144 s
72 s
30 s
42 $
125 130
1 SO s 156 s
75 s 78 $
31 s 31 s
44 $ 47 $
Annual
Total
135 510
162 s 611
81 $ 306
32 s 124
49 s 182
FIGURE 9.10 Example of a What-If Analysis Done in an Excel Worksheet.
H
An example of goal seeking is shown in Figure 9.11. For example, in a financial
planning model in Excel , the internal rate of return is the interest rate that produces a
net present value (NPV) of zero. Given a stream of annual returns in Column E, we can
compute the net present value of planned investment. By applying goal seeking, we can
determine the internal rate of return where the NPV is zero. The goal to be achieved is
NPV equal to zero, w hich determines the internal rate of return (IRR) of this cash flow ,
including the investment. We se t the NPV cell to the value Oby changing the interest rate
cell. The answer is 38.77059 percent.
COMPUTING A BREAK-EVEN POINT BY USING GOAL SEEKING Some mode ling software
packages can d irectly compute break-even points, which is an important application of
goal seeking. This involves determining the value of the decision variables (e .g ., quantity
to produce) that generate ze ro profit.
In many general applications programs, it can be d ifficult to conduct sensitivity
analysis because the prewritten routines usually present only a limited opportunity for
asking what-if questions . In a DSS , the what-if and the goal-seeking options must be easy
to perform.
SECTION 9 .7 REV IEW QUESTIONS
1. List some difficulties that may arise when analyzing multiple goals .
2. List the reasons for performing sensitivity analysis.
3. Explain why a manager might perform what-if analysis.
4 . Explain why a manager might use goa l seeking.
• El
420 Pan IV • Prescriptive Analytics
2 Investment Problem Initial Investment: S 1,000.00
3 Example of GoalSecking Interest R~te: 10%
5 Find the Interest Rate
6 (the Internal !late or Ye::.r
Annui=il
Returns
NPV
Calculations
7 Return-lRR) .l $1].0.()() $109.09
8 that yields an NPV
9 of$0
2 $130.00
3 $140.00
$118.18
sn1.21
10
11
12
13
14
15
16
l7
18
19
20
21
22
23
24
25
…… ., ~I T”ot:-.W.:
9J1″”‘9″1il«LtGI)
~~
4 $150.00 $136.36
5 $160.00 $145.45
6 $152.00 $138.18
7 $144.40 $131 .77
8 $131.18 $124 .71
9 $130.37 $118.47
10 $123.80 $112.55
!The NPV Solutions: I $261.fil
l – .- • • ShNt l S,.Htl ~ .. tl ”
FIGURE 9.11 Goal-Seeking Analysis.
9.8 DECISION ANALYSIS WITH DECISION TABLES
AND DECISION TREES
Decision situations that involve a finite and usually not too large number of alterna-
tives are modeled through a n approach called decision analysis (see Arsh am, 2006a,
2006b; and Decisio n Analysis Society, decision-analysis.society.informs.org). Using
this a pproach , the alternatives are listed in a table or a graph, w ith their forecasted con-
tributions to the goal(s) a nd the probability of obtaining the contribution. These can be
evaluated to select the best alternative.
Single-goal situations can be modeled with decision tables or decision trees. Multiple
goals (criteria) can be modeled with several other techniques, described later in this chapter.
Decision Tables
Decision tables convenie ntly o rganize information and knowledge in a systematic , tabu-
lar manner to prepare it for analysis. For example, say that an investment company is
con sidering investing in one of three alternatives: bonds, stocks, or certificates of deposit
(CDs). The company is interested in on e goal: maximizing the yield o n the investment after
o ne year. If it were inte rested in other goals, such as safety or liquidity, the problem would
be classified as o n e of multi-criteria decision analysis (see Koksalan and Zionts , 2001).
The yield depends o n the state of the econom y sometime in th e fu ture (often called
the state of nature), w hich can be in solid growth , stagnation, or inflation. Experts esti-
m ated the following annual yields :
• If the re is solid growth in the economy, bonds w ill yield 12 percent, stocks 15 percent,
and time deposits 6.S percent.
Chapter 9 • Model-Based Decision Making: Optimization and Multi-Criteria Systems 421
• If stagnation prevails, bonds will yield 6 percent, stocks 3 percent, and time deposits
6.5 perce nt.
• If inflation prevails, bonds will yield 3 percent, stocks will bring a loss of 2 percent,
and time deposits will yield 6.5 percent.
The problem is to select the one best investme nt alternative. These are assumed
to be discrete alternatives. Combinations such as investing 50 p e rcent in bo nds and
50 percent in stocks must be treated as new alternatives.
The investme nt decision-making proble m can be viewed as a two-person game (see
Kelly, 2002). The investor makes a choice (i.e., a move), and then a state of nature occurs
(i.e., makes a move). Table 9.3 shows the payoff of a mathematical model. The ta ble
includes decision variables (the alternatives), uncontrollable variables (th e states of the
economy; e.g. , the environment), and result variables (the projected yield; e.g., outcomes).
All the models in this section are structured in a spreadsheet framework.
If this were a decision-making problem unde r certainty, we would know what the
economy will be and could easily choose the best investment. But that is not the case ,
so we must consider the two situations of uncertainty and risk. For uncertainty, we do
not know the probabilities of each state of nature . For risk, we assume that we know the
probabilities with which each state of nature will occur.
TREATING UNCERTAINTY Several metho ds a re available for handling uncertainty. For
example, the optimistic approach assumes that the best possible outcome of each alter-
native will occur and the n selects the best of the b est (i.e. , stocks). The p essimistic
approach assumes that the worst possible outcome for each alternative will occur and
selects the best of these (i.e., CDs). Another approach simply assumes that all states of
nature are equally p ossible. (See Clemen and Re illy, 2000; Goodwin and Wright, 2000;
and Kontoghiorghes et a l. , 2002.) Every approach for handling uncertainty has serious
proble ms. Whenever possible, the analyst should attempt to gather enough information
so that the problem can be treated unde r assumed certainty or risk.
TREATING RISK The most common m e thod for solving this risk analysis problem is to
select the alte rnative with the greatest expected value. Assume that experts estimate the
chance of solid growth at 50 percent, the chance of stagnation at 30 percent, and the
chance of inflation at 20 p e rce nt. The decision table is then rewritte n with the know n
probabilities (see Table 9.4). An expected value is computed by multiplying the results
(i.e ., outcomes) by their respective probabilities and adding them. For example, investing
in bonds yie lds an expected return of 12(0.5) + 6(0.3) + 3(0.2) = 8 .4 p e rcent.
This approach can sometimes be a dangerous strategy because the utility of each
potential outcome may be different from the value. Even if there is an infinitesimal chance
of a catastrophic loss, the expected value may seem reasonable, but the investo r may not
be willing to cover the loss. For example, suppose a financial advisor presents you w ith
an “almost sure” investment of $1,000 that can double your m o ney in o ne day, and then
TABLE 9.3 Investment Problem Decision Table Model
State of Nature (Uncontrollable Variables)
Alternative
Bonds
Stocks
CDs
Solid Growth (%)
12 .0
15.0
6.5
Stagnation(%)
6.0
3.0
6.5
Inflation (%)
3.0
– 2.0
6.5
422 Pan IV • Prescriptive Analytics
TABLE 9.4 Multiple Goals
Alternative
Bonds
Stocks
CDs
Yield(%)
8.4
8.0
6.5
Safety
High
Low
Very high
Liquidity
High
High
High
the advisor says, “Well, the re is a .9999 probability that you w ill double your money, but
unfo1tuna tely there is a .0001 probability that you will be liable for a $500,000 o ut-of-
pocket loss. ” The expected value of this investment is as follows:
0.9999($2,000 – $1,000) + .0001(- $500,000 – $1,000) = $999.90 – $50.10
= $949.80
The potential loss could be catastrophic for any investor w ho is n ot a billio naire.
Depe nding o n the investor’s ability to cover the loss, an investmen t has different expected
utilities. Remember that the investo r makes the decision only once.
Decision Trees
An alternative representation of the decision table is a decision tree (for examples, see
Mind Tools Ltd. , mindtools.com) . A decision tree shows the re la tio nships of the
proble m g raphically and can handle complex situatio ns in a compact form. However, a
decision tree can be cumbe rso me if there are ma ny alternatives or states of n ature. TreeAge
Pro (TreeAge Software Inc., treeage.com) and PrecisionTree (Palisade Corp. , palisade.
com) include powerful , intuitive, a nd sophisticated decision tree an alysis systems . These
vendo rs also provide excelle nt examples of decision trees used in practice. Note that
the phrase decision tree has been used to describe two diffe re nt types o f models a nd
algorithms. In the current context, decisio n trees refer to scenario an alysis. On the o ther
hand, som e classification algorithms in predictive an alysis (see Ch apters 5 and 6) also are
called decision tree algorithms.
A simplified investment case of multiple goals (a decision situation in which alter-
natives a re evaluated w ith several , sometimes conflicting, goals) is shown in Table 9.4.
The three goals (criteria) are yield , safety, and liquidity. This situation is u nder assumed
certainty; that is, o nly one p ossible consequen ce is projected fo r each alte rna tive; the
m ore complex cases of risk o r uncerta inty could be considered. Some of the results are
qualitative (e.g. , low, high) rathe r than numeric .
See Clemen a nd Reilly (2000) , Goodwin a nd Wright (2000) , and Decision
Ana lysis Socie ty (faculty.fuqua.duke.edu/daweb) fo r more o n decisio n a n alys is.
Although doing so is quite complex, it is p ossible to apply mathematical program-
ming directly to decision-making s ituatio ns under risk. We discuss several oth er
m e thods of treating risk in the n ext few chapters. These include simula tion and cer-
tainty factors.
SECTION 9 .8 REVIEW QUESTIONS
1. What is a decision table?
2. What is a decision tree?
3. How can a decisio n tree be used in decisio n making?
4. Describe w hat it mea ns to have multiple goals.
Chapter 9 • Model-Based Decision Making: Optimization and Multi-Criteria Systems 423
9.9 MULTI-CRITERIA DECISION MAKING WITH PAIRWISE
COMPARISONS
Multi-criteria (goal) decision making was introduced in Chapter 2. One of the most
effective approaches is to use weights based on decision-making priorities. However,
soliciting weights (or priorities) from managers is a complex task, as is calculation of the
weighted averages needed to choose the best alternative. The process is complicated
further by the presence of qualitative variables. One method of multi-criteria decision
making is the analytic hierarchy process developed by Saaty.
The Analytic Hierarchy Process
The analytic hierarchy process (AHP), developed by Thomas Saaty 0995 , 1996),
is an excelle nt modeling structure for re presenting multi-criteria (multiple goals, mul-
tiple objectives) problems-with sets of criteria and alternatives (choices)- commonly
found in business environments. The decision maker uses AHP to decompose a decision-
making problem into relevant criteria and alternatives. The AHP separates the analysis
of the criteria from the alternatives , which h elps the decision maker to focus on small,
manageable portions of the problem. The AHP manipulates quantitative and qualitative
decision-making criteria in a fairly structured m anner, allowing a d ecisio n m aker to make
trade-offs quickly and “expertly. ” Application Case 9.6 gives an example of an application
of AHP in selection of IT projects.
Application Case 9.6
U.S. HUD Saves the House by Using AHP for Selecting IT Projects
The U.S. Department of Housing and Urban
Development’s (HUD) mission is to increase home-
ownership, support community development,
and increase access to affordable housing free
from discrimination. HUD’s total annual budget is
$32 billion with roughly $400 million allocated to
IT spending each year. HUD was annually besieged
by requests for IT projects by its program areas, but
had no rational process that allowed management to
select and monitor the b est projects within its bud-
getary constraints. Like most federa l agencies, HUD
was required by congressional act to hire a CIO and
develop an IT capital pla nning process. Howeve r, it
wasn’t until the Office of Management and Budget
(0MB) threatened to cut agency budgets in 1999
that an IT planning process was actually develop ed
and implemented at HUD. There had been a great
deal of wasted money and manpower in the dupli-
cation of efforts by program areas, a lack of a sound
project prioritization process, and no standards or
guidelines for the program areas to follow.
For example, in 1999 there were requests for
over $600 million in HUD IT projects against an IT
budget of less than $400 million. There were over
200 approved projects but no process for select-
ing, monitoring, and evaluating these projects. HUD
could not determine whether its selected IT projects
were properly aligned with the agency’s mission and
objectives and were thus the most effective projects.
The agency determined from b est practices
a nd industry research that it n eeded both a ratio-
nal process and a tool to supp01t this process to
meet OMB’s requirements. Using the results from
this research , HUD recommended that a process
and guidelines be developed that would allow
senior HUD manageme nt to select and prioritize the
objectives and selection criteria while allowing the
program teams to score specific project requests.
HUD now u ses the analytic hierarchy process
through Expert Choice software with its capital plan-
ning process to select, manage, and evaluate its IT
portfolio in real time, while the selected IT programs
are being implemented.
The results have b een staggering : With the
new methodology and Expert Choice, HUD has
reduced the preparation and meeting time for the
( Continued)
424 Pan IV • Prescriptive Analytics
Application Case 9.5 (Continued}
annual selection and prio ritization of IT projects
from months to m ere weeks, saving time a nd man-
agement hours. Program area requests of recent
IT budgets dropped from the 1999 level of over
$600 million to less than $450 million as managers
recognized that the selection criteria for IT projects
were going to be fairly and stringently applied by
senior management, a nd that the numbe r of projects
funded had dropped from 204 to 135 . In the first
year of imple me ntatio n , H UD reallocated $55 mil-
lio n of its IT budget to more effective projects that
were better aligned w ith the agen cy’s objectives .
In addition to saving time, the fair a n d trans-
parent p rocess has increased buy-in at all levels of
management. There are few opportunities or incen-
tives, if any, for a n “end run” around the process.
HUD now requires tha t each assistant secre tary for
the program areas sign off on the weighted selec-
tio n crite ria, a nd managers n ow know that sp ecial
requests are likely fruitless if they cannot be sup-
ported by the selection crite ria.
Source: http://expertchoice.com/xres/uploads/resource-center-
documents/HUD_casestudy (accessed February 2013).
Expert Choice (expertchoice.com; a demo is available directly on its Web site)
is an excelle nt commercial implementation of AHP. A p roblem is represented as an
inverted tree w ith a goal node at the top . All the weight of the decision is in the goal
(1.000) . Directly beneath and attached to the goal n ode are the criteria n odes. These are
the factors that are important to the decision m aker. The goal is decomposed into crite-
ria, to which 100 p ercent o f the weig h t of the decision from the goal is distributed. To
distribute the weight, the decision mak er conducts pairwise comparison s of the criteria:
first criterion to second, first to third, .. . , first to last; the n , second to third, … , second to
last; . .. ; and then the n ext-to-last criterion to the last one . This establishes the im p ortan ce
of each crite rion ; that is, h ow much of the goal’s weight is d istributed to each criterio n
( how important each crite rio n is). This objective method is performed by internally
manipulating matrices mathematically . The ma nipulatio ns are transparent to the user
because the operatio n al details of the method are not important to the decisio n maker.
Finally, a n incon sistency index indicates how consistent the comparisons were, thus
identifying incon sistencies, errors in judgment, or simply errors. The AHP method is con-
sistent with decision theory.
The decision maker can make comparisons verbally (e.g., one criterion is moderately
mo re impo rtant tha n anothe r), graphically (with bar and pie ch arts), or numerically (with a
matrix-comparisons are scaled from 1 to 9). Students and business professionals generally
prefer graphical and verbal approaches over matrices (based o n an informal sample).
Beneath each criterio n are the same sets of ch oices (alternatives) in the simple case
d escribed here. Like the goal, the crite ria decompose th eir weig ht into the cho ices, w hich
capture 100 percent of the weight of each criterio n. The decision maker performs a pair-
w ise comparison of ch oices in terms of preferences, as they re late to the specific criterion
under conside ration. Each set of c ho ices must be pairwise compared as th ey relate to
each criterion. Again, a ll three modes of compariso n are available, and an incon sisten cy
index is derived fo r each set a n d reported.
Finally, the results are synthesized and displayed o n a bar graph. The choice w ith
the most weig ht is the correct cho ice. However, under some con ditions the correct deci-
sio n may not be the right one. For example, if there are two “ide ntical” choices (e.g., if
you are selecting a car for purchase and you h ave two identical cars) , they m ay split the
weig ht a nd n e ithe r w ill h ave the most weight. Also, if the top few ch oices are very close,
there may be a missing criterio n that could be used to differentiate among these ch o ices.
Chapter 9 • Model-Based Decision Making: Optimization and Multi-Criteria Systems 425
Expert Choice also has a sensitivity analysis module. A newer version of the product,
called Comparion, also synthesizes the results of a group of decision make rs using the
same model. This version can work on the Web. Overall, AHP as implemented in Expert
Choice attempts to derive a decision maker’s preference (utility) structure in terms of the
criteria and choices and help him or her to make a n expe1t choice.
In addition to Expert Choice, other software packages allow for weighting of pair-
wise choices. For example, Web-HIPRE (hipre.aalto.fi), an adaptation of AHP and sev-
eral other weighting schemes, enables a decision maker to create a decision model, enter
pairwise preferences, and analyze the optimal choice. These weightings can be computed
using AHP as well as other techniques. It is available as a Java applet on the Web so it
can be easily located and run online, free for noncommercial use. To run Web-HIPRE ,
one has to access the site and leave a Java applet window running. The user can enter
a problem by providing the general labels for the decision tree at each node level and
then entering the problem components. After the model has been specified, the user can
enter pairwise preferences at each node level for criteria/ subcriteria/ alternative . Once that
is done, the appropriate analysis algorithm can be used to determine the model’s final
recommendation. The software can also perform sensitivity analysis to determine which
criteria/ subcriteria play a dominant role in the decision process. Finally, the Web-HIPRE
can also be employed in group mode. In the follow ing paragraphs, we provide a nnorial
on using AHP through Web-HIPRE.
Tutorial on Applying Analytic Hierarchy Process Using Web-HIPRE
The following paragraphs give an e xample of application of the analytic hierarchy pro-
cess in making a decision to select a movie that suits an individual’s inte rest. Phrasing the
decision problem in AHP terminology:
1. The goal is to select the most appropriate movie of interest.
2. Let us identify some criteria for making this decision. To get started, let us agree that
the main criteria for movie selection are genre, language, day of re lease, u ser/critics
rating.
3. The subcriteria for each of main criteria are listed here:
a. Genre: Action, Comedy, Sci-Fi, Romance
b. Language: English, Hindi
c. Day of Release: week day, weekend
d. User/Critics Rating: High, Average, Low
4. Let us assume that the alternatives are the following current movies: SkyFall, Tbe
Dark Knight Rises, The Dictator, Dabaang, Alien, and DDL.
The following steps enable setting up the AHP using Web-HIPRE. The same can be
done using commercial strength software such as Expert Choice/ Comparion and many
other tools. As mentioned earlier, Web-HIPRE can be accessed online at hipre.aalto.fi
Step 1 Web-HIPRE allows the users to create the goal, associated main criteria, subcri-
te ria and the alternatives, and establish appropriate relationships among each
of them. Once the application is opened, double-clicking o n the diagram space
allows users to create all the elements, which are renamed as the goal, criteria,
and alternatives. Selecting an e lement and right-clicking on the desired element
will create a relationship between these two e lements.
Figure 9 .12 shows the entire view of the sample decision problem of select-
ing a movie: a sequence of goal, main criteria, subcriteria, and the alternatives.
Step 2 All of the main criteria related to the goal are then ranked with their relative
importance over each other using a comparative ranking scale ranging from
1 to 9, w ith ascending order of importance. To begin entering your pairwise
426 Pan IV • Prescriptive Analytics
Movie Selection
FIGURE 9.12 Main AHP Diagram.
Action
Comedy
Genre Sci-Fi
Romance
High
Rating Average
Low
Release day Week Day
Weekend
Language English
Hindi
Skyfa ll
The Dar k
Knight Rises
The Dictator
Dabaang
Alien
DDLJ[Hindi)
priont1es for any ele me nt’s childre n n odes, you click o n the Priorities Menu ,
and then select AHP as the method of ranking. Again, n ote that each compari-
son is m ade between just two competing crite ria/ subcriteria or alternatives with
respect to the parent node. For example, in the current proble m , the rating
of the movie was considered to be the m ost important criterion, followed by
genre, release day, and la n gu age . Th e criteria are ranked o r rated in a p a irwise
mode w ith resp ect to the parent node- the goal of selecting a movie . The tool
readily n o rmalizes the rankings of each of the main criteria over one ano ther to
a scale ranging fro m O to 1 and the n calculates the row averages to a rrive at an
overall importance rating ranging from O to 1.
Figure 9.13 shows the ma in crite ria ranked over o ne a n othe r and the final
ranking o f each of the main crite ria .
Step 3 All of the subcriteria related to each of the m ain criteria are then ranked with
their relative impo rtance over one another. In the current example, o n e of the
main crite ria, Genre, the subcrite rio n Com edy is ran ked w ith higher importan ce
followed by Action, Romance , and Sci-Fi. The ranking is normalized and aver-
aged to yield a final score ra nging be tween O and 1. Likewise, for each of the
main crite ria, a ll subcriteria a re re latively ranked over o ne another.
Chapter 9 • Model-Based Decision Making: Optimizatio n and Multi-Criteria Systems 427
Priorities – Movie Selection
Direct SMART
How many times more rm porranu
9
Genre
Next Comparison r [6
A B C I)
A Genre 1.0 10.181 6.4 5.B
B Rating 5.7 1.0 6.4 6.4
C Release day 0.16 0.16 1.0 5.9
D Language 0.17 0.16 0.17 1.0
OK
FIGURE 9.13 Ranking Main Criteria.
5.7
D
~ – 9 scale
Genre 0.253
Rating 0.609
Release day 0.098
Language 0.041
C.a.ncell
RaUn,g
Clear All
CM:0.351
Figure 9 .14 shows the subcriteria ranked over one another and the
final ranking of each of the subcriteria with respect to the main criterion,
Genre .
Step 4 Each alternative is ranked with respect to all of the subcriteria that are linked
with the alternatives in a similar fashion using the relative scale of 0- 9. Then the
overall importance of each alternative is calculated using normalization and row
averages of rankings of each of the a lternatives.
Figure 9. 15 shows the alternatives specific to Comedy- Sub-Genre being
ranked over each other.
Step 5 The final result of the relative importance of each of the a lternatives, with respect
to the weighted scores of subcriteria, as well as the main criteria, is obtained
from the composite priority analysis involving all the subcriteria and main cri-
teria associated with each of the alternatives. The alternative with the highest
composite score, in this case, the movie The Dark Knight Rises, is then selected
as the right choice for the main goal.
Figure 9.16 shows the composite priority analysis .
Note that this example follows a top-down approach of choosing alter-
n atives by first setting up priorities among the main criteria and subcriteria,
eventually evaluating the relative importance of alternatives. Similarly, a bottom-
up approach of first evaluating the alternatives with respect to the subcriteria
and then setting up priorities among subcriteria and main criteria can also be
followed in choosing a particular alternative.
428 Pan IV • Presc riptive Analytics
Priorities • Genre
Direct j SM.A.RT I SWING I SMARTER AHP ] Group I
How many IJmes more important?
9 9
Action Comedy
Next c omparison 4 Clear All
A B C D . 9 scale cu
A Action 1.0 ~ 5.6 0.24 Action 0.128 I I
BCome(ly 3.8 1.0 6.0
CSci-Fi 0.18 0.17 1.0
DRomance 4.2 0.16 4.9
FIGURE 9.14 Ranking Subcriteria.
rities • Comedy
Direct I
9
The Dicta or
Next Comparison
A B
A The Dictator
B Dabaang 0.19 1.0
FIGURE 9.15 Ranking Alternatives.
6.1 Come
p . O.,ta
~
!;, Facility ~ Processes l • Definitions ?’I Data .. Dashboard r l Re:,ulb l
Pan~ls (( F, ……. ~ Ul’lP Here
~ -age I ler
Pivot Gnd @,iectrype _JI a,i.ctName JI O.,ta 5ou’ce ][category l~ta[tem I Stabslic —-=:J ~ age Total I
d::::
Combiner PCBASo.-dAsserr •• . [ltesou’ce] Capacity Sd>ecUedJtlz… I Peant
lkitsAloal2d Total
Reports lkitsSche
Pe-cent
Total (.Hou’s)
Staniedrime Average ()icu-s)
Ocnnences
Pean!
TotalObn)
Bab:mg MemberQue,e luroerWaitilg Average
Mamun
Tllll!Waiting Average (Holn)
Mamun (lio
~ ··~ . , I ~~~;~/~
Inference Engine
Explanation
Facility
Knowledge
Refinement
User User
Interface
Blackboard [Workspace]
Refined
Rules
Facts : Facts
y
Working
– – – _,.. Memory
[Short Term)
FIGURE 11.4 Structure/Architecture of an Expert System.
‘,,~ Data/ Information ,,
External Data
Sources
[via WWW] ,,
Chapter 11 • Automated Decision Systems and Expert Systems 485
program for constructing or expanding the knowledge base. Potential sources of knowl-
edge include human experts, textbooks, multimedia documents, databases (public and
private), special research reports , and information available on the Web.
Currently, most organizations have collected a large volume of data , but the organi-
zation and management of organizational knowledge are limited. Knowledge acquisition
deals w ith issues such as making tacit knowledge explicit and integrating knowledge
from multiple sources.
Acquiring knowledge from experts is a complex task that often creates a bottleneck
in ES construction. In building large systems, a knowledge engineer, or knowledge
elicitatio n expert, needs to interact with one or more human experts in building the knowl-
edge base. Typically, the knowledge engineer helps the expert structure the problem
area by interpreting and integrating human answers to questions, drawing analogies ,
posing counterexamples, and bringing conceptual difficulties to light.
Knowledge Base
The knowledge base is the foundation of an ES. It contains the relevant knowledge
necessary for understanding, formu lating, and solving problems. A typical knowledge
base may include two basic elements: (1) facts that describe the characteristics of a spe-
cific problem situation (or fact base) and the theory of the problem area and (2) special
heuristics or rules (or knowledge nuggets) that represent the deep expert knowledge to
solve specific problems in a particular domain. Additionally, the inference engine can
include general-purpose problem-solving and decision-making rules (or meta-rules-
rules about how to process production rules).
It is important to differentiate between the knowledge base of an ES and the knowl-
edge base of a n organization. The knowledge stored in the knowledge base of an ES
is often represented in a special format so that it can be used by a software program
(i.e., an expert system shell) to help users solve a particular problem. The organizational
knowledge base, however, contains various kinds of knowledge in different formats
(most of which is represented in a way that it can be consumed by people) and may be
stored in different places. The knowledge base of an ES is a special case and only a very
small subset of an organization’s knowledge base.
Inference Engine
The “brain” of an ES is the inference engine, also known as the control structure or the
rule inte,preter (in rule-based ES) . This component is essentially a computer program
that provides a methodology for reasoning about information in the knowledge base and
o n the blackboard to formulate appropriate conclusions. The infere nce engine provides
directions about how to use the system’s knowledge by developing the agenda that orga-
nizes and controls the steps taken to solve problems whenever a consultation takes place.
It is further discussed in Section 11. 7.
User Interface
An ES contains a language processor for friendly, problem-oriented communication between
the user and the computer, known as the user interface. This communication can best be
carried out in a natural language. Due to technological constraints, most existing systems
use the graphical or textual question-and-answer approach to interact with the user.
Blackboard (Workplace)
The blackboard is an a rea of working memory set aside as a database for description of
the current problem, as characterized by the input data. It is also u sed for recording inter-
mediate results, hypotheses, and decisions. Three types of decisions can be recorded on
486 Pan IV • Prescriptive Analytics
the blackboard: a plan (i.e., how to attack the problem), an agenda (i.e., potential actions
awaiting execution), and a solution (i.e ., candidate hypotheses and alternative courses of
action that the system has generated thus far).
Consider this example. When your car fails to start, you can enter the symptoms of
the failure into a computer for storage in the blackboard. As the result of an intermedi-
ate hypothesis developed in the blackboard, the computer may then suggest that you do
some additional checks (e.g., see whether your battery is connected properly) and ask
you to report the results. This information is also recorded in the blackboard. Such an
iterative process of populating the blackboard with values of hypotheses and facts contin-
ues until the reason for the failure is identified.
Explanation Subsystem (Justifier)
The ability to trace responsibility for conclusions to their sources is crucial both in the tran sfer
of expertise and in problem solving. The explanation subsystem can trace such responsi-
bility and explain the ES behavior by interactively answering questions such as these:
• Why was a certain question asked by the ES?
• How was a certain conclusion reached?
• Why was a certain alternative rejected?
• What is the complete plan of decisions to be made in reaching the con clusion? For
example , what remains to be known before a final diagnosis can be determined?
In most ES, the first two questions (why a nd how) are answered by showing the
rule that required asking a specific question and sh owing the sequen ce of rules that were
used (fired) to derive the specific recommendation s , respectively.
Knowledge-Refining System
Human experts have a knowledge-refining system; that is, they can analyze th eir own
knowledge and its effectiveness, learn from it, and improve on it for future consultations.
Similarly, such evaluation is necessary in expert systems so that a program can analyze
the reasons for its success or failure, which could lead to improvements resulting in a
more accurate knowledge base and more effective reasoning.
The critical component of a knowledge refinement system is the self-learning
mechanism that allows it to adjust its knowledge base and its processing of knowledge
based on the evaluation of its recent past performances. Such an intelligent compon ent
is not yet mature enough to appear in many commercial ES tools . Application Case 11.4
illustrates another application of expert systems in healthcare.
Application Case 11.4
Diagnosing Heart Diseases by Signal Processing
Auscultation is the science of listening to the sounds
of internal body organs, in this case the heart.
Skilled experts can make diagnoses using this tech-
nique. It is a noninvasive screening method of pro-
viding valuable information about the conditions
of the heart and its valves, but it is highly subjec-
tive and depends on the skills and experience of
the listener. Researchers from the Department of
Electrical & Electronic Engineering at Universiti
Teknologi Petronas h ave developed an Exsys Corvid
expert system, SIPMES (Signal Processing Module
Integrated Expert System) to analyze digitally pro-
cessed heart sound.
The system utilizes digitized hea1t sound algo-
rithms to diagnose various conditions of the heart.
Heart sounds are effectively acquired using a digital
Chapter 11 • Automated Decision Systems and Expett Systems 487
e lectronic stethoscope. The heart sounds were col-
lected from the Institut Jantung Negara (National
Heart Institute) in Kuala Lumpur and the Fatimah
Ipoh Hospital in Malaysia . A total of 40 patients age
16 to 79 years o ld with various pathologies were
used as the control group, and to test the validity of
the system using their abnormal heart sound sam-
ples and other patient medical data.
The heart sounds are transmitted using a wire-
less link to a nearby workstation that hosts the
Signal Processing Module (SPM). The SPM has the
capability to segment the stored heart sounds into
individual cycles and identifies the important cardiac
events.
The SPM data was then integrated with the
Exsys Corvid knowledge automation expert sys-
tem. The rules in the system use expert physician
reasoning knowledge, combined with information
acquired from medical journals, medical textbooks,
and other noted publications on cardiovascula r dis-
eases (CVD). The system provides the diagnosis and
gen erates a list of diseases arranged in descending
order of their probability of occurrence.
SIPMES was designed to diagnose all types of
cardiovascula r heart diseases . The system can help
gen eral p hysicia n s diagnose heart diseases at the
earliest possible stages under emergency situations
where expert cardiologists and advanced medical
facilities are not readily available .
SECTION 11.6 REVIEW QUESTIONS
1. Describe the ES development environment.
2. List and define the majo r components of a n ES.
The diagnosis made by the system has been
counterchecked by senior cardio logists, and the
results coincide with these heart experts. A high
coincidence factor of 74 percent has been achieved
using SIPMES .
QUESTIONS FOR DISCUSSION
1. List the major components involved in building
SIPMES and briefly comment on them.
2. Do expert systems like SIPMES eliminate the
need for human decision making?
3. How often do you think that the existing expert
systems, once built, should be changed?
What We Can Learn from This Application
Case
Many expert systems are prominently being used
in the field of medicine. Many traditional diagn os-
tic procedures are now being built into logical rule-
based systems, which can readily assist the medical
staff in quickly diagnosing the patient’s condition of
disease. These expert systems can help in saving the
valuable time of the medical staff and increase th e
number of patients being served.
Source: www.exsys.com, “Diagnos ing Heart Diseases ,”exsys
http://www. exsyssoftware. com/CaseStudyS elector/
casestudies.html (accessed Februa,y 2013).
3. What are the m ajor activities performed in the ES b lackboard (workplace)?
4. What are the major roles of th e explan atio n subsystem’
5. Describe the difference between a knowledge base of an ES and an organizational
kn owledge base.
11.7 KNOWLEDGE ENGINEERING
The collectio n of intensive activities encompassing the acquisition of knowledge from
human experts (and other information sources) and conve rsio n of this knowledge into a
repository (commonly called a knowledge base) are called knowledge engineering. The
term knowledge engineering was first defined in the pioneering work of Feigenbaum an d
McCorduck (1983) as the art of bringing the principles and too ls of artificial intellige nce
research to bear on difficult application problems requiring the knowledge of experts
for their solutions. Knowledge engineering requires cooperation and close communica-
tion betwee n the huma n experts and the knowledge e ngineer to successfully codify and
488 Pan IV • Prescriptive Analytics
explicitly represent the rules (or o ther knowledge-based procedures) that a human expert
u ses to solve problems within a specific application domain. The knowledge possessed
by human experts is often unstructured and n o t explicitly expressed. A major goal of
knowledge e ngineering is to help experts articulate how they do what they do a nd to
document this knowledge in a reusable form.
Knowledge engineering can be viewed from two perspectives: narrow a nd broad.
According to the narrow perspective, knowledge engineering deals w ith the steps nec-
essary to build expert systems (i.e., knowledge acqu isitio n , knowledge representation,
knowledge validation , inferencing, and explanation/ justification). Alternatively, accordin g
to the broad perspective, the term describes the entire process of developing an d main-
taining any intelligent systems. In this book, we use the narrow definition. Following are
the five major activities in knowledge e n gineering:
• Knowledge acquisition. Knowledge acquisition involves the acquisition of
knowledge from human experts, books, documents, sensors, or computer files. The
knowledge may be specific to the problem domain or to the problem-solving pro-
cedures, it may be general knowledge (e.g., knowledge about business), or it may
be metaknowledge (knowledge about knowledge). (By metaknowledge, we mean
informatio n about h ow experts use their knowledge to solve problems and about
problem-solv ing procedures in general.)
• Knowledge representation. Acqu ired knowledge is organized so that it will be
ready for u se, in an activity called knowledge representation. This activity involves
preparation of a knowledge map and encoding of the knowledge in the knowledge
base.
• Knowledge validation. Knowledge validatio n (or verification) involves validating
and verifying the knowledge (e.g. , by using test cases) until its quality is acceptable.
Test results a re usually shown to a domain expert to verify the accuracy of the ES.
• Explanation and justification. This step involves the design and program-
ming of an explanation capability (e.g., programming the ability to answer ques-
tions su ch as w hy a specific piece of information is needed by the computer or h ow
a certa in conclusion was derived by the computer).
Figure 11.5 shows the process of knowledge e ngineering and the relationships among
the knowledge engineering activities. Knowledge engineers interact with human experts or
collect documented knowledge from other sources in the knowledge acquisition stage. The
acquired knowledge is the n coded into a representation scheme to create a knowledge base.
The knowledge engineer can collaborate w ith human experts or use test cases to verify and
validate the knowledge base. The validated knowledge can be used in a knowledge-based
system to solve new problems via machine inference and to explain the generated recom-
mendatio n. Details of these activities are discussed in the following sections.
Knowledge Acquisition
Knowledge is a collectio n o f specialized facts, procedures, and judgment usually
expressed as rules. Knowledge can come from one or from many sources, such as books,
films , computer databases, pictures, maps, stories, news articles, and sensors, as well
as from human experts. Acquisition of knowledge from human experts (often called
knowledge elicitation) is arguably the most valuable an d most challengin g task in knowl-
edge acquisition. Technology Insights 11 .1 lists some of the difficulties of knowledge
acquisition. Th e classical knowledge elicitation methods, w hich are also called manual
methods, include interviewing (i.e., structured, semistructured, unstructured), tracking the
reasoning process , a nd observing. Because these m anual methods are slow, expensive,
and sometimes inaccurate , the ES community has been developing semiautomated and
fu lly auto mated means to acqu ire knowledge. These techniques, which rely o n com puters
Chapter 11 • Automated Decision Systems and Expett Systems 489
Problem or
Opportunity
t
Knowledge
Acquisition
>.
‘ ‘ ‘
–
‘
Raw
Knowledge
Knowledge
Representation
: ‘ Knowledge
Validation Validated
Knowledge
‘
‘ ‘
+
‘ ‘ ‘
”
‘
‘ ‘ : .!
‘ ” ” ‘
Interfacing
(Reasoning)
,J.
”
”
” ” ‘
: I I : : :
______________ l_~ ______________ ! ____________ ~_L ____ _
Feedback loop (corrections and refinements]
FIGURE 11.5 The Process of Knowledge Engineering.
Meta knowledge
Explanation and
Justification
t
Solution
and AI techniques, aim to minimize the involvement of the knowledge engineer and the
human experts in the process. Despite its disadvantages, in real-world ES projects the
traditio nal knowledge elicitation techniques still dominate.
TECHNOLOGY INSIGHTS 11.1 Difficulties in Knowledge Acquisition
Acquiring knowledge from experts is not an easy task. The following are some factors that add
to the complexity of knowledge acquisition from experts and its transfe r to a compute r:
• Experts may not know how to articulate their knowledge or may be unable to do so.
• Experts may lack time or may be unw illing to cooperate.
• Testing and refining knowledge are complicated.
• Meth ods for knowledge elicitation may be poorly defined.
• System builders tend to collect knowledge from one source, but the relevant knowledge
may be scattered across several sources.
• System builders may attempt to collect documented knowledge rather than use experts.
The knowledge collected may be incomplete.
• It is difficult to recognize specific k nowledge w hen it is mixed up with irre levant data .
• Experts may ch ange their behavior when they are observed or interviewed.
• Problematic interpersonal communication factors may affect the knowledge e ngineer and
the expert.
490 Pa n IV • Prescriptive Analytics
A critical ele ment in the developme nt of an ES is the identification of ex perts.
The u su al approach to mitigate this p roble m is to build ES for a very n arrow applica-
tio n d o main in w hich exp e rtise is mo re clearly defined . Even the n , there is a very good
chan ce that o ne might find mo re tha n on e exp ert w ith differen t (some time conflicting)
exp e1t ise. In such situatio n s, o ne might choose to u se multiple experts in the knowledge
elicitatio n p rocess.
Knowledge Verification and Validation
Knowledge acquired fro m exp erts need s to be evaluated for quality, including evalua-
tio n , validatio n , and verificatio n. These terms are ofte n u sed inte rchangeably . We use the
definitio n s provide d by O ‘Keefe e t al. (1987):
• Evalua tion is a broad concept. Its o bjective is to assess a n ES’s overa ll value . In
additio n to assessing acceptable p e rformance levels , it a nalyzes w h ether the system
would b e usable , efficie n t, and cost-effective.
• Validation is the part of eva luatio n that d eals with the performan ce of the system
(e.g., as it compares to the expe rt’s). Simply stated, validatio n is building the right
syste m (i. e., substan tiating th at a syste m p e rforms w ith an accepta ble level of
accuracy).
• Verification is building the syste m rig ht o r substantiating that the system is correctly
imp leme nted to its specificatio ns.
In the realm o f ES, these activities are dyna mic because they must be re p eated each
time the proto type is chan ged. In te rms of the knowledge base , it is necessary to ensu re
that the rig ht knowled ge b ase (i.e. , that the knowled ge is valid) is used. It is a lso essential
to e nsure that the knowled ge base has been constructe d properly (i.e., verification) .
Knowledge Representation
Once validate d , the knowledge acquired fro m expe rts o r induced from a set o f data must
be represented in a format tha t is both understandable by human s and executable o n
compute rs. A variety of knowle dge representatio n m e thods is available: p roductio n rules,
sema ntic networks , frames, o bjects , decisio n tables, decisio n trees, and p redicate logic.
Next, we explain the m ost po pular me thod-production rules.
PRODUCTION RULES Production rules are the most p opular form of knowledge
re presenta tio n for exp ert syste ms . Knowledge is re presented in th e form of con d itio n/
actio n p a irs : IF this conditio n (or premise o r anteced e nt) occurs, THEN some action
(or result or con clu sio n o r con sequence) w ill (or sh ould) occur. Conside r the following
two examples:
• If the sto p lig ht is red AND you h ave stopp ed , THEN a righ t turn is o kay.
• If the clie nt u ses purchase requisitio n forms AND the purchase orders a re approved
and purchasing is separate from receiving AND accounts p ayable AND inventory
records, THEN the re is strongly suggestive evide n ce (90 percent p robability) that
con trols to prevent unauthorized purchases a re ad equate . (This example fro m an
inte rnal con trol procedure includes a p robability.)
Each p rodu ctio n rule in a knowledge b ase imple me nts an auton o mo u s ch u nk of
exp e rtise that can be d evelo p e d and modified inde p e nde ntly of o ther rules. Whe n com-
bined and fed to the infe re nce en gine, the set of rules beh aves syn e rgistically, yielding
b ette r results tha n the sum of the results o f the individua l rules. In some sense, rules can
b e viewed as a simulation of the cognitive b ehavior of human expe rts . Accordin g to this
view, rules are no t just a n eat formalism to represe n t kn owled ge in a computer; rathe r,
they represen t a m ode l o f actual huma n be havior.
Chapter 11 • Automated Decision Systems and Exp ert Systems 491
KNOWLEDGE AND INFERENCE RULES Two types of rules a re common in artificial
inte llige nce: knowledge and inference . Knowledge ntles, o r declarative rules, state all
the facts and relationships about a proble m . Inference rules, or procedural rules, offer
advice o n how to solve a proble m , given that certain facts a re known. The knowledge
engineer sep ara tes the two types of rules: Knowledge rules go to the knowledge base,
whereas inference rules become part of the infe rence engine that was introduced earlier
as a compo n ent of an exp ert system . For example , assume that you are in the business of
buying and selling gold. The knowledge rules might loo k like this:
Rule 1: IF an international conflict begins, THEN the price of gold goes up .
Rule 2: IF the inflation rate declines, THEN the price of gold goes down.
Rule 3: IF the inte rnatio n al conflict lasts mo re than 7 days and IF it is in the Middle
East, THEN buy gold.
Infere nce rules contain rules about rules and thus are also called meta-rules. They per-
tain to other rules (or even to themselves) . Inference (procedural) rules may look like this:
Rule 1: IF the d ata n eed ed are n o t in the syste m , THEN requ est the m from the user.
Rule 2: IF more than one rule applies, THEN deactivate any rules that add no new data.
Inferencing
Infe rencing (or reasoning) is the process of using the rules in the knowledge base along with
the known facts to draw conclusio ns. Inferencing requires some logic embedded in a com-
puter program to access and manipulate the stored knowledge. This program is an algorithm
that, w ith the guidance of the infe rencing rules, controls the reasoning process and is u su ally
called the inference engine. In rule-based systems , it is also called the rule inte,preter.
The inference engine d irects the search through the collection of rules in the knowl-
edge base, a process commo nly called pattern matching. In inferencing, when all of the
hyp otheses (the “IF” pa rts) of a rule a re satisfied, the rule is said to be fired . On ce a rule
is fired , the n ew knowled ge generated by the rule (the conclusio n o r the validatio n of th e
THEN p art) is inserted into the me mory as a new fac t. The infe re nce e ngine checks every
rule in the know ledge base to identify those that can be fired based on what is known
at that point in time (the collectio n of known facts), and keeps doing so until the goal is
achieved . The most p opular inferencing mechanisms for rule-based systems are forwa rd
and backward ch aining:
• Backward chaining is a goal-driven approach in w hich you start from an
expectatio n of w ha t is going to happen (i. e ., hypothesis) and the n seek evidence
tha t suppo rts (or contradicts) your expectation . Often, this entails formulating and
testing inte rmediate hyp otheses (or subhyp o theses).
• Forward chaining is a da ta-driven approach. We start from available info rmation
as it becomes availa ble or from a basic idea, and then we try to draw conclusion s.
The ES analyzes the problem by looking for the facts that m atch the IF part of its
IF-THEN rules. For example, if a certa in machine is not working, the computer
checks the electricity flow to the machine . As each rule is tested, the program works
its way toward o n e o r m ore conclusio ns.
FORWARD AND BACKWARD CHAINING EXAMPLE Here we discuss an e xample invo lving
an investme nt decision abou t whether to invest in IBM stock. The following variables are
u sed :
A = Have $10,000
B = Younger than 30
C = Educatio n at college level
492 Pan IV • Prescriptive Analytics
D = Annual income of at least $40,000
E = Invest in securities
F = Invest in growth stocks
G = Invest in IBM stock (the potential goal)
Each of these variables can be answered as true (yes) or false (no).
We assume that a n investor h as $10,000 (i.e ., that A is true) and that she is 25 years
old (i.e., that B is true). She would like advice on investing in IBM stock (yes or no for
the goal).
Our knowledge base includes the following five rules:
Rl: IF a person h as $10,000 to invest and she has a college degree,
THEN she sho uld invest in securities.
R2: IF a person’s annua l income is at least $40,000 a nd she has a college degree ,
THEN she sho uld invest in growth stocks.
R3: IF a person is younger than 30 and she is investing in securities,
THEN sh e sh o uld invest in growth stocks .
R4: IF a person is younger than 30,
THEN she has a college degree.
RS: IF a person wants to invest in a growth stock,
THEN the stock sho uld b e IBM.
These rules can be writte n as follows:
Rl: IF A and C, THEN E.
R2: IF D and C, THEN F.
R3: IF B and E, THEN F.
R4 : IF B, THEN C.
RS: IF F, THEN G.
Backward Chaining Our goal is to determine whether to invest in IBM stock. With
backward ch a ining, we start by looking for a rule that includes the goal (G) in its
conclu sion (THEN) part. Because RS is the o nly o n e that qualifies, we start w ith it. If
several rules conta in G, then the infe re nce e ngine d ictates a procedure for handling the
situation . This is w hat we do:
1. Try to accept or re ject G . The ES goes to the assertio n b ase to see whether G is
there . At present, all we have in the assertio n base is A is true . B is true . Therefore,
the ES proceeds to step 2.
2. RS says that if it is true that we invest in growth stocks (F) , th en we s ho uld invest
in IBM (G) . If we can conclude that the premise of RS is either true or false, then
we have solved the problem. Howeve r, we do n ot know whether F is true. Wh at
shall we do now ? Note that F, which is the premise of RS , is also the conclusion of
R2 and R3. Therefore, to find out w h ether F is true, we must check e ither of these
two rules.
3. We try R2 first (arbitra rily); if both D a nd C a re true , the n F is true. Now we have a
proble m. D is not a conclusio n of any rule, nor is it a fact. The compute r can either
move to an other rule or tiy to find out whether D is true by asking the investor for
w h o m the consultatio n is given if her annual income is above $40,000. What the
ES does depends on the search procedures u sed by the inference engine. Usu ally,
a user is asked fo r additio nal informatio n o nly if the info rmatio n is no t available o r
Chapter 11 • Automated Decision Systems and Expett Systems 493
7 R4
FIGURE 11.6 A Graphical Depiction of Backward Chaining.
Legend
A, B, C, D, E, F, G: Facts
1, 2, 3 , 4: Sequence of rule firings
R1, R2, R3 , R4, R5: Rules
cannot be deduced. We abandon R2 and return to the other rule, R3 . This action is
called backtracking (i.e., knowing that we are at a dead end, we t1y somethin g e lse;
the computer must be preprogrammed to handle backtracking).
4. Go to R3; test Band E. We know that Bis true because it is a given fact. To prove E,
we go to Rl , w here E is the conclu sio n.
5. Examine Rl. It is necessary to determine whether A and C are true.
6. A is true because it is a given fact . To test C, it is necessary to test R4 (where C is the
conclusio n).
7. R4 tells us that C is true (because B is true). Therefore, C becomes a fact (and is
added to the assertio n base). Now E is true, w hich validates F, which validates our
goal (i.e. , the advice is to invest in IBM).
Note that during the search, the ES moved from the THEN p art to the IF part, back
to the THEN part, and so o n (see Figure 11.6 for a graphical depiction of the backward
chaining).
Forward Chaining Let us use the same example we examined in backward chaining to
illustrate the process of forward chaining. In forward chaining, we start w ith known facts
and derive n ew facts by using rules h aving known facts o n the IF side. The specific steps
that forward chaining would follow in this example are as follows (also see Figure 11.7
for a graphical depiction of this process):
1. Because it is known that A a nd Bare true, the ES starts deriving new facts by using
rules that have A a nd B o n the IF s ide. Using R4, the ES derives a new fact C and
adds it to the assertion base as true.
2. Rl fires (because A and C are true) and asserts E as true in the assertion base.
3. Because B and E are both known to be true (they are in the assertio n base), R3 fires
a nd establishes F as true in the assertio n base.
4. RS fires (because F is o n its IF side), w hich establish es G as true. So the ES recom-
m e nds an investme nt in IBM stock. If there is more than one conclusion, m ore rules
m ay fire , depending o n the inferencing procedure.
INFERENCING WITH UNCERTAINTY Although uncertainty is w idespread in the real world,
its treatment in the practical world of artificial intelligence is very limited. One could
argu e that because the knowledge provided by experts is often inexact an ES that mimics
494 Pan IV • Prescriptive Analytics
0-~ R2
~0 ~ R5
1 R4 0—+~0
0–+e-@6 4
:__,t 3 R3
~e-~~
~B .—–L 2 R1 ~——-~
~~ Legend
A , B, C, D, E, F, G: Facts
1 R4
FIGURE 11.7 A Graphical Depiction of Forward Chaining.
1, 2, 3, 4: Sequence of rule firings
R1, R2, R3, R4, R5 : Rules
the reasoning process of experts should represent such uncertainty. ES researchers have
proposed several methods to incorporate uncertainty into the reasoning process, includ-
ing probability ratios, the Bayesian approach , fuzzy logic, the Dempster- Shafer theory
of evidence, and the theory of certainty factors . Following is a brief description of the
theo1y of certainty factors , which is the most commonly used meth od to accommodate
unce1tainty in ES.
The theory of certainty factors is based on the con cepts of belief an d disbelief.
The standard statistical methods are based o n the assumption that an uncertainty is the
probability that an event (or fact) is true or false , whereas certainty theory is based on
the degrees of belief(not the calcula ted probability) that an event (or fact) is true o r false.
Certa inty theory relies o n the use of certa inty factors . Certainty factors (CF)
express belief in an event (or a fact or a hypothesis) based on the expert’s assessment.
Certainty factors can be represented by values ranging from O to 100; the smaller the
value , the lower the perceived likelihood that the event (or fact) is true or fa lse. Because
certa inty factors are n ot probabilities, when we say that there is a certainty value of 90 for
rain, we do n ot mean (or imply) any opinion about no rain (which is not necessarily 10).
Thus, certainty facto rs do not have to sum up to 100.
Combining Certainty Factors Certainty factors can be used to combine estimates by
different experts in several ways . Before using any ES sh ell, you need to make sure
that you understand how certainty factors are combined. The most acceptable way of
combining them in rule-based systems is the method u sed in EMYCIN. In this approach,
we distinguish between two cases, described next.
Combining Several Certainty Factors in One Rule Con sider the following rule w ith an
AND operator:
IF inflation is hig h , CF = 50 (A)
AND unemployment rate is above 7 percent, CF = 70 (B)
AND bond prices decline, CF = 100 (C),
THEN stock prices decline.
For this type of rule, a ll IFs must be true for the conclusion to be true. However, in
some cases, there is uncertainty as to what is happening. Then the CF of the con clusion
is the minimum CF o n the IF side:
CF(A, B, C) = minimum [CF(A) , CF(B), CF(C)]
Chapter 11 • Automated Decision Systems and Expert Systems 495
Thus, in our case, the CF for stock prices to decline is 50 percent. In other words ,
the chain is as strong as its weakest link.
Now look at this rule with an OR operator:
IF inflation is low, CF = 70 percent
OR bond prices are high, CF = 85,
THEN stock prices will be high.
In this case, it is sufficient that only one of the IFs is true for the conclusion to b e
true. Thus , if both IFs are be lieved to be true (at their certainty factor) , then the conclu-
sion will h ave a CF with the maximum of the two:
CF (A or B) = maximum [CF (A), CF (B))
In our case, CF must be 85 for stock prices to be high. Note that both cases hold for
any number of IFs .
Combining Two or More Rules Why might rules be combined? There m ay be several
ways to reach the same goal, each with different certainty factors for a given set of facts.
When we have a knowledge-based system with several inte rre lated rules, each of which
makes the same conclusion but w ith a different certainty factor, each rule can be viewed
as a piece of evide nce that supports the joint conclusion. To calculate the certainty
factor (or the confidence) of the conclusion, it is necessary to combine the evide nce . For
example, let us assume that there are two rules:
Rl: IF the inflation rate is less than 5 percent,
THEN stock market prices go up (CF = 0.7).
R2 : IF the unemployment level is less tha n 7 percent,
THEN stock market prices go up (CF = 0.6).
Now let us assume a prediction that during the n ext year, the inflation rate w ill
be 4 percent and the unemployment level w ill be 6.5 percent (i.e., we assume that the
premises of the two rules a re true). The combined effect is computed as follows:
CF(Rl, R2) = CF(Rl) + CF(R2) x [1 – CF(Rl))
= CF(Rl) + CF(R2) – [CF(Rl) x CF(R2))
In this examp le, given CF(Rl) = 0.7 a nd CF(R2) = 0.6
CF(Rl, R2) = 0.7 + 0.6 – [(0.7) x (0.6)) = 0.88
If we add a third rule, we can use the following formula:
CF(Rl , R2, R3) = CF(Rl , R2) + CF(R3) x [1 – CF(Rl , R2))
= CF(Rl, R2) + CF(R3) – [CF(Rl, R2) x CF(R3))
In our example:
R3: IF bond price increases,
THEN stock prices go up (CF = 0.85)
CF(Rl, R2 , R3) = 0.88 + 0.85 – [(0.88) x (0.85)] = 0.982
Note that CF(Rl,R2) was computed earlier as 0.88. For a situation with more rules,
we can apply the same formula incrementally.
496 Pan IV • Prescriptive Analytics
Explanation and Justification
A final feature of expert systems is their interactivity with u sers and the ir capacity to p rovide
an expla natio n con sisting of the seque n ce of infe ren ces that were mad e by the system in
arriving at a conclusio n. This feature offers a means of evalu ating the integrity of the system
w he n it is to be used by the expe rts the mselves. Two basic types of explanatio ns a re the
w hy and the how. Metaknowledge is knowledge about knowledge. It is a structure within
the system u sing the do main knowledge to accomplish the system ‘s p roblem-solving
strategy. This sectio n deals with diffe re nt me tho ds u sed in ES for gen erating exp lan atio n s.
Human exp e rts a re o fte n asked to explain the ir views, recommenda tio ns, or deci-
sions . If ES a re to mimic humans in p e rforming highly specialized tasks, they, too, need
to justify and explain their actio n s. An explan atio n is an atte mpt by an ES to cla rify its
reasoning, recommendatio ns, o r o ther actio ns (e.g., asking a qu estion). The p art o f an ES
that provides expla nation s is calle d an explanation f acility (or justifier) . The expla natio n
facility h as several purposes:
• Make the system mo re in tellig ible to the u ser.
• Un cove r the sho rtco mings of the rules and knowledge base (i.e ., debuggin g of the
syste ms by the knowledge e ngineer) .
• Explain sin1atio n s that were una nticipated by the u ser.
• Satisfy p sych o logical and social needs by helping the u ser feel mo re assured about
the actio ns of the ES.
• Cla rify the assu m ptio ns unde rlying the system ‘s o p eratio ns to b oth the u ser and the
builder.
• Conduct sens itivity an alyses. (Using the explan atio n facility as a guide, the u ser can
predict a nd test the effects of ch a nges on the system.)
Explanatio n in rule-based ES is u sually associated w ith a way of tracing the rules
that are fired during the course of a p roble m-solving sessio n. This is a b out the closest
to a real explanatio n tha t today’s syste ms come, given tha t the ir knowle dge is u su ally
represented almost exclusively as rules that d o not include b asic p rincip les necessary
for a human-typ e explan ation. Tho ug h ma ny explanatio ns h ave been p rop osed , in the
fo llowing sectio n w e provide brief descriptio ns fo r the two most commo n typ es: the w hy
a nd the how e xpla natio n s.
WHY EXPLANATIONS A user p oses a typical “w hy?” questio n to a compute r afte r the
computer asks the u ser to provide some informatio n. Fo r example , u sing our earlier
investme nt example, we might have the following dialogue:
COMPUTER: W ha t is your a nnual incom e?
CLIENT: W hy? (Why do you need to know?)
COMPUTER: R2 ; meaning “In c hecking R2 , I need to know w hether your income is
a b ove $40,000 . If this is true, I will conclude that b ecau se you h ave a
college d egree, you sh o uld invest in growth stocks.”
HOW EXPLANATIONS A user p oses a typical “h ow?” questio n w he n h e o r she would like
to know how a certain conclusion or recomme ndatio n was reached . Simple systems are
limited to the final conclusio n . Mo re complex syste ms can h andle inte rmediate con clu-
sio ns as well. The system can explain w hy a certain rule was fired; that is, it shows the
chain of rules used to reach the conclusio n . In the investment proble m , the how expla n a-
tio n might look like this:
COMPUTER: Invest in IBM stocks.
CLIENT: How? (How was the conclusio n reached ?)
Chapter 11 • Automated Decision Systems and Expert Systems 497
COMPU TER: RS-R3-R7-R4; meaning “Given that you have $10,000 to invest and you
are younger than 30, then according to R4, you have a college degree. If
this is the case, then according to Rl, you should invest in securities. For
a young investor like you, according to R3, you should invest in growth
stocks if you are going to invest at all. Finally, according to RS , if you
need to invest in growth stocks, then IBM is your best bet. ”
The why and how explanations often show the rules as they were programmed and
not in a natural language. However, some systems have the capability to present these
rules in natural language.
SECTION 11.7 REVIEW QUESTIONS
1. State two production rules that can represent the knowledge of repairing your car.
2. Describe how ES perform inference.
3. Describe the reasoning procedures of forward chaining and backward chaining.
4. List the three most popular methods to deal with uncertainty in ES.
5. Why do we need to incorporate uncertainty in ES solutions?
6. What are the ways by which ES justify their knowledge?
11.8 PROBLEM AREAS SUIT ABLE FOR EXPERT SYSTEMS
ES can be classified in several ways. One way is by the general problem areas they address.
For example, diagnosis can be defined as “inferring system malfunctions from obser-
vation s.” Diagnosis is a generic activity performed in medicine, organizational studies,
computer operations, a nd so on. The generic categories of ES are listed in Table 11.3.
Some ES belong to two or more of these categories. A brief description of each category
follows:
• Interpretation systems. Systems that infer situation descriptions from observa-
tions . This category includes surveillance, speech understanding, image analysis,
signal interpretation, and many kinds of intelligence analyses . An interpretation
system explains observed data by assigning them symbolic meanings that describe
the situation.
TABLE 11.3 Generic Categories of Expert Systems
Category
Interpretation
Prediction
Diagnosis
Design
Planning
Monitoring
Debugging
Repair
Instruction
Control
Problem Addressed
Inferring situation descriptions from observations
Inferring likely consequences of given situations
Inferring system malfunctions from observations
Configuring objects under constraints
Developing plans to achieve goals
Comparing observations to plans and f lagging exceptions
Prescribing remedies for malfunctions
Executing a plan to administer a prescribed remedy
Diagnosing, debugging, and correcting student performance
Interpreting, predicting, repairing, and monitoring system
behaviors
498 Pan IV • Prescriptive Analytics
• Prediction systems. These system s include weather forecasting ; demographic
predictions; econ omic forecasting; traffic predictions; crop estimates; and military,
marketing, and financial forecasting.
• Diagnostic systems. These system s include medical, electronic, mechanical, and
software diagnoses. Diagnostic systems typically relate observed behavioral irregu-
larities to underlying cau ses.
• Design systems. These systems develop configuratio ns of objects that satisfy
the constra ints of the design problem. Such problems include circuit layout, build-
ing design , a nd plant layout. Design systems construct descriptio ns of objects in
vario us relationships with one anothe r and verify that these configuratio ns conform
to stated constraints.
• Planning systems. These systems specialize in planning problems, such as auto-
matic programming. They also deal with sh ort- and lo ng-term planning in areas
such as project management, routing, communications, product development,
military applicatio ns, and financial planning.
• Monitoring systems. These systems compare observatio ns of system behavior
with standards that seem crucial for successful goal attainment. These crucial
features corresp ond to p otential flaws in the plan. There are many computer-aided
mo nitoring systems for topics ra ng ing from air traffic control to fiscal ma nagement
tasks.
• Debugging systems. These systems rely on planning, design, and prediction
capabilities for creating specifications or recommendations to correct a diagnosed
problem.
• Repair systems. These systems develop and execute plans to administer a remedy
for certain diagnosed problems. Su ch systems incorporate debugging, planning, and
executio n capabilities.
• Instruction systems. Systems that incorporate diagnosis an d debugging subsys-
tems that specifically address students’ needs. Typically, these systems begin by
con structing a hypothetical description of the studen t’s knowledge that interprets
her o r his behavior. They then diagnose weakn esses in the student’s knowledge
and identify appropriate remedies to overcome the deficiencies . Finally, they plan a
tutorial interactio n intended to deliver remedial knowledge to the student.
• Control systems. Systems that adaptively govern the overall behavior of a system.
To do this, a control system must repeatedly interpret the current situation, predict
the future, diagnose the causes of anticipated problems, formulate a remedia l p lan,
and monitor its executio n to e nsure success.
Not all the tasks u sually found in each of these categories are suitable for ES.
However, tho u sands of decisions do fit into these categories.
SECTION 11.8 REVIEW QUESTIONS
1. Describe a sample ES application for prediction.
2. Describe a sample ES applicatio n for diagnosis.
3. Describe a sample ES application for the rest of the generic ES categories.
11.9 DEVELOPMENT OF EXPERT SYSTEMS
The development of ES is a tedious process and typically includes defining the nature
and scope of the problem, ide n tifying proper experts, acquiring knowledge, selecting the
building tools, coding the system, a nd evaluating the system .
Chapter 11 • Automated Decisio n Systems and Exp ert Systems 499
Defining the Nature and Scope of the Problem
The first step in developing an ES is to identify the n ature of the problem and to define
its scop e. Some domains may n ot be a ppropriate for the applicatio n of ES. For example ,
a problem that can be solved by u sing mathematical optimization algorithms is often
ina ppropriate for ES. In general, rule -based ES are appropria te when the nature of th e
proble m is qualitative, knowledge is explicit, a nd experts are available to solve the prob-
lem effectively and provide their knowledge.
Another important factor is to define a feasible scope. The current technology is still
ve1y limited and is capable of solving re latively simple problems. Therefore, the scope of
the problem should be specific and reasonably narrow . For example, it may be possible
to d evelop a n ES for detecting abno rmal trading behavior and possible mo ney launde r-
ing, but it is n o t p ossib le to use an ES to d etermine w h ethe r a p articular transactio n is
criminal.
Identifying Proper Experts
After the n ature and scope of the problem have b een clea rly d efine d , the next step is to
find prope r experts who have the knowledge and are willing to assist in developing th e
knowle dge base. No ES can be d esign ed without the strong support of knowledgeable
and supportive experts. A project may ide ntify one expert or a group of experts. A proper
exp ert sho uld h ave a thorough unde rsta nding of proble m-solving knowledge, the role of
ES a nd decision support technology, a nd good communicatio n skills.
Acquiring Knowledge
After identifying he lpful exp erts, it is n ecessary to start acquiring decisio n knowledge
from them. The process of eliciting knowledge is called knowledge engineering. The
pe rson who is interacting with exp e rts to docume nt the knowledge is called a knowledge
engineer.
Know ledge acquisition is a time-consuming and risky process. Experts may be
unw illing to provide their knowledge for vario us reasons. First, the ir knowledge may be
pro prie tary a nd valuable. Exp e rts may not be w illing to sha re their knowledge without a
reason a ble payoff. Second, even though a n e xpert is w illing to share , certain knowledge
is tacit, and the exp ert may not have the skill to clearly dictate the decisio n rules and
considerations. Third, experts may be too busy to h ave e nough time to communicate
with the knowledge enginee r. Fourth, certain knowledge may be con fusing or contra-
dictory in nature. Finally, the knowledge e n gineer m ay misunderstand the expert and
inaccurately document know ledge.
The result of know ledge acquisition is a knowledge base that can be represen ted
in different formats. The most popular o ne is if-then rules. The knowledge may also be
represented as decision trees or decision tables. The know ledge in the knowledge base
must be evaluated for its con siste ncy a nd applicability.
Selecting the Building Tools
After the knowledge base is built, the next step is to choose a proper tool for implementing
the system. There a re three different kinds o f development tools, as described in the
following sections.
GENERAL-PURPOSE DEVELOPMENT ENVIRONMENT The first type of tool is gen eral-
purpose computer la nguages, such as C++, Prolog, and LISP. Most computer program-
ming la nguages suppo rt the if-then state me n t. Therefore, it is p ossible to u se C++ to
develop an ES for a particula r proble m domain (e.g., disease d iag nosis). Because these
500 Part IV • Prescriptive Analytics
programming languages do not have built-in inference capabilities, using them in this
way is often very costly and time-consuming. Prolog and LISP a re two languages for
developing intelligent systems. It is easier to use them than to use C++, but they are still
specifically designed for professional programmers and are not very friendly. For recent
Web-based applications, Java and computer languages that support Web services (such
as the Microsoft .NET platform) are also useful. Companies such as Logic Programming
Associates (www.lpa.co.uk) offer Prolog-based tools.
ES SHELLS The second type of development tool, the expert system (ES) shell, is
specifically designed for ES development. An ES shell has built-in inference capabilities
and a user interface , but the knowledge base is empty. System development is therefore
a process of feeding the know ledge base with rules elicited from the expert.
A popular ES shell is the Corvid system developed by Exsys (exsys.com). The
system is an object-oriented development platform that is composed of three types of
operations: variables, logic blocks , and command blocks . Variables define the major
factors considered in problem solving. Logic blocks are the decision rules acquired from
experts . Command blocks determine how the system interacts with the user, including
the order of execution and the user interface . Figure 11.8 show s a screenshot of a logic
block that show s the decision rules under Exys Corvid. More products are available from
business rules ma nagement vendors , such as LPA’s VisiRule (www.lpa.eo.uk/vsr.htm) ,
which is based on a general-purpose tool called Micro-Prolog.
[l EHsys CORYIO: D:\ Program Files\ EHsys\ Corvid\samples\Restaurant\restaurant.cvd
E.le !;_dit !.!ndo Qisplay
(2 Logic Block f’r-,
Logic Block
I L-+ [Formal! • 4
El I Occasion • date
Select Bl ock to Display:
I Atmos p here
El r Typ e_ol_ Da te • p a rly_ with_ rrie nds
I t
-+ !I/ME
-+ [Privacy[ ; 1
-+ [Formal! ~ 0
r
-1 Type_of_Date = getling_acquainted
t-+ [NoiseJ = 2 -+ [PrivacyJ = 3 -+ [FormalJ = 3
8- L Type_of_Dale •romantic
[–+ [~loiteJ = 0
[
-+ [Privacy! = 4
4 IFnrrnall = 4
THE N
Vari,,ble
Node———~
Command I
I … , ..
Edit
rN~r~-;r R~·.—:-;i
~~
FIGURE 11.8 A Screenshot from Corvid Expert System Shell.
r Comp,c ..
.::J Edi Name
Line: 16
r MetaBlock I
Gotoline: r-
~ C•neel I
14/27/ 2009 I 3:12 PM /4
Chapter 11 • Automated Decision Systems and Expert Systems 501
TAILORED TURN-KEY SOLUTIONS The third tool, a tailored turn-key tool, is tailored to
a specific domain and can be adapted to a similar application very quickly. Basically, a
tailored turn-key tool contains specific features often required for developing applications
in a particular domain. This tool must adjust or modify the base system by tailoring the
user interface o r a relatively small portion of the system to meet the unique needs of an
organization.
CHOOSING AN ES DEVELOPMENT TOOL Choosing among these tools for ES development
depends on a few criteria. First, you need to consider the cost benefits. Tailored turn-key
solutions are the most expensive option. However, you need to consider the total cost,
not just the cost of the tool. Second, you need to consider the technical functionality and
flexibility of the tool; that is, you need to determine whether the tool provides the fu n ction
you need and how easily it allows the development team to make necessary changes.
Third, you need to consider the tool’s compatibility with the existing information infra-
structure in the organization. Most organizations have many existing applications, and the
tool must be compatible with those applications and needs to be able to be integrated as
part of the entire information infrastructure. Finally, you need to consider the reliability
of the tool and vendor support. The vendor’s experiences in similar domains and training
programs are critical to the success of an ES project.
Coding the System
After choosing a proper tool, the development team can focus o n coding the knowledge
based on the tool’s syntactic requirements. The major concern at this stage is whether the
coding process is efficie nt and properly managed to avoid errors. Skilled programmers
are helpful and important.
Evaluating the System
After a n ES system is built, it must be evaluated. Evaluation includes both verification
and validation. Verification ensures that the resulting knowledge base contains knowl-
edge exactly the same as that acquired from the expert. In other words, verification
ensures that no error occurred at the coding stage. Validation ensures that the system can
solve the problem correctly. In other words, validation checks whether the knowledge
acquired from the expert can indeed solve the problem effectively. Application Case 11.5
illustrates a case where evaluation played a major role .
Application Case 11.5
Clinical Decision Support System for Tendon Injuries
Flexor tendon injuries in the hand continue to be
one of the greatest challenges in hand surgery and
hand therapy. Despite the advances in surgical tech-
niques, better understanding is needed of the tendon
anatomy, healing process and suture strength,
edema, scarring, and stiffness. The Clinical Decision
Support System (CDSS) system focuses on flexor
tendon injuries in Zone II, which is technically the
most demanding in both surgical and rehabilitation
areas. This zone is considered a “No Man’s Land” in
which not many surgeons feel comfortable repair-
ing. It is very difficult and time-consuming for both
the hand surgeon and hand therapist working with
tendon injury patients to keep up with the ongo-
ing advances in this field. However, it is essential
to be aware of all the information that would be
(Continued)
Chapter 11 • Automated Decision Systems and Expert Systems 503
intelligence, Web develo pment technologies, and analytical methods makes it possible
for systems to be deployed that collectively give an organization a major competitive
advantage or allow for better social welfare.
Chapter Highlights
• Artificial intelligence (AI) is a discipline that inves-
tigates how to build compu ter systems to perform
tasks that can be characterized as intellige nt.
• The major ch aracteristics of AI a re symbolic pro-
cessing, the use of heuristics instead o f algorithms ,
and the application of inference techniques.
• Knowledge, rather than data o r information, is
the major focus of AI.
• Major areas of AI include expert systems, n atural lan-
guage processing, speech understanding, intelligent
robotics, computer vision, fuzzy logic, intelligent
agents, inte lligent computer-aided instruction , auto-
matic programming, neural computing, game play-
ing, and language translation.
• Expert system s (ES) are the most ofte n applied
AI technology. ES attempt to imitate the work of
experts. They capture human expertise and apply
it to problem solving.
• For an ES to be effective, it must be applied to a
n arrow d o main, and the knowledge must include
qualitative factors.
• The power of an ES is derived from the specific
knowledge it possesses, n o t fro m the particular
knowledge representatio n a nd inference sch e mes
it u ses.
• Expertise is task-specific knowledge acquired
through training, reading , a nd experience.
• ES technology can transfer knowledge from
exp e rts a nd documented sources to the computer
a nd make it available for u se by no n experts.
Key Terms
• The major comp onents of a n ES are the knowledge
acquisition subsystem, knowledge base, inferen ce
engine, user interface, blackboard, explanation
subsyste m , and knowledge-refinement subsyste m.
• The inference e ngin e provides reasoning capabil-
ity for a n ES.
• ES infe rence can be done by u sing forward chain-
ing or backward chaining.
• Knowledge en gineers are professionals w ho
know h ow to capture the knowledge fro m an
expert and structure it in a form that can be
processed by th e computer-based ES.
• ES development process includes defining the
nature and scope of the p roblem, identifying
proper exp erts, acquiring knowledge , se lecting
the building tools , coding th e system , and evalu-
ating the system.
• ES are popular in a number of gen eric categories:
interpretation, predictio n , diagnosis , design , p lan-
ning, monitoring, debugging, repair, instruction,
and control.
• The ES she ll is an ES development tool that has
the infere nce engine and building blocks for the
knowledge base and the u ser interface. Knowledge
engineers can easily develop a prototype system
by entering rules into the knowledge base.
artificial intelligence (AI)
automated decision systems (ADS)
backward ch aining
expert system (ES) shell
expertise
knowledge rules
knowledge-based system (KBS)
knowledge-refining system
production rules blackboard
certainty factors (CF)
con sultation environment
decision a utom atio n systems
developme nt environment
expert
exp e rt syste m (ES)
explanation subsystem
forward ch aining
inference e ngine
inference rules
knowledge acquisitio n
knowledge base
knowledge eng ineer
knowledge e ng ineering
revenue m anageme nt system s
rule-based systems
theory of certain ty factors
u ser interface
504 Part IV • Prescriptive Analytics
Questions for Discussion
1. Why are automated d ecisio n syste ms so importa nt for
business applications?
2. It is said that p owerful compute rs, infe re nce capabilities,
and proble m-solving he uristics are necessary but no t suf-
ficie nt for solving real problems . Explain.
3 . Explain the re latio nship between the development e nvi-
ronment a nd the consultatio n (i.e ., runtime) e nvironme nt.
4. Explain the diffe re nce between forward ch aining and
backward chaining and d escribe w he n each is most
appropria te .
Exercises
Teradata UNIVERSITY NETWORK (TUN) and Other
Hands-on Exercises
1. Go to teradatauniversitynetwork.com a nd search for
sto ries about Chinatrust Commercial Bank’s (CTCB’s) u se
of the Teradata Re latio ns hip Ma nager a nd its re ported
be n efits . Study the functio nal demo of the Te rada ta
Rela tions hip Manager to answer the fo llowing q uestio ns :
a. What functio ns in the Te radata Re latio n ship Manager
are useful for suppo rting the automation of bu siness
rules? In CTCB’s case, ide ntify a potential applicatio n
that can be suppo rted by rule-based ES a nd solicit
p ote ntial business rules in the knowledge base.
b. Access Haley and com pare the Te rada ta Relatio nship
Man ager and Haley’s Business Rule Management
Syste m. Which tool is more suitable for the applica-
tio n identified in the previo us questio n?
2. We list 10 categories of ES applicatio ns in the ch apter. Find
20 sample applicatio ns, 2 in each category, from the vari-
o us functional areas in an organizatio n (i.e ., accounting,
fin a nce, production, ma rke ting, a nd huma n resources) .
3. Download Exsys’ Corvid tool for evalua tio n. Ide ntify
an exp e rt (or use o ne of your teammates) in an a rea
where expe rie nce-based knowledge is n eed ed to solve
problems, such as buying a used car, selecting a school
and majo r, selecting a jo b from many offe rs, buying a
compute r, diagnosing and fixing compute r proble ms,
etc. Go through the kn owledge-engineering process to
End-of-Chapter Application Case
Tax Collections Optimization for New York State
Introduction
Tax collectio n in the State of New York is unde r the mandate
of the New York State Department of Taxation and Finance’s
Collectio ns and Civil Enforcement Division (CCED). Between
1995 and 2005, CCED changed and improved o n its o peratio ns
5. What kinds of mistakes m ight ES make a nd w hy? Why
is it easier to correct mistakes in ES than in conventio nal
computer p rograms?
6. An ES fo r stock investment is d eveloped and licensed
for $1,000 per year. The syste m can help identify the
most unde rvalued securities on the ma rket and th e best
timing for buying and selling the securities. Will you
o rder a copy as your investme nt advisor? Explain w hy or
w hy no t.
acquire the necessa1y knowle dge. Using the evaluation
versio n of the Corvid tool, develop a simple expert
system applicatio n o n the exp e rtise area of your cho ice.
Report on your exp e riences in a w ritte n document; use
screenshots fro m th e software as necessary.
4. Search to find applicatio ns of artificial intelligence and
ES. Identify a n o rga nization w ith w h ich at least one
me mber of your group has a good contact w ho has a
decision-making problem th at requires some expertise
(but is not too comp licated). Unde rsta nd the nature of
its business and identify the p roblems that are supported
or can pote ntia lly be su pported by rule -based systems.
Some exam ples include selection of suppliers, selection
of a new employee, job assignment, computer selection,
ma rke t contact metho d selectio n , and determination of
admissio n into gradu ate school.
5 . Ide ntify a nd interview a n expe rt w ho knows the domain
of your choice. Ask the expert to w rite down his or her
knowle dge. Choose a n ES sh ell a nd build a prototype
system to see how it works.
6. Go to exsys.com to p lay w ith the restaurant se lection
example in its de mo systems. An alyze the variables a nd
rules contained in the example ‘s knowledge base.
7. Access the Web site o f the Am erican Association for
Artificial Inte lligence (aaai.org) . Examine the workshops
it has offe red over the past year and list the major topics
related to intelligent systems.
in order to make tax collection mo re efficient. Even tho ugh the
division’s staff strength decreased from over 1,000 employees in
1995 to about 700 e mployees, its tax collection revenue increased
from $500 million to over $1 billion within the same period as a
result of the improved systems and procedures they used .
Chapter 11 • Automated Decision Systems and Exp ert Systems 505
Presentation of Problem
The State of New York found it a challe nge to reverse its
growing budget deficit, partly due to the unfavorable
economic conditions prior to 2009. A key part o f the state ‘s
budget is revenue fro m tax collectio n , which forms about
40 p ercent of the ir yea rly revenue. Tax collection mecha-
nism was th e refore seen as o ne key area that would he lp
decrease the state’s budget defi cit if improved. The goal was
to optimize tax collection in a ve1y efficie nt way. The exist-
ing rigid and ma nual rules took too long to implement and
also required too many p e rsonnel and resources to be used.
This was n ot goin g to be feasible any longer because the
resources allocated to the CCED for tax collection were in
line to be reduced. This meant the tax collection division had
to find ways of doing more with fewer resources.
Methodology/Solution
Out of all the improvements CCED made to their work process
between 1995 and 2005, o ne area that remained unchanged
was the process o f collectio n of delinquent taxes. The exist-
ing method for tax collectio n e mployed a linear approach to
ide ntify, initiate, and collect delinquent taxes. This approach
emphasized what should be done, rathe r than what could
be done, by tax collectio n officers . A “on e-size-fits-all” pro-
cedure for data collection was u sed w ithin the constraints of
allowable laws. However, the challe nge o f a complex legal
tax system, and th e less than optimal results produced by
the ir existing scoring syste m, made the a pproach deficient.
Wh e n 70 p e rcent of delinquent cases relate to individuals
and 30 percent relate to business, it is difficult to operate
at a n o ptimal level by taking on d elinque nt cases based on
wh ethe r it is allowable o r not. Better processes that would
allow smarte r decisions a bout which delinquent cases to
pursue h ad to be developed w ithin a constrained Markov
Decisio n Process (MDP) framework.
Analytics and o ptimization processes were coupled
w ith a Constraine d Re inforceme nt Leaming (C-RL) method.
This method helped develop rules for tax collection based
on taxpayer characteristics. That is, they determined that
the past behavio r of a taxpayer was a major predicto r of a
taxpayer’s future behavior, and this discovery was leveraged
by the method used. Basically, data ana lytics and optimiza-
tion process were performed based o n the following inputs:
a list of business rules for collecting taxes, the state of the
tax collection process, and resources available. These inputs
produced rules for allocating actions to be taken in each tax
de linquency situation .
References
De mandTec. “Giant Food Sto res Prices the Entire Store
with De mandTec .” https://mydt.demandtec.com/
m ydemand te c/ c/ do cume n t_library /
get_file ?u uid=3151 a5e4-f3e 1-413e-9cd 7-
Results/Benefits
The new system, imple mented in 2009, e nabled the tax
agency to only collect de linquent tax when needed as
opposed to when allowed w ithin the constraints of th e
law. The yea r-to-year increase in revenue betwee n 2007
and 2010 was 8 .22 percent ($83 million). As a result of
more effi cien t tax collection ru les, fewer personnel were
needed b oth at the ir contact center and on the fi eld. The
average age of cases, even with fewer employees, dropped
by 9.3 percent; however, the a mount of dollars collected
per field age nt increased by about 15 percent. Overall, there
was a 7 perce nt increase in revenue from 2009 to 2010.
As a result, more revenue was generated to su p p ort state
programs .
QUESTIONS FOR THE END-OF-CHAPTER
APPLICATION CASE
1. What is the key difference betwee n the fo rmer tax
collection system and the new system?
2. List at least three benefits that were derived from
implementing the new system.
3 . In what ways do analytics a nd optimization supp ort
the generation of an efficient tax collection system?
4. Why was tax collection a target for decreasing the
budget deficit in the State of New York?
What We Can Learn from This End-of-
Chapter Application Case
This case presents a scenario that depicts the du al use of pre-
dictive analytics and optimization in solving real-world prob-
lems. Predictive an alytics was used to determine the future
tax be havior of taxp ayers based on the ir past behavior. Based
o n the delinquency statu s of th e individual’s or corporate
body’s tax situation, different courses of action a re followed
based on establish ed rules. He nce, th e tax agency was able
to sidestep the “o ne-size-fits-all” policy and initiate tax col-
lection procedures based on w hat they should do to increase
tax revenue, and not just w hat they could lawfully do . The
rule-based system was implemented u sing the information
derived from optimization models .
Source: Gerard Miller, Melissa Weatherwax, Timothy Gardinie r,
Naoki Abe , Prem Melville, Cezar Pendus, David Je nsen, e t a l. , “Tax
Collections Optimization for New York State.” Inteifaces, Vol. 42,
No. 1, 2012, pp. 74-84.
333289eeb3d5&groupld=264319 (accessed Febru ary
201 3).
Exsyssoftware .com. “Advanced Clinical Advice for Tendon
Injuries.” exsyssoftware .com/CaseStudySelector/
506 Pan IV • Prescriptive Analytics
AppDetailPages/ExsysCaseStudiesMedical.html
(accessed February 2013).
Exsyssoftware.com. “Diagnosing Heart Diseases. ”
exsyssoftware. com/CaseStudySelector/
AppDetailPages/Exsys CaseStudiesMedical.html (ac-
cessed February 2013).
Exsyssoftwa re.com . “Identificatio n of Chemical, Biological
and Radiological Agents. ” exsyssoftware.com/Case
Stud yS elector/ AppDetail Pages/ExsysCase
StudiesMedical.html (accessed February 2013).
Koushik, Dev, Jon A. Higbie, and Craig Eister. (2012) . “Retail
Price Optimization at Inte rcontinental Hotels Group.”
Inteifaces, Vol. 42, No. 1, pp. 45-57.
Miller, G., M. Weatherwax, T. Gardinier, N. Abe, P. Melville, C.
Pendus, D. Jensen, et al. (2012). “Tax Collections Optimization
for New York State.” Inteifaces, Vol. 42, No. 1, pp. 74-84.
Papic, V. , N. Rogulj, a nd V. Plestina. (2009). “Identification
of Sport Talents Using a Web-Oriented Expeit System
w ith a Fuzzy Module .” Expert Systems with Applications,
Vol. 36, pp. 8830- 8838.
CHAPTER
Knowledge Management
and Collaborative Systems
LEARNING OBJECTIVES
• Define knowledge and describe the
different types of knowledge
• Describe the characteristics of
knowledge management
• Describe the knowledge management
cycle
• Describe the technologies that can be
used in a knowledge management
system (KMS)
• Describe different approaches to
knowledge management
• Understand the basic concepts
and processes of groupwork,
communication, and collaboration
• Describe how computer systems
fac ilitate communication and
collaboration in an enterprise
• Explain the concepts and importance of
the time/ place framework
• Explain the underlying principles and
capabilities of groupware, such as group
support systems (GSS)
• Understand how the Web enables
collaborative computing and group
support of virtual meetings
• Describe the role of emerging
technologies in supporting
collaboration
I
n this chapter, we study two major IT initiatives related to decision support. First,
we describe the characteristics and concepts of knowledge management. We explain
how firms use information technology (IT) to implement knowledge management
(KM) systems and how these systems are transforming modern organizations. Knowledge
management, although conceptually ancient, is a relatively new business philosophy.
The goal of knowledge management is to identify, capture, store, maintain, and deliver
useful knowledge in a meaningful form to anyone who n eeds it, anyplace and anytime,
within an organization. Knowledge management is about sharing and collaborating at the
organization level. People work together, and groups make most of the complex decisions
in organizations. The increase in organizational decision-ma king complexity increases
the need for meetings and groupwork. Supporting groupwork, where team members
may be in d ifferent locations and working at different times, emphasizes the important
aspects of communications, computer-mediated collaboration, and work methodologies.
507
508 Pa n IV • Prescriptive Analytics
Group support is a critical aspect o f d ecisio n suppo rt systems (DSS). Effective computer-
suppo rte d group suppo rt syste ms have evolved to increase gains a nd decrease losses
in task performance and unde rlying p rocesses. So this chapter covers b oth knowledge
m an ageme nt a nd collaborative syste ms . It consists of the following sectio n s:
12.1 Ope ning Vigne tte: Exp e rtise Tra ns fe r Syste m to Tra in Fu ture Army
Personnel 508
12.2 Introdu ctio n to Knowle dge Man agem e nt 512
12.3 Approach es to Knowle d ge Man agem e nt 51 6
12.4 Informa tio n Techno logy (IT) in Knowle dge Ma n ageme n t 520
12.5 Ma king Dec is io n s in Groups : Chara cteristics, Proce ss , Benefits, a nd
Dysfunc tio n s 523
12.6 Suppo rting Groupwork w ith Co mpu terized Syste ms 526
12. 7 Tools for Indirect Support o f Decisio n Making 528
12.8 Direct Computerized Suppo rt for Decisio n Making : From Group Decisio n
Suppo rt Syste m s to Group Suppo rt Syste ms 53 2
12.1 OPENING VIGNETTE: Expertise Transfer System to Train
Future Army Personnel
A m ajo r problem for o rganiza tions impleme nting knowle dge manageme n t system s su ch
as lessons-learned cap abilities is the lack of success of su ch syste ms or p o or service
of the syste ms to their inte nded goal of promo ting knowledge reu se an d sh aring .
Lessons-learned syste ms a re p art of the broad o rganizatio n al and knowledge m anageme nt
systems that have been well studied by IS researche rs. The objective of lesso ns-learned
syste ms is to support the capture, codificatio n , presentation , and applicatio n of exp e rtise
in o rganizatio ns . Lesson-learne d syste ms have been a fa ilure mainly fo r two reasons-
inadequate representa tion and lack of integratio n into an organizatio n ‘s d ecisio n-making
p rocess.
The exp e rtise tra nsfer syste m (ETS) is a knowledge tra nsfer system develo pe d
by the Spears Sch ool of Business at Oklaho ma State Unive rsity as a prototype for the
Defe n se Ammunitio n Cente r (DAC) in McAleste r, Oklah o ma, fo r u se in Army a mmunitio n
career fie lds . The ETS is d esigned to capture the knowledge of experien ced ammunitio n
p ersonne l leaving the Army (i. e ., re tire ments, sep aration s, etc.) and those w h o have been
recently d eployed to the fie ld. This knowledge is captured o n video, converte d into units
of actiona ble knowledge called “nuggets, ” and presented to the u ser in a nu m ber of
learning-frie ndly views.
ETS b egins w ith a n a udio/ video-recorded (A/V) interview between a n inte1viewee
a nd a “know ledge harvester. ” Typically, the recording lasts between 60 and 90 minutes.
Faculty fro m the Oklah o ma State University College o f Ed ucation trained DAC knowled ge
h arveste rs o n effective interviewing techniques, me tho ds of e liciting tacit informatio n
from the interview ees , and ways to improve recorded audio q uality in the interview
process. O nce the videos h ave been recorded , the meat of the ETS p rocess ta kes place,
as de picted in Figure 12.1. First, the digital A/V files are converted to text. Cu rrently , this
is accomplished w ith human tran scriptionists, but we h ave h ad promising results u sing
voice recognitio n CVR) techno logies for tran scriptio n an d foresee a day w h en most of the
tran scription w ill be auto m ated. Se cond, the tra nscriptio ns a re parsed into small units an d
o rganized into knowledge nuggets (KN) . Simply put, a knowledge nugget is a sig nificant
expe rie nce the inte rviewee had during his/ he r career that is worth sharing . The n th ese
Chapter 12 • Knowledge Management and Collaborative Systems 509
Expertise Transfer System
Employ
Taxonomy
Convert Interviews to Text Harvest Knowledge Nuggets
Interviews
Interviewees
A/V
[Audio video
quality)
Convert to Text
(Protocols)
Parsed Text
(Protocols)
Incorporate into ETS
Validation
Expertise
Transfer System
,——+-l~ Makes KNs
Available
Develop
Knowledge
Maps
FIGURE 12.1 Development Process for Expertise Transfer System.
Organize
Knowledge
(Text mining)
Create
Knowledge
Nugget
Objects
KNs are incorporated into the expertise transfer system. Finally, additional features are
added to the KNs to make them easy to find , more u ser friendly , and more effective in
the classroom.
KNOWLEDGE NUGGETS
We chose to call the harvested knowledge assets knowledge nuggets (KN) . Of the many
definitions or explanations provided by a thesaurus for nugget, two explanations stand out:
(1) a lump of precious metal, and (2) anything of great value or significance. A knowledge
nugget assumes even more importance because knowledge already is of great value . A KN
can be just one piece of knowledge like a video or text. However, a KN can also be a com-
bination of video, text, documents, figures, maps, and so forth . The tools used to transfer
knowledge have a central theme, which is the knowledge itself. In our DAC repository,
we have a combination of knowledge state ments, videos, corresponding transcripts, causal
maps, and photographs. The knowledge nugget is a specific lesson learned on a particular
topic that has been developed for future use. It consists of several components. Figure 12.2
displays a sample knowledge nugget. A summary page provides the user with the title or
“punchline” of the KN, the name and deployment information of the interviewee, and a
bulleted summary of the KN. Clicking on the video link will bring the use rs to the KN video
clip, whereas clicking on the transcript link will provide the m with a complete transcript
of the nugget. The KN text is linked back to the portion of the A/V interview from which
it was harvested. The result is a searchable 30- to 60-second video clip (with captions) of
the KN. A causal map function gives the user an opportunity to see and understand the
thought process of the interviewee as they describe the situation captured by the nugget.
The related links feature provides users with a list of regulatory guidance associated with
the KN, and the related nuggets link lists all KNs within the same knowledge domain. Also
provided is information about the interviewee, recognized subject matter experts (SMEs) in
the KN domain, and supporting images related to the nugget.
510 Pan IV • Prescriptive Analytics
I ‘ • I •• • •• ••
~ CATALOG SEARCH DAC ABOUT US COMTACT US LOGOUT –
SAFETY CONSIDERATIOMS ARE A PRIMARY COMCERN FOR AMMO TRANSPORT
Back
SUMMARY:
PEGGY DEAN
Logged in as: ramesh.sharda
Summary
Videos
Transrnpt
[QI llllill
{Deployment Periods: Aug 04 – Feb OS: Apr 07 – Oct 07)
1. Safety should be a primary concern in your decisions to transport ammunition.
causal Map/
Workflow D~ gram
2. Road conditions and distance to ASP were considered before making a decision.
Related Links
Related Nuggets
Tags:
Course Topics
Nugget Manager
Interviewee
.!@_g_(25) Ammunition Safety (1) Transportation (10) Course ID AMM0-37 (2)
SME
ADD A TAG SEARCH TAG
Images
Current Average Rating: 3.5 **** OTHERS
RATE THIS NUGGET My URL
Review Status
Comments:
ADD A NEW COMMENT – MORE •..
Vetted
2010-09-01
FIGURE 12.2 A Sample Knowledge Nugget.
One of the primary objectives of the ETS is to quickly capture knowledge from the
field and incorpo rate it into the training curriculum. This is accomplish ed with the My
URL feature. This function a llows course developers and instructors to use ETS to identify
a sp ecific nugget for sharing, and then gene rate a URL that can be passed directly into
a course curriculum and lesson plans. When an instructor clicks on th e URL, it brings
him/ her directly to the KN. As such, the “war story” captured in the nugget becomes
the course instructor’s war story and provides a real-world decision-making or problem-
solving scenario right in the classroom.
The summary page also includes capabilities for users to rate the KN and make any
comments about its accuracy. This makes the knowledge nugget a live and continou sly
updated piece of knoweldge. These nuggets can then be sorted on the basis of higher
ratings, if so desired. Each nugget intially includes keywords created by the n ugget
developer. These are presented as tags. A user can also suggest their own tags . Th ese
user-specified tags make future searching faster and easier. This brings We b 2.0 concepts
of u ser participation to knowledge management.
In its initial conceptualization, the ETS was supposed to capture the “lesson
lea rned” of the interviews. However, we q uickly learned that th e ETS process often
Ch apte r 12 • Knowledge Management and Collabo rative Systems 511
captures “lesson s to be learned. ” That is , the interviewees often found themselves in
situations where they h ad to improvise and be innovative while deployed. Many of
their approaches and solutions are quite admirable, but sometimes they m ay n ot be
appropria te or suita ble fo r everyone . In light of that finding, a vetting process was
developed for the KNs. Each KN is reviewed by recognized subject matter experts
(SME). If the SMEs find the approach acceptable, it is n oted as “vetted.” If guidance fo r
the KN situatio n already exists, it is ide ntifie d and added to the related links. The KN is
the n no ted as “doctrine .” If the KN has yet to be reviewed , it is n oted as “not reviewed.”
In this way, the u ser always has an idea of the quality of each KN viewed. Additionally,
if the site must be brought down for any reason, the alerts feature is u sed to relay that
information.
The ETS is d esigned for two primary types of u sers: DAC instructors and ammunitio n
p e rsonnel. As such , a “push/ pull” cap ability was developed. Tech training instructors
do not have the time to search the ETS to find those KNs that are re lated to the courses
they teach. To provide some relief to instructors, the KNs are linked to DAC courses
and topics, and can be pushed to instructors’ e-mail accoun ts as the KNs come online.
Instructors can opt-in or opt-out of courses and topics at will, and they can arrange for
n ew KNs to be push ed as ofte n as they like . Ammunition personnel are th e o ther pri-
mary users o f the ETS. These users need the ability to quickly locate and pull KNs related
to their immediate knowledge needs. To a id them, the ETS organizes the nuggets in
various views and has a robust search engine . These views include courses and topics;
interviewee names; chron o logical; and by user-created tags . The goal of the ETS is to
provide the user w ith the KNs they need in 5 minutes or less.
KNOWLEDGE HARVESTING PROCESS
The knowledge h arvesting process began w ith videotaping interviews w ith DAC employees
regarding the ir d e ployme nt experience. Sp eech in the interviews, in some cases, was
converted manually to text. In othe r cases, the knowledge harvesting team (hereinafter
referred to as the “team”) employed voice recognitio n technologies to convert the speech
to text. The text was ch ecked for accuracy and then passed throug h the text mining
divis ion o f the team. The text mining g roup read through the transcript and employed
text mining software to extract some preliminary know ledge from the transcript. The
text mining process provided a o ne -sente nce summary for the knowledge nugget, w hich
became the knowledge statement, commo nly known amo ng the team as the “punchline. ”
The punchline created from the tra nscripts along w ith the excerpts, relevant video from
the interview, and cau sal maps m ake up the e ntire knowledge nugget. The knowledge
nugget is further refined by checking for quality of general appearance, e rrors in text,
and so o n.
IMPLEMENTATION AND RESULTS
The ETS system was built as a prototype for demo nstration of its p o te n tial use at the
Defense Ammunition Center. It was built using a MySQL data base for th e collectio n
of know le dge nuggets a nd the related conte nt, a nd PHP a nd JavaScript as the Web
language pla tform. The system also incorpo ra ted n ecessary security and access control
precautio ns. It was m ade available to several groups of trainees w h o really liked
using this type of ta cit knowledge presentation. The feedback was very p ositive.
However , some internal issu es as well as the ch alle nge o f having the tacit knowledge
b e sh ared as official knowledge resulted in the system being discontinu ed. Howeve r,
the a pplicatio n was develo ped to be more of a general knowledge- sharing system as
o pposed to just this sp ecific use. The autho rs are explo ring other pote ntial u sers fo r
this platform.
512 Pan IV • Prescriptive Analytics
QUESTIONS FOR THE OPENING VIGNETTE
1. What are the key impediments to the use of knowledge in a knowledge
manage ment system?
2. What features are incorporated in a knowledge nugget in this implementation?
3. Where else could such a system be implemented?
WHAT WE CAN LEARN FROM THIS VIGNETTE
Knowledge management initiatives in many organizations h ave n o t succeeded. Although
many studies have b een conducted on this issue and we will learn more about this
topic in future sections, two major issues seem to be critical. Compilation of a lot of
user-generated information in a large Web compilation by itself does not present the
needed information in the right format to the user. Nor does it m ake it easy to find
the right knowledge at the right time. So developing a friendly know ledge presentation
format that includes audio, video, text summary, and Web 2.0 features such as tagging,
sharing, comments, a nd ratings m akes it more likely that use rs will actually use the KM
content. Second, organizing the knowledge to be visible in specific taxonomies as well as
search and e nabling the u sers to tag the content enable this knowledge to be more easily
discove red w ithin a knowledge manage me nt system.
Sources: Based on o ur own docu me nts and S. Tyer, R. Sha rda, D. Biros, ]. Lucca , and U. Shimp, “O rganization
of Lessons Learned Knowledge: A Taxo nomy of Implementatio n ,” International Journal of Knowledge
Management, Vol. 5, No. 3 (2009) .
12.2 INTRODUCTION TO KNOWLEDGE MANAGEMENT
Humans learn effectively through stories, analogies, and examples. Davenport and Prusak
0998) argue that knowledge is communicated effectively when it is conveyed w ith
a convincing narrative . Family-run businesses transfer the secrets of business learned
through experience to the next generation. Knowledge through experience does n ot
necessarily reside in any business textbook, but the transfer of such knowledge facilitates
its profitable use . Nonaka 0991) used the term tacit knowledge for the knowledge that
exists in the head but not on paper. Tacit knowledge is difficult to capture, manage, and
share. He also observes that organizations that use tacit know ledge as a strategic weapon
are innovators and leaders in their respective business domains. There is no substitute
for the substantial value that tacit knowledge can provide. Therefore, it is necessary to
capture and codify tacit knowledge to the greatest extent possible.
In the 2000s knowledge management was considered to be one of the cornerstones
of business success. Despite spending billions of dollars on knowledge management both
by industry and government, success has been mixed. Usually it is the successful projects
that see the limelight. Much research has focused on successful knowledge management
initiatives as well as factors that could lead to a successful knowledge managem ent project
(Davenport e t al., 1998). But a few researchers have presented case studies of knowledge
management failures (Chua and Lam, 2005). One of the causes for such failures is that the
prosp ective u sers of such knowledge canno t easily locate relevant informatio n. Knowledge
compiled in a knowledge management system is no good to the organization if it cannot
be easily found by the likely end users. On the other hand, although th eir worth is
difficult to measure, o rganizatio ns recognize the value of the ir intellectual assets. Fie rce
global competition drives companies to better use their intellectual assets by transforming
themselves into organizations that foster the development and sharing of knowledge.
In the next few sections we cover the basic concepts of knowledge man agem ent.
Chapter 12 • Knowledge Management and Collabo rative Systems 513
Knowledge Management Concepts and Definitions
With roots in organizational learning and innovation, the idea of KM is not new
(see Ponzi, 2004; and Schwartz, 2006). However, the application of IT tools to facilitate
the creation, storage, transfer, and application of previously uncodifiable organizational
knowledge is a new and major initiative in many organizations. Successful managers
have long used intellectual assets and recognized their value . But these efforts were
not systematic, nor did they ensure that knowledge gained was shared and dispersed
appropriate ly for maximum organizational benefit. Knowledge management is a process
that helps organizations ide ntify, select, organize, disseminate, and transfer important
information and expertise that are part of the organization’s memory and that typically
reside within the organization in an unstructured manner. Knowledge management
(KM) is the systematic and active management of ideas, information, and knowledge
residing in an organization’s employees. The structuring of knowledge enables effective
and efficient problem solving, dynamic learning, strategic planning, and decision making.
KM initiatives focus o n identifying knowledge, explicating it in such a way that it can
be shared in a formal manner, and leveraging its value through reuse . The information
technologies that make KM available throughout an organization are referred to as KM
systems.
Through a suppo1tive organizational climate and modem IT, an organization can
bring its entire organizational memory and knowledge to bear on any problem, anywhere
in the world, and at any time (see Bock et al., 2005). For organizational success,
knowledge, as a form of capital, must be exchangeable among persons, and it must be
able to grow. Knowledge about how problems are solved can be captured so th at KM can
promote organizational learning, leading to further knowledge creation.
Knowledge
Knowledge is very distinct from data and information (see Figure 12.3). Data are facts,
measurements, and statistics; information is organized or processed data that is timely
(i.e., inferen ces from the data are drawn within the time frame of applicability) and
accurate (i.e., w ith regard to the original data) (Kankanhalli et al. , 2005). Knowledge is
information that is contextual, relevant, a nd actionable. For example, a map that gives
detailed driving directions from one location to a nother could be considered data . An
up-to-the-minute traffic bulletin along the freeway that indicates a traffic slowdown
due to construction several miles ahead could be considered information. Awareness
of an alternative, back-road route could be considered knowledge . In this case, th e
map is considered data because it does not contain current relevant information that
affects the driving time and conditions from one location to the other. However,
having the current conditions as informatio n is useful o nly if you have knowledge
that e n ables you to avert the con struction zone. The implication is that knowledge
has strong experientia l and reflective elements that d istinguish it from information in
a given context.
Having knowledge implies that it can be exercised to solve a problem, whereas
having information does not carry the same connotation. An ability to act is an integral
part of being knowledgeable. For example, two people in the same context w ith the
same information may not have the same ability to use the information to the same
degree of success. Hence, there is a difference in the human capability to add value. The
differences in ability may be due to different experiences, different training, diffe rent
perspectives, and other factors. Whereas data, information, and knowledge may all be
viewed as assets of an organization, knowledge provides a higher level of meaning about
data and information. It conveys meaning and hence tends to be much more valuable,
yet more ephemeral.
514 Pan IV • Presc riptive Analytics
Relevant and
Data Processed Actionable Knowledge
E
Information
V i I I ITJJl1 Wisdom g
~ 0 D Relevant and actionable processed data – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – …
FIGURE 12.3 Relationship Among Data, Information, and Knowledge.
Unlike other organizational assets, knowledge has the following ch aracteristics
(see Gray, 1999):
• Extraordinary leverage and increasing returns. Knowledge is not subject to
diminishing returns. Whe n it is used, it is not decreased (or d e pleted); rather, it is
increased (or improved). Its consumers can add to it, thus increasing its value.
• Fragmentation, leakage, and the need to refresh. As knowledge grows, it
branches and fragments. Knowledge is d ynamic; it is information in a ction. Thus,
an organization must continually refresh its knowledge base to maintain it as a
source of competitive advantage.
• Uncertain value. It is difficult to estima te the impact of an investment in
knowledge . There are too many intangible aspects that cannot be easily quantified.
• Value of sharing. It is difficult to estimate the valu e of sharing one’s knowledge
or even who will benefit most from it.
Over the past few d ecades , the industrialized economy has been going through
a transformation from being based on natural resources to being based on intellectual
assets (see Alavi, 2000; and Tseng and Goo, 2005). The knowledge-based economy
is a reality (see Godin, 2006). Rapid changes in the business environment cannot be
handled in traditional ways. Firms are much larger today than they used to be, and, in
some areas, turnover is extremely high, fueling the need for better tools for collabora-
tion, communication, and knowledge sharing. Firms must develop strategies to sustain
competitive advantage by leveraging their intellectual assets for optimal p e rformance.
Competing in the globalized economy a nd markets requires quick response to cus tomer
needs and problems. To provide service, managing knowledge is critical for consulting
firms spread out over wide geographical areas and for virtual organizations.
The re is a vast amount of lite rature about what knowledge and knowing mean in
epistemology (i.e. , the study of the nature of knowledge), the social sciences, philoso-
phy, and psychology . Although the re is no single definition of w hat knowledge and KM
specifically mean, the business perspective on them is fa irly pragm atic. Information as
a resource is not always valuable (i.e., information overload can distract from what is
important); knowledge as a resource is valuable because it focuses atte ntion back toward
what is important (see Carlucci and Schiuma, 2006; and Hoffer et al., 2002). Knowledge
implies an implicit understanding and experience that can discriminate between its u se
a nd misuse. Ove r time , information accumulates and decays, whereas knowledge evolves .
Knowledge is dynamic in nature. This implies , though, that today’s knowledge may well
become tomorrow’s ignorance if an individual or organization fails to update knowledge
as e nvironmental conditions change.
Chapter 12 • Knowledge Management and Collaborative Systems 515
Knowledge evolves over time w ith experience, which puts connections among new
situations and events in context. Given the breadth of the types and applications of knowl-
edge, we adopt the simple and elegant definition that knowledge is information in action.
Explicit and Tacit Knowledge
Polanyi 0958) first conceptualized the difference between an organization’s explicit and
tacit knowledge. Explicit knowledge deals with more objective, rational, and technical
knowledge (e.g., data, policies, procedures, software, documents). Tacit knowledge
is usually in the domain of subjective, cognitive, and experiential learning; it is highly
personal and difficult to formalize. Alavi and Leidner (2001) provided a taxonomy (see
Table 12.1), where they defined a spectrum of different types of knowledge, going beyond
the simple binary classification of explicit versus tacit. However, most KM research has
been (and still is) debating over the dichotomous classification of knowledge.
Explicit knowledge comprises the policies, procedural guides, white papers, reports,
designs , products, strategies, goals, mission, and core competencies of an enterprise an d
its IT infrastructure . It is the knowledge that has been codified (i.e., documented) in a
form that can be distributed to others or transformed into a process or strategy without
requiring interpersonal interaction. For example, a description of how to process a job
application would be documented in a firm’s human resources policy manual. Explicit
knowledge has also been called leaky knowledge because of the ease with which it can
leave an individual , a document, or an organization due to the fact that it can be readily
and accurately documented (see Alavi, 2000).
TABLE 12.1 Taxonomy of Knowledge
Knowledge Type Definition Example
Tacit Knowledge is rooted in actions, Best means of dealing w ith a specific
experience, and involvement in customer
specific context
Cognitive tacit: Mental models Individual’s belief on cause-effect
relat ionsh ips
Technical tacit: Know-how applicable to Surgery skil ls
specific work
Explicit Articulated, generalized Knowledge of major customers in a
knowledge region
Individual Created by and inherent in Insight s gained from completed
the individual project
Social Created by and inherent in Norms for interg roup communication
collective actions of a group
Declarative Know-about What drug is appropriate for an
illness
Procedural Know-how How to administer a particu lar drug
Causal Know-why Understanding w hy the drug works
Conditional Know-when Understanding when to prescribe t he
dru g
Relational Know-with Understanding how the drug
interacts w ith other drugs
Pragmatic Usefu l knowledge for an Best practices, t reatment protocols,
organization case ana lyses, postmortems
516 Pan IV • Prescriptive Analytics
Tacit knowledge is the cumulative store of the experiences, mental maps, insights,
acumen, expertise, know-how, trade secrets, skillsets, unde rstanding, and learning that
an organization has, as well as the organizational culture that has embedded in it the
past and present experiences of the organization’s people, processes, and values. Tacit
knowledge, also refe rred to as embedded knowledge (see Tuggle and Goldfinger, 2004),
is usually either localized within the brain of an individual or embedded in the group
interactions within a department or a branch office. Tacit knowledge typically involves
expertise or high skill levels.
Sometimes tacit knowledge could easily be documented but has remained tacit
simply because the individual housing the knowledge does not recognize its potential
value to other individuals. Othe r times , tacit knowledge is unstructured, without tangible
form, and therefore difficult to codify. It is difficult to put some tacit knowledge into words.
For example, an explanation of how to ride a bicycle would be difficult to document
explicitly and thus is tacit. Successful transfer or sharing of tacit knowledge usually takes
place through associations, internships, apprenticeship, conversations, other means of
social and interpe rsonal inte ractions, or even simulatio ns (see Robin , 2000). Nonaka
and Takeuchi (1995) claimed that intangibles such as insights , intuitions, hunches, gut
feelings, values, images, metaphors, and analogies are th e often-overlooked assets of
organizations. Ha tvesting these intangible assets can be critical to a firm ‘s bottom line and
its ability to meet its goals. Tacit knowledge sharing requires a certain context or situation
in order to be facilitated because it is less commonly shared unde r n o rmal circumstan ces
(see Shariq and Vendel0, 2006).
Historically, management information systems (MIS) departments have focused
on capturing, storing, managing, and reporting explicit knowle dge . Organizations n ow
recognize the need to integrate both types of knowledge in formal information systems.
For centuries, the me ntor-apprentice relationship, because of its experiential n ature ,
has been a slow but reliable mea ns of transfe rring tacit knowledge from individual to
individual. When people leave an organization, they take their knowledge w ith them.
One critical goal of knowledge manageme nt is to retain the valuable know-how that
can so easily and quickly leave an organization. Knowledge management systems
(KMS) refer to the use of modern IT (e.g., the Internet, intranets, extranets, Lotus Notes,
software filte rs, agents, data warehouses, Web 2.0) to systematize, enhan ce, and exp edite
intra- and interfirm KM.
KM systems are intended to help an organization cope w ith turnover, rapid change,
and downsizing by making the expertise of the organization’s human capital widely
accessible. They are being built, in part, because of the increasing pressure to maintain a
well-informed, productive workforce. Moreover, they are built to help large organizations
provide a consistent level o f customer service.
SECTION 12.2 REVIEW QUESTIONS
1. Define knowledge management and d escribe its purposes.
2. Distinguish b e tween knowledge and d ata .
3. Describe the knowledge-based economy.
4. Define tacit knowledge and explicit knowledge.
5. Define KMS a nd describe the capabilities of KMS.
12.3 APPROACHES TO KNOWLEDGE MANAGEMENT
The two fundamental approaches to knowledge management are the process approach
and the practice approach (see Table 12.2). We n ext d escribe these two approaches as
well as hybrid approaches.
Chapter 12 • Knowledge Management and Collaborative Systems 517
TABLE 12.2 The Process and Practice Approaches to Knowledge Management
Type of knowledge supported
Means of transmission
Benefits
Disadvantages
Process Approach
Explicit knowledge-codified in rules, tools,
and processes
Formal controls, procedures, and standard
operating procedures, w ith heavy
emphasis on information technologies to
support knowledge creation, codification,
and transfer of knowledge
Provides structure to harness generated
ideas and knowledge
Achieves scale in knowledge reuse
Provides spark for fresh ideas and
responsiveness to changing environment
Fails to tap into tacit knowledge
Practice Approach
Mostly tacit knowledge-unarticulated
knowledge not easily captured or
codified
Informal social groups that engage in
storytelling and improvisation
Provides an environment to generate and
transfer high-value tacit knowledge
Can result in inefficiency
May limit innovation and forces participants Abundance of ideas with no structure to
into fixed patterns of thinking implement them
Role of information
technology (IT)
Requires heavy investment in IT to connect
people with reusab le codified knowledge
Requires moderate investment in IT to
facilitate conversations and transfer of
tacit knowledge
Source: Compiled from M. Alavi, T. R. Kayworth , and D. E. Le idne r, “An Empirical Examination of the Influe nce of Organizatio n al Culture
on Know ledge Manageme nt Practices,” Journal of Management Information Systems, Vo l. 22, No. 3, 2006, pp. 191-224.
The Process Approach to Knowledge Management
The process approach to knowledge management attempts to codify organizational
knowledge through formalized controls, processes, and techno logies (see Hansen
et a l. , 1999). Organizations that adopt the process approach may implement explicit
policies governing how knowledge is to be collected, stored, and disse minated
throughout the organization. The process approach frequently involves the use of IT,
such as intranets, data warehousing, knowledge repositories, decision support tools, and
groupware to e nhance the quality and speed of knowledge creation and distribution in
the organization. The main criticisms of the process approach are that it fails to capture
much of the tacit knowledge embedded in firms and it forces individuals into fixed
patterns of thinking (see Kiaraka and Manning, 2005). This approach is favored by firms
that sell relatively standardized products that fill common needs. Most of the valuable
knowledge in these firms is fairly explicit because of the standardized nature of the
products and services. For example, a kazoo manufacturer has minimal product changes
or service needs over the years, and yet there is steady demand and a need to produce
the item. In these cases, the knowledge may be typically static in nature .
The Practice Approach to Knowledge Management
In contrast to the process approach, the practice approach to knowledge management
assumes that a great deal of organization al knowledge is tacit in nature an d that formal
controls, processes, and technologies are not suitable for transmitting this type of
understanding. Rather than build formal systems to manage knowledge, the focus of
this approach is to build the social e nviro nme nts or communities of practice necessary
518 Pan IV • Prescriptive Analytics
to facilitate the sharing of tacit understanding (see Hansen et al., 1999; Leidner et a l. ,
2006; and Wenger and Snyder, 2000). These communities are informal social groups
that m eet regularly to share ideas, insights, and best practices. This approach is typically
adopted by compa nies that provide highly customized solutio ns to unique problems. For
these firms, knowledge is shared mostly through person-to-person contact. Collaborative
computing methods (e.g. , group support systems [GSS], e-mail) help people communicate.
The valuable knowledge for these firms is tacit in n ature, which is difficult to express,
capture, a nd manage. In this case, the e nvironme nt a nd the nature of the problems being
encountered are extremely dynamic. Because tacit knowledge is difficult to extract, store,
and manage, the explicit knowledge that points to h ow to find the appropriate tacit
knowledge (i.e., people contacts, consulting reports) is made available to an appropriate
set of individuals who might need it. Con sulting firms generally fall into this category.
Firms adopting the codification strategy implicitly adopt the n etwork storage model in
their initial KMS (see Alavi, 2000).
Hybrid Approaches to Knowledge Management
Many organizatio n s use a hybrid of the process and practice approaches. Early in the
development process, when it may not be clear how to extract tacit know ledge from its
sources, the practice approach is u sed so that a repository stores only explicit knowledge
that is relatively easy to document. The tacit knowledge initially stored in the repository
is contact informatio n about experts and their areas of expertise. Such informatio n is
listed so that people in the organization can find sources of expertise (e.g., the process
approach). From this start, best practices can eventually be captured and managed so
that the knowledge repository w ill conta in an increasin g am ount of tacit knowledge
over time. Eventually, a true process approach may be attained. But if the e nvironment
changes rapidly, o nly some of the best practices w ill prove u seful. Regardless of the type
of KMS d evelo ped, a storage locatio n for the knowledge (i.e., a know ledge repository) of
some kind is n eeded.
Certain highly skilled, research-oriented ind u stries exhibit traits that require
nearly equal efforts w ith both approach es. For example, Koenig (2001) argued that the
pharmaceutical firms in w hich he has worked require about a 50/50 split. We suspect
that industries that require both a lo t of engineering effort (i.e., how to create produ cts)
and heavy-duty research effort (wh e re a large percentage of research is unusable) would
fit the 50/ 50 hybrid category. Ultimately, any knowledge that is stored in a knowledge
repository must be reevaluated; otherwise, the repository w ill becom e a knowledge
la ndfill.
Knowledge Repositories
A knowledge repository is neither a database n or a knowledge base in the strictest sense
o f the terms. Rather, a knowledge repository stores knowledge that is often text based
a nd has very different cha racteristics. It is also refe rred to as an o rganizational knowledge
base. Do not confuse a knowledge repository with the knowledge base of an expert
system. They are very different mechanisms: A knowledge base of a n expert system
contains knowledge for solving a specific problem. An organizatio n al knowledge base
contains all the organizatio nal knowledge.
Capturing and storing knowledge are the goals for a knowledge repository. The
structure of the re p ository is highly dependent o n the types of knowledge it stores.
The repository can range from simp ly a list of frequently asked (and obscure) questions
a nd solutio n s , to a listing of individuals w ith their expertise and contact info rmation,
to detailed best practices for a large o rganization. Figure 12.4 shows a comprehensive
KM a rchitecture designed around a n all-inclusive knowledge repository (Delen a n d
Chapter 12 • Knowledge Management and Collaborative Systems 519
Knowledge Management Platform (KMPJ
Knowledge Portal
c (Web-based End-User Interface)
0
‘ii
.!::!
~ Human Experts
:J
C
0
‘ii
f
u
GI
Cl
‘ti
GI
1
C
~
Knowledge Repository
(Knowledge/ Information/ Data Nuggets)
t
Intelligent Broker
Web Crawler 1—-•~1 Data/Text Mining Tools
Diverse Information/Data Sources
[Weather /Medical Info/Finance/ Agriculture/Industrial)
1
Ad Hoc
Search
j
t
Manual
Entries
FIGURE 12.4 A Comprehensive View of a Knowledge Repository. Source: D. Delen and S.S. Hawamdeh,
“A Holistic Framework for Knowledge Discovery and Management,” Communications of the ACM,
Vol. 52, No. 6, 2009, pp. 141-145.
Hawamdeh, 2009). Most knowledge repositories are developed u sing several different
storage mechanisms, depending o n the types a nd amount of knowledge to be maintained
and used. Each has strengths and weaknesses when used for diffe rent purposes within
a KMS. Developing a knowledge repository is n ot a n easy task. The m ost important
aspects and difficult issues a re making the contribution of knowledge relatively easy
for the contributo r and determining a good method for catalogin g th e knowle dge.
The users should n o t be involved in running the storage and retrieval mechanisms
of the knowledge repository. Typical development approaches include developing a
large-scale Internet-based system or purchasing a formal e lectronic document manage-
me nt system or a knowledge management suite. The structure and development of th e
knowledge repository a re a functio n of the specific technology u sed fo r the KMS .
SECTION 12.3 REVIEW QUESTIONS
1. Describe the process approach to knowledge management.
2. Describe the practice approach to knowledge management.
3. Why is a hybrid approach to KM desirable?
4. Define knowledge repository a nd describe how to c reate o ne .
520 Pan IV • Prescriptive Analytics
12.4 INFORMATION TECHNOLOGY (IT) IN KNOWLEDGE
MANAGEMENT
The two prima1y functions of IT in knowledge management are retrieval and
communicatio n. IT also extends the reach a nd range of knowledge u se and enhances the
speed of knowledge transfer. Networks facilitate collaboration in KM.
The KMS Cycle
A functioning KMS follows six steps in a cycle (see Figure 12.5) . The reason for the cycle
is that knowledge is dynamically refined over time. The knowledge in a good KMS is
never finished because the environment changes over time , a nd the knowledge m ust be
updated to reflect the changes. The cycle works as follows:
1. Create knowledge. Knowledge is created as people determine n ew ways of
doing things or develop know-how. Sometimes external knowledge is brought in.
Some of these new ways may become best practices.
2 . Capture knowledge. New knowledge must be identified as valuable and be
represented in a reasonable way.
3. Refine knowledge. New knowledge must be p laced in context so that it is
actionable. This is w here human insights (i.e., tacit qualities) must be captured
along with explicit facts.
4. Store knowledge. Useful knowledge must be stored in a reasonable format in a
knowledge reposito1y so that oth ers in the organizatio n can access it.
5. Manage knowledge. Like a library, a repository must be kept current. It must be
reviewed to verify that it is relevant and accurate.
6. Disseminate knowledge. Knowledge must be made available in a useful forma t
to anyone in the organization w ho needs it, anywhere and anytime.
Capture
l!_
Knowledge
Create
L!_
Refine
l!_
Knowledge Knowledge
Disseminate l!_ Store ~
Knowledge Knowledge
Manage
l!_
Knowledge
FIGURE 12.5 The Knowledge Management Cycle.
Chapter 12 • Knowledge Management and Collaborative Systems 521
Components of KMS
Knowledge management is more a methodology applied to bu siness practices than a
technology or a product. Nevertheless, IT is crucial to the success of every KMS. IT
enables knowledge management by providing the enterprise architecture on which it is
built. KMS are developed using three sets of technologies: communication, collaboration,
and storage and retrieval.
Communication technologies allow users to access needed knowledge and to
communicate w ith each other-especially w ith experts. E-mail, the Internet, corporate
intranets, a nd other Web-based tools provide communication capabilities. Even fax
machines and telephones are used for communication , especially when the practice
approach to knowledge management is adopted.
Collaboration technologies (next several sections) provide the means to perform
groupwork. Groups can work together on common documents at the same time
(i.e., synchronous) or at different times (i.e., asynchronous) ; they can work in the
same place or in different places. Collaboration technologies are especially important
for membe rs of a community of practice working on knowledge contributions. Other
collaborative computing capabilities, such as electronic brainstorming, enhance group-
work, especially for knowledge contribution. Additional forms of groupwork involve
experts working with individuals trying to apply their knowledge; this requires collabora-
tion at a fai rly high level. Other collaborative computing systems a llow an organization to
create a virtual space so that individuals can work online anywhere and at any time (see
Van de Van, 2005).
Storage and retrieval technologies originally meant using a database manage-
ment system (DBMS) to store and manage knowledge. This worked reasonably well in
the early days for storing a nd managing most explic it knowledge- and even explicit
knowledge about tacit knowledge. However, capturing , storing, and managing tacit
knowledge usually requires a d ifferent set of tools. Electronic document management
syste m s and specialized storage systems that are part of collabora tive computin g sys-
tems fill this void. These storage systems have come to be known as knowledge
repositories .
We describe the re latio nship between these knowledge management technologies
and the Web in Table 12.3.
Technologies That Support Knowledge Management
Several technologies have contributed to significant advances in knowledge management
tools. Artificial intellige nce, inte lligent agents, knowledge discove1y in databases,
eXtensible Markup Lan guage (XML) , an d Web 2.0 are examples of tech nologies that
enable advanced functionality of modern KMS and fo rm the basis for future innovations
in the knowledge management field. Following is a brief description of how these
technologies are used in support of KMS.
ARTIFICIAL INTELLIGENCE In the definition of knowledge management, artificial
inte lligen ce (AI) is rarely mentioned. However, practically speaking, Al methods and
tools a re embedded in a number of KMS, e ither by ven dors or by system developers .
AI methods can assist in ide ntifying expertise, eliciting knowledge automatically and
semiautomatically, interfacing through natural language processing, and intelligently
searching through intelligent agen ts. AI methods-notably expert systems , n eural
networks, fuzzy logic, and intelligent agents- are used in KMS to do the following:
• Assist in and e nhance searching knowledge (e.g ., intelligent agents in Web searches)
• Help establish knowledge profiles of individuals and groups
522 Pan IV • Presc riptive Analytics
TABLE 12.3 Knowledge Management Technologies and Web Impacts
Knowledge Management Web Impacts
Communication Consistent, friendly graphical
Collaboration
Storage and retrieval
user interface (GUI) for client
units
Improved communication tools
Convenient, fast access
t o knowledge and
know ledgeable individuals
Direct access to knowled ge on
servers
Improved collaboration tools
Enables anywhere/anytime
collaboration
Enables collaboration between
companies, customers, and
vendors
Enables document sharing
Improved, fast collaboration and
links t o knowledge sources
Makes audio- and videoconfer-
encing a reality, especially for
in dividuals not using a local
area network
Consistent, friendly GUI for
clients
Servers provide for efficient and
effective storage and retri eva l
of knowledge
Impacts on the Web
Knowledge ca ptured and
shared is used in
improving communication,
communication
management, and
communication
tech nologies.
Knowledge captured and
shared is used in improving
collaboration , collaboration
management, and
collaboration technologies
(i.e. , Sha rePoint, w iki, GSS).
Knowledge captu red and
shared is utilized in
improving data storage and
retrieval systems, database
management/ knowledge
repository management,
and database and
know ledge repository
technolog ies.
• Help determine the relative importance of knowledge when it is contributed to and
accessed from the knowledge repository
• Scan e-mail, docume nts, and databases to perform knowledge discovery, determine
meaningful relationships, glean knowledge , or induce rules fo r e xpert systems
• Ide ntify patterns in data (usually through neural n etworks)
• Forecast future results by using existing knowle dge
• Provide advice d irectly from knowledge by using n eural networks or expert systems
• Provide a natural language or voice command-driven u ser interface for a KMS
WEB 2.0 The Web has evolved fro m a tool for d isseminating informatio n and
conducting business to a platform for facilitating new ways of info rmation sharing,
collaboration, and communication in the digital age. A n ew vocabulary has e merged,
as mashups, social networks, media-sharing s ites, RSS , b iogs, a nd wikis have come to
Chapter 12 • Knowledge Management and Collaborative Systems 523
characterize the genre of interactive applications collectively known as Web 2.0. These
technologies have given knowledge management a strong boost by making it easy and
natural for everyone to share knowledge. In some ways this has occurred to the point of
perhaps making the term knowledge management almost redundant. Indeed, Davenport
(2008) characterized Web 2.0 (and its reflection to the enterprise world, Enterprise
2.0) as “new, new knowledge management. ” One of the bottlenecks for knowledge
management practices has been the difficulty for nontechnical people to natively share
their knowledge. Therefore, the ultimate value of Web 2.0 is its ability to foster greater
responsiveness , better knowledge capture and sharing, and ultimately, more effective
collective intelligence.
SECTION 12.4 REVIEW QUESTIONS
1. Describe the KMS cycle.
2. List and describe the components of KMS.
3. Describe how AI and intelligent agents support knowledge management.
4. Relate Web 2.0 to knowledge management
Web 2.0 also engenders collaborative inputs. Whether these collaborations are for
knowledge management activities or other organizational decision making, the overall
principles are the same. We study some basic collaborative mechanisms and systems in
the next several sections.
12.5 MAKING DECISIONS IN GROUPS: CHARACTERISTICS, PROCESS,
BENEFITS, AND DYSFUNCTIONS
Managers and other knowledge workers continuously make decisions , design and
manufacture products, develop policies and strategies, create software systems, and so
on. When people work in groups (i. e., teams), they perform groupwork (i.e., teamwork).
Groupwork refers to work done by two or more people together.
Characteristics of Groupwork
The following are some of the functions and characteristics of groupwork:
• A group performs a task (sometimes decision ma king, sometimes not).
• Group members may be located in different places.
• Group members may work at different times.
• Group members may work for the same organization or for different organization s.
• A group can be permanent or temporary.
• A group can be at one managerial level or span several levels.
• It can create synergy (leading to process and task gains) or conflict.
• It can generate productivity gains and/ or losses.
• The task may have to be accomplished very quickly.
• It may be impossible or too expensive for a ll the team members to meet in o ne
place, especially when the group is called for emergency purposes.
• Some of the needed data, information, or knowledge may be located in many
sources, some of w hich may be external to the organizatio n .
• The expertise of no team members may be needed.
• Groups perform many tasks; however, groups of managers and analysts frequently
concentrate on decision making.
• The decisions made by a group are easier to implement if supported by a ll (or at
least most) members.
524 Pan IV • Prescriptive Analytics
The Group Decision-Making Process
Even in hierarchical organizations, decision making is usually a shared process. A group
may be involved in a decision or in a decision-related task, such as creating a short list
of acceptable alternatives or choosing criteria for evaluating alternatives and prioritizing
them. The following activities and processes characterize meetings:
• The decision situation is important, so it is advisable to make it in a group in a
meeting.
• A meeting is a joint activity engaged in by a group of people typically of equal or
nearly equal status.
• The outcome of a meeting depends partly on the knowledge, opinions, and
judgments of its participants and the support they give to the outcome.
• The outcome of a meeting depends on the composition of the group and on the
decision-making process the group uses.
• Differences in opinions are settled either by the ranking person present or, often,
through negotiation or arbitration.
• The members of a group can be in one place, meeting face-to-face , or they can be
a virtual team, in which case they are in different places while in a meeting.
• The process of group decision making can create benefits as well as dysfunctions.
The Benefits and Limitations of Groupwork
Some people endure meetings (the most common form of groupwork) as a necessity;
others find them to be a waste of time. Many things can go wrong in a meeting.
Participants may not clearly understand their goals, they may lack focus, or they may have
hidden agendas. Many participants may be afraid to speak up, while a few may dominate
the discussion. Misunderstandings occur through different interpretations of language,
gesture, or expression. Table 12.4 provides a comprehensive list of factors that can hinder
the effectiveness of a meeting (Nunamaker, 1997). Besides being challenging, teamwork
is also expensive. A meeting of several managers or executives may cost thousands of
dollars per hour in salary costs alone.
Groupwork may have both potential benefits (process gains) and potential
drawbacks (process losses). Process gains are the benefits of working in groups. The
unfortunate dysfunctions that may occur when people work in groups are called process
losses. Examples of each are listed in Technology Insights 12.1.
TABLE 12.4 Difficulties Associated with Groupwork
• Waiting to speak • Wrong composition of people
• Dominating the discussion • Groupthink
• Fear of speaking • Poor grasp of problem
• Fear of being misunderstood • Ignored alternatives
• Inattention • Lack of consensus
• Lack of focus • Poor planning
• Inadequate criteria • Hidden agendas
• Premature decisions • Conflicts of interest
• Missing information • Inadequate resources
• Distractions • Poorly defined goals
• Digressions
Ch apte r 12 • Knowle dge Manageme nt and Collaborative Systems 525
TECHNOLOGY INSIGHTS 12.1 Benefits ofWorking in Groups
and Dysfunctions of the Group Process
Benefits of Working in Groups (Process Gains)
• It provides learning. Groups are better
than individuals at understanding
problems.
• People readily take ownership of
problems and their solutions. They
take responsibility.
• Group members have their egos
embedded in the decision, so they
are committed to the solution.
• Groups are better than individuals at
catching errors.
• A group has more information
(i.e., knowledge) than any one member.
Group members can combine their
knowledge to create new knowledge.
More and more creative alternatives
for problem solving can be generated,
and better solutions can be derived
(e.g., through stimulation).
• A group may produce synergy during
problem solving. The effectiveness and/
or quality of groupwork can be greater
than the sum of what is produced by
independent individuals.
• Working in a group may stimulate the
creativity of the participants and the
process.
• A group may have better and more
precise communication working together.
• Risk propensity is balanced. Groups
moderate high-risk takers and encourage
conservatives.
Dysfunctions of the Group Process (Process Losses)
• Socia l pressures of conformity may resul t in
groupthink (i.e., people beg in to t hink alike
and do not tolerate new ideas; they yield to
conformance pressure).
• It is a time-consuming, slow process (i.e ., on ly
one member can speak at a time).
• There can be lack of coordination of the
meeting and poor meeting planning.
• Inappropriate influences (e.g. , domination
of time, topic, or opinion by one or few
individuals; fear of contributing because of
the possibility of flaming).
• There can be a tendency for group members
to either dominate the agenda or rely on
others to do most of the w ork (free-riding).
• Some members may be afraid to speak up.
• There can be a tendency to produce
comprom ised solutions of poor quality.
• There is often nonproduct ive t ime (e.g .,
socia lizing, preparing, wa iting for latecomers;
i. e., air-time fragmentation).
• There can be a tendency to repeat w hat
has already been sa id (because of failure to
remember or process).
• Meet ing costs can be high (e.g., travel,
participation t ime spent).
• There can be incomplete or inappropriat e
use of informat ion.
• There can be too m uch information
(i.e., information overload).
• There can be incomplete or incorrect task
analysis.
(Continued)
526 Pa n IV • Prescriptive Analytics
Benefits of Working in Groups (Process Gains) Dysfunctions of the Group Process (Process Losses)
SECTION 12.5 REVIEW QUESTIONS
1. Define group work.
2. List five characte ristics of grou pwork.
• There ca n be inapprop riate or incomplete
representation in t he g roup.
• There ca n be attention blocking .
• There ca n be concentration blocking.
3. Describe the process of a group m eeting for decisio n making .
12.6 SUPPORTING GROUPWORK WITH COMPUTERIZED SYSTEMS
Whe n p eople work in teams, esp ecially w h e n th e m e m bers are in d iffe re n t locatio n s a nd
may be w orking at diffe re nt times, they need to communicate , collaborate, a nd access a
d iverse set of info rmation sources in multiple formats. This makes meetings, esp ecially
virtu al o nes, complex, with a g reate r ch a nce fo r process losses. It is important to follow a
certain process for co nducting meetings .
Groupwork may require d iffe re nt levels of coordinatio n (Nun amaker, 1997).
Some times a group may o pe ra te at the individ ua l work level, w ith me mbe rs m aking
individual efforts that require n o coordina tion. As with a team of sprinters representing
a country p articipa ting in a 100-me ter dash , group p rodu ctivity is simply the best of the
individual results . Other times group me m bers may interact at th e coordinated work
level. At this level, as with a team in a relay race, the work requires carefu l coordinatio n
between othe rwise independe nt individual effo rts . Som etimes a team may o p erate at
the concerted work level. As in a rowing race , teams working at this level must m ake a
continuo us concerted effo rt to be su ccessful. Differe nt mech anisms support groupwork at
diffe re nt levels of coordinatio n.
It is almost trite to say that all o rganizations , sm all and large, are u sing some
compute r-based communicatio n and collab oratio n metho ds an d tools to su p p o rt
p eople working in teams o r groups . From e-mails to mobile pho nes and SMS as well as
conferencing technologies , su ch tools a re a n ind ispensable p art of o n e ‘s work life today.
We n ext highlight some related techno logies and a pplicatio n s .
An Overview of Group Support Systems (GSS)
For groups to colla borate effectively , appropriate communicatio n method s an d techn o lo-
gies a re need ed . The Inte rne t and its de rivatives (i.e., intra ne ts and extra nets) are the
infrastructures on w hich m uch communicatio n for collab oratio n occurs. The Web su pp orts
intra- and interorganizatio n al collaborative decisio n making through collaboration tools
and access to data , informatio n , and knowledge from inside and outside the organization.
Intra-o rganizatio n al networke d decisio n suppo rt can be effectively supported by
a n intran e t. People w ithin an o rganizatio n can work w ith Internet tools a nd procedures
through e nte rprise informatio n p o rtals. Specific applicatio ns can in clude importa nt
inte rnal docume nts a nd p rocedures, corporate ad d ress lists, e -ma il, tool access, and
software distributio n.
An extra net links p eople in diffe re n t o rganizatio n s. For exam ple, covisint.com
focu ses o n p rovid ing su ch collaborative mechanisms in diverse industries such as
ma nufacturing, healthcare, and e n e rgy. Othe r extra nets are used to link teams together
Chapter 12 • Knowledge Management and Collaborative Systems 527
to design products w h e n several different suppliers must collaborate on design and
manufacturing techniques.
Computers have been used for several decades to facilitate groupwork and group
decision making. Lately, collaborative tools h ave received even greater attention due to
their increased capabilities and ability to save money (e.g., on travel cost) as well as their
ability to expedite decision making. Such computerized tools are called groupware.
Groupware
Many computerized tools have been developed to provide group support. These tools are
called groupware because their primary objective is to support groupwork. Groupware
tools can support decision making directly or indirectly, and they are described in
the remainder of this chapter. For example, generating creative solutions to problems
is a direct support. Some e-mail programs , ch at rooms, instant messaging (IM) , and
teleconferencing provide indirect support.
Groupware provides a mechanism for team members to share opinions, data,
information, knowledge, and other resources. Different computing technologies support
groupwork in d ifferent ways, depending o n the purpose of the group, the task, and the
time/ place category in which the work occurs.
Time/Place Framework
The effectiveness of a collaborative computing technology depends o n the location of the
group members and on the time that sh ared information is sent and received. Desanctis
and Gallupe (1987) proposed a framework for classifying IT communication support
technologies . In this framework, communication is divided into four cells, which are
shown together w ith representative computerized support technologies in Figure 12.6.
The four cells are organized along two dimensio n s-time a nd place.
When informatio n is sent a nd received almost simultaneously, the communication
is synchronous (real time). Telephones, IM, and face-to-face meetings are examples of
synchronous communication. Asynchronous communication occurs w he n the receiver
I
Same I
Place
I
Different I
Place
I Same Time I
• GSS in a decision room
• Web-based GSS
• Multimedia presentation
system
• Whiteboard
• Document sharing
• Web-based GSS
• Whiteboard
• Document sharing
• Videoconferencing
• Audioconferencing
• Computer conferencing
• E-mail, V-mail
FIGURE 12.6 The Time/Place Framework for Groupwork.
I Different Time I
• GSS in a decision room
• Web-based GSS
• Workflow management
system
• Document sharing
• E-mail, V-mail
• Videoconferencing playback
• Web-based GSS
• Whiteboard
• Document sharing
• E-mail, V-mail
• Workflow management
system
• Computer conferencing
with memory
• Videoconferencing playback
528 Pan IV • Prescriptive Analytics
gets the information at a different time than it was sent, such as in e-mail. The senders
and the receivers can be in the same room or in different places.
As shown in Figure 12.6, time and place combinations can be viewed as a four-cell
matrix, or framework. The four cells of the framework are as follows:
• Same time/same place. Participants meet face-to-face in one place at the same
time, as in a traditional meeting or decision room. This is still an important way to
meet, even when Web-based support is used, because it is sometimes critical for
participants to leave the office to e liminate distractions .
• Same time/different place. Participants are in different places, but they
communicate at the same time (e.g., with videoconferencing).
• Different time/same place. People work in shifts. One shift leaves information
for the next shift.
• Different time/different place (any time, any place). Participants are in
different places, and they also send and receive information at different times. This
occurs when team members are traveling, have conflicting schedules, or work in
different time zon es.
Groups and groupwork (also known as teams and teamwork) in organizations
are proliferating. Consequently, groupware continues to evolve to support effective
groupwork, mostly for communication and collaboration.
SECTION 12.6 REVIEW QUESTIONS
1. Why do companies use computers to support groupwork?
2. Describe the components of the time/ place framework.
12.7 TOOLS FOR INDIRECT SUPPORT OF DECISION MAKING
A large number of tools a nd methodologies are available to facilitate e-collaboration,
communication, and decision support. The following sections present the major tools that
support decision making indirectly.
Groupware Tools
Groupware products provide a way for groups to share resources and opm1o n s.
Groupware implies the use of networks to connect people, even if they are in the same
room. Many groupware products are available on the Internet or an intranet to enhance
the collaboration of a large number of people . The features of groupware products that
support commutation, collaboration, and coordination are listed in Table 12.5. What
follows are brief definitions of some of those features.
SYNCHRONOUS VERSUS ASYNCHRONOUS PRODUCTS Notice that the features in
Table 12.5 may be synchronous, meaning that communicatio n an d collaboration are
done in real time, or asynchronous, meaning that communication and collaboration are
done by the participants at different times. Web confere ncing and IM as well as Voice
over IP (VoIP) are associated w ith syn chron ous mode. Meth ods that are associated with
asynchronous modes include e-mail, wikilogs, and online workspaces , where participants
can collaborate, for example, on joint designs or projects, but work at different times.
Google Drive (drive.google.com) and Microsoft SharePoint (http://office.microsoft.
com/ en-us/SharePoint/ collaboration-software-SharePoint-FXl 034 79517 .aspx)
allow users to set up online workspaces for storing, sharing, and collaboratively working
on different types of documents.
Chapter 12 • Knowledge Management and Collaborative Systems 529
TABLE 12.S Groupware Products and Features
General {Can Be Either Synchronous or Asynchronous)
• Built-in e-mail, messaging system
• Browser interface
• Joi nt Web-page creation
• Sharing of active hyperlinks
• File sharing (graphics, video, audio, or other)
• Built-in search functions (by topic or keyword)
• Workflow tools
• Use of corporate portals for communication, collaboration, and search
• Shared screens
• Electronic decision rooms
• Peer-to-peer networks
Synchronous (Same Time)
• Insta nt messaging (IM)
• Videoconferencing, multimedia conferencing
• Audioconferencing
• Shared wh iteboard, smart wh iteboard
• Instant video
• Brainstorming
• Poll ing (voting), and other decision support (consensus builder, scheduler)
Asynchronous (Different Times)
• Workspaces
• Threaded discussions
• Users can receive/send e-mai l, SMS
• Users can receive activity notification alerts via e-mail or SMS
• Users can collapse/expand discussion threads
• Users can sort messages (by date, author, or read/unread)
• Auto responder
• Chat session logs
• Bulletin boards, discussion groups
• Use of biogs, wikis, and wikilogs
• Collaborative planning and/or design tools
• Use of bulletin boards
Companies such as Dropbox.com provide an easy way to share documents . Of
course, similar systems are evolving for consumer and home use such as photo sharing
(e.g., Picasa, Flicker, Facebook).
Groupware products are either stand-alone products that support one task (such
as videoconferencing) or integrated kits that include several tools. In general, groupware
technology products are fairly inexpe nsive and can easily be incorporated into existing
information systems.
VIRTUAL MEETING SYSTEMS The advancement of Web-based systems opens the door
for improved, electronically supported virtual meetings , where members are in different
locations and even in different countries. For example, online mee tings and presentation
tools are provided by webex.com, gotomeeting.com, Adobe.com, Skype.com, and
many others. Microsoft Office also includes a built-in virtual meeting capability. These
systems feature Web seminars (popularly called Webinars) , screen sharing, audioconfer-
encing, videoconferencing, polling, question- answer sessions, and so on. Even mobile
phones now have sufficient interaction capabilities to a llow live meetings through
applications such as Facetime.
530 Pan IV • Prescriptive Analytics
Groupware
Although many of the techno logies that enable group decision support are merging in
common office productivity software tools su ch as Microsoft Office, it is instructive to
learn about one specific software that illustrates some unique capabilities of groupware.
GroupSystems (groupsystems.com) MeetingRoom was on e of the first comprehen sive
same time/ same place electronic meeting packages. The follow-up product,
GroupSystems Online, offered similar capabilities, and it ran in asynchron ous mode
(anytime/ a nyplace) over the Web (MeetingRoom ran o nly over a local area n etwork
[LAN]) . GroupSystems ‘ latest product is ThinkTank, w hich is a suite of tools that signifi-
cantly shortens cycle time for brainstorming, strategic planning, product development,
problem solving, requirements gathering, risk assessments, team decision makings, and
other collaborations. ThinkTank moves face-to-face or virtual teams through customizable
processes toward their goals faster and more effectively than its predecessors. ThinkTank
offers the following capabilities:
• ThinkTank builds in the discipline of an agenda, efficient participation, workflow,
prioritization, and decisio n analysis.
• ThinkTank’s anonymous brainsto rming for ideas a nd comments is an ideal way to
capture the participants’ creativity a nd experience.
• ThinkTank Web 2.0’s e nhanced user interface e nsures that participants do not need
prior training to join, so they can focus 100 percent o n solving problems and making
decisions.
• With ThinkTa nk, all of the knowledge sh ared by p articipants is captured and saved
in documents a nd spreadsheets a nd automatically converted to the meeting minutes
and made available to all participants at the e nd of the sessio n.
Another specialized product is eRoom (n ow owned by EMC/ Documentum
at http://www.emc.com/ enterprise-content-management/ centerstage.htm). This
comprehensive Web-based suite of tools can support a variety of collaboratio n scenarios.
Yet an other product is Team Expert Ch o ice (Comparion) , w hich is an add-on product for
Expert Choice (expertchoice.com) . It has limited decision support cap abilities, mainly
supporting one-room meetings, but focu ses on developing a model and process for
decision making u s ing the analytic hierarchy process that was covered in Chapter 9.
Collaborative Workflow
Collaborative wor/efl,ow refers to software products that address project-oriented and
collaborative types of processes. They are administered centrally yet are capable of being
accessed a nd used by workers from different departments and even from different physical
location s. The goal of collaborative workflow tools is to empower knowledge workers.
The focus of a n e nterprise solutio n for collaborative workflow is o n allowing worke rs to
communicate, negotiate, and collaborate w ithin an integrated e nvironme nt. Some leading
vendo rs of collaborative workflow application s are Lo tus, EpicData, FileNet, and Action
Technologies.
Web 2.0
The term Web 2.0 refers to w hat is perceived to be the second gen eration of Web
development and Web design. It is characterized as facilitating communication , information
sh aring, interoperability, u ser-centered design, and collaboration on the World Wide Web.
It has led to the development a nd evolutio n of Web-based communities, hosted services,
a nd novel Web applications. Example Web 2.0 applications include social-networking
sites (e.g., Linkedln, Facebook), video-sha ring sites (e.g., YouTube, Flickr, Vimeo), w ikis,
biogs, mashups, an d folksonomies.
Chapter 12 • Knowledge Management and Collabo rative Systems 531
Web 2.0 sites typically include the following features/ techniques, identified by the
acronym SLATES:
• Search. The ease of finding information through keyword search.
• Links. Ad hoc guides to other relevant information.
• Authoring. The ability to create content that is constantly updated by multiple
users. In wikis, the content is updated in the sense that users undo and redo each
other’s work. In biogs, content is updated in that posts and comments of individuals
are accumulated over time.
• Tags. Categorization of content by creating tags. Tags are simple , one-word, user-
determined descriptions to facilitate searching and avoid rigid, premade categories.
• Extensions. Powerful algorithms leverage the Web as an application platform as
well as a document server.
• Signals. RSS technology is used to rapidly notify users of content changes.
Wikis
A wiki is a piece of server software available at a Web site that allows users to freely
create and edit Web page content through a Web browser. (The term wiki means “quick”
or “to hasten” in the Hawaiian language; e.g. , “Wiki Wiki” is the name of the shuttle bus
at Honolulu International Airport.) A wiki supports hyperlinks and has a simple text
syntax for creating new pages a nd cross-links between internal pages on-the-fly. It is
especially suited for collaborative writing.
Wikis are unusual among group communication mechanisms in that they allow th e
organization of the contributions to be edited as well as the content itself. The te rm wiki
also refers to the collaborative software that facilitates the operation of a wiki Web site.
A wiki enables documents to be written collectively in a very simple markup, using
a Web browser. A s ingle page in a wiki is referre d to as a “wiki page, ” and the entire
body of pages, which are usually highly interconnected via hyperlinks, is “the w iki”; in
effect, it is a very simple , easy-to-use database. For further details, see en.wikipedia.
org/wiki/Wiki and wiki.org.
Collaborative Networks
Traditionally, collaboration took place among supply chain members, frequently those
that were close to each other (e.g. , a manufacturer and its distributor, a distributor and
a retailer). Even if more partners were involved, the focus was on the optimization of
information and product flow between existing nodes in the traditional supply chain.
Advanced approaches, su ch as collaborative planning, forecasting, and replenishment, do
not change this basic structure.
Traditional collaboration results in a vertically integrated supply chain. However, Web
technologies can fundamentally change the shape of the supply chain, the number of players
in it, and their individual roles. In a collaborative network, partners at any point in the
network can interact with each other, bypassing traditional partners. Interaction may occur
among several manufacturers or distributors, as well as with new players, such as software
agents that act as aggregators, business-to-business (B2B) exchanges, or logistics providers.
SECTION 12.7 REVIEW QUESTIONS
1. List the major groupware tools and divide them into synchronous and asynchronous
types.
2. Identify specific tools for Web conferencing a nd their capabilities.
3. Define wiki and wikilog.
4. Define collaborative hub.
532 Pan IV • Prescriptive Analytics
12.8 DIRECT COMPUTERIZED SUPPORT FOR DECISION MAKING:
FROM GROUP DECISION SUPPORT SYSTEMS TO GROUP
SUPPORT SYSTEMS
Decisions are made at many meetings , some of which are called in o rder to make one
specific decision. For example, the federal government meets periodically to decide on the
short-term interest rate . Directors may b e e lected at shareholder meetings, organizations
allocate budgets in meetings , a company decides on which candidate to hire, and so on.
Although some of these decisions are complex, others can be controversial, as in resource
allocation by a city government. Process gains and d ysfunctions can b e significantly
large in such situations; therefore , computerized support has often been suggested to
mitigate these complexities. These computer-based support systems have appeared in
the literature under different names, including group decision support systems (GDSS),
group support systems (GSS), computer-supported collaborative work (CSCW), and elec-
tronic m eeting systems (EMS). These systems are the subject of this section.
Group Decision Support Systems (GDSS)
During the 1980s, researchers realized that computerized support to managerial decision
making needed to b e expanded to groups because major organizational decisions are
made by groups such as executive committees, specia l task forces, and d epartme nts. The
result was the creation of group decision support systems (see Powell et al., 2004) .
A group decision support system (GDSS) is an interactive computer-based
system that facilitates the solution of semistructured or unstructure d p roblems by a group
of decision makers. The goal of GDSS is to improve the productivity of decision-making
meetings by speeding up the decision-making process and/ or by improving the quality
of the resulting d ecisions.
The following are the major characteristics of a GDSS:
• Its goal is to support the process of group decision makers by providing automation
of subprocesses, using informatio n technology tools.
• It is a specially designed information system, not merely a configuration of already-
existing system components. It can b e designed to address on e type of problem o r
a variety of group-level organizational decisions.
• It encourages generation of ideas, resolution of conflicts, and freedom of expression.
It contains built-in mechanisms that discourage development of negative group
behaviors, such as destructive conflict, miscommunication , and groupthink.
The first generation of GDSS was designed to support face-to-face meetings in a
decision room. Today, support is provided mostly over the Web to virtual groups. The
group can meet at the same time or at different times by using e-mail, sending documents,
and reading transaction logs. GDSS is especially useful when controversial decisions have
to be made (e.g., resource allocation, determining which individuals to lay off). GDSS
applications require a facilitator w hen done in one room or a coordina tor or leade r when
done using virtual meetings.
GDSS can improve the decision-making process in various ways . For one, GDSS
generally provide structure to the p la nning process, which keeps the group o n track,
although some applications permit the group to use unstructured techniques and methods
for idea generation. In addition, GDSS offer rapid and easy access to external and
stored information needed for decision making. GDSS also support parallel processing
of information and idea generation by participants and allow asynchronous computer
discussion. They m ake possible larger mee tings that would otherwise be unmanageable;
Chapter 12 • Knowledge Management and Collabo rative Systems 533
having a larger group means that more complete information, knowledge, and skills w ill
be represented in the meeting. Finally, voting can be anonymous, with instant results,
and all information that passes through the system can be recorded for future an alysis
(producing organizational memory) .
Initially, GDSS were limited to face-to-face meetings. To provide the necessary
technology, a special facility (i.e., room) was created. Also, groups usually had a clearly
defined, narrow task, such as allocation of scarce resources or prioritization of goals in a
lo ng-ra nge plan.
Over time, it became clear that support teams’ needs were broader than that
supported by GDSS . Furthermore, it became clear that what was really needed was
support for virtual teams, both in different place/ same time and different place/ different
time situations. Also, it became clear that teams n eeded indirect support in most decision-
making cases (e.g., help in searching for information or collaboration) rather than direct
support for the decision making. Although GDSS expanded to virtual team support, they
were unable to meet all the other needs. Thus , a broader term, GSS, was created. We use
the terms interchangeably in this book.
Group Support Systems
A group support system (GSS) is any combination of hardware and software that
enhances groupwork either in direct or indirect support of decision making. GSS is
a generic term that includes a ll forms of collaborative computing. GSS evolved after
information techno logy researchers recognized that technology could be developed
to support the many activities n ormally occurring at face-to-face meetings (e.g ., idea
generation, consen sus building, anonymous ranking).
A complete GSS is still considered a specially designed information system, but
sin ce the mid-1990s many of the special capabilities of GSS have been embedded in
standard productivity tools. For example, Microsoft Office can embed the Lyne tool
for Web conferences. Most GSS are easy to use because they have a Windows-based
graphica l user interface (GUI) or a Web browser interface. Most GSS are fairly gen-
eral and provide support for activities such as idea generation, conflict resolution, and
voting. Also, many commercial products have been developed to support only one
or two aspects of teamwork (e.g., videoconferencing, idea generation, screen sharing,
wikis).
GSS settings range from a group meeting at a single locatio n for solving a specific
problem to virtual meetings conducted in multiple locations and held via telecommuni-
cation channels for the purpose of addressing a variety of problem types. Continuously
adopting new a nd improved methods , GSS are building up their capabilities to effectively
operate in asynchronous as well as synchronous modes.
How GDSS (or GSS) Improve Groupwork
The goal of GSS is to provide support to meeting participants to improve the productivity
and effectiveness of meetings by streamlining and speeding up the decision-making
process (i. e ., efficiency) or by improving the quality of the results (i.e. , effectiveness).
GSS attempts to increase process and task gains and decrease process and task losses.
Overall, GSS h ave been su ccessful in doing just that (see Holt, 2002); however, some
process and task gains may decrease, and some process a nd task losses may increase.
Improvement is achieved by providing support to group members for the generation
and exchange of ideas, opinions, and preferences. Specific features such as parallelism
(i.e., the ability of participants in a group to work simultan eou sly on a task, such as
534 Pan IV • Prescriptive Analytics
brainstorming or voting) and anonymity produce this improvement. The following are
some specific GDSS support activities:
• GDSS support parallel processing of information and idea generation (parallelism).
• GDSS enable the participation of larger groups with more complete information,
knowledge, and skills.
• GDSS permit the group to use structured or unstructured techniques and methods.
• GDSS offer rapid, easy access to external information.
• GDSS allow parallel computer discussions.
• GDSS help participants frame the big picture.
• Anonymity allows shy people to contribute to the meeting (i.e. , get up and do what
needs to be done) .
• Anonymity helps prevent aggressive individuals from driving a meeting.
• GDSS provide for multiple ways to participate in instant, anonymous voting.
• GDSS provide structure for the planning process to keep the group on track.
• GDSS enable several users to interact simultaneously (i.e., conferencing).
• GDSS record all information presented at a meeting (i.e., organizational memory).
For GSS success stories, look for sample cases at vendors’ Web sites. As you will see,
in many of these cases, collaborative computing led to dramatic process improvements
and cost savings.
Facilities for GDSS
There are three options for deploying GDSS/GSS technology: (1) as a special-purpose
decision room, (2) as a multiple-use facility, and (3) as Internet- or intranet-based
groupware, w ith clients running wherever the group members are.
DECISION ROOMS The earliest GDSS were installed in expensive, customized,
special-purpose facilities called decision rooms (or electronic meeting rooms) with PCs
and large public screens at the front of each room. The original idea was that only
executives and high-level managers would use the facility . The software in a special-
purpose electronic meeting room usually runs over a LAN, and these rooms are fairly
plush in their furnishings. Electronic meeting rooms can be constructed in different shapes
and sizes. A common design includes a room equipped with 12 to 30 networked PCs,
usually recessed into the desktop (for better participant viewing). A server PC is attached
to a large-screen projection system and connected to the network to display the work
at individual workstations and aggregated information from the facilitator’s workstation.
Breakout rooms equipped with PCs connected to the server, where small subgroups
can consult, are sometimes located adjacent to the decision room. The output from the
subgroups can a lso be displayed on the large public screen.
INTERNET-/INTRANET-BASED SYSTEMS Since the late 1990s, the most common approach
to GSS facilities has been to use Web- or intranet-based groupware that allows group
members to work from any location at any time (e.g. , WebEx, GotoMeeting, Adobe
Connect, Microsoft Lyne, GroupSystems). This groupware often includes audioconfer-
encing and videoconferen cing. The availability of relatively inexpensive groupware (for
purchase or for subscription), combined with the power and low cost of computers and
mobile devices, makes this type of system very attractive.
SECTION 12.8 REVIEW QUESTIONS
1. Define GDSS and list the limitations of the initial GDSS software.
2. Define GSS and list its benefits.
Chapter 12 • Knowledge Management and Collaborative Systems 535
3. List process gain improvements made by GSS.
4. Define decision room.
5. Describe Web-based GSS.
This chapter has served to provide a relatively quick overview of knowledge manage-
ment and collaborative systems, two movements that were really prominent in the past
20 years but have now been subsumed by other technologies for information sharing and
decision making. It helps to see where the roots of many of the technologies today might
have come from, although the names may have changed.
Chapter Highlights
• Knowledge is different from information and
data. Knowledge is informatio n that is contextual,
relevant, and actionable.
• Knowledge is dynamic in nature . It is information
in action.
• Tacit (i.e. , unstructured, sticky) knowledge is
usually in the domain of subjective, cognitive,
and experiential learning; explicit (i.e., structured,
leaky) knowledge deals with more objective,
rational , and technical knowledge, and it is highly
personal and difficult to formalize.
• Organizational learning is the development
of new knowledge and insights that have the
potential to influence behavior.
• The ability of an organization to learn, develop
memory, and share knowledge is dependent on
its culture. Culture is a pattern of shared basic
assumptions.
• Knowledge management is a process that helps
organizations identify, select, organize, dissemi-
nate , and transfe r important information and
expertise that typically reside within the organiza-
tion in an unstructured manner.
• The knowledge m anageme nt mode l involves the
following cyclical steps: create, capture, refine ,
store, manage , and disseminate knowledge .
• Two knowledge management approaches are the
process approach and the practice approach .
• Standard knowledge manageme nt m1tiattves
involve the creation of knowledge bases, active
process management, knowledge centers,
collaborative technologies, and knowledge webs.
• A KMS is generally developed using three sets of
technologies: communication, collaboration, and
storage.
• A variety of technologies can make up a KMS ,
including the Internet, intranets, data warehousing,
decision support tools, and groupware. lntra nets
are the primary vehicles for displaying and
distributing knowledge in organizations.
• People collaborate in their work (called group-
work). Groupware (i.e., collaborative computing
software) supports groupwork.
• Group members may be in the same organization
or may span organizations; they may be in the
same location or in different locations; they may
work at the same time or at different times.
• The time/ place framework is a convenient way
to describe the communication and collaboration
patterns of groupwork. Different techno logies
can support different time/ place settings.
• Working in groups m ay result in many benefits,
including improved decision making.
• Communication can be synchronous (i.e. , same
time) or asynchronous (i.e., sent and received in
different times).
• Groupware refers to software products that pro-
vide collaborative support to groups (including
conducting meetings).
• Groupware can support decision making/
problem solving directly or can provide indirect
support by improving communication between
team members.
• The Inte rnet (Web), intranets, and extranets sup-
port decision making through collaboration tools
and access to data, information, and knowledge.
• Groupware for direct support such as GDSS
typically contains capabilities for e lectronic
brainstorming, electronic conferencing or meet-
ing, group scheduling, cale ndaring, planning,
conflict resolution, model building, videoconfer-
encing, electronic document sharing, stakeholder
ide ntificatio n , topic commentator, voting, policy
formulation, and enterprise analysis .
• Groupware can support anytime/ anyplace
groupwork.
536 Pan IV • Prescriptive Analytics
• A GSS is any combination of hardware and
software that facilitates meetings. Its pre decessor,
GDSS, provided direct support to decision
meetings, usually in a face-to-face setting.
• GDSS atte mpt to increase process and task gains
and reduce process and task losses of groupwork.
• Paralle lism and anonymity provide several GDSS
gains.
Key Terms
asynchronous
community of
practice
decision room
explicit knowledge
group decision support
system (GDSS)
group support system
(GSS)
groupthink
groupware
groupwork
idea generation
knowledge
knowle dge-based
economy
knowledge management
(KM)
Questions for Discussion
1. Why is the te rm knowledge so difficult to d efine?
2. Describe and relate the different characteristics of knowl-
edge to one another.
3. Explain why it is impoitant to capture and ma nage
knowledge.
4. Compare and contrast tacit know ledge and explicit
knowledge.
5. Explain why organizational culture mu st sometimes
change before knowledge management is introduced.
6. How d oes knowledge manageme nt attain its primary
objective?
7. How can employees be motivated to contribute to and
use KMS?
8. What is the role of a knowledge reposito1y in knowledge
ma nageme nt?
9. Ex plain the impo rtance of communication and col-
laboration techno logies to the processes of knowledge
ma nage me nt.
Exercises
Teradata UNIVERSITY NE1WORK (TUN) and Other
Hands-on Exercises
1. Make a list of all the knowledge manageme nt metho ds
you use during your day (work and personal). Which
are the most effe ctive? Which are the least effective?
What kinds o f work o r activities does each knowledge
management method enable?
• GDSS may be assessed in terms of the common
group activities of information re trieval, informa-
tion sharing, and information use.
• GDSS can be deployed in an electronic d ecision
room e nvironme nt, in a multipurpose computer
lab, or over the Web.
• Web-based groupware is the norm for a nytime/
anyplace collabo ration.
know ledge manage ment
system (KMS)
knowledge re pository
leaky knowledge
organizational culture
organizational learning
organizational memory
parallelism
practice approach
process approach
process gain
process loss
synchronous (real-time)
tacit know ledge
virtual meeting
virtual team
wiki
10. List the three top techno logies most frequently u sed for
imple menting KMS and explain their impoitance.
11. Expla in w hy it is useful to describe gro upwork in terms
of the time/ place framework.
12. Describe the kinds of suppoit that groupware can
provide to decision makers.
13. Explain why most groupware is deployed today over the
Web.
14. Explain why meetings can be so ineffi cient. Given this,
explain h ow effective meetings can be run.
15. Expla in how GDSS can increase some o f the b en efits
of collaboration and decision making in groups and
eliminate or reduce some of the losses.
16. The original term for group suppoit system (GSS) was
group decisio n support system (GDSS). Why was the
word decision dropped? Does this make sense? Why or
w hy not?
2. Describe how to ride a bicycle, drive a car, or make
a peanut butter and jelly sandwich. Now have some-
o ne else try to do it based solely o n your expla nation.
How can you best conveit this knowledge from tacit to
explicit (or can ‘t you)?
3. Examine the top five reasons that firms initiate KMS
and investigate w hy they are important in a modern
e nterprise.
Ch apte r 12 • Knowle dge Manageme nt and Colla borative Systems 537
4. Read H ow the Irish Saved Civilization by Thomas Cahill
(New York: Anchor, 1996) and describe how Ireland
became a knowledge re pository for Western Europe just
before the fall of the Roman Empire . Explain in detail why
this was impo rtant for Weste rn civilizatio n and history.
5. Examine your unive rsity, college, or company and
describe the ro les that the faculty, administratio n , suppo rt
staff, and students h ave in the creation , sto rage, and
dissemina tio n of knowledge. Explain how th e process
wo rks. Explain how techno logy is curre ntly used and
how it could potentially be used.
6 . Search the Inte rne t for knowledge manageme nt products
and systems and create categories for the m. Assign
one ve n dor to each team . Describe the categories you
created and justify them.
7. Conside r a decision-making project in industry for this
course o r fro m anothe r class o r fro m w ork. Examine
some typical decisions in the project. How would you
extract the knowledge you need? Can you use that
knowledge in practice? Why or w hy n ot?
8. How d oes knowledge manageme nt support decision
making? Identify p roducts o r systems on the We b that help
organization s accomplish knowledge management. Sta rt
w ith brint.com and knowledgemanagement.com.
Try one o ut and re port your findings to the class.
9. Search the Internet to identify sites that d eal with knowledge
management. Start with google.com, kmworld.com,
kmmag.com, and km-forum.org. How many did you
find? Categorize the sites based on w h ether they are aca-
de mic, consulting finns, vendors, and so on . Sample one of
each and describe the main focus of the site .
10. Make a list of all the communicatio ns methods (both
work and p ersonal) you use during your day. Which are
the most effective? Which are the least effective? What
kind o f work o r activity does each communicatio n s
method e nable?
11. Investigate the impact of turning off every communication
syste m in a firm (i.e ., te lepho ne, fax , television , radio, all
END-OF-CHAPTER APPLICATION CASE
computer systems) . How effective and efficie nt would
the following types o f firms b e: a irline, bank, insuran ce
company, travel agency, department store, grocery
sto re? What would happe n? Do custome rs exp ect 100
percent uptime? (Whe n w as the last time a major a irline ‘s
reservatio n system was down?) How long would it be
before each typ e o f firm would no t be fun ctio ning at all?
Investigate w hat organizatio ns are doing to p revent this
situatio n fro m occurring .
12. Investigate how researchers are trying to d evelop
collaborative computer systems that p ortray o r display
no nverbal communicatio n fa ctors.
13. Fo r each of the following software packages, check the
trade literature a nd the Web fo r d etails and explain how
computerized collaborative su pp ort syste m cap abilities
a re included: Lyne, GroupSyste ms, an d WebEx.
14. Compare Simon ‘s four-ph ase decision-making model to
the steps in using GDSS.
15. A majo r claim in favor o f w ikis is that they can replace
e-mail, e liminating its disadvantages (e.g ., spam) . Go
to socialtext.com and review such claims . Find other
suppo rters of switching to wikis. Then find counterargu-
ments and conduct a de b ate o n the top ic.
16. Search the Inte rnet to ide ntify sites that d escribe metho ds
fo r improving meetings. Investigate w ays that meetings
can b e made mo re e ffective a nd e fficie nt.
17. Go to groupsystems.com and ide ntify its current GSS
p roducts. List the major capa bilities of those products .
18. Go to the Expe rt Choice Web site (expertchoice.com)
and find information abo ut the company’s group sup p ort
products and capabilities. Team Expert Choice is related
to the con cept of the AHP describe d . Evaluate this
p roduct in terms o f d ecision suppo rt. Do you think that
keypad use provides p rocess gains o r process losses?
How and why1 Also p rep are a list of the p roduct
a nalytical capabilities. Examine the free trial. How can it
suppo rt groupwork?
Solving Crimes by Sharing Digital Forensic Knowledge
Digital fore n sics has become an ind isp en sable tool fo r law
enforcement. This science is no t o nly a pplied to cases o f
crime committed w ith or against digital assets, but is used in
ma ny physical crimes to gather evidence o f intent o r proof o f
prior relationships . The volume of digital devices that might
be explo red by a fo re nsic analysis, however, is staggering,
including anything fro m a home co mputer to a videogame
console, to an e ngine module from a getaway vehicle. New
hardwa re, software, and applications are being re leased into
pub lic u se d aily, and analysts must create n ew me tho ds to
deal w ith each o f them.
Many law enfo rcement agencies have widely varying
capabilities to do forensics, sometimes enlisting the aid of
other agencies or o utside consultants to perform analyses. As
new techniques are developed , internally tested, and ultimately
scrutinized by the legal system , new forensic hyp otheses are
born and p roven. When the same techniques a re a pp lied to
other cases, the new proceeding is stre ngthened by the prec-
edent of a prio r case. Acceptance of a methodology in multiple
proceedings makes it more acceptable for future cases.
Unfortunately, new fore nsic discoveries are rarely for-
mally sha red-sometimes even among analysts w ithin the
538 Pan IV • Prescriptive Analytics
same agency. Briefings may be give n to other a nalysts w ithin
the same agency , although caseload s ofte n dictate immedi-
ately moving on to the next case. Even less is shared between
diffe re nt agencies, or eve n between different offices of some
fede ral law e nforceme nt communities. The result of this lack
of sharing is duplicatio n o f significant effo rt to re -discover
the same o r simila r approaches to prio r cases and a fa ilure
to take consistent advantage of precedent rulings that may
stre ngthe n the admissio n of a ceitain process.
The Cente r for Te lecommunications and Netwo rk
Security (CTANS) , a center of excelle nce that includes faculty
fro m Oklaho ma State Unive rsity’s Management Science and
Information Systems De p artment, has develo p e d , h oste d ,
and is continuo usly evolving Web-based software to su pp o1t
law enforcement digital forensics investigato rs (LEDFI) via
access to fore nsics resources and communicatio n channels
for the past 6 years. The cornersto ne of this initiative has
been the Nation al Re p osito1y of Digital Fo rensics Information
(NRDFI), a colla bo rative effo rt with the De fe nse Cyber Crime
Center (DC3), which has evolved into the Digital Fore nsics
Investigato r Link (DFILink) over the p ast 2 years.
Solution
The development of the NRDFI was guided b y the theory of
the egocentric group and how these groups share knowledge
and resources among o ne another in a community of practice
(Jarve npaa & Majchrzak, 2005). Within an egocentric com-
munity of practice, exp erts are identified through interactio n ,
knowle dge rema ins prima rily tacit, and informa l communica-
tion mecha nisms are u sed to transfer this knowledge from
one participant to the othe r. The informality of knowledge
transfe r in this context can lead to local p ocke ts o f exp e rtise
as well as redunda ncy of effort across the broader commu –
nity as a whole . Fo r example, a digital fo re nsics (DF) inves-
tigator in Washington , DC, may s pend 6 hours to develo p a
p rocess to extract d ata hidden in slack sp ace in the se cto rs
of a ha rd drive. The process may be sh ared among his local
colleagues, but othe r DF p rofessionals in othe r cities an d
regio ns w ill have to develo p th e process o n the ir own .
In resp o nse to these weaknesses, the NRDFI was
d evelo ped as a hub for kn owledge tra nsfe r between local
law enforceme nt communities. The NRDFI site was locke d
d own so that only members of law enforceme nt were able
to access conte nt, a n d me mbers were provided the ability to
upload knowledge docu ments and tools th at may have devel-
oped locally w ithin their community, so th at the broad e r law
e nforceme nt community of practice could utilize the ir contri-
butions and red uce redundancy of efforts. The Defen se Cyber
Crime Center, a co-sponsor of the NRDFI initiative, p rovided
a wealth o f knowledge documents and tools in o rder to seed
the system w ith content (see Figure 12.7).
Results
Resp o nse fro m the LEDFI community was positive, and
m embe rship to the NRDFI site quickly jumped to over 1,000
u sers. However, the usage p attern fo r these membe rs was
a lmost exclusively u nidirectio nal. LEDFI me mbers would
p e riodically log on , download a batch of tools and kn owledge
d ocu ments, a nd the n not log on again until the kn owledge
conte n t o n the site was exte n sively refreshed. The mecha-
nisms in place fo r local LEDFI communities to sha re their
own knowledge and tools sat la rgely unused. From here,
CTA S began to explo re the literature with regard to moti-
vating knowledge s haring, and b egan a redesign o f NRDFI
IWudf-
,_ ……….
Jason
-:'” -·-
…. -.1l•-…….
.1.•-
………
o.,,o…,.._ ___ b
~..,._ …. P-‘ _ .._. ___ _
,-u,o,-.11<111ro-,,1 __ v-. ___ _
~-O""'I-...... _._ .. __ _
~--11"'0
.... foi ...... cr._.
Nlllforc,wa:cw,Ch•"
FIGURE 12.7 DFI-Link Resources.
• :::i ... ._
!]- .. -~..., ... (,.
LJ~ltlll
.Il~lM"l fit
:::io-,.-1,0:oai:t
i..JC:.,,"'"8C:•41tl
1)---t l'l
1]1.C ..... ltl
!JI-In!
!]~-(1'11
!Jv.Nt~l1CII
u .........,11o1
'!J1...-..i1;31
-......
t.C--··-__ ....
Mac:. , ...
NAMt:DCJ Took
Fb6
Commmll!O --OC]Tools
te)lll.c,,,_.
OCCIVO.S....,._ _
t.N11..-.1uunMr
... tw..,...,....__
w,,_ -·-...,, .....
U:0,1"4rr -·-
""''-
~.,,___ 140-4'l'll
~Nrt_llHNMClll!ffAOe 1Sft1ICII
Ch apte r 12 • Knowledge Management and Collabo rative Systems 539
drive n by the extant literature; they focused on promoting
sharing w ithin the LEDFI community throu gh the NRDFI.
Some additio nal capabilities include new applications
such as a "Hash Link, " which can provide DFI Link members
w ith a reposito1y of hash values that they would o therwise
need to develop on their own and a directory to make it
easier to contact colleagues in other depa1tments and juris-
dictions. A cale ndar of events a nd a newsfeed page were
integrated into the DFI Link in response to requests from
the users. Increasingly , comme rcial software is also being
hosted . Some we re licensed through grants a nd o thers w e re
provided by vendors, but a ll are free to vetted users of the
law enforcement community.
The DFI Link has been a positive first step toward
getting LEDFI to bette r communicate and sha re knowledge
with colleagu es in o ther departments. Ongoing research is
helping to shape the DFI Link to better meet the needs of
its custome rs and promote even greater knowledge, sha ring .
Many LEDFI a re inhibited from s haring su ch knowledge, as
policies and culture in the law e nforcement domain often
promote the protectio n of information at the cost of knowl-
edge sha ring. However, by working with DC3 and the law
enforcement community, researche rs are beginning to knock
down these barriers a nd create a more productive knowledge
sha ring environment.
References
Alavi, M. (2000). "Managing Organizational Knowledge. "
Chapter 2 in W . R. Zmud (ed.). Framing the Domains of
IT Manag ement: Projecting the Future. Cincinnati, OH:
Pinnaflex Educational Resources.
Alavi, M., T. Kayworth, a nd D. Leidner. (2005/ 2006). "An
Empirical Examinatio n of the Influence of Organizational
Culture o n Knowledge Management Practice. " Journal of
Management Info rmation Systems, Vol. 22, No. 3.
Alavi, M., a nd D. Le idner. (2001). "Knowledge Manage ment
a nd Knowledge Management Systems : Conceptual
Foundations and Research Issues. " MIS Quarterly, Vol. 25 ,
No. 1, pp. 107-136.
Bock, G .-W. , R. Zmud, Y. Kim, a nd]. Lee. (2005). "Behavio ural
Intention Formation in Knowledge Sha ring: Examining the
Roles of Extrinsic Motivators, Social Psycho logical Forces
and Organization al Clima te. " MIS Quarterly Journal,
Vol. 29, No. 1.
Carlucci, D. , and G. Schiuma. (2006). "Knowledge Asset
Value Spiral: Linking Knowledge Assets to Compa ny's
Performa nce." Knowledge and Process Management,
Vol. 13, No. 1.
Chua, A. , a nd W. Lam. (2005). "Why KM Projects Fail: A
Multi-Case Analysis ." Journal of Knowledge Management,
Vol. 9, No. 3, pp. 6-1 7.
QUESTIONS FOR THE END-OF-CHAPTER
APPLICATION CASE
1. Why should digital forensics information be share d
amo n g law e nforcement communities?
2. What does egocentric theory suggest about kn owledge
sharing?
3. What behavior did the developers of NRDFI observe in
terms of use of the system?
4 . What additio nal features m ight e nhance the use and
value of such a KMS?
Sources: Harrison e t al. , "A Lesso ns Le arned Re posito ry fo r
Co mpute r Fo rensics ," International Journal of Digital Eviden ce,
Vo l. 1, No. 3, 2002; S. J a rve n paa and A. Majc hrza k , Developing
In dividuals' Transactive Memories of their Ego-Centric Networks
to Mitigate Risks of Knowledge Sharing: The Case of Professionals
Protecting Cy herSecurity. Pa per prese nted at the Proceedings of
the Twenty-Sixth Inte rnatio nal Confe re nce o n Info rma tio n Syste ms,
2005; J. Nichols, D. P. Biros, a nd M. Weiser, "Towa rd Alignme nt
Be tween Comm unitie s of Practice and Knowledge-Based Decisio n
Su pport," Journal of Digital Forensics, Security, a nd Law, Vo l. 7,
No. 2, 2012; M. Weiser, D . P. Biros, and G. Mosie r, "Building a
Na tiona l Fore ns ics Case Re pos itory fo r Fo re nsic Inte ll igence,"
Jou rnal of Digital Forensics, Security, a nd Law, Vol. 1, No . 2, May
2006 (This case was contributed by David Biros, J aso n Nicho ls, a nd
Ma rk We iser) .
Davenport, D. (2008) . "Enterprise 2.0: The New, New
Knowledge Management?" http://blogs.hbr.org/
davenport/2008/02/ enterprise_20_the_new _new_
know.html (accessed Sept. 2013).
Davenpott, T. , D. W. DeLong, and M. C. Bee rs. 0998,
Winter). "Successful Knowledge Management Projects. "
Sloan Managem ent Review, Vol. 39, No. 2.
Davenport, T. , and L. Prusak. 0998). "How Organizations
Manage What They Know. " Boston: Harvard Business
School Press.
Delen, D ., and S. S. Hawamdeh. (2009) . "A Holistic
Fra mework fo r Knowledge Discovery and Management. "
Communications of the ACM, Vol. 52, No. 6, pp. 141-1 45.
Desanctis, G., a nd R. B. Gallupe. 0987). "A Foundation
for the Study of Group Decisio n Support Syste ms ."
Management Science, Vol. 33, No. 5.
Godin , B. (2006) . "Knowledge-Based Economy: Conceptu al
Framework or Buzzword ." The Journal of Tech nology
Tranifer, Vol. 31, No. 1.
Gray, P. 0999). "Tutorial o n Knowledge Management. "
Proceedings of the Americas Conf erence of the Association
for Information Systems, Milwaukee.
Hall, M. (2002, July 1). "Decision Su pport Systems ."
Computerworld, Vol. 36, No. 27 .
540 Part IV • Prescriptive Analytics
Hansen, M. , et a l. (1999, March/ April). "What's Your Strategy
for Ma naging Knowle dge?" Ha rvard Business Review,
Vol. 77, No. 2.
Harrison, e t al. (2002, Fall) . "A Lessons Learne d Repository
for Compute r Fore nsics ." International Journal of Digital
Evidence, Vol. 1, No. 3.
Ho ffer, ]. , M. Prescott, a nd F. McFadde n . ( 2002). Modern
Database Ma nagement, 6th e d. Upper Saddle Rive r, NJ:
Pre ntice Hall.
Ho lt, K. (2002, August 5). "Nice Con cept: Two Days' Work in
a Day." Meeting News, Vol. 26, No. 11.
Iyer, S., R. Sharda, D. Biros, J. Lucca , and U. Shimp.
(2009). "Orga nization of Lesson s Learned Knowledge: A
Taxono my of Imple me ntation." International Journal of
Knowledge Management, Vol. 5, No. 3.
Jarvenpaa, S., a nd A. Majchrzak. ( 2005) . Developing
Individuals' Tra nsactive Memories of their Ego-Centric
Networks to Mitigate Risks of Knowledge Sharing: The Case
of Professionals Protecting CyberSecu rity. Pa pe r p resented
at the Proceedings of the Twenty-Sixth Inte rnationa l
Confere nce o n Information Syste ms.
Kankanhalli, A. , and B. C. Y. Tan . (2005) . "Knowle dge
Management Metrics: A Review and Directio ns for
Future Research. " International Journal of Knowledge
Management, Vol. 1, No. 2.
Kiaraka, R. N., a nd K. Ma nning. ( 2005). "Managing
Organizations Through a Process-Based Pe rspective:
Its Challe nges and Rewards ." Knowledge and Process
Management, Vol. 12, No. 4.
Koenig, M. (2001, September) . "Codificatio n vs. Personalization. "
KMWorld.
Konicki, S. (2001 , November 12). "Collaboration Is the
Corne rstone o f $19B De fense Contract. " Information Week.
Le idne r, D ., M. Alavi, and T. Kayworth. (2006). "The Role
o f Culture in Knowle dge Manage m e nt: A Case Study of
Two Global Firms. " International Journal of eCollabora-
tion, Vol. 2, No. 1.
Nichols, ]. , D. P. Biros, a nd M. We iser. (2012). "Toward
Alignment Be tween Communities of Practice and
Knowledge -Based Decision Suppo rt. " Journal of Digita l
Forensics, Security, and Law, Vol. 7, No. 2.
Nonaka, I. (1991). "The Knowledge-Creating Company. "
Harvard Business Review, Vol. 69, No. 6, pp. 96-104 .
Nuna maker, J. F. , R. 0 . Briggs , D. D. Mittleman , D. T. Vogel,
and P.A. Balthaza rd. (1997). "Lessons from a Dozen Yea rs
o f Group Su pport System s Research: A Discu ssion of Lab
and Field Findings." Journal of Management Information
Systems, Vol. 13, pp. 163- 207.
Po la n yi, M. (1958) . Personal Knowledge. Chicago: University
o f Chicago Press.
P o nzi, L. J. (2004). "Knowledge Management: Birth of a
Discipline ." In M. E. D. Koenig and T. K. Srikanta iah
(eds.). Knowledg e Management Lessons Learned: What
Works and What Doesn 't. Medford, NJ: Informa tio n
Today.
Po well, A. , G . Piccoli, and B. Ives. ( 2004, Winte r). "Virtual
Teams: A Review of Curre nt Literan1 re and Directio ns for
Future Research ." Data Base.
Robin, M. (2000, March). "Learning by Do ing. " Knowledge
Management.
Ruggles, R. 0 998). "The State o f the Notion : Knowled ge
Management in Practice. " California Management Review,
Vol. 40, No. 3.
Schwartz, D. G. (ed.) . (2006). Encyclopedia of Knowledge
Management. Hershey, PA: Idea Group Refe rence.
Shariq , S. G., and M. T. Ven del0 . (2006) . "Tacit Knowledge
Sha ring. " In D. G. Schwartz (ed.). Ency clop edia of
Knowledge Management. He rsh ey, PA: Idea Grou p
Re fe re nce.
Tseng , C. , and J. Goo. (2005). "Intellectua l Capital and
Corporate Value in a n Emerging Econ omy: Empirical
Sn1dy of Taiwanese Ma nufacture rs. " R&D Managemen t,
Vol. 35, No . 2.
Tuggle, F. D., a nd W. E. Goldfinger. (2004). "A Methodology
for Mining Embedded Kno wledge fro m Process Maps. "
Human Systems Management, Vol. 23, No. 1.
Van d e Van, A. H. (2005, June). "Running in Packs to D evelop
Knowle dge -Intensive Technologies ." MIS Quarterly,
Vol. 29, No. 2.
Weiser, M, D. P. Biros, and G. Mosier. (2006, May). "Building
a Natio na l Fore nsics Case Re p ository for Forensic
Inte lligence." Journal of Digital Forensics, Security, and
Law, Vol. 1, No. 2.
Wenge r, E. C. , a nd W. M. Snyder. (2000, J anuary/ Februaty) .
"Communities of Practice: The Organizational Frontie r. "
Harvard Business Review, pp . 139-145 .
p A R T
Big Data and Future Directions
for Business Analytics
LEARNING OBJECTIVES FOR PART V
• Unde rstand th e concepts, definitions, and po-
tential use cases for Big Data and a n alytics
• Learn the enabling technologies, methods, and
tools used to derive value from Big Data
• Explore some of the emerging tech-
nologies that offer interesting applica-
tion and development opportunities for
analytic systems in general and business
intelligence in particular. These include geo-
spatial data, location-based an alytics, social
networking, Web 2.0, reality mining, a nd cloud
computing.
• Describe some personal, organizational, an d
societal impacts of an alytics
• Learn about major ethical and legal issues of
analytics
This part consists of two chapters. Chapter 13 introduces Big Data analytics, a hot topic in the
analytics world today. It provides a detailed description of Big Data, the benefits and challenges that
it brings to the world of analytics, and the methods, tools, and technologies developed to turn Big
Data into immense business value. The primary purpose of Chapter 14 is to introduce several emerg-
ing technologies that will provide new opportunities for application and extension of business analyt-
ics techniques and support systems. This part also briefly explores the individual, organizational, and
societal impacts of these technologies, especially the ethical and legal issues in analytics implemen-
tation. After describing many of the emerging technologies or application domains, we will focus on
organizational issues.
541
542
CHAPTER
Big Data and Analytics
LEARNING OBJECTIVES
• Learn w hat Big Data is and how
it is changing the world of
analytics
• Understand the motivation for and
business drivers of Big Data
analytics
• Become familiar with the w ide range
of enabling technologies for Big Data
an alytics
• Learn about Hadoop, MapReduce,
and NoSQL as they relate to Big Data
analytics
• Understand the role of and capabilities/
skills for data scientist as a new
an alytics profession
• Compare and contrast the
complementary uses of data
warehousing and Big Data
• Become familiar with the vendors of
Big Data tools a nd services
• Understand the need for and appreciate
the capabilities of stream analytics
• Learn about the applications of stream
analytics
B
ig Data, which means many things to many people, is n ot a new technological
fad. It is a bu siness prio rity that has the potential to profoundly change the com-
petitive landscape in today's globally integrated economy. In addition to provid-
ing innovative solutio ns to e nduring business challe nges , Big Data a nd analytics instigate
new ways to transform processes, o rganizations, entire industries, and even society all
together. Yet extensive media coverage makes it hard to distinguish hype from reality.
This chapter aims to provide a comprehensive coverage of Big Data, its e n abling technol-
ogies, and related a nalytics con cepts to help understand the capabilities an d limitations of
this emerging paradigm. The chapter starts w ith the definitio n and related con cepts of Big
Data, followed by the technical details of the enabling technologies, including Hadoop,
MapReduce, and NoSQL. After describing "data scie ntist" as a new, fashio n able o rgani-
zatio n al role/ job, we provide a comparative an alysis between data wareh o u sing and Big
Chapter 13 • Big Data and Analytics 543
Data analytics. The last part of the chapter is dedicated to stream analytics, which is one
of the most promising value propositions of Big Data analytics. This chapter contains the
following section s:
13.1 Opening Vignette: Big Data Meets Big Science at CERN 543
13.2 Definition of Big Data 546
13.3 Fundamentals of Big Data Analytics 551
13.4 Big Data Technologies 556
13.5 Data Scientist 565
13.6 Big Data a nd Data Warehousing 569
13.7 Big Data Vendors 574
13.8 Big Data a nd Stream Analytics 581
13.9 Applications of Stream Analytics 584
13.1 OPENING VIGNETTE: Big Data Meets Big Science
at CERN
The European Organization for Nuclear Research, known as CERN (which is derived from
the acronym for the French "Conseil Europeen pour la Recherche Nucleaire") , is playing
a leading role in fundamental studies of physics. It h as been instrumental in many key
global innovations and breakthrough discoveries in theoretical physics and today oper-
ates the world's largest particle physics laboratory, home to the Large Hadron Collider
(LHC) nestled under the mountains between Switzerland and France. Founded in 1954,
CERN, one of Europe's first joint ventures, now has 20 member European states. At the
beginning, their research primarily con centrated on understanding the inside of the atom,
hence, the word "nuclear" in its name.
At CERN physicists and engineers are probing the fundamental structure of the uni-
verse. They use the world's largest and the most sophisticated scientific instruments to
study the basic con stituents of matter-the fundamental particles. These instruments
include purpose-built particle accelerators and detectors. Accelerators boost the beams of
particles to very high energies before the beams are forced to collide with each other or
with stationary targets. Detectors observe and record the results of these collisions, which
are happening at or near the speed of light. This process provides the physicists with
clues about how the particles interact, and provides insights into the fundamental laws
of nature. The LHC and its various experiments have received media attention following
the discovery of a new particle strongly suspected to be the e lusive Higgs Boson-an e le-
mentary particle initially theorized in 1964 and tentatively confirmed at CERN on March
14, 2013. This discove1y has been called "monumental" because it appears to confirm the
existence of the Higgs field, which is pivotal to the major theories within pa1ticle physics.
THE DATA CHALLENGE
Fo1ty million times per second, particles collide w ithin the LHC, each collision generating
particles that often decay in complex ways into even more particles. Precise electronic
circuits all around LHC record the passage of each particle via a detector as a series of
electronic signals, a nd send the data to the CERN Data Centre (DC) for recording and
digital reconstruction. The digitized summary of data is recorded as a "collision event."
Physicists must sift through the 15 petabytes or so of digitized summary data produced
annually to determine if the collisions have thrown up any interesting physics. Despite
544 Pan V • Big Data a nd Future Directions for Bu siness Analytics
the state-of-the-art instrumentation and computing infrastructure, CERN does n o t h ave
the capacity to process all of the data that it gen erates, and therefore relies o n numerous
o ther research cente rs all around the world to access a nd p rocess th e data.
The Compact Muon Solen o id (CMS) is o ne o f the two general-purpose particle phys-
ics detectors operated at the LHC. It is designed to explo re the frontiers of physics a n d
provide physicists with the ability to loo k at the conditions presented in the early stages
of o ur universe. More than 3,000 physicists from 183 institutions representing 38 countries
a re involved in the design, constructio n , a nd mainte nance of the experime nts. An exper-
iment of this magnitude requires an enormously complex distributed computing and
data management system. CMS spans more than a hundred data centers in a three-tier
mod e l and generates around 10 petabytes (PB) of summary data each year in real data,
simulated data, and metadata. This informa tion is stored and retrieved from relational
and n onrelatio nal data sources, such as re la tio n al databases, document databases, biogs,
wikis, file systems , and cu stomized applicatio ns .
At this scale, the informatio n discovery w ithin a heterogeneous, distributed e nviron-
ment becomes an impo rtant ingredient of successfu l data a n alysis. The data and associ-
ated metadata are produced in variety of forms and digital formats. Users (within CERN
and scientists all around the world) want to be able to query different services (at dis-
persed d ata servers a nd at diffe re n t locatio ns) a nd combine data/ informatio n from th ese
varied sources. However, this vast and complex collection of data means they don't
necessarily know where to find the right informatio n or have the domain knowledge to
extract and merge/combine this data.
SOLUTION
To overcome this Big Data hurdle, CMS's data management and workflow management
(DMWM) created the Data Aggregation System (DAS), built o n MongoDB (a Big Data
managem e nt infrastructure) to provide the ability to search and aggregate informa tio n
across this complex data la ndscape. Data and metadata for CMS com e from many differ-
ent sources and are distribu ted in a variety of digital formats. It is organized and managed
by constantly evolving software using both relatio n al and nonrelatio na l data sources. The
DAS provides a layer o n top of the existing data sources that allows researchers a nd other
staff to query data via free text-based queries, and then aggregates the results from across
distributed provide rs- while preserving their integrity, security policy, an d data formats .
The DAS then represents that data in defined format.
"The ch o ice of an existing relatio n al database was ruled o ut for several reasons-
nam e ly, we d idn't require any tra nsactions a nd data persistency in DAS, a n d as such
can 't have a pre-defined schema. Also the dynamic typing of stored metadata objects was
o ne of the requirements. Amongst other reason s, those a rgume nts forced us to look for
alte rnative IT solutio ns," explained Vale ntin Ku znetsov, a research associate fro m Corne ll
University w ho works at CMS.
"We conside red a numbe r of diffe rent optio ns, including file-based and in-memory
caches, as well as key-value databases , but ultimately decided that a document database
would best suit our needs. After evaluating several applications, we ch ose Mon goDB, due
to its suppo rt of d ynamic queries , full indexes, including inne r objects and embedded
arrays, as well as a uto -sharding."
ACCESSING THE DATA VIA FREE-FORM QUERIES
All DAS queries can be expressed in a free text-based form , either as a set of keywords
or key-value pairs, w h ere a pair can represent a conditio n. Users can query the system
u sing a simple, SQL-like language, which is the n transformed into the Mon goDB query
syntax, w hich is itself a JSON record. "Due to the sche ma-less nature of the underlying
Chapter 13 • Big Data a nd Analytics 545
MongoDB back-end, we are able to store DAS records of any arbitrary structure, regard-
less of whether it's a dictionary, lists, key-value pairs, etc. Therefore, eve1y DAS key has
a set of attributes describing its JSON structure," added Kuznetsov.
DATA AGNOSTIC
Given the number of different data sources, types, and providers that DAS connects to ,
it is impe rative that the system itself be data agnostic and allow us to query and aggre-
gate the metadata information in customizable ways. The MongoDB architecture easily
integrates with existing data services while preserving the ir access, security policy, and
development cycles. This also provides a simple plug-and-play mechanism that makes it
easy to add new data services as they are impleme nted and configure DAS to connect to
specific domains.
CACHING FOR DATA PROVIDERS
As well as providing a way for users to easily access a wide range of data sources in a
simple and consistent manne r, DAS uses MongoDB as a dyn amic cache, collating the
information fed back from the data providers-feedback in a variety of formats and
file structures. "When a user enters a query, it checks if the MongoDB database has the
aggregation the user is asking for and, if it does, returns it; otherwise, the system does
the aggregation and saves it to MongoDB ," said Kuznetsov. "If the cache does not contain
the requested query, the system contacts distributed data providers that could have this
information and queries them, gathering their results . It then merges all of the results ,
doing a sort of 'group by ' operation based on predefined identifying keys and inserts the
aggregated information into the cache."
The deployment specifics are as follows:
• The CMS DAS currently runs on a single eight-core server that processes all of the
queries and caches the aggregated data.
• OS: Scientific Linux
• Server hardware configuration: 8-core CPU, 40GB RAM, 1TB storage (but data set
usually around 50-lOOGB)
• Application La nguage: Python
• Other database technologies: Aggregates data from a number of different databases
including Oracle, PostGreSQL, CouchDB, and MySQL
RESULTS
"DAS is used 24 hours a day, seven days a week, by CMS physicists, data operators, and
data managers at research facilities around the world. The average query may resolve into
thousands of documents, each a few kilobytes in size . The performance of MongoDB has
been outstanding, with a throughput of around 6,000 documents a second for raw cache
population," concluded Kuznetsov. "The ability to offer a free text query system that is
fast and scalable, with a highly d ynamic and scalable cache that is data agnostic, pro-
vides an invaluable two-way translation mechanism. DAS helps CMS users to easily find
and discover information they n eed in the ir research, and it represents one of the many
tools that physicists use on a daily basis toward great discoveries. Without h elp from
DAS, information lookup would have taken orders of magnitude longer. " As the data
collected by the various experiments grows, CMS is looking into horizontally scaling the
system with sharding (i.e., distributing a single, logical database system across a cluster of
machines) to meet demand. Similarly the team are spreading the word beyond CMS and
out to other parts of CERN.
546 Pan V • Big Data a nd Future Directions for Bu siness Analytics
QUESTIONS FOR THE OPENING VIGNETTE
1. What is CERN? Why is it impo rtant to the world of scien ce?
2. How does Large Hadron Collider work? What does it p roduce?
3. What is essen ce of the data challe nge at CERN? How sig nificant is it?
4. What was the solution? How did Big Data address the ch allenges?
5. What were the results7 Do you think the current solutio n is sufficient?
WHAT WE CAN LEARN FROM THIS VIGNETTE
Big Data is big, and much more. Thanks largely to the techn o logical advan ces, it is easier
to create, capture , store, and analyze very large quantities of data. Most of the Big Data
is generated automatically by machines. The opening vignette is an excelle nt example to
this testament. As we have seen , LHC at CERN creates very large volumes of data very fast.
The Big Data comes in varied formats and is sto red in distributed server systems. Analysis
of such a data la ndscape requires new analytical tools and techniqu es. Regardless of its
size, complexity, a nd velocity, data need to be made easy to access, query, and an alyze if
promised value is to be derived from it. CERN uses Big Data technologies to make it easy
to analyze vast amount of data created by LHC to scientists all over the world, so that the
promise of unde rstanding the fundamental building blocks of th e universe is realized. As
organizations like CERN hypothesize new means to leverage the value of Big Data, they
will continue to invent newer technologies to create and capture even Bigger Data.
Sources. Co mpile d from N. He ath , "Cern: Where the Big Ba ng Meets Big Data," TechRe public, 201 2,
techrepublic.com/blog/ european-technology / cern-where-the-big-bang-meets-big-data/636 (accessed
February 2013); home.web.cern.ch/about/ computing; and lOgen Customer Case Study, "Big Data at the
CERN Project, " lOgen.com/customers/cern-cms (accessed March 2013).
13.2 DEFINITION OF BIG DATA
Using data to understand custo mers/clie nts and bu siness o p eratio ns to sustain (an d fos-
ter) growth and profitability is an increasingly more ch a llenging task for today's ente r-
prises. As more a nd more data becomes available in various forms and fas hions, time ly
processing of the data with traditional means becomes impractical. This phenomenon
is n owadays called Big Data, w hich is receiving su bstan tial press coverage and drawin g
increasing interest from both business users a nd IT professio nals . The result is that Big
Data is becoming a n overhyped and overused marketing bu zzword .
Big Data means different things to people w ith different backgrounds an d interests.
Traditionally, the term "Big Data" h as been used to describe the massive volumes of data
analyzed by huge organization s like Google or research science projects at NASA. But for
m ost businesses, it's a relative term: "Big" depends on a n o rganizatio n 's size . The point is
more about finding new value w ithin and outside conventio n al data sources. Pushing the
boundaries of data analytics uncovers new insights and opportunities, and "big" depends
o n w he re you sta1t and h ow you proceed. Conside r the popular descriptio n of Big Data:
Big Data exceeds the reach of commonly used hardware environments and/ or capabili-
ties of software tools to capture, manage, and process it w ithin a tolerable time span for
its user population. Big Data has become a popular term to describe the exponential
growth , availability, and use of informatio n , both structured and unstructured. Much h as
been w ritte n o n the Big Data trend and how it can serve as the basis for innovation, dif-
ferentiation, and growth .
Chapter 13 • Big Data and Analytics 547
Where does the Big Data come from? A simple answer is "everywhere ." The sources
of data that were ignored because of technical limitations are now being treated like
gold mines. Big Data may come from Web logs , RFID, GPS systems, sen sor networks ,
social networks, Internet-based text documents, Internet search indexes, detailed call
records, astronomy, atmospheric science, biological, genomics, nuclear physics , b iochem-
ical experiments, medical records, scientific research, military surveillance, photography
archives, video archives, and large-scale ecommerce practices.
Big Data is not new. What is new is that the definition and the structure of Big Data
constantly change. Companies have been storing and analyzing large volumes of data
since the advent of the data warehouses in the early 1990s. While terabytes u sed to be
synonymous with Big Data warehouses, now it's petabytes, and the rate of growth in data
volumes continues to escalate as organizations seek to store and an alyze greater levels of
transaction details, as well as Web- and machine-generated data , to gain a better under-
standing of customer behavior and business drivers.
Many (academics and industry analysts/leaders alike) think that "Big Data" is a mis-
nomer. What it says and what it means are not exactly the same. That is, Big Data is not
just "big. " The sheer volume of the data is only one of many characteristics that are often
associated with Big Data, such as variety, velocity, veracity, variability, and value proposi-
tion, among others.
The Vs That Define Big Data
Big Data is typically defined by three "V''s: volume, variety, velocity. In addition to these
three, we see some of the leading Big Data solution providers adding other Vs, such as
veracity (IBM), variability (SAS) , and value proposition.
VOLUME Volume is obviously the most common trait of Big Data. Many factors contrib-
uted to the expone ntial increase in data volume, such as transaction-based data stored
through the years, text data constantly streaming in from social media, increasing a mounts
of sen sor data being collected, automatically generated RFID and GPS data, and so forth .
In the past, excessive data volume created storage issues, both technical and financial.
But with today's advanced technologies coupled with decreasing storage costs, these
issues are no lo nger significant; instead, other issues emerge, including how to determine
relevance amidst the large volumes of data and how to create value from data that is
deemed to be relevant.
As mentioned before, b ig is a relative term. It changes over time and is perceived
differently by d ifferent organizations. With the staggering increase in data volume, even
the naming of the n ext Big Data echelo n has been a challenge. The highest mass of
data that used to be called petabytes (PB) has left its place to zettabytes (ZB), which is a
trillion gigabytes (GB) or a billion te rabytes (TB). Technology Insights 13.1 provides an
overview of the size an d naming of Big Data volumes .
TECHNOLOGY INSIGHTS 13.1 The Data Size Is Getting Big, Bigger,
and Bigger
The measure of data size is having a hard time keeping up with new names. We all know
kilobyte (KB, w hich is 1,000 bytes), megabyte (MB, which is 1,000,000 bytes), gigabyte (GB,
which is 1,000,000,000 bytes), and terabyte (TB, which is 1,000,000,000,000 bytes). Beyond that,
the names give n to data sizes are re latively new to most of us. The following table shows what
comes after terabyte a nd b eyond.
548 Part V • Big Data and Future Directio ns for Business Analytics
Name Symbol Value
Kilobyte kB 103
Megabyte M B 106
Gigabyte GB 109
Terabyte TB 101 2
Petabyte PB 101 5
Exabyte EB 101 8
Zettabyte ZB 1021
Yottabyte YB 1024
B rontobyte * BB 1027
Gegobyte* GeB 1030
'Not an official SI (lncemarional System of Units) name/ symbol , yet.
Consider that an exabyte of data is created on the Inte rnet each day, w hich equates to
250 million DVDs' worth of information. And the idea of even larger amounts of data- a
ze ttabyte- isn't too far off when it comes to the amount of info traversing the Web in any one
year. In fact, industry experts are already estimating that we will see a 1.3 zettabytes o f tra ffi c
annually over the Internet by 201 6- and soon eno ugh, we might start talking about even bigger
volumes. When referring to yottabytes, some o f the Big Data scie ntists o ften wonder about how
much data the NSA or FBI have o n p eople altogether. Put in terms of DVDs , a yottabyte wou ld
require 250 trillion of them. A brontobyte, w hich is not an o fficial SI prefix but is apparently
recognize d by some peop le in the m easurement community, is a 1 followed by 27 zeros. Size
of su ch magnitude can be used to describe the amount of sensor data that we will get from the
Inte rnet in the next d ecade, if no t soone r. A gegobyte is 10 to the power o f 30. With respect to
where the Big Data comes from, consider the following:
• The CERN La rge Hadron Collider ge nerates 1 petabyte per second.
• Sensors fro m a Boeing jet e ngine create 20 terabytes o f data every ho ur.
• 500 te rabytes of n ew d ata per day are ingeste d in Facebook databases .
• On YouTube, 72 hours of video are uploaded p er minute, translating to a terabyte every
4 minutes.
• The proposed Square Kilo mete r Array telescope (the world's proposed biggest telescope)
will generate an exabyte of data p er day.
Sources: S. Higginbotham, "As Data Gets Bigger, What Comes After a Yottabyte>” 2012, gigao m .
com/ 2012/10/30/as-data-get s -bigger-w h at-comes-after-a-yottabyte (accessed Ma rch 2013); an d en.
w ikipedia .org/wiki/ Pe tabyte (accessed March 2013).
From a short historical pers pective, in 2009 the world had about 0.8ZB o f data; in 2010,
it exceeded the lZB m ark; at the e nd o f 2011 , the number was 1.8ZB. Six o r seven years
fro m now, the number is estimated to b e 35ZB (IBM, 2013). Though this number is aston-
ishing in size, so a r e the challenges and opportunities that come w ith it.
VARIETY Data today comes in a ll types of formats- ranging from tra d itio nal databases
to hierarchical data s tores crea ted by the e nd users and OLAP systems, to tex t docu-
m e n ts, e -m a il, XML, mete r-collect e d , sen sor-captured data , to vid eo, a udio, a n d s tock
ticker data. By some estimates, 80 to 85 p e rcent of all organizations’ data is in some
sort of unstruc tured or semistructured format (a format tha t is n ot suitable for traditional
Chapter 13 • Big Data and Analytics 549
database schemas). But there is no denying its value, and hence it must be included in
the analyses to support decision making .
VELOCITY According to Gartner, velocity means both how fast data is being produced
and how fast the data must be processed (i.e., captured, stored, and analyzed) to meet
the need or demand. RFID tags, automated sensors, GPS devices, and smart meters are
driving an increasing need to deal w ith torrents of data in near-real time. Velocity is
perhaps the most overlooked characteristic of Big Data. Reacting quickly enough to deal
with velocity is a challenge to most organizations . For the time-sensitive environments,
the opportunity cost clock of the data starts ticking the moment the data is created. As
the time passes , the value proposition of the data degrades, and eventually becomes
worthless. Whether the subject matter is the health of a patient, the well-being of a traffic
system, or the health of an investment portfolio, accessing the data and reacting faster to
the circumstances w ill always create more advantageous outcomes.
In the Big Data storm that we are witnessing now, almost everyone is fixated on
at-rest analytics, using optimized software and hardware systems to mine large quantities
of variant data sources. Although this is critically important and hig hly valuable, there is
another class of analytics driven from the velocity n ature of Big Data, called “data stream
analytics” or “in-motion analytics,” which is mostly overlooked. If done correctly, data
stream analytics can be as valuable, and in some business environments more valuable,
than at-rest analytics . Later in this chapter we will cover this topic in more detail.
VERACITY Veracity is a term that is being used as the fourth “V” to describe Big Data by
IBM. It refers to the conformity to facts: accuracy, quality, truthfulness, or trustworthiness
of the data . Tools and techniques are often used to handle Big Data’s veracity by trans-
forming the data into quality and trustworthy insights.
VARIABILITY In addition to the increasing velocities and varieties of data, data flows can
be highly inconsistent, w ith periodic peaks. Is something big trending in the social media?
Perhaps there is a high-profile IPO looming. Maybe swimming with pigs in the Bahamas
is suddenly the must-do vacation activity. Daily, seasonal, and event-triggered peak data
loads can be challe nging to manage-especially with social media involved.
VALUE PROPOSITION The excitement around Big Data is its value proposition. A pre-
conceived notion about “big” data is that it contains (or has a greater potential to contain)
more patterns and interesting an omalies than “small” data. Thus, by analyzing large and
feature rich data, organizatio ns can gain g reater business value that they may not have
otherwise. While users can detect the patterns in small data sets using simple statistical
and machine-learning methods or ad hoc query a nd reporting tools, Big Data means “big ”
analytics. Big an alytics means greater insight and better decisions, something that every
organization needs nowadays.
Sin ce the exact definition of Big Data is still a matter of o ngoing discussion in aca-
demic and industrial circles, it is likely that more characteristics (perhaps more Vs) are
likely to be added to this list. Rega rdless of what happens, the importance and value
proposition of Big Data are here to stay. Figure 13.1 sh ows a conceptual architecture
where big data (at the left side of the figure) is converted to business insight through the
use of a combination of advanced an a lytics and delivered to a variety of different users/
roles for faster/ better decision making.
550 Pan V • Big Data a nd Future Directions for Business Analytics
UNIFIED DATA ARCHITECTURE
System Conceptual View
MOVE I MANAGE I ACCESS
Marketing
Executives
Operational
Systems
Customers
Partners
Frontline
W orkers
Business
Analysts
Data
Scientists
Engineers
USERS
FIGURE 13.1 A High-Level Conceptual Architecture for Big Data Solutions. (Source: AsterData-a Teradata Company)
Application Case 13.1 shows the creative use of Big Data analytics in the ever-
so-popular social media industry.
Application Case 13.1
Big Data Analytics Helps Luxottica Improvement Its Marketing Effectiveness
Based in Mason, Ohio, Luxottica Retail North
America (Luxottica) is a wholly owned retail arm of
Milan-based Luxottica Group S.p.A, the world’s larg-
est designer, manufacturer, distributer and seller of
luxu1y a nd sports eyewear. Employing more than
65,000 people worldwide, the company reported
net sales of EUR6.2 billion in 2011.
Problem – Disconnected Customer Data
Nearly 100 million customers purchase eight house
brands from Luxottica through the company’s
numerous websites and retail chain stores. The
big data captured from those customer interactions
(in the form of transactions, click streams, product
reviews, an d social media postings) constitutes a
massive source of business intelligence for potential
product, marketing, and sales opportunities.
Luxottica, however, outsourced both data stor-
age and promotional campaign development and
management, leading to a disconnect between data
analytics and marketing execution. The outsource
model hampered access to current, actionable data,
limiting its marketing value and the analytic value
of the IBM PureData System for Analytics appli-
a nce that Luxo ttica used for a small segment of its
business.
Luxottica’s competitive posture and strategic
growth initiatives were compromised for lack of an
individualized view of its customers and an inabil-
ity to act decisively and consiste ntly on the differ-
ent types of information generated by each retail
channel. Luxottica needed to be able to exploit all
data regardless of source or which internal or exter-
nal application it resided on. Likewise, the com-
pany’s marketing team wanted more control over
promotional campaigns, including the capacity to
gauge campaign effectiveness.
Solution – Fine-tuned Marketing
To integrate a ll data from its multiple internal and
external application sources and gain visibility into
its customers, Luxottica deployed the Customer
Intelligence Appliance (CIA) from IBM Business
Partner Aginity LLC.
CIA is an integrated set of adaptable software,
hardware, and embedded analytics built on the IBM
PureData System for Analytics solution. The com-
bined technologies help Luxottica highly segment
customer behavior and provide a platform and smart
database for marketing execution systems, such as
campaign management, e-mail services and other
forms of direct marketing.
IBM® PureData™ for Analytics, which is pow-
ered by Netezza data warehousing technology, is
one of the leading data appliances for large-scale,
real-time analytics. Because of its innovative data
storage mechanisms and massively parallel process-
ing capabilities, it simplifies and optimizes perfor-
mance of data services for analytic applications,
enabling very complex algorithms to run in min-
utes, not hours o r days, rapidly delivering invaluable
insight to decision makers w h e n they need it.
The IBM and Aginity platform provides
Luxottica with unprecedented visibility into a class
of customer that is of particular interest to the com-
pany: the omni-channel customer. This customer
purchases merchandise both online and in-store and
tends to shop and spend more than web-only or
in-store customers.
SECTION 13.2 REVIEW QUESTIONS
Chapter 13 • Big Data and Ana lytics 551
“We’ve equipped the ir team with tools to gain
a 360-degree view of their most profitable sales
channel, the omni-channel customers, and indi-
vidualize the way they market to them,” says Ted
Westerheide, chief architect for Aginity. “With the
Customer Intelligence Appliance and PureData
System for Analytics platform, Luxottica is a learn-
ing organization, connecting to customer data across
multiple channels and improving marketing initia-
tives from campaign to campaign.”
Benefits
Successful implementation of such an advanced big
data analytics solution brings about numerous busi-
ness benefits. In the case of Luxottica, the top three
benefits were:
• Anticipates a 10 percent improvement in mar-
keting effectiveness
• Identifies the highest-value customers out of
nearly 100 million
• Targets individual customers based on unique
preferences and histories
QUESTIONS FOR DISCUSSION
1. What does Big Data mean to Luxottica?
2. What were their main challenges?
3. What was the proposed solution, and the
obtained results?
Source: IBM Customer Case, “Luxottica anticipates 10 pe rcent
improve ment in marke ting effectiveness” http://www-01.ibm.
com/ software/success/ cssdb.nsf/CS/KPES-9BNNKV?Open
Document&Site=default&cty=en_us (accessed Octobe r 2013).
1. Why is Big Data important? What has changed to put it in the center of the analytics
world?
2. How do you define Big Data? Why is it difficult to define?
3. Out of the Vs that are used to define Big Data, in your opinion, which one is the most
important? Why?
4. What do you think the future of Big Data will be like? Will it leave its popularity to
something else? If so, what w ill it be?
13.3 FUNDAMENTALS OF BIG DATA ANALYTICS
Big Data by itself, regardless of the size, type, or speed, is worthless unless business
users do something with it that delivers value to their organizations. That’s where “big”
analytics comes into the picture. Although organizations have a lways run reports and
552 Pan V • Big Data and Future Directions for Business Analytics
dashboards against data warehouses, most have not opened these repositories to in-
depth on-demand exploration. This is partly because analysis tools are too complex for
the average user but also because the repositories often do not contain all the data
needed by the power user. But this is about to change (and had already changed for
some) in a dramatic fashion, thanks to the new Big Data analytics paradigm.
With the value proposition, Big Data also brought about big challenges for organi-
zations. The traditional means for capturing, storing, and analyzing data are not capable
of dealing with Big Data effectively a nd efficiently. Therefore, new breeds of technologies
need to be developed (or purchased/ hired/ outsourced) to take on the Big Data chal-
lenge. Before making such an investment, organizations should justify the means. Here
are some questions that may help shed light on this situation. If any of the following state-
ments are true, then you need to seriously consider embarking on a Big Data journey.
• You can’t process the amount of data that you want to because of the limitations
posed by your current platform or environment.
• You want to involve new/ contemporary data sources (e.g., socia l media, RFID, sen-
sory, Web, GPS, textual data) into your analytics platform, but you can’t because it
does not comply with the data storage schema-defined rows and columns without
sacrificing fidelity or the richness of the new data.
• You need to (or want to) integrate data as quickly as possible to be current on your
an alysis.
• You want to work with a schema-on-demand (as opposed to the predetermined schema
used in RDBMS) data storage paradigm because the nature of the new data may not be
known, or there may not be enough time to determine it and develop a schema for it.
• The data is arriving so fast at your organization’s doorstep that your traditional ana-
lytics platform cannot handle it.
As is the case with any other large IT investment, the success in Big Data analytics
depends on a number of factors . Figure 13.2 shows a graphical depiction of the most criti-
cal success factors (Watson 2012).
FIGURE 13.2 Critical Success Factors for Big Data Analytics. (Source: AsterData-a Teradata Company)
Chapter 13 • Big Data a n d Ana lytics 553
Follow ing a re the most critical su ccess factors for Big Data analytics (Watson et al. ,
2012):
1. A clear business need (alignment with the vision and the strategy).
Business investme nts o ug h t to be made for the good of the bu siness, not for the
sake of m e re technology advan cements . Therefore the main drive r fo r Big Data
a nalytics sh o uld be the needs of the b us iness at any level-strategic, tactica l, a nd
operations .
2. Strong, committed sponsorship (executive champion). It is a well-known
fact that if you don’t have strong, committed execu tive sponsorship, it is difficult
(if no t impossible) to su cceed. If the scope is a single or a few analytical applica-
tions, the spon sorship can be at the depa1tmental level. However, if the target is
e nte rprise-w ide organizational transformation, which is often the case for Big Data
initiatives, sp o n sorship needs to be at the hig hest levels and organization-w ide .
3. Alignment between the business and IT strategy. It is essential to make sure
that the analytics work is always supporting the business strategy, and n ot other
way around. Analytics sho uld play the e n abling role in successful executio n of th e
business strategy.
4. A fact-based decision making culture. In a fact-based decision-making cul-
ture, the numbers rather than intuitio n , gut feeling, or suppositio n drive decision
making. There is also a culture of experimentatio n to see what works and doesn’t.
To create a fact-based decision-making culture, senior m anageme nt n eeds to:
• Recognize that some people can ‘t o r won ‘t adjust
• Be a vocal suppo rter
• Stress that outdated methods must be discontinued
• Ask to see w hat analytics went into decisions
• Link incentives and compensatio n to desired behaviors
5. A strong data infrastructure. Data warehouses have provided the data infra-
structure for a nalytics. This infrastructure is changing a nd being e nhan ced in the Big
Data era w ith new technologies . Su ccess requires ma rrying the old w ith the new for
a h olistic infrastructure that works synergistically.
As the size and the complexity increase, the need for more efficie nt analytical sys-
tems is also increasing. In o rder to keep up w ith the computational needs of Big Data,
a number of new and innovative computatio nal techniques and platforms have been
developed. These techniques are collectively called high-peiformance computing, which
includes the following:
• In-memory analytics: Solves complex problems in n ear-real time w ith highly
accurate insights by allowing an alytical computation s and Big Data to be processed
in-memory and distributed across a dedicated set of nodes.
• In-database analytics: Speeds time to insights and e nables better data gover-
n ance by performing data integratio n and an alytic functions inside the database so
you won ‘t have to move or convert data repeatedly.
• Grid computing: Promotes efficiency, lower cost, and better performance by
processing jobs in a sh ared, centrally m anaged p ool of IT resources.
• Appliances: Bringing together h a rdware a nd software in a physical unit that is
n o t o nly fast but also scalable o n an as-needed basis.
Computational require ment is just a small p art of the list of ch alle nges that Big Data
imposes upon today’s e nterprises. Following is a list of ch allenges that are fou n d by busi-
ness executives to h ave a sig nificant impact o n successful imp leme ntatio n of Big Data
an alytics. When con side ring Big Data projects and architecture, being mindful of these
cha lle nges could ma ke the jo urney to ana lytics competency a less stressful o ne .
554 Pan V • Big Data a nd Future Directions for Bu siness Analytics
• Data volume: The ability to capture, store, a nd process the huge volume o f data
at a n acceptable sp eed so that the latest info rmation is available to decision makers
when they need it.
• Data integration: The a bility to combine data that is n ot similar in structure or
source a nd to do so quickly and at reason able cost.
• Processing capabilities: The ability to process the data quickly, as it is captured.
The traditio n al way o f collecting and the n processing the data may n ot work. In many
situatio ns d ata needs to be a nalyzed as soon as it is captured to leverage the most
value (this is called stream analytics, w hich will be covered later in this chapter).
• Data governance: The a bility to keep up w ith the security, privacy, owne rship,
and qua lity issu es of Big Data. As the volume, variety (format and source), and
velocity of data change, so should the capabilities of governance practices.
• Skills availability: Big Data is being harnessed w ith n ew tools a nd is being
looked at in differen t ways. There is a sh ortage of people (ofte n called data scien-
tists, covered later in this ch apter) w ith the skills to do the job.
• Solution cost: Since Big Data has opened up a world of possible business
improvements , there is a great deal of experimentation and discovery taking place
to d e termine the patte rns that matter and the insights that turn to value. To e nsure
a positive ROI o n a Big Data project, therefore, it is crucial to redu ce the cost of the
solutions u sed to find that value.
Though ch allenges are real, so is the value propositio n of Big Data an alytics.
Anything that you can do as business a n alytics leaders to help prove the value of n ew data
sources to the business will move your organization beyond exp erimenting and exploring
Big Data into adapting and embracing it as a differentiator. There is n othing wron g w ith
explo ratio n , but ultimately the value comes from putting those insights into action.
Business Problems Addressed by Big Data Analytics
The top business problems addressed by Big Data overall are process efficiency and cost
re duction as well as e nhancing cu s to mer exp erie nce, but different priorities e merge w h e n
it is looked at by industry. Process efficie n cy a nd cost redu ctio n are commo n business
problems that can be addressed by analyzing Big Data, w hich are perhaps among the
top-ranked prob lems that can be addressed with Big Data analytics for the manufactur-
ing, government, e nergy a nd utilities, communicatio ns and media, transport, and health-
care sectors. Enhanced cu stomer experience may be at the top of the list of problems
addressed by insurance compa nies and re ta ilers. Risk management usually is at the top
of the list for companies in banking and education . Here is a list of problems that can be
addressed using Big Data analytics:
• Process efficiency and cost re ductio n
• Brand management
• Revenue maximizatio n , cross-selling, and up-selling
• Enhanced cu stomer exp e rience
• Churn ide ntificatio n , cu stomer recru iting
• Improved customer service
• Ide ntifying n ew products and ma rket o pportunities
• Risk ma nagement
• Regulatory complian ce
• Enhanced security capabilities
Applicatio n Case 13.2 illustrates an excelle nt examp le in the ban king industry,
w here disparate data sources are integrated into a Big Data infrastructure to achieve a
single source of the truth.
Chapter 13 • Big Data and Ana lytics 555
Application Case 13.2
Top 5 Investment Bank Achieves Single Source of Truth
The Bank’s highly respected derivatives team is
responsible for over o ne -third of the world’s total
derivatives trades. Their derivatives practice has a
global footprint w ith teams that support credit, inter-
est rate, and equity derivatives in every region of
the world. The Bank has earned numerous industry
awards a nd is recognized for its product innovations.
Challenge
With its significant derivatives exposure the Bank’s
management recognized the importance of h aving
a real-time global view of its positions. The existing
system, based on a relational database, was com-
prised of multiple installations around the world.
Due to the gradual expansions to accommodate the
increasing data volume varieties, the legacy system
was not fast enough to respond to growing business
needs a nd requirements. It was unable to deliver
real-time alerts to manage market and counterparty
credit positions in the desired timeframe.
Solution
The Bank built a derivatives trade store based on the
Marklogic (a Big Data analytics solution provider)
• • • • …
Before it was difficult to identify financ ial
exposure across many systems [separate
copies of derivatives trade store)
Server, replacing the incumbent technologies.
Replacing the 20 disparate batch-processing servers
with a single operational trade store enabled the Bank
to know its market and credit counterpatty positions
in real time, providing the ability to act quickly to mit-
igate risk. The accuracy and completeness of the data
allowed the Bank and its regulators to confidently
rely on the metrics and stress test results it reports.
The selection process included upgrading exist-
ing Oracle and Sybase technology. Meeting all the
new regulatory requirements was also a major factor
in the decision as the Bank looked to maximize its
investment. After the Bank’s careful investigation, the
choice was clear–only Marklogic could meet both
needs plus provide better performance, scalability,
faster development for future requirements and imple-
mentation, and a much lower total cost of ownership
(TCO). Figure 13.3 illustrates the transformation from
the old fragmented systems to the new unified system.
Results
Marklogic was selected because ex1stmg sys-
tems would not provide the sub-second updating
and analysis response times needed to effectively
After it was possible to analyze all contracts in
single database [Marklogic Server eliminates
the need for 20 database copies)
JI- ‘ -; 1 – – – .
FIGURE 13.3 Moving from Many Old Systems to a Unified New System. Source: MarkLogic.
(Continued)
556 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Application Case 13.2 (Continued)
manage a derivatives trade book tha t represents
nea rly o n e-third of the global market. Trade data is
now aggregated accurately across the Bank’s e ntire
derivatives portfolio, allowing risk management
stakeholders to know the true e nte rprise risk profile,
to conduct predictive analyses using accurate data,
and to adopt a forward-looking approach. Not only
are hundreds of tho usands of dollars of technology
costs saved each year, but the Bank does n ot need to
add resources to meet regulators’ escalating demands
for mo re tra nsparency a nd stress-testing freque ncy.
Here are the highlights from the obtained results:
• An alerting feature keeps u sers appraised of
up-to-the-minute marke t and counte rparty
credit changes so they can take appropriate
actions.
• Derivatives are stored a nd traded in a single
MarkLogic system requiring no downtime
for maintenance, a significant compe titive
advantage .
• Complex changes can be made in h o urs ver-
su s days, weeks, and even months needed by
competitors.
• Replacing Oracle and Sybase significantly
redu ced o p e rations costs: o n e system versus
20, one database administrator instead of up to
10, and lower costs per trade .
Next Steps
The successful implementation and performance
of the n ew system resulted in the Bank’s exami-
nation of other areas w h ere it could extract more
value fro m its Big Data- structured, unstruc-
ture d , a nd/ or p oly-structure d . Two applications
are under active discussio n . Its equity research
b usiness sees an opp ortunity to sig nificantly boost
revenue with a platform that provides real-time
research , repurposing , and content delivery. The
Bank also sees the p ower of centralizing customer
data to improve onboarding, increase cross-sell
opportunities, and support know your customer
requirements.
QUESTIONS FOR DISCUSSION
1. How can Big Data benefit large-scale trading
banks?
2. How did MarkLogic infrastructure he lp ease the
leveraging of Big Data?
3. What were the ch alle nges, the proposed solu-
tion, and the obtained results?
Source: MarkLogic, Custome r Success Story, rnarklogic.com/
resources/top-5-derivatives-trading-bank-achieves-single-
source-of-truth (accessed March 2013).
SECTION 13.3 REVIEW QUESTIONS
1. What is Big Data analytics? How does it differ from regular an alytics?
2. What are the critical success factors for Big Data analytics?
3. What are the b ig challenges tha t one should be mindful of when considering imple-
m entation of Big Data analytics?
4. What are the commo n bus iness problems addressed by Big Data an alytics?
13.4 BIG DATA TECHNOLOGIES
There are a number of technologies for processing and analyzing Big Data, but most have
some commo n ch a racteristics (Ke lly 2012). Namely, they take advantage of commodity
hardware to e nable scale-o ut, para llel processing techniques; e m ploy nonrelatio n al d ata
storage capab ilities in order to process unstructured and semistructured data; an d apply
advan ced ana lytics and data visu alizatio n technology to Big Data to convey insights to
end users. The re are three Big Data techno logies that stan d out, that most believe w ill
transform the business analytics and data management markets: MapReduce, Hadoop,
a nd NoSQL.
Chapter 13 • Big Data and Analytics 557
MapReduce
MapReduce is a technique popularized by Google that distributes the processing of very
large multi-structured data files across a large cluster of machines. High p erformance is
achieved by breaking the processing into small units of work that can be run in parallel
across the hundreds, potentially thousands, of nodes in the cluster. To quote the seminal
paper on MapReduce:
“MapReduce is a programming model and an associated implementation for pro-
cessing and generating large data sets. Programs written in this functional style are auto-
matically parallelized and executed on a large cluster of commodity machines. This allows
programmers w ithout any experience with parallel and distributed systems to easily uti-
lize the resources of a large distributed system” (Dean and Ghemawat, 2004).
The key point to note from this quote is that MapReduce is a programming model ,
not a programming language, that is, it is designed to be used by programmers, rather
than business users . The easiest way to describe how MapReduce works is through the
use of an example- see the geometric shape counter in Figure 13.4.
The in put to the MapReduce process in Figure 13.4 is a set of geometric shapes.
The objective is to count the numbe r of geometric shapes of each type (diamond, circle,
square, star, and triangle). The programmer in this example is responsible for coding the
map and reducing programs; the remainder of the processing is handled by the software
system implementing the MapReduce programming model.
The MapReduce system first reads the input file and splits it into multiple pieces. In
this example, there are two splits, but in a real-life scenario, the number of splits would
typically be much higher. These splits are then processed by multiple map programs
O D 00
00 .. 00 00
,————–>-
00 I 00 D I ___ j_ ___
———, 0 4 ODDO 00 000 00 , a 3 0060 0 3 00 0 D 0 3 000 DD DD D D 6 3 ~——“~–,—-I 60 – 666 000 I l ——->-
60 0 666
06 00
“–~ __ ) ____ ~ ~—-~ )
y ——–y——
Raw Data Map Function Reduce Function
FIGURE 13.4 A Graphical Depiction of the MapReduce Process.
558 Pan V • Big Data a nd Future Directions for Bu siness Analytics
running in parallel on the nodes of the cluster. The role of each m ap program in this case
is to group the data in a split by the type of geometric sh ape. The MapRedu ce system
then takes the output from each map program and merges (shuffles/ sorts) the results for
input to re duce the program, w hich calculates the sum o f the number of different types
of geometric shapes. In this example, o nly o ne copy of the reduce program is used, but
there may be more in practice. To o ptimize performance, programmers can p rovide their
own shuffle/sort program and can also deploy a combiner that combines local map out-
put files to reduce the number o f output files that have to be remotely accessed across the
cluster by the shuffle/ sort step.
Why Use MapReduce?
MapReduce aids organizations in processing and analyzing large volumes of mu lti-struc-
tured data. Application examples include indexing and search, graph analysis , text an aly-
s is, machine lea rning , data tra nsformatio n , a nd so forth . These types of applications are
often difficult to imple me nt using the standard SQL employed by relational DBMSs.
The procedural n ature of MapReduce m akes it easily understood by skilled p ro-
g rammers. It a lso has the advantage that developers do not have to be concerned with
implementing p arallel computing-this is h andled transparently by the system. Although
MapReduce is designed fo r programmers, n o n-programmers can exploit the value of pre-
built MapReduce applicatio ns a nd function libraries. Both comme rcial a n d open source
MapReduce libraries are available that provide a wide range of analytic capabilities.
Apache Mahout, for example, is an open source machine-learning library of “algorithms
for clustering, classification, and batch-based collaborative filte ring ” that are imp lemented
using MapReduce.
Hadoop
Source: Hadoop. Used with permission.
Hadoop is an open source framework for p rocessing, storing, and ana lyzing m assive
amounts of distributed, unstructured data. Originally created by Doug Cuttin g at Yahoo! ,
Hadoop was inspired by MapReduce, a u ser-defined function developed by Google in
the early 2000s for indexing the Web. It was designed to h andle petabytes and exabytes
of data distributed over multiple n odes in parallel. Hadoop clusters run o n inexpensive
commodity hardware so projects can scale-out w ithout breaking the bank. Hadoop is
now a project of the Apache Software Foundatio n , w h ere hundreds of contributors con-
tinuo u sly improve the core techno logy. Fundamental con cept: Rathe r than banging away
at one, huge block of data w ith a s ingle machine, Hadoop breaks up Big Data into mul-
tiple parts so each part can be processed and an alyzed at the same time.
How Does Hadoop Work?
A client accesses unstructured and semistructured data from sources including log files,
social media feeds, and inte rnal data stores. It breaks the data up into “parts,” w hich
are then loaded into a file system made u p of multiple n odes running o n commodity
h ardware. The default file sto re in Hadoop is the Hadoop Distributed File System
(HDFS) . File systems such as HDFS are adept at storing la rge volumes of unstructured
and semistructured data as they do not require data to be organized into relational rows
and columns. Each “part” is replicated multiple times and loaded into the file system, so
that if a node fails, a noth e r node has a copy of the data contained on the failed node.
Chapter 13 • Big Data and Analytics 559
A Name Node acts as facilitator, communicating back to the client information such as
which nodes are available, where in the cluster certain data resides, and which nodes
have failed.
Once the data is loaded into the cluster, it is ready to be analyzed via the MapReduce
framework. The client submits a “Map” job-usually a que1y written in Java-to one of
the nodes in the cluster known as the Job Tracker. The Job Tracker refers to the Name
Node to determine which data it nee ds to access to complete the job and where in the
cluster that data is located. Once determined, the Job Tracker submits the query to the
relevant nodes. Rather than bringing a ll the data back into a central location for process-
ing, processing then occurs at each node simultaneously, or in parallel. This is an essen-
tial characteristic of Hadoop.
When each node has finished processing its given job, it stores the results. The cli-
ent initiates a “Reduce” job through the Job Tracker in which results of the map phase
stored locally on individual nodes are aggregated to determine the “answer” to the origi-
nal query, and then loaded onto another node in the cluster. The client accesses these
results, which can then be loaded into one of a number of analytic environments for
analysis. The MapReduce job has now been completed.
Once the MapReduce phase is complete, the processed data is ready for further
analysis by data scientists and others with advanced data ana lytics skills . Data scientists
can manipulate and analyze the data using any of a number of tools for any number of
uses, including searching for hidden insights and patterns or use as the fou n dation for
building user-facing analytic applications. The data can also be modeled and transferred
from Hadoop clusters into existing relational databases, data wareh ouses, and other tradi-
tional IT systems for further analysis and/ or to support transactional processing.
Hadoop Technical Components
A Hadoop “stack” is made up of a number of components, which include:
• Hadoop Distributed File System (HDFS): The default storage layer in any
given Hadoop cluster
• Name Node: The node in a Hadoop cluster that provides the client information
on where in the cluster particular data is stored and if any nodes fa il
• Secondary Node: A backup to the Name Node, it periodically replicates and
stores data from the Name Node should it fail
• Job Tracker: The node in a Hadoop cluster that initiates and coordinates
MapReduce jobs, or the processing of the data
• Slave Nodes: The grunts of any Hadoop cluster, slave nodes store data and take
direction to process it from the Job Tracker
In addition to these components, the Hadoop ecosystem is made up of a number of
complimentary sub-projects. NoSQL data stores like Cassandra and HBase are also used
to store the results of MapReduce jobs in Hadoop. In addition to Java, some MapReduce
jobs and other Hadoop functions are written in Pig, an open source language designed
specifically for Hadoop. Hive is an open source data warehouse originally developed by
Facebook that allows for analytic modeling within Hadoop. Here are the most commonly
referenced sub-projects for Hadoop.
HIVE Hive is a Hadoop-based data warehousing-like framework originally developed by
Facebook. It allows users to write queries in an SQL-like language called HiveQL, which
are then converted to MapReduce. This a llows SQL programmers with no MapReduce
experience to use the warehouse and makes it easier to integrate with business intelli-
gence and visualization tools such as MicroStrategy, Tableau , Revolutions Analytics , and
so forth.
560 Pan V • Big Data a nd Future Directions for Bu siness Analytics
PIG Pig is a Hadoop-based query language developed by Yah oo!. It is relatively easy to
learn and is adept at very deep, very long data pipelines (a limitatio n of SQL.)
HBASE HBase is a n o nre lational database that allows for low-latency, quick lookups in
Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates,
inserts, and deletes. eBay and Facebook u se HBase heavily.
FLUME Flume is a framework for populating Hadoop w ith data. Agents are populated
throughout o ne’s IT infrastructure-inside Web servers, application servers, and mobile
devices, for example-to collect d a ta and integrate it into Hadoop .
OOZIE Oozie is a workflow processing system that lets users define a series of jobs writ-
te n in multiple lan guages- such as Map Redu ce, Pig, and Hive- and the n inte llige n tly
link the m to o ne anothe r. Oozie a llows u sers to specify, fo r example, that a particular
query is o nly to be initiated after specified previous jobs on w hich it relies for data are
completed .
AMBARI Ambari is a Web-based set of tools for deploying, administering, and monitoring
Apache Hadoop clusters. Its development is being led by e ngineers from Hortonworks ,
w hich include Ambari in its Hortonworks Data Platform.
AVRO Avro is a data serializatio n system that allows for e ncoding th e schema of Hadoop
files. It is adept at parsing data and performing removed procedure calls.
MAHOUT Mahout is a data mining library. It takes the most popular data mining algo-
rithms for performing clustering, regression testing, an d statistical modeling and imple-
m ents them using the MapReduce model.
SQOOP Sqoo p is a connectivity tool for moving data from no n-Hadoop data stores-
su ch as relatio nal databases and data ware h ou ses-into Hadoop. It allows use rs to spec-
ify the target location inside of Hadoop a nd instructs Sqoop to move data from Oracle,
Teradata, or other relatio na l databases to the target.
HCATALOG HCatalog is a centralized metadata management and sharing service for
Apache Hadoop. It allows for a unified view of a ll data in Hadoop clusters and allows
diverse tools, including Pig an d Hive, to process any data elem ents w ithout needing to
know physically w here in the cluster the d ata is stored.
Hadoop: The Pros and Cons
The main benefit o f Hadoop is that it a llows e nte rprises to process a nd analyze large vol-
umes of unstructured and semistructured data, hereto fore inaccessible to them, in a cost-
and time-effective manne r. Because Hadoop clusters can scale to petabytes a nd even
exabytes of data, e nterprises no longer m ust re ly o n sample d ata sets but can p rocess
and a nalyze a ll relevant data. Data scientists can apply an iterative approach to a n alysis,
continually refining a nd testing queries to uncover previously unknown insigh ts. It is also
inexpensive to get started w ith Ha doop . Developers can download the Apache Hadoop
d istribution for free a nd begin experimenting w ith Hadoop in less than a day.
The downside to Hadoop a nd its m yriad compon ents is that they a re immatu re
and still develo ping. As w ith a ny young, raw technology, implemen ting and managing
Chapter 13 • Big Data a nd Analytics 561
Hadoop clusters and performing advanced analytics o n large volumes of unstructured
data require significant expertise, skill, and training . Unfortunate ly, there is curre ntly
a dearth of Hadoop developers and data scientists available, making it impractical for
many e nterprises to maintain and take advantage of complex Had oop clusters. Further,
as Hadoop’s m yriad components are improved upon by the community and n ew com-
ponents are created, there is, as w ith any immature open source technology/ approach, a
risk of forking . Finally, Hadoop is a b atch-orie nted framework, meaning it does no t sup-
port real-time data processing a nd a na lysis .
The good news is that some of the brightest minds in IT are contributing to the
Apache Hadoop project, and a new generation of Hadoop develo pers and data scientists
is coming of age. As a result, the technology is advancing rapidly, becoming both more
powerful and easier to implement and manage. An ecosystems of vendors, both Hadoop-
focused start-ups like Cloudera and Hortonworks and well-worn IT stalwarts like IBM
and Microsoft, are working to offer commercial, enterprise-ready Hadoop distributions,
too ls, and services to make deploying and managing the techno logy a practical reality for
the traditio n al e nterprise. Othe r bleeding-edge start-ups a re working to p e rfect NoSQL
(Not Only SQL) data stores capable of delivering near- real-time insights in conjun ction
with Hadoop. Technology Insights 13.2 provides a few facts to clarify some misconcep-
tions about Hadoo p.
TECHNOLOGY INSIGHTS 13.2 A Few Demystifying Facts About
Hadoop
Although Hadoop and rela ted technologies have been around for more than 5 years n ow,
most people still have several misconceptions about Hadoop a nd related technologies su ch as
Ma pRedu ce a nd Hive. The following list o f 10 facts intends to clarify what Hadoop is and d oes
relative to BI, as well as in which b usiness and techno logy situatio ns Hadoop-based BI, d ata
warehousing, and analytics can be useful (Russom , 2013).
Fact #1. Hadoo p consists o f multiple products. We talk abo ut Hadoop as if it’s one
mo no lithic thing, wh e reas it’s actu ally a family of op e n source products and techno logies
overseen by the Apache Software Foundation (ASF). (Some Hadoop products are also
availa ble via vendor distributions; more on that la ter.)
The Apache Hadoop library includes (in BI priority o rde r) the Hadoop Distributed
File System (HDFS), MapReduce, Hive, Hbase, Pig, Zookeeper, Flume , Sqoop, Oozie,
Hue, a nd so o n. You can combine these in various ways, but HDFS a nd MapReduce
(perha p s with Hbase and Hive) constitute a u seful technology stack for applications in BI,
DW, and a nalytics.
Fact #2. Hadoop is open source but ava ilable from vendors, too. Apache Hadoop’s
open source software library is availa ble from ASF at apache.org. For users desiring
a mo re e nte rprise-ready package, a few vendors n ow offer Hadoop distributio ns that
include additional administrative tools and technical support.
Fact #3. Hadoop is an ecosystem, not a single product. In additio n to products from
Apache, the exte nde d Hadoop ecosystem includes a g rowing list of vendor p roducts
that integrate with or expand Hadoop technologies. One minute on your fa vorite search
e ngine will reveal these.
Fact #4. HDFS is a file syste m, no t a database ma nageme nt syste m (DBMS). Hadoop is
primarily a distributed fil e syste m and lacks ca pabilities we’d associate with a DBMS, such
as indexing, random access to data , and suppo rt fo r SQL. That’s okay, because HDFS does
things DBMSs cannot do.
562 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Fact #5. Hive resembles SQL but is not standard SQL. Many of u s are handcuffed to SQL
because we know it well and o ur tools demand it. People who know SQL can quickly
learn to hand code Hive, but that doesn’t solve com patibility issues w ith SQL-based tools.
TDWI feels that over time, Hadoop products w ill support standard SQL, so this issue will
soon be moot.
Fact #6. Hadoop a nd MapReduce are related but don’t require each oth er. Developers
at Google developed MapReduce before HDFS existed, and some variations of MapReduce
work with a variety of storage techno logies, including HDFS, oth e r file systems, and some
DBMSs.
Fact #7. MapReduce provides control for an alytics, not analytics per se. MapReduce is a
gene ral-purpose executio n engine that h andles th e com p lexities of n etwork communica-
tion, parallel programming, and fault tolerance for any kind of applicatio n that you can
ha nd code-no t just a nalytics.
Fact #8. Hadoop is about data diversity, not just data volume . Theoretically, HDFS can
manage the storage and access of any data type as long as you can put the data in a file
and copy that fil e into HDFS. As outrageously simplistic as that sounds, it’s la rgely true,
and it’s exactly what brings many users to Apache HDFS.
Fact #9. Hadoop complements a DW; it’s ra rely a replacement. Most organizations have
designed their DW for structured, relational data, which makes it difficult to wring BI
value fro m uns tructured a nd semistructured data. Hadoop promises to complement DWs
by handling the multi-structured data types most DWs can’t.
Fact #10. Hadoop enables many types of a nalytics, not just Web analytics. Hadoop
gets a lot of press about how Inte rne t companies u se it for analyzing Web logs and o ther
Web data, but other use cases exist. For example, consider the Big Data coming from
sensoty devices, such as robotics in manufacturing, RFID in retail , or grid monitoring in
utilities. O lder analytic applications that need large data samples- su ch as customer-b ase
segmentatio n , fraud detection, and risk a nalysis-can benefit from the additional Big Data
managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to
create a more complete an d granular vie w.
NoSQL
A related new style of database called NoSQL (Not Only SQL) h as emerged to, like
Hadoop, process large volumes of multi-structured data. However, whereas Hadoop
is adept at supporting large-scale , batch-style historical analysis, NoSQL databases are
aimed, for the most part (though there are some important exceptions), at serving up
discrete data stored a mong large volumes of multi-structured data to e nd-user a nd auto-
mated Big Data applications . This capability is sorely lacking from relational database
technology, w hich simply can ‘t m aintain n eeded application p erforman ce levels at Big
Data scale .
In som e cases, NoSQL and Hadoop work in conjunctio n. The aforemen tioned
HBase, for example, is a popular NoSQL database modeled afte r Google BigTable that
is often deployed o n top of HDFS, the Hadoop Distributed File System, to provide low-
laten cy, quick lookups in Hadoop. The downside of most NoSQL databases today is that
they trade ACID (atomicity, consiste ncy, isolatio n , durability) complian ce fo r performance
and scalability. Many also lack m ature management and monitoring tools. Both these
sh ortcomings are in the process of being overcome by both the open source NoSQL
communities and a ha ndful of vendors that are attempting to commercialize the vari-
ous NoSQL databases. NoSQL databases currently available include HBase, Cassandra,
MongoDB, Accumulo, Riak, CouchDB, and DynamoDB, among others. Application Case
13.3 shows the u se of NoSQL databases at eBay.
Application Case 13.3
eBay’s Big Data Solution
eBay is the world’s largest online marketplace,
enabling the buying and selling of practically any-
thing. Founded in 1995, eBay connects a diverse
and passionate community of individual buyers and
sellers, as well as small businesses. eBay’s collective
impact on e-commerce is staggering: In 2012, the
total value of goods sold on eBay was $75.4 billion.
eBay currently serves over 11 2 million active users
and has 400+ million items for sale.
The Challenge: Supporting Data at
Extreme Scale
One of the keys to eBay’s extraordina1y success
is its ability to turn the enormous volumes of data
it generates into useful insights that its customers
can glean directly from the pages they freque nt. To
accommodate eBay’s explosive data growth-its
data centers pe rform billions of reads and writes
each day- and the increasing demand to process
data at blistering speeds, eBay needed a solution
that did n ot have the typical bottlenecks, scalability
issues, a nd transactional constraints associated with
common relational database approaches . The com-
pany also needed to pe rform rapid analysis on a
broad assortment of the structured and unstructured
data it captured.
The Solution: Integrated Real-Time Data
and Analytics
Its Big Data requirements brought eBay to NoSQL
technologies, specifically Apache Cassandra and
DataStax Enterprise. Along with Cassandra and its
high-velocity data capabilities, eBay was also drawn
to the integrated Apache Hadoop a nalytics that
come w ith DataStax Enterprise. The solution incor-
porates a scale-out architecture that e nables eBay to
deploy multiple DataStax Enterprise clusters across
several different data centers using commodity hard-
ware. The e nd result is that eBay is now able to
more cost effectively process massive amounts of
data at very high speed s, at very hig h velocities, and
achieve far more than they were able to w ith the
higher cost propriety system they had been using.
Currently, eBay is managing a sizable portion of its
Chapter 13 • Big Data a nd Analytics 563
data center needs- 250TBs+ of storage-in Apache
Cassandra and DataStax Enterprise clusters.
Additional technical factors that played a role
in eBay’s decision to deploy DataStax Enterprise so
widely include the solution’s linear scalability, high
availability with no single point of failure , and out-
standing write performance.
Handling Diverse Use Cases
eBay employs DataStax Enterprise for many different
use cases. Th e following examples illustrate some of
the ways the company is able to meet its Big Data
needs with the extremely fast data h a ndling and ana-
lytics capabilities the solution provides. Naturally,
eBay experiences huge amounts of write traffic,
which the Cassandra implementation in DataStax
Enterprise handles more efficiently than any other
RDBMS or NoSQL solution. eBay currently sees
6 billion+ writes per day across multiple Cassandra
clusters and 5 billion+ reads (mostly offline) per day
as well.
One use case supported by DataStax Enterprise
involves quantifying the social data eBay displays
on its product pages. The Cassandra distributio n in
DataStax Enterprise stores all the information needed
to provide counts fo r “like,” “own ,” and “want” data
on eBay product pages. It also provides the same
data for the eBay “Your Favorites” page that con-
tains a ll the items a u ser likes, owns, or wants, with
Cassandra serving up the entire “Your Favorites”
page. eBay provides this data through Cassandra’s
scalable counters feature .
Load balancing and application availability
are important aspects to this particular u se case.
The DataStax Enterprise solution gave e Bay archi-
tects the flexibility they needed to design a system
that enables a ny user request to go to any data cen-
ter, with each data center having a single DataStax
Enterprise cluster spanning those centers. This
design feature helps balance the incoming user load
and eliminates any possible threat to application
downtime. In addition to the line of business data
powering the Web pages its customers visit, eBay
is also able to perform high-speed analysis w ith the
ability to maintain a separate data center running
(Continued)
564 Pan V • Big Data and Future Directions for Business Analytics
Application Case 13.3 (Continued)
Hadoop nodes of the same DataStax Enterprise ring
(see Figure 13.5).
Another use case involves the Hunch (an eBay
sister company) “taste graph” for eBay users and
items, which provides custom recommendations
based on user interests. eBay’s Web site is essentially
a graph between all users and the items for sale.
All events (bid , buy, sell , and list) are captured by
eBay’s systems and stored as a graph in Cassandra.
The application sees more than 200 million
writes daily and holds more than 40 billion p ieces
of data .
eBay also uses DataStax Enterprise for many
time-series use cases in which processing high-
volume, real-time data is a foremost priority. These
include mobile notification logging and tracking
(every time eBay sends a notification to a mobile
phone or device it is logged in Cassandra), fraud
detection, SOA request/ response payload logging,
and RedLaser (another eBay sister company) server
logs and analytics.
Across all of these use cases is the common
requirement of uptime. eBay is acutely aware of
the need to keep their business up and open for
business, and DataStax Enterprise plays a key part
in that through its support of high availability clus-
ters. “We have to be ready for disaster recovery a ll
the time . It’s really great that Cassandra allows for
active-active multiple data centers where we can
read and write data anywhere, anytime, ” says eBay
architect Jay Patel.
QUESTIONS FOR DISCUSSION
1. Why Big Data is a big deal for eBay?
2. What were the challenges , the proposed solu-
tion, and the obtained results?
3. Can you think of other e-commerce businesses
that may have Big Data challenges comparable
to that of eBay?
Source: DataStax, Custome r Case Studies, datastax.com/
resources/casestudies/eBay (accessed Janua1y 2013).
Analytics Nodes
Running DSE Hadoop
for near real-time
analytics
Cassandra Ring
Topology-NTS
RF-2 :2 :2
DATA CENTER 1 DATA CENTER 2 DATA CENTER 3
FIGURE 13.5 eBay’s Multi-Data-Center Deployment. Source: DataStax.
SECTION 13.4 REVIEW QUESTIONS
1. What are the common characteristics of emerging Big Data technologies?
2. What is MapReduce? What does it do? How does it do it?
3. What is Hadoop? How does it work?
4. What are the main Hadopp components? What functions do they perform?
5. What is NoSQL? How does it fit into the Big Data ana lytics picture?
Chapter 13 • Big Data and Analytics 565
13.5 DATA SCIENTIST
Data scientist is a role or a job frequently associated with Big Data or data science. In
a ve1y sho1t time it has become one of the most sought-out roles in the marketplace.
In a recent article published in the October 2012 issue of the Harvard Business Review,
authors Thomas H. Davenport and D . J. Patil called data scientist “The Sexiest Job of the
21st Centu1y. ” In that article they specified data scientists’ most basic, universal skill as the
ability to write code (in the latest Big Data languages and platforms). Although this may
be less true in the near future, when many more people will have the title “data scientist”
on their business cards, at this time it seems to be the most fundamental skill required
from data scientists. A more enduring skill will be the need for data scientists to commu-
nicate in a language that all their stakeholders understand-and to demonstrate the spe-
cial skills involved in storytelling with data , whether verbally, visually, or- ideally- both
(Davenport and Patil, 2012).
Data scientists use a combination of their business and technical skills to investigate
Big Data looking for ways to improve current business analytics practices (from descrip-
tive to predictive and prescriptive) and hence to improve decisions for new business
opportunities. One of the biggest differences between a data scientist and a business intel-
ligence user-such as a business analyst-is that a data scientist investigates and looks
for new possibilities, while a BI user analyzes existing business situations and operations.
One of the dominant traits expected from data scientists is an inte nse curiosity- a
desire to go beneath the surface of a problem, find the questions at its heart, and distill
them into a ve1y clear set of hypotheses that can be tested. This often entails the associa-
tive thinking that characterizes the most creative scientists in any field. For example, we
know of a data scientist studying a fraud problem who realized that it was analogous to a
type of DNA sequencing problem (Davenport and Patil, 2012). By bringing together those
disparate worlds, he and his team were able to craft a solution that dramatically reduced
fraud losses.
Where Do Data Scientists Come From?
Although there still is disagreement about the use of “science” in the name, it is becoming
less of a controversial issue. Real scientists use tools made by other scientists, or make
them if they don’t exist, as a means to expand knowledge. That is exactly what data
scientists are expected to do. Experimental physicists, for example, have to design equip-
ment, gather data, and conduct multiple experiments to discover knowledge and com-
municate their results. Even though they may not be wearing white coats, and may not
be living in a sterile lab environment, that is exactly what data scientists do: use creative
tools and techniques to turn data into actionable information for others to use for better
decision making.
There is no consensus on what educational background a data scientist has to have.
The usual suspects like Master of Science (or Ph.D.) in Computer Science, MIS, Industrial
Engineering, or the newly popularized postgraduate analytics degrees may be necessary
but not sufficient to call someone a data scientist. One of the most sought-out characteris-
tics of a data scientist is expertise in both technical and business application domains. In
that sense, it somewhat resembles to the professional engineer (PE) or project manage-
ment professional (PMP) roles, where experience is valued as much as (if not more than)
the technical skills and educational background. It would not be a huge surprise to see
within the next few years a certification specifically designed for data scientists (perhaps
called “Data Science Professional” or “DSP,” for short).
Because it is a profession for a field that is still being defined, many of its prac-
tices are still experimental and far from being standardized; companies are overly sensi-
tive about the experience dimension of data scientist. As the profession matures, and
566 Pan V • Big Data and Future Directions for Bu siness Analytics
Communication and
Interpersonal
Curiosity and
Creativity
Domain Expertise,
Problem Definition, and
Decision Modeling
Internet and Social
Media/Social Networking
Technologies
FIGURE 13.6 Skills That Define a Data Scientist .
Data Access and
Management
[both traditional and
new data systems]
Programming,
Scripting, and Hacking
practices a re standardized, experie n ce w ill be less of an issue w hen defining a data scien-
tist. Nowadays, companies looking for people w ho have extensive experien ce in work-
ing w ith complex d ata have h ad good luck recruiting amo ng th ose w ith edu cation al an d
work backgrounds in the physical o r social scien ces. Some of the best and brightest data
scientists have been Ph.D.s in esoteric fields like ecology an d systems biology (Davenport
and Patil, 2012). Even though there is no conse n su s on w here d ata scientists come from,
the re is a common unde rstanding o f w ha t skills and qualities they a re expected to pos-
sess. Figure 13.6 shows a high-level graphical illustration of these skills.
Data scie ntists are expected to have soft skills such as creativity, curiosity, communica-
tion/ inte rpersonal, domain expertise, problem definition, and managerial (shown w ith light
background hexagons o n the left side of the figure) as well as sound technical skills such as
data manipulation , programming/h acking/scripting, and Internet and social media/ n etwork-
ing technologies (shown w ith darker background hexagons o n the right side of the figure).
Technology Insights 13.3 is abo ut a typical job advertisement for a data scientist.
TECHNOLOGY INSIGHTS 13.3 A Typical Job Post for Data Scientists
[Some company] is seeking a Data Scientist to jo in our Big Data Analytics team. Individu als
in this ro le are expected to be comfo rtable working as a software e ngineer and a quantitative
researche r. The ideal candidate w ill h ave a keen inte rest in the study of an onlin e social network
and a passio n for identifying and answering questions that he lp us build the best products.
Chapte r 13 • Big Data a nd Ana lytics 567
Responsibilities
• Work closely w ith a product engineering team to ide ntify and an swer imp o rtant product
questio ns
• Answer product questio ns by using a ppropriate statistical techniques o n available data
• Communicate findings to product ma nage rs and e ngineers
• Drive the collectio n of new d ata a nd the refin e me nt o f existing data sources
• Analyze a nd inte rpre t the results of p roduct expe rime nts
• Develo p best practices for instrumentatio n and exp e rime ntatio n a nd communicate those
to p roduct e ngineering te ams
Requirements
• M.S. o r Ph.D. in a relevant technical field , or 4+ years of exp e rien ce in a re levant role
• Extensive exp e rie nce solving analytical proble ms using quantitative approaches
• Comfort w ith manipulating and a nalyzing complex, high-volume , high-dime nsio nality
d ata from varying sources
• A strong passio n for e mpirical research and for a nswering hard questio ns with d ata
• A flexible analytic approach that allows for results at varying levels of precisio n
• Ability to communicate complex q ua ntita tive a nalys is in a clear, p recise , a nd actionable
manne r
• Flue ncy w ith at least o ne scripting la nguage such as Pytho n o r PHP
• Familiarity with re latio nal da tabases a nd SQL
• Exp e rt knowledge of an analysis tool su ch as R, Matla b , o r SAS
• Exp e rien ce working w ith large d ata sets, experie nce working with distribute d computing
tools a plus (Map/ Reduce, Hadoop , Hive , etc.)
People with this range o f skills are rare, w hich e xplains w hy d ata scientists are in short
supply . Because of the hig h de mand for these re latively fewer individuals, the sta rting
sala ries for data scie ntists are well abo ve six figures, and for o nes w ith ample experie nce
and specific domain expertise , salaries are pushing near seven figures. For m ost organiza-
tions, rather tha n looking for individuals w ith all these capa bilities, it w ill b e n ecessary
instead to build a team of p eople that collectively h ave these skills. He re are some recent
anecdotes a bout data scientists:
• Data scientists turn Big Data into big value, delivering p roducts that delig ht u sers
and insight that informs business decisio ns .
• A da ta scie ntist is no t o nly proficie nt to w o rk w ith d ata, but also appreciates data
itself as an invaluable asset.
• By 2020 there w ill be 4 .5 millio n n ew d ata scie ntist job s, of w hich o nly o ne-third
w ill b e filled b ecause o f the lack of available pe rsonnel.
• Today ‘s da ta scientists are the quants of the financial markets of the 1980s.
Data scientists are n o t limited to high-tech Internet companies. Many of the companies
that do no t have much Inte rnet presen ce are also inte rested in high ly qualified Big Da ta
analytics pro fessio nals. Fo r insta nce, as d escribed in the End-o f-Ch apte r Applicatio n Case,
Volvo is leveraging data scie ntists to turn data that comes from its corporate transaction
da ta bases and fro m sensors (placed in its cars) into actionable insight. An interesting area
w he re w e h ave seen the use of data scie ntists in the recent past is in politics. Ap plication
Case 13.4 describes the use o f Big Da ta analytics in the world of p o litics and presidential
electio n s.
568 Pan V • Big Data and Future Directions for Business Analytics
Application Case 13.4
Big Data and Analytics in Politics
One of the application areas where Big Data and
analytics promise to make a b ig difference is argu-
ably the field of politics. Experiences from the
recent presidential elections illustrated the power
of Big Data and analytics to acquire and energize
millions of volunteers (in the form of a modern-era
grassroots movement) to not only raise hundreds of
millions of dollars for the election campaign but to
optimally organize and mobilize potential voters to
get out and vote in large numbers, as well. Clearly,
the 2008 and 2012 presiden tial e lections made a
mark on the political arena with the creative use of
Big Data and analytics to improve chances of win-
ning. Figure 13.7 illustrates a graphical depiction of
the analytical process of converting a wide variety
of data into the ingredients for winning an election.
As Figure 13.7 illustrates, data is the source of
information; the richer and deeper it is, the better
and more relevant the insights. The main charac-
teristics of Big Data, namely volume, variety, and
velocity (the three Vs), readily apply to the kind of
data that is used for political campaigns. In addi-
tion to the structured data (e.g., detailed records of
previous campaigns, census data, market research,
and poll data) vast volumes and a variety of social
media (e.g., tweets at Twitter, Facebook wall posts,
blog posts) and Web data (Web pages, n ews arti-
cles, newsgroups) are used to learn more about vot-
ers and obtain deeper insights to enforce or change
their opinions. Often, the search and browsing his-
tories of individuals are captured and made avail-
able to customers (political analysts) who can use
such data for better insight and behavioral target-
ing. If done correctly, Big Data and analytics can
provide invaluable information to manage political
campaigns better than ever before.
From predicting election outcomes to targeting
potential voters and donors, Big Data and analytics
have a lot to offer to modern-day e lection cam-
paigns. In fact , they have changed the way presi-
dential e lection campaigns are run. In the 2008 and
2012 presidential e lections , the major political par-
ties (Republican and Democratic) employed social
media and data-driven analytics for a more effec-
tive and efficient campaign, but as many agree, the
Democrats clearly had the competitive advantage
(Issenberg, 2012). Obama’s 2012 data and analytics-
driven operation was far more sophisticated and
more efficient than its much-heralded 2008 process,
which was primarily social media driven. In the 2012
INPUT: Data Sources Big Data & Analytics OUTPUT: Goals
Census data
(Data Mining, Web Mining, Text
Raise money contributions • Mining , Multimedia Mining) •
Population specifics, age , • Increase number of
race, sex, income, etc. • Predicting outcomes and volunteers
• Election databases trends • Organize movements
Party affiliations, previous Identifying associations • Mobilize voters to get out •
election outcomes, trends, between events and and vote
and distributions outcomes • Other goals and objectives
• Market research • Assessing and measuring • …
Polls, recent trends, and the sentiments
movements • Profiling (clustering) groups • Social media with similar behavioral
Facebook, Twitter, patterns
Linkedin, newsgroups, • Other knowledge nuggets
biogs , etc.
• Web (in general)
Web pages, posts and
replies, search trends, etc.
• Other data sources
–
FIGURE 13.7 Leveraging Big Data and Analytics for Political Campaigns.
campaign, hundreds of analysts applied advanced
analytics on very large and diverse data sources to
pinpoint exactly who to target, for what reason, with
what message, on a continuous basis. Compared
to 2008, they had more expertise, hardware , soft-
ware, data (e.g ., Facebook and Tw itte r were orders
of magnitude bigger in 2012 than they had been in
2008), and computational resources to go over and
beyond what they had accomplished previously
(Shen, 2013). Before the 2012 election, in June of
last year, a Politico reporter claimed that Obama had
a data advantage and went on to say that the depth
and breadth of the campaign’s digital operation,
from political and demographic data mining to voter
sentiment and behavioral analysis, reached beyond
anything politics had ever seen (Romano, 2012).
According to Shen, the real winner of the 2012
elections was analytics (Shen, 2013). While most
people, including the so-called political experts
(who often rely on gut feelings and experie nces) ,
thought the 2012 presidential election would be
very close, a number of analysts, based on their
data-driven analytical models, predicted that Obama
would win easily w ith close to 99 percent cer-
tainty. For example, Nate Silver at FiveThirtyEight,
a popular political blog published by 7be New
York Times, predicted not only that Obama would
w in but also by exactly how much he would win.
SECTION 13.5 REVIEW QUESTIONS
Chapter 13 • Big Data a nd Ana lytics 569
Simon Jackman, professor of political science
at Stanford University, accurately predicted that
Obama would win 332 electoral votes and that
North Carolina and Indiana-the only two states
that Obama won in 2008- would fall to Romney.
In short, Big Data and an alytics have become
a critical part of political campaigns. The usage a nd
e xpertise gap between the party lines may disap-
pear, but the importance of analytical capabilities
will continue to evolve for the foreseeable future .
QUESTIONS FOR DISCUSSION
1. What is the role of analytics and Big Data in
modern-day politics?
2. Do you think Big Data Analytics could change
the outcome of an election?
3. What do you think are the challenges, the poten-
tial solution, and the probable results of the use
of Big Data Analytics in politics?
Sources: Compiled fro m G. She n, “Big Data , Analytics a nd
Electio ns,” INFORMS’ Analytics Magazine, January-Febrna ry
2013; L. Romano, “Obama’s Data Advantage,” Politico, June 9,
2012; M. Schere r, “Inside the Secret World o f the Da ta Crnnchers
Who Helped Obama Win,” Time, Novembe r 7, 2012; S.
Issenberg, “Obama Does It Bette r” (from “Victory Lab: The New
Science of Winning Campaigns”) , Slate, O cto be r 29, 2012; and
D. A. Samue lson, “Analytics: Key to Oba ma’s Victory,” INFORMS’
ORMS Today , Fe bruary 2013 Issue , pp. 20-24.
1. Who is a data scientist? What makes them so much in demand?
2. What are the common characteristics of data scientists? Which one is the most
important?
3. Where do data scientists come from? What educational backgrounds do they have?
4. What do you think is the path to becoming a great data scientist?
13.6 BIG DATA AND DATA WAREHOUSING
There is doubt that the e me rgence of Big Data has changed and will continue to ch ange
data warehousing in a significant way. Until recently, enterprise d ata warehouses were
the centerpiece of all decision support technologies . Now, they have to share the spotlight
w ith the newcome r, Big Data. The question that is popping up everywhe re is whether
Big Data and its enabling technologies such as Hadoop will replace data warehousing
and its core technology relational data base management systems (RDBMS) . Are we w it-
nessing a data warehouse versus Big Data ch a lle nge (or fro m the techno logy standpoint,
Hadoop versus RDBMS)? In this section we will explain w hy these questio ns have no
basis- and at least justify that such an either-or choice is not the reflection of the reality
at this point in time.
570 Pan V • Big Data a nd Future Directions for Bu siness Analytics
In the last decade o r so, we have seen a significan t improvement in the a rea
of computer-based decision support systems, which can largely be credited to data
warehou sing and technological advancements in both software and h ardware to capture,
store, and analyze data. As the size of the data increased, so did the capabilities of data
warehouses. Some of these data warehousing advan ces included massively parallel pro-
cessing (moving from one or few to many p arallel processors), storage area networks
(easily scalable storage solutio n s), solid-state storage, in-database processing, in-memory
processing, a nd columnar (column oriented) databases, just to na me a few. These
advancements helped keep the increasing size of data under control, while effectively
serving an alytics n eeds of the decis ion makers. What has changed the landscape in recent
years is the variety and complexity of data , which made data warehouses incapable of
keeping up. It is n ot the volume of the structured data but the variety and the velocity
that forced the world of IT to develop a n ew paradigm, w hich we now call “Big Data.”
Now that we have these two paradigms, data warehousing and Big Data, seemingly
competing for the same job-turning data into actio nable information-which one will
prevail? Is this a fair question to ask? Or are we missing the big picture? In this section , we
try to shed some light o n this intriguing questio n.
As has been the case fo r ma ny previous technology innovatio n s, hype about Big
Data and its e nabling techno logies like Hadoop a nd MapReduce is rampant. Both non-
practitioners as well as practitio ners are overwhelmed by diverse opinio ns. According
to Awadallah and Graham (2012), people are missing the p oint in claiming that Hadoop
replaces relational databases and is becoming the new data warehou se. It is easy to see
w here these claims o rigina te since both Hadoop and d ata warehou se systems can run
in parallel, scale up to e n o rmo us data volumes, and h ave sh ared-nothing architectures.
At a con ceptual level, it is easy to think they are interch angeable. The reality is that they
are n o t, and the differences between the two overwhelm the similarities. If they are n ot
interchangeable, then how do we decide w he n to deploy Hadoop and w h e n to use a
data wareho u se?
Use Case(s) for Hadoop
As we have covered earlie r in this ch apter, Hadoop is the result of new developments
in computer a nd storage grid technolog ies. Using commodity hardware as a foundation,
Hadoop provides a layer of software that spans the e ntire g rid, tu rning it into a single
system. Consequently, some ma jor differentiators are obvious in this architecture:
• Hadoop is the repository and refinery for raw data.
• Hadoop is a p owerful, econ o mical, and active archive .
Thus, Hadoop sits at both e nds of the large-scale data life cycle- first w h e n raw
data is born, and finally when data is retiring, but is still occasio n ally needed.
1. Hadoop as the repository and refinery. As volumes of Big Data arrive from
sources such as sensors, machines, social media, a nd clickstream interactions, the
first step is to capture a ll the data reliably an d cost effectively. When data volumes
are huge, the traditional single-server strategy does not work for long. Pouring the
data into the Hadoop Distributed File System (HDFS) gives a rchitects much n eeded
flexibility. Not o nly can they capture hundreds of terabytes in a day, but they can
also adjust the Hadoop configuration up or down to meet surges and lulls in data
ingestio n. This is accomplish ed at the lowest possible cost per gigabyte due to open
source economics and leveraging commodity h ardware .
Since the data is stored on local storage instead of SANs, Hadoop data access
is often much faster, a nd it does n ot clog the network w ith terabytes of data move-
me nt. Once the raw data is captured, Hadoop is u sed to refine it. Hadoop can act
Chapter 13 • Big Data and Analytics 571
as a parallel “ETL engine on steroids,” leveraging h andwritten or commercial data
transformation technologies. Many of these raw data transformations require the
unraveling of complex free-form data into structured formats. This is particularly true
with clickstreams (or Web logs) and complex sensor data formats. Consequently, a
programmer needs to tease the wheat from the chaff, identifying the valuable signal
in the noise.
2. Hadoop as the active archive. In a 2003 interview with ACM, Jim Gray claimed
that hard disks ca n be treated as tape. While it may take many more years for mag-
netic tape archives to be retired, today some portions of tape workloads are already
being redirected to Hadoop clusters. This shift is occurring for two fundamental
reasons. First, while it may appear inexpensive to store data o n tape, the true cost
comes with the difficulty of retrieval. Not only is the data stored offline, requiring
hours if not days to restore, but tape cartridges themselves are also prone to d egra-
dation over time, making data loss a reality and forcing companies to factor in those
costs. To make matters worse, tape formats ch ange every couple of years, requiring
organizations to either perform massive data migrations to the newest tape format
or risk the inability to restore data from obsolete tapes.
Second, it h as been shown that there is value in keeping historical data online
a nd accessible. As in the clickstream example, keeping raw data on a spinning d isk
for a longer duration makes it easy for companies to revisit data when the context
changes and new constraints need to be applied . Searching thousands of disks w ith
Hadoop is dramatically faster and easier than spinning through hundreds of mag-
netic tapes. Additionally, as disk densities continue to double every 18 months, it
becomes economically feasible for organizations to hold many years’ worth of raw
or refined data in HDFS . Thus, the Hadoop storage grid is useful in both the pre-
processing of raw data and the long-te rm storage of data. It’s a true “active archive”
since it not only stores and protects the data, but also enables users to quickly, eas-
ily, a nd perpetually derive value from it.
Use Case(s) for Data Warehousing
After nearly 30 years of investment, refinement, and growth, the list of features available
in a data warehouse is quite staggering. Built upon relational database technology using
schemas and integrating business intelligence (BI) tools, the major differences in this
architecture are:
• Data warehouse performance
• Integrated data that provides business value
• Interactive BI tools for end users
l. Data warehouse performance. Basic indexing, found in open source data-
bases, such as MySQL or Postgres, is a standard feature used to improve query
response times or enforce constraints on da ta. More advanced forms such as m ate-
rialized views, aggregate join indexes, cube indexes, a nd sparse join indexes enable
numerous performance gains in data warehouses. However, the most important
performance enhancement to date is the cost-based optimizer. The optimizer exam-
ines incoming SQL and considers multiple plans for executing each query as fast as
possible. It achieves this by comparing the SQL request to the database design and
extensive data statistics that help identify the best combination of execution steps.
In essence, the optimizer is like having a genius programmer examine every query
and tune it for the best performance. Lacking an optimizer or data demographic sta-
tistics, a query that could nm in minutes may take hours, even with many indexes.
572 Pan V • Big Data a nd Future Directio ns for Bu siness Analytics
For this reason , database ven dors are con stantly adding new index types, p artition-
ing, sta tistics, and optimizer featu res. For the past 30 years , every software release
has been a performance release.
2. Integrating data that provides business value. At the heart of any data ware-
ho u se is the p romise to a nswer essential business q uestio n s. Integrated data is th e
unique foundatio n required to achieve th is goal. Pulling data from m ultip le su b-
ject areas and numerou s applicatio n s into o n e repository is the raison d’etre for
data warehouses. Data model desig ne rs and ETL architects a rmed w ith metadata,
data-cleansing tools , an d patien ce must ratio n alize d ata formats, source systems,
an d semantic meaning of the data to make it understandable and trustworthy. This
creates a commo n vocabulary w ithin the corporatio n so that critical concep ts such
as “custo me r ,” “end of month,” o r “p rice elasticity” are u n ifo rmly measured and
understood. Nowhe re else in the e ntire IT data cen ter is data collected, cleaned, and
integrated as it is in the d ata warehou se.
3. Interactive BI tools. BI tools such as MicroStrategy, Tableau , IBM Cogn os, a nd
oth e rs provide business users w ith d irect access to data warehouse insigh ts . First,
the business u ser can create rep orts and com p lex analysis quickly and easily u sin g
these tools . As a result, there is a tre n d in many data ware hou se sites toward e n d-
user self-service. Bus iness users can easily demand more reports th an IT has staff-
ing to p rovide . More importa nt than self-service, h owever, is that the users become
intimately familia r w ith the data . They can ru n a report, discover they missed a
metric or filter, make an adjustment, and ru n their report again all within minu tes.
This process results in significant cha nges in bu siness u sers’ un dersta nding the bu si-
ness an d their d ecisio n-making process. First, u sers stop asking trivial questions a nd
start asking mo re comp lex strategic q uestio ns. Gene rally, the more complex a nd
stra tegic the report, the mo re revenue and cost savings the u ser captures . This leads
to some users becoming “power users” in a com pany. These individu als become
w izards at teasing bu siness value fro m the data a nd sup plying valuable strategic
informatio n to the executive staff. Every data ware hou se h as anywh ere from two to
20 p ower users.
The Gray Areas (Any One of the Two Would Do the Job)
Even tho ugh the re are several areas th a t differe ntiate o n e fro m the othe r, th ere are also
g ray a reas w here th e data ware ho u se a nd Hadoop canno t be clearly discerned. In th ese
areas either tool could b e the right solution-e ith e r d oing an equally good or a n ot-
so-good job o n th e task at h and. Ch oosing the o ne over th e other depen ds o n th e
re quire me nts a nd the p references of the organizatio n . In many cases, Hadoop and the
data ware ho u se work togethe r in a n informatio n su pply chain, an d just as often , one tool
is bette r for a specific workload (Awadallah an d Graham , 2012). Table 13.1 illustrates the
p referred p latform (on e versu s the oth er, or equa lly likely) under a number o f commonly
o bserved require m ents .
Coexistence of Hadoop and Data Warehouse
The re are several possible scenarios u nder w hich using a com bination of Hadoop and
relatio nal DBMS-based data wareh ousing technologies makes more sense. Here a re some
of th ose scena rios (White, 2012):
1. Use Hadoop for storing and archiving multi-structured data. A connector
to a relatio n al D BMS can then be u sed to extract req uired data from Had oop for
an alysis by the re latio nal DBMS . If the re la tional DBMS su p p orts MapRedu ce func-
tio ns, these functio ns ca n be used to do the extractio n. The Aster-Hadoop adap tor,
Chapter 13 • Big Data and Ana lytics 573
TABLE 13.1 When to Use Which Platform-Hadoop Versus DW
Data
Requirement Warehouse Hadoop
Low latency, interactive reports, and OLAP 0
ANSI 2003 SQL compliance is required 0 0
Preprocessing or exploration of raw unstructured data 0
Online archives alternative to tape 0
High-quality cleansed and consistent data 0 0
1 OOs to 1,000s of concurrent users 0 0
Discover unknown relationships in the data 0
Parallel complex process logic 0 0
CPU intense analysis 0
System, users, and data governance 0
Many flexible programming languages running in parallel 0
Unrestricted, ungoverned sandbox explorations 0
Analysis of provisional data 0
Extensive security and regulatory compliance 0
for example, uses SQL-MapReduce functions to provide fast, two-way data loading
between HDFS and the Aster Database. Data loaded into the Aster Database can
then be analyzed using both SQL and MapReduce.
2. Use Hadoop for filtering, transforming, and/or consolidating multi-struc-
tured data. A connector such as the Aster-Hadoop adaptor can be used to extract
the results from Hadoop processing to the relational DBMS for analysis.
3. Use Hadoop to analyze large volumes of multi-structured data and publish
the analytical results to the traditional data warehousing environment, a shared
workgroup data store, or a common user inte rface.
4. Use a relational DBMS that provides MapReduce capabilities as an inves-
tigative computing platform. Data scientists can employ the relational DBMS
(the Aster Database system, for example) to analyze a combination of structured
data and multi-structured data (loaded from Hadoop) using a mixture of SQL pro-
cessing and MapReduce analytic functions.
5. Use a front-end query tool to access and analyze data that is stored in both
Hadoop and the relational DBMS.
These scenarios support an environment where the Hadoop and relational DBMS
systems are separate from each other and connectivity software is used to exchange
data between the two systems (see Figure 13.8). The direction of the industry over the
next few years will like ly be moving toward more tightly coupled Hadoop and rela-
tional DBMS-based data wareh ouse techn ologies- software as well as hardware. Such
integration provides many benefits, including eliminating the need to install and m ain-
tain multiple systems, reducing data movement, providing a single metadata store for
application development, and providing a single interface for both business users and
analytica l tools.
574 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Developer Environments Business Intelligence Tools
•
t t
I Extract, Transform, Load I
888~~
Raw Data Streams Operational Systems
FIGURE 13.8 Coexistence of Hadoop and Data Warehouses. Source: Teradata.
SECTION 13.6 REVIEW QUESTIONS
1. What are the challenges facing data warehousin g and Big Data? Are we witnessing
the e nd of the da ta wareho using e ra? Why or w hy n o t?
2. What are the use cases for Big Data and Hadoop?
3. What are the use cases for data warehousing a nd RDBMS?
4. In what scenarios can Hadoop and RDBMS coexist?
13.7 BIG DATA VENDORS
As a relative ly new technology a rea, the Big Data vendo r la ndscape is developing very rap-
idly. A numbe r of vendors have developed their own Hadoop distribution s, most based o n
the Apache open source distribution but with various levels of proprie ta1y cu stomization.
The clear market leader in terms of distribution seems to be Cloudera (cloudera.com),
a Silicon Valley start-up w ith an all-star lineup of Big Data experts, including H adoop
creator Do ug Cutting and former Facebook data scie ntist Jeff Hammerbacher. In addi-
tion to distribution, Cloudera o ffers paid enterprise-level training/services and proprietary
Hadoop m anagement software . MapR (mapr.com), another Valley start-up, offers its own
Hadoop distribution that supplem e nts HDFS with its proprietary NFS for improved per-
formance. EMC Greenplum partnered w ith MapR to release a partly proprietary Hadoop
distribution o f its own in May 2011. Hortonworks (hortonworks.com), which was spun-
o ut o f Yahoo! in summer 2011 , released its 100 percent open source Hadoop distribution,
called Hortonworks Data Platform, a nd related support services in November 2011. These
are just a few of the many compa nies (establish ed and start-ups) that are crowding the
competitive landscape of tool and service providers for Hadoop technologies.
In the NoSQL world, a number of start-ups are working to deliver commercially sup-
ported versio n s of the vario us flavors of NoSQL. DataStax, for exam ple, offers a comme rcial
version of Cassandra that includes e nterprise support and services, as well as integration
w ith Hadoop and open source enterprise search via Lucen e Solr. As mentioned, proprietary
data integratio n vendors, including Informatica, Pervasive Software, and Syncsort, are mak-
ing inroads into the Big Data market with Hadoop connectors and complementary tools
aimed at making it easier for developers to move data around and within Hadoop clusters.
The analytics layer of the Big Data stack is also experie ncing significant develop-
ment. A start-up called Datamee r, for example, is developing w hat it says is an “all-in-o ne”
Chapter 13 • Big Data and Ana lytics 575
business intelligence platform for Hadoop, while data visualization specialist Tableau
Software has added Hadoop and Next Generation Data Warehouse connectivity to its
product suite. EMC Greenplum, meanwhile, has Chorus, a sort of playground for data
scientists where they can mash-up, experiment with, and share large volumes of data for
analysis. Other vendors focus on specific analytic use cases , such as ClickFox with its cus-
tomer experience analytics engine. A number of traditional business intelligence vendors ,
most notably MicroStrategy, are working to incorporate Big Data analytic and reporting
capabilities into their products.
Less progress has been made in the Big Data application space, however. There are few
off-the-shelf Big Data applications currently on the market. This void leaves enterprises with
the task of developing and building custom Big Data applications with internal or outsourced
teams of application developers. There are exceptions. Namely, a start-up called Treasata
offers Big-Data-as-a-service applications for the financial services vertical market, and Google
makes its internal Big Data analytics application, called BigQuery, available as a service.
Meanwhile , the next-generation data warehouse market has experienced signifi-
cant consolidation since 2010. Four leading vendors in this space- Nete zza, Greenplum,
Vertica, and Aster Data-were acquired by IBM, EMC, HP, and Teradata, respectively . Just
a handful of niche independent players remain, among them Kognitio and ParAccel. These
vendors , by and large, position their products as complementary to Hadoop and NoSQL
deployments, providing real-time analytic capabilities on large volumes of structured data.
Mega-vendors Oracle and IBM also play in the Big Data space. IBM’s Big Insights
platform is based on Apache Hadoop, but includes numerous proprietary modules
including the Netezza database, InfoSphere Warehouse, Cognos business intelligence
tools, and SPSS data mining capabilities. It also offers IBM InfoSphere Streams, a platform
designed for streaming Big Data analysis. Oracle, meanwhile, has embraced the appli-
ance approach to Big Data w ith its Exadata, Exalogic, and Big Data appliances. Its Big
Data appliance incorporates Cloudera’s Hadoop distribution with Oracle’s NoSQL data-
base and data integration tools . Application Case 6.5 provides an interesting case where
Dublin City council used Big Data Analytics to reduce city’s traffic congestion.
The cloud is increasingly playing a role in the Big Data market as well. Amazon
and Google support Hadoop deployments in their public cloud offerings , Amazon Elastic
Application Case 13.5
Dublin City Council Is Leveraging Big Data to Reduce Traffic Congestion
Employing 6,000 people, Dublin City Council (DCC)
delivers housing, water a nd transport services to
1.2 million citizens across the Irish capital. To keep
the city moving, the council’s traffic control center
(TCC) works together with local transport operators to
manage an extensive network of roads, tramways and
bus lanes. Using operational data from the TCC, the
council’s roads and traffic department is responsible
for predicting Dublin’s future transport requirements,
and developing effective strategies to meet them.
and each of the city’s 1,000 buses transmits a GPS
update every 20 seconds.
Like local governments in many large
European cities, DCC has a wide array of technology
at its disposal. Sensors such as inductive-loop traffic
detectors , rain gauges and closed-circuit television
(CCTV) cameras collect data from across Dublin,
Tackling Traffic Congestion
In the past, o nly a small proportion of this Big
Data was available to controllers at Dublin’s TCC-
reducing their ability to identify, anticipate and
address the causes of traffic congestion.
As Brendan O ‘Brien, Head of Technical
Services-Roads and Traffic Department at Dublin
City Council, explains: “Previously, our TCC systems
only offered a narrow window on the overall status
of our transport network-for example , controllers
could only view the status of individual bus routes.
Our legacy systems were also unable to monitor the
(Conti nued)
576 Pan V • Big Data and Future Directions for Business Analytics
Application Case 13.5 (Continued)
geospatial location of Dublin’s bus fleet, which fur-
ther complicated the traffic control process. ” He con-
tinues: “Because we couldn’t see the ‘health’ of the
whole transport network in real time, it was very dif-
ficult to identify traffic congestion in its early stages.
This meant that the causes of delays had often moved
on by the time our TCC operators were able to select
the appropriate CCTV feed-making it hard to deter-
mine and mitigate the factors causing congestion.”
DCC wanted to ease traffic congestion across
Dublin. To achieve this, the council needed to
find a way to integrate, process and visualize large
amounts of structured and unstructured data from its
network of sensor arrays- all in real time.
Becoming a Smarter City
To help develop a smarter approach to traffic
control, DCC entered into a research partnership
with IBM Research-Ireland. Francesco Calabrese,
Research Manager- Smarter Urban Dynamics at IBM
Research, comments: “Smarter Cities are cities with
the tools to extract actionable insights from massive
amounts of constantly changing data, and deliver
those insights instantly to decision-makers. At the
IBM Smarter Cities Technology Centre in Dublin, our
goal is to develop innovative solutions to enable cit-
ies like Dublin to support smarter ways of working-
delivering a better quality of life for their citizens. ”
Today, DCC makes a ll of its data available to
the IBM Smarter Cities Technology Centre in Dublin.
Using Big Data a nalytics technologies, IBM Research
is developing new solutions for Smarter Cities, and
making the deep insights it discovers available to
the council’s roads and traffic department.
“From our first discussion with the IBM Research
team, we realized tl1at our goals were perfectly aligned,”
says O ‘Brien. “Using our data, the IBM Smarter Cities
Technology Centre can both drive its own research,
and deliver innovative solutions to help us visualize
transport data from sensor arrays across the city.”
Analyzing the Transport Network
As a first step, IBM integrated geospatial data from
buses a nd data on bus timetables into a central geo-
graphic information system. Using IBM InfoSphere
Streams and mapping software, IBM researchers
created a digital map of the city, overlaid with the
real-time positions of Dublin’s 1,000 buses. “In the
past, our TCC operators could only see the status of
individual bus corridors,” says O’Brien. “Now, each
TCC operator gets a twin-monitor setup- one dis-
playing a dashboard, and the other a real-time map
of all buses across the city.
“Using the dashboard screen, operators can
drill down to see the number of buses that are on-
time or delayed on each route. This information is
also displayed visually on the map screen, allow-
ing operators to see the current status of the entire
bus network at a glance. Because the interface is
so intuitive, our operators can rapidly home in on
emerging areas of traffic congestion, and then use
CCTV to identify the causes of delays before they
move further downstream. ”
Taking Action to Ease Congestion
By enriching its data with GPS tracking, DCC can
produce detailed reports on areas of the network
w here buses are frequently delayed, and take
action to ease congestion. “The IBM Smarter Cities
Technology Centre has provided us with a lot of
valuable insights, ” says O ‘Brien. “For example, the
IBM team created trace reports on bus journeys,
w hich showed that at rush hour, some buses were
being overtaken by buses that set off later.
“Working w ith the city’s bus operators, we
are looking at why the headways are diverging in
that way, and what we can do to improve traffic
flow at these peak times. Thanks to the work of the
IBM team, we can n ow start answering questions
such as: ‘Are the bus lane start times correct?’, and
‘Where do we need to add additional bus lanes and
bus-only traffic signals?”‘
O ‘Brien continues: “Over the next two years,
we are starting a project team for bus priority
measures and road-infrastructure improvements.
Without the ability to visualize our transport data ,
this would not have been possible. ”
Planning For the Future
Based on the success of the traffic control project
for the city’s bus fleet, DCC and IBM Research are
working together to find ways to further augment
traffic control in Dublin. “Our relationship with IBM
is quite fluid-we offer them our expertise about
how the city operates, and their researchers use that
input to extract valuable insights from our Big Data ,”
says O ‘Brien. “Currently, the IBM team is working
o n ways to integrate data from rain and flood gauges
into the traffic control solution- alerting controllers
to potential hazards presented by extreme weather
conditio ns, and a llowing them to take timely action
to reduce the impact on road users. ”
In addition to meteorological data, IBM is inves-
tigating the possibility of incorporating data from the
under-road sensor network to better understand the
impact of private motor vehicles on traffic congestion.
The IBM team is also developing a predictive
analytics solutio n combining data from the city’s
tram network with electronic docks for the city’s free
bicycle scheme. This project aims to optimize the
distribution of the city’s free bicycles according to
anticipated demand-ensuring that citizens can seam-
lessly continue their journey after stepping off a tram.
“Working with IBM Research has allowed
us to take a fresh look at our transport strategy, ”
Chapter 13 • Big Data and Ana lytics 577
concludes O ‘Brien. “Thanks to the continuing work
of the IBM team, we can see how our transport net-
work is working as a whole- and develop innovative
ways to improve it for Dublin’s citizens.”
QUESTIONS FOR DISCUSSION
1. Is there a strong case to make for large cities to
use Big Data Analytics and related information
technologies? Identify and discuss e xamples of
what can be done w ith analytics beyond what is
portrayed in this application case.
2. How can a big data analytics help ease the traffic
problem in large cities?
3. What were the challenges Dublin City was fac-
ing; what were the proposed solution, initial
results, and future plans?
Source: IBM Customer Story, “Dublin City Council – Leveraging
the leading edge of IBM Smarter Cities research to reduce traf-
fic congestion ” public.dhe.ibm.com/comrnon/ssi/ecm/en/
imc14829ieen/IMC14829IEEN.PDF (accessed October 2013).
MapReduce and Google Compute Engine, respectively, enabling users to easily scale up
and scale down clusters as needed. Microsoft abandoned its own internal Big Data plat-
form and w ill support Hortonworks’ Hadoop distribution o n its Azure cloud.
As part of its market-sizing efforts, Wikibon (Kelly, 2013) tracked and/ or modeled
the 2012 Big Data revenue of more than 60 vendors. The list included both Big Data
pure-plays-those vendors that derive close to if n ot all their revenue from the sale of Big
Data products a nd services- and vendors for whom Big Data sales is just one of multiple
revenue streams. Table 13.2 shows the top 20 vendors in order of Big Data revenues in
2012, and Figure 13.9 shows the top 10 pure players in the Big Data marketplace.
The services side of the Big Data marke t is small but growing. The established ser-
vices providers like Accenture and IBM are just starting to build Big Data practices, while
just a few smaller providers focus strictly o n Big Data, among them Think Big Analytics.
EMC is also investing heavily in Big Data training a nd services offerings, particularly
around data science. Similarly, Hadoop distribution vendors Hortonworks and Cloudera
offer a number of training classes aimed at both Hadoop administrators and data scientists.
There are a lso other vendors approaching Big Data from the visual analytics angle.
As Gartner’s latest Magic Quadrant indicated, a significant growth in business intelligence
and analytics is in visual exploration and visual analytics. Large companies like SAS, SAP,
and IBM, along w ith small but stable companies like Tableau, TIBCO , and QlikView, are
making a strong case for hig h performance an alytics built into info rmation visualization
platforms. Technology Insights 13.4 provides a few key e nablers to succeed with Big
Data and visual analytics. SAS is perhaps the one pushing it h arder than any other w ith
its recently laun ched SAS Visual Analytics platform. Using a multitude of computational
enhancements, the SAS Visual Analytics platform is capable of turning tens of millions of
data records into informational graphics in just a few seconds by using massively parallel
processing (MPP) a nd in-memory computing. Application Case 13.6 is a customer case
where the SAS Visual Analytics platform is used for accurate and timely credit decisions.
578 Pan V • Big Da ta a nd Future Directio ns for Bu siness Ana lytics
TABLE 13.2 Top 20 Vendors in Big Data Market
2012 Worldwide Big Data Revenue by Vendor ($US millions)
Big Data Revenue % Big Data % Big Data % Big Data
Big Data Total as% of Total Hardware Software Services
Vendor Revenue Revenue Revenue Revenue Revenue Revenue
IBM $1,352 $103,930 1% 22 % 33% 44 %
HP $664 $119,895 1% 34% 29% 38 %
Teradata $435 $2,665 16% 31 % 28% 41 %
Dell $425 $59,878 1% 83 % 0% 17 %
Oracle $415 $39,463 1% 25 % 34% 41 %
SAP $368 $21,707 2 % 0 % 67% 33 %
EMC $336 $23,570 1% 24% 36 % 39 %
Cisco Systems $214 $47,983 0% 80 % 0% 20 %
Microsoft $196 $$71,474 0 % 0 % 67 % 33 %
Accenture $194 $29,770 1% 0 % 0% 100 %
Fusion-io $190 $439 43% 71 % 0% 29 %
PwC $189 $31,500 1% 0 % 0% 100 %
SAS Institute $187 $2,954 6 % 0 % 59% 41 %
Splunk $186 $186 100% 0 % 71 % 29 %
Deloitte $173 $31,300 1% 0 % 0% 100 %
Amazon $170 $56,825 0 % 0 % 0 % 100 %
NetApp $138 $6,454 2% 77 % 0 % 23 %
Hitachi $130 $112 ,318 0 % 0 % 0% 100 %
Opera Solutions $118 $118 100% 0 % 0 % 100 %
Mu Sigma $114
$70
$ 60
$ 50
$40
$ 30
$20
$ 10
$114 100% 0 % 0% 100 %
$0-+-‘–‘—r~~..,…….-‘-,-~~”T””””””‘–‘—-r-~ ~-r-‘-~-,-~~,–L-~—.—~….___,
FIGURE 13.9 Top 10 Big Data Vendors with Primary Focus on Hadoop. Source: wikibon.org.
Chapte r 13 • Big Data a nd Ana lytics 579
TECHNOLOGY INSIGHTS 13.4 How to Succeed with Big Data
What a year 2012 was for Big Data ! From the White House to your house, it’s hard to find an
o rga nizatio n or con sume r who has less d ata today tha n a year ago. Database o ptio ns p rolife ra te ,
a nd business intelligence evolves to a new e ra of o rga nization-w ide an alytics. And everything’s
mo bile . Organizatio ns that su ccessfully a da pt the ir d ata a rchitecture and processes to address
the three characteristics of Big Data- volume , va riety, and velocity- are improving o p erational
effi cien cy, growing revenues, and empowering new business mo dels. With all the attention
o rganizatio ns are placing o n innovating around data, the rate of cha nge w ill o nly increase. So
w h at sh ould compa nies do to succeed w ith Big Da ta? Here are some of the ind ustry testame n ts:
I. Simplify. It is ha rd to keep track of all of the n ew data base vendo rs, open source
projects, a nd Big Data service providers. It will eve n be mo re crowded and com p licated
in the years a head. The refore , the re is a need for s implificatio n . It is essential to ta ke a
strategic a pproach b y exte nding your relatio nal a nd o nline tra nsaction processing (OLTP)
syste ms to o ne o r more of the new o n-pre mise, hoste d , o r service-based da tabase o ptio ns
th at best reflect the need s of your industry a nd your organizatio n , a nd the n p icking a real-
time business inte lligence platform that suppo rts direct connectio ns to ma ny da tabases
a nd file formats . Choosing the best mix of solutio n a lternatives fo r eve ry project (b etween
connecting live to fast da ta bases a nd impo rting data extracts into a n in-memory analyt-
ics e ngine to o ffset the p erforma nce o f slow o r ove rburde ne d databases) is critical to the
success o f any Big Data p rojects . Fo r insta nce, e Bay’s Big Data an alytics a rchitecture com-
prises Teradata (o ne of the most popular data ware hou sing compa nies), Hadoop (most
p romising solution to Big Da ta challe nge), a nd Tableau (on e of the p rolific visual analytics
solutio n p roviders). eBay e mployees can visua li ze insig hts fro m mo re tha n 52 petabytes
of da ta. e Bay uses a visu al an alytics solutio n by Ta bleau to a nalyze search relevance and
qua lity o f the eBay.com site; mo nito r the la test custo me r feed back and me te r sentiments
o n eBay.com; and achieve o pe rational re p orting for the data wa re ho use syste ms, all of
w hich helped a n an alytic culture fl o urish within eBay.
2. Coexist. Using the stre ngths of each database platform a nd e nabling the m to coexist in
your o rga nizatio n’s d ata architectu re is essentia l. The re is ample lite rature that talks about
the necessity o f mainta ining a nd nurturing the coexistence o f traditio nal data wa re ho uses
w ith the cap abilities of new platforms.
3. Visualize. Acco rding to leading a nalytics research companies like Fo rrester and
Gartne r, e nte rprises find adva nced da ta visualizatio n platforms to be essential tools tha t
e na ble the m to mo nito r business, find p atterns, a n d ta ke actio n to avoid threats and
snatch o ppo rtunities. Visu al analytics he lp o rga nizatio ns uncover tre nds, relatio nships,
a nd a no malies by visu ally shifting through very large qua ntities of data. A visual analysis
exp e rience has certain characteristics. It allows you to do two things a t a ny mo me nt:
• Instantly ch ange w hat data you a re looking at. This is important becau se diffe re nt ques-
tio ns require diffe re nt da ta.
• Instantly cha nge the way you are looking at it. This is impo rta nt b ecause each view may
an swer diffe re nt questions .
This combinatio n c reates the explo ratory e xpe rie nce required for anyone to answer
question s quickly . In essence, visu alizatio n becomes a natural exte nsio n of your experi-
me ntal tho ught process.
4. Empower. Big Data a nd self-service business intelligence go hand in ha nd, according to
Abe rdee n Group’s recently publish ed “Maximizing the Value of Analytics and Big Da ta .”
Orga nizatio ns w ith Big Data a re over 70 percent mo re likely tha n othe r o rga nizatio ns to
h ave BI/BA projects tha t are driven prima rily by the business community, no t by the IT
grou p . Across a ra nge o f uses-fro m tackling new business p roble ms, d eveloping e ntirely
new products a nd services, finding actionab le intelligence in less than an ho ur, and ble nd-
ing d ata from d isparate sources-Big Data has fire d the imaginatio n of w hat is possible
throug h the a pplicatio n o f a n alytics.
5. Integrate. Integrating and ble nding da ta fro m disp arate sources for your organization
is a n essential p art o f Big Data a na lytics. O rgani zatio ns that can ble nd diffe re nt re latio nal,
580 Pan V • Big Da ta a nd Future Directio ns for Business Analytics
semistructured, a nd raw data sources in real time, w itho ut expensive up-front integration
costs, will be the ones that get the best value from Big Data. Once integrate d and blended,
the structure of the data (e.g., spreadsheets, a database, a data warehouse, an open source
fil e system like Hadoop , o r all of them at the same time) becomes unimportant; that is,
you do n ‘t need to know the d e tails of h ow d ata is sto re d to ask and answer questions
against it. As we saw in Applicatio n Case 13.4, the Obama campaign found a way to inte-
grate social med ia, techno logy, e-mail databases, fundra is ing databases, and consumer
marke t data to create competitive advantage.
6. Govern. Data governance has always been a challe nging issu e in IT, and is getting even
more puzzling with the adve n t of Big Data . More than 80 countries have data privacy
laws. The European Unio n (EU) defin es seven “safe harbo r privacy principles” for the
protection o f the ir citizens’ personal data . In Singapore, the pe rsonal data protection law
took effect January 2013. In the United States, Sarbanes-Oxley affects all publicly listed
companies, and HIPAA (Health Ins urance Portability and Accountability Act) sets national
standards in healthcare. The right balance between contro l and exp e rime ntatio n varies
depending on the organizatio n a nd industry. Use of maste r data management (MDM) best
practices seems to help ma nage the governance process.
7. Evangelize. With the backing of one or more executive sponsors, evangelists like yo ur-
self can get the ball rolling a nd instill a virtuous cycle: Tthe more depaitments in your
organization that realize actionable benefits, the mo re p ervasive a nalytics becomes across
your organization. Fast, easy-to -use visual analytics is the key that op e ns the door to
organization-wide analytics adoption and collaboration.
Sources: Compiled from A. Lampitt, “Big Data Visualization: A Big Deal for eBay,” InfoWorld, December
6, 2012, infoworld.com/d/big-data/big-data-visualization-big-deal-ebay-208589 (accessed Marc h
2013); Tableau white paper, cdnlarge.tableausoftware.com/sites/default/files/whitepapers/7-tips-to-
succeed-with-big-data-in-2013 (accessed Janua ry 2013).
Application Case 13.6
Creditreform Boosts Credit Rating Quality with Big Data Visual Analytics
Founded as a credit agency in Mainz, Germany, in
1879, Creditreform has grown to now serve more
than 163,000 me mbe rs from 177 offices across Europe
and China as one of the leading international provid-
ers of business information and receivables manage-
ment services. Creditreform provides a comprehen-
sive spectrum of integrated credit risk management
solutions and services worldwide, provides members
with more than 16 million commercial reports a year,
and helps them recover billions in outstanding debts.
Challenge
Via its online database, Creditreform makes more than
24 million credit reports from 26 countries in Europe
and from China that are available around the clock.
Using high-performance solutions Creditreform
wants to quickly detect anomalies and relationships
within those high data volumes and prese nt results
in easy-to-read graphics. Already Germany’s top
provider of quality business information and debt
collection services, Creditreform wants to maintain
its leadership and widen its market lead through
better and faster analytics.
Solution and the Results
Creditreform decided to use SAS Visual Analytics
to simplify the analytics process, so that every
Creditreform employee can use the software to
make smart decisions without needing extensive
training. The new high-performance solution ,
obtained from one of the business analytics
leaders in the market place (SAS Institute) , makes
Creditreform better at providing the highest quality
financial information and credit ratings to its client
businesses.
“SAS Visual Analytics makes it faster and easier
for our analysts to detect correlations in our busi-
ness data, ” said Be rnd Bi.itow, managing director at
Creditreform. “That, in turn, improves the quality
and forecasting accuracy of our credit ratings.”
“Creditreform saw SAS Visual Analytics as a
compelling solution, ” remarked Mona Beck, financial
services sales director at SAS Germany. “SAS Visual
Analytics advances business analytics by combin-
ing Big Data analysis w ith excellent usability, mak-
ing it a breeze to represent data graphically. As a
company known for providing top-quality informa-
tion on businesses, Creditreform is a perfect match
for the very latest in business analytics technology. ”
SAS Visual Analytics is a high-performance ,
in-memory solution for exploring massive amounts
of data very quickly. Users can explore all data ,
execute analytic correlations on b illions of rows of
data in just minutes or seconds, and visually present
results. With SAS Visual Analytics, executives can
make quicker, better decisions with instant access,
SECTION 13. 7 REVIEW QUESTIONS
Chapter 13 • Big Data and Ana lytics 581
via PC or tablet, to insights based on the latest data.
By integrating corporate and consumer data, bank
executives gain real-time insights for risk manage-
ment, customer development, product marketing,
and financial management.
QUESTIONS FOR DISCUSSION
1. How did Creditreform boost credit rating quality
with Big Data and visual analytics?
2. What were the challenges, proposed solution,
and initial results?
Source: SAS, Custo me r Stories, “With SAS, Creditre form Boosts
Credit Rating Quality, Forecasting : SAS Vis ua l Ana lytics,
High-Pe rformance Analytics Speed Decisions , Increase Efficiency,”
sas.com/news/preleases/banking-visual-analytics.html
(accessed Ma rch 2013).
1. What is special about the Big Data vendor landscape? Who are the b ig players?
2. How do you think the Big Data vendor landscape w ill change in the near future? Why?
3. What is the role of visual analytics in the world of Big Data?
13.8 BIG DATA AND STREAM ANALYTICS
Along with volume and variety, as we have seen earlier in this ch apter, one of the key
ch aracteristics that define Big Data is velocity, which refers to the speed at which the data
is created a nd streamed into the analytics environme nt. Organizations are looking for
new means to process this streaming data as it comes in to react quickly and accurately
to problems and opportunities to p lease their customers and to gain compe titive advan-
tage. In situations where data streams in rapidly and continuously, traditional analytics
approaches that work w ith previously accumulated data (i.e., data at arrest) often either
arrive at the wrong decisions because of using too much out-of-context data , o r they
arrive at the correct decisions but too late to be of any use to the organization. Therefore
it is critical for a numbe r of business situation s to analyze the data soon after it is created
an d/ or as soon as it is streamed into the a nalytics system.
The presumption that the vast majority of modern-day businesses are currently liv-
ing by is that it is important and critical to record every piece of data because it might
contain valuable information now or sometime in the near future. However, as long as
the number of data sources increases, the “store-everything” approach becomes h arder
and harder and, in some cases, not even feasible. In fact, despite technological advances,
current total storage capacity lags far behind the digital information being generated in
the world. Moreover, in the constantly changing business environment, real-time detec-
tion of meaningful changes in data as well as of complex pattern variations w ithin a
given short time window are essentia l in order to come up with the actions that better
fit with the new environment. These facts become the main triggers for a paradigm that
we call stream analytics. The stream a nalytics paradigm was b o rn as an answer to these
cha lle nges, namely, the unbounded flows of data that cannot be permanently stored in
order to be subsequently analyzed, in a timely and efficient manner, an d complex p attern
variatio ns that need to be detected and acted upon as soon as they happen.
582 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Stream analytics (also called data in-motion analytics and real-time data analyti-
cal, among others) is a term comm o nly used for th e analytic process of extracting action-
able information from continuously flowing/streaming data. A stream can be define d as
a continuo us sequ e n ce of data ele ments (Zikopoulos et al. , 2013). The data e lem e nts in
a stream are ofte n called tuples. In a re latio n al database sen se, a tuple is similar to a row
of data (a record , an object, an instance). However in the context of semistructured or
unstructured data, a tuple is an abstraction that represents a package of data , which can be
characterized as a set o f attribu tes for a given object. If a tuple by itself is not suffic ie ntly
informative for an a lysis, a correlation-or other collective relationships among tuples are
needed- then a w indow of data that includes a set of tuples is used. A w indow of data
is a finite number/ sequ e nce of tuples, w here the wind ows are continuously updated as
new data become available. The size of the window is determined based o n the system
being analyzed. Stream a n alytics is becoming increasingly more popular because of two
things. First, time-to-action has become an ever decreasing value, a n d second, we h ave
the technological means to capture and process the data while it is being created.
Some of the most impactful applicatio ns of stream analytics were developed in the
energy industry, specifically for smart grid (electric power supply chain) system s. The
new smart grids are cap able o f not o nly real-time creation and processing of multiple
streams o f data in order to determine optimal power distribution to fulfill real customer
needs, but also generating accurate short-term predictions aimed at covering unexpected
demand and renewable e n ergy generation peaks. Figure 13.10 shows a depiction of a
gene ric use case for streaming analytics in e nergy industry (a typical smart grid applica-
tion) . The goal is to accurately predict electricity demand an d production in real time by
u sing streaming data that is coming from smart meters, production system sensors, a nd
meteorological models. The ability to predict n ear future con sumptio n/ productio n trends
and detect anomalies in real time can be u sed to optimize supply decisions (how much
to produce, w hat sources of production to use , optimally adjust produ ctio n capacities)
as well as to adjust smart meters to regulate consumptio n and favorable e ne rgy pricing .
Stream Analytics Versus Perpetual Analytics
The terms “streaming ” a nd “perpetual” probably sound like the same thing to most people,
and in ma ny cases they are u sed synonymo usly . However, in the context o f intelligent
syste ms, the re is a difference (Jonas, 2007). Streaming an alytics involves applying transac-
tio n-level logic to real-time observations . The rules applied to these observations take into
account previous observation s as lo ng as they occurred in the prescribed w indow; th ese
w indows have som e arbitrary size (e.g., last 5 seconds, last 10,000 observation s, etc.).
Perpetual analytics, on the other hand, evaluates every incoming observation against all
prior observation s, w h e re there is n o w indow size . Recognizing h ow the n ew observation
relates to all prior observations en a bles the discovery of real-time insight.
Both streaming and perpetual analytics have their pros and con s, a n d their respec-
tive places in the business analytics world. For example, sometimes transactional volumes
are high a nd the time-to-decision is too sho1t, fa voring no npersistence and small window
sizes, which translates into u sing strea ming analytics. However, when the mission is
critical a nd transactio n volumes can be m anaged in real time, then perpetual an alytics
is a better an swer. That way, o ne can answer questions su ch as “How does w hat I just
learned re late to w hat I h ave known?” “Does this matter?” and “Who needs to know?”
Critical Event Processing
Critical event processing is a m ethod of capturing , tracking, and analyzing streams of
data to detect events (out of normal h appenings) of certain types that are worthy of the
effo 1t. Comple x event processing is a n applicatio n of stream a na lytics that comb ines data
from multiple sources to infer events or p atterns of interest either before they actually
Chapter 13 • Big Data a nd Ana lytics 583
Energy Production System
1
[Traditional and Renewable) Capacity Decisions
Sensor Data
(Energy Production
‘ System Status) ‘ ‘ ‘ ‘ ‘ ‘ ‘ , …
Meteorological Data Data Integration
Streaming Analytics
(Predicting Usage, <------------
[Wind , Light, ~------- and Tempor ary -----l>–
Temperature , etc.) Staging
Production , and 0————-
Anomalies]
/ ~
,,- ~
/ ‘—- —/ / —/
/ Permanent
Usage Data
/
/
Storage Area
[Smart Meters, v
/
Smart Grid Devices)
t Energy Consumption System Pricing Decisions
[Residential and Commercial)
FIGURE 13.10 A Use Case of Streaming Analytics in the Energy Industry.
occur or as soon as they happen. The goal is to ta ke rapid actions to e ithe r prevent (or
mitigate the negative effects oO these events (e.g., fraud or network intrusio n) , o r in the
case of a short window of opportunity, take full ad vantage of the situation w ithin the
allowed time (based on use r be havior on a e -comme rce site, create promo tional offers
that they a re more like ly to respo nd to).
These critical events may be happening across the various layers of an organiza-
tio n such as sales leads, o rders, o r customer service calls. Or, more broadly, they may be
news ite ms, text messages, social media posts, stock m arket feeds, traffic reports, weather
conditions, or other kinds of anomalies that may have a significant impact on the well-being
of the organization. An event may also be defined generically as a “change of state,” w hich
may be detected as a measurement exceeding a predefined threshold of time, temperature,
or some other value. Even though there is no denying the value proposition of critical
event processing, one h as to be selective in what to m easure, when to measure, and h ow
often to measure. Because of the vast amount of information available abo ut events, w hich
is sometimes referred to as the event cloud, the re is a possibility of overdoing it, in w hich
case as opposed to helping the organization , it may hurt the operatio nal effectiveness.
Data Stream Mining
Data stream mining, as an en abling technology for stream analytics, is the process of
extracting novel patterns and knowledge structures from continuo us, rapid data records. As
we have seen in the data mining chapter (Chapter 5), traditional data mining methods require
the data to be collected and organized in a proper file format, and then processed in a recur-
sive manner to learn the underlying patterns. In contrast, a data stream is a continuous flow
of ordered sequence of instances that in many app lications of data stream mining can be
read/ processed o nly o nce or a small number of times u sing limited computing a nd storage
capabilities. Examples of data streams include sensor data, computer network traffic, phone
584 Pan V • Big Data and Future Directions for Business Analytics
conversations, ATM transactions, web searches, and financial data. Data stream mining can
be considered a subfield of data mining, machine learning, and knowledge discovery.
In many data stream mining applications, the goal is to predict the class or value of
new instances in the data stream given some knowledge about the class membership or
values of previous instances in the data stream. Specialized machine learning techniques
(mostly derivative of traditional machine learning techniques) can be used to learn this
prediction task from labeled examples in an automated fashion. An example of such a
prediction method is developed by Delen et a l. (2005), where they gradually built and
refined a decision tree model by using a subset of the data at a time.
SECTION 13.8 REVIEW QUESTIONS
1. What is a stream (in Big Data world)?
2. What are the motivations for stream analytics?
3. What is stream analytics? How does it differ from regular an alytics?
4. What is critical event processing? How does it relate to stream analytics?
5. Define data stream mining? What are the additional challenges that are posed?
13.9 APPLICATIONS OF STREAM ANALYTICS
Because of its power to create insight instantly, helping decision makers to be on top of
events as they unfold and allowing organizations to address issues before they become
proble ms, the use of streaming analytics is on an exponentially increasing trend. Following
are some of the application areas that have already benefited from stream analytics.
e-commerce
Companies like Amazon and eBay (among many others) are trying to make the most out
of the data that they collect while the customer is on their Web sites. Every page visit,
every product looked at, every search conducted, and every click made is recorded and
analyzed to maximize the value gained from a user’s visit. If done quickly, analysis of such
a stream of data can turn browsers into buyers and buyers into shopaholics. When we visit
an e-commerce Web site, even one where we are not a me mber, after a few clicks here and
there we start to get very interesting product and bundle price offers. Behind the scenes,
advanced analytics are crunching the real-time data coming from our clicks, and the clicks
of thousands of others, to “understand” what it is that we are interested in (in some cases,
even we do not know that) and make the most of that information by creative offerings.
Telecommunications
The volume of data that comes from call detail records (CDR) for telecommunications
companies is astounding. Although this information has been used for billing purposes
for quite some time now, there is a wealth of knowledge buried deep inside this Big Data
that the telecommunications companies are just now realizing to tap . For instance, CDR
data can be analyzed to prevent churn by identifying networks of callers, influencers,
leaders , and followers within those networks and proactively acting on this information.
As we all know, influencers and leaders have the effect of changing the perception of
the followers within their network toward the service provide r, either positively or nega-
tively. Using social network analysis techniques, telecommunication companies are iden-
tifying the leaders and influencers and their network participants to better manage their
customer base. In addition to churn analysis, such information can also be used to recruit
new members and maximize the value of the existing members.
Continuous stream of data that comes from CDR can be combined with social media
data (sentiment analysis) to assess the effectiven ess of marketing campaigns. Ins ight
Chapter 13 • Big Data and Analytics 585
gained from these data streams can be used to rapidly react to adverse effects (which may
lead to loss of customers) or boost the impact of positive effects (which may lead to maxi-
mizing purchases of existing customers and recruitment of n ew customers) observed in
these campaigns . Furthermore, the process of gaining insight from CDR can be re plicated
for data networks using Internet protocol detail records. Since most te lecommunications
companies provide both of these service types, a holistic optimization of all offerings and
marketing campaigns could lead to extraordinary market gains. Application Case 13.7 is
an example of how telecommunication companies are using stream analytics to boost
customer satisfaction and competitive advantage.
Application Case 13.7
Turning Machine-Generated Streaming Data into Valuable Business Insights
This case study is about one of the largest U.S.
telecommunications organizations, which offers a
variety of services, including digital voice, high-speed
Internet, and cable, to more than 24 million customers.
As a subscription-based business, its success depends
on its IT infrastructure to deliver a high-quality cus-
tomer experience. When application failures or
network latencies negatively impact the customer
experience, they adversely impact company revenue
as well. That’s why this leading telecommunications
organization demands robust and timely information
from its operational telemetty to ensure data integrity,
stability, application quality, and network efficiency.
Challenges
The environment generates over a billion daily events
running o n a distributed hardware/software infrastruc-
ture supporting millions of cable, o nline, and interac-
tive media customers. It was overwhelming to even
gather and view this data in one place, much less to
perform any diagnostics, or h one in on the real-time
intelligence that lives in the machine-generated data.
Using time-consuming and error-prone traditional
search methods, the company’s roster of experts
would shuffle through mountains of data to uncover
issues threatening data integrity, system stability, and
applications performance-all necessary components
of delivering a quality customer experience.
Solution
In order to bolster operational intelligence, the
company selected to work w ith Splunk, one of
the leading a na lytics service provide rs in the area
of turning machine-generated streaming data into
valuable business insights. Here are some of the
results.
Application troubleshooting. Before
Splunk developers had to ask the operations
team to FTP log files to them. And then they
waited .. . sometimes 16+ hours to get the data
they needed while the operations teams had to
step away from their primary duties to assist the
developers. Now, because Splunk aggregates all
relevant machine data into on e place, develop-
ers can be more proactive about troubleshooting
code and improving the user experience. When
they first deployed Splunk, they started with a
simple search for 404 errors. Splunk revealed
up to 1,600 404s per second for a particular
service. The team identified latencies in a flash
player download as the primary blocker, caus-
ing viewers to navigate away from the page
without viewing any content. Just one search
in Splunk has helped to boost video views by
3 percent over the last year. In a business where
eyes equal dollars, that’s real money to the busi-
ness. Now when the applications team sees 404s
spiking o n custom dashboards they’ve built in
Splunk, they can dig in to see what’s happen-
ing upstream and align appropriate resources to
recapture those viewers- and that revenue.
Operations. Splunk’s ability to model sys-
tems and examine patterns in real time helped
the operations team avoid critical downtime.
Using Splunk, they spotted the potential for
failure in a vendor-provided infrastructure.
Modeling the proposed architecture in Splunk,
they were able to predict system imbalance
(Continued)
586 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Application Case 13.7 (Continued)
and how it might fail based on inability to
distribute load. “My team provides guidance
to our executives on mission-critical media
systems and strategic systems architecture,” said
Matt Stevens, director of software architecture.
“This is just one instance w here Splunk paid
for itse lf by helping us avoid deployme nt of
vulne rable systems, which would inevitably
result in d owntime and upset customers .” In
day-to-day operations, teams u se Splunk to
ide ntify and drill into events to ide ntify activ-
ity patterns leading to o utages. Once they’ve
ide ntified signa tures o r patterns, they create
alerts to proactively avoid future problems.
Compliance. Once seen as a foe, many
organizations a re looking to complia nce man-
dates as a n opportunity to implement best
practices in log consolidation and IT systems
management. This organization is no differ-
ent. As Sarbanes-Oxley (SOX) and other com-
pliance mandates evolve, the company uses
Splunk to audit its systems, genera te scheduled
and ad hoc reports, and share information w ith
business executives, auditors, and partne rs .
Security. When you ‘re a conte nt provider,
DNS attacks simply can’t be tolerated. By con-
solidating logs across data centers, the secu-
rity team has improved the effectiveness of its
threat assessments and security monitoring.
Dashboards allow analysts to detect system vul-
ne rabilities or attacks o n both its conte nt delivery
network and critical applications. Trend reports
spanning long timeframes also identify recurring
threats and known attackers. And alerts for bad
actors trigger immediate responses.
Conclusion
No lo nger does the sheer volume of m achine –
generated data overwhelm the o p eration s team.
The more data that the comp a ny ‘s en ormous infra-
strnc ture generates, the more lurking issues and
security threats a re revealed. The team even seeks
out historical data- goin g back years-to identify
trends and unique p a tte rns. As the discipline of
investigating a nomalies a nd creating alerts based
on unmasked event signatures spreads throughout
the IT organization, the growing knowledge base
a nd awareness fortify the cable provider’s ability to
deliver continuous quality customer experiences.
Even more valuable than this situatio nal
awareness has been the pre dictive capability gained.
W hen testing a new technology, the decision-making
team sees how a solutio n w ill work in production-
determining the potential for instability by observ-
ing reactio ns to varyin g loads and traffic p atte rns .
Splunk’s predictive a nalytics capabilities h elp this
leading cable provider make the right decisio ns ,
avoiding costly d e lays and downtime.
QUESTIONS FOR DISCUSSION
1. Why is stream a nalytics becoming m ore popular?
2. How did the telecommunication company in
this case use stream a nalytics for better busi-
ness outcomes? What additional benefits can you
foresee?
3. What were the ch alle nges, p roposed solution ,
and initial results?
Source: Splu nk, Customer Case Stud y, splunk.com/view/SP·
CAAAFAD (accessed March 2013).
Law Enforcement and Cyber Security
Streams of Big Data provide excelle nt opportunities for improved crime prevention, law
e nforceme nt, and e nhanced security. They offer unmatched potential w h en it comes to
security applications that can be built in the space, su ch as real-time situational aware-
ness , multimodal surveillance, cyber-security detection , legal w ire taping, video surveil-
lance, and face recognitio n (Zikopoulos et al. , 2013). As an application of information
assurance , enterprises can u se streaming analytics to detect and prevent network intrn-
sio n s, cyber attacks, and malicio us activities by streaming and an alyzing n etwork logs and
o the r Inte rne t activity mo nito ring resources.
Chapter 13 • Big Data and Analytics 587
Power Industry
Because of the increasing use of smart meters, the amount of real-time data collected by
power utilities is increasing exponentially. Moving from once a month to every 15 minutes
(or more frequent) , meter read accumulates large quantities of invaluable data for power
utilities. These smart meters and other sensors placed all around the power grid are sending
information back to the control centers to be analyzed in real time . Such analyses help utility
companies to optimize their supply chain decision (e.g., capacity adjustments, distribution
network options, real-time buying or selling) based on the up-to-the-minute consumer usage
and demand patterns. Additionally, utility companies can integrate weather and other natu-
ral condition data into their analytics to optimize power generation from alternative sources
(e.g., wind, solar, etc.) and to be tter forecast energy demand on different geographic granu-
lations. Similar benefits also apply to other utilities such as water and natural gas.
Financial Services
Financial service companies are among the prime examples where analysis of Big Data
streams can provide faster and better decisions, competitive advantage , and regulatory
oversight. The ability to analyze fast-paced, high volumes of trading data at very low
latency across markets and countries offers tremendous advantage to making the split-
second buy/sell decisions that potentially translate into big financial gains. In addition to
optimal buy/sell decisions, stream analytics can also help financial service companies in
real-time trade monitoring to detect fraud and other illegal activities.
Health Sciences
Modern e ra medical devices (e.g., electrocardiograms and equipment that measures
blood pressure, blood oxygen level, blood sugar level, body temperature, and so on)
are capable of producing invaluable streaming diagnostic/ sensory data at a very fast rate.
Harnessing this data and analyzing it in real time offers benefits-the kind that w e often
call “life and death”-unlike any other fie ld. In addition to helping healthcare companies
become more effective and efficient (and hence more competitive and profitable), stream
analytics is also improvin g patient conditions, saving lives.
Many hospital systems all around the world are developing care infrastructures and
health systems that are futuristic . These systems aim to take full advantage of what the tech-
no logy has to offer, and more. Using hardware devices that generate high-resolution data
at a ve1y rapid rate, coupled with super-fast computers that can synergistically analyze mul-
tiple streams of data, increases the chances of keeping patients safe by quickly detecting
anomalies. These systems are meant to help human decision makers make faster and better
decisions by being exposed to a multitude of information as soon as it becomes available.
Government
Governments a ll arou nd the world are trying to find ways to be more efficient (via optimal
use of limited resources) and effective (providing the services that people need and want) .
As the practices fore-government become mainstream, coupled with widespread use and
access to social media, very large quantities of data (both structured a nd unstructured) are
at the disposal of government agencies. Proper and timely use of these Big Data streams
differe ntiates proactive and highly efficient agencies from the ones that are still u sing tra-
ditional methods to react to situations as they unfold. Another way in which government
agencies can leverage real-time analytics capabilities is to manage natural disasters such
as snowstorms, hurricanes, tornados, and wildfires through surveillance of streaming data
coming from radars, sen sors, and other smart detection devices. They can also use similar
approaches to monitor water quality, air quality, and consumption patterns, and d etect
anomalies before they become significant problems. Yet another area where government
588 Pan V • Big Data and Future Directions for Business Analytics
agencies use stream analytics is in traffic management in congested cities . By using the
data coming from traffic flow cameras, GPS data coming from commercial vehicles, and
traffic sensors embedded in roadways, agencies are able to change traffic light sequences
and traffic flow lanes to ease the pain caused by traffic congestion problems.
SECTION 13.9 REVIEW QUESTIONS
1. What are the most fruitful industries for stream analytics?
2. How can stream analytics be used in e-commerce?
3. In addition to what is listed in this section, can you think of other industries and/ or
application areas where stream analytics can be used?
4. Compared to regular analytics, do you think stream analytics w ill have more (or
fewer) use cases in the era of Big Data analytics? Why?
Chapter Highlights
• Big Data means different things to people with
different backgrounds and interests.
• Big Data exceeds the reach of commonly used
hardware environments and/or capabilities of
software tools to capture, manage, and process it
within a tolerable time span.
• Big Data is typically defined by three ”V”s: vol-
ume, variety, velocity.
• MapReduce is a technique to distribute the pro-
cessing of very large multi-structured data files
across a large cluster of machines.
• Hadoop is an open source framework for pro-
cessing, storing, a nd analyzing massive amounts
of distributed, unstructured data.
• Hive is a Hadoop-based data warehousing-like
framework originally developed by Facebook.
• Pig is a Hadoop-based query language developed
by Yahoo!.
• NoSQL, which stands for Not Only SQL, is a new
paradigm to store and process large volumes of
Key Terms
Big Data
Big Data analytics
critical event processing
data scientist
data stream mining
Hadoop
Hadoop Distributed File
System (HDFS)
Questions for Discussion
1. What is Big Data? Why is it important? Where does Big
Data come from?
2. What do you think the future of Big Data w ill be? Will it
leave its popularity to something e lse? If so, what will it be?
unstructured, semistructured, and multi-structured
data.
• Data scientist is a n ew role or a job commonly
associated with Big Data or data science.
• Big Data and data warehouses are complemen-
tary (not competing) analytics technologies.
• As a relatively new area, the Big Data vendor
landscape is developing very rapidly.
• Stream analytics is a term commonly used for
extracting actionable information from continu-
ously flowing/streaming data sources.
• Perpetual analytics evaluates every incoming
observation against all prior observations.
• Critical event processing is a method of captur-
ing, tracking, and analyzing streams of d ata to
detect certain events (out of normal happenings)
that are worthy of the effort.
• Data stream mining, as an enabling technology
for stream analytics, is the process of extracting
novel p atterns and knowledge structures from
continuous, rapid data records.
Hive
MapReduce
NoSQL
perpetual analytics
Pig
RFID
stream analytics
social media
3. What is Big Data analytics? How does it differ from regu-
lar analytics?
4. What are the critical success factors for Big Data
analytics?
5. What a re the big challe nges tha t o n e sh o uld be mind-
ful of w he n conside ring imple me ntatio n of Big Da ta
analytics?
6. What a re the commo n business problems addressed by
Big Data analytics?
7. Who is a d ata scie ntist? What makes the m so much in
demand?
8. What are the co mmo n characte ristics of data scie ntists?
Which o ne is the most impo rtant?
9. In the e ra of Big Data , are w e ab out to w itness the e nd
of da ta w are ho using? Why?
Exercises
Teradata University Network (TUN) and Other
Hands-On Exercises
1. Go to teradatauniversitynetwork.com a nd search for
case studies. Read cases and white p a pe rs that talk about
Big Data analytics. Wha t is the commo n them e in those
case studies?
2. At teradatauniversitynetwork.com, find the SAS
Visual Analytics w hite p ap e rs, case studies, and hands-o n
exercises. Carry o ut the visual analytics e xe rcises o n large
data sets and pre p are a repo rt to discuss your findings .
3. At teradatauniversitynetwork.com, go to the pod-
casts library. Find p od casts abo ut Big Data analytics.
Summarize your findings.
4. Go to teradatauniversitynetwork.com a nd search for
BSI videos that talk about Big Data. Review these BSI
videos a nd a nswer case questio ns re la te d to the m.
5. Go to the teradata.com a nd/ or asterdata.com Web
sites. Find at least three c usto mer case studies o n Big
Data , and write a re po rt whe re you discuss the common-
alities and diffe re n ces of these cases.
6. Go to IBM.com. Find at least three cu sto me r case stud-
ies o n Big Data, a nd write a report w he re you discuss
the commo na lities and diffe re nces o f these cases.
7. Go to cloudera.com. Find at le ast three custome r case
studies o n Hadoop imple mentatio n , and w rite a re p ort
whe re you discu ss the commo nalities and diffe re nces o f
these cases .
End-of-Chapter Application Case
Cha pte r 13 • Big Data a nd Ana lytics 589
10. What are the u se cases for Big Data/Had oop a nd d ata
w are ho using/RDBMS?
11. What is stream analytics? How does it diffe r fro m regular
a nalytics?
12. What are the most fruitful industries for stream an alytics?
Wha t is commo n to those industries?
13. Compa re d to regula r a nalytics, do you think stream
analytics w ill have mo re (or fewer) use cases in the e ra
of Big Data an alytics? Why?
8. Go to MapR.com. Find at least three cu stome r case
studies on Hadoop imple me ntation, and write a re p ort
w he re you discuss the commo nalities an d d iffere nces of
these cases.
9. Go to hortonworks.com. Find at least three cu sto mer
case stud ies o n Hadoop impleme ntatio n, and w rite a
re port whe re you discu ss th e commonalities and diffe r-
e nces of these cases.
10. Go to mark.logic.com. Find at least three cu sto mer case
studies o n Hadoop imple me ntatio n , a nd w rite a re p ort
w he re you discu ss the commonalities a nd d iffere nces of
these cases.
11. Go to youtube.com. Search for videos on Big Da ta com-
puting. Watch at least two . Summa rize your findings.
12. Go to google.com/scholar and search for articles on
stream analytics. Find at least three re lated a rticles. Read
a nd summa rize your findings.
13. Ente r google.com/scholar a nd searc h for a1ticles on
d ata stream mining. Find at least three re la ted articles.
Read a nd summarize your find ings.
14. Search the job search sites like monster.com,
careerbuilder.com, and so forth. Find at least five job
p ostings for data scie ntist. Ide ntify the key ch aracte ristics
a nd skills exp ected from the a pplicants.
15. Ente r google.com/scholar a nd search fo r articles that
talk a bout Big Data versus data wa re housing . Find at
least five a rticles. Read a n d summa rize your findings.
Discovery Health Turns Big Data into Better Healthcare
Introduction-Business Context
Founded in Joha nnesburg more tha n 20 yea rs ago , Disco very
no w o p e rates throughout the countiy, w ith offices in most
majo r cities to s uppo rt its ne two rk of broke rs. It employs
mo re than 5,000 p eople and offers a wide ra nge of health,
life and o the r insurance services.
In the health sector, Discovery prides itself o n offe ring
the widest ra nge of health plan s in the South African mar-
ke t. As one o f the largest health sche me administrato rs in the
country, its is able to keep me mbe r contributio n s as low as
possible, making it mo re affo rdable to a wider cross-section
of the popula tio n. On a like-for-like basis , Discovery’s pla n
contributions are as much as 15 p e rcent lower than those of
any othe r South African medical scheme.
Business Challenges
Whe n your health sche mes have 2.7 millio n me mbe rs , your
claims syste m gene rates a millio n new rows o f da ta d a ily,
590 Pan V • Big Da ta a nd Future Directio ns for Bu siness Ana lytics
an d you are u sing three years o f historical d a ta in your ana-
lytics e nvironme nt, how can you ide ntify the key insights that
your business and your m e mbe rs’ health d ep e nd o n?
This was the challe nge fac ing Discovery Health, one
of South Africa’s leading specialist health sch e me adminis-
tra to rs . To find the n eedles o f vital info rmatio n in the big
da ta haystack , the compa ny not o nly need ed a sophisticate d
da ta-mining and predictive m o d eling solutio n , but a lso an
analytics infrastru cture w ith the power to d e live r results at
the s peed o f business.
Solutions-Big Data Analytics
By building a new accele rated analytics landscape , Discove1y
Health is now able to unlock the true pote ntial of its d ata for the
first time. This e nables the company to run three years’ woith
of d ata for its 2.7 millio n m e mbers through complex statistical
models to d e liver actio nable insights in a matte r of minutes.
Discove ry is constantly developing new ana lytical applications,
and has alread y seen tangib le be nefits in areas s uch as p redic-
tive mo de ling of m embe rs’ medical needs and fraud detectio n .
Predicting and preventing health risks
Matthew Zylstra , Actua ry, Risk Inte lligence Technical Devel-
opme nt a t Discovery Health, explains : “We can now combine
da ta fro m o ur cla ims syste m w ith o the r sources o f informa-
tio n such as patho logy results a nd m embers’ questio nnaires
to gain mo re accurate ins ig ht into the ir curre nt a nd possible
future h e alth.
“Fo r example, by looking at previous hospita l a dmis-
sio ns, we can no w pre dict which o f our me mbe rs are most
likely to require p roced u res s uch as knee surgery o r lower
back surgery. By gaining a be tte r overview o f me mbe rs ‘
need s, w e can a djust o ur health pla ns to serve them mo re
e ffectively a nd o ffe r b e tte r value .”
Lizelle Steenka mp, Divisio nal Manager, Risk Intelligence
Technical Develo pme nt, adds: “Everything we d o is a n
atte mpt to low e r costs for our me mbe rs while m a inta ining
or improving the q ua lity o f care . Th e sche mes we administe r
are mutual funds-no n-profit o rganizatio ns- so any surpluses
in the plan go b ack to the m e mbers we a dminister, e ithe r
thro ug h inc reased reserves o r lowere d contrib utio ns . “O ne of
the most impo ita nt w ays we can simultane ously re duce costs
and improve the well-be ing of o ur me mbe rs is to predict and
prevent health p roble m s be fore they need treatme nt. We are
using the results o f o ur p redictive m odeling to d esign preven-
tative progra ms that can he lp o ur me mbe rs stay healthier. ”
Identifying and eliminating fraud
Estiaa n Steenbe rg , Actuary at Discovery Health, comments :
“From a n a n alytical p o int o f view, frau d is ofte n a sma ll inte r-
sectio n between two o r mo re ve1y large da ta-sets. We now
have the tools w e need to ide ntify even the tiniest ano m alies
and trace susp icio us transactio ns back to the ir source. ”
For example, Discovery can now compare drug presclip-
tio ns collecte d by pharmacies across the country with health-
care provide rs’ records. If a prescription seems to have been
issue d by a p rovide r, b ut the pe rson fulfilling it has not visited
that provide r recently, it is a su·ong indicator that the prescrip-
tion may be fraudulent. “We used to o nly be able to nm this
kind of analysis for o ne p ha1macy and one mo nth at a time, ”
says Estiaan Steenbe rg. “Now we can nm 18 months of data
fro m all the pharmacies at once in two minutes. The re is no
way we could have obtained these results w ith our old analytics
landscape. ”
Similar techniques can be used to identify coding
e rrors in b illing fro m healthcare p roviders-for exam p le, if a
provider “upcodes” an item to c harge Discovery fo r a m ore
expe nsive procedure than it actua lly p erforme d , or “unbun-
dles” the billing fo r a sing le p rocedure into two o r more sep-
a rate (and mo re expensive) lines. By com paring the billing
codes with data o n hospital admissions, Discovery is a le rted
to unusu a l pa tte rns, and can investigate w hen ever mistakes
o r fra udule nt activity are susp ected .
The Results-Transforming Performance
To achieve this tran sformatio n in its analytics cap abilities,
Discovery worked w ith BITa n ium, an IBM Bu siness Partne r
w ith deep expertise in o p era tion a l deploym e nts o f ad vanced
a na lytics techno logies. “BITanium has p rovided fa ntas-
tic support fro m so m a ny differe nt a ngles,” says Matthew
Zylstra. “Product evalua tion a nd selection , software lice nse
m anagement, technical support for d eveloping new m o d els,
performan ce o ptimizatio n an d an alyst training a re just a few
of the areas they have he lpe d us with .”
Discove1y is an expe rienced u se r o f IBM SPSS® pre –
d ictive a n alytics software, w hich forms the core o f its d a ta-
mining and pre dictive analytics cap a bility. But the m ost
impo rtant factor in embedding a na lytics in d ay-to-day
opera tiona l d ecisio n-maki ng has been the recent in troduc-
tio n of the IBM PureDa ta™ System for Analytics, p owered
by Netezza® technology- an appliance th at tra nsfo rms the
p e rformance o f the pre dictive models .
“BITanium ran a proof of concep t for the solutio n that
rapidly d e livere d u seful resu lts, ” says Lize lle Steen kam p .
“We were impressed w ith h ow quickly it was possible to
achieve tre me ndo u s p e rfo rmance gains.” Matthew Zylstra
adds : “Our d a ta wareh o use is so la rge that some q u e ries
u sed to ta ke 18 ho u rs o r m ore to p rocess-an d they would
ofte n crash before d e livering results. Now, we see results in
a few minutes, w hich a llows us to be m ore resp onsive to our
c ustomers a nd thus p rovide bette r care .”
From an a na lytics persp ective, the speed of the solu-
tio n g ives D iscovery mo re scope to expe riment a nd optimize
its m o d els. “We can tweak a m o d e l and re-run the an a lysis in
a few minutes,” says Matthew Zylstra “This m eans we can do
m o re deve lo pme nt cycles faster- a nd re lease new ana lyses to
the business in days rathe r than weeks .”
From a b road e r business perspective, the comb inatio n of
SPSS and PureData technologies gives Discovery the ab ility to
p ut actionable data in tl1e hands of its decisio n-make rs faste r. “In
sensitive areas su ch as patient care and fraud investigation , the
d etails are everything ,” concludes Lizelle Steenkam p . “With the
IBM solution, instead of infernng a ‘near eno ugh’ answer from
high-level summaries of d ata, we can get the right irtfo rmatio n,
develop the right models, ask the right questions, and provide
accurate analyses that meet the precise needs of the business. ”
Looking to the future, Discovery is a lso starting to ana-
lyze unstructure d data, such as text-based surveys a nd com-
me nts from online feedback forms.
About BITanium
BITanium b elieves that the truth lies in data. Data does not
have its own agenda, it does not lie, it is no t influe nced by
promotions or bonuses . Data contains the only accurate re p-
resentation of what h as and is actually happe ning within a
business . BITanium also bel ieves tha t o ne of the few re main-
ing diffe rentiators between mediocrity and excellence is how
a compa ny uses its data.
BITanium is passionate about using technology and
mathematics to find p atte rns and relationships in data. These
patterns provide insight and knowledge about proble ms, trans-
forming them into opportunities. To learn more about services
and solutions fro m BITanium, please visit bitanium.co.za.
About IBM Business Analytics
IBM Business Analytics software delive rs data-drive n insights
that help o rganizations work smarter and outperform their
peers. This compre hensive p ortfolio includes solutions
for business intelligence, predictive analytics and decision
References
Awadallah, A. , and D. Graham. (2012) . “Hadoo p a nd the
Data Warehouse: When to Use Which. ” White paper by
Cloudera and Teradata . teradata.com/white-papers/
Hadoop-and-the-Data-Warehouse-When-to-Use-
Which (accessed March 2013).
Dave nport, T. H. , and D. J. Patil. (2012, October). “Data
Scientist. ” Harvard Business Review, pp. 70- 76.
Dean, ]. , and S. Ghemawat. (2004). “MapReduce: Simplified
Data Processing on La rge Cluste rs.” research.google.
com/archive/mapreduce.html (accessed March 2013) .
Dele n, D. , M. Kle tke, and]. Kim. (2005) . “A Scalable Classification
Algotithm for Very Large Datase ts.” Journal of Informa tion
and Knowledge Management, Vol. 4, No. 2, pp. 83-94.
Ericsson. (2012). “Proof of Concept for Applying Stream
Analytics to Utilities. ” Ericsson Labs, Research Topics, labs.
ericsson.com/blog/proof-of-concept-for-applying-
stream-analytics-to-utilities (accessed March 2013).
Issenbe rg, S. (2012, O ctober 29). “Obama Does It Be tte r ”
(from “Victo1y Lab: The New Science of Winning
Campaigns”), Slate.
Jonas, J. (2007). “Streaming Analytics vs. Perpetual Analytics
(Advantages of Windowless Thinking).” jeffjonas.
typepad.com/jeff_jonas/2007 /04/streaming_analy.
html (accessed March 2013).
Kelly, L. (2012). “BigData: Hadoop, BusinessAnalyticsandBeyond. ”
wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_
Analytics_and_Beyond (accessed January 2013).
Chapter 13 • Big Data a nd Analytics 591
management, p e rformance m anageme n t, and ris k manage-
ment. Business Analytics solutions e nable companies to
identify and visualize trends and patterns in areas, such as
customer analytics, that can h ave a profound effect o n busi-
ness performance. They can compare scenarios, anticipate
po te ntial threats and o pportunities, better plan , budget and
forecast resources, ba lance risks against exp ected returns and
work to meet regulatory require me nts. By making analytics
w idely ava ilable, organizations can alig n tactical and strategic
d ecisio n-making to achieve business goals. For more infor-
mation, you may visit ibm.com/ business-analytics.
QUESTIONS FOR THE END-OF-CHAPTER
APPLICATION CASE
1. How big is Big Da ta for Discovery Health?
2. What big data sources d id Discove1y Health use for
their analytic solutio ns?
3. Wha t were the main data/ analytics challenges Discovery
Health was facing?
4. Wha t were the main solutions they have produced?
5. Wha t were the initial results/benefits? Wh at d o
you think will be the future of Big Data analytics at
Discovery?
Source: IBM Customer Story, “Discovery Health turns big data
into be tter healthcare” pu blic .dhe. ibm.com/commo n/ ssi/ecm/ en/
ytc03619zaen/YTC03619ZAEN.PDF (accessed October 2013).
Kelly, L. (201 3). “Big Data Vendor Revenue and Market
Forecast 2012-2017.” wikibon.org/wiki/v /Big_Data_
Vendor_Revenue_and_Market_Forecast_2012-2017
(accessed March 2013).
Romano, L. (2012,June 9). “Obama’s Data Advantage. ” Politico.
Russom, P. (2013). “Busting 10 Myths about Hadoop: The Big
Data Explosio n. ” TDWI’s Best of Business In telligence, Vol. 10,
pp. 45–46.
Samuelson, D. A. (2013, Febru ary). “Analytics: Key to
Obama’s Victory. ” INFORMS’ ORMS Today , pp. 20-24.
Scherer, M. (2012, November 7) . “Inside the Secret World of
the Data Crunche rs Who Helpe d Obama Win. ” Time.
Shen, G. (2013, Janua1y – Februa1y ). “Big Data , Analytics, and
Elections .” INFORMS’ Analy tics Magazine.
Watson, H. (201 2) . “The Requireme nts for Being an Analytics-
Based Organizatio n. ” Business Intelligence journal, Vol. 17,
No. 2, pp. 42-44.
Watson, H., R. Sharda, and D. Schrader. (2012). “Big Data
and How to Teach It.” Workshop at AMCIS. Seattle, WA.
White, C. (2012). “MapReduce and the Data Scientist.” Teradata
Aste r w hite pape r. teradata.com/white-paper/Map
Reduce-and-the-Data-Scientist (accessed February 2013).
Zikop o ulos, P. , D . DeRoos, K. Parasuraman , T. Deutsch ,
D. Corrigan , and J. Giles. (2013). Ha rness the Power of B ig
Data . New York: MacGraw Hill Publishing.
592
CHAPTER
Business Analytics: Emerging Trends
and Future Impacts
LEARNING OBJECTIVES
• Explore som e of the em e rging
techno logies that may impact an a lytics,
BI, and decision support
• Describe h o w geosp atial and
locatio n -based a nalytics a re assisting
o rganizatio n s
• Describe h ow an alytics are p owering
con sumer applicatio n s and creating a
new o pportunity for entrepre ne urship
for an alytics
• Describe the p otential of clo ud
computing in business intelligen ce
• Understand Web 2.0 a nd its
cha racteristics as related to analytics
• Describe o rganizatio n al impacts of
analytics applicatio n s
• List and describe the major e thical an d
legal issu es of analytics imp leme ntation
• Un dersta nd the a nalytics ecosystem
to get a sen se of the vario us types of
p layers in the a nalytics industry and
h ow on e can work in a variety of roles
T
his ch a pte r introduces several e me rging technologies that are likely to have majo r
impacts o n the develo pme n t and u se of bu siness intelligence application s. Many
o the r inte resting technologies are also e me rging , but we have focused o n some
trends that h ave already been realized and o thers that are a b o ut to impact analytics
fmthe r. Using a crystal b a ll is always a risky propositio n , but this cha pte r provides a
fra me wo rk for a na lysis o f eme rg ing tre nds. We introduce a nd exp lain some e m e rg ing
technologies a nd e xplore their current application s. We then discuss the orga niza tio n al,
p ersonal, legal, ethical, and socie tal impacts of suppo rt systems that m ay affect their
imple me ntation. We conclude with a d escriptio n of the a nalytics eco system. This sectio n
should h e lp readers appreciate different career possibilities w ithin the realm of a na lytics.
This chapter contains the following sectio n s:
1 4 .1 Opening Vignette: Okla homa G as a nd Electric Employs An alytics to Promo te
Sma rt En e rgy Use 593
14.2 Locatio n-Based Ana lytics for Orga niza tio ns 594
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 593
14.3 Analytics Applications for Consumers 600
14.4 Recommendation Engines 603
14.5 Web 2.0 and Online Social Networking 604
14.6 Cloud Computing and BI 607
14.7 Impacts of Analytics in Organizations: An Overview 613
14.8 Issues of Legality, Privacy, and Ethics 616
14.9 An Overview of the Ana lytics Ecosystem 620
14.1 OPENING VIGNETTE: Oklahoma Gas and Electric
Employs Analytics to Promote Smart Energy Use
Oklahoma Gas and Electric (OG&E) serves over 789,000 customers in Oklahoma and
Arkansas. OG&E has a strategic goal to delay building n ew fossil fuel generation plants
until the year 2020. OG&E forecasts a daily system demand of 5,864 megawatts in 2020, a
reduction of about 500 megawatts.
One of the ways to optimize this demand is to engage the consumers in managing
their e n ergy usage. OG&E has completed installation of smart meters a nd other devices
on the electronic grid at the consumer end that e nable it to capture large amounts of
data. For example, currently it receives about 52 million meter reads per day. Apart
from this, OG&E expects to receive close to 2 million event messages per day from its
advanced metering infrastructure, data networks, meter alarms, and outage management
systems. OG&E employs a three-layer information architecture involving data warehouse,
improved and expanded integration and data management, and new analytics and pre-
sentation capabilities to support the Big Data flow.
With this data , OG&E has started working on consumer-oriented efficiency programs
to shift the customer’s usage out of peak demand cycles. OG&E is targeting customers
with its smart hours plan. This plan encourages customers to choose a variety of rate
options sent via phone, text, or e-mail. These rate options offer attractive summer rates
for all other hours apart from the peak hours of 2 P.M . to 7 P.M. OG&E is making an invest-
ment in customers by supplying a communicating thermostat that will respond to the
price signals sent by OG&E and help customers in managing their utility consumption.
OG&E also educates its customers on their usage habits by providing 5-minute interval
data every 15 minutes to the demand-responsive customers.
OG&E has developed consumer analytics and customer segmentation analytics that
will enhance their understanding about individuals’ responses to the price signals and
identify the best customers to be targete d with specific marketing campaigns. It also uses
demand-side management analytics for peak load management/ load shed. With Teradata’s
platform, OG&E has combined its smart meter data, outage data , call center data, rate data,
asset data, price signals, billing, and collections into o ne integrated d ata platform. The
platform also incorporates geospatial mapping of the integrated data using the in-database
geospatial analytics that add onto the OG&E’s dynamic segmentation capabilities.
Using geospatial mapping and visual analytics, OG&E now views a near-real-time
version of data about its energy-efficient prospects spread over geographic areas and
comes up with marketing initiatives that are most suitable for these customers. OG&E now
has an easy way to narrow down to the specific customers in a geographic region based
o n their meter usage; OG&E can also find noncommunicating smart meters. Furthermore,
OG&E can track the outage, with the deployed crew supporting outages as well as the
weather overlay of their services. This combination of filed infrastructure, geospatial data,
enterprise data warehouse, and analytics has enabled OG&E to manage its customer
demand in such a way that it can optimize its lo ng-term investments .
594 Pan V • Big Data a nd Future Directions for Bu siness Analytics
QUESTIONS FOR THE OPENING VIGNETTE
1. Why perform con sumer analytics?
2. What is meant by dynamic segmentation?
3. How does geospatial mapping he lp OG&E?
4. What types of incentives might the consumers respond to in changing the ir energy u se?
WHAT WE CAN LEARN FROM THIS VIGNETTE
Many o rganizatio n s are now integrating the d ata from the different internal units and
turning toward analytics to convert the integrated d ata into value. The ability to view the
operatio ns/ customer-specific data using in-database geosp atial analytics gives o rganiza-
tions a broader perspective and aids in decision making.
Sources: Teradata.com, “Utilities Ana lytic Summit 20 12 Okla ho ma Gas & Electric,” teradata.com/video/
Utilities-Analytic-Summit-2012-0klahoma-Gas-and-Electric (accessed March 2013); ogepet.com, “Sma rt
Ho urs ,” ogepet.com/programs/smarthours.aspx (accessed Ma rch 2013) ; IntellidentUtility.com, “OGE’s
Three-Tie red Architecture Aids Data Analysis,” intelligentutility.com/article/12/02/oges-three-tiered-
architecture-aids-data-analysis&utm_medium=eNL&utm_campaign=IU_DAILY2&utm_term=Original-
Magazine (accessed March 2013) .
14.2 LOCATION-BASED ANALYTICS FOR ORGANIZATIONS
This goal of this chapter is to illustrate the potential of n ew technologies when innovative
uses are developed by creative minds. Most of the technologies described in this chapter
are n ascent and h ave yet to see widespread adoptio n . Therein lies th e opportunity to cre-
ate the next “killer” application. For example, u se of RFID and sensors is growing , w ith
each company exploring its u se in supply chains, re tail stores, m anufacturing, or service
operatio ns. The ch apter argues that with the right combination of ideas, n etworking, a nd
applications, it is p ossible to develop creative technologies that have the potential to
impact a company’s operatio n s in multiple ways, o r to create entirely n ew markets and
make a major difference to the world. We also study the a nalytics ecosystem to bette r
understa nd which compa nies are the players in this industry.
Thus far, we have seen m a ny examples of o rganizatio n s employing analytical
techniques to gain insights into their existing processes through informative reporting,
predictive analytics, forecasting, and optimization techniques. In this section, we learn
about a critica l e m erging trend-incorporation of locatio n data in a n alytics . Figure 14 .1
gives our classificatio n of location-based analytic applications. We first review applica-
tions that make use of static locatio n data that is u sually called geospatial data . We then
examine the explosive growth of application s th at take advan tage of all the location
data being generated by today’s devices. This sectio n focuses on analytics applications
that a re being developed by organizatio ns to make better d ecisio ns in m an aging opera-
tio ns (as was illustrated in the opening vignette), targeting customers, promotions, a nd
so forth . In the following section we w ill explore a nalytics applicatio n s that are being
developed to be u sed directly by a con sume r, some of which also ta ke advantage of the
location data .
Geospatial Analytics
A consolidated view of the overall perform ance of an o rganization is usually re p resented
through the visualizatio n tools that provide actio nable information. The information m ay
include curre nt a nd fo recasted values of vario u s business factors and key performance
indicators (KPis). Looking a t the key performance indicators as overa ll numbers via
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 595
Location-Based Analytics
I I
ORGANIZATION ORIENTED CONSUMER ORIENTED
I I
I I
GEOSPATIAL STATIC
APPROACH
LOCATION-BASED DYNAMIC
APPROACH
GEOSPATIAL STATIC
APPROACH
LOCATION-BASED DYNAMIC
APPROACH
Examining Geographic Site
Locations
Live Locat ion Feeds;
Real-Ti me Marketing
Prom otions
GPS Navigation and Data
Analysis
FIGURE 14.1 Classification of Location-Based Analytics Applications.
Historic and Current Location
Demand Analysis; Pr edictive
Parking; Health-Social
Networks
various graphs a nd charts can be overwhelming . There is a high risk of missing potential
growth opportunities o r not identifying the problematic areas . As a n alte rnative to simply
viewing reports , organizations employ v isual maps that are geographically mapped and
based o n the traditional location data , u sually grouped by the postal codes. These map-
based visu alizatio ns h ave bee n used by organizatio ns to view the aggregated data and get
more meaningful locatio n-based insights. Although this approach h as advantages, the use
of postal codes to represent the data is more of a static approach suitable for achieving a
higher level view of things.
The traditional locatio n-based analytic techniq ues using geocoding of organ ization al
locatio n s and con sumers hampe rs the o rganizations in u nderstanding “true location-based”
impacts . Locations based o n postal codes offer an aggregate view o f a large geographic
area . This poor g ranularity may not be able to pinpoint the growth opportunities w ithin a
region. The locatio n of the targe t custome rs can ch ange rapidly . An organization’s promo-
tional campaigns might not target the right customers. To address these concerns, organi-
zation s are embracing locatio n and spatial exte nsio ns to a n alytics (Gnau, 2010). Addition
of locatio n components based o n la titudinal and lo ng itud inal a ttributes to the traditio nal
an alytical techniques enables organizations to add a n ew d ime nsion of “where” to their
traditio nal business analyses, w hich curre ntly answer questions of “w ho,” “what,” “when ,”
and “h ow much. ”
Locatio n-based data are now readily available from geographic informatio n systems
(GIS). These are used to capture, store, analyze, a nd manage the data linked to a location
using integrated sen sor technologies, g loba l positioning systems installed in smartphones ,
or through radio -frequency ide ntificatio n deployments in retail and healthcare indu stries.
By integrating information about the locatio n w ith other critical business data ,
organizatio ns are now creating location intelligence (LI) (Krivda , 2010). LI is e nabling
o rganizatio n s to gain critical insig hts and make better decisio ns by optimizing important
processes and applicatio ns . Organizations n ow create interactive map s that furthe r drill
down to details about any location, offering analysts the ability to investigate new trends
and correlate location-specific factors across multiple KPis. Analysts in the organization s
can now pinpoint tren ds an d patterns in revenues, sales, and profita bility across geo-
graphical areas.
By incorporating demographic d e tails into locations, retailers can determine how
sales vary by populatio n level a nd proximity to o ther competitors; they can assess th e
596 Pan V • Big Data and Future Directions for Business Analytics
demand and efficiency of supply chain operations. Consumer product companies can
identify the specific needs of the customers and customer complaint locations, and easily
trace them back to the products. Sales reps can better target their prospects by analyzing
their geography (Krivda, 2010).
Integrating detailed global intelligence, real-time location information, and logistics
data in a visual, easy-to-access format, U.S. Transportation Command (USTRANSCOM)
could easily track the information about the type of aircraft, maintenance history, com-
plete list of crew, the equipment and supplies on the aircraft, and location of the aircraft.
Having this information will enable it to make well-informed decisions and coordinate
global operations, as noted in Westholder (2010).
Additionally, with location intelligence, organizations can quickly overlay weather
and environmental effects and forecast the level of impact on critical business operations.
With technology advancements, geospatial data is now being directly incorporated in
the e nterprise data warehouses. Location-based in-database analytics enable organiza-
tions to perform complex calculations w ith increased efficiency and get a single view of
all the spatially oriented data, revealing the hidden trends and new opportunities. For
example, Teradata ‘s data warehouse supports the geospatial data feature based on the
SQL/MM standard. The geospatial feature is captured as a new geometric data type called
ST_GEOMETRY. It supports a large spectrum of shapes, from simple points , lines, and
curves to complex polygons in representing the geographic areas. They are converting
the nonspatial data of their operating business locations by incorporating the latitude and
longitude coordinates. This process of geocoding is readily supported by service compa-
nies like NA VTEQ and Tele Atlas, which maintain worldwide databases of addresses with
geospatial features and make use of address-cleansing tools like Informatica and Trillium,
which support mapping of spatial coordin ates to the addresses as part of extract, trans-
form, and load functions.
Organizations across a variety of business sectors are employing geospatial analyt-
ics. We w ill review some examples next. Sabre Airline Solutions’ application, Traveler
Security, uses a geospatial-enabled dashboard that alerts the users to assess the current
risks across global hotspots displayed in interactive maps. Using this, airline personnel
can easily find current travele rs and respond quickly in the event of any travel disruption.
Application Case 14.1 provides an example of how location-based information was used
in making site selection decisions in expanding a company’s footprint.
Application Case 14.1
Great Clips Employs Spatial Analytics to Shave Time in Location Decisions
Great Clips, the world’s largest and fastest grow-
ing salon, has more than 3,000 salons through-
out United States and Canada. Great Clips’ fran-
chise success depends on a growth strategy that is
driven by rapidly opening new stores in the right
locations and markets. The company needed to
analyze the locations based on the requirements
for a potential customer base, demographic trends,
and sales impact on existing franchises in the tar-
get locatio n. Choosing a good site is of utmost
importance. The current processes took a long
time to analyze a single site and a great deal of
labor requiring intensive analyst resources was
needed to manually assess the data from multiple
data sources.
With thousands of locations analyzed each
year, the delay was risking the loss of prime sites
to competitors and was proving expensive : Great
Clips employed external contractors to cope with
the delay. Great Clips created a site-selection
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 597
workflow application to evaluate the new salon site
locations by using the geospatial an a lytical capa-
bilities o f Alte1yx. A new site location was evalu-
ated by its drive-time proximity a nd convenie nce
for serving all the existing customers of the Great
Clips Salon network in the area . The Alteryx-based
solution also enabled evaluation of each new loca-
tion based on demographics and consumer behav-
ior data, aligning with existing Great Clip’s customer
profiles and the pote ntial revenue impact of the
new site on the existing sites . As a result of using
location-based analytic techniques, Great Clips was
able to reduce the time to assess n ew locations by
nearly 95 percent. The labor-intensive analysis was
automated an d developed into a data collection
analysis, mapping, and reporting application that
could be easily used by the nontechnical real estate
managers. Furthermore, it en able d the company to
implement proactive predictive analytics for a new
franchise location because the whole process n ow
took just a few minutes.
QUESTIONS FOR DISCUSSION
1. How is geospatial analytics employed at Great
Clips?
2. What criteria should a company consider in eval-
uating sites fo r future locations?
3. Can you think of other applications where such
geospatial data might be useful?
Source: altery:x.com, “Great Clips,” altery:x.com/sites/default/
files/resources/files/ case-study-great-chips. pdf (accessed
March 2013).
In additio n to the retail transac tion an alysis applications hig hlighted h ere , there
are many oth e r applications o f combining geographic information with other data
being gen e rated by an organization. The opening vignette described a use of such
location information in understanding locatio n-based en e rgy usage as well as outage.
Similarly, network o perations a nd communicatio n compa nies often generate massive
amounts of data every day . The ability to a nalyze the d ata quickly w ith a high level
of locatio n-specific granularity can better identify the customer churn and he lp in for-
mulating strategies specific to locations for increasing operational efficiency, quality of
service, and revenue.
Geospatial analysis can e n able communication companies to capture daily trans-
actions from a network to identify the geographic areas experiencing a large number
of failed connection attempts of voice, data , text, or Internet. Analytics can h elp deter-
mine the exact causes based on location an d drill down to a n individual customer to
provide better customer service. You can see this in action by completing the following
multimedia exercise.
A Multimedia Exercise in Analytics Employing Geospatial Analytics
Te radata University Network includes a BSI video on the case of dropped mobile calls. Please
watch th e video that appears o n You Tube at the following link: teradatauniversitynetwork.
com/teach-and-learn/library-item/?Libraryltemld=893
A te lecommunication company launches a new line of smartphon es and faces prob-
lems of dropped calls. The new rollout is in trouble, and the n ortheast region is th e
worst hit regio n as they compare effects of dropped calls on the profit for the geographic
region. The company hires BSI to analyze the problems arising due to defects in
smartphone handsets, tower coverage, and software g litches. The entire northeast region
data is divided into geographic clusters, and the proble m is solved by identifying the
individual customer data. The BSI team employs geospatial an alytics to identify the loca-
tions w h ere network coverage was leading to the dropped calls and suggests installing
598 Pan V • Big Data a nd Future Directions for Bu siness Analytics
a few additio nal towers where the unhappy customers are located. They also work with
companies o n various actio ns that e nsu re that the problem is addressed.
After the video is complete, you can see h ow the analysis was prepared o n a slide
set at: slideshare.net/teradata/bsi-teradata-the-case-of-the-dropped-mobile-calls
This multimedia excursio n provides an example of a combination of geospatial an alytics
along with Big Data analytics that assist in better decision making.
Real-Time Location Intelligence
Many devices in u se b y consume rs and professio n a ls are con stantly sending out
the ir locatio n information. Cars, buses, taxis, mobile phones, ca m e ras , a n d p ersonal
n avigatio n devices all tra n smit the ir locatio n s thanks to n etwork-connected p osition-
ing techno logies su ch as GPS, w ifi , and cell tower triangula tio n. Millions of con-
sumers and businesses use locatio n-e n abled devices for finding n earby services,
locating friends a nd fa mily, n avigating, tracking of assets and pets, dispatching, and
e ngagin g in sports, games, and hobbies. This surge in location-enabled services has
resulte d in a massive database of historical and real-time streaming locatio n infor-
mation. It is, of course, scatte red a nd by itself n o t very u seful. Indeed, a n ew name
h as been given to this type of data mining-reality mining. Eagle a nd Pentland
(2006) appear to have been the first to use this term. Reality mining builds o n the
idea that these location-e n abled data sets cou ld provide remarkable real-time insight
into aggregate human activity trends. For example, a British company called Path
Intelligence (pathintelligence.com) has developed a system called Footpath that
ascertains h ow people move within a city or even within a store. All of this is d o n e
by automatically tracking movement w ithout any cam e ras recording the movement
v isu a lly . Su ch a n a lysis can h e lp determine the b est layout fo r products o r even pub-
lic transportation option s . The automated data collectio n enabled through capture of
cell phone and w ifi h o tspot access points presents an interestin g new dimension in
n o nintrus ive ma rket research data collectio n an d, of course, microanalysis of such
m ass ive data sets .
By a nalyzing and learning fro m these large-scale patterns of movemen t, it is pos-
sible to identify distinct classes of behaviors in specific contexts. This approach allows
a business to better understand its customer patterns and also to make more informed
decisions about promotions, pricing, and so o n. By applying algorithms that reduce
the d ime n sionality of locatio n data, o ne can characterize places accordin g to th e activ-
ity and movemen t between them. From m assive amounts of high-dimension al location
data, these algorithms uncover tre nds , meaning, and re latio nships to eventually p roduce
human-understandable representation s. It then becomes possible to use such data to
auto matically make intelligent predictions and find important matches and similarities
between places a nd people.
Locatio n-based a na lytics finds its application in consumer-orie nted marketing
applicatio ns. Quiznos, a quick-service restaurant, u sed Sense Networks’ platform to ana-
lyze locatio n trails of mobile users based on the geospatial data obtained from the GPS
a nd target tech-savvy customers w ith coupon s. See Application Case 14.2. This case
illustrates the em e rging trend in retail sp ace where comp anies a re looking to improve
efficie ncy of marketing campaign s-not just by targeting every customer based on real-
time location, but by e mploying m ore sophisticated predictive analytics in real time on
con sume r behavioral profiles and finding the right set of consumers fo r the advertisin g
campaig n s.
Many mobile applications now e nable organizations to target the right customer by
building the profile of cu stomers’ beh avior over geographic locations. For example , the
Radii app takes the custo me r experie nce to a w ho le new level. The Radii app collects
Chapte r 14 • Busine ss Analytics: Eme rging Trends and Future Impacts 599
Application Case 14.2
Quiznos Targets Customers for its Sandwiches
Quizn os, a fran chised, quick-service restau ran t,
imple me nted a locatio n-based m o bile targeting
campaig n tha t targeted the tech-savvy and busy con-
sume rs of Po rtland, O regon . It m ade use of Sense
Networks’ platform, w hich a nalyzed the locatio n
trails of m o bile u sers over deta iled time p e riod s and
b uilt ano nymo us p rofiles based o n the behavio ra l
attributes of sho p p ing habits.
With the a pplicatio n of pre dictive a nalytics
o n the u ser p rofiles, Quiznos employed locatio n-
b ased b e havio ral targeting to n a rrow the ch arac-
te ristics o f u sers w h o a re m ost like ly to eat at a
quick-service restaura nt. Its advertising campaig n
ran for 2 mo n th s- Novembe r and December,
201 2- a nd targeted o nly p o te ntial cu stom ers w h o
had b een to q uick-service restaura nts ove r the p ast
30 days, w ithin a 3-mile radius o f Quiznos, an d
between the ages of 18 a nd 34. It u sed re levan t
mo bile advertisem e nts o f local cou pon s b ased o n
th e custo mer’s locatio n. The camp aig n resulted in
over 3.7 million n ew customers a n d h ad a 20 per-
cent inc rease in coupo n red e mptio n s w ithin the
Portland a rea.
QUESTIONS FOR DISCUSSION
1. How can location-based an alytics h elp retailers
in targeting custom ers?
2. Research similar a p p licatio n s of location-based
a nalytics in the reta il domain.
Source: Mobilemarketer.com, “Q uiznos Sees 20pc Boost in
Coupo n Re de mptio n via Location-Based Mo bile Ad Campa ign ,”
mobilemarketer .com/ ems/ news/ advertising/14 7 38.html
(accessed Fe bruary 2013).
informatio n about the u ser’s habits , inte rests , spending p atte rns , and favorite locations
to unde rsta n d their p ersonality. Radii u ses the Gimbal Context Awaren ess SDK to gathe r
locatio n and geospatial informatio n . Gimbal SDK’s Geofen cing function ality e nables Radii
to pick up the u ser’s interests and h abits based o n the time they sp end at a location and
how often they visit it. Dep e nding o n the number of users w ho visit a p articu lar loca-
tio n , a n d based o n the ir prefere nces, Radii assig ns a persona lity to that locatio n , w hich
cha nges based o n w hich typ e of u ser visits the locatio n , an d the ir prefere nces. New u sers
are given recomme n datio ns that a re close r to their p e rsonality, m aking this process h ighly
dyn amic .
Users w h o sign up for Radii receive 10 “Radii,” w hich is the ir curren cy. Users can
use this curre ncy at select locatio n s to ge t discounts a n d special offers. They can also ge t
mo re Radii by inviting their frie nds to use the app. Businesses w ho offer these discounts
p ay Radii fo r bringing cu sto m ers to the ir locatio n , as this in turn tran slates into more busi-
ness. For every Radii exch an ged b etween users, Radii is paid a certain amount. Radii thus
creates a new direct ma rketing platfo rm for business a nd e nha nces th e cu sto me r exp eri-
en ce by p roviding recomme ndatio ns , discounts, an d coupo ns .
Yet a nothe r exte ns io n of locatio n-based a nalytics is to u se aug me nted reality.
Cachetown has introduced a locatio n-sensing au gmented reality-based game to encour-
age users to claim offers fro m select geograp h ic locatio n s. Th e u ser can sta rt a nywh ere
in a city and fo llow ma rkers o n the Cache town ap p to reach a cou pon , discount, or
offer fro m a business. Virtual items a re visible throu g h the Cach etown app w he n the
user p o ints a pho n e’s came ra toward the virtual item . The use r can the n cla im this item
by clicking o n it throug h the Cachetown app. On cla iming the ite m , the u ser is given a
certain free good/ discount/ o ffer from a n earby business, w hich he can use just by walk-
ing into the ir sto re.
Cach e town ‘s b u siness-facing app a llows businesses to p lace these virtual ite m s
o n a m ap u sing Google Map s. The placeme nt of this ite m can be fine -tuned by u sin g
Google ‘s Street View. Once a ll virtu al item s h ave been con fig ured w ith in fo rmation
600 Pan V • Big Data and Future Directions for Business Analytics
and location, the business can submit items , after which the items are visible to
the user in real time. Cachetown also provides usage analytics to the business to
enable better targeting of virtual items. The virtual reality aspect of this app improves
the experience of users , providing them with a “gaming”-type environment in real
life. At the same time, it provides a powerful marke ting platform for businesses to
reach their customers better. More information on Cachetown is at candylab.com/
augmented-reality/ .
As is evident from this section, location-based analytics and ensuing applications are
perhaps the most important front in the near future for organizations. A common theme
in this section was the use of operational or marketing data by organizations. We will
next explore analytics applications that are directly targeted at the users and sometimes
take advantage of location information.
SECTION 14.2 REVIEW QUESTIONS
1. How does traditional analytics make use of location-based data?
2. How can geocoded locations assist in better decision making?
3. What is the value provided by geospatial analytics?
4. Explore the use of geospatial analytics further by investigating its use across various
sectors like government census tracking, consumer marke ting, and so forth.
14.3 ANALYTICS APPLICATIONS FOR CONSUMERS
The explosive growth of the apps industry for smartphone platforms (iOS, Android,
Windows, Blackberry, Amazon, and so forth) and the use of analytics are also creat-
ing tremendous opportunities for developing apps that the consumers can use directly.
These apps differ from the previous category in that these are meant for d irect use by a
consumer rather than an organization that is trying to mine a consumer’s usage/ purchase
data to create a profile for marketing specific products or services to them. Predictably,
these apps are meant for enabling consumers to do their job better. We highlight two of
these in the following examples.
Sense Networks has built a mobile application called CabSense that analyzes large
amounts of data from the New York City Taxi and Limousine Commission and helps New
Yorkers and visitors in finding the best corners for hailing a taxi based on the person’s
location, day of the week, and time. CabSense rates the street corners on a 5-point scale
by making use of machine-learning algorithms applied to the vast amounts of histori-
cal location points obtained from the pickups and drop-offs of all New York City cabs.
Although the app does not give the exact location of cabs in real time, its data-crunching
predictions enable people to get to a street corner that has the highest probability of
finding a cab.
CabSense provides an interactive map based on current user location obtained from
the mobile phone’s GPS locator to find the best street corners for finding an open cab.
It also provides a radar view that automatically points the right direction toward the best
street corner. The application also allows users to plan in advance, set up date and time
of travel, and view the best corners for finding a taxi. Furthermore, CabSense distin-
guishes New York’s Yellow Cab services from the for-hire vehicles and readily prompts
the users with relevant details of private service providers that can be used in case no
Yellow Cabs a re available.
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 601
Another transportation-related app that uses predictive a nalytics has been deployed in
Pittsburgh, Pennsylvania. Developed in collaboration with Carnegie Mellon University,
this app includes predictive capabilities to estimate parking availability. ParkPGH directs
drivers to parking lots in the area where parking is available. It calculates the number
of parking spaces available in 10 lots- over 5,300 spaces, and 25 percent of the garage
parking in downtown Pittsburgh. Available spaces are updated every 30 seconds, keep-
ing the driver as close to the current availability as possible. The app is a lso capable of
predicting parking availability by the time the driver reaches the destination. Depending
o n historical demand and current events, the app is able to provide information on which
lots will have free space by the time the driver gets to the destination. The app’s under-
lying algorithm uses data on current events around the area- for example, a basketball
game-to predict a n increase in demand for parking spaces later that day, thus saving
commuters valuable time searching for parking spaces in the busy city. Both of these
examples show consumer-oriented examples of location-based analytics in transportation.
Application Case 14.3 illustrates another consumer-oriented application, but in the health
domain. There a re many more health-related apps .
Application Case 14.3
A Life Coach in Your Pocket
Most people today are finding ways to stay active
and healthy . Although everyone knows it’s best
to follow a healthy lifestyle, people often lack the
motivation needed to keep them on track. lOOPlus,
a start-up company, has developed a personalized,
mobile prediction platform called Outside that keeps
users active. The application is based on the quanti-
fied self-approach, which makes use of technology
to self-track the data o n a person ‘s habits, an alyze it,
and make personalized recommendatio ns.
100 Plus posited that people are most likely to
succeed in changing their lifestyles when they are given
small, micro goals that are easier to achieve. They built
Outside as a personalized product that engages people
in these activities and e nables them to understand the
lo ng-term impacts of short-term activities.
After the user enters basic data su ch as gender,
age, weight, height, and the location where he or
she lives, a behavior profile is built and compared
w ith data from Practice Fusion and CDC records.
A life score is calculated using predictive analytics.
This score gives the estimated life expectancy of the
user. Once registered, users can begin discovering
health opportunities, which are categorized as “mis-
sion s” on the mobile interface. These missions are
specific to the places based on the user’s location.
Users can track activities, complete them, and get a
score that is credited back to a life score. Outside
also enables its users to create diverse, personal-
ized suggestions by keeping track of photographs
of them doing each activity. These can be used for
suggestions to others, based on their location and
preferences. A leader board allows a particular user
to find how other people w ith similar characteristics
are completing their missions and inspires the cur-
rent user to resort to h ealthier living. In that sen se it
also combines social media with predictive analytics.
Today, most smartphones are equipped with
accelerometers a nd gyroscopes to measure jerk,
orientation, and sen se motion. Many applications
use this data to make the user’s experience on the
smartphone better. Data on accelerometer and gyro-
scope readings is publicly available and can be used
to classify various activities like walking, running,
lying down, and climbing. Kaggle (kaggle.com),
a platform that hosts competitions and research
for predictive modeling and analytics, recently
hosted a competition aimed at identifying muscle
motions that may be used to predict the progres-
sion of Parkinson’s disease. Parkinson’s disease is
caused by a failure in the central nervous system,
which leads to tremors, rigidity, slowness of move-
ment, and postural instability. The objective of the
competitio n is to best identify markers that can lead
( Continued)
602 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Application Case 14.3 (Continued)
to predicting the progression of the disease. This
pa1ticular applicatio n of advanced techno logy and
analytics is a n example of how these two can come
togethe r to generate extremely u seful and relevant
informatio n .
QUESTIONS FOR DISCUSSION
1. Search o nline for o ther application s of consumer-
oriented analytical applications .
2. How can location-based analytics help individ-
ual consume rs?
3 . How can smartph o ne data be used to predict
medical condition s?
4. How is ParkPGH different from a “parking space-
reporting” a pp?
Source.- Institute of Med icine o f the Nationa l Academies,
“Hea lth Data Initiative Forum III: The He alth Datapalooza,”
iom . edu/ Activities/PublicHealth/HealthData/2012-
JUN-05/ Afternoon-Apps-Demos/ outside-I OOplus.aspx
(accessed March 2013) .
Analytics-based applicatio ns are emerging not just for fun and health, but also to
enhance o ne’s productivity. For example , Cloze is an app that manages in-boxes from mul-
tiple e-mail accounts in one place. It integrates social networks w ith e-mail contacts to learn
w hich contacts are impo1tant and assigns a score- a higher score for impo1tant contacts.
E-mails w ith a higher score are shown first, thus filtering less important and irrelevant e-mails
out of the way. Cloze stores the context of each conversatio n to save time when catching up
with a pending conversation. Contacts are organized into groups based o n how frequently
they interact, helping u sers keep in touch w ith people with w hom they may be losing con-
tact. Users are able to set a Cloze score for people they want to get in to uch with and work o n
improving that score. Cloze marks up the score w henever an attempt at connecting is made.
On opening an e-mail, Cloze provides several options, such as now, today, tomor-
row, and n ext week, w hich automatically reminds the u ser to initiate contact at the sched-
uled time. This serves as a reminder for getting back to e-mails at a later point with out just
forgetting about them o r m arking the m as “unread,” w hich often leads to a cluttered in-box.
As is evide n t from these examples of con sum er-centric apps, predictive a nalytics is
beginning to enable development of software that is directly u sed by a consum er. The
Wall Street j ournal (wsj.com/apps) estima tes that the app industry h as already become a
$25 b illio n ind ustry w ith mo re growth expected. We believe that the growth of con sume r-
oriented a nalytic a pplicatio ns w ill grow and create many entrepreneurial opportunities
for the readers of this book.
One key con cern in e mploying these technologies is the loss o f privacy. If someone
can track the movement of a cell phone , the privacy of that customer is a big issue. Some
of the app developers claim that they o nly need to gathe r aggregate flow information, not
individually identifiable information. But many stories appear in the media that highlight
violations of this general principle. Both u sers and developers of su ch apps have to be
very aware of the deleterious effect of g iving out private informa tion as well as collecting
su ch informatio n . We discuss this issue a little bit fu rther in Sectio n 14.8.
SECTION 14.3 REVIEW QUESTIONS
1. What are the vario us o ptions that CabSense provides to users?
2. Explore more tra nsportatio n applications that may employ location-based analytics.
3. Briefly describe h ow the data are used to create profiles of u sers.
4. What othe r applicatio n s can you imagine if you were able to access cell phone
location data? Do a search on locatio n-e nabled services.
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 603
14.4 RECOMMENDATION ENGINES
In most decision situations, people rely on recommendations gathered either directly
from other people or indirectly through the aggregated recommendations made by others
in the form of reviews and ratings posted eith er in newspapers, product guides, or online.
Such information sharing is considered one of the major reasons for th e success of online
retailers such as Amazon.com. In this section we briefly review the common terms and
technologies of such systems as these are becoming key components of any analytic
application.
The term recommender systems refers to a Web-based information filtering system
that takes the inputs from users and then aggregates the inputs to provide recommenda-
tions for other users in their product or service selection choices. Some recomme nder
systems now even try to predict the rating or preference that a user would give for a
particular product or service.
The data n ecessary to build a recommendation system are collected by Web-based
systems where each user is specifically asked to rate an item on a rating scale, rank the
items from most favorite to least favorite, and/ or ask the user to list the attributes of th e
items that the user likes. Other information such as the user’s textual comments, feedback
reviews , amount of time that the u ser spends on viewing an item, and tracking the details
of the user’s social networking activity provides behavioral information about the product
choices made by the user.
Two basic approaches that are employed in the development of recommendation
systems are collaborative filtering and content filtering. In collaborative filtering, the rec-
ommendation system is built based on the individual user’s past behavior by keeping
track of the previous history of all purchased items. This includes products, items th at are
viewed most often, and ratings that are given by the users to the items they purchased.
These individual profile histories with item preferences are grouped with other similar
user-item profile histories to build a comprehensive set of relations between u sers and
items, which are then used to predict what the use r will like and recommend items
accordingly.
Collaborative filtering involves aggregating the user-item profiles. It is usually done
by building a user-item ratings matrix where each row represents a unique user an d
each column gives the individual item rating made by the user. The resultant matrix is a
dynamic, sparse matrix with a huge dimensionality; it gets updated every time the existing
user purchases a new item or a new user makes item purchases. Then the recommenda-
tion task is to predict what rating a user would give to a previously unranked item. The
predictions that result in higher item rankings are then presented as recommendations
to the users. The user-item based approach employs techniques like matrix factorization
and low-rank matrix approximation to reduce the dimensionality of the sparse matrix in
gen erating the recommendations.
Collaborative filtering can also take a user-based approach in which the users take
the main role . Similar users sharing the same preferences are combined into a group, and
recommendations of items to a particular user are based o n the evaluation of items by
other users in the same group. If a particular item is ranked high by the entire commu-
nity, then it is recommended to the user. Another collaborative filtering approach is based
o n the item-set similarity, which groups items based on the user ratings provided by
various users. Both of these collaborative filtering approaches employ many algorithms,
such as KNN CK-Nearest Neighborhood) and Pearson Correlation, in measuring user and
behavior similarity of ratings among the items.
The collaborative filtering approaches often require huge amounts of existing data
on user-item preferences to ma ke appropriate recommendations; this problem is most
often referred to as cold start in the process of making recommendations. Also, in the
604 Pan V • Big Data a nd Future Directions for Bu siness Analytics
typical Web-based environment, tapping each individ u al’s ratings and purchase behavior
generates large amounts of data, and applying collaborative filtering algorithms requires
separate high-end computation power to make the recomme ndations .
Collaborative filtering is widely employed in e -commerce. Customers can rate
books, songs, o r m ovies and then get recommendations regarding those issues in future.
It is also being utilized in browsin g documents , articles , and other scientific papers and
m agazines. Some of the companies using this type of recommen der system are Amazon.
com a nd socia l networking Web s ites like Facebook and Linkedin.
Content-based recommender systems overcome o n e of the disadvantages of
collaborative filte ring recommender systems, which completely rely on the user ratings
matrix, by considering specifications and characteristics of items. In th e content-based
filtering approach, the ch aracteristics of an item are profiled first and then con tent-based
individual user profiles are built to store the informatio n about the characteristics of
sp ecific ite ms that the user has rated in the past. In the recommendation process, a com-
parison is made by filtering the item information from the u ser profile for w hich the u ser
has rated positively a nd compares these characteristics w ith any new products that the
user has not rated yet. Recommendatio ns are made if there are similarities found in the
item characte ristics.
Content-based filtering involves using info rmatio n tags o r keywords in fetching
detailed information about item characteristics and restricts this process to a single u ser,
unlike collaborative filtering, w hich looks for similarities between various user profiles.
This approach makes use of machine-learning and classification techniques like Bayesian
classifiers, cluster analysis, decisio n trees , and artificial neural n etworks in o rder to
estimate the probability of recommending similar ite ms to the u sers that m atch the user’s
existing ratings for an item.
Content-based filte ring approach es a re w idely used in recomm ending textual
conte nt su ch as n ews items and related Web pages. It is also used in recommending
similar movies and m usic based on the existing individual profile. O ne of the companies
e mploying this technique is Pandora, w hich builds a user profile based on the musi-
cians/ stations that a particu lar user likes and makes recommendations of other musicians
following the similar genres an ind ividual profile contains. Another example is an app
called Patients Like Me, w hich builds individual patient profiles and recommends patients
registered with Patients Like Me to contact oth er patients suffering from similar diseases.
SECTION 14.4 REVIEW QUESTIONS
1. List the types of approach es used in recommendation engines.
2. How do the two approach es differ?
3. Can you identify specific sites that may use o ne or the other type of recommendatio n
system?
14.5 WEB 2.0 AND ONLINE SOCIAL NETWORKING
Web 2.0 is the popular term for describing advanced Web technologies a nd applicatio n s,
including blogs, w ikis, RSS, m ashups , user-genera ted content, and social networks. A
major objective of Web 2.0 is to e nhance creativity, informa tio n sh aring, an d collaboration.
One of the most significant diffe re nces between Web 2.0 and the traditional Web
is the greater collabora tio n amo ng Internet users a nd other users, content providers, a nd
e nterprises. As an umbre lla term for an emerging core of techn ologies, trends, an d prin-
ciples, Web 2.0 is not o nly ch a nging what is o n the Web, but also how it works. Web 2.0
concepts have led to the evolution of Web-based virtual communities and their hosting
services, such as social networking sites, video-sharing sites, a nd mo re. Many believe
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 605
that companies that understand these new applications and technologies-and apply the
capabilities early on- stand to greatly improve internal business processes and market-
ing . Among the biggest advantages is better collaboration w ith customers, partners, and
suppliers, as well as among internal users.
Representative Characteristics of Web 2.0
The following are representative characteristics of the Web 2.0 environment:
• Web 2.0 has the ability to tap into the collective intelligence of users. The more
users contribute, the more popular a nd valuable a Web 2.0 site becomes.
• Data is made available in new or never-intended ways . Web 2.0 data can be remixed
or “mashed up,” often through Web service interfaces, much the way a dance-club
DJ mixes music.
• Web 2.0 relies on user-generated and user-controlled content and data.
• Lightweight programming techniques and tools let nearly anyone act as a Web site
developer.
• The virtual e limination of software-upgrade cycles makes everything a p erpetual
beta or work-in-progress and allows rapid prototyping, using the Web as an appli-
cation development platform.
• Users can access applications entirely through a browser.
• An architecture of participation and digital democracy en courages users to add
value to the application as they use it.
• A major emphasis is on social networks and computing.
• There is strong support for information sharing and collaboration.
• Web 2.0 fosters rapid and continuous creation of new business models.
Other important features of Web 2.0 are its dynamic content, rich user experience,
metadata, scalability, open source basis, and freedom (net neutrality).
Most Web 2.0 applications have a rich, interactive, user-friendly interface based
on Ajax or a similar framework. Ajax (Asynchronous JavaScript an d XML) is an effective
and efficient Web development technique for creating interactive Web applications. The
intent is to make Web pages feel more responsive by exchanging small amounts of
data with the server behind the scenes so that the entire Web page does not have to be
reloaded each time the user makes a change. This is meant to increase the Web page’s
interactivity, loading speed, and usability.
A major characteristic of Web 2.0 is the global spread of innovative Web sites
and start-up companies. As soon as a successful idea is deployed as a Web site in on e
country, other sites appear around the globe. This section presents some of these sites.
For example, approximately 120 companies specialize in providing Twitter-like services
in dozens of countries. An excellent source for material on Web 2.0 is Search CIO ‘s
Executive Guide: Web 2.0 (see searchcio.techtarget.com/general/0,295582,sidl9_
gci1244339,00.html#glossary) .
Social Networking
Social networking is built on the idea that there is structure to how people know each
other a nd interact. The basic premise is that social networking gives people the power to
sh are, making the world more open a nd connected. Although social networking is usu-
ally practiced in social networks such as Linkedln, Facebook, or Google+, aspects of it
are also found in Wikipedia and YouTube.
We first briefly define social networks and then look at some of the services they
provide and their capabilities.
606 Pan V • Big Data and Future Directions for Business Analytics
A Definition and Basic Information
A social network is a place w h ere people create their own space, o r homepage, on which
they write biogs (Web logs); post pictures, videos, or music; share ideas; and link to other
Web locations they find interesting . In addition, members of social networks can tag the
content they create and post it w ith keywords they choose themselves, which makes the
content searchable. The mass adoption of social networking Web sites points to an evolu-
tion in human social interaction.
Mobile social networking refers to social networking where members converse
and connect with one another using cell phones or other mobile devices. Virtually all
major social networking sites offer mobile services or apps on smartphones to access
their services. The explosion of mobile Web 2.0 services and companies means that many
social networks can be based from cell phones and other portable devices, extending
the reach of such networks to the millions of people who lack regular or easy access to
computers.
Facebook (facebook.com) , which was launched in 2004 by former Harvard
student Mark Zuckerberg, is the largest social network service in the world, with almost
1 billion users worldwide as of February 2013. A primary reason why Facebook has
expanded so rapid ly is the network effect- more users means more value . As more users
become involved in the social space, more people are available to connect with. Initially,
Facebook was an online social space for college and high school students that auto-
matically connected students to other students at the same school. Expanding to a global
audience has enabled Facebook to become the dominant social network.
Today, Facebook has a number of applications that support photos, groups, events,
marketplaces, posted items, games, and notes . A special feature on Facebook is the News
Feed, which enables users to track the activities of friends in their social circles. For
example, when a user changes his or her profile, th e updates are broadcast to others who
subscribe to the feed . Users can also develop their own applications or use any of the
millio ns of Facebook applications that have been developed by other users.
Orkut (orkut.com) was the brainchild of a Turkish Google programmer of the
same name. Orkut was to be Google ‘s homegrown answer to Facebook. Orkut fol-
lows a format similar to that of other major social networking sites: a homepage where
users can display every facet of their personal life they desire using various multimedia
applications. It is more popular in countries such as Brazil than in the Un ited States.
Google has introduced another social network called Google+ that takes advantage of
the popular e-mail service from Google, Gmail, but it is still a much smaller competitor
of Facebook.
Implications of Business and Enterprise Social Networks
Although advertising and sales are the major EC activities in public social networks, there
are emerging possibilities for commercial activities in business-oriented n etworks such as
Linkedln and in enterprise social n etworks.
USING TWITTER TO GET A PULSE OF THE MARKET Twitter is a popular social networking
site that enables friends to keep in touch and follow what others are saying. An analy-
sis of “tweets” can be used to determine how well a product/ service is doing in the
market. Previous c hapters on Web analytics included a significant coverage of social
media analytics. This continues to grow in popularity and business use. Analysis of posts
o n social media sites such as Facebook and Twitter has become a major business. Many
companies provide services to monitor and manage such posts on behalf of companies
a nd individuals. One good example is reputation.com.
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 607
SECTION 14.5 REVIEW QUESTIONS
1. Define Web 2.0.
2. List the major characteristics of Web 2.0.
3. What new business model has emerged from Web 2.0?
4. Define social network.
5. List some major social n etwork sites.
14.6 CLOUD COMPUTING AND Bl
Another emerging technology trend that business intelligence users should be aware
of is cloud computing. Wikipedia (en.wikipedia.org/wiki/cloud_computing) defines
cloud computing as “a style of computing in which dynamically scalable and often
virtualized resources are provided over the Internet. Users need not have knowledge of,
experience in, or control over the technology infrastructures in the cloud that supports
them. ” This definition is broad and comprehensive. In some ways , cloud computing is
a new name for many previous, related trends: utility computing, application service
provider, grid computing, on-demand computing, software as a service (SaaS), and even
older, centralized computing with dumb terminals. But the term cloud computing origi-
nates from a reference to the Internet as a “cloud” and represents a n evolution of all of
the previously shared/centralized computing trends. The Wikipedia entry also recognizes
that cloud computing is a combination of several information technology components
as services. For example, infrastructure as a service (IaaS) refers to providing comput-
ing platforms as a service (PaaS), as well as a ll of the basic platform provisioning, such
as management administration , security, and so on. It also includes Saas, which includes
applications to be delivered through a Web browser w hile the data and the application
programs are on some other server.
Although we do not typically look at Web-based e-mail as an example of cloud
computing , it can be considered a basic cloud application. Typically , the e-mail application
stores the data (e-ma il messages) and the software (e-mail programs that let us process
and manage e-mails). The e-mail provider also supplies the hardware/ software a nd all
of the basic infrastructure. As long as the Internet is available, one can access the e-mail
application from anywhere in the Internet cloud. When the application is updated by the
e-mail provider (e.g ., when Gmail updates its e-mail application) , it becomes available
to a ll the customers without them having to download a ny new programs. Thus, any
Web-based general application is in a way an example of a cloud application. Another
example of a general cloud application is Google Docs and Spreadsheets. This applica-
tion allows a user to create text documents or spreadsheets that are stored on Google’s
servers a nd are available to the users anywhere they have access to the Internet. Again,
no programs need to be installed, “the application is in the cloud.” The storage space is
also “in the cloud. ”
A very good general business example of cloud computing is Amazon.corn’s
Web services. Amazon.com has developed an impressive technology infrastructure for
e-commerce as well as for business intelligence, customer relationship management, and
supply chain management. It has built major data centers to manage its own operation s.
Howeve r, through Amazon .corn’s cloud services, many other companies can e mploy these
very same facilities to gain advantages of these technologies w ithout having to make a
similar investment. Like othe r cloud-computing services , a user can subscribe to any of the
facilities on a pay-as-you-go basis. This model of letting someone e lse own the h ardware
and software but making use of the facilities on a pay-per-use basis is the cornerstone
of cloud computing. A number of companies offer cloud-computing services, including
Sa lesforce.com, IBM, Sun Microsystems, Microsoft (Azure) , Google, a nd Yahoo!
608 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Cloud computing, like m any other IT trends, h as resulted in new offerings in
business inte lligen ce. White (2008) and Trajman (2009) provided examples of BI offerings
related to clo ud computing. Trajman identifie d several companies offering cloud-based
data warehouse options. These option s pe rmit a n organizatio n to scale up its data ware-
house and pay o nly for w h at it uses. Companies offering su ch services include 1010data,
LogiXML, and Lucid Era. These companies offer feature extract, transform, and load
capabilities as well as advanced data an alysis tools. These are examples of Saas as well as
data as a service (DaaS) offerings. Other compa nies, such as Elastra and Rightscale, offer
dashboard a nd data man agement tools that follow the SaaS and DaaS models, but they
also e mploy IaaS from oth e r providers, su ch as Amazon .com or Go Grid. Thus, the e nd
u ser o f a cloud-based BI service may use o ne organization for analysis applicatio n s that,
in turn, uses a nother firm for the platform or infrastructure.
The next several paragraphs summarize the latest trends in the interface of cloud
computing a nd business intelligence/decisio n support systems. These are excerpted from
a paper w ritte n by Ha luk Demirkan and o n e of the co-authors of this book (Demirka n
and Delen, 2013).
Service-o riented thinking is one of the fastest growing paradigms in today’s econ-
o my . Most of the o rga nization s have already built (or a re in a process of building) deci-
s io n support system s that suppo rt agile d ata, information, and a nalytics capabilities as
services . Let’s look at the implications of service-orientation on DSS. One of the m ain
premises of service o rientatio n is that service-o riented decision support systems w ill be
developed with a compo nent-based app roach that is characterized by reusability (ser-
vices can be reused in m any workflows), substitutability (alternative services can be
u sed), exten sibility and scalability (ability to extend services a nd scale them, increase
capabilities of individual services), cu stomizability (ability to cu stomize generic features,
and composability- easy con struction o f mo re complex functional solutio ns using basic
services), re liability, low cost of owne rship, econo my of scale, an d so on .
In a service-oriented DSS environment, most of the services are provided w ith
distributed collaboratio ns. Various DSS services are produced by many partners, and
consumed by end users for decision ma king. In the meantime, partners play the role of
producer and consumer in a given time.
Service-Oriented DSS
In a SODSS e nvironme nt, there are four majo r components: information technology as
enabler, process as beneficiary, peo ple as u ser, and organiza tion as facilitator. Figure 14 .2
illustrates a conceptu al architecture of service-oriented DSS.
In service-o riented DSS solutio n s, o peratio n al systems (1), d ata warehou ses
(2), online a nalytic processing (3), and e nd-user compon ents (4) can be individually
or bundled provided to the users as service. Some of these compon e nts and th e ir brief
descriptions are listed in Table 14-1.
In the following subsection s we provide brief descriptions of the three service
models (i.e., data-as-a-service, informatio n-as-a-service, and a nalytics-as-a-service) that
underlie (as its foundational enablers) the service-oriented DSS.
Data-as-a-Service (DaaS)
In the service-oriented DSS e nvironme nt (such as cloud e nvironme nt), the concept of
data-as-services basically advocates the view that- w ith the eme rgen ce of service-
orie nted business processes, architecture, and infrastructure, w h ich includes standard-
ized processes for accessing d ata “w he re it lives”- the actual p latform o n w hich the data
resides doesn’t matter (Dyche, 2011). Data can reside in a local computer or in a server at
a server farm inside a clo ud-co mputing e nviro nment. With data-as-a-service, any business
Chapter 14 • Business Analytics: Eme rging Trends and Future Impacts 609
Data
lnformacion Management
Sources
0
~
§
~
Other
DLTP/weB
Replication
Service
Routine
Business
Reporting
DLAP
Dashboards
Intranet
Search for
Content
Operacions
Management
G~
Optimization
Data mining
Text mining
Simulation
Automated
Decision
System
I
•
FIGURE 14.2 Conceptual Architecture of Service-Oriented DSS. Source: Haluk Demirkan and Dursun Delen, ” Leveraging t he
Capabilities of Service-Oriented Decision Support Systems: Putting Analytics and Big Data in Cloud,” Decision Support Systems,
Vol. 55, No. 1, April 2013, pp. 412-421.
process can access data wherever it resides . Data-as-a-service began with the notion that
data qua lity could happen in a centralized place, cleansing and enriching data and offer-
ing it to different systems, application s, or users, irrespective of where they were in the
organization , computers , or o n the network. This h as now been replaced with maste r data
management (MDM) and customer data integration (CDI) solutions, where the record of
the customer (or product, o r asset, e tc.) may reside anywh ere and is available as a service
to any a pplication that has the services allowing access to it. By applying a standard set
of transformations to the various sources of data (for example, ensuring that gen der fields
containing different notation styles [e.g., M/ F, Mr./Ms .] are all translated into male/ female)
an d the n e nabling applications to access the data via open stan dards such as SQL, XQuery,
and XML, service requestors can access the data regardless of vendor or system .
With DaaS, customers can move quickly thanks to the simplicity of the data access
and the fact that they don’t need extensive knowledge of the underlying data . If custom-
ers require a slightly different data structure or h ave location-specific requirements, the
implementation is easy because th e changes are minimal (agility) . Second, providers can
build the base w ith the data experts a nd outsource the presentation layer (which a llows
for very cost-effective user interfaces and makes change requests at the presentation layer
much more feasible- cost-effectiveness), an d access to the data is controlled through
the data services, w hich tends to improve data quality because there is a single point
for updates. On ce those services are tested thoroughly, they only need to be regression
tested if they remain uncha nged for the next deployment (better data quality). Another
important point is that DaaS platforms use NoSQL (sometimes expanded to “not only
SQL”) , which is a broad class of data base management system that differs from classic
re latio n al database management systems (RDBMSs) in some sig nificant ways . These data
610 Pan V • Big Da ta a nd Future Directions for Business Analytics
TABLE 14-1 Major Components of Service-Oriented DSS
Data sources
Data sources
Data sources
Data
management
Data services
Data services
Data services
Information
services
Analytics
services
Information
delivery to
end users
Information
management
Data
management
Operations
management
Information
sources
Servers
Software
Component
Application programming
interface
Operational transaction
systems
Enterprise application
integration/staging area
Extract, transform, load (ETL)
Metadata management
Data warehouse
Data marts
Information
Analytics
Information delivery portals
Information services w ith
library and administrator
Ongoing data management
Operations and
administration
Interna l and external
databases
Operations
Operations
Brief Description
Mechanism t o populate source systems with raw data and to pull
operational reports.
Systems that ru n day-to-day business operations and provide source
data for the data warehouse an d DSS environment.
Provides an integrated common data interface and interchange
mechanism for real-time and source systems.
The processes to extract, transform, cleanse, reengineer, and load
source data into the data warehouse, and move data from one
location to another.
Data t hat describes the meaning and structure of business data, as
well as how it is created, accessed, and used.
Subject-oriented, integrated, time-va riant, and nonvolatile collecti on of
summary and detailed data used to support the strategic decision-
making process for the organization . This is also used for ad hoc and
exploratory processing of very large data sets.
Subset of dat a warehouse to support specific decision and analytical
needs and provide business units more f lexibility, control, and
responsibility.
Such as ad hoc query, reporting, OLAP, dashboards, intra- and Internet
search for content, data, and information mashups.
Such as optimization, data mining, text mining, simulation, automated
decision system.
Such as desktop, Web browser, portal, mobile devices, e-mail.
Opt imizes t he DSS environment use by organizing its capabilit ies and
knowledge, and assimilating them into t he business processes.
A lso includes search engines, index crawlers, cont ent servers,
categorization servers, application/ content integration servers,
application servers, etc.
Ongoing management of data w ithin and across the environment
(such as backup, aggregate, retrieve data from near-line and off-line
storage).
Activities to ensu re daily operations, and optim ize to allow manageable
growt h (systems management, dat a acquisition management ,
service management, change management, scheduling, mon itor,
security, etc.).
Dat abases and files.
Database, application, Web, network, securit y, etc.
Applications, integration, analytics, portals, ETL, etc.
stores m ay not require fixed table schemas, usually avoid join operations, and typically
scale horizontally (Stonebraker, 2010). Amazon offers such a service, called SimpleDB
(http:/ /aws.amazon.com/simpledb) . Google ‘s AppEngine (http:/ /code.google.com/
appengine) provides its DataStore API around BigTable . But apart from these two pro-
prietary o fferings, the current la ndscap e is still open for prospective service p roviders.
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 611
Information-as-a-Service (Information on Demand) (laaS)
The overall idea of IaaS is making information available quickly to people, processes,
and applications across the business (agility) . Such a system promises to eliminate silos
of data that exist in systems and infrastructure today, to enable sharing real-time informa-
tion for emerging apps, to hide complexity, and to increase availability with virtualiza-
tion. The main idea is to bring together diverse sources, provide a “single version of the
truth, ” make it available 24/ 7, and by doing so, reduce proliferating redundant data and
the time it takes to build and deploy new information services. The IaaS paradigm aims
to implement and sustain predictable qualities of service around information delive1y at
runtime and leverage and extend legacy information resources and infrastructure imme-
diately through data and runtime virtualization, and thereby reduce ongoing develop-
ment efforts. IaaS is a comprehensive strategy for the delivery of information obtained
from information services, following a consistent approach using SOA infrastructure and/
or Internet standards . Unlike enterprise information integration (EII), enterprise applica-
tion integration (EAI), and extract, transform, and load (ETL) technologies, IaaS offers a
flexible data integration platform based on a newer generation of service-oriented stan-
dards that enables ubiquitous access to any type of data , on any platform, using a wide
range of interface and data access standards (Yuhanna, Gilpin, and Knoll, The Forrester
Wave: Information-as-a-Service, Ql 2010, Forrester Research, 2010). Forrester Research
names IaaS as Information Fabric and proposes a new, logical view to better characterize
it. Two examples of such products are IBM’s Web Sphere Information Integration and
BEAs AquaLogic Data Services. These products can take the messy underlying data and
present them as elemental services-for example, a service that presents a single view
of a customer from the underlying data. These products can be used to enable real-time,
integrated access to business information regardless of location or format by means of
semantic integratio n. They also provide models-as-services (MaaS) to provide a collection
of industry-specific business processes, reports, dashboards, and other service models for
key industries (e.g., banking, insurance, and financial markets) to accelerate enterprise
business initiatives for business process optimization and multi-channel transformation.
They also provide master data management services (MDM) to enable the creation and
management of multiform master data, provided as a service, for customer information
across heterogeneous environments, content management services, and business intel-
ligence services to perform powerful analysis from integrated data .
Analytics-as-a-Service (AaaS)
Analytics and data-based managerial solutions-the applications that query data for use
in business planning, problem solving, and decision support- are evolving rapidly and
being used by almost every organization. Gartner predicts that by 2013, 33 percent of
BI functionality will be consumed via handheld devices; by 2014, 30 percent of analytic
applications will use in-memory functions to add scale a nd computational speed, and will
use proactive, predictive, and forecasting capabilities; and by 2014, 40 percent of spend-
ing on business analytics will go to system integrators, not software vendors (Tudor and
Pettey, 2011).
The con cept of analytics-as-a-service (AaaS)-by some referred to as Agile
Analytics-is turning utility computing into a service model for analytics. AaaS is not
limited to a single database or software; rather, it has the ability to turn a general-purpose
analytical platform into a shared utility for an enterprise with the focus on virtualization
of analytical services (Ratzesberger, 2011). With the needs of Enterprise Analytics growing
rapidly, it is imperative that traditional hub -and-spoke architectures are not able to satisfy
the demands driven by increasingly complex business analysis and analytics . New and
improved architectures are needed to be able to process very large amounts of structured
612 Pan V • Big Data a nd Future Directions for Bu siness Analytics
and unstructured da ta in a very short time to produce accurate and action able results.
The “analytics-as-a-service” model is already being facilitated by Amazon, MapRedu ce,
Hadoop, Microsoft’s Dryad/SCOPE, Opera Solutions, eBay, and others. For example,
e Bay employees access a virtua l slice of the main d ata warehou se server w h ere they can
store and an alyze their own data sets. e Bay’s virtual private data marts h ave been quite
successful-hundreds have been created , w ith 50 to 100 in operation at any one time.
They have e liminated the company’s n eed for new physical data marts that cost an esti-
mated $1 millio n apiece and require the full-time attentio n of several skilled employees
to provision (Winter, 2008).
AaaS in the cloud h as economies of scale a nd scope by providing many virtual
analytical applicatio ns with better scalability and higher cost savings. With growing data
volumes a nd dozens of virtual analytical applicatio ns, chances are that mo re of them
leverage processing at different times, usage p atterns, a nd frequencies (Kalakota, 2011) .
A number of database companies su ch as Teradata, Netezza, Greenplum, Oracle, IBM
DB2, DATAllegro, Vertica , and AsterData that provide sh ared-nothin g (scalable) datab ase
m an agement applicatio ns a re well-suited for AaaS in cloud deployment.
Data and text mining is another very promising application of AaaS. The capabilities
that a service o rie ntation (along w ith cloud computing, p ooled resources, and parallel
processing) brings to the a nalytic world are not limited to data/ text mining. It can a lso be
used for large -scale optimizatio n , highly-complex multi-criteria decision problems, and
distributed simulation models. These prescriptive an alytics require highly capable systems
that can only be realized u sing service-based collaborative systems that can utilize large-
scale computatio n a l resources.
We also expect that there will be significant interest in con du cting service scie n ce
research o n cloud computing in Big Data an alysis. W ith Web 2.0, more than e n ough d ata
h as been collected by orga nizatio n s . We are e n tering the “p e tabyte age,” and traditio n al
data and analytics approaches are beginning to sh ow their limits. Clo ud an alytics is an
e mergin g alternative solution for large-scale data analysis. Data-oriented cloud systems
include storage a nd computing in a distributed and virtualized environme nt. These
solutio ns a lso come w ith many challe nges, such as security, service level, a nd data gover-
nance. Research is still limited in this area. As a result, there is ample opp ortunity to b ring
a nalytical, computatio n al, and conceptual modeling into the context of service scien ce,
service orie ntatio n , and cloud intelligence.
These types of cloud-based offerings a re continuing to grow in popularity. A major
advantage of these offerings is the rapid diffusion of advan ced analysis tools among the
users, w ithout significant investment in technology acquisitio n . However, a number of
con cerns have been raised about cloud computing, including loss of control and privacy,
legal liabilities, cross-border political issues, and so o n. Non ethe less, cloud computin g is
a n important initiative for a BI professional to watch .
SECTION 14.6 REVIEW QUESTIONS
1. Define cloud computing. How does it relate to PaaS , SaaS, and laaS?
2. Give examples of companies offering cloud services.
3. How does clo ud computing affect business intelligence?
4. What are the three service models that provide the foundation to service-oriented
DSS?
5. How does DaaS change the way data is handled?
6. What is MaaS? What does it offe r to businesses?
7. Why is AaaS cost-effective?
8. Why is MapReduce me ntio ned in the context of Aaas?
Chapter 14 • Business Analytics: Eme rging Trends and Future Impacts 613
14.7 IMPACTS OF ANALYTICS IN ORGANIZATIONS: AN OVERVIEW
Analytic systems are important factors in the information, Web, and knowledge
revolution. This is a cultura l transformation with which most people are only now
coming to terms. Unlike the slower revolutions of the past, such as the Industrial
Revolution, this revolution is taking place very quickly and affecting e very facet of our
lives. Inherent in this rapid transformation are a host of managerial , economic, and
social issues.
Separating the impact of analytics from that of other computerized syste ms is a
difficult task, especially because of the trend toward integrating, or even embedding ,
analytics w ith other computer-based information systems. Analytics can have both micro
and macro implications. Such systems can affect particular individuals and jobs, and they
can also affect the work structures of departments and units within an organization. They
can also have significant long-term effects on total organizational structures, entire indus-
tries, communities, and society as a whole (i.e., a macro impact).
The impact of computers and analytics can be divided into three general catego-
ries: organizational, individual, and societal. In each of these, computers have had many
impacts. We cannot possibly consider all of them in this section, so in the next para-
graphs we touch upon topics we feel are most relevant to analytics.
New Organizational Units
One change in organizational structure is the possibility of creating an analytics department,
a BI department, or a knowledge management department in which analytics play a
major role. This special unit can be combined with or replace a quantitative a nalysis unit,
or it can be a completely n ew entity. Some large corporations have separate decision
support units or departments. For example, many major banks have such departments in
their financial services divisions. Many companies have small decision support or BI/ data
warehouse units . These types of departments are usually involved in training in addition
to consulting and application development activities . Others have empowered a chief
technology officer over BI, intelligent systems, and e-commerce applications . Companies
such as Target and Walmart have major investments in such units, w hich are constantly
analyzing their data to determine the efficiency of marketing and supply chain manage-
ment by understanding their customer and supplier interactions.
Growth of the BI industry has resulted in the formation of new units within IT
provider companies as well. For example, a few years back IBM formed a new business
unit focused on analytics. This group includes units in business intelligence, optimization
models, data mining, and business performance. As noted in Sectio ns 14.2 and 14.3, the
enormous growth of the app industry h as created many opportunities for new companies
that can employ an alytics and deliver innovative applications in any specific domain.
There is also consolidation through acquisition of specialized software companies
by major IT providers. For example, IBM acquired Demandtec, a revenue and promotion
optimization software company, to build their offerings after having acquired SPSS for
predictive analytics and ILOG to build their prescriptive analytics capabilities. Oracle
acquired Hyperion some time back. Finally, there are a lso collaborations to enable
companies to work cooperatively in some cases while also competing elsewhere.
For example, SAS and Teradata announced a collaboration to let Teradata users develop
BI applications using SAS analytical modeling capabilities. Teradata acquired Aster to
enh an ce their Big Data offerings a nd Aprimo to add to their customer campaign manage-
ment capabilities.
Section 14.9 describes the ecosystem of the analytics industry and recognizes the
career paths available to analytics practitioners. It introduces many of the industry clus-
ters, including those in user organizations .
614 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Restructuring Business Processes and Virtual Teams
In many cases, it is necessary to restructure business processes before introducing
new informatio n technologies. For example, before IBM introduced e -procurement,
it restructured a ll re lated business processes, including decision making, searching
inve nto ries, reorde ring, a nd shipping. When a company introduces a data ware ho u se
and BI, the information flows and related business processes (e.g., order fulfillment)
are likely to change. Su ch changes are often necessary for profitability, or even survival.
Restructuring is especially n ecessary when major IT projects such as ERP or BI are under-
take n. Some times a n organizatio n-w ide, major restructuring is needed; then it is referred
to as reengineering. Reengineering involves changes in structure , organizational culture,
and processes. In a case in which a n e ntire (or most of a n) o rganizatio n is involved, the
process is referred to as business process reengineering (BPR).
The Impacts of ADS Systems
As indicated in Ch apter 1 and other chapters, ADS systems, such as those for pricing,
scheduling, and inve ntory management, are spreading rapidly , especially in industries
su ch as airlines, retailing, transportation, and banking. These systems will probably h ave
the fo llowing impacts:
• Reductio n of middle management
• Empowerment of customers and business partners
• Improved cus tomer service (e.g ., faster reply to requests)
• Increased productivity o f help desks and call centers
The impact goes beyond one company or o ne supply chain , h owever. Entire
industries are affected. The u se of profitability models and o ptimizatio n are reshaping
retailing, real estate , banking, tran sportatio n , a irlines, a nd car rental agencies, amon g
o ther industries.
Job Satisfaction
Although ma ny jobs may be substantially enriched by analytics, oth er jobs may become
more routine and less satisfying. For example, more than 40 years ago, Argyris (1971)
predicted that compute r-based information systems would reduce managerial discretion
in decision making and lead to m a nagers being dissatisfied. In the ir study about ADS,
Davenport and Ha rris ( 2005) found that employees using ADS systems, especially th ose
w ho a re e mpowered by the system s, were more satisfied w ith their jobs. If the rou tine
and mundane work can be done using a n analytic system, then it should free up the
managers and knowledge workers to do mo re challenging tasks.
Job Stress and Anxiety
An increase in workload and/ or responsibilities can trigger job stress. Altho u gh
compute rizatio n h as benefited o rganization s by increasing p rodu ctivity, it h as also created
an ever-increasing and changing workload o n some employees- many times brought on
by downsizing and redistributing entire workloads of one e mployee to another. Some
worke rs feel overwh elmed and begin to feel anxious about th eir jobs a nd the ir perfor-
m an ce. These fee lings of anxiety can ad versely affect their productivity. Management
must alleviate these feelings by redistrib uting the workload among workers o r conducting
a ppropriate training.
One of the negative impacts of the informatio n age is info rmation anxiety. This
d isquiet can take several forms , su ch as frustration w ith the inability to keep up with
the a mo unt of data present in our lives. Constant connectivity afforded through mobile
Chapter 14 • Business Analytics: Eme rging Trends and Future Impacts 615
devices , e-mail, and instant messaging creates its own challenges and stress. Research on
e-mail response strategies (iris.okstate.edu/REMS) includes many examples of studies
conducted to recognize such stress. Constant alerts about incoming e-mails lead to inter-
ruptions, which eventually result in loss of productivity (and then an increase in stress).
Systems have been developed to provide decision support to determine how often a
person should check his or her e-mail (see Gupta and Sharda, 2009) .
Analytics’ Impact on Managers’ Activities and Their Performance
The most important task of managers is making decisions. Analytics can change the
manner in which many decisions are made and can consequently change managers’ jobs.
Some of the most common areas are discussed next.
According to Perez-Cascante et al. (2002), an ES/ DSS was fou nd to improve the
performance of both existing and new managers as well as other employees. It helped
managers gain more knowledge, experience, and expertise, and it consequently enhanced
the quality of their decision making. Many managers report that computers have finally
given them time to get out of the office and into the field. (BI can save an hour a day
for eve1y user.) They have also found that they can spend more time planning activities
instead of putting out fires because they can be alerted to potential problems well in
advance, thanks to intelligent agents, ES, and other analytical tools.
Another aspect of the managerial challenge lies in the ability of analytics to
support the decision-making process in general and strategic planning and control
decisions in particular. Analytics could change the decision-making process and even
decision-making styles. For example, information gathering for decision making is com-
pleted much more quickly when analytics are in use. Enterprise information systems
are extremely useful in supporting strategic management (see Liu et al. , 2002). Data,
text, and Web mining technologies are now used to improve external environmental
scanning of information. As a result, managers can change their approach to problem
solvin g and improve on their decisions quickly. It is reported that Starbucks rece ntly
introduced a new coffee beverage and made the decision on pricing by trying several
different prices and monitoring the social media feedback throughout the day. This
implies that data collection methods for a manager could be drastically different now
than in the past.
Research indicates that most managers tend to work on a large number of problems
simultaneously , moving from o n e to another as they wait for more information on their
current problem (see Mintzberg et al., 2002). Analytics technologies tend to reduce th e
time required to complete tasks in the decision-making process and eliminate some of
the nonproductive waiting time by providing knowledge and information. Therefore ,
managers work o n fewer tasks during each day but complete more of them. The reduction
in start-up time associated with moving from task to task could be the most important
source of increased managerial productivity.
Another possible impact of analytics on the manager’s job could b e a change in lead-
ership requirements. What are now generally considered good leadership qualities may
be significantly a ltered by the use of analytics. For example, face-to-face communication
is frequently replaced by e-mail, wikis, and computerized conferencing; thus, leadership
qualities attributed to physical appearance could become less important.
The following a re some potential impacts of analytics o n managers’ jobs:
• Less expertise (experience) is required for making many decisions.
• Faster decision making is possible because of the availability of information and th e
automation of some phases in the decision-making process.
• Less reliance on experts and analysts is required to provide support to top execu-
tives; managers can do it by themselves w ith the he lp of intelligent systems.
616 Pan V • Big Data a nd Future Directions for Bu siness Analytics
• Power is being redistributed a mo ng managers. (The more information and analysis
capability they possess, the more power they have .)
• Support for complex decisions makes them faster to make and be of better quality.
• Information needed for high-level decision making is expedited or even
self-gen e rated.
• Automation o f routine decisions or phases in the decisio n-making process (e.g., for
frontline decision making and using ADS) may elimina te some managers.
In general, it has been found that the job of middle managers is the m ost likely job to
be automated. Midlevel managers make fairly rou tine decisions , which can be fully auto-
mated. Managers a t lower levels do n ot spend much time on decision making. Instead,
they supervise, train, and motivate nonmanagers. Some of their routine decisions, su ch
as scheduling, can be automated; other decisions that involve behavioral asp ects cann ot.
However, even if we completely automate the ir decisional role , we could n ot au tomate
their jobs. The Web provides an opportunity to automate certain tasks done by frontline
e mployees; this e mpowers them, thus reducing the workload of approving managers.
The job of top man agers is the least routine and therefore the most difficult to automate.
SECTION 14. 7 REVIEW QUESTIONS
1. List the impacts of an alytics o n decision making.
2. List the impacts of an alytics o n oth er managerial tasks.
3. Describe new organizational units that are created because of analytics.
4. How can analytics affect restructuring of business processes?
5. Describe the impacts of ADS systems.
6. How can analytics affect job satisfactio n?
14.8 ISSUES OF LEGALITY, PRIVACY, AND ETHICS
Several important legal, privacy, and ethical issues are related to analytics. Here we
provide o nly representative examples and sources.
Legal Issues
The introduction of analytics may compound a h ost of legal issues already relevant to
computer systems. For example, questions concerning liability for the action s of advice
provided by intelligent machines a re just beginning to be considered.
In addition to resolving disputes about the unexpected a nd possibly damagin g
results of some analytics, other complex issues may surface. For example , w h o is liable if
a n e nte rprise finds itself bankrupt as a result of using the advice of an analytic application?
Will the e nterprise itself be held responsible for not testing the system adequately before
entrusting it w ith sen sitive issues? Will auditing and accounting firms share the liability
for failing to apply adequate auditing tests? Will the software develo p ers of inte lligent
systems be jo intly liable? Consider the fo llowing specific legal issues:
• What is the value of an expert opinion in court w he n the expertise is e ncoded in a
compute r?
• Who is liable for wrong advice (or informatio n) provided by a n intelligent applica-
tion? For example , w h at happens if a physician accepts an incorrect diagn osis made
by a compute r a nd performs an act that results in the death of a patient?
• What happen s if a man ager ente rs an incorrect judgment value into an analytic
application and the result is damage or a disaster?
• Wh o owns the knowledge in a knowledge base?
• Can ma nageme nt force experts to contribute their expertise?
Chapter 14 • Business Analytics: Eme rging Trends and Future Impacts 617
Privacy
Privacy means different things to different people. In general, privacy is the right to be
left alone and the right to be free from unreasonable personal intrusions. Privacy has long
been a legal, ethical, and social issue in many countries. The right to privacy is recognized
today in every state of the United States and by the federal government, either by statute
or by common law. The definition of privacy can be interpreted quite broadly. However,
the following two rules have been followed fairly closely in past court decisions: (1) The
right of privacy is not absolute. Privacy must be balanced against the needs of society.
(2) The public’s right to know is superior to the individual’s right to privacy. These two
rules show w hy it is difficult, in some cases, to determine and enforce privacy regulations
(see Peslak, 2005). Privacy issues online have specific characteristics and policies. One
area where privacy may be jeopardized is discussed next. For privacy and security issues
in the data warehouse environment, see Elson and Leclerc (2005).
COLLECTING INFORMATION ABOUT INDIVIDUALS The complexity of collecting, sorting,
filing, and accessing information manually from numerous government agencies was, in
many cases, a built-in protection against misuse of private information. It was simply too
expensive, cumbersome, and complex to invade a person’s privacy. The Internet, in combi-
nation with large-scale databases, has created an entirely new dimension of accessing and
using data . The inherent power in systems that can access vast amounts of data can be used
for the good of society. For example, by matching records w ith the aid of a computer, it is
possible to e liminate or reduce fraud, crime, government mismanagement, tax evasion, wel-
fare cheating, family-support filching, employment of illegal workers, and so o n . However,
what price must the individual pay in terms of loss of privacy so that the government can
better apprehend criminals? The same is true on the corporate level. Private information
about employees may aid in better decision making, but the employees’ privacy may be
affected. Similar issues are related to information about customers.
The implications for online privacy are significant. The USA PATRIOT Act also
broadens the government’s ability to access student information and personal financial
information w ithout any suspicion of wron gdoing by attesting that the information like ly
to be found is pertinent to an ongoing criminal investigation (see Electronic Privacy
Information Center, 2005). Location information from devices has been used to locate
victims as well as perpetrators in some cases, but at what point is the information n ot th e
property of the individual?
Two effective tools for collecting information about individuals are cookies and spy-
ware . Single-sign-o n facilities that let a user access various services from a provider are
beginning to raise some of the same concerns as cookies. Such services (Google, Yahoo!,
MSN) let consumers permanently enter a profile of information along with a password and
use this information and password repeatedly to access services at multiple sites. Critics say
that such services create the same opportunities as cookies to invade an ind ividual’s privacy.
The use of artificial intelligen ce technologies in the administration and enforcement
of laws and regulations may increase public concern regarding privacy of information.
These fears , generated by the perceived abilities of artificial intelligen ce, will have to be
addressed at the outset of almost any artificial intelligence development effort.
MOBILE USER PRIVACY Many users are unaware of the private information being tracked
through mobile PDA or cell phone use. For example, Sen se Networks’ models are built
using data from cell phone companies that track each phone as it moves from on e cell
tower to another, from GPS-enabled devices that transmit users’ locations , and from PDAs
transmitting information at w ifi hotspots. Sense Networks claims that it is extremely care-
ful and protective of users’ privacy, but it is interesting to note how much information is
available through just the use of a s ingle device.
618 Pan V • Big Da ta a nd Future Directio ns for Bu siness Ana lytics
HOMELAND SECURITY AND INDIVIDUAL PRIVACY Using a n alytics techno logies su ch as
mining a nd interpre ting the conte nt of tele phone calls, taking pho tos of p eople in certain
places and ide ntifying the m , and u sing scanners to view your p erso nal belo ngings are
considered by m any to be a n invasio n of privacy. However, many p eople recognize
that a nalytic tools are effective and efficient means to increase security, even tho u gh the
privacy of many innocent peo ple is compromised.
The U.S. governme nt applies an alytical techno logies o n a glo b al scale in the war
o n te rro rism . In the first year a nd a ha lf after Septe mber 11 , 2001, supe rma rket c ha ins,
home improvement sto res , and other retailers voluntarily handed over massive amounts
o f customer records to federal law enfo rceme nt agencies, almost always in viola tio n of
the ir state d privacy policies. Many othe rs resp onde d to court o rders fo r info rmation , as
required by law. The U. S. government h as a right to gather corporate data under legisla-
tio n p assed after Septe mbe r 11 , 2001. The FBI now mines e no rmo us amo unts of data,
loo king for activity that could indicate a terrorist plot or crime.
Privacy issu es a b o und. Because the government is acqu iring pe rson al data to detect
su spicio u s p atterns o f activity, there is the p rosp ect of improper or illegal u se of the d ata.
Many see su ch gathering of d ata as a vio lation of citizens ‘ freedoms and rights. They
see the n eed for a n oversight organizatio n to “watch the watch ers,” to e nsure that the
Departme nt o f Ho me la nd Security does no t mindlessly acqu ire data. Instead , it s ho u ld
acquire only pertinent data and information that can be mined to identify p atterns that
p o te ntially could lead to sto pping te rro rists’ activities. This is no t an easy task.
Recent Technology Issues in Privacy and Analytics
Most provide rs of Interne t services such as Goog le, Faceb ook , Tw itter, a nd oth e rs
depe nd upo n mo n etizing the ir users’ actio n s. They do so in many differe n t ways, but a ll
o f these a pproach es in the en d am o unt to unde rsta nding a u ser’s p rofile o r pre feren ces
o n the basis of their u sage. With the grow th of Interne t u sers in g eneral a nd m o bile
d evice u sers in p a rticular, many companies h ave b een founded to em p loy advan ced
a na lytics to d evelo p profiles o f u sers o n the bas is o f the ir device usage, movement,
a nd the contacts o f the u sers . Tb e Wa ll Street Journal h as an excellent co llectio n of
a rticles titled “Wha t They Know” (wsj.com/wtk) . These articles are con stantly upda te d
to hig hlig ht the latest techno logy and privacy/ ethical issu es. Some o f the compa nies
that have been mentioned in this series include compa nies such as Rapleaf (rapleaf.
com). Ra pleaf cla ims to be a ble to p rovide a p rofile of a user by just knowing the ir
e -mail address. Clearly , the ir technology e na bles the m to gathe r significant informa-
tio n . Similar technology is a lso m a rke ted by X+l (xplusone.com) . Anoth e r com pan y
that aims to ide n tify d evices o n the b asis of the ir u sage is Bluecava (bluecava.com).
All of these compa nie s e mploy techno logies su ch as clustering and associatio n mining
to d evelo p profiles of u sers. Su ch an alytics applicatio n s d efini tely raise tho rny ques-
tio ns of privacy vio latio n for the users. O f course, ma ny o f the a nalytics sta rt-ups in
this space claim to honor user privacy, but violations are o ften reported. For example,
a recent story re p o rte d tha t Ra pleaf was collecting unauthorized u ser informatio n fro m
Faceb ook use rs a nd was banned from Faceboo k . A column in Tim e Magazine by J oel
Stein (2011) reports that an h o ur after he gave his e-mail address to a company that
sp ecia lizes in u ser informa tio n m o nito ring (reputation.com) , they had a lread y been
a ble to discover his Social Security number. This numbe r is a ke y to accessing muc h p ri-
vate informa tio n ab o u t a u ser a nd could lead to ide ntity theft. So, v iola tio n s of privacy
create fears of crimina l cond u ct b ased o n user info rmatio n. This a rea is a big concern
o verall and n e eds careful study. The b ook ‘s Web site w ill consta ntly update new devel-
o pme nts . Tbe Wall Street j ournal s ite “What They Know” is a resource that o ught to be
con sulte d p e riodically .
Chapter 14 • Business Analytics: Eme rging Trends and Future Impacts 619
Another application area that combines organizational IT impact, Big Data, sensors,
and privacy concerns is analyzing employee behaviors on the basis of data collected
from sensors that the employees wear in a badge. One company Sociometric Solutions
(sociometricsolutions.com) has reported several such applications of their sensor-
embedded badges that the employees wear. These sensors track all movement of an
employee. Sociometric Solutions has reportedly been able to assist companies in p re-
dicting which types of employees are likely to stay w ith the company or leave on the
basis of these employees’ interactions with other employees. For example, those employ-
ees who stay in their own cubicles are less likely to progress up the corporate ladder than
those who move about and interact with other employees extensively. Similar data collec-
tion and analysis have helped other companies determine the size of conference rooms
needed or even the office layout to maximize efficiency. This area is growing really fast
and has resulted in another term-people analytics. Of course, this creates major pri-
vacy issues. Should the companies be able to monitor their employees this intrusively?
Sociometric has reported that its analytics are o nly reported on an aggregate basis to their
clients. No individual user data is shared. They have noted that some employers want to
get individual employee data , but their contract explicitly prohibits this type of sharing. In
any case, sensors are leading to another level of surveillance and analytics, which poses
interesting privacy, legal, and ethical questions.
Ethics in Decision Making and Support
Several ethical issues are related to analytics. Representative ethical issues that could be
of interest in analytics implementations include the following:
• Electronic surveillance
• Ethics in DSS design (see Chae et al. , 2005)
• Software piracy
• Invasion of individuals’ privacy
• Use of proprietary databases
• Use of intellectual property such as knowledge and expe1tise
• Exposure of employees to unsafe environments related to computers
• Computer accessibility for workers with disabilities
• Accuracy of data, information, and knowledge
• Protection of the rights of users
• Accessibility to information
• Use of corporate computers for non-work-related purposes
• How much decision making to delegate to computers
Personal values constitute a major factor in the issue of ethical decision making. The
study of ethical issues is complex because of its multi-dimensionality. Therefore, it makes
sense to develop frameworks to describe ethics processes and systems. Mason et al. 0995)
explained how technology and innovation expand the size of the domain of ethics and
discuss a model for ethical reasoning that involves four fundamental focusing questions:
Who is the agent’ What actio n was actually taken or is being contemplated’ What are
the results or consequences of the act? Is the result fair o r just for a ll stakeholders? They
also described a hierarchy of ethical reasoning in which each ethical judgment or action
is based on rules and codes of ethics, which are based on principles, w hich in turn are
grounded in ethical theory. For more o n ethics in decision making, see Murali (2004).
SECTION 14.8 REVIEW QUESTIONS
1. List some legal issues of analytics.
2. Describe privacy concerns in analytics.
620 Pan V • Big Da ta a nd Future Directio ns for Bu siness Ana lytics
3. Explain privacy concerns o n the We b.
4. List e thical issu es in an alytics.
14.9 AN OVERVIEW OF THE ANALYTICS ECOSYSTEM
So, you a re excited a bo ut the p o te ntial o f a nalytics, a nd wan t to JOtn this growing
industry. Who are the current players, and w h at d o they d o? Wh e re might you fi t in?
The o bjective of this sectio n is t o identify vario u s secto rs of th e an a lytics ind ustry,
prov ide a classificatio n of diffe re nt typ es of industry p articipa nts , and illustrate the
types of opportunities that exist fo r a na lytics professio nals . The sectio n (indeed the
book) concludes w ith som e o bservatio ns a bo ut the o pportu nities for p rofessio n als to
move across these cluste rs .
First, we want to re mind the reade r about the three types of a nalytics introdu ced
in Cha pte r 1 a nd describe d in deta il in the intervening ch apte rs : descriptive or reporting
a nalytics, pre dictive analytics , and prescriptive or decision an alytics. In th e following
sectio ns we w ill assume that you alread y know these three categories o f an alytics.
Analytics Industry Clusters
This sectio n is aimed a t identifying vario u s an alytics industry players by grou p ing them
into secto rs. We n o te tha t the list o f compa ny n am es included is n o t exh au stive . These
me re ly re fl ect o ur own aware ness and mapping of compa n ies’ offerings in this space.
Additiona lly, the mention of a company’s name o r its cap ability in one sp ecific grou p
d oes n o t mean that is the o nly activity/ o ffe ring of that o rganizatio n . We use th ese
names s imply to illustrate our d escriptio n s of sectors. Many o ther o rganizations exist in
this industry. Our goal is not to create a directory of players o r the ir cap abilities in e ach
sp ace, but to illustra te to the stude nts tha t ma ny diffe rent optio ns exist for playin g in
the analytics industry. One ca n start in o ne secto r and m ove to a nothe r role altogether.
We w ill also see that m a ny companies play in multiple sectors within the a n alytics
industry a nd, thus, o ffer opp o rtunities for moveme nt within the fie ld both h o rizontally
and v ertically.
Figure 14.3 illustrates o ur view of the an alytics ecosyste m. It inclu des nine key
secto rs o r clusters in the a nalytics sp ace. The first five clusters can be b roadly te rmed tech-
nology providers. Their primary revenue comes fro m develo ping techno logy, solutio n s,
and training to e n able the u ser o rganizatio ns employ these technologies in the most
effective a nd e fficie nt ma nner. The accelerators include acad e mics a nd indusuy o rganiza-
tions whose goal is to assist b o th technology providers and users. We d escribe each of
these next, briefly, an d give some examples of players in each sector.
Data Infrastructure Providers
This group includes all of the major players in the da ta h ardware and software indus-
try. These organizatio n s p rovide h a rdware and software targeted at providing the basic
foundatio n for all da ta ma n age me nt solutio ns. Obvio us exam ples of these wo u ld include
all majo r h ardware players tha t provide the infrastructure for database computing-IBM,
Dell , HP , Oracle , a nd so fo rth. We would also include sto rage solutio n p rovide rs su ch as
EMC and NetApp in th is sector. Many com p anies provide b oth hardw are a n d software
platforms of their own (e.g ., IBM, Oracle, and Terad ata) . On th e o the r h a nd, man y d ata
solution provide rs offe r d a ta b ase man agem ent syste ms that are ha rdware indep e ndent
and can run o n ma ny platforms . Perha p s Microsoft’s SQL Server family is the most com-
m on example of this . Sp ecialized integrate d software p roviders such as SAP also are in
this famil y o f companies. Because this group o f com panies is well known and represents
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 621
Data
Infrastructure
Providers
Data Warehouse
Industry
Middleware
Industry
Data-Aggregat ed
Distributors
ANALYTICS ECOSYSTEM
Analytics-Focused
Software
Developers
Analytics Industry
Analysts and
Influencers
Academic
Providers and
Certification
Agencies
Analytics User
Organizations
Application
Developers:
FIGURE 14.3 Analytic Industry Clusters.
Industry Specific or
General
a massive overall economic activity, we believe it is sufficient to recognize the key roles
all these companies play. By inference, we also include all the other organizations that
support each of these companies’ ecosystems. These would include database appliance
providers, service providers , integrators, and developers.
Several other companies are emerging as major players in a related space, thanks
to the network infrastructure enabling cloud computing. Companies such as Amazon and
Salesforce.com pioneered to offer full data storage (and more) solutions through the
cloud. This h as now been adopted by several of the players already identified.
Another group of companies that can be included here are the recent crop of
companies in the Big Data space. Companies such as Cloudera, Hortonworks, and many
others do not necessarily offer their own hardware but provide infrastructure services an d
training to create the Big Data platform. This would include Hadoop clusters, MapReduce ,
NoSQL, and other related technologies for analytics. Thus , they could also be grouped
under industry consultants or trainers. We include them here because their role is aimed
at enabling the basic infrastructure.
Bottom line, this group of companies provides the basic data and computing infra-
structure that we take for granted in the practice of an y analytics .
Data Warehouse Industry
We distinguish between this group and the preceding group mainly due to differences in
their focus. Companies with data warehousing capabilities focus on providing integrated
data from multiple sources so an organization can derive and deliver value from its data
assets . Many companies in this space include their own hardware to provide efficient
data storage, retrieval, and processing. Recent developments in this space include per-
forming analytics on the data directly in memory. Companies such as IBM, Oracle, and
Teradata are major players in this arena. Because this book includes links to Teradata
University Network (TUN), we note that their platform software is available to TUN par-
ticipants to explore data warehousing concepts (Chapter 3). In addition, all major players
(EMC, IBM, Microsoft, Oracle, SAP, Teradata) have their own academic alliance programs
through which much data warehousing software can be obtained so that students can
develop familiarity and experience with the software. These companies clearly work with
all the other sector p layers to provide data warehouse solutio ns and services w ithin their
622 Pan V • Big Data a nd Future Directions for Bu siness Analytics
ecosystem. Because players in this industry are covered extensively by technology media
as well as textbooks a nd have the ir own ecosystems in many cases, we w ill just recognize
them as a backbone of the an alytics industry and move to other clu sters.
Middleware Industry
Data warehousing began w ith the foc us on bringing all the data stores into a n enterprise-
w ide platform. By making sen se of this data, it becomes a n industry in itself. The general
goal of this industry is to provide easy-to-use tools for reporting and analytics. Examples
of companies in this sp ace include MicroStrategy, Plum, and many others. A few of the
majo r players that were independent m iddleware p layers h ave been acquired by com-
panies in the first two groups . For example, Hype rion becam e a part of Oracle. SAP
acquired Business Objects. IBM acquired Cognos. This segment is thus merging w ith
o ther players or at least p artnering with m any others. In many ways, the focus of these
companies has b een to provide descriptive analytics and reports, identified as a core p art
of BI or analytics.
Data Aggregators/Distributors
Several companies realized the opportunity to develop specialized data collection,
aggregation, a nd distribution mechanisms. These companies typically focus on a specific
industry sector and build upon their existing re lationships. For example, Nielsen provides
data sources to their clients o n retail purchase beh avior. Another example is Experian,
w hich includes da ta on each h ou sehold in the United States . (Similar companies exist
o utside the United States, as well.) Omniture has develo ped technology to collect Web
clicks and share such data w ith their clients. Comscore is another major company in this
sp ace . Google compiles data for individual Web sites a nd makes a summary available
throug h Google Analytics services. There are hundreds of oth er companies that are
developing niche p latforms and services to collect, aggregate, and share such data w ith
their clie nts.
Analytics-Focused Software Developers
Companies in this category have develo ped analytics software for general use with data
that h as been collected in a data ware h ou se o r is available through o ne of the p latforms
ide ntified earlie r (including Big Data). It can also include inventors and researchers in
universities and o ther o rganizatio n s that h ave develo ped algorithms for specific types of
analytics applicatio ns. We can ide ntify major industry players in this space alo ng the same
lines as the three types of analytics outlined in Chapter 1.
Reporting/Analytics
As seen in Chapters 1 and 4, the focus of reporting ana lytics is o n developing various
types of reports , queries, and visualizations. These include general visualizatio ns of data
or dashboards presenting multiple performance reports in an easy-to -follow style. Th ese
are made possible by the tools available from the middleware industry p layers o r unique
cap abilities offered by focused providers. For example, Microsoft’s SQL Server BI toolkit
includes reporting as well as predictive an alytics capabilities. On the other hand, specia l-
ized software is available from companies su ch as Tableau for visu alization. SAS also
offers a visual sn alytics tool for s imilar capacity. Both are linked throu gh TUN. There
are many open source visualizatio n tools as well. Literally hundreds of data visualization
tools have been develo ped around the world. Many such tools focu s o n visu alization of
data from a specific industry o r domain. A Google search w ill sh ow the latest list of su ch
softwa re providers and tools.
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 623
Predictive Analytics
Perhaps the biggest recent growth in analytics has been in this category. Many statistical
software companies such as SAS and SPSS embraced predictive analytics early on and
developed the software capabilities as well as industry practices to employ data mining
techniques, as well as classical statistical techniques, for analytics. SPSS was purchased
by IBM and now sells IBM SPSS Modeler. SAS sells its software called Enterprise Miner.
Other players in this space include KXEN, Statsoft, Salford Systems, and scores of other
companies that may sell their software broadly or use it for their own consulting practices
(next group of companies).
Two open source platforms (R and RapidMiner) have also emerged as popular
industrial-strength software tools for predictive analytics and have companies that support
training and implementation of these open sources tools. A company called Alteryx uses
R extensions for reporting and predictive analytics, but its strength is in delivery of ana-
lytics solutions processes to customers and other users. By sharing the analytics process
stream in a gallery w here other users can see what data processing and analytic steps
were used to arrive at a result from multiple data sources, other users can understand the
logic of the analysis, even change it, and share the updated process with other users if
they so choose.
In addition, many companies have d eveloped specialized software around a specific
technique of data mining. A good example includes a company called Rulequest, which
sells proprietary variants of decision tree software. Many neural network software compa-
nies such as NeuroDimensions would also fall under this category. It is important to note
that such specific software implementations may also be part of the capability offered by
general predictive a n alytics tools identified earlier. The number of companies focused
on predictive analytics is so large that it would take several pages to identify even a
partial set.
Prescriptive Analytics
Software providers in this category offer modeling tools and algorithms for optimization
of operations. Such software is typically available as manageme nt science/ operations
research (MS/OR) software. The best source of information for such providers is through
OR/MS Today, a publication of INFORMS. Online directories of software in various
categories are available on their Web site at orms-today.org. This field has had its own
set of major software providers. IBM, for example, has classic linear and mixed-integer
programming software. IBM also acquired a company (ILOG) that provides prescriptive
analysis software and services to complement the ir other offerings. Analytics providers
such as SAS have their own OR/MS tools-SAS/ OR. FICO acquired another company,
XPRESS, that offers optimization software. Other major players in this domain include
companies such as AIIMS, AMPL, Frontline, GAMS, Gurobi, Lindo Systems, Maximal, and
many others. A detailed delineation and description of these companies ‘ offerings is
beyond the scope of our goals here. Suffice it to note that this industry sector has seen
much growth recently.
Of course, many techniques fall under the category of prescriptive analytics. Each
group has its own set of provide rs. For example, simulation software is a category in its
own right. Major companies in this space include Rockwell (ARENA) and Sirnio, among
oth ers. Palisade provides tools that include many software categories. Similarly, Frontline
offers tools for optimization with Excel spreadsheets as well as predictive analytics.
Decision analysis in multiobjective settings can be performed using tools such as Expert
Choice. There are also tools from companies such as Exsys, XpertRule, and others for
gen erating rules directly from data or expert inputs.
624 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Some new companies are evolving to combine multiple analytics models in the
Big Data sp ace. For example, Teradata Aster includes its own predictive an d prescrip-
tive analytics capabilities in processing Big Data streams. We believe there will be more
o pportunities for comp anies to develop sp ecific application s that combine Big Data and
optimizatio n techniques.
As noted earlier, all three categories of analytics h ave a rich set of p roviders, o ffer-
ing the u ser a wide set of ch o ices and cap abilities. It is worthwhile to note again that
these groups a re n o t mutually exclus ive. In most cases a provider can play in multip le
components of a n alytics.
Application Developers or System Integrators: Industry
Specific or General
The organizations in this group focus on u sin g solutions available from the d ata
infrastructure, data wareh o use, middleware, data aggregators, and an alytics software pro-
viders to develop custom solutio n s for a specific indu stry. They also use their a n alytics
expertise to develop specific applicatio ns for a user. Thus, this industry grou p makes it
p ossible for the an alytics technology to be truly useful. Of course, su ch groups may also
exist in specific user o rganizatio ns . We discuss those next, but distinguish between the
two because the la tter group is responsible for analytics w ithin an organization w he reas
these application developers work w ith a larger client base. This sector presents excellent
o pportunities for someon e inte rested in broadening their an alytics implementation expe-
rience across industries. Predictably, it also represents a la rge group, too numerou s to
identify. Most ma jor analytics technology providers clearly recognize th e opportunity to
connect to a sp ecific industry or client. Virtually every provider in any of the groups
identified earlier includes a consulting practice to help their clie nts employ their tools.
In m any cases, revenue from such engagements may far exceed the technology licen se
revenue . Companies such as IBM, SAS, Teradata, and most others identified earlier have
significant con sulting practices. They hire g raduates of analytics programs to work o n
diffe re nt client projects. In many cases the larger techno logy providers also run their own
certification programs to ensure that the graduates and con sultants are able to claim a
certain am ount of expertise in using the ir specific tools.
Companies that have traditionally provided application/ d ata solutions to specific
sectors h ave recognized the potential for the u se of analytics and are developing
industry-specific a nalytics offerings. For example, Cerner provides electronic m edical
records (EMR) solutions to medical providers. Their offerings now include many analyt-
ics repo rts and visu alizatio ns. This has now extended to providing athletic injury reports
and m an agement services to sports programs in college and professional sports. Similarly,
IBM offers a fraud detection engine for the h ealth insurance industry and is working
w ith an insurance comp any to employ the ir famou s Watson an alytics platform (which is
know n to have won against huma ns in the popular TV game show Jeopardy/) in assist-
ing medical providers a nd insurance companies w ith diagnosis and disease management.
Anoth er example of a vertical applicatio n provider is Sabre Technologies, w hich p rovides
analytical solutions to the travel industry including fa re pricing for revenue optimization,
disp atch p lanning, and so forth.
This group also includes companies that h ave developed their own dom ain-specific
analytics solutions and m arket the m broadly to a client base. For example, Axiom h as
developed clusters for virtually all h ou seholds in the United States based upo n all the data
they collect about househ o lds from m any different sources. These cluster labels allow a
client organization to target a marke ting campaign more precisely . Several com panies
provide this typ e of service. Credit score a nd classification reporting companies (su ch
as FICO a nd Experian) also belo n g in this gro up. Dema ndtec (a company now owne d
Chapter 14 • Business Analytics: Emerging Trends and Future Impacts 625
by IBM) provides pncmg opt1m1zation solutio ns in the retail industry. They employ
predictive analytics to forecast price-demand sensitivity and then recommend prices for
thousands of products for retailers. Such analytics consultants and application providers
are emergin g to meet the needs of specific industries and represent an entrepreneur-
ial opportunity to develop industry-specific applications. One area with many emerging
start-ups is Web/social media/ location analytics. By analyzing data available from Web
clicks/smartphones/app uses, companies are trying to profile users and their interests to
be better able to target promotional campaigns in real time . Examples of such companies
and their activities include Sense Networks, which employs location data for developing
user/group profiles; X+l and Rapleaf, which profile users on the basis of e-mail usage;
Bluecava, which aims to identify users through all device usage; and Simulmedia, which
targets advertisements on TV on the basis of analysis of a user’s TV-watching habits.
Another group of analytics application start-ups focuses on ve1y specific analytics
applications . For example, a popular smartphone app called Shazam is able to identify a
song on the basis of the first few n otes and then let the user select it from the ir song base
to play/download/ purchase. Voice-recognition tools such as Siri on iPhone and Google
Now on Android are likely to create many more specialized an alytics applications for very
specific purposes in analytics applied to images, videos, audio, and oth er data that can be
captured through smartphones and/or connected sensors.
This start-up activity and space is growing and in major transition due to technol-
ogy/venture funding and security/privacy issues. Nevertheless, the application developer
sector is perhaps the biggest growth industry within analytics at this point.
Analytics User Organizations
Clearly, this is the economic engine of the whole analytics industry. If there were n o
users, there would be n o analytics industry. Organizations in every other industry, size,
shape, and location are using analytics or exploring use of a nalytics in their operations.
These include the private sector, government, education, and the military. It includes
organizations around the world. Examples of uses of analytics in different industries
abound. Others are exploring similar opportunities to try a nd gain/ retain a competitive
advantage . We will not identify specific companies in this section. Rather, the goal here
is to see what types of roles analytics professionals can play within a user organization.
Of course, the top leadership of an organization is critically important in applying
analytics to its operations. Re portedly, Forrest Mars of the Mars Chocolate Empire said that
all management boiled down to applying mathematics to a company’s operations and
economics. Although not enough senio r managers seem to subscribe to this view, the aware-
ness of applying analytics within an organization is growing everywhere. Certainly the top
leadership in information technology groups within a company (such as chief information
officer) need to see this potential. For example , a health insurance company executive once
told me that his boss (the CEO) viewed the company as an IT-enabled organization that
collected money from insured members and distributed it to the providers. Thus, efficiency
in this process was the premium they could earn over a competitor. This led the company
to develop several analytics applications to reduce fraud and overpayment to providers and
promote wellness among those insured so they would use the providers less often. Virtually
all major organizations in every industry we are aware of are considering hiring analyti-
cal professionals. Titles of these professionals vary across industries. Table 14-2 includes
selected titles of the MS graduates in our MIS program as well as graduates of our SAS
Data Mining Certificate program (courtesy of Dr. G. Chakbraborty). This list indicates that
most titles are indeed related to analytics. A “word cloud” of all of the titles of our analytics
graduates, included in Figure 14-4, confirms the general results of these titles. It shows that
analytics is a lready a popular title in the organizations hiring graduates of such programs.
626 Pan V • Big Da ta a nd Future Directions for Business Analytics
TABLE 14-2 Selected Titles of Analytics Program Graduates
Advanced Analytics Math Modeler
Analytics Software Tester
Application Developer/Analyst
Associate Director, Strategy and Analytics
Associate Innovation Leader
Bio Statistical Research Analyst
Business Analysis Manager
Business Analyst
Business Analytics Consultant
Business Data Analyst
Business Intelligence Analyst
Business Intelligence Developer
Consultant Business Analytics
Credit Policy and Risk Analyst
Customer Analyst
Data Analyst
Data Mining Analyst
Data Mining Consultant
Data Scientist
Decision Science Analyst
Decision Support Consultant
ERP Business Analyst
Financial/Business Analyst
Healthcare Analyst
Inventory Analyst
IT Business Analyst
Lead Analyst-Management Consulting
Services
Manager of Business Analytics
Manager Risk Management
Manager, Client Analytics
Manager, Decision Support Analysis
Manager, Global Customer Strategy and
Analytics
Manager, Modeling and Analytics
Manager, Process Improvement, Global
Operations
Manager, Reporting and Analysis
Managing Consultant
Marketing Analyst
Marketing Analytics Specialist
Media Performance Analyst
Operation Research Analyst
Operations Analyst
Predictive Modeler
Principal Business Analyst
Principal Statistical Programmer
Procurement Analyst
Project Analyst
Project Manager
Quantitative Analyst
Research Analyst
Retail Analytics
Risk Analyst- Client Risk and Collections
SAS Business Analyst
SAS Data Analyst
SAS Marketing Analyst
SAS Predictive Modeler
Senior Business Intelligence Analyst
Senior Customer Intelligence Analyst
Senior Data Analyst
Senior Director of Anal ytics and Data Qualit y
Senior Manager of Dat a Warehouse, Bl, and
Analytics
Senior Quantitative Marketing Analyst
Senior St rategic M arket ing Analyst
Senior Strategic Project Marketing Analyst
Senior Marketing Database Analyst
Senior Data Mining Analyst
Senior Operations Analyst
Senior Pricing Analyst
Senior Strategic Marketing Analyst
Senior St rategy and Analytics Analyst
Stat istical Ana lyst
Strateg ic Business Analyst
Strateg ic Database Analyst
Supply Chain Analyst
Supply Chain Planning Analyst
Technical Analyst
Chapter 14 • Busine ss Analytics: Eme rging Trends and Future Impacts 627
Development
Business Pr oduct
Information Associate SpecialiSt
Ma k t ·n Predictive Software Reasearch . r e I g Student
Experienced Intelligence Data
Management Consultant Integration
Procurement Services ERP Gas Application President
Database Health Inventory Analysis Principal s r
S I
SAP MBA Process AMSS
upp Y Corporate web QA Assurance SAS .
Planning . . Strategy Engineer ,Improvement Strategic
. Stat1st1cal Migration DirectotCllent GI b I Production
Credit A I O a
Candidate n a ys Tester Service
Buyer Center
CISA Technology Leader Lead Statist ician Cognos Analytical
Support Contractor Administrat or Pricing Programmer
Representative . An a lyt1″ cs Chain Biost atist ical
Sc1ent,st lnormat,on M d I
Operat ions Developer Purchasing Co~tult~ge er
Research . S 8 n j Qr Technical Architect
Coatings PoDl,cy . . Science
ec1s1on II F’ · I
Risk Manag1Efr pp~m~j;~t
Appli cations Systems Solutions
Mining
Customer
Professional
FIGURE 14.4 Word Cloud of Titles of Analytics Program Graduates.
Of course , user organizations include career paths for analytics professionals mov-
ing into management positions. These titles include project managers, senior managers,
directors . .. all the way up to chief information officer or chief executive officer. Our
goal is here is to recognize that user organizations exist as a key cluster in the analytics
ecosystem.
Analytics Industry Analysts and Influencers
The next cluster includes three types of organizations or professionals. The first group is
the set of professional organizations that provides advice to analytics industry providers
and users . Their services include marketing analyses, coverage of n ew developments ,
evaluation of specific technologies, and development of training/white papers, and so
forth. Examples of such players include organizations such as the Gartner Group , The
Data Warehousing Institute, and many of the general and technical publications and Web
sites that cover the analytics industry. The second group includes professional societies
or organizations that also provide some of the same services but are membe rship based
and organized. For example, INFORMS, a professional organization, has now focused on
promoting analytics. The Special Interest Group on Decision Support Syste ms (SIGDSS),
a subgroup of the Association for Information Systems, also focuses on analytics .
Most of the major vendors (e.g. , Teradata and SAS) also have their own membership-
based user groups. These e ntities promote the use of analytics and enable sharing of
the lesson s learned through their publications and conferences. They may also provide
placement services.
A third group of analytics industry analysts is wha t we call analytics ambassa-
dors, influencers, or evangelists. These folks have presented their enthusiasm for analyt-
ics through their seminars, books, and other publications. Illustrative examples include
Steve Baker, Tom Davenpo11, Charles Duhigg, Wayn e Eckerson , Bill Franks, Malcolm
628 Pan V • Big Data a nd Future Directions for Bu siness Analytics
Gladwell, Claudia Imhoff, Bill Inma n , a nd many others. Again, the list is n ot inclusive.
All of these ambassadors have writte n books (some of them bestsellers!) and/ or given
presentations to promote the analytics applications. Perhaps another grou p of evan-
gelists to include he re is the authors of textbooks on business intelligen ce/ a n alytics
(such as u s, humbly) who aim to assist the next cluster to produce professionals for the
analytics industry.
Academic Providers and Certification Agencies
In a ny knowledge-intensive indus try su ch as analytics, the fundamental strength comes
from h aving stude nts who a re inte rested in the technology a n d c h oose that industry
as their profession. Universities play a key role in making this possible . This cluster,
then, represents the academic programs that prepare p rofessio n als for the industry. It
includes various compon e nts of business sch ools su ch as information systems, mar-
keting , and ma n agement sciences. It also extends far beyond business schools to
include computer scie nce, statistics, mathematics , and industrial engineering depart-
me nts across the world. The cluster also includes gra phics developers who design n ew
ways of visu alizing information. Universities a re offering unde rgraduate a n d graduate
programs in a nalytics in all of these d isciplines, tho u g h they may be labeled differ-
ently. A major growth fro ntier h as been certificate programs in analytics to enable
curre nt professionals to re train and retool th emselves for a n alytics careers . Certificate
programs enable p racticing analysts to gain basic proficiency in specific software by
taking a few critical courses. Power (2012) published a partial list of the graduate
programs in a n aly tics, but there a re like ly many more suc h programs, w ith new ones
be ing added d a ily.
Another group of players assists w ith developing compe tency in analytics. These
are certificatio n programs to award a certificate of expertise in specific software. Virtu ally
every major techno logy provider (IBM, Microsoft, MicroStrategy, Oracle, SAS , Teradata)
h as its own certificatio n programs. These certificates ensure that potential new hires h ave
a certain level of tool skills. On the oth e r h a nd, INFORMS h as just introduced a Certified
Analytics Professional (CAP) certificate program that is aimed at testing an indiv idual’s
gen eral an alytics competen cy. Any of these certification s give a college stude nt additio nal
m arketable skills.
The growth of acad emic programs in analytics is staggering . Only time w ill tell if
this cluster is overbuilding the cap acity tha t can be consumed by the other eigh t clus-
ters, but at this point the demand appears to outstrip the supply of qualified a n alytics
graduates.
The purpose of this sectio n has been to create a map of the landscape of the
analytics industry. We identified nine d ifferent groups that play a key role in building
and fostering this industry. It is possible for professionals to move from one industry
cluster to anothe r to take advantage of their skills. For examp le, expe1t professionals from
providers can sometimes move to consulting positions, or directly to user o rganizatio ns.
Academics h ave provided con sulting o r have moved to industry. Overall, there is much to
be excited about the analytics industry at this point.
SECTION 14.9 REVIEW QUESTIONS
1. Identify the nine clusters in the analytics ecosystem.
2 . Which clusters re present technology developers’
3. Which clusters represent technology u sers?
4. Give examples o f a n a na lytics professio nal moving from o ne cluster to a noth e r.
Chapter 14 • Business Analytics: Eme rging Trends and Future Impacts 629
Chapter Highlights
• Geospatial data can enhance a nalytics applica-
tions by incorporating location information.
• Real-time location information of users can be
mined to develop promotion campaigns that are
targeted at a specific user in real time.
• Location information from mobile phones and
PDAs can be used to create profiles of user
behavior and movement. Such location informa –
tion can enable users to find other people with
similar interests and advertisers to customize their
promotions.
• Location-based analytics can also benefit consum-
ers directly rather than just businesses. Mobile
apps a re being developed to enable such innova-
tive analytics applications.
• Web 2.0 is about the innovative application
of existing technologies. Web 2.0 has brought
together the contributions of millions of people
and has made their work, opinions, and identity
matter.
• User-created content is a major characteristic of
Web 2.0, as is the emergence of social networking.
• Large Internet communities enable the sharing of
content, including text, videos, and photos, and
promote online socialization and interaction.
• Business-oriented social networks concentrate on
business issues both in o ne country and around the
world (e.g., recruiting, finding business partners).
Key Terms
business process reengineering (BPR)
cloud computing
mobile social networking
Questions for Discussion
1. What are the potential benefits of using geospatial data
in analytics? Give examples.
2. What type of new applications can e merge from know-
ing locations of u sers in real time? What if you also knew
what they have in their shopping cart, for example?
3. How can consumers benefit from using analytics, espe-
cially based on location information?
4. “Location-tracking-based profiling ( reality mining) is
powerful but also poses privacy tlu·eats. ” Comment.
Business-oriented social networks include Linkedln
and Xing.
• Cloud computing offers the possibility of using
software, hardware , platform, and infrastrncture,
all on a service-subscription basis. Cloud comput-
ing enables a more scalable investment on the
pa1t of a user.
• Cloud-computing-based BI services offer organi-
zations the latest technologies without significant
upfront investment.
• Analytics can affect organizations in many ways ,
as stand-alone systems or integrated among them-
selves, or with other computer-based information
systems.
• The impact of analytics on individuals varies; it
can be positive, neutral, or n egative.
• Serious legal issues may develop with the intro-
duction of intelligent systems; liability and privacy
are the dominant problem areas.
• Many positive social implications can be expected
from analytics. These range from providing
opportunities to disabled people to leading the
fight against terrorism. Quality of life, both at
work and at home, is likely to improve as a result
of analytics . Of course, there are also negative
issues to be concerned about.
• Analytics industry consists of m any different types
of stakeholders.
privacy
reality mining
Web 2.0
5. Is cloud computing “just an old wine in a new bottle”?
How is it similar to other initiatives? How is it different?
6 . Discuss the relationship between mobile devices and
social networking.
7. Some say that analytics in general, and ES in particular,
d e humanize ma nagerial activities, and others say they
d o not. Discuss arguments for both points of view.
8. Diagnosing infections and prescribing pharmaceuti-
cals are the weak points of many practicing physicians
630 Pan V • Big Data a nd Future Directions for Bu siness Analytics
(according to E. H. Shonliffe, o ne of the developers of
MYCIN). It seems, therefore, that society woul d be better
served if MYCIN (and othe r ES, see Ch apter 11) were
used extensively, but few physicians use ES . Answe r the
following questions:
a. Why do you think su ch systems are little used by
physicians?
b. Assume that you a re a hospital administrator whose
physicians are salaried a nd report to you . What
would you do to persuade them to u se ES?
Exercises
Teradata University Network (TUN) and Other
Hands-on Exercises
1. Go to teradatauniversitynetwork.com a nd search for
case studies. Read the Continental Airlines cases written
by Hugh Watson and his colleagues. What new applica-
tio ns can you imagine with the level of detailed data an
airline can capture today.
2. Also review the Mycin case at teradatauniversity
network.com. What o ther similar applications can you
envisio n?
3. At teradatauniversitynetwork.com, go to the pod-
casts library. Find podcasts of pervasive BI s ubmitted
by Hugh Watson. Summarize the points made by the
speaker.
4. Go to teradatauniversitynetwork.com a nd search fo r
BSI videos. Review these BSI videos and answer case
qu estions related to them.
5. Locatio n-tracking-based clustering provides the poten-
tial for personalized services but challe nges for privacy.
Divide the class in two parts to a rgue fo r and against
su ch applications.
6. Ide ntify e thical issu es re lated to managerial decisio n mak-
ing . Search the Internet, join chat rooms, a nd read anicles
from the Internet. Prepare a report on your findings.
7. Search the Internet to find examples of how intelligent
systems (especially ES and intelligent agents) facilitate
activities su ch as e mpowerme nt, mass c usto mizatio n ,
and teamwork.
8. Investigate the American Bar Association ‘s Technology
Resource Cente r (abanet.org/tech/ltrc/techethics.
html) and nolo.com. What are the major legal and
End-of-Chapter Application Case
c. If the potential ben efits to society are so great, can
society do something tha t will increase doctors’ use
of such analytic systems?
9. Discuss the potential impacts of ADS systems o n various
types o f employees a nd managers.
10. What a re some of the major privacy concerns in employ-
ing analytics on mobile data?
11. How can o ne move from a technology provider cluster
to a user cluster?
societal concerns and advances addressed there? How
are they being dealt with?
9. Explore several sites related to healthcare (e.g., WebMD.
com, who.int). Find issues related to analytics and pri-
vacy. Write a repott on how these sites improve healthcare.
10. Go to computerworld.com and find five legal issues
related to BI and analytics.
11. Enter youtube.com. Search for videos o n clou d com-
puting. Watch at least two. Summa rize your findings.
12. Ente r pandora.com. Find out how you can create
and share music w ith friends . Why is this a W eb 2.0
application?
13. Enter mashable.com a nd review the latest news rega rd-
ing social networks and network strategy. Write a report.
14. Ente r sociometricsolutions.com. Review va rious case
studies and summarize o ne interesting application of sen-
sors in understanding social exchanges in o rganizations.
15. The objective of the exercise is to familiarize you with the
capabilities of smartphones to identify hu man activity. The
data set is available at archive.ics.uci.edu/ml/datasets/
Human+Activity+Recognition+Using+Smartphones
It contains accelerometer a nd gyroscope readings
on 30 subjects w ho had the smartphone on the ir waist.
The data is available in a raw format, and involves some
data preparation efforts. Your objective is to iden tify and
classify these readings into activities like walking, run-
ning, climbing, a nd such . More information o n the data
set is available on the download page. You may use
cluste ring for initial exploration and gain a n understand-
ing on the data. You may use tools like R to prepare and
a nalyze this data.
Southern States Cooperative Optimizes Its Catalog Campaign
Southe rn States Cooperative is o ne of the largest farmer-
owned cooperatives in the United States, with over 300,000
farmer members being served at over 1,200 retail locatio ns
across 23 states. It ma nufactures and purchases farm su pplies
like feed, seed, a nd fertilizer a nd distributes the products to
farme rs a nd o ther rural American customers.
Southern States Cooperative wanted to maintain and
extend their su ccess by better targeting the right cu stomers
in its direct-marketing campaigns. It realized the need to
continually optimize marketing activities by gaining insights
into its custome rs. Southern States employed Alteryx mo d-
eling tools, w hich e nabled the company to solve the main
Ch apte r 14 • Business Analytics: Eme rging Tre nds and Future Impacts 631
business ch allenges o f de termining the right set of custome rs
to be ta rgeted for mailing the cata logs, choosing the rig ht
combination of storage keeping units (SKUs) to be included
in th e catalog, cutting down mailing costs, and increasing
cu stome r response, resulting in increased revenue genera –
tion , ultimate ly en abling it to provide be tte r services to its
cu stome rs.
SSC first built a pre dictive mo d el to d e termine w hich
cata logs the custom e r was m ost like ly to prefe r. The
data for the ana lysis include d Southe rn States’ his to rical
cu s to me r tra n saction data; the catalog d ata including the
SKU informatio n ; fa rm-level d ata corresp o nding to the
cu s to me rs; and geocode d cu s to me r locatio n s- as w e ll as
Southe rn States o utle ts. In p e rforming the an a lysis , data
fro m o n e yea r w as analy ze d on the b asis of recency,
fre que n cy, and mon e tary value o f cu sto mer tra n saction s .
In marke ting , this ty p e o f a n alys is is commo nly known
as RFM an a lysis . The numbe r of unique combinatio n s o f
cata log SKUs a nd the cu stom e r purchase histo ry o f p ar-
ticular ite ms in SKUs we re used to pre dict the cu sto me rs
wh o w e re m ost likely to use the catalogs and the SKUs
that ou g ht to b e include d for the cu stome rs to resp on d
to the catalogs . Pre liminary explo ra tory an alysis reveale d
that all the RFM measures a nd the me asure o f previou s
cata log SKU purchases h a d a diminishing m arginal effect.
As a result, these va riables were natura l-log tra nsfo rme d
for log istic regre ssio n mo d els. In a ddition to the logistic
regressio n models, b o th a decisio n tree ( b ased o n a recu r-
sive p a rtitio ning algorithm) a nd a ra ndo m fo rest m od e l
were a lso estimate d u sing an estimatio n sample . The
four d iffe re nt mo de ls (a “full ” logistic regressio n mo de l ,
a re duced versio n o f the “full ” log istic regressio n m od e l
based o n the a pplication of both fo rward and backwa rd
ste pw ise variable selectio n , the d ecisio n tree mo de l, and
the random forest mo d el) were the n compare d u sing a val-
idatio n sample v ia a gains (cumulative captured) resp on se
ch a rt. A mo d el u sing lo gistic regressio n was selecte d in
w hich the m ost significa nt pre d ictive fa ctor was c u sto me rs ‘
p ast purchase o f items containe d in the cata log .
Based on the predictive mo deling results, an incre –
me ntal reve nue model was built to estimate the effect of a
cu stomer’s catalog u se and the percentage revenues gene r-
ate d fro m the cu sto me r w ho u sed a p articular catalog in a
particular catalog period. Linear regression w as the main
techniq ue applied in estimating the revenue pe r custo me r
resp onding to the catalog. The mod el indicated that the re
was an additional 30 percent revenue p er individ ual w ho
used the catalog as co mpared to the non-catalog customers.
Furthe rmore, based on the results of the pre dictive
mod el and the incre me nta l revenue mod el, an o p timizatio n
mod el w as d evelo ped to maximize the total income from
mailing the catalogs to custo me rs . The o ptimizatio n p roblem
jo intly maximizes the selection of catalog SKUs and custom-
ers to b e sent the catalog , taking into account the exp ected
resp onse rate fro m mailing the catalog to sp ecific custo m-
ers and the expected profit margin in p e rcentage from the
purchases by that custome r. It also considers the mailing
cost. This fo rmulation re presents a constrained n on-linear
p rogramming p roble m. This m odel was solved using genetic
algorithms, aiming to maximize the combined selection of
the catalog SKUs and the customers to w hom the catalog
should b e sent to result in increased resp o nse, a t the same
time increasing the revenues and cutting down the mailing
costs.
The Alteryx-based solution involved application of pre-
dictive analytics as well as p rescriptive an alytics techniques.
The predictive mod el aimed to determine the customer’s cat-
alog u se in purchasing selecte d ite ms and then prescriptive
a nalytics was applied to the results generated by predictive
mod els to help th e marketing de p artment pre p are th e cus-
to mi zed catalogs conta ining the SKUs that suited the targeted
custo mer needs, resu lting in b ette r revenue gene ratio n .
From the model-based counte rfactu al a nalysis of th e
2010 catalogs, the models quantified that the p eople w ho
resp onded to the catalogs sp e nt more in p urchasing goods
than those w ho had not used a catalog. The m od els ind i-
cated that in the year 2010, targeting the right cu stome rs
w ith catalogs conta ining customized SKUs, Southern States
Coop erative w ould have been able to reduce the number of
catalogs sent by 63 pe rce nt, while improving the respon se
rate by 34 percent, for an estimated incre mental gross mar-
gin, less mailing cost, of $193,604- a 24 p ercent increase.
The models were also applie d toward the analysis of 2011
catalogs, and they estimated that with right combination and
ta rgeting of the 2011 catalogs, the to tal incre m ental gross
marg in w ould have been $206,812. With the insig hts derived
fro m results of the histo rical data an alysis, Southern States
Cooperative is now p lanning to ma ke use of these models
in their future d irect-mail marketing camp aigns to target the
right custome rs .
QUESTIONS FOR THE END-OF-CHAPTER
APPLICATION CASE
1. What is main business p roblem faced b y South ern
States Cooperative?
2. How was predictive analytics a p plied in the a pplica-
tion case?
3. What proble ms were solved by the optimi zation tech-
niques employed by Southe rn States Cooperative?
What We Can Learn from This End-of-
Chapter Application Case
Predictive mo d els built o n historical data can be used to
he lp quantify the effects o f new techniques employed , as
part of a retrosp ective assessme n t that otherwise cannot be
quantified . The qua ntifie d values a re esti mates, not ha rd
numbe rs , but obtaining hard numbers simply isn ‘t p ossible .
Ofte n in a re al-world scen ario, many business p roble ms
re quire application of more than o ne type of a na lytics solu-
tio n. The re is o ften a chain of action s associated in solving
p roble ms w he re each stage re lies o n the outputs of the pre-
vio u s stages . Valuable insights can be de rived by application
of each type of a na lytic techniq u e, which can be further
632 Pan V • Big Da ta a nd Future Directio ns for Bu siness Analytics
applied to reach the optima l solution. This a pplication case
illustrates a combinatio n o f pre dictive a nd prescriptive a na-
lytics where geospatia l data also played a role in d e ve lo ping
the initial mode l.
References
Alteryx.com. “Great Clips.” alteryx.com/sites/default/
files/ resources/files/ case-study-great-chips. pdf
(accessed March 2013) .
Alteryx.com. “Southern States Cooperative Case Study. ”
Direct communication with Dr. Dan Putler. alteryx.
com/ sites/ default/files/ resources/files/ case-study-
southern-states. pdf (accessed February 2013).
Ananda rajan, M. (2002). “Interne t Abuse in the Workplace.”
Communications of the ACM, Vol. 45 , No. 1, pp. 53-54.
Argyris, C. 0971) . “Manageme nt Informatio n Systems: The
Cha llenge to Ratio nality and Emo tio na lity. ” Management
Science, Vol. 17, No. 6, pp. B-275 .
Chae, B. , D . B. Paradice, J. F. Couitney, and C. J. Cagle .
(2005). “Incorporating an Ethical Perspective into Proble m
Formulatio n. ” Decision Support Systems, Vol. 40, No. 2,
pp . 197- 212.
Dave nport, T. H. , a nd J. G . Harris. (2005) . “Automated
Decision Making Comes of Age .” MIT Sloan Management
Review, Vol. 46, No. 4, p. 83.
Delen, D. , B. Hardgrave, and R. Sharda. (2007) . “RFID for
Better Supply-Chain Management Through Enha nced
Informa tio n Visibility.” Production and Operations
Management, Vol. 16, No. 5, pp . 613-624.
Dyche, ]. (2011) . “Data -as-a-Service, Expla ine d and Defined. ”
searchdatamanagement. techtarget.com/answer/Data-
as-a-service-explained-and-defined (accessed March
2013).
Eagle, N., a nd A. Pe ntland. (2006). “Reality Mi ning: Sensing
Complex Social Systems.” Personal and Ubiquitous
Computing, Vol. 10, No. 4, pp. 255-268.
Electronic Privacy Information Cente r. (2005). “USA PATRIOT
Act. ” epic.org/privacy/terrorism/usapatriot (accessed
March 2013).
Elson, R. J., and R. LeCle rc. (2005) . “Security and Privacy
Concerns in the Data Wa re house Environme nt. ” Business
lntelligence] ournal, Vol. 10. , No. 3, p. 51.
Erne.com. “Data Scie n ce Revealed: A Data-Driven Glimpse
into the Burgeoning New Fie ld.” emc.com/collateral/
about/news/emc-data-science-study-wp (accessed
Febru ary 2013) .
Fritzsche, D. 0995, Nove mber). “Persona l Values: Pote ntia l
Keys to Ethical Decisio n Making. ” Journal of Business
Ethics, Vol. 14, No. 11.
Gnau, S. ( 2010). “Find Your Edge.” Teradata Magazine
Special Edition Location Intelligence. teradata.com/
articles/Teradata-Magazine-Special-Edition-Location-
Intelligence-AR6270/?type=ART (accessed Ma rch 2013).
Sources: Alteryx.com, “Southe rn States Cooperative Case Study, ”
and direct communication w ith Dr. Dan Putler, alteryx.com/sites/
default/files/resources/files/case-study-southern-states.pelf
(accessed February 2013).
Gupta, A., and R. Sharda. (2009). “SIMONE: A Simulator
fo r Inte rruptions a nd Message Overload in Network
Environme nts .” In ternational Journal of Simulation and
Process Modeling, Vol. 4, Nos. 3/ 4, pp. 237- 247.
Institute of Medicine of the Nation al Academies . “Health Data
Initiative Forum III: The Health Datapalooza. ” iom.edu/
Activities/PublicHealth/HealthData/2012-JUN-05/
Afternoon-Apps-Demos/outside-lOOplus.aspx
(accessed February 2013) .
lntellidentUtility.com. “O GE’s Three-Tiered Architecture
Aids Data Analysis.” intelligentutility.com/article/12/
0 2 / o ge s-three-tie re d-archi te cture-aids-da ta-
an al ysis&u tm_m edi um= eNL&u tm_ campaign=
IU_DA1LY2&utm_term=Original-Magazine (accessed
March 2013) .
Kalakota, R. (2011) . “Analytics-as-a-Service: Unde rstanding How
Amazon.com Is Chang ing the Rules. ” practicalanalytics.
wordpress.com/2011/08/13/analytics-as-a-service-
understanding-how-amazon-com-is-changing-the-
rules (accessed March 201 3).
Krivda, C. D . (2010). “Pinpoint Oppo rtunity.” Teradata
Magazine Special Edition Location Intelligence. teradata.
com/articles/Teradata-Magazine-Special-Edition-
Location-Intelligence-AR6270/?type=ART (accessed
March 2013).
Liu , S. , ]. Carlsson , and S. Nu mmila. ( 2002, July). “Mobile
E-Se rvices: Creating Added Value for Working Mothe rs .”
Proceedings DSI AGE 2002, Cork, Ire land.
Mason, R. 0., F. M. Mason, and M. J. Culna n. 0995). Ethics
of Information Management. Thousand Oaks, CA: Sage.
Mintzberg, H. , et a l. (2002). Tbe Strategy Process, 4th ed.
Uppe r Saddle Rive r, NJ: Pre ntice Hall.
Mobilemarketer.com. “Quiznos Sees 20pc Boost in Coupo n
Redemption via Location-Based Mobile Ad Cam paign.”
mobilemarketer.com/ cms/news/advertising/14 738.
html (accessed February 2013).
Murali, D. (2004). “Ethica l Dilemmas in Decision Making .”
BusinessLine.
Ogepet.com. “Sm art Hours. ” ogepet.com/programs/
smarthours.aspx (accessed March 2013).
Pe rez-Cascante, L. P. , M. Plaisent, L. Maguiraga, a nd P. Bernard.
(2002). “The Impact of Expeit Decision Suppoit Systems
o n the Performance of New Employees.” Information
Resources Management Journal.
Peslak, A. P. (2005). “Internet Privacy Policies .” Info rmation
Resources Management j ournal.
Ch apter 14 • Business Analytics: Eme rg ing Trends and Future Impacts 633
Power, D. P. (2012) . “What Universities Offer Master’s
Degrees in Analytics a nd Data Science?” dssresources.
com/faq/index. php?action=artikel&id=2 50 (accessed
February 2013) .
Ratzesberger, 0. (2011). “Analytics as a Service. ” xlmpp.
com/ articles/ 16-articles/39-analytics-as-a-service
(accessed Septembe r 2011).
Sensenetworks.com. “CabSense New York: The Smartest
Way to Find a Cab.” sensenetworks.com/products/
macrosense-technology-platform/ cabsense (accessed
Februa1y 2013).
Stein, ]. “Data Mining: How Companies Now Know
Eve1ything About You. ” Time Magazine. time.com/time/
magazine/article/0,9171,2058205,00.html (accessed
Ma rc h 2013).
Sto nebrake r, M. (2010). “SQL Databases V. NoSQL Databases. ”
Communications of the ACM, Vol. 53, No. 4, pp. 10-11.
Teradata.com . “Sabre Airline Solutio ns. ” teradata.com/t/
case-studies/Sabre-Airline-Solutions-EB6281
(accessed March 2013).
Teradata.com. “Utilities Analytic Summit 2012 Oklahoma
Gas & Electric. ” teradata.com/video/Utilities-Analytic-
Summit-2012-0klahoma-Gas-and-Electric (accessed
March 2013).
Trajman, 0. ( 2009, March). “Business Intelligence in th e Clouds. ”
JrifoManagement Direct. information-management.
com/infodirect/2009_111/10015046-1 . html
(accessed July 2009).
Tudor, B. , and C. Pettey. (2011 , January 6) . “Gartner Says
Ne w Re lationships Will Change Business Inte lligence and
Analytics .” Gartner Research.
Tynan, D. (2002, June). “How to Take Back Your Privacy (34
Steps).” PC World.
WallStreetJournal.com. (2010). “What They Know. ” online.
wsj.com/public/page/what-they-know-2010.html
(accessed March 2013).
Westho lde r, M. (2010) . “Pinpoint Opportunity. ” Teradata
Magazine Special Edition Location Intelligence. teradata.
com/articles/Teradata-Magazine-Special-Edition-
Location-Intelligence-AR6270/?type=ART (accessed
March 2013).
White, C. ( 2008, July 30). “Bus iness In te lligence in the Cloud:
Sorting Out the Termino logy.” BeyeNetwork.b-eye-
network.com/channels/1138/view /8122 (accessed
March 2013).
Winte r, R. (2008) . “E-Bay Turns to Analytics as a Service. ”
informationweek.com/news/software/info_
management/210800736 (accessed March 2013).
Yuhanna, N., M. Gilpin, and A. Knoll. ( 2010) . “The Forrester
Wave: Information-as-a-Service, Ql 2010. ” forrester.
com/rb/Research/wave%26trade%3B_information-
as-a-service%2C_q1_2010/q/id/55204/t/2 (accessed
March 2013).
GLOSSARY
active data warehousing See real-time data warehousing.
ad hoc DSS A DSS that deals w ith specific problems that
are usually neither anticipated nor recurring.
ad hoc query A query that cannot be determined prior to
the moment the query is issued.
agency The degree of autonomy vested in a software agent.
agent-based models A simulation modeling technique
to support complex decision systems w here a system or
network is modeled as a set of autonomous decision-making
units called agents th at individually evaluate the ir situation
and make decisions o n the basis of a set of predefined
behavio r and interaction rules.
algorithm A step-by-step search in which improvement is
made at every step until the best solution is found .
analog model An abstract, symbolic model of a system that
behaves like the system but looks differe nt.
analogical reasoning The process of d etermining the
outcome of a problem by using analogies. It is a procedure
fo r drawing conclusio ns abou t a problem by u sing past
experience.
analytic hierarchy process (AHP) A mode ling structure
fo r represe nting multi-criteria (multiple goals, multiple
objectives) problems-with sets of criteria and alternatives
(cho ices)-commonly found in business e nvironme nts.
analytical models Mathematical models into which data
are loaded for analysis.
analytical techniques Methods that use math ematical for-
mulas to derive an optimal solution directly o r to predict a
certain result, mainly in solving structured proble ms.
analytics The science of analysis-to use data fo r decision
making.
application service provider (ASP) A software vendor
that offers leased softwa re applicatio ns to organizatio ns .
Apriori algorithm The most commonly used algorithm to
discover association rules by recursively identifying frequent
itemsets.
area under the ROC curve A graphical assessment tech-
nique for binary classification models where the true positive
rate is plotted o n the Y-axis and the fa lse positive rate is
plotted on the X-axis .
artificial intelligence (AI) The subfield of computer science
concerned with symbolic reasoning and problem solving.
artificial neural network (ANN) Computer technology
that attempts to build compute rs that operate like a human
brain. The machines possess simulta neous me mory storage
and work with ambiguous information. Sometimes called,
simply, a neural network. See neural computing .
634
association A category of data mining algorithm that estab-
lishes relationships about items that occur together in a
given record.
asynchronous Occurring at d ifferent times.
authoritative pages Web pages that are ide ntified as
particularly popular based on links by othe r Web pages and
directories.
automated decision support (ADS) A ru le-based system
that provides a solution to a rep e titive managerial prob lem.
Also known as enteiprise decision management (EDM).
automated decision system (ADS) A business rule-based
system that uses intelligen ce to recommend solutions to
repetitive decisions (su ch as pricing).
autonomy The capability of a software agent acting on its
own o r being empowered.
axon An outgoing connection (i.e., terminal) from a bio-
logical n euron.
backpropagation The best-kn own learning algorithm in
neural computing w he re the learning is done by comparing
compute d outputs to d esired outputs of training cases.
backward chaining A search technique (based on if-then
rules) used in production systems that begins with the action
clause of a rule and works backward through a chain of rules
in an attempt to find a verifiable set of condition clauses.
balanced scorecard (BSC) A performance measurement
and manageme nt methodology that helps translate a n orga-
nization’s financial , cu stome r, internal process, and learning
and growth objectives and targets into a set of actionable
initiatives.
best practices In an organization, the best methods for
solving problems. These are often stored in the knowledge
re positoty of a knowledge management system.
Big Data Data that exceeds the reach of commonly u sed
hardware e nvironments and/ or capabilities of software tools
to capture, manage, and process it within a tolerable time
span.
blackboard An area of working memory set aside fo r the
description of a current problem and for recording interme-
diate results in an expert system.
black-box testing Testing that involves comparing test
results to actual results.
bootstrapping A sampling technique where a fixed num-
ber of instances fro m the original data is sampled (with
replacement) for training and the rest of the data set is used
for testing.
bot An intelligent software agent. Bot is an abbreviation of
robot and is usually u sed as part of another term, such as
knowbot, softbot, or shopbot.
business (or system) analyst An individual whose job is
to a nalyze business processes and the support they receive
(or need) from informatio n technology.
business analytics (BA) The application of models
directly to business data. Business analytics involve using
DSS tools, especially models, in assisting decision makers .
See also business intelligence (BI).
business intelligence (Bl) A conceptua l framework
for decision support. It combines architecture, databases
(or data warehouses), a nalytical tools, and application s.
business network A group of people who have some
kind of commercial relationship; for example , sellers and
buyers, buyers a mo ng themselves, buyers and suppliers,
and colleagu es and other colleagues.
business performance management (BPM) An
advanced performance measurement and analysis approach
that embraces planning and strategy.
business process reengineering (BPR) A methodology
for introducing a fundamenta l change in specific business
processes. BPR is usu ally supported by an informatio n
system.
case library The knowledge base of a case-based reasoning
system.
case-based reasoning (CBR) A methodology in which
knowledge or infe rences are derived from historical cases.
categorical data Data that represent the labels of multiple
classes used to divide a variable into specific groups.
causal loops A way fo r re lating different factors in a system
dynamics model to define evolution of relationships over
time.
certainty A condition under which it is assumed that future
values are known for sure and o nly one result is associated
with an actio n.
certainty factors (CF) A popular technique for repre-
senting uncertainty in expert systems where the belief in
an event (or a fact or a hypothesis) is expressed using the
expert’s unique assessment.
chief knowledge officer (CKO) Th e leader typically
responsible for knowledge management activities and oper-
ations in an organization.
choice phase The third phase in decision making, in which
an alternative is selected.
chromosome A candidate solution for a genetic algorithm.
classification Supervised ind u ction u sed to analyze the
historical data stored in a database and to automatically gen-
erate a model that can predict future behavior.
clickstream analysis The an alysis of data that occur in the
Web e nvironme nt.
clickstream data Data that provide a tra il of the user’s
activities and show the user’s browsing patterns (e.g ., which
sites are visited, w hich pages, how lo ng).
Glossary 635
cloud computing Information technology infrastructure
(hardware, software, applications , platform) that is available
as a service, u sually as virtu alized resources.
clustering Partitioning a database into segments in which
the members of a segment share similar qualities.
cognitive limits The limitations of the human mind related
to processing informa tion.
collaboration hub The central point of control fo r an
e-market. A single collaboration hub Cc-hub) , represent-
ing o ne e-market owner, can h ost multiple collaboration
spaces Cc-spaces) in wh ich trading partners use c-enablers to
exch ange data w ith the c-hu b .
collaborative filtering A method for generating recom-
mendations from user profiles. It uses preferences of other
users with similar behavior to predict the prefe rences of a
pa1ticular u ser.
collaborative planning, forecasting, and replenishment
(CPFR) A project in w hich suppliers and retailers collabo-
rate in their planning and demand forecasting to optimize
the flow of materials along the supply chain.
community of practice (COP) A group of people in an
organization with a common professional interest, often self-
organized, for managing knowledge in a knowledge man-
agement system.
complexity A measure of how difficult a problem is in
terms of its formulation for optimization, its required optimi-
zatio n effort, or its stochastic nature.
confidence In association rules, the condition al probability
of find ing the RHS of the rule present in a list of transactions
w here the LHS of the rule already exists.
connection weight The weight associated with each link
in a neural network model. Neural networks learning algo-
rithms assess connection weights.
consultation environment The part of an expert system
that a non-expert uses to obtain expert knowledge and
advice. It includes the workplace, infere nce engine, expla-
nation facility, recommended action, and user interface .
content management system (CMS) An electronic docu-
ment management system that produces dynamic versions
of documents and automatically maintains the cu rrent set for
use at the enterprise level.
content-based filtering A type of filte ring that recom-
mends items for a u ser based on the description of previously
evaluated items and information available from the content
(e.g. , keywords) .
corporate (enterprise) portal A gateway for entering a
corporate Web site . A corporate portal e nables communica-
tion, collaboration , and access to company information.
corpus In linguistics, a large and structured set of texts
(now usually stored and p rocessed electronically) prepared
for the purpose of conducting kn owledge discovery.
CRISP-DM A cross-industry standardized process of con-
ducting data mining projects, which is a sequ ence of six
636 Glossary
ste p s that starts w ith a good understanding of the business
and the need for the data mining project (i.e ., the applica-
tion domain) and ends with the deployment of the solution
that satisfied the specific business n eed.
critical event processing (CEP) A method of capturing,
tracking, and analyzing streams of data to detect certain events
(out of normal happenings) that are worthy of the effort.
critical success factors (CSF) Key factors that d elineate
the things that an organization must excel at to be su ccessful
in its market space.
crossover The combination of parts of two superior solu-
tions by a genetic algorithm in an attempt to produce an
even better solutio n.
cube A subset of highly interre lated data that is o rganized
to allow users to combine any attributes in a cube
(e.g., stores, products, customers, suppliers) w ith a ny me trics
in the cube (e.g., sales, profit, units , age) to create variou s
two-dimensional views, or slices, that can be displayed o n a
compute r screen.
customer experience management (CEM) Applications
designed to report on the overall user experience by detect-
ing Web application issues and problems, by tracking
and resolving business process and usability obstacles, by
reporting on site performance and ava ilability, by enabling
real-time alerting and monitoring, and by supporting deep-
diagnosis of observed visitor behavior.
dashboard A visual presentation of critical data for execu-
tives to view. It allows executives to see hot spots in seconds
and explore the situatio n.
data Raw facts that are meaningless by themselves
(e.g ., n ames, numbers) .
data cube A two-dimensional, three-dimensional, or
higher-dimensio nal object in which each dimension of the
data represents a measure of interest.
data integration Integration that comprises th ree major
processes: data access, data federation , and change capture.
When these three processes are correctly impleme nted, data
can be accessed and made accessib le to an array of ETL,
an alysis tools, and data warehousing environments.
data integrity A part of data quality where the accuracy
of the data (as a w hole) is maintained during a ny operation
(such as transfer, storage, or retrieval).
data mart A departmental data warehouse that stores only
re levant data.
data mining A process that uses statistical, mathemati-
cal, artificial intelligence, and machine-learning techniques
to extract and ide ntify useful information and subsequent
knowledge from large databases .
data quality (DQ) The holistic quality of data, including
their accuracy, precision, completeness, and relevance.
data scientist A new role or a job commonly associated
w ith Big Data o r data science.
data stream mining The process of extracting novel
patterns a nd knowledge strnctures from continuously
streaming data records. See stream analytics.
data visualization A graphical, animation, or video
presentation of data and th e results of data analysis.
data warehouse (DW) A physical repository w here
relational data are specially organized to provide enterprise-
wide, cleansed data in a standardized format.
data warehouse administrator (DW A) A person respon-
sible for the administration and management of a d ata
wa re h ouse.
database A collection of files that are viewed as a single
storage concept. The data are the n available to a wide range
of u sers.
database management system (DBMS) Software for
establishing, updating, and q uerying (e.g., managing) a
database.
deception detection A way of identifying deception
(inte ntionally propagating beliefs that are not true) in voice,
text, and/ or body language of humans .
decision analysis Methods for determining the solution to
a problem, typically w hen it is inappropriate to use iterative
algorithms .
decision automation systems Computer systems that are
aimed at building rule-oriented decision modules.
decision making The action of selecting among alternatives.
decision room An arrangement for a group support system
in which PCs are ava ilable to some or all participan ts. The
objective is to enhance groupwork.
decision style The manne r in which a decision maker
thinks and reacts to problems. It includes p erceptions, cog-
nitive responses, values, and beliefs.
decision support systems (DSS) A conceptual framework
for a process of suppo1ting managerial d ecision making,
usu ally by modeling problems and employing quantitative
models for solution analysis.
decision tables Information and know ledge convenie ntly
organized in a systematic, tabular manner, often prepared
for further analysis.
decision tree A graphical presentation of a sequence of
interrelated decisions to be made under assumed risk. This
technique classifies specific entities into pa1ticular classes
based upon the features of the e ntities; a root is followed
by inte rnal nodes, each node (including root) is labeled with
a question, and a rcs associated w ith each node cover all
possible respo nses.
decision variable A variable in a model that can be
changed and manipulated by the decision maker. Decision
va riables correspond to the decisions to be made, such as
quantity to produce, amounts of resources to allocate, etc.
defuzzification The process of creating a crisp solution
from a fu zzy logic solutio n.
Delphi method A qualitative forecasting methodo logy that
uses anonymo us questionna ires. It is e ffective for techno log-
ical forecasting and for forecasting involving sensitive issues.
demographic filtering A type of filte ring that uses the
de mogra phic data of a user to dete rmine w hich ite ms may
be a ppropriate for recommendation.
dendrite Th e p art o f a biological ne u ron that provides
inputs to the cell.
dependent data mart A subset that is cre ated directly from
a data ware ho use.
descriptive model A model that describes things as they are.
design phase The second d ecision-making phase, w hich
involves finding p ossible alternatives in d ecisio n making and
assessing the ir contributio ns .
development environment The p art o f an expe rt system
that a builde r uses. It includes the knowle dge base and the
infe re nce e ngine , and it involves knowled ge acquisitio n and
improveme nt of reasoning cap ability. The knowledge engi-
neer and the expe rt are conside red part of the environme nt.
diagnostic control system A cybernetic system that has
inputs, a process fo r transforming the inp uts into outputs,
a standard o r be nchmark against w hich to compa re the
outputs, and a feedback ch anne l to allow info rmatio n on
va riances between the outputs and the sta ndard to be com-
municate d and acted o n.
dimensional modeling A re trieval-based system that s up-
po rts high-volume query access.
directory A catalog of all the data in a d ata base o r all the
mod els in a mode l base.
discovery-driven data mining A form o f data mining that
finds p atterns, associations , and relationships amo ng data in
orde r to uncover fa cts that we re previou sly unknown or not
even contemplated by an o rganization.
discrete event simulation Building a m od el of a system
w he re the inte ractio n between diffe rent e ntities is studied .
The simplest example of this is a sho p con sisting of a server
and cu stome rs.
distance measure A method u sed to calculate the close-
ness between p airs of ite ms in most cluster analysis meth-
ods . Popula r distance measures include Euclidian distance
(the ordinary distance between two po ints that on e w ould
measure w ith a ruler) and Manhattan distance (also called
the rectilinear distance, or tax icab distan ce, between two
po ints).
distributed artificial intelligence (DAI) A multip le-agent
syste m for problem so lving. DAI invo lves s plitting a prob lem
into multiple coop erating systems to derive a solution.
DMAIC A closed-loop business improveme nt mode l that
includes these ste p s: d efining, measuring, a nalyzing, improv-
ing, and controlling a p rocess.
document management systems (DMS) Informatio n sys-
tems (e.g., hardware, software) that allow the flow, sto rage,
retrieval, and use o f digitized do cume nts.
Glossary 637
drill-down The investigation o f information in detail
(e.g. , finding not only total sales but also sales by region , by
p roduct, o r by salesperson). Finding the detailed sources.
DSS application A DSS p rogram built for a s pecific pur-
p ose (e .g ., a scheduling syste m fo r a sp ecific company) .
dynamic models Mode ls w hose input data are changed
over time (e.g. , a 5-year profit or loss projection).
effectiveness The degree of goal attainment. Doing the
right things .
efficiency The ratio of o utput to input. Appropriate use of
resources. Doing things righ t.
electronic brainstorming A computer-su pported method-
ology of idea gene ratio n by association . This group p rocess
u ses analogy and syne rgy.
electronic document management (EDM) A method for
p rocessing docume n ts electronically, including capture , stor-
age, retrieval, ma nipulation , and presentation.
electronic meeting systems (EMS) An info rmation tech-
nology-based e nvironme nt that supports group meetings
(groupware), whic h may be distributed geograp hically and
temporally.
elitism A con cept in gene tic algorithms w h ere some of the
bette r solutions are mig rated to the next generation in o rder
to prese1ve the best solutio n.
end-user computing Developme nt of o ne ‘s own info rma-
tion system . Also known as end-user development.
Enterprise 2 .0 Technologies and business practices that
free the workforce from the constraints of legacy communi-
catio n and prod uctivity tools su ch as e -mail. Provides busi-
ness managers w ith access to the right information at the
right time through a Web of interconnected applicatio ns,
se1vices, and d evices.
enterprise application integration (EAi) A technology
that provides a vehicle for pushing data from source syste ms
into a data w are ho u se.
enterprise data warehouse (EDW) An o rganizatio nal-
leve l data wa re ho u se developed for analytical purposes.
enterprise decision management (EDM) See auto mate d
decision support (ADS).
enterprise information integration (Ell) An evolving
tool space that promises real-time data integra tio n from a
variety of sources, such as relatio nal databases, Web se r-
vices, and multidimensional databases.
enterprise knowledge portal (EKP) An electronic door-
way into a knowledge man ageme nt system .
enterprise-wide collaboration system A grou p sup p ort
system that suppo rts an e ntire ente rprise.
entropy A metric that measures the extent of uncertainty or
rando mness in a data set. If all the data in a subset be long to
just on e class, then there is n o uncertainty o r ran domness in
that d ata set, and the refore the e ntropy is zero.
638 Glossary
environmental scanning and analysis A process that
invo lves conducting a search for and an analysis of informa-
tion in exte rnal databases and flows of information.
evolutionary algorithm A class of he uristic-based o pti-
mization algorithms mod e led afte r the natura l process of
biological evolution, su ch as genetic algorithms and genetic
programming.
expert A human being w ho h as developed a high level o f
proficie ncy in making judgments in a specific, usually nar-
row, domain.
expert location system An interactive computerize d sys-
tem that helps employees find and connect w ith colleagues
who have expertise required for specific problems-whether
they are across the county or across the room-in orde r to
solve specific, critica l business problems in seconds .
expert system (ES) A computer system that applies rea-
soning meth odologies to knowledge in a specific domain
to re nde r advice o r recomme ndatio ns, much like a human
expert. An ES is a computer system that achieves a high
level o f performance in task areas that, for human beings,
require years of special edu catio n and training.
expert system (ES) shell A computer program that facili-
tates re lative ly easy imple mentation of a specific expert sys-
tem. Analogous to a DSS generator.
expert tool user A person wh o is skilled in the application
of o ne o r more types o f specialized problem-solving tools.
expertise The set o f capabilities that underlines the perfor-
mance o f human exp erts, including extensive domain knowl-
edge, heuristic rules that simplify and improve approaches
to problem solving, metaknowledge and metacognition , and
compiled forms o f behavior that affo rd great economy in a
skilled performance.
explanation subsystem The component of an expert sys-
te m that can explain the system ‘s reasoning a nd justify its
conclusions.
explanation-based learning A machine-learning approach
that assumes that there is e nough existing theory to rational-
ize why one instance is or is n ot a prototypical member of
a class.
explicit knowledge Knowledge that deals with objec-
tive, rationa l, and technical material (e.g. , data, policies,
procedures, software , docume nts) . Also known as leaky
knowledge.
extraction The process o f capturing data from several
sources, synthesizing them , summarizing them , d etermining
which of the m are re levant, and organizing them, resulting
in their effective integration.
extraction, transformation, and load (ETL) A d ata ware-
ho u sing process that consists of extraction (i.e. , reading
data fro m a database), transformation (i.e. , converting the
extracted d ata from its previo us form into the form in which
it needs to b e so that it can be placed into a data wa re ho use
or simp ly another database), and load (i.e., putting the data
into the data wa re ho u se) .
facilitator (in a GSS) A p e rson w ho plan s, organizes, and
electronically controls a group in a collab orative computing
enviro nme nt.
forecasting Predicting the future.
forward chaining A data-driven search in a rule-based
system.
functional integration The provision of different support
functions as a sing le system through a single , consistent
inte rface.
fuzzification A p rocess that converts an accurate number
into a fuzzy description, such as con verting from an exact
age into categories such as you ng and old.
fuzzy logic A logically con sisten t way of reasoning that can
cope w ith uncertain o r partial info rmation. Fuzzy logic is
characte ristic of human thinking and expert systems.
fuzzy set A set theo1y approach in which set membership
is less precise than having objects strictly in or out of the set.
genetic algorithm A software program that learns in an
evolutionary manner, similar to the way biological systems
evolve.
geographic information system (GIS) An information
system capable of integrating, editing , analyzing, sharing,
and displaying geogra phically referen ced info rmatio n .
Gini index A metric that is used in economics to measure
the diversity of the population. The same concept can be
used to determine the purity of a specific class as a result
of a decision to branch along a particular attribute/ variable .
global positioning systems (GPS) Wireless devices that
use satellites to enable users to detect the position on earth
of items (e .g. , cars or people) the devices are attached to,
w ith reasonable precision.
goal seeking Analyzing a model (usu ally in spreadsheets)
to determine the level an independent variable should take
in o rder to achieve a specific level/value of a goal variable .
grain A definition o f the highest level of detail th at is sup-
ported in a data wa rehouse.
graphical user interface (GUI) An inte ractive, u ser-
frie ndly interface in which, by using icons and similar objects,
the user can control communication with a computer.
group decision support system (GDSS) An interactive
compute r-based system that facilitates the solution of semis-
tructured and unstructured problems by a group of decision
makers.
group support system (GSS) Information system , specifi-
cally DSS, that supports the collaborative work of groups.
group work Any work being performed by more than one
person.
groupthink In a meeting, continual re inforceme nt of an
idea by group members.
groupware Computerized technologies an d methods that
aim to suppott the work of people working in groups.
groupwork Any wo rk be ing p erforme d by more than one
pe rso n.
Hadoop An o pen source framew ork for processing , sto ring,
and an alyzing massive amo unts of d istributed , unstructured
data.
heuristic programming The u se of he uristics in prob lem
solving .
heuristics Informal, judgmental knowledge o f a n applica-
tion area that constitutes the rules o f good judgme nt in the
fi e ld. Heuristics also e ncompasses the knowledge o f how
to solve problems efficie ntly and effectively, how to plan
ste p s in solving a complex p roble m, how to improve p erfor-
ma nce, and so fotth.
hidden layer The middle layer of an artificial n eural net-
work that has three o r more layers .
Hive A Hadoop-based data w are ho using-like framework
originally d eveloped by Facebook.
hub One o r mo re Web p ages that provide a collectio n of
links to autho ritative pages.
hybrid (integrated) computer system Different but inte-
grated computer support syste ms used togethe r in o ne deci-
sion-making situation.
hyperlink-induced topic search (HITS) The m ost p o pu-
la r publicly known and referenced algorithm in Web mining
used to discover hubs and a utho rities.
hyperplane A geome tric concept commonly used to
describe the sep aratio n surface between d iffe rent classes of
things w ithin a multidime nsional sp ace.
hypothesis-driven data mining A fo rm o f data mining
that begins w ith a proposition by the u ser , who the n seeks
to validate the truthfulness of the p roposition .
IBM SPSS Modeler A very p o pular, comme rcially avail-
able , comprehensive d ata, text, and We b mining software
suite d evelo p ed by SPSS (form e rly Cle me ntine) .
iconic model A scaled physical re plica .
idea generation The process by which peo ple ge ne r-
ate ideas, u su ally suppo tted by software (e.g., develop-
ing alte rnative solutio ns to a problem) . Also known as
brainstorming.
implementation phase The fourth d ecision-making phase,
involving actually putting a recommended solution to work.
independent data mart A sma ll data ware ho use designed
for a strategic business unit or a de p artment.
inductive learning A machine -learning approach in w hich
rules a re infe rred from fa cts or data.
inference engine The part of an expert system that actu-
ally p e rforms the reasoning function.
influence diagram A diagram that s hows the va riou s
types of varia bles in a problem (e.g. , decision , independent,
result) and h ow they are related to each othe r.
influences rules In exp ert systems, a collection of if-then
rules that govern the p rocessing of knowledge rules acting
as a critical part o f the infe re ncing mechanis m.
Glossary 639
information Data o rga nized in a meaningful way.
information gain The splitting mecha nism u sed in ID3
(a p opular decisio n-tree algorithm).
information overload An excessive amount of info rma-
tion being p rovided , making p rocessing and absorbing tasks
very difficult for the individual.
institutional DSS A DSS that is a p e rmanent fixture in an
o rgan izatio n and has continuing fin ancial support. It deals
w ith decisions of a recurring nature .
integrated intelligent systems A syn ergistic combination
(or hybridizatio n) of two or more systems to solve complex
decision p roblems .
intellectual capital The know-how of an organiza-
tion. Inte llectual capital ofte n includes th e knowledge that
employees possess.
intelligence A degree o f reasoning and learned behavior,
u su ally task o r problem-solving oriented.
intelligence phase The initial phase of p roblem definition
in decisio n ma king .
intelligent agent (IA) An exp ert or knowledge-based
system embedded in computer-based information systems
(or their compo ne nts) to make the m s marter.
intelligent computer-aided instruction (ICAI) The u se
of AI techniques fo r training or teaching w ith a compu ter.
intelligent database A data base ma nageme nt syste m
exhibiting artificial in telligence features that assist the user or
designer; often includes ES a nd intelligent agents.
intelligent tutoring system (ITS) Self-tutoring syste ms
that can guide learners in how best to p roceed with the
learning process.
interactivity A ch aracteristic of software agents that allows
them to inte ract (communicate an d/ or collaborate) w ith
each o ther w ithout having to rely on human intervention .
intermediary A person who u ses a compute r to fulfill
requests made by other people (e.g., a fin an cial a nalyst w ho
u ses a computer to answer q u estions for top ma nagement) .
intermediate result variable A variable that contains the
values of inte rmediate o utcomes in mathe matical mo dels.
Internet telephony See Voice over IP (Vo IP) .
interval data Variables th at can be measure d on interval
scales .
inverse document frequency A commo n and very usefu l
tran sforma tio n of indices in a term -by-docume nt matrix that
refl ects b oth the sp ecificity of words (document frequencies)
as w ell as the overall frequ e ncies of their occurrences (te rm
freque ncies).
iterative design A systematic p rocess fo r system d evelop-
ment that is u sed in manageme nt suppo rt systems (MSS).
Iterative design involves p rodu cing a first version of MSS,
revising it, producing a second design version , and so o n .
kernel methods A class of algorithms for patte rn analysis
that app roaches the p roble m by ma p p ing high ly nonlinear
640 Glossary
data into a high dimensional feature space, w here the data
items a re transformed into a set of points in a Euclidean
space for b etter mode ling.
kernel trick In machine learning, a metho d for u sing a
linear classifier a lgorithm to solve a nonlinear problem by
mapping the original nonlinear observations o nto a higher-
dime nsio nal space, w here the linear classifier is subse-
que ntly used; this makes a linear classification in the new
space equivalent to nonlinear classification in the o riginal
space.
kernel type In kernel trick, a type of transformation algo –
rithm use d to represent data items in a Euclidean space. The
most commonly used kernel type is the radial basis function.
key performance indicator (KPI) Measure of perfor-
mance against a strategic objective and goal.
k-fold cross-validation A popular accuracy assessment
technique for pre diction models where the complete data
set is randomly split into k mutually exclusive subsets of
approximately equal size. The classification model is trained
and tested k times. Each time it is trained on all but one
fold and then tested on the remaining single fold. The cross-
validation estimate of the overall accuracy of a model is
calculated by simply ave raging the k individual accuracy
measures.
k-nearest neighbor (k-NN) A prediction method for clas-
sification as well as regression type prediction problems
where the prediction is made based on the s imilarity to k
neighbo rs.
knowledge Understanding, awareness, or familiarity
acquired through education or experien ce; a nything that
has been learned, perceived , discovered , inferred, o r under-
stood; the ability to use information. In a know ledge man-
agement system, knowledge is information in actio n.
knowledge acquisition The extraction and formulation
of knowledge derive d from various sources, esp ecially from
experts.
knowledge audit The process of identifying the knowl-
edge an organization has, who h as it, and how it flows (or
does not) through the enterprise.
knowledge base A collection of fa cts, rules, and proce-
dures o rganized into schemas. A knowledge base is the
assembly of all the information and knowledge about a spe-
cific fie ld of inte rest.
knowledge discovery in databases (KDD) A machine-
learning process that performs rule induction or a related
procedure to establis h knowle dge fro m large databases.
knowledge engineer An artificial intelligence specialist
responsible for the technical side of developing an expe1t sys-
te m. The knowledge engineer works close ly with the domain
expe1t to capture the expert’s knowledge in a knowledge base.
knowledge engineering The engineering discipline in
which knowledge is integrated into computer syste ms to
solve complex problems that normally require a high level
of human expertise.
knowledge management The active management of the
exp e rtise in an organization. It involves collecting, categoriz-
ing, and disseminating knowledge.
knowledge management system (KMS) A system that
facilitates knowle dge ma nageme nt by ensuring knowledge
flow from the person(s) who knows to the person (s) who
needs to know throughout the organization ; knowledge
evolves and grows during the process.
knowledge repository The actual storage location of
knowledge in a knowle dge manageme nt syste m . A knowl-
edge reposito1y is similar in nature to a database but is
generally text oriented.
knowledge rules A collection of if-the n rules that repre-
sents the deep knowledge about a specific proble m.
knowledge-based economy The mode rn, global econ-
omy, which is driven by what people and o rganizatio ns
know rather than only by capital and labor. An economy
based o n inte llectual assets.
knowledge-based system (KBS) Typically, a rule -based
system for providing expertise. A KBS is ide ntical to an
exp ert system, except that the source of expertise may
include docume nted knowledge.
knowledge-refining system A system that is cap able of
analyzing its own performance, learning, and improving
itself for future consultations.
knowware Technology tools that support kn owledge
manageme nt.
Kohonen self-organizing feature map A type of neural
network model for machine learning .
leaky knowledge See explicit knowledge.
lean manufacturing Production methodology focused on
the elimination of waste or non-value -added features in a
process.
learning A process of self-improvement where the new
knowledge is obta ined through a process by using what is
already known.
learning algorithm The training procedure used by an
artificial ne ural network.
learning organization An organization that is capa ble of
learning from its past experience, implying the existence of
an organizational me mory and a means to save, represent,
and share it through its personnel.
learning rate A paramete r for learning in neural networks.
It dete rmines the portion of the existing discrepancy that
must be offset.
linear programming (LP) A mathematical model for the
optimal solution of resource allocation problems. All the rela-
tionships among the variables in this type of model are linear.
linguistic cues A collection of nume rical measures extracted
from the textual content using linguistic rules and theories.
link analysis The linkage among many o bjects o f interest
is discovered automatically, such as the link between Web
pages and referential relation ships amon g groups o f aca-
de mic publicatio n autho rs .
literature mining A popular a pplicatio n area for text min-
ing whe re a large collection of literature (articles, abstracts,
book excerpts , and commenta ries) in a s p ecific area is p ro –
cessed using semiautomated method s in order to discove r
novel p atte rns .
machine learning The process by w hich a compute r
learns fro m experience (e.g ., u sing progra ms that can learn
fro m historical cases).
management science (MS) The applicatio n of a scien-
tific approach and mathematical models to the analysis and
solutio n of managerial d ecisio n situatio ns (e.g., p roblems,
opportunities) . Also known as op era tions research (OR).
management support system (MSS) A syste m that
applies any type of decision suppo rt tool o r technique to
man agerial decisio n making .
MapReduce A technique to distribute the processing o f
ve1y large multi-structure d data files across a large cluste r
of machines.
mathematical (quantitative) model A system o f symbols
and expressions that re present a real situation.
mathematical programming An o ptimizatio n technique
for the allocation o f resources, subject to constraints .
maturity model A formal de pictio n of critical dime nsion s
and the ir compete ncy leve ls o f a business practice .
mental model The mechanisms o r images through which
a human mind p e rforms sense-making in d ecision making.
metadata Data ab out data . In a data wa rehouse, metadata
describe the conte nts of a data w areho use and the manne r
of its use.
metasearch engine A search e ngine that combines results
fro m seve ral diffe rent search engines.
middleware Softwa re that links application modules from
diffe re nt computer languages and platforms.
mobile agent An intelligent software agent that moves
across diffe re nt syste m architectures a nd platforms o r
fro m one Inte rnet site to another, retrieving and sending
information.
mobile social networking Me mbers converse and connect
with one ano the r u sing cell phones or othe r mo bile devices.
mobility The d egree to w hich agents travel through a com-
pute r netwo rk.
model base A collection of prep rogrammed quantitative
mod els (e.g ., statistical, fin a ncia l, op timizatio n) orga nized as
a single unit.
model base management system (MBMS) Software for
esta blishing , updating , combining, and so o n (e.g., managing)
a DSS mo del base.
model building blocks Pre p rogrammed software ele –
ments that can be used to build compute rized mo de ls. Fo r
example , a random-number generato r can be employed in
the construction o f a simulatio n mo de l.
Glossary 641
model mart A small, gen e rally d e p artmental reposito1y
of knowledge created by u sing knowledge-discovery tech-
niques o n past decision insta nces . Model marts a re similar to
data marts . See model warehouse.
model warehouse A la rge, gene rally enterprise-w ide
re posito1y of kn owled ge created by u sing knowle dge-
discovery tech niq ues o n p ast decision instan ces. Model
w are h ou ses are similar to d ata w areho uses. See model mart.
momentum A learning paramete r in backprop agatio n n e u-
ral networks.
Monte Carlo simulation A method of simulation w he reby
a mo del is built and then sampling experiments are run to
collect and analyze the performa nce o f inte resting variables.
MSS architecture A p la n for organizing the underlying
infrastructure and applications of an MSS p roject.
MSS suite An integrate d collection of a large number of
MSS tools that work togethe r for a pplications develo pme nt.
MSS tool A software element (e.g., a language) that facili-
tates the d evelopme nt o f an MSS o r a n MSS generator.
multiagent system A system with multiple coop erating
softwa re agents.
multidimensional analysis (modeling) A modeling
metho d that involves data an alysis in several dimen sions.
multidimensional database A datab ase in w h ich the d ata
are o rganized specifica lly to support easy a nd quick multid i-
me nsion al an alysis.
multidimensional OLAP (MOLAP) OLAP im p lemented
via a specialized multidime nsio nal d ata base (or data store)
that summarizes transactions into m ultidimension al views
ahead of time.
multidimensionality The a bility to organize, p resent, and
analyze data by several d ime nsio ns, such as sales by region,
by p roduct, by salesperson, and by time (four dime nsions).
multiple goals Refers to a decision situatio n in w hich alter-
natives a re evaluated with severa l, someti mes confl icting,
goals.
mutation A genetic o p erator that causes a ra n dom change
in a p ote ntial solution.
natural language processing (NLP) Using a natural lan-
guage processor to inte rface with a comp uter-b ased system.
neural computing An experimental compute r design
aime d at building inte lligent computers that operate in a
manne r modeled o n the functioning o f the human brain. See
artificial ne ural network (ANN).
neural (computing) networks A computer design aimed
at building inte lligent computers that op erate in a ma nner
modeled o n the functioning of the human b rain.
neural network See artificial n e u ral network (ANN).
neuron A cell (i.e. , processing element) of a biological or
artificial ne ural network.
nominal data A type of data that contains measurements
of simple codes assigned to o bjects as labels, w hich are
642 Glossary
not measurements. For example, the variable marital status
can be generally categorized as (1) single, (2) married, and
(3) divorced.
nominal group technique (NGT) A simple brainstorming
process fo r no nelectronic meetings.
normative model A model that prescribes how a syste m
sho uld operate.
NoSQL (which stands for Not Only SQL) A new p ara-
digm to store and process large volumes of unstructured ,
semistrncture d, and multi-structured data.
nucleus The central processing portion o f a n euron.
numeric data A type of data that represent the numeric
values of specific variables. Examples of nume rically valued
variables include age, number o f childre n , total household
income (in U.S. dollars) , travel distance (in miles), and tem-
perature (in Fahre nhe it degrees) .
object A person, place, o r thing about w h ich information is
collected, processed, o r stored.
object-oriented model base management system
(OOMBMS) An MBMS constructed in an object-o riented
environme nt.
online analytical processing (OLAP) An information sys-
te m that enables the u ser, while at a PC, to query the system,
conduct an analysis, and so on. The result is generated in
seconds.
online (electronic) workspace Online screens that allow
people to sh are documents, files, project plans, calendars,
and so o n in the same o nline place, tho ugh not necessarily
at the same time.
oper mart An o p erational data mart. An oper mart is a
small-scale data mart typica lly u sed by a single department
or functional area in an organization.
operational data store (ODS) A type of database often
used as an interim area for a data wa rehouse, especially fo r
cu stomer information fil es .
operational models Models that represent problems for
the operational level of management.
operational plan A plan that translates an o rganization ‘s
strategic objectives and goals into a set of well-defined tac-
tics and initiatives, resource requirements, and expected
results.
optimal solution A best possible solution to a modeled
problem.
optimization The process of identifying the best possible
solution to a problem.
ordinal data Data that contains codes assign ed to objects
or events as labels that also represent the rank order among
them. For example , the va riable credit score can be generally
categorized as (1) low, (2) medium, and (3) hig h.
organizational agent An agent that executes tasks on
behalf o f a business process o r compute r applicatio n.
organizational culture The aggregate attitudes in an
o rganization concerning a certain issue (e.g. , technology,
compute rs, DSS).
organizational knowledge base An organization’s knowl-
edge reposito1y.
organizational learning The process of capturing knowl-
edge and making it available enterprise-wide.
organizational memory That which an organization
knows.
ossified case A case that has been analyzed and has n o
further value.
PageRank A link analysis algorithm, n amed after Larry
Page-one of the two founders of Google as a research proj-
ect at Stanford University in 1996, and used by the Google
Web search e ngine .
paradigmatic case A case that is unique that can be main-
tained to derive new knowledge for the futu re.
parallel processing An advan ced com pute r processing
technique that allows a computer to perform multiple pro-
cesses at once, in parallel.
parallelism In a group support system, a process gain in
w hich everyone in a group can work simultaneously (e .g. , in
brainstorming , voting, ranking).
parameter See uncontrollable variable (p arameter).
part-of-speech tagging The process of marking up
the words in a text as corresponding to a particular p art
of speech (su ch as nou ns, verbs , adjectives, adverbs, etc.)
based o n a word’s definition and context of its use.
pattern recognition A technique of matching an externa l
patte rn to a pattern sto red in a computer’s memory (i.e., the
process of classifying data into predetermi ned categories).
Pattern recognition is u sed in inference engines, image pro-
cessing, ne ural computing, and speech recognition.
perceptron An early neural network structure that u ses no
hidden layer.
performance measurement system A system that assists
managers in tracking the implementatio ns of business strat-
egy by comparing actual resu lts against strategic goals and
objectives.
personal agent An agent that performs tasks on behalf of
individual users.
physical integration The seamless integration of several
systems into one fu nctioning system.
Pig A Hadoop-based query langu age developed by Yahoo!.
polysemes Words also called homonyms, they are syntacti-
cally identical words (i.e. , spelled exactly th e same) with
different meanings (e.g ., bow can mean “to bend forward,”
“the fro nt of the ship,” “the weapon that shoots arrows,” or
“a kind of tied ribbon”).
portal A gateway to Web sites. Portals can be pu blic
(e.g. , Yahoo!) o r private (e.g., corporate portals) .
practice approach An approach toward knowledge man-
ageme nt that focu ses o n building the social environments o r
communities of practice necessary to fa cilitate the sharing of
tacit unde rstanding .
prediction The act o f telling a bout the future .
predictive analysis Use o f tools that he lp determine the
probable future outcome for an event o r the like lihood of
a situatio n occurring. These tools also identify re lationships
and patte rns .
predictive analytics A business an a lytical approach
toward forecasting (e.g ., demand, p roble ms, opportunities)
that is used instead o f simply re p orting data as they occur.
principle of choice The crite rion for making a cho ice
amo ng alte rnatives.
privacy In general, the right to be left alon e and the righ t
to be free o f unreasonable pe rsonal intrus io ns. Info rma tio n
privacy is the right to dete rmine whe n , and to wh at extent,
info rmation a bo ut on eself can be communicated to others.
private agent An agent that w o rks for o nly o ne p e rson.
problem ownership Th e jurisdictio n (authority) to solve
a p roble m.
problem solving A process in w hich o ne starts fro m a n
initial state and proceed s to search through a problem space
to identify a desired goal.
process approach An approach to knowledge manage-
ment that attempts to codify organizatio nal knowle dge
through fo rma lized controls, p rocesses, a nd techno logies.
process gain In a group suppo rt syste m , improvements in
the effective ness o f the activities of a meeting .
process loss In a group support system , degradation in the
effectiven ess of the activities o f a meeting.
processing element (PE) A ne uron in a ne ural n etwo rk.
production rules The m ost p o pular form of knowle dge
re presentatio n for exp ert syste ms w he re ato mic p ieces of
knowledge are represente d using simple if-then structures.
prototyping In system developme nt, a strategy in which a
scaled-down system o r portion of a system is constructed in
a short time, tested , and improved in several iterations .
public agent An agent that serves a ny u ser.
quantitative software package A pre p rogramme d (some-
times called ready -ma de) mo del o r optimization system.
These packages sometimes serve as building blocks fo r
other qua ntitative mo dels.
query facility The (data base) mech anism that accepts
requests for data , accesses them , manip ulates them , and
que ries them.
rapid application development (RAD) A developm ent
method ology that adjusts a syste m d evelopment life cycle so
that p arts of the system can be develop ed quickly, the reby
e na bling users to obta in some functio nality as soon as pos-
sible . RAD includes me thods of p hased develo pment, proto –
typing, and throwaway p rototyping.
Glossary 643
RapidMiner A p o p ula r, o p e n source, free-of-charge d ata
mining softwa re suite that e mploys a graphically e nha nced
u ser inte rface, a rathe r large number of algorithms , and a
variety of data vis ualization features.
ratio data Continuou s data whe re both differences and
ratios are inte rpretable . The distinguishing feature of a ratio
scale is the p ossession of a n onarbitrary zero value.
real-time data warehousing The process of loading
and provid ing data via a data wareh o use as they become
ava ilable.
real-time expert system An expert system designed for
online dynamic d ecisio n support. It has a strict limit on
response time; in othe r words, the syste m always p rod uces a
resp onse by the time it is needed.
reality mining Data mining of location-based data .
recommendation system (agent) A compute r syste m that
can suggest n ew ite ms to a u ser b ased on his o r h e r revealed
p reference. It may be content based or use co llaborative fil-
te ring to suggest items that match the prefe re nce of the u ser.
An example is Amazon.corn’s “Customers who bought this
item also bou ght … ” feature .
regression A data mining metho d for real-world prediction
p roble ms wh e re the p redicted values (i.e. , the o utp ut vari-
able or dependent variable) are nume ric (e.g. , predicting the
tempe rature for tomorrow as 68°F).
reinforcement learning A sub-area of mach ine learning
that is concerned w ith learning-by-do ing-and-measuring to
maximize some n otion of long-te rm reward. Re inforcement
learning d iffe rs from supe1vised learning in that correct
input/ output pairs are never presented to the algorithm.
relational database A database whose records are o rga-
nized into tables that can b e processed by either re latio nal
algebra or relational calculus .
relational model base management system (RMBMS) A
relational a pproach (as in rela tio nal databases) to the design
and d evelopme nt of a model base ma nagement system.
relational OLAP (ROLAP) The imp le menta tion of an
O LAP database on top of an existing re lational d atab ase.
report Any communication artifact p rep ared with the sp e-
cific intention of conveying informatio n in a presenta ble
form.
reproduction The creation of new generation s of improved
solutio ns w ith the use of a genetic algorithm.
result (outcome) variable A variable that expresses the
result of a decision (e.g., one concerning p rofit), usu ally one
of the goals of a decisio n-making problem.
revenue management systems Decision-making systems
u sed to make optimal p rice decisions in o rde r to maximize
revenue , based upo n previous d emand histo1y as well as
forecasts of dema nd at vario us pricing levels an d o ther
considerations.
RFID A gene ric techno logy that refe rs to the u se of rad io-
freque ncy waves to ide ntify objects.
644 Glossary
risk A probabilistic or stochastic decision situation.
risk analysis A d ecision-making method that analyzes the
risk (based on assumed known probabilities) associated
with different alte rnatives.
robot A machine that has the capability of p erforming man-
ual functions w itho ut human intervention.
rule-based system A system in which knowle dge is re pre-
sented complete ly in terms of rules (e.g., a system based on
production rules) .
SAS Enterprise Miner A compreh ensive, commercial data
mining software tool developed by SAS Institute.
satisficing A process by w hich o ne seeks a solution that
will satisfy a set of constraints . In contrast to optimization,
w hich seeks the best possible solutio n , satisfi cing simply
seeks a solution that w ill work well enough.
scenario A statement of assu mptions and configurations
concerning the operating e nvironme nt of a particular system
at a particular time.
scorecard A visual display that is used to ch art progress
against strategic a nd tactical goals and targets.
screen sharing Software that enables group members,
even in different locations, to work on the same document,
w hich is shown o n the PC screen of each participant.
search engine A program that finds and lists Web sites or
pages (designated by URLs) that m atch some user-selected
criteria.
search engine optimization (SEO) The intentional activ-
ity o f affecting the visibility of an e-commerce site or a Web
site in a search e ngine’s natural (unpaid or organic) search
results.
self-organizing A neural network architecture that uses
unsupe1vised learning .
semantic Web An extensio n of the current Web, in which
info rmation is given well-defined meani ngs, better enabling
compute rs and p eople to work in cooperation.
semantic Web services An XML-based technology that
allows semantic info rmation to b e re presented in Web
services.
semistructured problem A category of decision problems
where the decision process has some structure to it but still
requires subjective analysis and an iterative approach.
SEMMA An alternative process for data mining projects
proposed by the SAS Institute. The acronym “SEMMA” stands
for “sample, explore, modify, model, and assess. ”
sensitivity analysis A study o f the effect of a change in
one or more input variables on a proposed solution.
sentiment A settled opinion reflective of one’s feelings.
sentiment analysis The technique used to detect favorable
and unfavorable opinions toward specific products and ser-
vices using a large number of textual data sources (customer
feedback in the form of Web p ostings).
SentiWordNet An extension of WordNet to be u sed for
sentiment ide ntification. See WordNet.
sequence discovery The identificatio n of associatio n s over
time .
sequence mining A pattern discovery method wh e re rela-
tionships among the things are examine d in terms of the ir
order of occurrence to identify association s over time .
sigmoid (logical activation) function An S-sh ap ed trans-
fer function in the range of O to 1.
simple split Data is partitioned into two mutually exclusive
subsets called a training set and a test set (or holdout set). It
is common to designate two-thirds of the data as the training
set and the remaining one-third as the test set.
simulation An imitation of reality in computers.
singular value decomposition (SVD) Closely re lated to
prin cipal components analysis, redu ces the overall dimen-
sionality of the input matrix (number of input documents by
number of extracted te rms) to a lower dimensional space,
w here each consecutive dimension represents the largest
degree o f va riability (between words and documents).
Six Sigma A performan ce manageme nt methodology
aimed at redu cing the number of defects in a busine ss pro-
cess to as close to zero defects per million opportunities
(DPMO) as possible .
social analytics The monitoring, analyzing, measuring, and
inte rpreting digital interactions and re latio nships of people,
topics , ideas, and content.
social media The online platforms and tools that people
use to share opinio ns, exp erie nces, insights, perceptions,
and various media, including photos, videos, or mu sic, with
each othe r. The en abling technologies of social interactions
among p eople in which they create, share , and exchange
informatio n , ideas, and opinions in virtual communities and
networks.
social media analytics The systematic and scientific ways
to consume the vast amount o f content created by Web-
based social media o utlets, too ls, and techniques fo r the bet-
terment of an organization’s competitiveness.
social network analysis (SNA) The mapping and m easur-
ing of relationships and information fl ows among people,
grou ps, organizations, computers, and oth e r information- or
knowledge-processing entities. The nodes in the network
are the people and grou ps, whereas the links show re lation-
ships or flows between the nodes.
software agent A p iece of autonomous software that per-
sists to accomplish the task it is designed for (by its owner) .
software-as-a-service (SaaS) Software that is rented
instead of sold.
speech analytics A growing field of science that allows
users to analyze and extract information from both live and
recorded conversations.
speech (voice) understanding An area of artificial intel-
ligence research that attempts to allow compute rs to recog-
nize words o r phrases of huma n speech.
staff assistant An individual w ho acts as an assistant to a
manager.
static models Models that describe a single interval of a
situation.
status report A report that provides the most current infor-
mation o n the status of an item (e.g ., orde rs, expenses, pro-
duction quantity).
stemming A process of reducing words to their respective
root forms in o rde r to better represent them in a text mining
project.
stop words Words that are filte red o ut prior to o r afte r pro-
cessing of na tural langu age data (i.e., text) .
story A case w ith rich information and episodes. Lessons
may be derived from this kind o f case in a case base.
strategic goal A quantified objective that h as a designated
time period.
strategic models Models that re present problems for the
stra tegic level (i.e., executive level) of management.
strategic objective A broad statement or general course of
action that prescribes targeted directions for an o rganizatio n.
strategic theme A collectio n of related strategic objectives,
used to simplify the constructio n of a strategic map.
strategic vision A picture or me ntal image of what the
organizatio n should look like in the future.
strategy map A visu al display that delineates the re lation-
ships among the key organizational o bjectives for all four
bala nced scorecard perspectives.
stream analytics A te rm commo n ly u sed fo r extracting
actionable information fro m continuo usly flowing/streaming
data sources.
structured problem A decision situatio n w here a spe-
cific set of steps can be followed to make a straightforward
decision
Structured Query Language (SQL) A data definition and
management lang uage for relational databases. SQL front
e nds most re latio nal DBMS .
suboptimization An optimizatio n-based procedure that
does not consider all the alte rnatives for or impacts o n a n
organization.
summation function A mecha nism to add all the inputs
coming into a particular ne uron.
supervised learning A me thod o f training artificial ne ural
networks in w hich sample cases are shown to the network
as input, and the weights a re adjusted to minimize the e rror
in the o utputs.
support Th e measure of how ofte n products a nd/or ser-
vices appear together in the same transaction; that is, the
proportion of tra nsactio ns in the data set that contain a ll of
the products a nd/ o r services me ntio ned in a specific rule.
Glossary 645
support vector machines (SVM) A family of gen eralized
linear models, which achieve a classification o r regression
decision b ased on the value o f the linear combination of
input features.
synapse The connection (where the weights are) between
processing elements in a neural network.
synchronous (real time) Occurring at the same time.
system architecture The logical and physical design of a
system.
system development lifecycle (SDLC) A systematic
process for the e ffective constructio n of large information
systems.
systems dynamics Macro-level simulatio n models in w hich
aggregate values and trends a re considered . Th e o bjective is
to study the overall behavior of a system over time, ra ther
tha n the behavior of each individua l participant or player in
the system
tacit knowledge Knowledge that is usually in the domain
of subjective, cognitive, and experiential learning. It is highly
personal a nd difficult to formalize .
tactical models Models that represent proble ms fo r the tac-
tical level (i.e. , midlevel) of management.
teleconferencing The use of electronic communication
that a llows two o r more people at different locations to have
a simultaneou s confere n ce.
term-document matrix (TDM) A frequency matrix cre-
ated fro m digitized and o rganized documents (the corpus)
w here the columns represent the terms w hile rows represent
the individua l documents.
text analytics A broader concept th at includes information
retrieval (e.g., searching and identifying relevant documents
for a given set of key terms) as well as information extrac-
tion, data mining, and Web mining.
text mining The applicatio n of data mining to nonstru c-
tured or less structured text files. It entails the generation
of meaningful numeric indices from the u nstructured text
and then processing those indices using va riou s data mining
algorithms.
theory of certainty factors A theory designed to help
incorporate uncertainty into the representation of knowl-
edge (in te rms of production ru les) for expert systems.
threshold value A hurdle value for the o utput of a ne uron
to trigger the next level of neurons. If an output value is
smaller tha n the thresho ld value, it w ill not be passed to the
next level of ne urons.
tokenizing Categorizing a block of text (token) according
to the function it performs.
topology The way in w hich neurons a re organized in a
neural network.
transformation (transfer) function In a ne ural network,
the functio n that sums and transforms inputs b efore a neu-
ron fires. It shows the relationship between the internal acti-
vation level a nd the output of a neuron .
646 Glossary
trend analysis The collecting o f information and attempt-
ing to spot a patte rn, or trend, in the info rmation.
Turing test A test desig ned to measure the “inte lligence”
of a compute r.
uncertainty In expert systems, a value that canno t be
determined during a consultatio n. Ma ny expe rt systems can
accommodate uncertainty; that is, they allow the u ser to
indicate w hethe r he o r she does not know the answer.
uncontrollable variable (parameter) A fa ctor that affects
the result of a d ecisio n but is n ot under the control of the
decision maker. These variables can be internal (e.g., re lated
to technology or to p olicies) or external (e.g. , related to legal
issues o r to climate).
unstructured data Data that does no t have a p redete r-
mined format and is stored in the form of textua l documents.
unstructured problem A decision setting w h ere the ste p s
are no t e ntire ly fixed o r stru cture d , but may require sub jec-
tive conside rations.
unsupervised learning A method of training artificial neu-
ral networks in which only input stimuli are s hown to the
network, which is self-organizing .
user interface The component of a computer system that
allows bidirectional communication between the system and
its user.
user interface management system (UIMS) The DSS
compo ne nt that handles all inte raction between users and
the system.
user-developed MSS An MSS developed by o ne user o r by
a few u sers in o ne department, including decision makers
and professionals (i.e., knowledge workers, e.g. , financial
analysts, tax a nalysts, e ngineers) w ho build or use comput-
ers to solve problems or enhance the ir productivity.
utility (on-demand) computing Unlimited comput-
ing power a nd storage capacity that, like electricity, water,
and telephone se1v ices, can be obtained on demand, used ,
and reallocated for a ny application and that are billed o n a
pay-per-use basis.
vendor-managed inventory (VMI) The practice of retail-
ers making suppliers resp o nsible for determining w he n to
order and how much to o rde r.
video teleconferencing (videoconferencing) Virtual
meeting in w hich p articipants in one locatio n can see
participants at other locatio ns on a large screen or a desktop
compute r.
virtual (Internet) community A g roup of people w ith
similar interests who inte ract w ith one anoth er using the
Inte rnet.
virtual meeting An online meeting whose members are in
different locations, possibly even in different countries.
virtual team A team w hose members are in different places
while in a meeting togethe r.
virtual worlds Artificial worlds created by computer sys-
tems in which the u ser has the impression of being immersed.
visual analytics The combination of visualization and pre-
dictive analytics.
visual interactive modeling (VIM) See visual interactive
simula tion (VIS) .
visual interactive simulation (VIS) A simulation
approach used in the decision-making process that shows
graphical animatio n in which systems and processes are pre-
sented dynamically to the decision maker. It e nables visual-
ization of the results of different potential actions.
visual recognition The addition of some form of computer
inte lligence and decision making to digitized visual informa-
tion, received from a machine sensor such as a came ra.
voice of customer (VOC) Applicatio ns that focus on “wh o
and how” questio ns by gathe ring and reporti ng direct feed-
back from site visitors, by benchmarking against other sites
and offline channels, and by supporting predictive modeling
of future visito r behavior.
voice (speech) recognition Translation of human voice
into individual words a nd sentences that are understandable
by a compute r.
Voice over IP (VoIP) Conununication systems th at trans-
mit voice calls over Inte rne t Protocol (IP)- based networks.
Also known as Internet telephony.
voice portal A Web site, usually a p ortal, that has an audio
interface.
voice synthesis The technology by which compute rs con-
vert text to voice (i.e ., speak).
Web 2.0 The popular term for advanced Internet tech-
nology and applications , including biogs, w ikis, RSS, and
social bookma rking. One of the most significa nt differe nces
between Web 2.0 and the traditional World Wide Web is
greater collaboratio n among Internet users and oth e r users,
content providers, and enterprises.
Web analytics The application of business an alytics activi-
ties to Web-based processes, including e-comme rce.
Web content mining The extraction of useful information
from Web pages.
Web crawlers An application use d to read through the
content of a Web site automatically.
Web mining The discovery and a nalysis of interesting and
useful information from the Web, about the Web, and usu-
ally through Web-based tools.
Web services An architecture that e nables assembly of dis-
tributed applications from software services a nd ties them
together.
Web structure mining The development of useful infor-
matio n from the links included in Web documents.
Web usage mining The extraction of u seful information
from the data being generated through Web page visits,
transactio ns, and so on.
Weka A popular, free-of-charge, open source suite of
machine-learning software written in Java , developed at the
University of Waikato.
what-if analysis A process that involves asking a computer
what the effect of changing some of the input data o r param-
eters would be.
wiki A piece of server software available in a Web site that
allows users to freely create and edit Web page content,
using any Web browser.
Glossary 647
wikilog A Web log (blog) that allows people to participate
as peers ; anyone can add, delete, or change content.
WordNet A popula r general-purpose lexicon created at
Princeton University.
work system A system in which humans and/ or
machines perform a business process, using resources
to produce products or services for internal or external
customers.
INDEX
Note: ‘A’, ‘f, ‘n’ and ‘t’ refer to a pplicatio n cases, fig ures, notes and tables respective ly
A
Academ ic provide rs, 628
Accenture, 577, 578t
Accuracy me trics for classificatio n models,
215t, 216t
Acxiom, 235
Ad hoc DSS, 63
Agent-based mode ls, 461–462, 463A- 464A
Agility support, 611
AI . See Artificial inte lligence (AI)
AIS SIG DS classification ( DSS), 61-63
communicatio n drive n , 62
compo und, 63
data driven , 62
document drive n , 62
group, 62
knowledge driven , 62
mode l drive n , 62-63
AJAX, 605
Algorithms
ana lytical technique, 438-439
data mining, 275
decision tree, 422
evolutio nary, 277, 441
genetic, 441–446
HITS, 345
kNN a lgorithm, 277
lea rning, 198, 260, 26 1
linea r, 272
ne ighbor, 275- 277
po pular, 328
proprie tary, 352
sea rch , 278
Alte r’s DSS classification, 63
Amazo n.com
a nalytical decision making, 189-190, 603
apps, for custome r support, 69
cloud computing, 607-608
collabo rative filte ring, 604
Simple DB at, 610
Web usage, 359
America n Airlines, 403A
Ana lytic ecosystem
aggrega to rs a nd distributo rs, data , 622
data infrastructure provide rs, 620-621
data warehouse, 621-622
industry clusters, 620
midd lewa re industry, 622
software developers, 622
Analytical techniques, 3, 55, 341, 438
a lgorithms , 438–439
blind sea rching, 439
he uristic sea rching, 439, 439A- 440A
See also Big Data analytics
Ana lytic hierarchy process (AHP)
applicatio n, 423A- 424A
alte rnative ranking, 428f
applicatio n, 425–429
descriptio n, 423–425
diagram , 426f
fina l composite , 429f
ra nking criteria , 427f
subcrite ria, 428f
648
Ana lytics
business process restru cture , 614
impact in o rga nizatio n , 613
industry analysts and influe ncers, 627
job satisfa ctio n, 614
legal issues, 6 16
manager’s activities, im pact o n, 615-616
mo bile user privacy, 617
privacy issues, 617
stress and anxiety, job , 6 14- 615
techno logy issues, 618-619
user organizatio n, 625
Analytics-as-a-Service (AaaS), 6 11-612
AN . See Artificial neural network (AN
Apple , 69, 345
Apriori, 200, 209, 226-227
Area under the ROC curve, 217
Artificial inte lligence (AI)
advanced ana lytics, 126
automated decisio n system, 472, 473
BI systems, 15
data-mining, 190, 192-193
ES feature, symbo lic reasoning, 478
fi e ld applicatio ns, 475–477
ge netic algorithm , 441
knowledge-driven DSS, 62, 469
knowledge-based management
subsystem , 70
knowledge e ngineering tools, 487
in natural language p rocessing ( LP), 297
rule-based syste m, 475, 491
text-analytics, 292
visual interactive mode ls (VIM), 457
Artificia l neu ral netwo rk (ANN)
application , 264A- 265A
architectures, 254-255
backpropagatio n , 260-261
biological and artificia l, 248-250
develo ping (See Neural network-based
syste ms, developing)
e le me nts, 251-254
Ho pfield networks, 255-256
Ko ho ne n’s self-organizing feature maps
(SOM), 255
lea rning process, 259-260
ne ural computing, 247
pre dictive mode ls , 244
result predictio ns, 246f
simple split, exceptio n to, 215, 216f
Artificial neurons, 248-251
Associatio n rule mining, 200, 209,
224-227
Associatio ns
in data mining, 197, 200
defined, 200
See also Association ru le mining
Asynchronous communicatio n, 527
Asynchronous products, 528
Attensity360, 384
Attributes, 218
Auctions, e-co mme rce ne two rk , 360
Audio, acoustic approach , 329
Authoritative pages, 345
Automated decisio n-making (ADM), 62
Automated decisio n system (ADS),
62,471
application , 472A
architecture, 472f
concept, 471–475
for revenue manageme nt system , 474
rule based system, 475
vigne tte, 470–471
Auto mated help desks, 483
Auto matic programming in AI field, 476f,
498
Auto matic sensitivity analysis, 418
Automatic summa rizatio n , 300
AWStats (awstats.org), 369
Axons, 248
B
Back-erro r propagation , 260, 261f
Backpropagation, 251 , 255, 260-261, 261f
Backtracking, 204, 493
Backwa rd cha ining , 491–493, 493f
Bag-of-words used in text mining,
296-297
Balanced scoreca rds
applicatio n, 178A- 179A
concept, 172
dashboard vs, 174-175
DMAJC performance model, 176
four perspectives, 173-174
meaning o f balance, 174
Six Sigma vs, 176
Basic Information, 606
Black box testing by using sensitivity
ana lys is, 262- 263
Ba nki ng, 201
Bayesian classifiers, 218
Best Buy, 69
BI. See Business intelligence (BI)
Big Data analytics
a pp licatio ns, 550A- 551A, 555A- 556A,
563A- 564A, 568A- 569A, 576A- 577 A,
580A- 581A
business problems , addressing, 554-556
data scientist’s ro le in, 565- 569
data warehousing and , 569-572
defined , 27, 546-547
fu nda me ntals, 551-554
gray areas, 572
Hadoop, 558-562, 570-574
indust1y testaments, 579-580
MapReduce, 557-558
NoSQL, 562-564
Stream ana lytics, 581- 588
value p ropositio n , 550
variability, 549
va riety, 548-549
velocity, 549
vendo rs, 574, 578t
veracity, 549
volume , 547-548
vignette, 543-546
Bing, 69, 355
Bio medica l text mining applicatio ns, 304
Blackboard (workplace), 484-486
Black-box syndrome, 263
Ble nding pro ble m, 4 15
bluecava .com, 618
Blind searching, 439
Biogs, 320,377,522
Bootstrapping, 217
Bo unded rationality, 52
Brainstorming, electro nic, 57- 58, 534
Bra nch, 218
Bridge, 376-377
Brea k-even point, goa l seeking ana lysis
used to compute , 419
Budgeting , 171
Building, process of, 168
Business activity mo nito ring (BAM),
57, 85
Business a nalytics (BA)
applicatio n , 596A- 597A, 599A,
601A- 602A
a thle tic injuries, 24A
by Seattle Childre n ‘s Hosp ital, 21A
cons umer applicatio ns, 600
data science vs, 26
d escriptive analytics, 20
for s ma rt e ne rgy use, 593
geo-spatial analytics, 594- 598
Industrial a nd Comme rcia l Bank o f China’s
network , 25A
legal a nd ethical issues, 6 16
locatio n-based a nalytics, 594
moneyba ll in sports a nd movies, 23A
o rganizatio ns , impact on , 6 13
overview, 19
recomme ndation e ngines, 603-604
speed of thought, 22A
vig ne tte, 593-594
See a lso Ana lytic ecosyste m
Business Intelligence (BO
a rc hitecture , 15
brief history, 14
d e finitions , 14
DSS a nd , 18
multimedia exercise, 16
o rig in and drivers, 16
Sabre’s dashboa rd and ana lytica l
technique, 17 A
styles, 15
Business intelligence service provider
(BISP), 108
Business Objects, 11 2
Business Performa nce Improvement
Resource, 126
Business pe rforma nce m a nageme nt (BPM)
applicatio n , 169A- 170A
closed loop cycle, 167-169
d efined , 166
key performa nce indicator (KP!) , 17 1
measure me nt o f pe rformance, 170-172
Business Pressures-Responses- Support
mode l, 5-7, 6t
Business process ma nagement (BPM) , 57,
58, 69
Busine ss process reeng ineering (BPR),
56, 614
Business re porting
analytic reporting , 622
applicatio n, 141A- 143A, 144A- 145A,
146A, 149A- 150A, 161A, 163A- 164A,
169A- 170A, 178A- 179A
cha rts and g raphs, 150-153
compone nts, 143-144
data visualizatio n , 145, 146A, 147, 147A,
149A, 154-156
definitio ns and con cept, 136-140
vig nette, 136
BuzzLogic, 385
C
Ca ndidate gene ration, 226
Capacities, 52
Capita l One, 190
Carnegie Mello n University, 601
CART, 219, 228,229, 244
Case-based reasoning (CBR), 218
Causal loops, 459
Catalog design , 225
Catalyst, data ware housing, 117
Categorica l data , 194, 206
Categorization in text mining
applicatio ns , 293
Cell p ho nes
mo bile social networking a nd, 606
mobile user privacy a nd , 618
Centrality, 376
Centro id, 223
Certainty, decisio n making unde r, 401, 402
Certa inty factors (CF), 422
Ce1tificatio n agencies, 628
Ce rtified Analytics Profess io nal (CAP), 628
Cha nnel optimiza tio n , 17t
Chief executive o ffi cer (CEO), 58, 627
Chief information o fficer (CIO), 625, 627
Chie f operating o ffi cer (COO), 339
Chi-square d a utomatic interaction d e tecto r
(CHAID), 219
Cho ice phase of de cision-making process,
55, 58
Chromosome , 202, 443
Cisco, 578t
Classification, 201-202
AIS SIGDSS Classifica tion fo r DSS, 61-62
in d a ta mining, 199, 214-219
in decision support system (DSS), 61
no n-linear, 272
N-P Polarity, 326
of problem (intelligent phase), 46
in text m ining , 312
Class label, 218
Cle me ntine , 228, 232, 262
Clickstream a nalysis , 358
Cliques, social circles, 377
Clo ud computing, 123, 607-608
Cluster a nalysis for data mining, 220-221
Cluste ring
in d a ta mining, 200
defined, 200
K-Mea ns, 223
o ptimal numbe r, 222
in text mining, 293, 312-313
See a lso Cluste r analysis for data mining
Cluste ring coeffi cie nt, 377
Cluste rs, 197
Coca-Cola, 102, 103A
Cognitive li m its , 10
Cognos, 64, 79, 83
Cohesio n , 377
Index 649
Collaborative ne tworks, 531
Collaborative pla nning, forecasting, a nd
reple nishme nt (CPFR) . See CPFR
Collaborative pla nning alo ng supply
chain, 531
Collaborative workflow, 530
Collective Inte llect, 384-385
Collective intelligence (CI), 523, 605
Communication ne tworks, 374
Community networks , 374
Community of practice (COP), 521, 538
Complete e nume ratio n , 439
Complexity in simulation, 441
Compo und DSS, 62, 63
Compre he ns ive database in data
ware ho using process, 89
Computer-based informatio n system (CBIS),
59, 477
Computer ha rdwa re a nd softwa re . See
Hard ware; Softwa re
Computerized d ecis io n support system
decisio n making and analytics, 5
Gorry a nd Scott-Mo rto n classica l
framework , 11-12
reasons for u sing , 11-1 3
for semistructured proble ms, 13
for structured decisio ns, 12 (See also
Automated d ecision syste m (ADS))
for unstructure d decisions, 13
Compute r-supported collaboratio n tools
collaborative networks, 531
collaborative workflow, 530
corporate (ente rprise) portals, 526
for de cision m a king (See Compute rized
d ecisio n su p po rt syste m)
virtual mee ting systems, 529
Voice over IP (VoIP), 453-454
Web 2.0, 530-531
w ik is, 531
Compute r-supported collaborative work
(CSCW), 532
Compute r vision, 476f
Concepts, define d , 290
Conceptual me thodology, 13
Conditio n-based mainte nance, 202
Confide nce gap, 453
Confide nce metric, 225-226
Confu sion matrix, 215, 215f
Connectio n weig hts, 253
Constra ints, 96, 211 , 255
Consultatio n e nviro nme nt used in ES,
484, 484f
Contine ntal Airlines, 120, 127-128 , 131A
Contingency table, 215, 215f
Continuo us probabi lity d istributio ns, 452t
Contro l syste m, 498
Con ve rseon , 385
Corporate intrane ts and extranets, 531
Corpo rate performa nce ma nageme nt
(CPM), 166
Corpo rate portals, 529t
Corpus, 294, 308- 309
CPFR, 531
Creativity, 48
Cred ib ility assessment (deceptio n
detectio n) , 302
650 Index
Credit ana lysis system , 482-483
Crimson Hexagon , 385
CRISP-DM, 204-206, 212-2 13, 213f, 307
Cross-Industry Standard Process for Data
Mining. See CRISP-DM
Crossover, genetic algorithm , 443
Cross-marketing, 225
Cross-selling, 225
Cross-validatio n accuracy (CVA), 217
Cube analysis, 15
Custo mer attritio n, l 7t, 201
Custo mer expe rien ce manageme nt
(CEM), 323
Customer profitability, 17t
Custo mer re latio nship ma nagement (CRM),
34, 42, 104t, 189, 201, 298, 301, 313,
320, 329,483
Customer segmentatio n , l 7t
Custom-made DSS system, 63-64
D
Dashboards, 160
applicatio n, 161A, 163A- 164A
vs balanced scorecards, 174
best practices, 165
characterstics, 164
design , 162, 166
g uided an alytics, 166
info rmatio n level, 166
key performance ind icato rs, 165
metrics, 165
rank ale rts, 165
user comments, 165
validatio n methods, 165
Data as a service (DaaS), 575, 608-609
Database management system (DBMS), 86,
88t, 92, 337, W3.l.16
architecture , 92
column-o riented, 123
data storage, 144, 521
defined , 65
NoSQL, 609
re lational, 83, 124-125
Teradata, 82
Data directory, 65
Data dre dging, 192
Data extraction , 89
Data integratio n
applicatio n, 98A
descriptio n , 98
extraction, tra nsformatio n a nd load (ETL)
processes, 97
Data loading in data warehous ing
process, 89
Data management subsystem, 65
Data marts, 84
Data migratio n tools, 93
Data mining
applications, 191A- 192A, 196A- 197A,
201-204, 203A- 204A, 221A- 222A,
231A- 234A, 235A- 236A
artificial neural network (ANN) for (See
Artificial neural network (ANN))
associations used in, 197
as ble nd of multi ple d isciplines, 193f
in ca ncer resea rch , 210A- 211A
characteristics of, 192-197, 195f
classificatio n of tasks used in, 195, 198f,
199, 21 4
clustering used in, 199-200
comme rcial uses of, 228
concepts a nd application, 189-192
data extractio n capabilities of, 89
data in, l 9.2>-194, l 9.2if
definitio ns o f, 192
DSS templates provided by, 64
law e nforcement’s use o f, 196A- 197A
methods, 214-227
myths and blunders, 234-237
names associated w ith, 192
patterns identified by, 198
pred iction used in, 193, 198
process, 204- 212
recent popularity o f, 199
standardized process and methodo logies,
212-213
vs statistics, 200-201
software tools, 228-234
term o rigin, 192
time-series forecasting used in, 200
using predictive analytics tools, 23
vignette, 187-189
visualization used in, 203
working, 197-200
Data modeling vs. analytical models, 47n
Data o rganization , 549
Data-oriented DSS, 19
Data preparation, 206-208, 207f, 209t
Data processing, 28, 82, 190, 197-198, 483,
561, 623
Data q uality, 100-101
Data sources in data wa re ho u sing
process, 16f
Data visua lizatio n , 145, 147A, 149A
Data wareho use (DW)
defin ed, 81
development, 106-108
driJI down in, 111
hosted, 108
See also Data wa re housing
Data wareho use administrator CDW A) , 122
Data Wa re house Institute , 366
Data ware ho use vendors , 103, 124
Data wa re hou sing
administration , 121
applicatio n, 79A, 85A, 88A, 103A, 106A,
115A, 118A
architecture , 90-96
characteristics, 83
data analysis, 109
data represe ntation, 108
definiti o n and concept, 81
development, 102-103, 107
future tre nds, 121, 123
historical perspective, 8 1
imp lementatio n issues, 11 3
real-time, 117
scalability, 116
security issues, 121
The Data Wareho usi ng Institute (TDWI),
16~ 31, 10~ 11 3,627
DB2 Informatio n Integrator, 92
Debugging system, 498
Deceptio n detectio n , 301-303, 302A- 303A,
303f
Decisional ma nagerial roles, 8-9
Decision ana lys is, 420-422
Decision automation system (DAS), 471
Decision makers, 49
Decisio n making
cha racteristics, 40
disciplines, 41
ethical issues, 619
at HP, 38
imple me ntatio n phase, 55, 58
phases, 42
style and ma kers, 41
working definition, 41
Decision making and a nalytics
business pressures-responses-support
model, 7
computerized decision support, 5
informatio n systems suppo rt, 9
overview, 1
vignette, 3
Decisio n-making models
compo ne nts of, 66-67
defined , 47
desig n variables in, 47-48
Kepner-Tregoe method, 44
mathematical (quantitative), 47
Simon’s four-phas e model, 42-44, 44t
Decision-making process, p hases o f, 42-44,
44f
Decision modeling, using spreadsheets,
38- 39
Decision rooms, 534
Decision style, 41-42
Decision support in decisio n making, 56-59,
56f
Decision support system (DSS)
AlS SIGDSS, 61
appl icatio ns, 59-61 , 66A, 68A, 70A, 74A,
395A- 396A
in business intelligence, 14
capabilities, 59
characte ristics and capabilities of, 59-61, 60f
classifications of, 61-64
components of, 64-72
custo m-made system vs ready-made
system , 63
definition and concept of, 13
description o f, 59-61
DSS-BI connection, 16 (See also DSS/ BI)
resources a nd links, 31
sing ular and plural o f, 2n
spreadsheet-based (See Spreadsheets)
as umbre lla term, 13-14
vendo rs, products, a nd demos, 31
See also Computerized decisio n suppo rt
system
Decision tables, 420-422
Decision trees, 219, 420-422
Decision variables, 198, 399, 400
Deep Blue (chess progra m), 289
Dell, 104t, 578t, 620
DeltaMaste r, 229t
Density, in network distributio n , 377
DENDRAL, 481
Dendrites, 249
Dependent data ma rt, 84
De pendent variable s, 400
Descriptive models, 50-51
Desig n phase of d ecision-making process,
43, 44t, 47-54
alternatives, developing (gene rati ng),
52-53
decision suppo rt for, 57-58
descriptive mode ls used in, 50-51
design variables used in, 47
e rro rs in decision making, 54
no rmative models used in, 49
o utcomes, measuring, 53
principle of cho ice, selecting, 48
risk, 53
satisficing , 51-52
scenarios, 54
suboptimizatio n approach to, 49-50
Design system, 498
Development e nvironment used in ES, 484
Diagnostic system, 498
Dictionary, 294
Digital cockpits, 15f
Dimen sio na l modeling, 108
Dimensional reductio n , 208
Dimen sio n tables, 108
Directo1y, data , 65
Discrete event simu latio n , 453
Discrete probability distributio ns , 452t
Disseminato r, St
Distance measure entropy, 223
Dista nce, in network distribution, 377
Disturbance handler, St
DNA microa rray ana lysis, 304
Document management systems (DMS),
57, 521
Drill down data warehouse, 111
DSS. See Decision support system (DSS)
DSS/ BI
defined, 18
hybrid suppo rt syste m , 63
DSS/ ES, 72
DSS Resources, 31, 126
Dynamic model, 405
E
ECHELON s u1veillance system , 301
Eclat a lgorithm, 200, 226
E-commerce site design, 225
Economic o rde r quantity (EOQ), 50
Electroencephalography, 71
Electro nic brainstorming, 52, 521
Electronic document management (EDM),
57, 519, 521
Electronic meeting system ( EMS), 530, 532,
534
Electronic teleconferenc ing, 527
Elitism, 443
E-mail, 296, 323, 372
Embedded knowle dge, 516
Emo tiv, 71
End-user modeling tool , 404
Ente rprise application integration (EAi), 99,
6 11
Enterprise data ware hou se (EDW), 85-87,
88t
Enterprise informatio n integratio n (Ell), 87,
99,61 1
Ente rprise informatio n system (EIS), 15, 42,
615
Enterprise Miner, 228, 229, 229t, 230
Ente rprise re porting, 15, 140
Ente rprise resource management (ERM), 42
Ente rprise resource planning (ERP), 42, 313
Enterprise 2.0, 523
Ente rtainme nt industry, 203
Entity extractio n, 293
Entity-relatio nship d iag rams (ERD), 104, 107t
Entre preneur, St
Entropy, 219
Environmental sca nning a nd analysis, .’396
ERoom server, 530
Ethics in decision making and support, 619
Evolutio nary algorithm, 441
Executive info rmatio n system (EIS), 14
Expert, defined , 477
Expert Cho ice (ECll), 423, 424-425,
530, 623
Expertise, 477
Expertise Transfer System (ETS), 509
Expert system (ES), 28, 80, 81, 82, 105-106,
108, 542-546
applicatio ns o f, 480A, 481A, 482-483,
486A- 487A, 501A- 502A
vs. conventio na l syste m, 479t
development of, 498-502
expertise and, 478
exp erts in, 477-478
features of, 478-479
generic categories of, 497t
problem a reas suitable fo r, 497-498
rule-based , 482, 483
structure o f, 484-485
used in identifying spo rt to talent, 480A
Expert system (ES) s he ll , 500
Explanation and justificatio n , 488
Explanation facility (or justifi er), 496
Explanations, why a nd how, 496
Explanation subsystem in ES, 486
Explicit knowledge, 51 5-516
Exsys, 126, 500
Exsys Co1v id, 486-487
EXtens ible Markup Language (XML),
95-96, 521
External data , 609f
Extraction, transformatio n , and load (ETL),
89, 90f, 97-102
Extraction of data , 100
Extranets, 91, 526
F
Facebook, 28,320,323, 372, 559
Facilitate, 528
Facilita tors (in GDSS), 532, 534
Fair Isaac Bus iness Scie nce, 229t
Feedforward-backpropagation paradigm,
251
See also Backpropagation
Figurehead, St
FireStat (firestats.cc), 368
Folksonomies, 530
FootPath , 598
Forecasting (predictive a nalytics), 396-397
Fo reign lang uage reading/writing, 300
Forwa rd chaining, 491, 493
Forward-thinking companies, 370
FP-Growth algorithm , 200, 226
Frame, 51
Fraud detection/ prevention, 17t, 624
Fu zzy logic, 223, 369, 494, 521
G
Ga mbling re fe re nda predicted by using
ANN , 397A
Ga me playing in Al field, 476f
Ga rtner Group , 14, 627
GATE, 317
Index 651
Genera l-purpose develo pme nt e nvironment,
499-500
Genetic a lgorithms, 218, 258, 441-446
Geographical info rmatio n system (GIS), 57,
61, 152, 454
Gini index, 219
Goal seeking analysis, 420f
Google Docs & Spreadsheets, 607
Google Web Analytics (google .com/
analytics), 368
Gorry a nd Scott-Morto n classical framework ,
11-13
Government a nd defense, 202
GPS, 31, 57, 547
G raphical user interface (GUI), 68A
Group decisio n suppo rt system (GDSS)
characteristics of, 532
defined , 532
idea gene ratio n methods, 532
groupwork improved by, 532- 533
limitations of, 533
suppon activities in, 533
Group support system (GSS), 28, 43, 55, 56,
60, 62,419, 442-445
collaboratio n capabilities o f, 102, 105
in crime prevention , 427-428
defined , 533
decisio n rooms, 534
faci lities in, 534
Inte rnet/ Intran et base d system, 534
support activities, 534
GroupSystems, 530, 534
G roupware
defined , 527
products and features, 529t
tools, 528
G ro upSystems, 437, 439
Lo tus Notes (IBM collaboratio n software),
194,516
Team Expert Ch oice (ECll), 530
WebEx.com, 529
Gro upwork
benefits and limitatio ns of, 524
characteristics of, 523
computerized syste m used to support,
526-528
de fin ed, 523
difficulties associate d w ith, 524t
group decision-making process, 524
overview, 526-527
WEB 2.0, 522-523
H
Hadoop
pros and cons, 560-561
technical components , 559-560
working 558-559
Hardware
for data mart, 105
data mining used in, 202
data ware hous ing, 123
fo r DSS, 72
Heuristic programming, 435
Heuristics, defined, 439
Heuristic sea rching, 439
Hewlett-Packard Company (HP), 38-39,
385, 465-466, 575, 578t, 620
652 Index
Hidde n layer, 252, 252f, 254
Hive, in data mining, 200
Ho ldo ut set, 215
Ho me land security, 203, 483, 618
Ho mo nyms, 294
Homo phily, 376
Ho pfie ld networks, 255-256
Hosted data warehouse ( OW), 108
How explanatio ns, 496–497
HP. See Hewlett-Packard Company (HP)
Hub, 345
Hybrid approaches to KMS, 517
Hyperion Solutions, 104t
Hyperlink-induced topic search (HITS), 345
Hype rplane, 265
IBM
Cognos, 64, 79, 88
DB2, 92
Deep Blue (chess program) , 289
Watso n ‘s story, 289-291
!LOG acquisitio n , 623
InfoSphere Warehouse, 575
Intelligent Miner, 229t
Lotus Notes, 194 516, 530
WebSphere portal, 183
Iconic (sca le) model, 241
!Data Ana lyzer, 229t
Idea gene ratio n , 532, 533
Idea gene ratio n in GSS process, 532
!03, 196, 219
Implementation
defined, 55
phase of decisio n-making process, 43,
43ft , 55, 58-59
Independent data mart, 93
Indices represented in TDM, 309-310
Individual DSS, 73
Individual privacy, 618
Inference, 496
Inference engine , 485, 491
Infe rencing, 491
backward chaining, 491
combining two o r mo re rules, 491-492
fo1ward cha ining, 491
w ith uncertainty, 493-494
Influence diagrams, 392, 396
Info rmatio nal managerial roles, 7, St
Information-as-a-Service (Informatio n o n
Demand) (IaaS), 611
Information Builde rs , 126, 136, 139
Informatio n extractio n in text mining
applicatio ns , 293
Information ga in, 221-222
Information harvesting, 192
In fo rmatio n overload, 40
Information retrieval, 292
Info rmatio n system, integrating w ith
KMS , 516
Info rmatio n techno logy (IT) in knowledge
management, 520-523
Informatio n warfare , 203
Infrastructure as a service (IaaS), 607
Inmo n , Bill , 83, 96, 120
Inmon model (EDW approach), 104
Innovatio n networks , 374-375
Input/ o utput (techno logy) coeffi cie nts ,
250, 408
Insig htful Miner, 229t
Instant messaging (IM), 527, 615
Instant video, 529t
Institutio nal DSS, 63
Instructio n system, 498
Insurance, 225
Integra ted data wareho us ing, 115
Inte lligent phases, decisio n making
applicatio n, 45A
classificatio n of p roble ms, 46
decompos itio n of proble ms , 46
identification of p roblems, 45
pro blem owne rship, 46
suppo rting, 56
Inte lligent agents (IA) in AI fie ld, 521
Inte lligent decisio n support system
(JOSS), 469
Inte lligent DSS, 62
Inte lligent Miner, 229t
Interactive Financial Planning System
(!FPS), 67
Intermediate result variables, 399, 400, 401
Inte rnal data, 65
Inte rnet
GOSS faci lities, 434
non-work-re lated use of, 619
virtual communities, 377
Interpersonal managerial roles, 7, St
Inte rpretatio n system, 497
Interval data, 195
Intranets , 194, 516, 521
Inverse docume nt frequency, 311
J
Jackknifing, 211
Java , 27, 66, 9 1, 101 , 229, 262
K
KDD (knowledge discovery in databases), 21 3
Key performance indicato rs, 143, 168
K-fold cross-validation, 21 6-2 17
Kimball mode l (data mart approach) ,
104-105
K-means clustering algorithm , 223
k-nea rest ne ig hbo r, 2 75
Knowledge
acquisitio n , 488-489
cha racteristics of, 514-515
data, information , and, 513, 514f
defined , 513
explicit, 515-516
leaky, 515
tacit, 515-516
taxono my of, 515t
Knowledge acquisitio n in ES, 484-485
Knowledge and inference rules, 491
Knowledge-based decisio n suppo rt syste m
(KBDSS) , 469
Knowledge-based DSS, 70
Knowledge-based econo m y, 514
Knowledge-based modeling, 396, 398
Knowledge-based system (KBS), 488
Knowledge base in ES , 485
Knowledge discove1y in databases (KDD),
213, 521
Knowledge discove1y in textual da tabases.
See Text mining
Knowledge elicitatio n , 488
Knowledge eng ineer, 485
Knowledge engineering, 487-497
Knowle dge extractio n , 192
See also Data mining
Knowle dge extractio n methods, 3 12
Know le dge harvesting tools, 511
Knowle dge manageme nt consulting
firms, 518
Knowle dge manageme nt system (KMS), 10,
42, 44t, 57-58, 62, 516
approaches to, 516- 518
artificial intellige nce (AI) in, 521-522, 522t
components, 521
concepts and de fini tio ns , 513-515
cycle , 520, 520f
explicit, 515-516
harvesting process, 511
info rmatio n techno logy, role in, 520
knowle dge re pository in, 518- 519
nuggets, 509-511
o rganizational culture in, 516
o rganizatio nal lea rning in , 513
o rganizatio nal me mory in, 513
overview, 512
successes, 511- 524
tacit, 515-516
traps, 518
vignette, 508-509
Web 2.0, 522- 523
See also Knowledge
Knowle dge Miner, 229t
Knowle dge nuggets (KN), 485
Knowle dge-refining system in ES , 486
Knowle dge repository, 518, 519f
Knowle dge re presentation, 488, 490
KnowledgeSeeker, 228
Knowle dge-sharing system , 511
Knowledge va lidation , 488
Ko ho nen’s self-orga nizing feature maps
(SOM), 255
KXE (Knowledge extractio n ENgines),
229t
L
La ng uage translatio n , 300
Laptop compute rs, 225
Late nt semantic indexing , 295
Law e nfo rceme nt, 374
Leade r, St
Leaf node , 153
Leaky knowledge, 515
Lea rning algo rithms, 198, 248, 260, 275
Learning in artificial ne ural network (ANN),
253-259
algorithms, 260, 261
backpro pagatio n, 260-261 , 26lf
how a network learns, 258-260
lea rning rate , 258
process o f, 259, 259f
supervised a nd unsupervised , 259-260
Learning organization, 551A
Lea rning rate, 258
Leave-one-out, 217
Le ft-hand s ide (LHS), 414
Legal issues in business analytics, 6 16
Liaison, St
Lift metric, 225
Lindo Systems, Inc. , 404, 414, 623
Linea r progra mming (LP)
alloca tio n problems , 412f
application, 407A
decisio n modeling with spreadsheets, 404
implementatio n, 414
mathematical programming and , 414
mo de ling in, example o f, 409-412
modeling system, 409-411
product-mix proble m, 4 llf
Link analysis, 200
Linkedln , 380, 530, 604, 606
Linux , 105t
Loan application approva ls , 201, 252-253
Locatio n-based analytics for o rganizatio ns,
594-596
Lockheed a ircraft, 131
Logistics, 201
Lotus Notes. See IBM
Lotus Software. See IBM
M
Machine -learning techniq ues, 274, 277, 328
Machine tra nslatio n, 300
Ma nagement control, 63
Ma nageme nt informatio n system (MIS), 516
Ma nageme nt science (MS), 628
Manageme nt support systems (MSS), 41
Ma nagerial performance, 7
Ma nua l methods, 81
Manufacturing , 202
Market-basket analysis. See Association rule
mining
Marketing text mining applicatio ns, 293
MARS, 228, 229t
Mashups, 522,530,604
Massively paralle l processing (MPP), 124
Mathe matical (quantitative) model, 47
Mathematical programming, 408, 414, 422
Megaputer, 228, 262, 306A
PolyAna lyst, 228, 229t, 306A
Text Ana lyst, 317
WebAnalyst, 369
Mental mod e l, 515t
Message feature mining, 302
Metadata in data wa re housing, 84, 85-87,
3 12
Microsoft
Enterprise Consortium, 228
Excel, 151-152, 15lf, 152f, 230 (See also
Spreadsheets)
PowerPoint, 30, 128
SharePoint, 522t, 528
SQL Server, 228
Windows, 105t, 600
Windows-based GUI, 533
Windows XP , 369
See a lso Groove Networks ; SQL
MicroStrategy Corp., 15
Middlewa re tools in data
ware ho us ing process, 90
Mind-reading platforms , 71
Mintzberg’s 7
ma nagerial roles, St
MochiBo t (mochibo t. com), 369A
Mobile social networking, 606
Model base , 66
Model base ma nageme nt system (MBMS),
65-66, 67f, 398
Modeling and ana lysis
ce1tainty, uncertainty, and risk, 401-403
decision a nalysis, 420-422
goa l seeking analysis, 418-419, 420f
management support system (MSS)
modeling, 41
mathematical mod els for decision suppo rt,
399-401
mathematical program optimizatio n ,
407-415 (See a lso Linear
prog ramming ( LP)
mode l base management, 398
multicriteria decisio n making w ith
pairwise compa risons, 423-429
of multiple goals, 422t
pro blem-solving search methods, 437-440,
438f
sensitivity ana lysis, 418
simulatio n , 446-453
w ith spreadsheets, 404A- 405A (See also
under Spreadsheets)
w hat-if a nalysis, 418 See also individual
headings
Model libra ries, 398
Model management subsystem , 65-68
categories of mode ls, 398t
components of, 66-67
la nguages, 66
model base, 66, 67f
model base manageme nt system
(MBMS), 66
model directory, 66
model executio n , integra tio n , a nd
comma nd, 66
model integratio n , 66
modeling language, 66
Models/ modeling issues
env ironme nta l sca nning and analysis, 396
forecasting (predictive analytics), 396-397
knowledge-based modeling, 398
model categories, 398
model manageme nt, 398
trends in, current , 398-399
variable identificatio n, 396-397
Monito r, St
Monito ring system, 497
Morphology, 294
MSN, 617
Multicriteria decision making w ith paitwise
compa risons, 423-425
Multidimensio nal an alysis (modeling), 399
Multidimens io nal cube presentation, 179A
Multiple goals , an alysis of, 416-417
Multiple goals, defin ed , 422
Multiplexity, 376
Mu ltiprocessor clusters, l l Ot
Mutuality/ reciprocity, 376
Mutatio n , 443
MySpace, 379
MySQL query response, 571
N
Narrative, 51
NASA, 546
Natural language generation, 300
Natural language processing (NLP)
in AI fi eld , 521
aggregation technique , 326
bag-of-words inte rpretatio n, 296, 297
defin ed, 292-293
goa l of, 292
morphology, 294
Index 653
QA technologies, 289
text analytics and text mining, 292, 297,
348
sentime nt ana lysis and, 320, 323, 382
social analytics, 373
stop words, 294
Natural language understanding, 300
Negotiator, St
NEOS Server for Optimizatio n , 398
.NET Framework , 66
Network closure, 376
Network informatio n processing, 252
Networks, 255-256
Neural computing , 247, 249
Neura l network
application , 250A- 251A, 256A- 257A
architectures, 254-255
concepts, 247-248
informatio n processing, 252-254
See also A,tificial ne ural network (ANN)
Neura l network-based systems
backpropagatio n , 260-261
developing, 258-259
implementatio n of ANN, 259- 260
lea rning algorithm selectio n, 260
Neurodes, 249
Neuro ns, 248, 249
Nominal data , 194
Nonvolatile data warehousing, 84
Normative mode ls, 49
Nucleus, 248
Nume ric data , 195
0
Objective functi o n , 408
Objective functio n coefficients, 408
Object-orie nted databases, 500
Objects, defined , 101
OLAP. See O nline a nalytica l processing
(OLAP)
OLTP system, 99, 110, 119, 579
1-of-N pseudo variables, 196
O nline advertising, 225
Online analytical processing (OLAP)
business ana lytics, 19
data driven DSS, 62
data Wareho use, 81, 109
data sto rage, 144
decision making, 9
DSS templates provided by, 64
mathematical (quantitative) mode ls
embedde d in, 140
middlewa re tools , 90
multidimen sio nal ana lysis, 399
vs OLTP, 110
operations, 110- 111
Simo n’s four phases (decisio n making), 44t
variations, 111- 112
O nline collaboration, imp lementation issues
for, 389
O nline transaction processes. See OLTP
system
Open Web Analytics (openwebana lytics.
com), 368
Operatio na l contro l, 12
Operatio na l data store (ODS), 84-85
Operatio ns resea rch (OR), 39
Optical characte r recognition , 300
Optima l solution , 408
654 Index
Optimization
algorithms, 623
in compo und DSS, 63
ma rketing application , 301
no nlinear, 261
in mathe ma tical programming, 407
no rmative models, 49
mo de l, spreadsheet-based, 405
o f o nline advertising, 225
quadratic m odeling, 265
sea rc h e ngine, 354-356
vig ne tte, 393- 394
Web Site , 370-372
Oracle
Attensity360, 384
Big Data a nalytics, 575, 578t, 612
Business Inte lligence Suite, 99
Data Mining (ODM), 229t
Endeca, 155
e nte rprise p e rformance management
(EPM), 166
Hyperio n , 83, 228
RDBMS , 92
Orange Data Mining Tool, 229t
O rdinal data, 194
Ordinal multiple logistic regressio n , 195
O rga nizatio nal cu lture in KMS, 516
Organizational knowledge base, 70, 485, 518
O rga nizational lea rning in KMS , 513
Organizational memo1y , 513, 533, 534
Organizational performance, 10
Organizational support, 63
orms-today.org, 623
Overall Ana lysis System for Intelligence
Support (OASIS), 301
Overa ll classifier accuracy, 215
p
Piw ik (PIWIK.O RG)
Predictio n metho d
a pplicatio n, 278A- 279A
d ista nce me tric , 276-277
k-nea rest ne ighbo r a lgorithm (KNN),
275-276
parameter selectio n, 277-278
Predictive modeling
v ignette, 244-246
Processing element (PE), 25 1
Process losses, 524
Procter & Gamble (P&G), 63, 394, 452
Product design, applicatio ns for, 483
Productio n , 50- 51, 102, 202, 408, 490
Product life-cycle ma nageme nt (PLM), 57, 85
Product-mix model formulatio n, 410
Product pricing, 225
Profita bility analysis, 170
Project ma nageme n t, 56, 176, 498
Pro p ensity to buy, 17t
Pro pinquity, 376
Q
Qualitative data, 206, 370
Qua lity, information, 13
Quantitative data , 206, 479t
Quantitative models , decisio n theory, 417
Query facility, 65
Query-specific cluste ring, 3 13
Questio n answering in text mining,
a pplicatio ns, 300
R
Radia n6/Salesforce Clo ud, 384
Radio freque ncy ide ntificatio n ( RFID)
Big Data a nalytics, 547, 552, 562
business process managem e nt(BPM) , 69
location-based d a ta, 594-595
simulatio n-based assesm e nt, 454A- 457 A
tags, 461, 549
VIM in DSS, 454
Rank o rde r, 195, 347, 349, 352
RapidMine r, 229, 230
rapleaf. Com. , 6 18
Ratio data , 195
Read y-made DSS system, 64
Rea lity mining, 30, 598
Revenue ma nageme nt systems, 397A, 474
Search e ng ine
applicatio n, 353A- 354A, 357A- 358A
o ptimizatio n methods, 354- 356
s
Sentime nt ana lysis, 319, 321A- 322A,
331A- 332A
Septe mbe r 11 , 2001 , 203A- 204A, 618
Sequence mining, 200
Sequence pattern discovery, 359
Seque ntial relationship patterns, 197
Serial ana lysis o f gene expressio n (SAGE),
304
Service-orie nted architectures (SOA), 87
Service-Orie nted DSS, compone n ts, 608, 6 10t
Sha ring in collective inte lligence (CI), 523,
605
Short message service (SMS), 69, 526
Sigm o id (logical activatio n) functio n , 254
Sigmo id transfe r fu nctio n , 254
Simon’s fou r phases of d ecisio n making. See
Decisio n-ma king p rocess, phases of
Simple split, 215-2 16, 216f
Simulation, 54, 108, 171-1 77
advantages of, 449
applicatio ns, 446A- 448A
characte ristics of, 448-449
defined, 446
disadvantages o f, 450
examples o f, 454A- 457A
inadequacies in, conventional, 453
me thodology of, 450-451, 450f
software , 457-458
types, 451-452, 452t
vignette, 436-437
visua l interactive models and DSS, 454
visual interactive simulation (VIS), 453-454
Simulta neous goa ls , 416
Singula r-value d ecomp osition (SVD) , 295,
311-312
Site Meter (site m eter.com), 368
SLATES (Web 2.0 acronym), 531
Slice-and-dice a nalysis, 15
SMS. See Sho rt message service (SMS)
Snoop (reinvigorate .net), 369
Snowfl ake schema, 109
Socia l media
applicatio n, 379A- 380A
definiti o n a nd concept, 377-378
users , 378
Socia l media ana lytics
applicatio n, 383A
best practices, 381-382
concept, 380-381
tools and vendors, 384-385
Social network analysis (SNA), 374
a pplication, 375A- 376A
connectio ns , 376
distributio ns, 376-377
metrics, 375
segme nta tion, 377
Social networking, o nline
business e nterprises, implicatio n o n , 606
d e fin ed , 606
Tw itte r, 606
WEB 2.0, 604-605
Social-networking sites, 530
Sociometricsolutio ns.com, 619
Software
in a rtificial neura l networks (ANN), 262
data mining used in, 201
in text mining, 317
used in data mining, 228-23 1
Software as a service (SaaS), 123, 607
Solutio n technique libraries, 398, 399
SOM. See Kohone n’s self-organizing feature
maps (SOM)
Speech recognitio n, 300
Speech synthe sis, 300
Spiders, 348
Spira116, 385
Spreadsheets
d a ta sto rage , 144
for de cision tables, 4 21
for DSS model, 398
as e nd-user model, 404-405
as ETL tool, 100
of goa l seeking ana lysis, 410
in LP models , 414
ma nageme nt support system (MSS)
modeling w ith m o d el used to create
schedules for med ica l interns, 407 A
MSS modeling with, 67
in mu lti dimensionaJ a na lysis, 399
in prescriptive a nalytics, 406f
simulatio n packages, 449, 452
in ThinkTa nk , 530
user inte rface subsyste m, 68
w hat-if query, 418
SPRINT, 219
SproutSocial, 385
SPSS
Cle me ntine, 228, 232A, 262, 369t
PASW Modeler, 228
Static mo dels, 454
Statistics
d a ta mining vs, 200- 201
predictive a nalytics, conversio n, 364-365
Se rver, 545
Server Data Mining, 229t
Starbucks, 385, 615
Sta r sche ma , 108-109, 109f
State of nature, 420
Static model, 454
Sta tistica Data Mine r, 125, 228, 262
Statistica Text Mining, 317
StatSoft, Inc. , 310, 623
Stemming, 294
Stop te rms , 310
Stop words, 294
Store des ign, 225
Story, 157
Strategic planning, 513
Stream analytics
application, 585A- 586A
critical Event Processing, 582-583
cyber security, 586
data stream mining, 583
defined, 581-582
e-commerce, 584
finan cial services, 587
governme nt, 587- 588
health sciences, 587
law enforcement, 586
versus perpetual analytics, 582
power industry, 586
te lecommunicatio ns, 584
Structural holes, 377
Structured problems, 12
Structured processes, 12
Subject matte r expert (SME), 509
Subject-orie nted data wa reho using, 81
Suboptimizatio n, 50
Summatio n functio n , 253
Summarizatio n in text mining applications,
293
Sun Microsystems, 118, 607
Supe rvised learning, 259- 260
Supply-chain management (SCM), 13, 42,
56f, 58, 720
Support metric, 225-226
Suppo rt vecto r machines (SVM)
app licatio ns, 266A- 270A
vs anificia l ne ural network (ANN),
274-275
formulations, 270-272
Kerne l trick, 272-273
no n-linear classificatio n , 272
p rocess-based approach , 273-274
Sybase, 104t, 125, 385
Symbio tic inte llige nce. See Collective
intelligence ( CI)
Symmetric kernel function , 206
Synapse, 249, 527
Synchro no us communicatio n , 429
Synchrono us products , 528-529
Syno nyms, 294
Sysomos, 384
System dynamics modeling, 458—460
T
Tablet computers, 68, 158
Tacit knowledge, 515-516
Tags, RFID , 461,531
Tailo red turn-key solutions, 501
Team Expert Choice (ECll), 530
Tea mwork. See Groupwork
Techno logy insig ht
active data wa reho using, 120
Ambeo’s auditing solutio n, 122
ANN Softwa re, 262
o n Big Da ta , 579-580
biological a nd artificia l neura l network ,
compared, 250
business reporting , stories, 156–157
crite rion and a constraint, compared 48
data Scie ntists, 566
data size, 547-548
Ga rtne r, Inc.’s business inte lligence
p latform, 154-155
group process, dysfunctions , 525-526
o n Hadoop, 561-562
hoste d data wareho uses, 108
knowledge acquisition difficulties, 489
linear programming, 409
MicroStrategy’s data wa reho using, 112-11 3
PageRank algorithm, 350-351
po pula r sea rch e ng ine , 355
taxonomy of data, 194-196
text mining lingo, 294-295
textual data sets, 329
Teleconferencing, 527
Teradata Corp ., 95, 118f, 119f
Teradata University Network (TUN), 16, 32,
127-128
Term-by-docume nt matrix (occurre nce
matrix), 294-295
Term dictio nary, 294
Term-document matrix (TOM), 309- 310,
309f
Terms, defined, 294
Test set, 215
Text categorizatio n , 312
Text data mining. See Text mining
Text mining
academic application, 305
applications, 295A- 296A, 298A- 299A,
300-307, 314A- 3 16A
biomedical applicatio n , 304-305
bag-of-words used in, 296
commercial software tools, 317
concepts a nd definitions, 291-295
corpus establishme nt, 308- 309
free software tools , 317
Knowledge extractio n, 3 12-316
marketing applications, 301
natural lan guage processing (NLP) in,
296-300
for patent analysis , 295A- 296A
process, 307-317
research lite rature survey with,
314A- 316A
security application, 301-303
term document matrix, 309-312
three-ste p process, 307-308, 308f
tools, 317-319
Text proofing, 300
Text-to-speech, 300
Theory of cen ainty factors, 494
ThinkTank, 530
Three-tier architecture, 90, 91, 92
Threshold va lue , 254
Tie strength , 377
Time compression, 449
Time-dependent simulation, 452
Time-independent s imulatio n, 452
Time/ place framework , 527-528
Time pressure, 40
Time-series forecasting, 200
Time va ria nt (time series) data
wareho using, 84
Tokenizing, 294
Topic tracking in text mining applications,
293
Topologies, 252
Toyota , 385
Training set, 215
Transaction-processing system (TPS), 114
Transformation (transfer) function , 254
Travel ind ustry, 202
Index 655
Trend a nalysis in text mining, 313
Trial-and-e rro r sens itivity a nalysis, 418
Tribes, 157
Turing test, 477
Tw itter, .339, 378, 381, 605
Two-tie r architecture, 9 1
u
Uncertainty, decisio n making under, 401,
402, 403A, 421
Uncontrollable variables, 400, 421
U.S. Department of Homeland Security
(OHS), 203A- 204A, 346, 6 18
Unstructured data (vs. structured data) , 294
Unstructured decisions, 12
Unstructured problems, 12, 13
Unstru ctured processes, 12
Unstructured text data, 301
Unsupe rvised learning , 198, 214
USA PATRIOT Act, 203A, 617
User-cente red design (UCO), 530
User interface
in ES, 485
subsystem , 68-69
Utility theory, 417
V
Variables
decision, 400, 408, 421
de pendent, 400
ide ntificatio n o f, 396
interme diate result, 401
result (outcome), 400, 421
uncontrollable , 400, 42 1
Videogames, 537
Video-sharing sites, 530
Virtual communities, 604-605
Virtua l meeting system, 529
Virtual reality, 51, 149, 454, 600
VisSim (Visual Solutions, Inc.), 461
Visual analytics
high-powere d enviro nment, 158-159
story structure, 157
Visual inactive modeling (VIM), 453-454
Visu al interactive models a nd DSS, 454
Visual interactive problem solving, 453
Visu al interactive simulatio n (VIS), 453-454,
454A- 457A
Visu alizatio n , 14, 145-150, 154
Visual recognitio n , 278A- 279A
Visu al s imulatio n , 453
Vivisimo/ Clusty, 317
Vodafone New Zealand Ltd., 6
Voice input (speech recognition) , 69
Voice of customer (VOC), 366, 371, 372-373
Voice over IP (VoIP), 528
Voice recognitio n , 265
w
Walma1t, 10, 69, 116, 224, 385, 613
Web analytics
application, 360A- 362A
conve rs io n statistics, 364-365
knowledge extraction, 359A
maturity model , 366-368
metrics, 362
usability, 362-363
technologies, 359- 360
tools, 368-370
656 Index
Web a nalytics (Continued)
traffic sources, 363-364
vignette, 339-341
visito r profiles, 364
Web-based data wareho using, 91, 92f
Web conferencing, 528
Web conte nt mining, 344-346, 346A-
Web crawle rs , 344, 348
WebEx.com, 529, 534
Web-HIPRE application , 425
Webhousing, 115
Webinars, 529
Web mining, 341-343
Web s ite optimizatio n ecosystem, 370- 372,
370f, 37lf
Web structure mining, 344-346
We btrends, 385
Web 2.0, 522-523, 530-531 , 604-605
cha racteristics o f, 605
defined , 522-523
features/ techniques in, 530-531
used in o nline social networking,
604-605
Web usage mining , 341-343, 358-359
What-if analysis , 418, 419f
Why explanations, 496
Weka, 228
WiFi ho tspot access points, 598
Wikilog, 528, 529, 531
Wikipedia, 378, 461, 531, 605, 607
Wikis , 377, 522-523, 530, 531 , 544, 615
Woopra (wopra .com), 369
Word counting, 297
Word frequency, 294
WordStat a nalysis, 317
Workflow, 530, 544, 560
wsj.com/ wtk , 618
X
XCON, 481, 482, 482t
XLMiner, 229t
XML Miner, 369t
xplusone.com, 618
y
Yahoo! , 69, 262, 345, 351,368, 558, 607, 6 17
Yahoo! Web Analytics (web.ana lytics .yahoo.
com), 368
Yield management, 202, 473
YouTube, 16,378,380, 530, 548, 597,
605