Review Chapter 1 topics of interest and list min 5 facts introduced in a table with page number. Provide a relevant Google link for each that explores the topic further, indicating your view of the article (few sentences each)
In Praise of Computer Architecture: A Quantitative Approach
Fifth Edition
“The 5th edition of Computer Architecture: A Quantitative Approach continues
the legacy, providing students of computer architecture with the most up-to-date
information on current computing platforms, and architectural insights to help
them design future systems. A highlight of the new edition is the significantly
revised chapter on data-level parallelism, which demystifies GPU architectures
with clear explanations using traditional computer architecture terminology.”
—Krste Asanović, University of California, Berkeley
“Computer Architecture: A Quantitative Approach is a classic that, like fine
wine, just keeps getting better. I bought my first copy as I finished up my undergraduate degree and it remains one of my most frequently referenced texts today.
When the fourth edition came out, there was so much new material that I needed
to get it to stay current in the field. And, as I review the fifth edition, I realize that
Hennessy and Patterson have done it again. The entire text is heavily updated and
Chapter 6 alone makes this new edition required reading for those wanting to
really understand cloud and warehouse scale-computing. Only Hennessy and
Patterson have access to the insiders at Google, Amazon, Microsoft, and other
cloud computing and internet-scale application providers and there is no better
coverage of this important area anywhere in the industry.”
—James Hamilton, Amazon Web Services
“Hennessy and Patterson wrote the first edition of this book when graduate students built computers with 50,000 transistors. Today, warehouse-size computers
contain that many servers, each consisting of dozens of independent processors
and billions of transistors. The evolution of computer architecture has been rapid
and relentless, but Computer Architecture: A Quantitative Approach has kept
pace, with each edition accurately explaining and analyzing the important emerging ideas that make this field so exciting.”
—James Larus, Microsoft Research
“This new edition adds a superb new chapter on data-level parallelism in vector,
SIMD, and GPU architectures. It explains key architecture concepts inside massmarket GPUs, maps them to traditional terms, and compares them with vector
and SIMD architectures. It’s timely and relevant with the widespread shift to
GPU parallel computing. Computer Architecture: A Quantitative Approach furthers its string of firsts in presenting comprehensive architecture coverage of significant new developments!”
—John Nickolls, NVIDIA
“The new edition of this now classic textbook highlights the ascendance of
explicit parallelism (data, thread, request) by devoting a whole chapter to each
type. The chapter on data parallelism is particularly illuminating: the comparison
and contrast between Vector SIMD, instruction level SIMD, and GPU cuts
through the jargon associated with each architecture and exposes the similarities
and differences between these architectures.”
—Kunle Olukotun, Stanford University
“The fifth edition of Computer Architecture: A Quantitative Approach explores
the various parallel concepts and their respective tradeoffs. As with the previous
editions, this new edition covers the latest technology trends. Two highlighted are
the explosive growth of Personal Mobile Devices (PMD) and Warehouse Scale
Computing (WSC)—where the focus has shifted towards a more sophisticated
balance of performance and energy efficiency as compared with raw performance. These trends are fueling our demand for ever more processing capability
which in turn is moving us further down the parallel path.”
—Andrew N. Sloss, Consultant Engineer, ARM
Author of ARM System Developer’s Guide
Computer Architecture
A Quantitative Approach
Fifth Edition
John L. Hennessy is the tenth president of Stanford University, where he has been a member
of the faculty since 1977 in the departments of electrical engineering and computer science.
Hennessy is a Fellow of the IEEE and ACM; a member of the National Academy of Engineering,
the National Academy of Science, and the American Philosophical Society; and a Fellow of
the American Academy of Arts and Sciences. Among his many awards are the 2001 EckertMauchly Award for his contributions to RISC technology, the 2001 Seymour Cray Computer
Engineering Award, and the 2000 John von Neumann Award, which he shared with David
Patterson. He has also received seven honorary doctorates.
In 1981, he started the MIPS project at Stanford with a handful of graduate students. After
completing the project in 1984, he took a leave from the university to cofound MIPS Computer
Systems (now MIPS Technologies), which developed one of the first commercial RISC
microprocessors. As of 2006, over 2 billion MIPS microprocessors have been shipped in devices
ranging from video games and palmtop computers to laser printers and network switches.
Hennessy subsequently led the DASH (Director Architecture for Shared Memory) project, which
prototyped the first scalable cache coherent multiprocessor; many of the key ideas have been
adopted in modern multiprocessors. In addition to his technical activities and university
responsibilities, he has continued to work with numerous start-ups both as an early-stage
advisor and an investor.
David A. Patterson has been teaching computer architecture at the University of California,
Berkeley, since joining the faculty in 1977, where he holds the Pardee Chair of Computer
Science. His teaching has been honored by the Distinguished Teaching Award from the
University of California, the Karlstrom Award from ACM, and the Mulligan Education Medal and
Undergraduate Teaching Award from IEEE. Patterson received the IEEE Technical Achievement
Award and the ACM Eckert-Mauchly Award for contributions to RISC, and he shared the IEEE
Johnson Information Storage Award for contributions to RAID. He also shared the IEEE John von
Neumann Medal and the C & C Prize with John Hennessy. Like his co-author, Patterson is a
Fellow of the American Academy of Arts and Sciences, the Computer History Museum, ACM,
and IEEE, and he was elected to the National Academy of Engineering, the National Academy
of Sciences, and the Silicon Valley Engineering Hall of Fame. He served on the Information
Technology Advisory Committee to the U.S. President, as chair of the CS division in the Berkeley
EECS department, as chair of the Computing Research Association, and as President of ACM.
This record led to Distinguished Service Awards from ACM and CRA.
At Berkeley, Patterson led the design and implementation of RISC I, likely the first VLSI reduced
instruction set computer, and the foundation of the commercial SPARC architecture. He was a
leader of the Redundant Arrays of Inexpensive Disks (RAID) project, which led to dependable
storage systems from many companies. He was also involved in the Network of Workstations
(NOW) project, which led to cluster technology used by Internet companies and later to cloud
computing. These projects earned three dissertation awards from ACM. His current research
projects are Algorithm-Machine-People Laboratory and the Parallel Computing Laboratory,
where he is director. The goal of the AMP Lab is develop scalable machine learning algorithms,
warehouse-scale-computer-friendly programming models, and crowd-sourcing tools to gain
valueable insights quickly from big data in the cloud. The goal of the Par Lab is to develop technologies to deliver scalable, portable, efficient, and productive software for parallel personal
mobile devices.
Computer Architecture
A Quantitative Approach
Fifth Edition
John L. Hennessy
Stanford University
David A. Patterson
University of California, Berkeley
With Contributions by
Krste Asanović
University of California, Berkeley
Jason D. Bakos
University of South Carolina
Robert P. Colwell
R&E Colwell & Assoc. Inc.
Thomas M. Conte
North Carolina State University
José Duato
Universitat Politècnica de València and Simula
Diana Franklin
University of California, Santa Barbara
David Goldberg
The Scripps Research Institute
Norman P. Jouppi
HP Labs
Sheng Li
HP Labs
Naveen Muralimanohar
HP Labs
Gregory D. Peterson
University of Tennessee
Timothy M. Pinkston
University of Southern California
Parthasarathy Ranganathan
HP Labs
David A. Wood
University of Wisconsin–Madison
Amr Zaky
University of Santa Clara
Amsterdam • Boston • Heidelberg • London
New York • Oxford • Paris • San Diego
San Francisco • Singapore • Sydney • Tokyo
Acquiring Editor: Todd Green
Development Editor: Nate McFadden
Project Manager: Paul Gottehrer
Designer: Joanne Blank
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, Waltham, MA 02451, USA
© 2012 Elsevier, Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic
or mechanical, including photocopying, recording, or any information storage and retrieval system,
without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the
Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website:
www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the
Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience
broaden our understanding, changes in research methods or professional practices, may become
necessary. Practitioners and researchers must always rely on their own experience and knowledge in
evaluating and using any information or methods described herein. In using such information or
methods they should be mindful of their own safety and the safety of others, including parties for
whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume
any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas
contained in the material herein.
Library of Congress Cataloging-in-Publication Data
Application submitted
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
ISBN: 978-0-12-383872-8
For information on all MK publications
visit our website at www.mkp.com
Printed in the United States of America
11 12 13 14 15 10 9 8 7 6 5 4 3 2 1
Typeset by: diacriTech, Chennai, India
To Andrea, Linda, and our four sons
This page intentionally left blank
Foreword
1
by Luiz André Barroso, Google Inc.
The first edition of Hennessy and Patterson’s Computer Architecture: A Quantitative Approach was released during my first year in graduate school. I belong,
therefore, to that first wave of professionals who learned about our discipline
using this book as a compass. Perspective being a fundamental ingredient to a
useful Foreword, I find myself at a disadvantage given how much of my own
views have been colored by the previous four editions of this book. Another
obstacle to clear perspective is that the student-grade reverence for these two
superstars of Computer Science has not yet left me, despite (or perhaps because
of) having had the chance to get to know them in the years since. These disadvantages are mitigated by my having practiced this trade continuously since this
book’s first edition, which has given me a chance to enjoy its evolution and
enduring relevance.
The last edition arrived just two years after the rampant industrial race for
higher CPU clock frequency had come to its official end, with Intel cancelling its
4 GHz single-core developments and embracing multicore CPUs. Two years was
plenty of time for John and Dave to present this story not as a random product
line update, but as a defining computing technology inflection point of the last
decade. That fourth edition had a reduced emphasis on instruction-level parallelism (ILP) in favor of added material on thread-level parallelism, something the
current edition takes even further by devoting two chapters to thread- and datalevel parallelism while limiting ILP discussion to a single chapter. Readers who
are being introduced to new graphics processing engines will benefit especially
from the new Chapter 4 which focuses on data parallelism, explaining the
different but slowly converging solutions offered by multimedia extensions in
general-purpose processors and increasingly programmable graphics processing
units. Of notable practical relevance: If you have ever struggled with CUDA
terminology check out Figure 4.24 (teaser: “Shared Memory” is really local,
while “Global Memory” is closer to what you’d consider shared memory).
Even though we are still in the middle of that multicore technology shift, this
edition embraces what appears to be the next major one: cloud computing. In this
case, the ubiquity of Internet connectivity and the evolution of compelling Web
services are bringing to the spotlight very small devices (smart phones, tablets)
ix
x
■
Foreword
and very large ones (warehouse-scale computing systems). The ARM Cortex A8,
a popular CPU for smart phones, appears in Chapter 3’s “Putting It All Together”
section, and a whole new Chapter 6 is devoted to request- and data-level parallelism in the context of warehouse-scale computing systems. In this new chapter,
John and Dave present these new massive clusters as a distinctively new class of
computers—an open invitation for computer architects to help shape this emerging field. Readers will appreciate how this area has evolved in the last decade by
comparing the Google cluster architecture described in the third edition with the
more modern incarnation presented in this version’s Chapter 6.
Return customers of this book will appreciate once again the work of two outstanding
computer scientists who over their careers have perfected the art of combining an
academic’s principled treatment of ideas with a deep understanding of leading-edge
industrial products and technologies. The authors’ success in industrial interactions
won’t be a surprise to those who have witnessed how Dave conducts his biannual project retreats, forums meticulously crafted to extract the most out of academic–industrial
collaborations. Those who recall John’s entrepreneurial success with MIPS or bump into
him in a Google hallway (as I occasionally do) won’t be surprised by it either.
Perhaps most importantly, return and new readers alike will get their money’s
worth. What has made this book an enduring classic is that each edition is not an
update but an extensive revision that presents the most current information and
unparalleled insight into this fascinating and quickly changing field. For me, after
over twenty years in this profession, it is also another opportunity to experience
that student-grade admiration for two remarkable teachers.
Contents
1
Foreword
ix
Preface
xv
Acknowledgments
Chapter 1
Fundamentals of Quantitative Design and Analysis
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13
Chapter 2
xxiii
Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Trends in Cost
Dependability
Measuring, Reporting, and Summarizing Performance
Quantitative Principles of Computer Design
Putting It All Together: Performance, Price, and Power
Fallacies and Pitfalls
Concluding Remarks
Historical Perspectives and References
Case Studies and Exercises by Diana Franklin
2
5
11
17
21
27
33
36
44
52
55
59
61
61
Memory Hierarchy Design
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Introduction
Ten Advanced Optimizations of Cache Performance
Memory Technology and Optimizations
Protection: Virtual Memory and Virtual Machines
Crosscutting Issues: The Design of Memory Hierarchies
Putting It All Together: Memory Hierachies in the
ARM Cortex-A8 and Intel Core i7
Fallacies and Pitfalls
72
78
96
105
112
113
125
xi
xii
■
Contents
2.8
2.9
Chapter 3
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
Instruction-Level Parallelism: Concepts and Challenges
Basic Compiler Techniques for Exposing ILP
Reducing Branch Costs with Advanced Branch Prediction
Overcoming Data Hazards with Dynamic Scheduling
Dynamic Scheduling: Examples and the Algorithm
Hardware-Based Speculation
Exploiting ILP Using Multiple Issue and Static Scheduling
Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and
Speculation
Advanced Techniques for Instruction Delivery and Speculation
Studies of the Limitations of ILP
Cross-Cutting Issues: ILP Approaches and the Memory System
Multithreading: Exploiting Thread-Level Parallelism to Improve
Uniprocessor Throughput
Putting It All Together: The Intel Core i7 and ARM Cortex-A8
Fallacies and Pitfalls
Concluding Remarks: What’s Ahead?
Historical Perspective and References
Case Studies and Exercises by Jason D. Bakos and Robert P. Colwell
131
148
156
162
167
176
183
192
197
202
213
221
223
233
241
245
247
247
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Introduction
Vector Architecture
SIMD Instruction Set Extensions for Multimedia
Graphics Processing Units
Detecting and Enhancing Loop-Level Parallelism
Crosscutting Issues
Putting It All Together: Mobile versus Server GPUs
and Tesla versus Core i7
4.8 Fallacies and Pitfalls
4.9 Concluding Remarks
4.10 Historical Perspective and References
Case Study and Exercises by Jason D. Bakos
Chapter 5
129
131
Instruction-Level Parallelism and Its Exploitation
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
Chapter 4
Concluding Remarks: Looking Ahead
Historical Perspective and References
Case Studies and Exercises by Norman P. Jouppi,
Naveen Muralimanohar, and Sheng Li
262
264
282
288
315
322
323
330
332
334
334
Thread-Level Parallelism
5.1
5.2
5.3
Introduction
Centralized Shared-Memory Architectures
Performance of Symmetric Shared-Memory Multiprocessors
344
351
366
Contents
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
Chapter 6
378
386
392
395
400
405
409
412
412
Introduction
Programming Models and Workloads for Warehouse-Scale Computers
Computer Architecture of Warehouse-Scale Computers
Physical Infrastructure and Costs of Warehouse-Scale Computers
Cloud Computing: The Return of Utility Computing
Crosscutting Issues
Putting It All Together: A Google Warehouse-Scale Computer
Fallacies and Pitfalls
Concluding Remarks
Historical Perspectives and References
Case Studies and Exercises by Parthasarathy Ranganathan
432
436
441
446
455
461
464
471
475
476
476
Instruction Set Principles
A.1
A.2
A.3
A.4
A.5
A.6
A.7
A.8
A.9
A.10
A.11
A.12
Appendix B
Distributed Shared-Memory and Directory-Based Coherence
Synchronization: The Basics
Models of Memory Consistency: An Introduction
Crosscutting Issues
Putting It All Together: Multicore Processors and Their Performance
Fallacies and Pitfalls
Concluding Remarks
Historical Perspectives and References
Case Studies and Exercises by Amr Zaky and David A. Wood
xiii
Warehouse-Scale Computers to Exploit Request-Level and
Data-Level Parallelism
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
Appendix A
■
Introduction
Classifying Instruction Set Architectures
Memory Addressing
Type and Size of Operands
Operations in the Instruction Set
Instructions for Control Flow
Encoding an Instruction Set
Crosscutting Issues: The Role of Compilers
Putting It All Together: The MIPS Architecture
Fallacies and Pitfalls
Concluding Remarks
Historical Perspective and References
Exercises by Gregory D. Peterson
A-2
A-3
A-7
A-13
A-14
A-16
A-21
A-24
A-32
A-39
A-45
A-47
A-47
Review of Memory Hierarchy
B.1
B.2
B.3
Introduction
Cache Performance
Six Basic Cache Optimizations
B-2
B-16
B-22
xiv
■
Contents
B.4
B.5
B.6
B.7
B.8
Appendix C
Virtual Memory
Protection and Examples of Virtual Memory
Fallacies and Pitfalls
Concluding Remarks
Historical Perspective and References
Exercises by Amr Zaky
B-40
B-49
B-57
B-59
B-59
B-60
Pipelining: Basic and Intermediate Concepts
C.1
C.2
C.3
C.4
C.5
C.6
C.7
C.8
C.9
C.10
Introduction
The Major Hurdle of Pipelining—Pipeline Hazards
How Is Pipelining Implemented?
What Makes Pipelining Hard to Implement?
Extending the MIPS Pipeline to Handle Multicycle Operations
Putting It All Together: The MIPS R4000 Pipeline
Crosscutting Issues
Fallacies and Pitfalls
Concluding Remarks
Historical Perspective and References
Updated Exercises by Diana Franklin
C-2
C-11
C-30
C-43
C-51
C-61
C-70
C-80
C-81
C-81
C-82
Online Appendices
Appendix D
Storage Systems
Appendix E
Embedded Systems
By Thomas M. Conte
Appendix F
Interconnection Networks
Revised by Timothy M. Pinkston and José Duato
Appendix G
Vector Processors in More Depth
Revised by Krste Asanovic
Appendix H
Hardware and Software for VLIW and EPIC
Large-Scale Multiprocessors and Scientific Applications
Computer Arithmetic
by David Goldberg
Appendix I
Appendix J
Appendix K
Survey of Instruction Set Architectures
Appendix L
Historical Perspectives and References
References
Index
R-1
I-1
Preface
1
Why We Wrote This Book
Through five editions of this book, our goal has been to describe the basic principles underlying what will be tomorrow’s technological developments. Our excitement about the opportunities in computer architecture has not abated, and we
echo what we said about the field in the first edition: “It is not a dreary science of
paper machines that will never work. No! It’s a discipline of keen intellectual
interest, requiring the balance of marketplace forces to cost-performance-power,
leading to glorious failures and some notable successes.”
Our primary objective in writing our first book was to change the way people
learn and think about computer architecture. We feel this goal is still valid and
important. The field is changing daily and must be studied with real examples
and measurements on real computers, rather than simply as a collection of definitions and designs that will never need to be realized. We offer an enthusiastic
welcome to anyone who came along with us in the past, as well as to those who
are joining us now. Either way, we can promise the same quantitative approach
to, and analysis of, real systems.
As with earlier versions, we have strived to produce a new edition that will
continue to be as relevant for professional engineers and architects as it is for
those involved in advanced computer architecture and design courses. Like the
first edition, this edition has a sharp focus on new platforms—personal mobile
devices and warehouse-scale computers—and new architectures—multicore and
GPUs. As much as its predecessors, this edition aims to demystify computer
architecture through an emphasis on cost-performance-energy trade-offs and
good engineering design. We believe that the field has continued to mature and
move toward the rigorous quantitative foundation of long-established scientific
and engineering disciplines.
xv
xvi
■
Preface
This Edition
We said the fourth edition of Computer Architecture: A Quantitative Approach
may have been the most significant since the first edition due to the switch to
multicore chips. The feedback we received this time was that the book had lost
the sharp focus of the first edition, covering everthing equally but without emphasis and context. We’re pretty sure that won’t be said about the fifth edition.
We believe most of the excitement is at the extremes in size of computing,
with personal mobile devices (PMDs) such as cell phones and tablets as the clients and warehouse-scale computers offering cloud computing as the server.
(Observant readers may seen the hint for cloud computing on the cover.) We are
struck by the common theme of these two extremes in cost, performance, and
energy efficiency despite their difference in size. As a result, the running context
through each chapter is computing for PMDs and for warehouse scale computers,
and Chapter 6 is a brand-new chapter on the latter topic.
The other theme is parallelism in all its forms. We first idetify the two types of
application-level parallelism in Chapter 1: data-level parallelism (DLP), which
arises because there are many data items that can be operated on at the same time,
and task-level parallelism (TLP), which arises because tasks of work are created
that can operate independently and largely in parallel. We then explain the four
architectural styles that exploit DLP and TLP: instruction-level parallelism (ILP)
in Chapter 3; vector architectures and graphic processor units (GPUs) in Chapter
4, which is a brand-new chapter for this edition; thread-level parallelism in
Chapter 5; and request-level parallelism (RLP) via warehouse-scale computers in
Chapter 6, which is also a brand-new chapter for this edition. We moved memory
hierarchy earlier in the book to Chapter 2, and we moved the storage systems
chapter to Appendix D. We are particularly proud about Chapter 4, which contains the most detailed and clearest explanation of GPUs yet, and Chapter 6,
which is the first publication of the most recent details of a Google Warehousescale computer.
As before, the first three appendices in the book give basics on the MIPS
instruction set, memory hierachy, and pipelining for readers who have not read a
book like Computer Organization and Design. To keep costs down but still supply supplemental material that are of interest to some readers, available online at
http://booksite.mkp.com/9780123838728/ are nine more appendices. There are
more pages in these appendices than there are in this book!
This edition continues the tradition of using real-world examples to demonstrate the ideas, and the “Putting It All Together” sections are brand new. The
“Putting It All Together” sections of this edition include the pipeline organizations and memory hierarchies of the ARM Cortex A8 processor, the Intel core i7
processor, the NVIDIA GTX-280 and GTX-480 GPUs, and one of the Google
warehouse-scale computers.
Preface
■
xvii
Topic Selection and Organization
As before, we have taken a conservative approach to topic selection, for there are
many more interesting ideas in the field than can reasonably be covered in a treatment of basic principles. We have steered away from a comprehensive survey of
every architecture a reader might encounter. Instead, our presentation focuses on
core concepts likely to be found in any new machine. The key criterion remains
that of selecting ideas that have been examined and utilized successfully enough
to permit their discussion in quantitative terms.
Our intent has always been to focus on material that is not available in equivalent form from other sources, so we continue to emphasize advanced content
wherever possible. Indeed, there are several systems here whose descriptions
cannot be found in the literature. (Readers interested strictly in a more basic
introduction to computer architecture should read Computer Organization and
Design: The Hardware/Software Interface.)
An Overview of the Content
Chapter 1 has been beefed up in this edition. It includes formulas for energy,
static power, dynamic power, integrated circuit costs, reliability, and availability.
(These formulas are also found on the front inside cover.) Our hope is that these
topics can be used through the rest of the book. In addition to the classic quantitative principles of computer design and performance measurement, the PIAT section has been upgraded to use the new SPECPower benchmark.
Our view is that the instruction set architecture is playing less of a role today
than in 1990, so we moved this material to Appendix A. It still uses the MIPS64
architecture. (For quick review, a summary of the MIPS ISA can be found on the
back inside cover.) For fans of ISAs, Appendix K covers 10 RISC architectures,
the 80×86, the DEC VAX, and the IBM 360/370.
We then move onto memory hierarchy in Chapter 2, since it is easy to apply
the cost-performance-energy principles to this material and memory is a critical
resource for the rest of the chapters. As in the past edition, Appendix B contains
an introductory review of cache principles, which is available in case you need it.
Chapter 2 discusses 10 advanced optimizations of caches. The chapter includes
virtual machines, which offers advantages in protection, software management,
and hardware management and play an important role in cloud computing. In
addition to covering SRAM and DRAM technologies, the chapter includes new
material on Flash memory. The PIAT examples are the ARM Cortex A8, which is
used in PMDs, and the Intel Core i7, which is used in servers.
Chapter 3 covers the exploitation of instruction-level parallelism in highperformance processors, including superscalar execution, branch prediction,
speculation, dynamic scheduling, and multithreading. As mentioned earlier,
Appendix C is a review of pipelining in case you need it. Chapter 3 also surveys the limits of ILP. Like Chapter 2, the PIAT examples are again the ARM
Cortex A8 and the Intel Core i7. While the third edition contained a great deal
xviii
■
Preface
on Itanium and VLIW, this material is now in Appendix H, indicating our view
that this architecture did not live up to the earlier claims.
The increasing importance of multimedia applications such as games and video
processing has also increased the importance of achitectures that can exploit datalevel parallelism. In particular, there is a rising interest in computing using graphical processing units (GPUs), yet few architects understand how GPUs really work.
We decided to write a new chapter in large part to unveil this new style of computer architecture. Chapter 4 starts with an introduction to vector architectures,
which acts as a foundation on which to build explanations of multimedia SIMD
instrution set extensions and GPUs. (Appendix G goes into even more depth on
vector architectures.) The section on GPUs was the most difficult to write in this
book, in that it took many iterations to get an accurate description that was also
easy to understand. A significant challenge was the terminology. We decided to go
with our own terms and then provide a translation between our terms and the official NVIDIA terms. (A copy of that table can be found in the back inside cover
pages.) This chapter introduces the Roofline performance model and then uses it
to compare the Intel Core i7 and the NVIDIA GTX 280 and GTX 480 GPUs. The
chapter also describes the Tegra 2 GPU for PMDs.
Chapter 5 describes multicore processors. It explores symmetric and
distributed-memory architectures, examining both organizational principles and
performance. Topics in synchronization and memory consistency models are
next. The example is the Intel Core i7. Readers interested in interconnection networks on a chip should read Appendix F, and those interested in larger scale multiprocessors and scientific applications should read Appendix I.
As mentioned earlier, Chapter 6 describes the newest topic in computer architecture, warehouse-scale computers (WSCs). Based on help from engineers at
Amazon Web Services and Google, this chapter integrates details on design, cost,
and performance of WSCs that few architects are aware of. It starts with the popular MapReduce programming model before describing the architecture and
physical implemention of WSCs, including cost. The costs allow us to explain
the emergence of cloud computing, whereby it can be cheaper to compute using
WSCs in the cloud than in your local datacenter. The PIAT example is a description of a Google WSC that includes information published for the first time in
this book.
This brings us to Appendices A through L. Appendix A covers principles of
ISAs, including MIPS64, and Appendix K describes 64-bit versions of Alpha,
MIPS, PowerPC, and SPARC and their multimedia extensions. It also includes
some classic architectures (80×86, VAX, and IBM 360/370) and popular embedded
instruction sets (ARM, Thumb, SuperH, MIPS16, and Mitsubishi M32R). Appendix H is related, in that it covers architectures and compilers for VLIW ISAs.
As mentioned earlier, Appendices B and C are tutorials on basic caching and
pipelining concepts. Readers relatively new to caching should read Appendix B
before Chapter 2 and those new to pipelining should read Appendix C before
Chapter 3.
Preface
■
xix
Appendix D, “Storage Systems,” has an expanded discussion of reliability and
availability, a tutorial on RAID with a description of RAID 6 schemes, and rarely
found failure statistics of real systems. It continues to provide an introduction to
queuing theory and I/O performance benchmarks. We evaluate the cost, performance, and reliability of a real cluster: the Internet Archive. The “Putting It All
Together” example is the NetApp FAS6000 filer.
Appendix E, by Thomas M. Conte, consolidates the embedded material in one
place.
Appendix F, on interconnection networks, has been revised by Timothy M.
Pinkston and José Duato. Appendix G, written originally by Krste Asanović, includes
a description of vector processors. We think these two appendices are some of the
best material we know of on each topic.
Appendix H describes VLIW and EPIC, the architecture of Itanium.
Appendix I describes parallel processing applications and coherence protocols
for larger-scale, shared-memory multiprocessing. Appendix J, by David Goldberg, describes computer arithmetic.
Appendix L collects the “Historical Perspective and References” from each
chapter into a single appendix. It attempts to give proper credit for the ideas in
each chapter and a sense of the history surrounding the inventions. We like to
think of this as presenting the human drama of computer design. It also supplies
references that the student of architecture may want to pursue. If you have time,
we recommend reading some of the classic papers in the field that are mentioned
in these sections. It is both enjoyable and educational to hear the ideas directly
from the creators. “Historical Perspective” was one of the most popular sections
of prior editions.
Navigating the Text
There is no single best order in which to approach these chapters and appendices,
except that all readers should start with Chapter 1. If you don’t want to read
everything, here are some suggested sequences:
■
Memory Hierarchy: Appendix B, Chapter 2, and Appendix D.
■
Instruction-Level Parallelism: Appendix C, Chapter 3, and Appendix H
■
Data-Level Parallelism: Chapters 4 and 6, Appendix G
■
Thread-Level Parallelism: Chapter 5, Appendices F and I
■
Request-Level Parallelism: Chapter 6
■
ISA: Appendices A and K
Appendix E can be read at any time, but it might work best if read after the ISA
and cache sequences. Appendix J can be read whenever arithmetic moves you.
You should read the corresponding portion of Appendix L after you complete
each chapter.
xx
■
Preface
Chapter Structure
The material we have selected has been stretched upon a consistent framework
that is followed in each chapter. We start by explaining the ideas of a chapter.
These ideas are followed by a “Crosscutting Issues” section, a feature that shows
how the ideas covered in one chapter interact with those given in other chapters.
This is followed by a “Putting It All Together” section that ties these ideas
together by showing how they are used in a real machine.
Next in the sequence is “Fallacies and Pitfalls,” which lets readers learn from
the mistakes of others. We show examples of common misunderstandings and
architectural traps that are difficult to avoid even when you know they are lying
in wait for you. The “Fallacies and Pitfalls” sections is one of the most popular
sections of the book. Each chapter ends with a “Concluding Remarks” section.
Case Studies with Exercises
Each chapter ends with case studies and accompanying exercises. Authored by
experts in industry and academia, the case studies explore key chapter concepts
and verify understanding through increasingly challenging exercises. Instructors
should find the case studies sufficiently detailed and robust to allow them to create their own additional exercises.
Brackets for each exercise () indicate the text sections of primary relevance to completing the exercise. We hope this helps readers to avoid
exercises for which they haven’t read the corresponding section, in addition to
providing the source for review. Exercises are rated, to give the reader a sense of
the amount of time required to complete an exercise:
[10] Less than 5 minutes (to read and understand)
[15] 5–15 minutes for a full answer
[20] 15–20 minutes for a full answer
[25] 1 hour for a full written answer
[30] Short programming project: less than 1 full day of programming
[40] Significant programming project: 2 weeks of elapsed time
[Discussion] Topic for discussion with others
Solutions to the case studies and exercises are available for instructors who
register at textbooks.elsevier.com.
Supplemental Materials
A variety of resources are available online at http://booksite.mkp.com/9780123838728/,
including the following:
Preface
■
xxi
■
Reference appendices—some guest authored by subject experts—covering a
range of advanced topics
■
Historical Perspectives material that explores the development of the key
ideas presented in each of the chapters in the text
■
Instructor slides in PowerPoint
■
Figures from the book in PDF, EPS, and PPT formats
■
Links to related material on the Web
■
List of errata
New materials and links to other resources available on the Web will be
added on a regular basis.
Helping Improve This Book
Finally, it is possible to make money while reading this book. (Talk about costperformance!) If you read the Acknowledgments that follow, you will see that we
went to great lengths to correct mistakes. Since a book goes through many printings, we have the opportunity to make even more corrections. If you uncover any
remaining resilient bugs, please contact the publisher by electronic mail
(ca5bugs@mkp.com).
We welcome general comments to the text and invite you to send them to a
separate email address at ca5comments@mkp.com.
Concluding Remarks
Once again this book is a true co-authorship, with each of us writing half the
chapters and an equal share of the appendices. We can’t imagine how long it
would have taken without someone else doing half the work, offering inspiration
when the task seemed hopeless, providing the key insight to explain a difficult
concept, supplying reviews over the weekend of chapters, and commiserating
when the weight of our other obligations made it hard to pick up the pen. (These
obligations have escalated exponentially with the number of editions, as the biographies attest.) Thus, once again we share equally the blame for what you are
about to read.
John Hennessy
■
David Patterson
This page intentionally left blank
Acknowledgments
1
Although this is only the fifth edition of this book, we have actually created ten
different versions of the text: three versions of the first edition (alpha, beta, and
final) and two versions of the second, third, and fourth editions (beta and final).
Along the way, we have received help from hundreds of reviewers and users.
Each of these people has helped make this book better. Thus, we have chosen to
list all of the people who have made contributions to some version of this book.
Contributors to the Fifth Edition
Like prior editions, this is a community effort that involves scores of volunteers.
Without their help, this edition would not be nearly as polished.
Reviewers
Jason D. Bakos, University of South Carolina; Diana Franklin, The University of
California, Santa Barbara; Norman P. Jouppi, HP Labs; Gregory Peterson, University of Tennessee; Parthasarathy Ranganathan, HP Labs; Mark Smotherman,
Clemson University; Gurindar Sohi, University of Wisconsin–Madison; Mateo
Valero, Universidad Politécnica de Cataluña; Sotirios G. Ziavras, New Jersey
Institute of Technology
Members of the University of California–Berkeley Par Lab and RAD Lab who
gave frequent reviews of Chapter 1, 4, and 6 and shaped the explanation of
GPUs and WSCs: Krste Asanović, Michael Armbrust, Scott Beamer, Sarah Bird,
Bryan Catanzaro, Jike Chong, Henry Cook, Derrick Coetzee, Randy Katz, Yunsup Lee, Leo Meyervich, Mark Murphy, Zhangxi Tan, Vasily Volkov, and Andrew
Waterman
Advisory Panel
Luiz André Barroso, Google Inc.; Robert P. Colwell, R&E Colwell & Assoc.
Inc.; Krisztian Flautner, VP of R&D at ARM Ltd.; Mary Jane Irwin, Penn State;
xxiii
xxiv
■
Acknowledgments
David Kirk, NVIDIA; Grant Martin, Chief Scientist, Tensilica; Gurindar Sohi,
University of Wisconsin–Madison; Mateo Valero, Universidad Politécnica de
Cataluña
Appendices
Krste Asanović, University of California, Berkeley (Appendix G); Thomas M.
Conte, North Carolina State University (Appendix E); José Duato, Universitat
Politècnica de València and Simula (Appendix F); David Goldberg, Xerox PARC
(Appendix J); Timothy M. Pinkston, University of Southern California (Appendix F)
José Flich of the Universidad Politécnica de Valencia provided significant contributions to the updating of Appendix F.
Case Studies with Exercises
Jason D. Bakos, University of South Carolina (Chapters 3 and 4); Diana Franklin,
University of California, Santa Barbara (Chapter 1 and Appendix C); Norman P.
Jouppi, HP Labs (Chapter 2); Naveen Muralimanohar, HP Labs (Chapter 2);
Gregory Peterson, University of Tennessee (Appendix A); Parthasarathy Ranganathan, HP Labs (Chapter 6); Amr Zaky, University of Santa Clara (Chapter 5 and
Appendix B)
Jichuan Chang, Kevin Lim, and Justin Meza assisted in the development and testing of the case studies and exercises for Chapter 6.
Additional Material
John Nickolls, Steve Keckler, and Michael Toksvig of NVIDIA (Chapter 4
NVIDIA GPUs); Victor Lee, Intel (Chapter 4 comparison of Core i7 and GPU);
John Shalf, LBNL (Chapter 4 recent vector architectures); Sam Williams, LBNL
(Roofline model for computers in Chapter 4); Steve Blackburn of Australian
National University and Kathryn McKinley of University of Texas at Austin
(Intel performance and power measurements in Chapter 5); Luiz Barroso, Urs
Hölzle, Jimmy Clidaris, Bob Felderman, and Chris Johnson of Google (the
Google WSC in Chapter 6); James Hamilton of Amazon Web Services (power
distribution and cost model in Chapter 6)
Jason D. Bakos of the University of South Carolina developed the new
lecture slides for this edition.
Finally, a special thanks once again to Mark Smotherman of Clemson University, who gave a final technical reading of our manuscript. Mark found numerous
bugs and ambiguities, and the book is much cleaner as a result.
This book could not have been published without a publisher, of course. We
wish to thank all the Morgan Kaufmann/Elsevier staff for their efforts and support.
For this fifth edition, we particularly want to thank our editors Nate McFadden
Acknowledgments
■
xxv
and Todd Green, who coordinated surveys, the advisory panel, development of the
case studies and exercises, focus groups, manuscript reviews, and the updating of
the appendices.
We must also thank our university staff, Margaret Rowland and Roxana
Infante, for countless express mailings, as well as for holding down the fort at
Stanford and Berkeley while we worked on the book.
Our final thanks go to our wives for their suffering through increasingly early
mornings of reading, thinking, and writing.
Contributors to Previous Editions
Reviewers
George Adams, Purdue University; Sarita Adve, University of Illinois at Urbana–
Champaign; Jim Archibald, Brigham Young University; Krste Asanović, Massachusetts Institute of Technology; Jean-Loup Baer, University of Washington; Paul
Barr, Northeastern University; Rajendra V. Boppana, University of Texas, San
Antonio; Mark Brehob, University of Michigan; Doug Burger, University of Texas,
Austin; John Burger, SGI; Michael Butler; Thomas Casavant; Rohit Chandra; Peter
Chen, University of Michigan; the classes at SUNY Stony Brook, Carnegie Mellon, Stanford, Clemson, and Wisconsin; Tim Coe, Vitesse Semiconductor; Robert
P. Colwell; David Cummings; Bill Dally; David Douglas; José Duato, Universitat
Politècnica de València and Simula; Anthony Duben, Southeast Missouri State
University; Susan Eggers, University of Washington; Joel Emer; Barry Fagin, Dartmouth; Joel Ferguson, University of California, Santa Cruz; Carl Feynman; David
Filo; Josh Fisher, Hewlett-Packard Laboratories; Rob Fowler, DIKU; Mark Franklin, Washington University (St. Louis); Kourosh Gharachorloo; Nikolas Gloy, Harvard University; David Goldberg, Xerox Palo Alto Research Center; Antonio
González, Intel and Universitat Politècnica de Catalunya; James Goodman, University of Wisconsin–Madison; Sudhanva Gurumurthi, University of Virginia; David
Harris, Harvey Mudd College; John Heinlein; Mark Heinrich, Stanford; Daniel
Helman, University of California, Santa Cruz; Mark D. Hill, University of Wisconsin–Madison; Martin Hopkins, IBM; Jerry Huck, Hewlett-Packard Laboratories;
Wen-mei Hwu, University of Illinois at Urbana–Champaign; Mary Jane Irwin,
Pennsylvania State University; Truman Joe; Norm Jouppi; David Kaeli, Northeastern University; Roger Kieckhafer, University of Nebraska; Lev G. Kirischian,
Ryerson University; Earl Killian; Allan Knies, Purdue University; Don Knuth; Jeff
Kuskin, Stanford; James R. Larus, Microsoft Research; Corinna Lee, University of
Toronto; Hank Levy; Kai Li, Princeton University; Lori Liebrock, University of
Alaska, Fairbanks; Mikko Lipasti, University of Wisconsin–Madison; Gyula A.
Mago, University of North Carolina, Chapel Hill; Bryan Martin; Norman Matloff;
David Meyer; William Michalson, Worcester Polytechnic Institute; James Mooney;
Trevor Mudge, University of Michigan; Ramadass Nagarajan, University of Texas
at Austin; David Nagle, Carnegie Mellon University; Todd Narter; Victor Nelson;
Vojin Oklobdzija, University of California, Berkeley; Kunle Olukotun, Stanford
University; Bob Owens, Pennsylvania State University; Greg Papadapoulous, Sun
xxvi
■
Acknowledgments
Microsystems; Joseph Pfeiffer; Keshav Pingali, Cornell University; Timothy M.
Pinkston, University of Southern California; Bruno Preiss, University of Waterloo;
Steven Przybylski; Jim Quinlan; Andras Radics; Kishore Ramachandran, Georgia
Institute of Technology; Joseph Rameh, University of Texas, Austin; Anthony
Reeves, Cornell University; Richard Reid, Michigan State University; Steve Reinhardt, University of Michigan; David Rennels, University of California, Los Angeles; Arnold L. Rosenberg, University of Massachusetts, Amherst; Kaushik Roy,
Purdue University; Emilio Salgueiro, Unysis; Karthikeyan Sankaralingam, University of Texas at Austin; Peter Schnorf; Margo Seltzer; Behrooz Shirazi, Southern
Methodist University; Daniel Siewiorek, Carnegie Mellon University; J. P. Singh,
Princeton; Ashok Singhal; Jim Smith, University of Wisconsin–Madison; Mike
Smith, Harvard University; Mark Smotherman, Clemson University; Gurindar
Sohi, University of Wisconsin–Madison; Arun Somani, University of Washington;
Gene Tagliarin, Clemson University; Shyamkumar Thoziyoor, University of Notre
Dame; Evan Tick, University of Oregon; Akhilesh Tyagi, University of North Carolina, Chapel Hill; Dan Upton, University of Virginia; Mateo Valero, Universidad
Politécnica de Cataluña, Barcelona; Anujan Varma, University of California, Santa
Cruz; Thorsten von Eicken, Cornell University; Hank Walker, Texas A&M; Roy
Want, Xerox Palo Alto Research Center; David Weaver, Sun Microsystems;
Shlomo Weiss, Tel Aviv University; David Wells; Mike Westall, Clemson University; Maurice Wilkes; Eric Williams; Thomas Willis, Purdue University; Malcolm
Wing; Larry Wittie, SUNY Stony Brook; Ellen Witte Zegura, Georgia Institute of
Technology; Sotirios G. Ziavras, New Jersey Institute of Technology
Appendices
The vector appendix was revised by Krste Asanović of the Massachusetts Institute of Technology. The floating-point appendix was written originally by David
Goldberg of Xerox PARC.
Exercises
George Adams, Purdue University; Todd M. Bezenek, University of Wisconsin–
Madison (in remembrance of his grandmother Ethel Eshom); Susan Eggers; Anoop
Gupta; David Hayes; Mark Hill; Allan Knies; Ethan L. Miller, University of California, Santa Cruz; Parthasarathy Ranganathan, Compaq Western Research Laboratory; Brandon Schwartz, University of Wisconsin–Madison; Michael Scott; Dan
Siewiorek; Mike Smith; Mark Smotherman; Evan Tick; Thomas Willis
Case Studies with Exercises
Andrea C. Arpaci-Dusseau, University of Wisconsin–Madison; Remzi H. ArpaciDusseau, University of Wisconsin–Madison; Robert P. Colwell, R&E Colwell &
Assoc., Inc.; Diana Franklin, California Polytechnic State University, San Luis
Obispo; Wen-mei W. Hwu, University of Illinois at Urbana–Champaign; Norman
P. Jouppi, HP Labs; John W. Sias, University of Illinois at Urbana–Champaign;
David A. Wood, University of Wisconsin–Madison
Acknowledgments
■
xxvii
Special Thanks
Duane Adams, Defense Advanced Research Projects Agency; Tom Adams; Sarita
Adve, University of Illinois at Urbana–Champaign; Anant Agarwal; Dave Albonesi,
University of Rochester; Mitch Alsup; Howard Alt; Dave Anderson; Peter Ashenden;
David Bailey; Bill Bandy, Defense Advanced Research Projects Agency; Luiz
Barroso, Compaq’s Western Research Lab; Andy Bechtolsheim; C. Gordon Bell;
Fred Berkowitz; John Best, IBM; Dileep Bhandarkar; Jeff Bier, BDTI; Mark Birman;
David Black; David Boggs; Jim Brady; Forrest Brewer; Aaron Brown, University of
California, Berkeley; E. Bugnion, Compaq’s Western Research Lab; Alper Buyuktosunoglu, University of Rochester; Mark Callaghan; Jason F. Cantin; Paul Carrick;
Chen-Chung Chang; Lei Chen, University of Rochester; Pete Chen; Nhan Chu;
Doug Clark, Princeton University; Bob Cmelik; John Crawford; Zarka Cvetanovic;
Mike Dahlin, University of Texas, Austin; Merrick Darley; the staff of the DEC
Western Research Laboratory; John DeRosa; Lloyd Dickman; J. Ding; Susan Eggers, University of Washington; Wael El-Essawy, University of Rochester; Patty
Enriquez, Mills; Milos Ercegovac; Robert Garner; K. Gharachorloo, Compaq’s
Western Research Lab; Garth Gibson; Ronald Greenberg; Ben Hao; John Henning,
Compaq; Mark Hill, University of Wisconsin–Madison; Danny Hillis; David
Hodges; Urs Hölzle, Google; David Hough; Ed Hudson; Chris Hughes, University
of Illinois at Urbana–Champaign; Mark Johnson; Lewis Jordan; Norm Jouppi; William Kahan; Randy Katz; Ed Kelly; Richard Kessler; Les Kohn; John Kowaleski,
Compaq Computer Corp; Dan Lambright; Gary Lauterbach, Sun Microsystems;
Corinna Lee; Ruby Lee; Don Lewine; Chao-Huang Lin; Paul Losleben, Defense
Advanced Research Projects Agency; Yung-Hsiang Lu; Bob Lucas, Defense
Advanced Research Projects Agency; Ken Lutz; Alan Mainwaring, Intel Berkeley
Research Labs; Al Marston; Rich Martin, Rutgers; John Mashey; Luke McDowell;
Sebastian Mirolo, Trimedia Corporation; Ravi Murthy; Biswadeep Nag; Lisa
Noordergraaf, Sun Microsystems; Bob Parker, Defense Advanced Research Projects Agency; Vern Paxson, Center for Internet Research; Lawrence Prince; Steven
Przybylski; Mark Pullen, Defense Advanced Research Projects Agency; Chris
Rowen; Margaret Rowland; Greg Semeraro, University of Rochester; Bill Shannon; Behrooz Shirazi; Robert Shomler; Jim Slager; Mark Smotherman, Clemson
University; the SMT research group at the University of Washington; Steve
Squires, Defense Advanced Research Projects Agency; Ajay Sreekanth; Darren
Staples; Charles Stapper; Jorge Stolfi; Peter Stoll; the students at Stanford and
Berkeley who endured our first attempts at creating this book; Bob Supnik; Steve
Swanson; Paul Taysom; Shreekant Thakkar; Alexander Thomasian, New Jersey
Institute of Technology; John Toole, Defense Advanced Research Projects Agency;
Kees A. Vissers, Trimedia Corporation; Willa Walker; David Weaver; Ric Wheeler,
EMC; Maurice Wilkes; Richard Zimmerman.
John Hennessy
■
David Patterson
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13
Introduction
Classes of Computers
Defining Computer Architecture
Trends in Technology
Trends in Power and Energy in Integrated Circuits
Trends in Cost
Dependability
Measuring, Reporting, and Summarizing Performance
Quantitative Principles of Computer Design
Putting It All Together: Performance, Price, and Power
Fallacies and Pitfalls
Concluding Remarks
Historical Perspectives and References
Case Studies and Exercises by Diana Franklin
2
5
11
17
21
27
33
36
44
52
55
59
61
61
1
Fundamentals of Quantitative
Design and Analysis
I think it’s fair to say that personal computers have become the most
empowering tool we’ve ever created. They’re tools of communication,
they’re tools of creativity, and they can be shaped by their user.
Bill Gates, February 24, 2004
Computer Architecture. DOI: 10.1016/B978-0-12-383872-8.00002-1
© 2012 Elsevier, Inc. All rights reserved.
1
2
■
Chapter One Fundamentals of Quantitative Design and Analysis
1.1
Introduction
Computer technology has made incredible progress in the roughly 65 years since
the first general-purpose electronic computer was created. Today, less than $500
will purchase a mobile computer that has more performance, more main memory,
and more disk storage than a computer bought in 1985 for $1 million. This rapid
improvement has come both from advances in the technology used to build computers and from innovations in computer design.
Although technological improvements have been fairly steady, progress arising from better computer architectures has been much less consistent. During the
first 25 years of electronic computers, both forces made a major contribution,
delivering performance improvement of about 25% per year. The late 1970s saw
the emergence of the microprocessor. The ability of the microprocessor to ride
the improvements in integrated circuit technology led to a higher rate of performance improvement—roughly 35% growth per year.
This growth rate, combined with the cost advantages of a mass-produced
microprocessor, led to an increasing fraction of the computer business being
based on microprocessors. In addition, two significant changes in the computer
marketplace made it easier than ever before to succeed commercially with a new
architecture. First, the virtual elimination of assembly language programming
reduced the need for object-code compatibility. Second, the creation of standardized, vendor-independent operating systems, such as UNIX and its clone, Linux,
lowered the cost and risk of bringing out a new architecture.
These changes made it possible to develop successfully a new set of architectures with simpler instructions, called RISC (Reduced Instruction Set Computer)
architectures, in the early 1980s. The RISC-based machines focused the attention
of designers on two critical performance techniques, the exploitation of instructionlevel parallelism (initially through pipelining and later through multiple instruction
issue) and the use of caches (initially in simple forms and later using more sophisticated organizations and optimizations).
The RISC-based computers raised the performance bar, forcing prior architectures to keep up or disappear. The Digital Equipment Vax could not, and so it
was replaced by a RISC architecture. Intel rose to the challenge, primarily by
translating 80×86 instructions into RISC-like instructions internally, allowing it
to adopt many of the innovations first pioneered in the RISC designs. As transistor counts soared in the late 1990s, the hardware overhead of translating the more
complex x86 architecture became negligible. In low-end applications, such as
cell phones, the cost in power and silicon area of the x86-translation overhead
helped lead to a RISC architecture, ARM, becoming dominant.
Figure 1.1 shows that the combination of architectural and organizational
enhancements led to 17 years of sustained growth in performance at an annual
rate of over 50%—a rate that is unprecedented in the computer industry.
The effect of this dramatic growth rate in the 20th century has been fourfold.
First, it has significantly enhanced the capability available to computer users. For
many applications, the highest-performance microprocessors of today outperform the supercomputer of less than 10 years ago.
1.1
Introduction
■
3
100,000
Performance (vs. VAX-11/780)
10,000
Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz)
Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz)
Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz)
24,129
Intel Core Duo Extreme 2 cores, 3.0 GHz
21,871
19,484
Intel Core 2 Extreme 2 cores, 2.9 GHz
14,387
AMD Athlon 64, 2.8 GHz
11,865
AMD Athlon, 2.6 GHz
Intel Xeon EE 3.2 GHz
7,108
6,043 6,681
Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology)
4,195
IBM Power4, 1.3 GHz
3,016
Intel VC820 motherboard, 1.0 GHz Pentium III processor
1,779
Professional Workstation XP1000, 667 MHz 21264A
1,267
Digital AlphaServer 8400 6/575, 575 MHz 21264
993
AlphaServer 4000 5/600, 600 MHz 21164
649
Digital Alphastation 5/500, 500 MHz
481
Digital Alphastation 5/300, 300 MHz
280
Digital Alphastation 4/266, 266 MHz
183
IBM POWERstation 100, 150 MHz
117
Digital 3000 AXP/500, 150 MHz
80
HP 9000/750, 66 MHz
51
1000
100
10
IBM RS6000/540, 30 MHz
MIPS M2000, 25 MHz
18
MIPS M/120, 16.7 MHz
13
Sun-4/260, 16.7 MHz
9
VAX 8700, 22 MHz
5
24
22%/year
52%/year
AX-11/780, 5 MHz
25%/year
1.5, VAX-11/785
1 1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
Figure 1.1 Growth in processor performance since the late 1970s. This chart plots performance relative to the VAX
11/780 as measured by the SPEC benchmarks (see Section 1.8). Prior to the mid-1980s, processor performance
growth was largely technology driven and averaged about 25% per year. The increase in growth to about 52% since
then is attributable to more advanced architectural and organizational ideas. By 2003, this growth led to a difference
in performance of about a factor of 25 versus if we had continued at the 25% rate. Performance for floating-point-oriented calculations has increased even faster. Since 2003, the limits of power and available instruction-level parallelism have slowed uniprocessor performance, to no more than 22% per year, or about 5 times slower than had we
continued at 52% per year. (The fastest SPEC performance since 2007 has had automatic parallelization turned on
with increasing number of cores per chip each year, so uniprocessor speed is harder to gauge. These results are limited to single-socket systems to reduce the impact of automatic parallelization.) Figure 1.11 on page 24 shows the
improvement in clock rates for these same three eras. Since SPEC has changed over the years, performance of newer
machines is estimated by a scaling factor that relates the performance for two different versions of SPEC (e.g.,
SPEC89, SPEC92, SPEC95, SPEC2000, and SPEC2006).
Second, this dramatic improvement in cost-performance leads to new classes
of computers. Personal computers and workstations emerged in the 1980s with
the availability of the microprocessor. The last decade saw the rise of smart cell
phones and tablet computers, which many people are using as their primary computing platforms instead of PCs. These mobile client devices are increasingly
using the Internet to access warehouses containing tens of thousands of servers,
which are being designed as if they were a single gigantic computer.
Third, continuing improvement of semiconductor manufacturing as predicted by Moore’s law has led to the dominance of microprocessor-based computers across the entire range of computer design. Minicomputers, which were
4
■
Chapter One Fundamentals of Quantitative Design and Analysis
traditionally made from off-the-shelf logic or from gate arrays, were replaced by
servers made using microprocessors. Even mainframe computers and highperformance supercomputers are all collections of microprocessors.
The hardware innovations above led to a renaissance in computer design,
which emphasized both architectural innovation and efficient use of technology
improvements. This rate of growth has compounded so that by 2003, highperformance microprocessors were 7.5 times faster than what would have been
obtained by relying solely on technology, including improved circuit design; that
is, 52% per year versus 35% per year.
This hardware renaissance led to the fourth impact, which is on software
development. This 25,000-fold performance improvement since 1978 (see
Figure 1.1) allowed programmers today to trade performance for productivity. In
place of performance-oriented languages like C and C++, much more programming today is done in managed programming languages like Java and C#. Moreover, scripting languages like Python and Ruby, which are even more productive,
are gaining in popularity along with programming frameworks like Ruby on
Rails. To maintain productivity and try to close the performance gap, interpreters
with just-in-time compilers and trace-based compiling are replacing the traditional compiler and linker of the past. Software deployment is changing as well,
with Software as a Service (SaaS) used over the Internet replacing shrinkwrapped software that must be installed and run on a local computer.
The nature of applications also changes. Speech, sound, images, and video
are becoming increasingly important, along with predictable response time that is
so critical to the user experience. An inspiring example is Google Goggles. This
application lets you hold up your cell phone to point its camera at an object, and
the image is sent wirelessly over the Internet to a warehouse-scale computer that
recognizes the object and tells you interesting information about it. It might
translate text on the object to another language; read the bar code on a book cover
to tell you if a book is available online and its price; or, if you pan the phone camera, tell you what businesses are nearby along with their websites, phone numbers, and directions.
Alas, Figure 1.1 also shows that this 17-year hardware renaissance is over.
Since 2003, single-processor performance improvement has dropped to less than
22% per year due to the twin hurdles of maximum power dissipation of aircooled chips and the lack of more instruction-level parallelism to exploit efficiently. Indeed, in 2004 Intel canceled its high-performance uniprocessor projects
and joined others in declaring that the road to higher performance would be via
multiple processors per chip rather than via faster uniprocessors.
This milestone signals a historic switch from relying solely on instructionlevel parallelism (ILP), the primary focus of the first three editions of this book,
to data-level parallelism (DLP) and thread-level parallelism (TLP), which were
featured in the fourth edition and expanded in this edition. This edition also adds
warehouse-scale computers and request-level parallelism (RLP). Whereas
the compiler and hardware conspire to exploit ILP implicitly without the programmer’s attention, DLP, TLP, and RLP are explicitly parallel, requiring the
1.2
Classes of Computers
■
5
restructuring of the application so that it can exploit explicit parallelism. In some
instances, this is easy; in many, it is a major new burden for programmers.
This text is about the architectural ideas and accompanying compiler
improvements that made the incredible growth rate possible in the last century,
the reasons for the dramatic change, and the challenges and initial promising
approaches to architectural ideas, compilers, and interpreters for the 21st century.
At the core is a quantitative approach to computer design and analysis that uses
empirical observations of programs, experimentation, and simulation as its tools.
It is this style and approach to computer design that is reflected in this text. The
purpose of this chapter is to lay the quantitative foundation on which the following chapters and appendices are based.
This book was written not only to explain this design style but also to stimulate you to contribute to this progress. We believe this approach will work for
explicitly parallel computers of the future just as it worked for the implicitly parallel computers of the past.
1.2
Classes of Computers
These changes have set the stage for a dramatic change in how we view computing, computing applications, and the computer markets in this new century. Not
since the creation of the personal computer have we seen such dramatic changes
in the way computers appear and in how they are used. These changes in computer use have led to five different computing markets, each characterized by different applications, requirements, and computing technologies. Figure 1.2
summarizes these mainstream classes of computing environments and their
important characteristics.
Personal
mobile device
(PMD)
Desktop
Server
Clusters/warehousescale computer
Embedded
Price of
system
$100–$1000
$300–$2500
$5000–$10,000,000
$100,000–$200,000,000
$10–$100,000
Price of
microprocessor
$10–$100
$50–$500
$200–$2000
$50–$250
$0.01–$100
Critical
system
design
issues
Cost, energy,
media
performance,
responsiveness
Priceperformance,
energy,
graphics
performance
Throughput,
availability,
scalability, energy
Price-performance,
throughput, energy
proportionality
Price, energy,
application-specific
performance
Feature
Figure 1.2 A summary of the five mainstream computing classes and their system characteristics. Sales in 2010
included about 1.8 billion PMDs (90% cell phones), 350 million desktop PCs, and 20 million servers. The total number
of embedded processors sold was nearly 19 billion. In total, 6.1 billion ARM-technology based chips were shipped in
2010. Note the wide range in system price for servers and embedded systems, which go from USB keys to network
routers. For servers, this range arises from the need for very large-scale multiprocessor systems for high-end
transaction processing.
6
■
Chapter One Fundamentals of Quantitative Design and Analysis
Personal Mobile Device (PMD)
Personal mobile device (PMD) is the term we apply to a collection of wireless
devices with multimedia user interfaces such as cell phones, tablet computers,
and so on. Cost is a prime concern given the consumer price for the whole product is a few hundred dollars. Although the emphasis on energy efficiency is frequently driven by the use of batteries, the need to use less expensive packaging—
plastic versus ceramic—and the absence of a fan for cooling also limit total
power consumption. We examine the issue of energy and power in more detail in
Section 1.5. Applications on PMDs are often Web-based and media-oriented, like
the Google Goggles example above. Energy and size requirements lead to use of
Flash memory for storage (Chapter 2) instead of magnetic disks.
Responsiveness and predictability are key characteristics for media applications. A real-time performance requirement means a segment of the application
has an absolute maximum execution time. For example, in playing a video on a
PMD, the time to process each video frame is limited, since the processor must
accept and process the next frame shortly. In some applications, a more nuanced
requirement exists: the average time for a particular task is constrained as well
as the number of instances when some maximum time is exceeded. Such
approaches—sometimes called soft real-time—arise when it is possible to occasionally miss the time constraint on an event, as long as not too many are missed.
Real-time performance tends to be highly application dependent.
Other key characteristics in many PMD applications are the need to minimize
memory and the need to use energy efficiently. Energy efficiency is driven by
both battery power and heat dissipation. The memory can be a substantial portion
of the system cost, and it is important to optimize memory size in such cases. The
importance of memory size translates to an emphasis on code size, since data size
is dictated by the application.
Desktop Computing
The first, and probably still the largest market in dollar terms, is desktop computing. Desktop computing spans from low-end netbooks that sell for under $300 to
high-end, heavily configured workstations that may sell for $2500. Since 2008,
more than half of the desktop computers made each year have been battery operated laptop computers.
Throughout this range in price and capability, the desktop market tends to be
driven to optimize price-performance. This combination of performance (measured primarily in terms of compute performance and graphics performance) and
price of a system is what matters most to customers in this market, and hence to
computer designers. As a result, the newest, highest-performance microprocessors and cost-reduced microprocessors often appear first in desktop systems (see
Section 1.6 for a discussion of the issues affecting the cost of computers).
Desktop computing also tends to be reasonably well characterized in terms of
applications and benchmarking, though the increasing use of Web-centric, interactive applications poses new challenges in performance evaluation.
1.2
Classes of Computers
■
7
Servers
As the shift to desktop computing occurred in the 1980s, the role of servers grew
to provide larger-scale and more reliable file and computing services. Such servers have become the backbone of large-scale enterprise computing, replacing the
traditional mainframe.
For servers, different characteristics are important. First, availability is critical. (We discuss availability in Section 1.7.) Consider the servers running ATM
machines for banks or airline reservation systems. Failure of such server systems
is far more catastrophic than failure of a single desktop, since these servers must
operate seven days a week, 24 hours a day. Figure 1.3 estimates revenue costs of
downtime for server applications.
A second key feature of server systems is scalability. Server systems often
grow in response to an increasing demand for the services they support or an
increase in functional requirements. Thus, the ability to scale up the computing
capacity, the memory, the storage, and the I/O bandwidth of a server is crucial.
Finally, servers are designed for efficient throughput. That is, the overall performance of the server—in terms of transactions per minute or Web pages served
per second—is what is crucial. Responsiveness to an individual request remains
important, but overall efficiency and cost-effectiveness, as determined by how
many requests can be handled in a unit time, are the key metrics for most servers.
We return to the issue of assessing performance for different types of computing
environments in Section 1.8.
Annual losses with downtime of
Application
Cost of downtime
per hour
1%
(87.6 hrs/yr)
0.5%
(43.8 hrs/yr)
0.1%
(8.8 hrs/yr)
Brokerage operations
$6,450,000
$565,000,000
$283,000,000
$56,500,000
Credit card authorization
$2,600,000
$228,000,000
$114,000,000
$22,800,000
Package shipping services
$150,000
$13,000,000
$6,600,000
$1,300,000
Home shopping channel
$113,000
$9,900,000
$4,900,000
$1,000,000
Catalog sales center
$90,000
$7,900,000
$3,900,000
$800,000
Airline reservation center
$89,000
$7,900,000
$3,900,000
$800,000
Cellular service activation
$41,000
$3,600,000
$1,800,000
$400,000
Online network fees
$25,000
$2,200,000
$1,100,000
$200,000
ATM service fees
$14,000
$1,200,000
$600,000
$100,000
Figure 1.3 Costs rounded to nearest $100,000 of an unavailable system are shown by analyzing the cost of
downtime (in terms of immediately lost revenue), assuming three different levels of availability and that downtime is distributed uniformly. These data are from Kembel [2000] and were collected and analyzed by Contingency
Planning Research.
8
■
Chapter One Fundamentals of Quantitative Design and Analysis
Clusters/Warehouse-Scale Computers
The growth of Software as a Service (SaaS) for applications like search, social
networking, video sharing, multiplayer games, online shopping, and so on has led
to the growth of a class of computers called clusters. Clusters are collections of
desktop computers or servers connected by local area networks to act as a single
larger computer. Each node runs its own operating system, and nodes communicate using a networking protocol. The largest of the clusters are called
warehouse-scale computers (WSCs), in that they are designed so that tens of
thousands of servers can act as one. Chapter 6 describes this class of the
extremely large computers.
Price-performance and power are critical to WSCs since they are so large. As
Chapter 6 explains, 80% of the cost of a $90M warehouse is associated with
power and cooling of the computers inside. The computers themselves and networking gear cost another $70M and they must be replaced every few years.
When you are buying that much computing, you need to buy wisely, as a 10%
improvement in price-performance means a savings of $7M (10% of $70M).
WSCs are related to servers, in that availability is critical. For example, Amazon.com had $13 billion in sales in the fourth quarter of 2010. As there are about
2200 hours in a quarter, the average revenue per hour was almost $6M. During a
peak hour for Christmas shopping, the potential loss would be many times higher.
As Chapter 6 explains, the difference from servers is that WSCs use redundant
inexpensive components as the building blocks, relying on a software layer to
catch and isolate the many failures that will happen with computing at this scale.
Note that scalability for a WSC is handled by the local area network connecting
the computers and not by integrated computer hardware, as in the case of servers.
Supercomputers are related to WSCs in that they are equally expensive, costing hundreds of millions of dollars, but supercomputers differ by emphasizing
floating-point performance and by running large, communication-intensive batch
programs that can run for weeks at a time. This tight coupling leads to use of
much faster internal networks. In contrast, WSCs emphasize interactive applications, large-scale storage, dependability, and high Internet bandwidth.
Embedded Computers
Embedded computers are found in everyday machines; microwaves, washing
machines, most printers, most networking switches, and all cars contain simple
embedded microprocessors.
The processors in a PMD are often considered embedded computers, but we
are keeping them as a separate category because PMDs are platforms that can run
externally developed software and they share many of the characteristics of desktop computers. Other embedded devices are more limited in hardware and software sophistication. We use the ability to run third-party software as the dividing
line between non-embedded and embedded computers.
Embedded computers have the widest spread of processing power and cost.
They include 8-bit and 16-bit processors that may cost less than a dime, 32-bit
1.2
Classes of Computers
■
9
microprocessors that execute 100 million instructions per second and cost under
$5, and high-end processors for network switches that cost $100 and can execute
billions of instructions per second. Although the range of computing power in the
embedded computing market is very large, price is a key factor in the design of
computers for this space. Performance requirements do exist, of course, but the
primary goal is often meeting the performance need at a minimum price, rather
than achieving higher performance at a higher price.
Most of this book applies to the design, use, and performance of embedded
processors, whether they are off-the-shelf microprocessors or microprocessor
cores that will be assembled with other special-purpose hardware. Indeed, the
third edition of this book included examples from embedded computing to illustrate the ideas in every chapter.
Alas, most readers found these examples unsatisfactory, as the data that drive
the quantitative design and evaluation of other classes of computers have not yet
been extended well to embedded computing (see the challenges with EEMBC,
for example, in Section 1.8). Hence, we are left for now with qualitative descriptions, which do not fit well with the rest of the book. As a result, in this and the
prior edition we consolidated the embedded material into Appendix E. We
believe a separate appendix improves the flow of ideas in the text while allowing
readers to see how the differing requirements affect embedded computing.
Classes of Parallelism and Parallel Architectures
Parallelism at multiple levels is now the driving force of computer design across
all four classes of computers, with energy and cost being the primary constraints.
There are basically two kinds of parallelism in applications:
1. Data-Level Parallelism (DLP) arises because there are many data items that
can be operated on at the same time.
2. Task-Level Parallelism (TLP) arises because tasks of work are created that
can operate independently and largely in parallel.
Computer hardware in turn can exploit these two kinds of application parallelism
in four major ways:
1. Instruction-Level Parallelism exploits data-level parallelism at modest levels
with compiler help using ideas like pipelining and at medium levels using
ideas like speculative execution.
2. Vector Architectures and Graphic Processor Units (GPUs) exploit data-level
parallelism by applying a single instruction to a collection of data in parallel.
3. Thread-Level Parallelism exploits either data-level parallelism or task-level
parallelism in a tightly coupled hardware model that allows for interaction
among parallel threads.
4. Request-Level Parallelism exploits parallelism among largely decoupled
tasks specified by the programmer or the operating system.
10
■
Chapter One Fundamentals of Quantitative Design and Analysis
These four ways for hardware to support the data-level parallelism and
task-level parallelism go back 50 years. When Michael Flynn [1966] studied
the parallel computing efforts in the 1960s, he found a simple classification
whose abbreviations we still use today. He looked at the parallelism in the
instruction and data streams called for by the instructions at the most constrained component of the multiprocessor, and placed all computers into one of
four categories:
1. Single instruction stream, single data stream (SISD)—This category is the
uniprocessor. The programmer thinks of it as the standard sequential computer, but it can exploit instruction-level parallelism. Chapter 3 covers SISD
architectures that use ILP techniques such as superscalar and speculative execution.
2. Single instruction stream, multiple data streams (SIMD)—The same
instruction is executed by multiple processors using different data streams.
SIMD computers exploit data-level parallelism by applying the same
operations to multiple items of data in parallel. Each processor has its own
data memory (hence the MD of SIMD), but there is a single instruction
memory and control processor, which fetches and dispatches instructions.
Chapter 4 covers DLP and three different architectures that exploit it:
vector architectures, multimedia extensions to standard instruction sets,
and GPUs.
3. Multiple instruction streams, single data stream (MISD)—No commercial
multiprocessor of this type has been built to date, but it rounds out this simple
classification.
4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own instructions and operates on its own data, and it targets
task-level parallelism. In general, MIMD is more flexible than SIMD and
thus more generally applicable, but it is inherently more expensive than
SIMD. For example, MIMD computers can also exploit data-level parallelism, although the overhead is likely to be higher than would be seen in an
SIMD computer. This overhead means that grain size must be sufficiently
large to exploit the parallelism efficiently. Chapter 5 covers tightly coupled
MIMD architectures, which exploit thread-level parallelism since multiple
cooperating threads operate in parallel. Chapter 6 covers loosely coupled
MIMD architectures—specifically, clusters and warehouse-scale computers—that exploit request-level parallelism, where many independent tasks
can proceed in parallel naturally with little need for communication or
synchronization.
This taxonomy is a coarse model, as many parallel processors are hybrids of the
SISD, SIMD, and MIMD classes. Nonetheless, it is useful to put a framework on
the design space for the computers we will see in this book.
1.3
1.3
Defining Computer Architecture
■
11
Defining Computer Architecture
The task the computer designer faces is a complex one: Determine what
attributes are important for a new computer, then design a computer to maximize
performance and energy efficiency while staying within cost, power, and availability constraints. This task has many aspects, including instruction set design,
functional organization, logic design, and implementation. The implementation
may encompass integrated circuit design, packaging, power, and cooling. Optimizing the design requires familiarity with a very wide range of technologies,
from compilers and operating systems to logic design and packaging.
Several years ago, the term computer architecture often referred only to
instruction set design. Other aspects of computer design were called implementation, often insinuating that implementation is uninteresting or less challenging.
We believe this view is incorrect. The architect’s or designer’s job is much
more than instruction set design, and the technical hurdles in the other aspects of
the project are likely more challenging than those encountered in instruction set
design. We’ll quickly review instruction set architecture before describing the
larger challenges for the computer architect.
Instruction Set Architecture: The Myopic View of Computer
Architecture
We use the term instruction set architecture (ISA) to refer to the actual programmervisible instruction set in this book. The ISA serves as the boundary between the
software and hardware. This quick review of ISA will use examples from 80×86,
ARM, and MIPS to illustrate the seven dimensions of an ISA. Appendices A and
K give more details on the three ISAs.
1. Class of ISA—Nearly all ISAs today are classified as general-purpose register
architectures, where the operands are either registers or memory locations.
The 80×86 has 16 general-purpose registers and 16 that can hold floatingpoint data, while MIPS has 32 general-purpose and 32 floating-point registers
(see Figure 1.4). The two popular versions of this class are register-memory
ISAs, such as the 80×86, which can access memory as part of many instructions, and load-store ISAs, such as ARM and MIPS, which can access memory only with load or store instructions. All recent ISAs are load-store.
2. Memory addressing—Virtually all desktop and server computers, including
the 80×86, ARM, and MIPS, use byte addressing to access memory operands.
Some architectures, like ARM and MIPS, require that objects must be
aligned. An access to an object of size s bytes at byte address A is aligned if
A mod s = 0. (See Figure A.5 on page A-8.) The 80×86 does not require
alignment, but accesses are generally faster if operands are aligned.
3. Addressing modes—In addition to specifying registers and constant operands,
addressing modes specify the address of a memory object. MIPS addressing
12
■
Chapter One Fundamentals of Quantitative Design and Analysis
Name
Number
$zero
0
$at
1
$v0–$v1
2–3
Use
Preserved across a call?
The constant value 0
N.A.
Assembler temporary
No
Values for function results and
expression evaluation
No
$a0–$a3
4–7
Arguments
No
$t0–$t7
8–15
Temporaries
No
$s0–$s7
16–23
Saved temporaries
Yes
$t8–$t9
24–25
Temporaries
No
$k0–$k1
26–27
Reserved for OS kernel
No
Yes
$gp
28
Global pointer
$sp
29
Stack pointer
Yes
$fp
30
Frame pointer
Yes
$ra
31
Return address
Yes
Figure 1.4 MIPS registers and usage conventions. In addition to the 32 generalpurpose registers (R0–R31), MIPS has 32 floating-point registers (F0–F31) that can hold
either a 32-bit single-precision number or a 64-bit double-precision number.
modes are Register, Immediate (for constants), and Displacement, where a
constant offset is added to a register to form the memory address. The 80×86
supports those three plus three variations of displacement: no register (absolute), two registers (based indexed with displacement), and two registers
where one register is multiplied by the size of the operand in bytes (based
with scaled index and displacement). It has more like the last three, minus the
displacement field, plus register indirect, indexed, and based with scaled
index. ARM has the three MIPS addressing modes plus PC-relative addressing, the sum of two registers, and the sum of two registers where one register
is multiplied by the size of the operand in bytes. It also has autoincrement and
autodecrement addressing, where the calculated address replaces the contents
of one of the registers used in forming the address.
4. Types and sizes of operands—Like most ISAs, 80×86, ARM, and MIPS
support operand sizes of 8-bit (ASCII character), 16-bit (Unicode character
or half word), 32-bit (integer or word), 64-bit (double word or long integer), and IEEE 754 floating point in 32-bit (single precision) and 64-bit
(double precision). The 80×86 also supports 80-bit floating point (extended
double precision).
5. Operations—The general categories of operations are data transfer, arithmetic logical, control (discussed next), and floating point. MIPS is a simple and
easy-to-pipeline instruction set architecture, and it is representative of the RISC
architectures being used in 2011. Figure 1.5 summarizes the MIPS ISA. The
80×86 has a much richer and larger set of operations (see Appendix K).
1.3
Defining Computer Architecture
■
13
Instruction type/opcode
Instruction meaning
Data transfers
LB, LBU, SB
LH, LHU, SH
LW, LWU, SW
LD, SD
L.S, L.D, S.S, S.D
MFC0, MTC0
MOV.S, MOV.D
MFC1, MTC1
Move data between registers and memory, or between the integer and FP or special
registers; only memory address mode is 16-bit displacement + contents of a GPR
Load byte, load byte unsigned, store byte (to/from integer registers)
Load half word, load half word unsigned, store half word (to/from integer registers)
Load word, load word unsigned, store word (to/from integer registers)
Load double word, store double word (to/from integer registers)
Load SP float, load DP float, store SP float, store DP float
Copy from/to GPR to/from a special register
Copy one SP or DP FP register to another FP register
Copy 32 bits to/from FP registers from/to integer registers
Arithmetic/logical
DADD, DADDI, DADDU, DADDIU
DSUB, DSUBU
DMUL, DMULU, DDIV,
DDIVU, MADD
AND, ANDI
OR, ORI, XOR, XORI
LUI
DSLL, DSRL, DSRA, DSLLV,
DSRLV, DSRAV
SLT, SLTI, SLTU, SLTIU
Operations on integer or logical data in GPRs; signed arithmetic trap on overflow
Add, add immediate (all immediates are 16 bits); signed and unsigned
Subtract, signed and unsigned
Multiply and divide, signed and unsigned; multiply-add; all operations take and yield
64-bit values
And, and immediate
Or, or immediate, exclusive or, exclusive or immediate
Load upper immediate; loads bits 32 to 47 of register with immediate, then sign-extends
Shifts: both immediate (DS__) and variable form (DS__V); shifts are shift left logical,
right logical, right arithmetic
Set less than, set less than immediate, signed and unsigned
Control
BEQZ, BNEZ
BEQ, BNE
BC1T, BC1F
MOVN, MOVZ
J, JR
JAL, JALR
TRAP
ERET
Conditional branches and jumps; PC-relative or through register
Branch GPRs equal/not equal to zero; 16-bit offset from PC + 4
Branch GPR equal/not equal; 16-bit offset from PC + 4
Test comparison bit in the FP status register and branch; 16-bit offset from PC + 4
Copy GPR to another GPR if third GPR is negative, zero
Jumps: 26-bit offset from PC + 4 (J) or target in register (JR)
Jump and link: save PC + 4 in R31, target is PC-relative (JAL) or a register (JALR)
Transfer to operating system at a vectored address
Return to user code from an exception; restore user mode
Floating point
ADD.D, ADD.S, ADD.PS
SUB.D, SUB.S, SUB.PS
MUL.D, MUL.S, MUL.PS
MADD.D, MADD.S, MADD.PS
DIV.D, DIV.S, DIV.PS
CVT._._
FP operations on DP and SP formats
Add DP, SP numbers, and pairs of SP numbers
Subtract DP, SP numbers, and pairs of SP numbers
Multiply DP, SP floating point, and pairs of SP numbers
Multiply-add DP, SP numbers, and pairs of SP numbers
Divide DP, SP floating point, and pairs of SP numbers
Convert instructions: CVT.x.y converts from type x to type y, where x and y are L
(64-bit integer), W (32-bit integer), D (DP), or S (SP). Both operands are FPRs.
DP and SP compares: “__” = LT,GT,LE,GE,EQ,NE; sets bit in FP status register
C.__.D, C.__.S
Figure 1.5 Subset of the instructions in MIPS64. SP = single precision; DP = double precision. Appendix A gives
much more detail on MIPS64. For data, the most significant bit number is 0; least is 63.
14
■
Chapter One Fundamentals of Quantitative Design and Analysis
6. Control flow instructions—Virtually all ISAs, including these three, support
conditional branches, unconditional jumps, procedure calls, and returns. All
three use PC-relative addressing, where the branch address is specified by an
address field that is added to the PC. There are some small differences. MIPS
conditional branches (BE, BNE, etc.) test the contents of registers, while the
80×86 and ARM branches test condition code bits set as side effects of arithmetic/logic operations. The ARM and MIPS procedure call places the return
address in a register, while the 80×86 call (CALLF) places the return address
on a stack in memory.
7. Encoding an ISA—There are two basic choices on encoding: fixed length and
variable length. All ARM and MIPS instructions are 32 bits long, which simplifies instruction decoding. Figure 1.6 shows the MIPS instruction formats.
The 80×86 encoding is variable length, ranging from 1 to 18 bytes. Variablelength instructions can take less space than fixed-length instructions, so a
program compiled for the 80×86 is usually smaller than the same program
compiled for MIPS. Note that choices mentioned above will affect how the
instructions are encoded into a binary representation. For example, the number of registers and the number of addressing modes both have a significant
impact on the size of instructions, as the register field and addressing mode
field can appear many times in a single instruction. (Note that ARM and
MIPS later offered extensions to offer 16-bit length instructions so as to
reduce program size, called Thumb or Thumb-2 and MIPS16, respectively.)
Basic instruction formats
R
opcode
31
I
rs
26 25
opcode
31
rt
21 20
rs
26 25
rd
16 15
rt
21 20
31
funct
6 5
0
immediate
16 15
opcode
J
shamt
11 10
address
26 25
Floating-point instruction formats
opcode
FR
FI
fmt
26 25
31
fmt
opcode
31
ft
21 20
26 25
fs
16 15
ft
21 20
fd
11 10
funct
6 5
0
immediate
16 15
Figure 1.6 MIPS64 instruction set architecture formats. All instructions are 32 bits
long. The R format is for integer register-to-register operations, such as DADDU, DSUBU,
and so on. The I format is for data transfers, branches, and immediate instructions, such
as LD, SD, BEQZ, and DADDIs. The J format is for jumps, the FR format for floating-point
operations, and the FI format for floating-point branches.
1.3
Defining Computer Architecture
■
15
The other challenges facing the computer architect beyond ISA design are
particularly acute at the present, when the differences among instruction sets are
small and when there are distinct application areas. Therefore, starting with the
last edition, the bulk of instruction set material beyond this quick review is found
in the appendices (see Appendices A and K).
We use a subset of MIPS64 as the example ISA in this book because it is both
the dominant ISA for networking and it is an elegant example of the RISC architectures mentioned earlier, of which ARM (Advanced RISC Machine) is the most
popular example. ARM processors were in 6.1 billion chips shipped in 2010, or
roughly 20 times as many chips that shipped with 80×86 processors.
Genuine Computer Architecture: Designing the Organization
and Hardware to Meet Goals and Functional Requirements
The implementation of a computer has two components: organization and
hardware. The term organization includes the high-level aspects of a computer’s
design, such as the memory system, the memory interconnect, and the design of
the internal processor or CPU (central processing unit—where arithmetic, logic,
branching, and data transfer are implemented). The term microarchitecture is
also used instead of organization. For example, two processors with the same
instruction set architectures but different organizations are the AMD Opteron and
the Intel Core i7. Both processors implement the x86 instruction set, but they
have very different pipeline and cache organizations.
The switch to multiple processors per microprocessor led to the term core to
also be used for processor. Instead of saying multiprocessor microprocessor, the
term multicore has caught on. Given that virtually all chips have multiple processors, the term central processing unit, or CPU, is fading in popularity.
Hardware refers to the specifics of a computer, including the detailed logic
design and the packaging technology of the computer. Often a line of computers
contains computers with identical instruction set architectures and nearly identical
organizations, but they differ in the detailed hardware implementation. For example, the Intel Core i7 (see Chapter 3) and the Intel Xeon 7560 (see Chapter 5) are
nearly identical but offer different clock rates and different memory systems,
making the Xeon 7560 more effective for server computers.
In this book, the word architecture covers all three aspects of computer
design—instruction set architecture, organization or microarchitecture, and
hardware.
Computer architects must design a computer to meet functional requirements
as well as price, power, performance, and availability goals. Figure 1.7 summarizes requirements to consider in designing a new computer. Often, architects
also must determine what the functional requirements are, which can be a major
task. The requirements may be specific features inspired by the market. Application software often drives the choice of certain functional requirements by determining how the computer will be used. If a large body of software exists for a
certain instruction set architecture, the architect may decide that a new computer
16
■
Chapter One Fundamentals of Quantitative Design and Analysis
Functional requirements
Typical features required or supported
Application area
Target of computer
Personal mobile device
Real-time performance for a range of tasks, including interactive performance for
graphics, video, and audio; energy efficiency (Ch. 2, 3, 4, 5; App. A)
General-purpose desktop
Balanced performance for a range of tasks, including interactive performance for
graphics, video, and audio (Ch. 2, 3, 4, 5; App. A)
Servers
Support for databases and transaction processing; enhancements for reliability and
availability; support for scalability (Ch. 2, 5; App. A, D, F)
Clusters/warehouse-scale
computers
Throughput performance for many independent tasks; error correction for
memory; energy proportionality (Ch 2, 6; App. F)
Embedded computing
Often requires special support for graphics or video (or other application-specific
extension); power limitations and power control may be required; real-time
constraints (Ch. 2, 3, 5; App. A, E)
Level of software compatibility
Determines amount of existing software for computer
At programming language
Most flexible for designer; need new compiler (Ch. 3, 5; App. A)
Object code or binary
compatible
Instruction set architecture is completely defined—little flexibility—but no
investment needed in software or porting programs (App. A)
Operating system requirements
Necessary features to support chosen OS (Ch. 2; App. B)
Size of address space
Very important feature (Ch. 2); may limit applications
Memory management
Required for modern OS; may be paged or segmented (Ch. 2)
Protection
Different OS and application needs: page vs. segment; virtual machines (Ch. 2)
Standards
Certain standards may be required by marketplace
Floating point
Format and arithmetic: IEEE 754 standard (App. J), special arithmetic for graphics
or signal processing
I/O interfaces
For I/O devices: Serial ATA, Serial Attached SCSI, PCI Express (App. D, F)
Operating systems
UNIX, Windows, Linux, CISCO IOS
Networks
Support required for different networks: Ethernet, Infiniband (App. F)
Programming languages
Languages (ANSI C, C++, Java, Fortran) affect instruction set (App. A)
Figure 1.7 Summary of some of the most important functional requirements an architect faces. The left-hand
column describes the class of requirement, while the right-hand column gives specific examples. The right-hand
column also contains references to chapters and appendices that deal with the specific issues.
should implement an existing instruction set. The presence of a large market for a
particular class of applications might encourage the designers to incorporate
requirements that would make the computer competitive in that market. Later
chapters examine many of these requirements and features in depth.
Architects must also be aware of important trends in both the technology and
the use of computers, as such trends affect not only the future cost but also the
longevity of an architecture.
1.4
1.4
Trends in Technology
■
17
Trends in Technology
If an instruction set architecture is to be successful, it must be designed to survive
rapid changes in computer technology. After all, a successful new instruction set
architecture may last decades—for example, the core of the IBM mainframe has
been in use for nearly 50 years. An architect must plan for technology changes
that can increase the lifetime of a successful computer.
To plan for the evolution of a computer, the designer must be aware of rapid
changes in implementation technology. Five implementation technologies, which
change at a dramatic pace, are critical to modern implementations:
■
Integrated circuit logic technology—Transistor density increases by about
35% per year, quadrupling somewhat over four years. Increases in die size
are less predictable and slower, ranging from 10% to 20% per year. The combined effect is a growth rate in transistor count on a chip of about 40% to 55%
per year, or doubling every 18 to 24 months. This trend is popularly known as
Moore’s law. Device speed scales more slowly, as we discuss below.
■
Semiconductor DRAM (dynamic random-access memory)—Now that most
DRAM chips …