Welcome!

Microservices Expo Authors: Gordon Haff, Pat Romanski, Liz McMillan, Elizabeth White, Carmen Gonzalez

Related Topics: @CloudExpo, Java IoT, Microservices Expo, Containers Expo Blog, @BigDataExpo, SDN Journal

@CloudExpo: Book Excerpt

Book Excerpt: Systems Performance: Enterprise and the Cloud | Part 1

CPUs drive all software and are often the first target for systems performance analysis

"This excerpt is from the book, "Systems Performance: Enterprise and the Cloud", authored by Brendan Gregg, published by Prentice Hall Professional, Oct. 2013, ISBN 9780133390094, Copyright © 2014 Pearson Education, Inc. For more info, please visit the publisher site:

CPUs drive all software and are often the first target for systems performance analysis. Modern systems typically have many CPUs, which are shared among all running software by the kernel scheduler. When there is more demand for CPU resources than there are resources available, process threads (or tasks) will queue, waiting their turn. Waiting can add significant latency during the runtime of applications, degrading performance.

The usage of the CPUs can be examined in detail to look for performance improvements, including eliminating unnecessary work. At a high level, CPU usage by process, thread, or task can be examined. At a lower level, the code path within applications and the kernel can be profiled and studied. At the lowest level, CPU instruction execution and cycle behavior can be studied.

This chapter consists of five parts:

  • Background introduces CPU-related terminology, basic models of CPUs, and key CPU performance concepts.
  • Architecture introduces processor and kernel scheduler architecture.
  • Methodology describes performance analysis methodologies, both observa- tional and experimental.
  • Analysis describes CPU performance analysis tools on Linux- and Solaris- based systems, including profiling, tracing, and visualizations.
  • Tuning includes examples of tunable parameters.

The first three sections provide the basis for CPU analysis, and the last two show its practical application to Linux- and Solaris-based systems.

The effects of memory I/O on CPU performance are covered, including CPU cycles stalled on memory and the performance of CPU caches. Chapter 7, Memory, continues the discussion of memory I/O, including MMU, NUMA/UMA, system interconnects, and memory busses.

Terminology
For reference, CPU-related terminology used in this chapter includes the following:

  • Processor: the physical chip that plugs into a socket on the system or pro- cessor board and contains one or more CPUs implemented as cores or hard- ware threads.
  • Core: an independent CPU instance on a multicore processor. The use of cores is a way to scale processors, called chip-level multiprocessing (CMP).
  • Hardware thread: a CPU architecture that supports executing multiple threads in parallel on a single core (including Intel's Hyper-Threading Tech- nology), where each thread is an independent CPU instance. One name for this scaling approach is multithreading.
  • CPU instruction: a single CPU operation, from its instruction set. There are instructions for arithmetic operations, memory I/O, and control logic.
  • Logical CPU: also called a virtual processor,1 an operating system CPU instance (a schedulable CPU entity). This may be implemented by the processor as a hardware thread (in which case it may also be called a virtual core), a core, or a single-core processor.
  • Scheduler: the kernel subsystem that assigns threads to run on CPUs.
  • Run queue: a queue of runnable threads that are waiting to be serviced by
  • CPUs. For Solaris, it is often called a dispatcher queue.

Other terms are introduced throughout this chapter. The Glossary includes basic terminology for reference, including CPU, CPU cycle, and stack. Also see the terminology sections in Chapters 2 and 3.

Models
The following simple models illustrate some basic principles of CPUs and CPU per- formance. Section 6.4, Architecture, digs much deeper and includes implementation- specific details.

CPU Architecture
Figure 1 shows an example CPU architecture, for a single processor with four cores and eight hardware threads in total. The physical architecture is pictured, along with how it is seen by the operating system.

Figure 1: CPU architecture

Each hardware thread is addressable as a logical CPU, so this processor appears as eight CPUs. The operating system may have some additional knowledge of topology, such as which CPUs are on the same core, to improve its scheduling decisions.

CPU Memory Caches
Processors provide various hardware caches for improving memory I/O perfor- mance. Figure 2 shows the relationship of cache sizes, which become smaller and faster (a trade-off) the closer they are to the CPU.

The caches that are present, and whether they are on the processor (integrated) or external to the processor, depend on the processor type. Earlier processors pro- vided fewer levels of integrated cache.

Figure 2: CPU cache sizes

CPU Run Queues
Figure 3 shows a CPU run queue, which is managed by the kernel scheduler.

Figure 3: CPU run queue

The thread states shown in the figure, ready to run and on-CPU, are covered in Figure 3.7 in Chapter 3, Operating Systems.

The number of software threads that are queued and ready to run is an impor- tant performance metric indicating CPU saturation. In this figure (at this instant) there are four, with an additional thread running on-CPU. The time spent waiting on a CPU run queue is sometimes called run-queue latency or dispatcher-queue latency. In this book, the term scheduler latency is used instead, as it is appropri- ate for all dispatcher types, including those that do not use queues (see the discus- sion of CFS in Section 6.4.2, Software).

For multiprocessor systems, the kernel typically provides a run queue for each CPU and aims to keep threads on the same run queue. This means that threads are more likely to keep running on the same CPUs, where the CPU caches have cached their data. (These caches are described as having cache warmth, and the approach to favor CPUs is called CPU affinity.) On NUMA systems, memory locality may also be improved, which also improves performance (this is described in Chapter 7, Memory).

It also avoids the cost of thread synchronization (mutex locks) for queue operations, which would hurt scalability if the run queue was global and shared among all CPUs.

Concepts
The following are a selection of important concepts regarding CPU performance, beginning with a summary of processor internals: the CPU clock rate and how instructions are executed. This is background for later performance analysis, particularly for understanding the cycles-per-instruction (CPI) metric.

Clock Rate
The clock is a digital signal that drives all processor logic. Each CPU instruction may take one or more cycles of the clock (called CPU cycles) to execute. CPUs exe- cute at a particular clock rate; for example, a 5 GHz CPU performs 5 billion clock cycles per second.

Some processors are able to vary their clock rate, increasing it to improve performance or decreasing it to reduce power consumption. The rate may be varied on request by the operating system, or dynamically by the processor itself. The ker- nel idle thread, for example, can request the CPU to throttle down to save power.

Clock rate is often marketed as the primary feature of the processor, but this can be a little misleading. Even if the CPU in your system appears to be fully utilized (a bottleneck), a faster clock rate may not speed up performance-it depends on what those fast CPU cycles are actually doing. If they are mostly stall cycles while waiting on memory access, executing them more quickly doesn't actually increase the CPU instruction rate or workload throughput.

Instruction
CPUs execute instructions chosen from their instruction set. An instruction includes the following steps, each processed by a component of the CPU called a functional unit:

  1. Instruction fetch
  2. Instruction decode
  3. Execute
  4. Memory access
  5. Register write-back

The last two steps are optional, depending on the instruction. Many instructions operate on registers only and do not require the memory access step.

Each of these steps takes at least a single clock cycle to be executed. Memory access is often the slowest, as it may take dozens of clock cycles to read or write to main memory, during which instruction execution has stalled (and these cycles while stalled are called stall cycles). This is why CPU caching is important, as described in Section 6.4: it can dramatically reduce the number of cycles needed for memory access.

Instruction Pipeline
The instruction pipeline is a CPU architecture that can execute multiple instructions in parallel, by executing different components of different instructions at the same time. It is similar to a factory assembly line, where stages of production can be executed in parallel, increasing throughput.

Consider the instruction steps previously listed. If each were to take a single clock cycle, it would take five cycles to complete the instruction. At each step of this instruction, only one functional unit is active and four are idle. By use of pipe- lining, multiple functional units can be active at the same time, processing differ- ent instructions in the pipeline. Ideally, the processor can then complete one instruction with every clock cycle.

Instruction Width
But we can go faster still. Multiple functional units can be included of the same type, so that even more instructions can make forward progress with each clock cycle. This CPU architecture is called superscalar and is typically used with pipe- lining to achieve a high instruction throughput.

The instruction width describes the target number of instructions to process in parallel. Modern processors are 3-wide or 4-wide, meaning they can complete up to three or four instructions per cycle. How this works depends on the processor, as there may be different numbers of functional units for each stage.

CPI, IPC
Cycles per instruction (CPI) is an important high-level metric for describing where a CPU is spending its clock cycles and for understanding the nature of CPU utilization. This metric may also be expressed as instructions per cycle (IPC), the inverse of CPI.

A high CPI indicates that CPUs are often stalled, typically for memory access. A low CPI indicates that CPUs are often not stalled and have a high instruction throughput. These metrics suggest where performance tuning efforts may be best spent.

Memory-intensive workloads, for example, may be improved by installing faster memory (DRAM), improving memory locality (software configuration), or reducing the amount of memory I/O. Installing CPUs with a higher clock rate may not improve performance to the degree expected, as the CPUs may need to wait the same amount of time for memory I/O to complete. Put differently, a faster CPU may mean more stall cycles but the same rate of completed instructions.

The actual values for high or low CPI are dependent on the processor and processor features and can be determined experimentally by running known work- loads. As an example, you may find that high-CPI workloads run with a CPI at ten or higher, and low CPI workloads run with a CPI at less than one (which is possi- ble due to instruction pipelining and width, described earlier).

It should be noted that CPI shows the efficiency of instruction processing, but not of the instructions themselves. Consider a software change that added an inefficient software loop, which operates mostly on CPU registers (no stall cycles): such a change may result in a lower overall CPI, but higher CPU usage and utilization.

Utilization
CPU utilization is measured by the time a CPU instance is busy performing work during an interval, expressed as a percentage. It can be measured as the time a CPU is not running the kernel idle thread but is instead running user-level application threads or other kernel threads, or processing interrupts.

High CPU utilization may not necessarily be a problem, but rather a sign that the system is doing work. Some people also consider this an ROI indicator: a highly utilized system is considered to have good ROI, whereas an idle system is considered wasted. Unlike with other resource types (disks), performance does not degrade steeply under high utilization, as the kernel supports priorities, preemption, and time sharing. These together allow the kernel to understand what has higher priority, and to ensure that it runs first.

The measure of CPU utilization spans all clock cycles for eligible activities, including memory stall cycles. It may seem a little counterintuitive, but a CPU may be highly utilized because it is often stalled waiting for memory I/O, not just executing instructions, as described in the previous section.

CPU utilization is often split into separate kernel- and user-time metrics.

More Stories By Brendan Gregg

Brendan Gregg, Lead Performance Engineer at Joyent, analyzes performance and scalability throughout the software stack. As Performance Lead and Kernel Engineer at Sun Microsystems (and later Oracle), his work included developing the ZFS L2ARC, a pioneering file system technology for improving performance using flash memory. He has invented and developed many performance tools, including some that ship with Mac OS X and Oracle® Solaris™ 11. His recent work has included performance visualizations for Linux and illumos kernel analysis. He is coauthor of DTrace (Prentice Hall, 2011) and Solaris™ Performance and Tools (Prentice Hall, 2007).

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
We call it DevOps but much of the time there’s a lot more discussion about the needs and concerns of developers than there is about other groups. There’s a focus on improved and less isolated developer workflows. There are many discussions around collaboration, continuous integration and delivery, issue tracking, source code control, code review, IDEs, and xPaaS – and all the tools that enable those things. Changes in developer practices may come up – such as developers taking ownership of code ...
Containers have changed the mind of IT in DevOps. They enable developers to work with dev, test, stage and production environments identically. Containers provide the right abstraction for microservices and many cloud platforms have integrated them into deployment pipelines. DevOps and containers together help companies achieve their business goals faster and more effectively. In his session at DevOps Summit, Ruslan Synytsky, CEO and Co-founder of Jelastic, reviewed the current landscape of Dev...
In his session at 20th Cloud Expo, Mike Johnston, an infrastructure engineer at Supergiant.io, will discuss how to use Kubernetes to setup a SaaS infrastructure for your business. Mike Johnston is an infrastructure engineer at Supergiant.io with over 12 years of experience designing, deploying, and maintaining server and workstation infrastructure at all scales. He has experience with brick and mortar data centers as well as cloud providers like Digital Ocean, Amazon Web Services, and Rackspace....
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
SYS-CON Events announced today that Outlyer, a monitoring service for DevOps and operations teams, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Outlyer is a monitoring service for DevOps and Operations teams running Cloud, SaaS, Microservices and IoT deployments. Designed for today's dynamic environments that need beyond cloud-scale monitoring, we make monitoring effortless so you...
Cloud Expo, Inc. has announced today that Andi Mann and Aruna Ravichandran have been named Co-Chairs of @DevOpsSummit at Cloud Expo 2017. The @DevOpsSummit at Cloud Expo New York will take place on June 6-8, 2017, at the Javits Center in New York City, New York, and @DevOpsSummit at Cloud Expo Silicon Valley will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
DevOps and microservices are permeating software engineering teams broadly, whether these teams are in pure software shops but happen to run a business, such Uber and Airbnb, or in companies that rely heavily on software to run more traditional business, such as financial firms or high-end manufacturers. Microservices and DevOps have created software development and therefore business speed and agility benefits, but they have also created problems; specifically, they have created software securi...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, Cloud Expo and @ThingsExpo are two of the most important technology events of the year. Since its launch over eight years ago, Cloud Expo and @ThingsExpo have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, I provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading the...
@DevOpsSummit at Cloud taking place June 6-8, 2017, at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long developm...
In their general session at 16th Cloud Expo, Michael Piccininni, Global Account Manager - Cloud SP at EMC Corporation, and Mike Dietze, Regional Director at Windstream Hosted Solutions, reviewed next generation cloud services, including the Windstream-EMC Tier Storage solutions, and discussed how to increase efficiencies, improve service delivery and enhance corporate cloud solution development. Michael Piccininni is Global Account Manager – Cloud SP at EMC Corporation. He has been engaged in t...
TechTarget storage websites are the best online information resource for news, tips and expert advice for the storage, backup and disaster recovery markets. By creating abundant, high-quality editorial content across more than 140 highly targeted technology-specific websites, TechTarget attracts and nurtures communities of technology buyers researching their companies' information technology needs. By understanding these buyers' content consumption behaviors, TechTarget creates the purchase inte...
Software development is a moving target. You have to keep your eye on trends in the tech space that haven’t even happened yet just to stay current. Consider what’s happened with augmented reality (AR) in this year alone. If you said you were working on an AR app in 2015, you might have gotten a lot of blank stares or jokes about Google Glass. Then Pokémon GO happened. Like AR, the trends listed below have been building steam for some time, but they’ll be taking off in surprising new directions b...
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...
"We're bringing out a new application monitoring system to the DevOps space. It manages large enterprise applications that are distributed throughout a node in many enterprises and we manage them as one collective," explained Kevin Barnes, President of eCube Systems, in this SYS-CON.tv interview at DevOps at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
As organizations realize the scope of the Internet of Things, gaining key insights from Big Data, through the use of advanced analytics, becomes crucial. However, IoT also creates the need for petabyte scale storage of data from millions of devices. A new type of Storage is required which seamlessly integrates robust data analytics with massive scale. These storage systems will act as “smart systems” provide in-place analytics that speed discovery and enable businesses to quickly derive meaningf...
Docker containers have brought great opportunities to shorten the deployment process through continuous integration and the delivery of applications and microservices. This applies equally to enterprise data centers as well as the cloud. In his session at 20th Cloud Expo, Jari Kolehmainen, founder and CTO of Kontena, will discuss solutions and benefits of a deeply integrated deployment pipeline using technologies such as container management platforms, Docker containers, and the drone.io Cl tool...
In 2014, Amazon announced a new form of compute called Lambda. We didn't know it at the time, but this represented a fundamental shift in what we expect from cloud computing. Now, all of the major cloud computing vendors want to take part in this disruptive technology. In his session at 20th Cloud Expo, John Jelinek IV, a web developer at Linux Academy, will discuss why major players like AWS, Microsoft Azure, IBM Bluemix, and Google Cloud Platform are all trying to sidestep VMs and containers...
DevOps has often been described in terms of CAMS: Culture, Automation, Measuring, Sharing. While we’ve seen a lot of focus on the “A” and even on the “M”, there are very few examples of why the “C" is equally important in the DevOps equation. In her session at @DevOps Summit, Lori MacVittie, of F5 Networks, explored HTTP/1 and HTTP/2 along with Microservices to illustrate why a collaborative culture between Dev, Ops, and the Network is critical to ensuring success.
DevOps is being widely accepted (if not fully adopted) as essential in enterprise IT. But as Enterprise DevOps gains maturity, expands scope, and increases velocity, the need for data-driven decisions across teams becomes more acute. DevOps teams in any modern business must wrangle the ‘digital exhaust’ from the delivery toolchain, "pervasive" and "cognitive" computing, APIs and services, mobile devices and applications, the Internet of Things, and now even blockchain. In this power panel at @...
In his General Session at 16th Cloud Expo, David Shacochis, host of The Hybrid IT Files podcast and Vice President at CenturyLink, investigated three key trends of the “gigabit economy" though the story of a Fortune 500 communications company in transformation. Narrating how multi-modal hybrid IT, service automation, and agile delivery all intersect, he will cover the role of storytelling and empathy in achieving strategic alignment between the enterprise and its information technology.