|By Dan Kuebrich||
|April 19, 2013 11:00 AM EDT||
You don’t have to be a pre-cog to find and deal with infrastructure and application problems; you just need good monitoring. We had quite a day Monday during the EC2 EBS availability incident. Thanks to some early alerts - which started coming in about 2.5 hours before AWS started reporting problems - our ops team was able to intervene and make sure that our customers’ data was safe and sound. I’ll start with screenshots of what we saw and experienced, then get into what metrics to watch and alert on in your environment, as well as how to do so in TraceView.
10:30 AM EST: Increased disk latency, data pipeline backup
Around 10 am, we started to notice that writes weren’t moving through our pipeline as smoothly as before. Sure enough, pretty soon we started seeing alerts about elevated DB load and disk latency. Here’s what it looked like:
Figure 1: At 10 AM, we saw elevated DB load and disk latency.
12:30 PM EST: Diverting pipeline to S3 instead of EBS, pulling out hair
1:30 PM EST: Frontend offline, AWS incident report
Our workload is very write-heavy, so we first noticed performance problems there, but pretty soon reads made by our frontend were also affected, as a growing fraction of our customer’s data became affected by the mounting EBS problems. At a certain point, any file I/O to affected EBS volumes would cause processes to enter an uninterruptible state, causing our MySQL servers to hang. Here’s a view of the impact on our query sharding service:
Figure 2: Impact on our query sharding service.
6 PM EST: Debate whether backup restore or AWS EBS recovery will finish faster
9 PM EST: Back online
AWS started bringing volumes back online that evening. During the downtime, we continued to collect customer performance data, diverting the pipeline to S3 until our databases came back online. Once the disks were back, we were able to get frontend servers back online, and spun up more pipeline workers to plow through the queued trace backlog as we replayed it from S3. Latency: the functional test of performance metrics You might be surprised by this, but monitoring latency is often the easiest and surest way to catch serious problems. It’s the functional test of system: if any of the gears in the system being monitored start getting jammed, it will likely manifest in increased latency. However, latency can be noisy—how can we make this measurement more controlled, or to extend my testing metaphor, closer to a unit test? Using TraceView, you can set alerts not only on the latency of your application, but also on individual layers of the stack, or particular URLs/controllers. The performance of a predictable query load over time is a great way to detect aberrant database performance, for instance.
Figure 3: Use alerting to detect aberrant database performance.
Alerting on All of the Metrics
When looking at cases of infrastructure degradation, host-level metrics is where the buck stops. Configuration is usually a pain: install agents on each machine and set thresholds. We think the best alerts are at the intersection of easy and actionable. With TraceView, you can set up a single alert and have it cover all hosts in an app.
Cloud Expo, Inc. has announced today that Aruna Ravichandran, vice president of DevOps Product and Solutions Marketing at CA Technologies, has been named co-conference chair of DevOps at Cloud Expo 2017. The @DevOpsSummit at Cloud Expo New York will take place on June 6-8, 2017, at the Javits Center in New York City, New York, and @DevOpsSummit at Cloud Expo Silicon Valley will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Apr. 28, 2017 11:30 PM EDT Reads: 2,722
Is your application too difficult to manage? Do changes take dozens of developers hundreds of hours to execute, and frequently result in downtime across all your site’s functions? It sounds like you have a monolith! A monolith is one of the three main software architectures that define most applications. Whether you’ve intentionally set out to create a monolith or not, it’s worth at least weighing the pros and cons of the different architectural approaches and deciding which one makes the most s...
Apr. 28, 2017 08:00 PM EDT Reads: 2,877
Cloud promises the agility required by today’s digital businesses. As organizations adopt cloud based infrastructures and services, their IT resources become increasingly dynamic and hybrid in nature. Managing these require modern IT operations and tools. In his session at 20th Cloud Expo, Raj Sundaram, Senior Principal Product Manager at CA Technologies, will discuss how to modernize your IT operations in order to proactively manage your hybrid cloud and IT environments. He will be sharing be...
Apr. 28, 2017 05:15 PM EDT Reads: 848
This recent research on cloud computing from the Register delves a little deeper than many of the "We're all adopting cloud!" surveys we've seen. They found that meaningful cloud adoption and the idea of the cloud-first enterprise are still not reality for many businesses. The Register's stats also show a more gradual cloud deployment trend over the past five years, not any sort of explosion. One important takeaway is that coherence across internal and external clouds is essential for IT right n...
Apr. 28, 2017 05:00 PM EDT Reads: 1,858
Back in February of 2017, Andrew Clay Schafer of Pivotal tweeted the following: “seriously tho, the whole software industry is stuck on deployment when we desperately need architecture and telemetry.” Intrigue in a 140 characters. For me, I hear Andrew saying, “we’re jumping to step 5 before we’ve successfully completed steps 1-4.”
Apr. 28, 2017 02:00 PM EDT Reads: 1,907
A Man in the Middle attack, or MITM, is a situation wherein a malicious entity can read/write data that is being transmitted between two or more systems (in most cases, between you and the website that you are surfing). MITMs are common in China, thanks to the “Great Cannon.” The “Great Cannon” is slightly different from the “The Great Firewall.” The firewall monitors web traffic moving in and out of China and blocks prohibited content. The Great Cannon, on the other hand, acts as a man in the...
Apr. 28, 2017 01:00 PM EDT Reads: 432
When you decide to launch a startup company, business advisors, counselors, bankers and armchair know-it-alls will tell you that the first thing you need to do is get funding. While there is some validity to that boilerplate piece of wisdom, the availability of and need for startup funding has gone through a dramatic transformation over the past decade, and the next few years will see even more of a shift. A perfect storm of events is causing this seismic shift. On the macroeconomic side this ...
Apr. 28, 2017 12:00 PM EDT Reads: 547
Enterprise architects are increasingly adopting multi-cloud strategies as they seek to utilize existing data center assets, leverage the advantages of cloud computing and avoid cloud vendor lock-in. This requires a globally aware traffic management strategy that can monitor infrastructure health across data centers and end-user experience globally, while responding to control changes and system specification at the speed of today’s DevOps teams. In his session at 20th Cloud Expo, Josh Gray, Chie...
Apr. 28, 2017 09:30 AM EDT Reads: 3,466
In his session at 20th Cloud Expo, Scott Davis, CTO of Embotics, will discuss how automation can provide the dynamic management required to cost-effectively deliver microservices and container solutions at scale. He will discuss how flexible automation is the key to effectively bridging and seamlessly coordinating both IT and developer needs for component orchestration across disparate clouds – an increasingly important requirement at today’s multi-cloud enterprise.
Apr. 28, 2017 06:00 AM EDT Reads: 4,456
To more closely examine the variety of ways in which IT departments around the world are integrating cloud services, and the effect hybrid IT has had on their organizations and IT job roles, SolarWinds recently released the SolarWinds IT Trends Report 2017: Portrait of a Hybrid Organization. This annual study consists of survey-based research that explores significant trends, developments, and movements related to and directly affecting IT and IT professionals.
Apr. 28, 2017 05:00 AM EDT Reads: 1,829
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor – all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
Apr. 28, 2017 01:15 AM EDT Reads: 9,134
NHK, Japan Broadcasting, will feature the upcoming @ThingsExpo Silicon Valley in a special 'Internet of Things' and smart technology documentary that will be filmed on the expo floor between November 3 to 5, 2015, in Santa Clara. NHK is the sole public TV network in Japan equivalent to the BBC in the UK and the largest in Asia with many award-winning science and technology programs. Japanese TV is producing a documentary about IoT and Smart technology and will be covering @ThingsExpo Silicon Val...
Apr. 28, 2017 01:15 AM EDT Reads: 9,319
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
Apr. 27, 2017 09:15 PM EDT Reads: 7,372
In large enterprises, environment provisioning and server provisioning account for a significant portion of the operations team's time. This often leaves users frustrated while they wait for these services. For instance, server provisioning can take several days and sometimes even weeks. At the same time, digital transformation means the need for server and environment provisioning is constantly growing. Organizations are adopting agile methodologies and software teams are increasing the speed ...
Apr. 27, 2017 08:30 PM EDT Reads: 3,448
Developers want to create better apps faster. Static clouds are giving way to scalable systems, with dynamic resource allocation and application monitoring. You won't hear that chant from users on any picket line, but helping developers to create better apps faster is the mission of Lee Atchison, principal cloud architect and advocate at New Relic Inc., based in San Francisco. His singular job is to understand and drive the industry in the areas of cloud architecture, microservices, scalability ...
Apr. 27, 2017 03:00 PM EDT Reads: 3,628
In his general session at 19th Cloud Expo, Manish Dixit, VP of Product and Engineering at Dice, discussed how Dice leverages data insights and tools to help both tech professionals and recruiters better understand how skills relate to each other and which skills are in high demand using interactive visualizations and salary indicator tools to maximize earning potential. Manish Dixit is VP of Product and Engineering at Dice. As the leader of the Product, Engineering and Data Sciences team at D...
Apr. 27, 2017 03:15 AM EDT Reads: 6,159
Software as a service (SaaS), one of the earliest and most successful cloud services, has reached mainstream status. According to Cisco, by 2019 more than four-fifths (83 percent) of all data center traffic will be based in the cloud, up from 65 percent today. The majority of this traffic will be applications. Businesses of all sizes are adopting a variety of SaaS-based services – everything from collaboration tools to mission-critical commerce-oriented applications. The rise in SaaS usage has m...
Apr. 22, 2017 06:15 PM EDT Reads: 4,988
The proper isolation of resources is essential for multi-tenant environments. The traditional approach to isolate resources is, however, rather heavyweight. In his session at 18th Cloud Expo, Igor Drobiazko, co-founder of elastic.io, drew upon his own experience with operating a Docker container-based infrastructure on a large scale and present a lightweight solution for resource isolation using microservices. He also discussed the implementation of microservices in data and application integrat...
Apr. 22, 2017 05:45 AM EDT Reads: 6,358
We'd all like to fulfill that "find a job you love and you'll never work a day in your life" cliché. But in reality, every job (even if it's our dream job) comes with its downsides. For you, the constant fight against shadow IT might get on your last nerves. For your developer coworkers, infrastructure management is the roadblock that stands in the way of focusing on coding. As you watch more and more applications and processes move to the cloud, technology is coming to developers' rescue-most r...
Apr. 22, 2017 04:00 AM EDT Reads: 4,176
2016 has been an amazing year for Docker and the container industry. We had 3 major releases of Docker engine this year , and tremendous increase in usage. The community has been following along and contributing amazing Docker resources to help you learn and get hands-on experience. Here’s some of the top read and viewed content for the year. Of course releases are always really popular, particularly when they fit requests we had from the community.
Apr. 22, 2017 03:45 AM EDT Reads: 3,703