|By Dan Kuebrich||
|April 19, 2013 11:00 AM EDT||
You don’t have to be a pre-cog to find and deal with infrastructure and application problems; you just need good monitoring. We had quite a day Monday during the EC2 EBS availability incident. Thanks to some early alerts - which started coming in about 2.5 hours before AWS started reporting problems - our ops team was able to intervene and make sure that our customers’ data was safe and sound. I’ll start with screenshots of what we saw and experienced, then get into what metrics to watch and alert on in your environment, as well as how to do so in TraceView.
10:30 AM EST: Increased disk latency, data pipeline backup
Around 10 am, we started to notice that writes weren’t moving through our pipeline as smoothly as before. Sure enough, pretty soon we started seeing alerts about elevated DB load and disk latency. Here’s what it looked like:
Figure 1: At 10 AM, we saw elevated DB load and disk latency.
12:30 PM EST: Diverting pipeline to S3 instead of EBS, pulling out hair
1:30 PM EST: Frontend offline, AWS incident report
Our workload is very write-heavy, so we first noticed performance problems there, but pretty soon reads made by our frontend were also affected, as a growing fraction of our customer’s data became affected by the mounting EBS problems. At a certain point, any file I/O to affected EBS volumes would cause processes to enter an uninterruptible state, causing our MySQL servers to hang. Here’s a view of the impact on our query sharding service:
Figure 2: Impact on our query sharding service.
6 PM EST: Debate whether backup restore or AWS EBS recovery will finish faster
9 PM EST: Back online
AWS started bringing volumes back online that evening. During the downtime, we continued to collect customer performance data, diverting the pipeline to S3 until our databases came back online. Once the disks were back, we were able to get frontend servers back online, and spun up more pipeline workers to plow through the queued trace backlog as we replayed it from S3. Latency: the functional test of performance metrics You might be surprised by this, but monitoring latency is often the easiest and surest way to catch serious problems. It’s the functional test of system: if any of the gears in the system being monitored start getting jammed, it will likely manifest in increased latency. However, latency can be noisy—how can we make this measurement more controlled, or to extend my testing metaphor, closer to a unit test? Using TraceView, you can set alerts not only on the latency of your application, but also on individual layers of the stack, or particular URLs/controllers. The performance of a predictable query load over time is a great way to detect aberrant database performance, for instance.
Figure 3: Use alerting to detect aberrant database performance.
Alerting on All of the Metrics
When looking at cases of infrastructure degradation, host-level metrics is where the buck stops. Configuration is usually a pain: install agents on each machine and set thresholds. We think the best alerts are at the intersection of easy and actionable. With TraceView, you can set up a single alert and have it cover all hosts in an app.
Ten years ago, there may have been only a single application that talked directly to the database and spit out HTML; customer service, sales - most of the organizations I work with have been moving toward a design philosophy more like unix, where each application consists of a series of small tools stitched together. In web example above, that likely means a login service combines with webpages that call other services - like enter and update record. That allows the customer service team to writ...
Oct. 10, 2015 02:45 AM EDT Reads: 443
As we increasingly rely on technology to improve the quality and efficiency of our personal and professional lives, software has become the key business differentiator. Organizations must release software faster, as well as ensure the safety, security, and reliability of their applications. The option to make trade-offs between time and quality no longer exists—software teams must deliver quality and speed. To meet these expectations, businesses have shifted from more traditional approaches of d...
Oct. 10, 2015 02:15 AM EDT Reads: 240
Between the compelling mockups and specs produced by analysts, and resulting applications built by developers, there exists a gulf where projects fail, costs spiral, and applications disappoint. Methodologies like Agile attempt to address this with intensified communication, with partial success but many limitations. In his session at DevOps Summit, Charles Kendrick, CTO and Chief Architect at Isomorphic Software, will present a revolutionary model enabled by new technologies. Learn how busine...
Oct. 10, 2015 02:00 AM EDT Reads: 306
If you are new to Python, you might be confused about the different versions that are available. Although Python 3 is the latest generation of the language, many programmers still use Python 2.7, the final update to Python 2, which was released in 2010. There is currently no clear-cut answer to the question of which version of Python you should use; the decision depends on what you want to achieve. While Python 3 is clearly the future of the language, some programmers choose to remain with Py...
Oct. 10, 2015 02:00 AM EDT Reads: 268
SYS-CON Events announced today that Dyn, the worldwide leader in Internet Performance, will exhibit at SYS-CON's 17th International Cloud Expo®, which will take place on November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Dyn is a cloud-based Internet Performance company. Dyn helps companies monitor, control, and optimize online infrastructure for an exceptional end-user experience. Through a world-class network and unrivaled, objective intelligence into Internet condit...
Oct. 10, 2015 02:00 AM EDT Reads: 652
Achim Weiss is Chief Executive Officer and co-founder of ProfitBricks. In 1995, he broke off his studies to co-found the web hosting company "Schlund+Partner." The company "Schlund+Partner" later became the 1&1 web hosting product line. From 1995 to 2008, he was the technical director for several important projects: the largest web hosting platform in the world, the second largest DSL platform, a video on-demand delivery network, the largest eMail backend in Europe, and a universal billing syste...
Oct. 10, 2015 01:00 AM EDT Reads: 241
Containers have changed the mind of IT in DevOps. They enable developers to work with dev, test, stage and production environments identically. Containers provide the right abstraction for microservices and many cloud platforms have integrated them into deployment pipelines. DevOps and Containers together help companies to achieve their business goals faster and more effectively.
Oct. 10, 2015 12:00 AM EDT Reads: 221
Somebody call the buzzword police: we have a serious case of microservices-washing in progress. The term “microservices-washing” is derived from “whitewashing,” meaning to hide some inconvenient truth with bluster and nonsense. We saw plenty of cloudwashing a few years ago, as vendors and enterprises alike pretended what they were doing was cloud, even though it wasn’t. Today, the hype around microservices has led to the same kind of obfuscation, as vendors and enterprise technologists alike ar...
Oct. 10, 2015 12:00 AM EDT Reads: 494
Opinions on how best to package and deliver applications are legion and, like many other aspects of the software world, are subject to recurring trend cycles. On the server-side, the current favorite is container delivery: a “full stack” approach in which your application and everything it needs to run are specified in a container definition. That definition is then “compiled” down to a container image and deployed by retrieving the image and passing it to a container runtime to create a running...
Oct. 10, 2015 12:00 AM EDT Reads: 268
Containers are revolutionizing the way we deploy and maintain our infrastructures, but monitoring and troubleshooting in a containerized environment can still be painful and impractical. Understanding even basic resource usage is difficult - let alone tracking network connections or malicious activity. In his session at DevOps Summit, Gianluca Borello, Sr. Software Engineer at Sysdig, will cover the current state of the art for container monitoring and visibility, including pros / cons and li...
Oct. 10, 2015 12:00 AM EDT Reads: 266
The web app is agile. The REST API is agile. The testing and planning are agile. But alas, data infrastructures certainly are not. Once an application matures, changing the shape or indexing scheme of data often forces at best a top down planning exercise and at worst includes schema changes that force downtime. The time has come for a new approach that fundamentally advances the agility of distributed data infrastructures. Come learn about a new solution to the problems faced by software organ...
Oct. 9, 2015 08:00 PM EDT Reads: 937
Saviynt Inc. has announced the availability of the next release of Saviynt for AWS. The comprehensive security and compliance solution provides a Command-and-Control center to gain visibility into risks in AWS, enforce real-time protection of critical workloads as well as data and automate access life-cycle governance. The solution enables AWS customers to meet their compliance mandates such as ITAR, SOX, PCI, etc. by including an extensive risk and controls library to detect known threats and b...
Oct. 9, 2015 03:00 PM EDT Reads: 241
Docker is hot. However, as Docker container use spreads into more mature production pipelines, there can be issues about control of Docker images to ensure they are production-ready. Is a promotion-based model appropriate to control and track the flow of Docker images from development to production? In his session at DevOps Summit, Fred Simon, Co-founder and Chief Architect of JFrog, will demonstrate how to implement a promotion model for Docker images using a binary repository, and then show h...
Oct. 9, 2015 02:15 PM EDT Reads: 191
DevOps has often been described in terms of CAMS: Culture, Automation, Measuring, Sharing. While we’ve seen a lot of focus on the “A” and even on the “M”, there are very few examples of why the “C" is equally important in the DevOps equation. In her session at @DevOps Summit, Lori MacVittie, of F5 Networks, will explore HTTP/1 and HTTP/2 along with Microservices to illustrate why a collaborative culture between Dev, Ops, and the Network is critical to ensuring success.
Oct. 9, 2015 01:30 PM EDT Reads: 167
Overgrown applications have given way to modular applications, driven by the need to break larger problems into smaller problems. Similarly large monolithic development processes have been forced to be broken into smaller agile development cycles. Looking at trends in software development, microservices architectures meet the same demands. Additional benefits of microservices architectures are compartmentalization and a limited impact of service failure versus a complete software malfunction....
Oct. 9, 2015 01:15 PM EDT Reads: 263
Despite all the talk about public cloud services and DevOps, you would think the move to cloud for enterprises is clear and simple. But in a survey of almost 1,600 IT decision makers across the USA and Europe, the state of the cloud in enterprise today is still fraught with considerable frustration. The business case for apps in the real world cloud is hybrid, bimodal, multi-platform, and difficult. Download this report commissioned by NTT Communications to see the insightful findings – registra...
Oct. 9, 2015 01:00 PM EDT Reads: 304
Manufacturing has widely adopted standardized and automated processes to create designs, build them, and maintain them through their life cycle. However, many modern manufacturing systems go beyond mechanized workflows to introduce empowered workers, flexible collaboration, and rapid iteration. Such behaviors also characterize open source software development and are at the heart of DevOps culture, processes, and tooling.
Oct. 9, 2015 12:30 PM EDT Reads: 1,108
In a report titled “Forecast Analysis: Enterprise Application Software, Worldwide, 2Q15 Update,” Gartner analysts highlighted the increasing trend of application modernization among enterprises. According to a recent survey, 45% of respondents stated that modernization of installed on-premises core enterprise applications is one of the top five priorities. Gartner also predicted that by 2020, 75% of
Oct. 9, 2015 12:00 PM EDT Reads: 345
DevOps Summit at Cloud Expo 2014 Silicon Valley was a terrific event for us. The Qubell booth was crowded on all three days. We ran demos every 30 minutes with folks lining up to get a seat and usually standing around. It was great to meet and talk to over 500 people! My keynote was well received and so was Stan's joint presentation with RingCentral on Devops for BigData. I also participated in two Power Panels – ‘Women in Technology’ and ‘Why DevOps Is Even More Important than You Think,’ both ...
Oct. 9, 2015 12:00 PM EDT Reads: 8,679
As the world moves towards more DevOps and microservices, application deployment to the cloud ought to become a lot simpler. The microservices architecture, which is the basis of many new age distributed systems such as OpenStack, NetFlix and so on, is at the heart of Cloud Foundry - a complete developer-oriented Platform as a Service (PaaS) that is IaaS agnostic and supports vCloud, OpenStack and AWS. In his session at 17th Cloud Expo, Raghavan "Rags" Srinivas, an Architect/Developer Evangeli...
Oct. 9, 2015 11:45 AM EDT Reads: 185