|By Dan Kuebrich||
|April 19, 2013 11:00 AM EDT||
You don’t have to be a pre-cog to find and deal with infrastructure and application problems; you just need good monitoring. We had quite a day Monday during the EC2 EBS availability incident. Thanks to some early alerts - which started coming in about 2.5 hours before AWS started reporting problems - our ops team was able to intervene and make sure that our customers’ data was safe and sound. I’ll start with screenshots of what we saw and experienced, then get into what metrics to watch and alert on in your environment, as well as how to do so in TraceView.
10:30 AM EST: Increased disk latency, data pipeline backup
Around 10 am, we started to notice that writes weren’t moving through our pipeline as smoothly as before. Sure enough, pretty soon we started seeing alerts about elevated DB load and disk latency. Here’s what it looked like:
Figure 1: At 10 AM, we saw elevated DB load and disk latency.
12:30 PM EST: Diverting pipeline to S3 instead of EBS, pulling out hair
1:30 PM EST: Frontend offline, AWS incident report
Our workload is very write-heavy, so we first noticed performance problems there, but pretty soon reads made by our frontend were also affected, as a growing fraction of our customer’s data became affected by the mounting EBS problems. At a certain point, any file I/O to affected EBS volumes would cause processes to enter an uninterruptible state, causing our MySQL servers to hang. Here’s a view of the impact on our query sharding service:
Figure 2: Impact on our query sharding service.
6 PM EST: Debate whether backup restore or AWS EBS recovery will finish faster
9 PM EST: Back online
AWS started bringing volumes back online that evening. During the downtime, we continued to collect customer performance data, diverting the pipeline to S3 until our databases came back online. Once the disks were back, we were able to get frontend servers back online, and spun up more pipeline workers to plow through the queued trace backlog as we replayed it from S3. Latency: the functional test of performance metrics You might be surprised by this, but monitoring latency is often the easiest and surest way to catch serious problems. It’s the functional test of system: if any of the gears in the system being monitored start getting jammed, it will likely manifest in increased latency. However, latency can be noisy—how can we make this measurement more controlled, or to extend my testing metaphor, closer to a unit test? Using TraceView, you can set alerts not only on the latency of your application, but also on individual layers of the stack, or particular URLs/controllers. The performance of a predictable query load over time is a great way to detect aberrant database performance, for instance.
Figure 3: Use alerting to detect aberrant database performance.
Alerting on All of the Metrics
When looking at cases of infrastructure degradation, host-level metrics is where the buck stops. Configuration is usually a pain: install agents on each machine and set thresholds. We think the best alerts are at the intersection of easy and actionable. With TraceView, you can set up a single alert and have it cover all hosts in an app.
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lea...
Oct. 1, 2016 01:30 PM EDT Reads: 863
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...
Oct. 1, 2016 01:00 PM EDT Reads: 5,188
About a year ago we tuned into “the need for speed” and how a concept like "serverless computing” was increasingly catering to this. We are now a year further and the term “serverless” is taking on unexpected proportions. With some even seeing it as the successor to cloud in general or at least as a successor to the clouds’ poorer cousin in terms of revenue, hype and adoption: PaaS. The question we need to ask is whether this constitutes an example of Hype Hopping: to effortlessly pivot to the ...
Oct. 1, 2016 12:45 PM EDT Reads: 1,244
SYS-CON Events announced today the Enterprise IoT Bootcamp, being held November 1-2, 2016, in conjunction with 19th Cloud Expo | @ThingsExpo at the Santa Clara Convention Center in Santa Clara, CA. Combined with real-world scenarios and use cases, the Enterprise IoT Bootcamp is not just based on presentations but with hands-on demos and detailed walkthroughs. We will introduce you to a variety of real world use cases prototyped using Arduino, Raspberry Pi, BeagleBone, Spark, and Intel Edison. Y...
Oct. 1, 2016 12:30 PM EDT Reads: 3,107
Just over a week ago I received a long and loud sustained applause for a presentation I delivered at this year’s Cloud Expo in Santa Clara. I was extremely pleased with the turnout and had some very good conversations with many of the attendees. Over the next few days I had many more meaningful conversations and was not only happy with the results but also learned a few new things. Here is everything I learned in those three days distilled into three short points.
Oct. 1, 2016 12:30 PM EDT Reads: 5,538
SYS-CON Events announced today that Sheng Liang to Keynote at SYS-CON's 19th Cloud Expo, which will take place on November 1-3, 2016 at the Santa Clara Convention Center in Santa Clara, California.
Oct. 1, 2016 11:45 AM EDT Reads: 318
24Notion is full-service global creative digital marketing, technology and lifestyle agency that combines strategic ideas with customized tactical execution. With a broad understand of the art of traditional marketing, new media, communications and social influence, 24Notion uniquely understands how to connect your brand strategy with the right consumer. 24Notion ranked #12 on Corporate Social Responsibility - Book of List.
Oct. 1, 2016 10:45 AM EDT Reads: 641
With the rise of Docker, Kubernetes, and other container technologies, the growth of microservices has skyrocketed among dev teams looking to innovate on a faster release cycle. This has enabled teams to finally realize their DevOps goals to ship and iterate quickly in a continuous delivery model. Why containers are growing in popularity is no surprise — they’re extremely easy to spin up or down, but come with an unforeseen issue. However, without the right foresight, DevOps and IT teams may lo...
Oct. 1, 2016 10:30 AM EDT Reads: 1,261
DevOps at Cloud Expo – being held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Am...
Oct. 1, 2016 10:00 AM EDT Reads: 4,684
Much of the value of DevOps comes from a (renewed) focus on measurement, sharing, and continuous feedback loops. In increasingly complex DevOps workflows and environments, and especially in larger, regulated, or more crystallized organizations, these core concepts become even more critical. In his session at @DevOpsSummit at 18th Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, showed how, by focusing on 'metrics that matter,' you can provide objective, transparent, and meaningful f...
Oct. 1, 2016 09:00 AM EDT Reads: 2,954
Digitization is driving a fundamental change in society that is transforming the way businesses work with their customers, their supply chains and their people. Digital transformation leverages DevOps best practices, such as Agile Parallel Development, Continuous Delivery and Agile Operations to capitalize on opportunities and create competitive differentiation in the application economy. However, information security has been notably absent from the DevOps movement. Speed doesn’t have to negat...
Oct. 1, 2016 07:00 AM EDT Reads: 2,464
With online viewership and sales growing rapidly, enterprises are interested in understanding how they analyze performance to positively impact business metrics. Deeper insight into the user experience is needed to understand why conversions are dropping and/or bounce rates are increasing or, preferably, to understand what has been helping these metrics improve. The digital performance management industry has evolved as application performance management companies have broadened their scope beyo...
Oct. 1, 2016 07:00 AM EDT Reads: 1,502
While DevOps promises a better and tighter integration among an organization’s development and operation teams and transforms an application life cycle into a continual deployment, Chef and Azure together provides a speedy, cost-effective and highly scalable vehicle for realizing the business values of this transformation. In his session at @DevOpsSummit at 19th Cloud Expo, Yung Chou, a Technology Evangelist at Microsoft, will present a unique opportunity to witness how Chef and Azure work tog...
Oct. 1, 2016 06:30 AM EDT Reads: 1,950
Your business relies on your applications and your employees to stay in business. Whether you develop apps or manage business critical apps that help fuel your business, what happens when users experience sluggish performance? You and all technical teams across the organization – application, network, operations, among others, as well as, those outside the organization, like ISPs and third-party providers – are called in to solve the problem.
Oct. 1, 2016 06:00 AM EDT Reads: 2,820
Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterprises are using some form of XaaS - software, platform, and infrastructure as a service.
Oct. 1, 2016 05:30 AM EDT Reads: 1,074
As applications are promoted from the development environment to the CI or the QA environment and then into the production environment, it is very common for the configuration settings to be changed as the code is promoted. For example, the settings for the database connection pools are typically lower in development environment than the QA/Load Testing environment. The primary reason for the existence of the configuration setting differences is to enhance application performance. However, occas...
Oct. 1, 2016 05:15 AM EDT Reads: 1,012
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
Oct. 1, 2016 05:00 AM EDT Reads: 4,774
If you’re responsible for an application that depends on the data or functionality of various IoT endpoints – either sensors or devices – your brand reputation depends on the security, reliability, and compliance of its many integrated parts. If your application fails to deliver the expected business results, your customers and partners won't care if that failure stems from the code you developed or from a component that you integrated. What can you do to ensure that the endpoints work as expect...
Oct. 1, 2016 04:30 AM EDT Reads: 1,849
Apache Hadoop is a key technology for gaining business insights from your Big Data, but the penetration into enterprises is shockingly low. In fact, Apache Hadoop and Big Data proponents recognize that this technology has not yet achieved its game-changing business potential. In his session at 19th Cloud Expo, John Mertic, director of program management for ODPi at The Linux Foundation, will explain why this is, how we can work together as an open data community to increase adoption, and the i...
Oct. 1, 2016 03:30 AM EDT Reads: 1,245
SYS-CON Events announced today that Tintri Inc., a leading producer of VM-aware storage (VAS) for virtualization and cloud environments, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Tintri VM-aware storage is the simplest for virtualized applications and cloud. Organizations including GE, Toyota, United Healthcare, NASA and 6 of the Fortune 15 have said “No to LUNs.” With Tintri they mana...
Oct. 1, 2016 02:30 AM EDT Reads: 3,065