|By Dan Kuebrich||
|April 19, 2013 11:00 AM EDT||
You don’t have to be a pre-cog to find and deal with infrastructure and application problems; you just need good monitoring. We had quite a day Monday during the EC2 EBS availability incident. Thanks to some early alerts - which started coming in about 2.5 hours before AWS started reporting problems - our ops team was able to intervene and make sure that our customers’ data was safe and sound. I’ll start with screenshots of what we saw and experienced, then get into what metrics to watch and alert on in your environment, as well as how to do so in TraceView.
10:30 AM EST: Increased disk latency, data pipeline backup
Around 10 am, we started to notice that writes weren’t moving through our pipeline as smoothly as before. Sure enough, pretty soon we started seeing alerts about elevated DB load and disk latency. Here’s what it looked like:
Figure 1: At 10 AM, we saw elevated DB load and disk latency.
12:30 PM EST: Diverting pipeline to S3 instead of EBS, pulling out hair
1:30 PM EST: Frontend offline, AWS incident report
Our workload is very write-heavy, so we first noticed performance problems there, but pretty soon reads made by our frontend were also affected, as a growing fraction of our customer’s data became affected by the mounting EBS problems. At a certain point, any file I/O to affected EBS volumes would cause processes to enter an uninterruptible state, causing our MySQL servers to hang. Here’s a view of the impact on our query sharding service:
Figure 2: Impact on our query sharding service.
6 PM EST: Debate whether backup restore or AWS EBS recovery will finish faster
9 PM EST: Back online
AWS started bringing volumes back online that evening. During the downtime, we continued to collect customer performance data, diverting the pipeline to S3 until our databases came back online. Once the disks were back, we were able to get frontend servers back online, and spun up more pipeline workers to plow through the queued trace backlog as we replayed it from S3. Latency: the functional test of performance metrics You might be surprised by this, but monitoring latency is often the easiest and surest way to catch serious problems. It’s the functional test of system: if any of the gears in the system being monitored start getting jammed, it will likely manifest in increased latency. However, latency can be noisy—how can we make this measurement more controlled, or to extend my testing metaphor, closer to a unit test? Using TraceView, you can set alerts not only on the latency of your application, but also on individual layers of the stack, or particular URLs/controllers. The performance of a predictable query load over time is a great way to detect aberrant database performance, for instance.
Figure 3: Use alerting to detect aberrant database performance.
Alerting on All of the Metrics
When looking at cases of infrastructure degradation, host-level metrics is where the buck stops. Configuration is usually a pain: install agents on each machine and set thresholds. We think the best alerts are at the intersection of easy and actionable. With TraceView, you can set up a single alert and have it cover all hosts in an app.
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
Jul. 29, 2015 01:45 PM EDT Reads: 411
[video] Infrastructure as a Toolbox By @SoftLayer at @CloudExpo New York | #IoT #API #Containers #Microservices
Countless business models have spawned from the IaaS industry. Resell Web hosting, blogs, public cloud, and on and on. With the overwhelming amount of tools available to us, it's sometimes easy to overlook that many of them are just new skins of resources we've had for a long time. In his General Session at 16th Cloud Expo, Phil Jackson, Lead Technology Evangelist at SoftLayer, broke down what we've got to work with and discuss the benefits and pitfalls to discover how we can best use them to d...
Jul. 29, 2015 01:00 PM EDT Reads: 1,943
Puppet Labs has published their annual State of DevOps report and it is loaded with interesting information as always. Last year’s report brought home the point that DevOps was becoming widely accepted in the enterprise. This year’s report further validates that point and provides us with some interesting insights from surveying a wide variety of companies in different phases of their DevOps journey.
Jul. 29, 2015 01:00 PM EDT Reads: 158
[session] The Container New World By @KeGilpin | @DevOpsSummit #DevOps #Docker #Containers #Microservices
Containers are changing the security landscape for software development and deployment. As with any security solutions, security approaches that work for developers, operations personnel and security professionals is a requirement. In his session at DevOps Summit, Kevin Gilpin, CTO and Co-Founder of Conjur, will discuss various security considerations for container-based infrastructure and related DevOps workflows.
Jul. 29, 2015 01:00 PM EDT Reads: 1,053
Microservices Total Cost of Ownership: Too Soon? By @Aruna13 | @DevOpsSummit #DevOps #Docker #Containers #Microservices
Microservices are hot. And for good reason. To compete in today’s fast-moving application economy, it makes sense to break large, monolithic applications down into discrete functional units. Such an approach makes it easier to update and add functionalities (text-messaging a customer, calculating sales tax for a specific geography, etc.) and get those updates / adds into production fast. In fact, some would argue that microservices are a prerequisite for true continuous delivery. But is it too...
Jul. 29, 2015 01:00 PM EDT Reads: 668
Summer is finally here and it’s time for a DevOps summer vacation. From San Francisco to New York City, our top summer conferences list is going to continuously deliver you to the summer destinations of your dreams. These DevOps parties are hitting all the hottest summer trends with Microservices, Agile, Continuous Delivery, DevSecOps, and even Continuous Testing. Move over Kanye. These are the top 5 Summer DevOps Conferences of 2015.
Jul. 29, 2015 01:00 PM EDT Reads: 618
What we really mean to ask is whether microservices architecture is SOA done right. But then, of course, we’d have to figure out what microservices architecture was. And if you think defining SOA is difficult, pinning down microservices architecture is unquestionably frying pan into fire time. Given my years at ZapThink, fighting to help architects understand what Service-Oriented Architecture really was and how to get it right, it’s no surprise that many people ask me this question.
Jul. 29, 2015 10:30 AM EDT Reads: 376
One of the ways to increase scalability of services – and applications – is to go “stateless.” The reasons for this are many, but in general by eliminating the mapping between a single client and a single app or service instance you eliminate the need for resources to manage state in the app (overhead) and improve the distributability (I can make up words if I want) of requests across a pool of instances. The latter occurs because sessions don’t need to hang out and consume resources that could ...
Jul. 29, 2015 10:15 AM EDT Reads: 155
[video] An Interview with @ProfitBricksUSA CEO @AchimWeiss | @CloudExpo #DevOps #Docker #Containers #Microservices
"ProfitBricks was founded in 2010 and we are the painless cloud - and we are also the Infrastructure as a Service 2.0 company," noted Achim Weiss, Chief Executive Officer and Co-Founder of ProfitBricks, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 29, 2015 10:15 AM EDT Reads: 1,103
Approved this February by the Internet Engineering Task Force (IETF), HTTP/2 is the first major update to HTTP since 1999, when HTTP/1.1 was standardized. Designed with performance in mind, one of the biggest goals of HTTP/2 implementation is to decrease latency while maintaining a high-level compatibility with HTTP/1.1. Though not all testing activities will be impacted by the new protocol, it's important for testers to be aware of any changes moving forward.
Jul. 29, 2015 08:00 AM EDT Reads: 142
The Internet of Things. Cloud. Big Data. Real-Time Analytics. To those who do not quite understand what these phrases mean (and let’s be honest, that’s likely to be a large portion of the world), words like “IoT” and “Big Data” are just buzzwords. The truth is, the Internet of Things encompasses much more than jargon and predictions of connected devices. According to Parker Trewin, Senior Director of Content and Communications of Aria Systems, “IoT is big news because it ups the ante: Reach out ...
Jul. 29, 2015 05:00 AM EDT Reads: 379
[video] Logging and Monitoring with @Sematext Founder @OtisG | @DevOpsSummit #DevOps #Logging #Monitoring
"We got started as search consultants. On the services side of the business we have help organizations save time and save money when they hit issues that everyone more or less hits when their data grows," noted Otis Gospodnetić, Founder of Sematext, in this SYS-CON.tv interview at @DevOpsSummit, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 28, 2015 10:45 PM EDT Reads: 1,010
"We've just seen a huge influx of new partners coming into our ecosystem, and partners building unique offerings on top of our API set," explained Seth Bostock, Chief Executive Officer at IndependenceIT, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 28, 2015 08:00 PM EDT Reads: 622
[slides] Storage for Docker Containers By @OnModulus | @DevOpsSummit #DevOps #Docker #Containers #Microservices
Learn how to solve the problem of keeping files in sync between multiple Docker containers. In his session at 16th Cloud Expo, Aaron Brongersma, Senior Infrastructure Engineer at Modulus, discussed using rsync, GlusterFS, EBS and Bit Torrent Sync. He broke down the tools that are needed to help create a seamless user experience. In the end, can we have an environment where we can easily move Docker containers, servers, and volumes without impacting our applications? He shared his results so yo...
Jul. 28, 2015 07:15 PM EDT Reads: 717
[slides] A New Architecture for the Internet of Things By @JKirklan | @ThingsExpo @RedHatNews #IoT #M2M #InternetOfThings
Explosive growth in connected devices. Enormous amounts of data for collection and analysis. Critical use of data for split-second decision making and actionable information. All three are factors in making the Internet of Things a reality. Yet, any one factor would have an IT organization pondering its infrastructure strategy. How should your organization enhance its IT framework to enable an Internet of Things implementation? In his session at @ThingsExpo, James Kirkland, Red Hat's Chief Arch...
Jul. 28, 2015 06:30 PM EDT Reads: 1,372
Modern DevOps Tool Kit By @Logentries and @NewRelic | @DevOpsSummit #DevOps #Containers #Microservices
Auto-scaling environments, micro-service architectures and globally-distributed teams are just three common examples of why organizations today need automation and interoperability more than ever. But is interoperability something we simply start doing, or does it require a reexamination of our processes? And can we really improve our processes without first making interoperability a requirement for how we choose our tools?
Jul. 28, 2015 06:15 PM EDT Reads: 292
Microservices are individual units of executable code that work within a limited framework. They are extremely useful when placed within an architecture of numerous microservices. On June 24th, 2015 I attended a webinar titled “How to Share Share-Nothing Microservices,” hosted by Jason Bloomberg, the President of Intellyx, and Scott Edwards, Director Product Marketing for Service Virtualization at CA Technologies. The webinar explained how to use microservices to your advantage in order to deliv...
Jul. 28, 2015 06:00 PM EDT Reads: 923
[slides] Workloads and Public Cloud at @CloudExpo By @utollwi | @ProfitBricksUSA #DevOps #Containers #Microservices
Public Cloud IaaS started its life in the developer and startup communities and has grown rapidly to a $20B+ industry, but it still pales in comparison to how much is spent worldwide on IT: $3.6 trillion. In fact, there are 8.6 million data centers worldwide, the reality is many small and medium sized business have server closets and colocation footprints filled with servers and storage gear. While on-premise environment virtualization may have peaked at 75%, the Public Cloud has lagged in adop...
Jul. 28, 2015 04:00 PM EDT Reads: 2,184
How do you securely enable access to your applications in AWS without exposing any attack surfaces? The answer is usually very complicated because application environments morph over time in response to growing requirements from your employee base, your partners and your customers. In his session at @DevOpsSummit, Haseeb Budhani, CEO and Co-founder of Soha, shared five common approaches that DevOps teams follow to secure access to applications deployed in AWS, Azure, etc., and the friction an...
Jul. 28, 2015 03:30 PM EDT Reads: 490
The Software Defined Data Center (SDDC), which enables organizations to seamlessly run in a hybrid cloud model (public + private cloud), is here to stay. IDC estimates that the software-defined networking market will be valued at $3.7 billion by 2016. Security is a key component and benefit of the SDDC, and offers an opportunity to build security 'from the ground up' and weave it into the environment from day one. In his session at 16th Cloud Expo, Reuven Harrison, CTO and Co-Founder of Tufin,...
Jul. 28, 2015 03:00 PM EDT Reads: 462