Click here to close now.

Welcome!

Microservices Journal Authors: Lori MacVittie, Ruxit Blog, Elizabeth White, Liz McMillan, Carmen Gonzalez

Related Topics: Cloud Expo, XML, Microservices Journal, Web 2.0, Security

Cloud Expo: Blog Post

Amazon Outage

Alerting to catch infrastructure problems

You don’t have to be a pre-cog to find and deal with infrastructure and application problems; you just need good monitoring.  We had quite a day Monday during the EC2 EBS availability incident.  Thanks to some early alerts - which started coming in about 2.5 hours before AWS started reporting problems - our ops team was able to intervene and make sure that our customers’ data was safe and sound. I’ll start with screenshots of what we saw and experienced, then get into what metrics to watch and alert on in your environment, as well as how to do so in TraceView.

10:30 AM EST: Increased disk latency, data pipeline backup
Around 10 am, we started to notice that writes weren’t moving through our pipeline as smoothly as before.  Sure enough, pretty soon we started seeing alerts about elevated DB load and disk latency.  Here’s what it looked like:

Figure 1: At 10 AM, we saw elevated DB load and disk latency.

12:30 PM EST: Diverting pipeline to S3 instead of EBS, pulling out hair

1:30 PM EST: Frontend offline, AWS incident report

Our workload is very write-heavy, so we first noticed performance problems there, but pretty soon reads made by our frontend were also affected, as a growing fraction of our customer’s data became affected by the mounting EBS problems.  At a certain point, any file I/O to affected EBS volumes would cause processes to enter an uninterruptible state, causing our MySQL servers to hang.  Here’s a view of the impact on our query sharding service:

Figure 2: Impact on our query sharding service.

6 PM EST: Debate whether backup restore or AWS EBS recovery will finish faster

9 PM EST: Back online

AWS started bringing volumes back online that evening.  During the downtime, we continued to collect customer performance data, diverting the pipeline to S3 until our databases came back online. Once the disks were back, we were able to get frontend servers back online, and spun up more pipeline workers to plow through the queued trace backlog as we replayed it from S3. Latency: the functional test of performance metrics You might be surprised by this, but monitoring latency is often the easiest and surest way to catch serious problems.  It’s the functional test of system: if any of the gears in the system being monitored start getting jammed, it will likely manifest in increased latency. However, latency can be noisy—how can we make this measurement more controlled, or to extend my testing metaphor, closer to a unit test?  Using TraceView, you can set alerts not only on the latency of your application, but also on individual layers of the stack, or particular URLs/controllers.  The performance of a predictable query load over time is a great way to detect aberrant database performance, for instance.

Figure 3: Use alerting to detect aberrant database performance.

Alerting on All of the Metrics
When looking at cases of infrastructure degradation, host-level metrics is where the buck stops.  Configuration is usually a pain: install agents on each machine and set thresholds.  We think the best alerts are at the intersection of easy and actionable.  With TraceView, you can set up a single alert and have it cover all hosts in an app.

Related Articles

The 5 Critical Things You Need to Know to Assure Optimal Performance in the Cloud

Verifying Network Performance SLAs

Who’s Managing the Performance of Your Cloud Applications?

More Stories By Dan Kuebrich

Dan Kuebrich is a web performance geek, currently working on Application Performance Management at AppNeta. He was previously a founder of Tracelytics (acquired by AppNeta), and before that worked on AmieStreet/Songza.com.

@MicroservicesExpo Stories
SYS-CON Events announced today that ProfitBricks, the provider of painless cloud infrastructure, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY., and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. ProfitBricks is the IaaS provider that offers a painless cloud experience for all IT users, with no learning curve. ...
SYS-CON Events announced today that Open Data Centers (ODC), a carrier-neutral colocation provider, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Open Data Centers is a carrier-neutral data center operator in New Jersey and New York City offering alternative connectivity options for carriers, service providers and enterprise customers.
SYS-CON Events announced today the DevOps Foundation Certification Course, being held June ?, 2015, in conjunction with DevOps Summit and 16th Cloud Expo at the Javits Center in New York City, NY. This sixteen (16) hour course provides an introduction to DevOps – the cultural and professional movement that stresses communication, collaboration, integration and automation in order to improve the flow of work between software developers and IT operations professionals. Improved workflows will res...
SYS-CON Events announced today that Akana, formerly SOA Software, has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Akana’s comprehensive suite of API Management, API Security, Integrated SOA Governance, and Cloud Integration solutions helps businesses accelerate digital transformation by securely extending their reach across multiple channels – mobile, cloud and Internet of Thi...
ProfitBricks, the provider of painless cloud infrastructure for IaaS, today announced the release of a Node.js SDK written against its recently launched REST API. This new JavaScript based library provides coverage for all existing ProfitBricks REST API functions. With additional libraries set to release this month, ProfitBricks continues to prove its dedication to the DevOps community and commitment to making cloud migrations and cloud management painless. Node.js is an open source, cross-pl...
SYS-CON Events announced today that StorPool Storage will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. StorPool is distributed storage software that allows service providers, enterprises and other cloud builders to run data storage on standard x86 servers, instead of using expensive and inefficient storage arrays (SAN).
SYS-CON Events announced today that Site24x7, the cloud infrastructure monitoring service, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Site24x7 is a cloud infrastructure monitoring service that helps monitor the uptime and performance of websites, online applications, servers, mobile websites and custom APIs. The monitoring is done from 50+ locations across the world and from various wireless carr...
This is my first blog post at AppDynamics, and I have to say that it’s great to be aboard. It’s been a hectic first couple of weeks, but the energy, enthusiasm and friendliness of everyone I have met has made me very excited about 2015! AppDynamics has a market leading APM and analytics platform but it also takes great people to make a great company – and AppDynamics has a wealth of talent! So to start my blogging life at AppDynamics I want to focus on something that is a red-hot buzzword in IT...
SYS-CON Events announced today that B2Cloud, a provider of enterprise resource planning software, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. B2cloud develops the software you need. They have the ideal tools to help you work with your clients. B2Cloud’s main solutions include AGIS – ERP, CLOHC, AGIS – Invoice, and IZUM
SYS-CON Events announced today that Tufin, the market-leading provider of Security Policy Orchestration Solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. As the market leader of Security Policy Orchestration, Tufin automates and accelerates network configuration changes while maintaining security and compliance. Tufin's award-winning Orchestration Suite™ gives IT organizations the power and a...
SYS-CON Events announced today that Cloudian, Inc., the leading provider of hybrid cloud storage solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Cloudian, Inc., is a Foster City, California - based software company specializing in cloud storage software. The main product is Cloudian, an Amazon S3-compliant cloud object storage platform, the bedrock of cloud computing systems, that enables c...
Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 16th Cloud Expo at the Javits Center in New York June 9-11 will find fresh new content in a new track called PaaS | Containers & Microservices Containers are not being considered for the first time by the cloud community, but a current era of re-consideration has pushed them to the top of the cloud agenda. With the launch ...
While DevOps most critically and famously fosters collaboration, communication, and integration through cultural change, culture is more of an output than an input. In order to actively drive cultural evolution, organizations must make substantial organizational and process changes, and adopt new technologies, to encourage a DevOps culture. Moderated by Andi Mann, panelists will discuss how to balance these three pillars of DevOps, where to focus attention (and resources), where organizations m...
of cloud, colocation, managed services and disaster recovery solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. TierPoint, LLC, is a leading national provider of information technology and data center services, including cloud, colocation, disaster recovery and managed IT services, with corporate headquarters in St. Louis, MO. TierPoint was formed through the strategic combination of some of t...
SYS-CON Events announced today that Column Technologies, a global technology solutions company, will exhibit at SYS-CON's DevOps Summit 2015 New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Established in 1998, Column Technologies is a leader in application performance and infrastructure management for commercial and federal markets. The company is headquartered in the United States, with a diverse and talented team of more than 350 employees around th...
SYS-CON Events announced today that Ciqada will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Ciqada™ makes it easy to connect your products to the Internet. By integrating key components - hardware, servers, dashboards, and mobile apps - into an easy-to-use, configurable system, your products can quickly and securely join the internet of things. With remote monitoring, control, and alert messaging capability, you will mee...
I woke up this morning to the devastating news about the earthquake in Nepal. Sitting here in California that destruction is literally on the other side of the world but my mind immediately went to thinking about my good friend Jeremy Geelan. See Jeremy and his family have been living in Kathmandu for a while now. His wife, in fact, is the Danish Ambassador to Nepal!
Public Cloud IaaS started it's life in the developer and startup communities and has grown rapidly to a $20B+ industry, but it still pales in comparison to how much is spent worldwide on IT: $3.6 trillion. In fact, there are 8.6 million data centers worldwide, the reality is many small and medium sized business have server closets and colocation footprints filled with servers and storage gear. While on-premise environment virtualization may have peaked at 75%, the Public Cloud has lagged in ado...
Dave will share his insights on how Internet of Things for Enterprises are transforming and making more productive and efficient operations and maintenance (O&M) procedures in the cleantech industry and beyond. Speaker Bio: Dave Landa is chief operating officer of Cybozu Corp (kintone US). Based in the San Francisco Bay Area, Dave has been on the forefront of the Cloud revolution driving strategic business development on the executive teams of multiple leading Software as a Services (SaaS) ap...
The 5th International DevOps Summit, co-located with 17th International Cloud Expo – being held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the...