Welcome!

SOA & WOA Authors: Liz McMillan, Elizabeth White, Rich Waidmann, Pat Romanski, Plutora Blog

Related Topics: Cloud Expo, XML, SOA & WOA, Web 2.0, Security

Cloud Expo: Blog Post

Amazon Outage

Alerting to catch infrastructure problems

You don’t have to be a pre-cog to find and deal with infrastructure and application problems; you just need good monitoring.  We had quite a day Monday during the EC2 EBS availability incident.  Thanks to some early alerts - which started coming in about 2.5 hours before AWS started reporting problems - our ops team was able to intervene and make sure that our customers’ data was safe and sound. I’ll start with screenshots of what we saw and experienced, then get into what metrics to watch and alert on in your environment, as well as how to do so in TraceView.

10:30 AM EST: Increased disk latency, data pipeline backup
Around 10 am, we started to notice that writes weren’t moving through our pipeline as smoothly as before.  Sure enough, pretty soon we started seeing alerts about elevated DB load and disk latency.  Here’s what it looked like:

Figure 1: At 10 AM, we saw elevated DB load and disk latency.

12:30 PM EST: Diverting pipeline to S3 instead of EBS, pulling out hair

1:30 PM EST: Frontend offline, AWS incident report

Our workload is very write-heavy, so we first noticed performance problems there, but pretty soon reads made by our frontend were also affected, as a growing fraction of our customer’s data became affected by the mounting EBS problems.  At a certain point, any file I/O to affected EBS volumes would cause processes to enter an uninterruptible state, causing our MySQL servers to hang.  Here’s a view of the impact on our query sharding service:

Figure 2: Impact on our query sharding service.

6 PM EST: Debate whether backup restore or AWS EBS recovery will finish faster

9 PM EST: Back online

AWS started bringing volumes back online that evening.  During the downtime, we continued to collect customer performance data, diverting the pipeline to S3 until our databases came back online. Once the disks were back, we were able to get frontend servers back online, and spun up more pipeline workers to plow through the queued trace backlog as we replayed it from S3. Latency: the functional test of performance metrics You might be surprised by this, but monitoring latency is often the easiest and surest way to catch serious problems.  It’s the functional test of system: if any of the gears in the system being monitored start getting jammed, it will likely manifest in increased latency. However, latency can be noisy—how can we make this measurement more controlled, or to extend my testing metaphor, closer to a unit test?  Using TraceView, you can set alerts not only on the latency of your application, but also on individual layers of the stack, or particular URLs/controllers.  The performance of a predictable query load over time is a great way to detect aberrant database performance, for instance.

Figure 3: Use alerting to detect aberrant database performance.

Alerting on All of the Metrics
When looking at cases of infrastructure degradation, host-level metrics is where the buck stops.  Configuration is usually a pain: install agents on each machine and set thresholds.  We think the best alerts are at the intersection of easy and actionable.  With TraceView, you can set up a single alert and have it cover all hosts in an app.

Related Articles

The 5 Critical Things You Need to Know to Assure Optimal Performance in the Cloud

Verifying Network Performance SLAs

Who’s Managing the Performance of Your Cloud Applications?

More Stories By Dan Kuebrich

Dan Kuebrich is a web performance geek, currently working on Application Performance Management at AppNeta. He was previously a founder of Tracelytics (acquired by AppNeta), and before that worked on AmieStreet/Songza.com.

@ThingsExpo Stories
The Internet of Things will greatly expand the opportunities for data collection and new business models driven off of that data. In her session at @ThingsExpo, Esmeralda Swartz, CMO of MetraTech, discussed how for this to be effective you not only need to have infrastructure and operational models capable of utilizing this new phenomenon, but increasingly service providers will need to convince a skeptical public to participate. Get ready to show them the money!
DevOps Summit 2015 New York, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that it is now accepting Keynote Proposals. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential.
The 3rd International Internet of @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that its Call for Papers is now open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
Connected devices and the Internet of Things are getting significant momentum in 2014. In his session at Internet of @ThingsExpo, Jim Hunter, Chief Scientist & Technology Evangelist at Greenwave Systems, examined three key elements that together will drive mass adoption of the IoT before the end of 2015. The first element is the recent advent of robust open source protocols (like AllJoyn and WebRTC) that facilitate M2M communication. The second is broad availability of flexible, cost-effective storage designed to handle the massive surge in back-end data in a world where timely analytics is e...
"There is a natural synchronization between the business models, the IoT is there to support ,” explained Brendan O'Brien, Co-founder and Chief Architect of Aria Systems, in this SYS-CON.tv interview at the 15th International Cloud Expo®, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
The Internet of Things will put IT to its ultimate test by creating infinite new opportunities to digitize products and services, generate and analyze new data to improve customer satisfaction, and discover new ways to gain a competitive advantage across nearly every industry. In order to help corporate business units to capitalize on the rapidly evolving IoT opportunities, IT must stand up to a new set of challenges. In his session at @ThingsExpo, Jeff Kaplan, Managing Director of THINKstrategies, will examine why IT must finally fulfill its role in support of its SBUs or face a new round of...
The BPM world is going through some evolution or changes where traditional business process management solutions really have nowhere to go in terms of development of the road map. In this demo at 15th Cloud Expo, Kyle Hansen, Director of Professional Services at AgilePoint, shows AgilePoint’s unique approach to dealing with this market circumstance by developing a rapid application composition or development framework.
The 3rd International Internet of @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that its Call for Papers is now open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.

ARMONK, N.Y., Nov. 20, 2014 /PRNewswire/ --  IBM (NYSE: IBM) today announced that it is bringing a greater level of control, security and flexibility to cloud-based application development and delivery with a single-tenant version of Bluemix, IBM's platform-as-a-service. The new platform enables developers to build ap...

“The age of the Internet of Things is upon us,” stated Thomas Svensson, senior vice-president and general manager EMEA, ThingWorx, “and working with forward-thinking companies, such as Elisa, enables us to deploy our leading technology so that customers can profit from complete, end-to-end solutions.” ThingWorx, a PTC® (Nasdaq: PTC) business and Internet of Things (IoT) platform provider, announced on Monday that Elisa, Finnish provider of mobile and fixed broadband subscriptions, will deploy ThingWorx® platform technology to enable a new Elisa IoT service in Finland and Estonia.
Advanced Persistent Threats (APTs) are increasing at an unprecedented rate. The threat landscape of today is drastically different than just a few years ago. Attacks are much more organized and sophisticated. They are harder to detect and even harder to anticipate. In the foreseeable future it's going to get a whole lot harder. Everything you know today will change. Keeping up with this changing landscape is already a daunting task. Your organization needs to use the latest tools, methods and expertise to guard against those threats. But will that be enough? In the foreseeable future attacks w...
Building low-cost wearable devices can enhance the quality of our lives. In his session at Internet of @ThingsExpo, Sai Yamanoor, Embedded Software Engineer at Altschool, provided an example of putting together a small keychain within a $50 budget that educates the user about the air quality in their surroundings. He also provided examples such as building a wearable device that provides transit or recreational information. He then reviewed the resources available to build wearable devices at home including open source hardware, the raw materials required and the options available to power s...
The Internet of Things is not new. Historically, smart businesses have used its basic concept of leveraging data to drive better decision making and have capitalized on those insights to realize additional revenue opportunities. So, what has changed to make the Internet of Things one of the hottest topics in tech? In his session at @ThingsExpo, Chris Gray, Director, Embedded and Internet of Things, discussed the underlying factors that are driving the economics of intelligent systems. Discover how hardware commoditization, the ubiquitous nature of connectivity, and the emergence of Big Data a...
From telemedicine to smart cars, digital homes and industrial monitoring, the explosive growth of IoT has created exciting new business opportunities for real time calls and messaging. In his session at @ThingsExpo, Ivelin Ivanov, CEO and Co-Founder of Telestax, shared some of the new revenue sources that IoT created for Restcomm – the open source telephony platform from Telestax. Ivelin Ivanov is a technology entrepreneur who founded Mobicents, an Open Source VoIP Platform, to help create, deploy, and manage applications integrating voice, video and data. He is the co-founder of TeleStax, a...
We certainly live in interesting technological times. And no more interesting than the current competing IoT standards for connectivity. Various standards bodies, approaches, and ecosystems are vying for mindshare and positioning for a competitive edge. It is clear that when the dust settles, we will have new protocols, evolved protocols, that will change the way we interact with devices and infrastructure. We will also have evolved web protocols, like HTTP/2, that will be changing the very core of our infrastructures. At the same time, we have old approaches made new again like micro-services...
The Internet of Things is a misnomer. That implies that everything is on the Internet, and that simply should not be - especially for things that are blurring the line between medical devices that stimulate like a pacemaker and quantified self-sensors like a pedometer or pulse tracker. The mesh of things that we manage must be segmented into zones of trust for sensing data, transmitting data, receiving command and control administrative changes, and peer-to-peer mesh messaging. In his session at @ThingsExpo, Ryan Bagnulo, Solution Architect / Software Engineer at SOA Software, focused on desi...
Today’s enterprise is being driven by disruptive competitive and human capital requirements to provide enterprise application access through not only desktops, but also mobile devices. To retrofit existing programs across all these devices using traditional programming methods is very costly and time consuming – often prohibitively so. In his session at @ThingsExpo, Jesse Shiah, CEO, President, and Co-Founder of AgilePoint Inc., discussed how you can create applications that run on all mobile devices as well as laptops and desktops using a visual drag-and-drop application – and eForms-buildi...
Disruptive macro trends in technology are impacting and dramatically changing the "art of the possible" relative to supply chain management practices through the innovative use of IoT, cloud, machine learning and Big Data to enable connected ecosystems of engagement. Enterprise informatics can now move beyond point solutions that merely monitor the past and implement integrated enterprise fabrics that enable end-to-end supply chain visibility to improve customer service delivery and optimize supplier management. Learn about enterprise architecture strategies for designing connected systems tha...
"For over 25 years we have been working with a lot of enterprise customers and we have seen how companies create applications. And now that we have moved to cloud computing, mobile, social and the Internet of Things, we see that the market needs a new way of creating applications," stated Jesse Shiah, CEO, President and Co-Founder of AgilePoint Inc., in this SYS-CON.tv interview at 15th Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Recurring revenue models are great for driving new business in every market sector, but they are complex and need to be effectively managed to maximize profits. How you handle the range of options for pricing, co-terming and proration will ultimately determine the fate of your bottom line. In his session at 15th Cloud Expo, Brendan O'Brien, Co-founder at Aria Systems, session examined: How time impacts recurring revenue How to effectively handle customer plan changes The range of pricing and packaging options to consider