|By Dan Kuebrich||
|April 19, 2013 11:00 AM EDT||
You don’t have to be a pre-cog to find and deal with infrastructure and application problems; you just need good monitoring. We had quite a day Monday during the EC2 EBS availability incident. Thanks to some early alerts - which started coming in about 2.5 hours before AWS started reporting problems - our ops team was able to intervene and make sure that our customers’ data was safe and sound. I’ll start with screenshots of what we saw and experienced, then get into what metrics to watch and alert on in your environment, as well as how to do so in TraceView.
10:30 AM EST: Increased disk latency, data pipeline backup
Around 10 am, we started to notice that writes weren’t moving through our pipeline as smoothly as before. Sure enough, pretty soon we started seeing alerts about elevated DB load and disk latency. Here’s what it looked like:
Figure 1: At 10 AM, we saw elevated DB load and disk latency.
12:30 PM EST: Diverting pipeline to S3 instead of EBS, pulling out hair
1:30 PM EST: Frontend offline, AWS incident report
Our workload is very write-heavy, so we first noticed performance problems there, but pretty soon reads made by our frontend were also affected, as a growing fraction of our customer’s data became affected by the mounting EBS problems. At a certain point, any file I/O to affected EBS volumes would cause processes to enter an uninterruptible state, causing our MySQL servers to hang. Here’s a view of the impact on our query sharding service:
Figure 2: Impact on our query sharding service.
6 PM EST: Debate whether backup restore or AWS EBS recovery will finish faster
9 PM EST: Back online
AWS started bringing volumes back online that evening. During the downtime, we continued to collect customer performance data, diverting the pipeline to S3 until our databases came back online. Once the disks were back, we were able to get frontend servers back online, and spun up more pipeline workers to plow through the queued trace backlog as we replayed it from S3. Latency: the functional test of performance metrics You might be surprised by this, but monitoring latency is often the easiest and surest way to catch serious problems. It’s the functional test of system: if any of the gears in the system being monitored start getting jammed, it will likely manifest in increased latency. However, latency can be noisy—how can we make this measurement more controlled, or to extend my testing metaphor, closer to a unit test? Using TraceView, you can set alerts not only on the latency of your application, but also on individual layers of the stack, or particular URLs/controllers. The performance of a predictable query load over time is a great way to detect aberrant database performance, for instance.
Figure 3: Use alerting to detect aberrant database performance.
Alerting on All of the Metrics
When looking at cases of infrastructure degradation, host-level metrics is where the buck stops. Configuration is usually a pain: install agents on each machine and set thresholds. We think the best alerts are at the intersection of easy and actionable. With TraceView, you can set up a single alert and have it cover all hosts in an app.
Continuous processes around the development and deployment of applications are both impacted by -- and a benefit to -- the Internet of Things trend. To help better understand the relationship between DevOps and a plethora of new end-devices and data please welcome Gary Gruver, consultant, author and a former IT executive who has led many large-scale IT transformation projects, and John Jeremiah, Technology Evangelist at Hewlett Packard Enterprise (HPE), on Twitter at @j_jeremiah. The discussion...
Nov. 27, 2015 04:15 AM EST Reads: 705
In his General Session at DevOps Summit, Asaf Yigal, Co-Founder & VP of Product at Logz.io, explored the value of Kibana 4 for log analysis and provided a hands-on tutorial on how to set up Kibana 4 and get the most out of Apache log files. He examined three use cases: IT operations, business intelligence, and security and compliance. Asaf Yigal is co-founder and VP of Product at log analytics software company Logz.io. In the past, he was co-founder of social-trading platform Currensee, which...
Nov. 27, 2015 04:00 AM EST Reads: 184
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound...
Nov. 27, 2015 02:30 AM EST Reads: 450
Discussions of cloud computing have evolved in recent years from a focus on specific types of cloud, to a world of hybrid cloud, and to a world dominated by the APIs that make today's multi-cloud environments and hybrid clouds possible. In this Power Panel at 17th Cloud Expo, moderated by Conference Chair Roger Strukhoff, panelists addressed the importance of customers being able to use the specific technologies they need, through environments and ecosystems that expose their APIs to make true ...
Nov. 27, 2015 02:00 AM EST Reads: 514
In today's enterprise, digital transformation represents organizational change even more so than technology change, as customer preferences and behavior drive end-to-end transformation across lines of business as well as IT. To capitalize on the ubiquitous disruption driving this transformation, companies must be able to innovate at an increasingly rapid pace. Traditional approaches for driving innovation are now woefully inadequate for keeping up with the breadth of disruption and change facin...
Nov. 27, 2015 01:30 AM EST Reads: 461
Microservices are a very exciting architectural approach that many organizations are looking to as a way to accelerate innovation. Microservices promise to allow teams to move away from monolithic "ball of mud" systems, but the reality is that, in the vast majority of organizations, different projects and technologies will continue to be developed at different speeds. How to handle the dependencies between these disparate systems with different iteration cycles? Consider the "canoncial problem"...
Nov. 27, 2015 01:00 AM EST Reads: 421
PubNub has announced the release of BLOCKS, a set of customizable microservices that give developers a simple way to add code and deploy features for realtime apps.PubNub BLOCKS executes business logic directly on the data streaming through PubNub’s network without splitting it off to an intermediary server controlled by the customer. This revolutionary approach streamlines app development, reduces endpoint-to-endpoint latency, and allows apps to better leverage the enormous scalability of PubNu...
Nov. 27, 2015 01:00 AM EST Reads: 307
Growth hacking is common for startups to make unheard-of progress in building their business. Career Hacks can help Geek Girls and those who support them (yes, that's you too, Dad!) to excel in this typically male-dominated world. Get ready to learn the facts: Is there a bias against women in the tech / developer communities? Why are women 50% of the workforce, but hold only 24% of the STEM or IT positions? Some beginnings of what to do about it! In her Day 2 Keynote at 17th Cloud Expo, San...
Nov. 27, 2015 01:00 AM EST Reads: 555
I recently attended and was a speaker at the 4th International Internet of @ThingsExpo at the Santa Clara Convention Center. I also had the opportunity to attend this event last year and I wrote a blog from that show talking about how the “Enterprise Impact of IoT” was a key theme of last year’s show. I was curious to see if the same theme would still resonate 365 days later and what, if any, changes I would see in the content presented.
Nov. 26, 2015 10:00 PM EST Reads: 397
It's been a busy time for tech's ongoing infatuation with containers. Amazon just announced EC2 Container Registry to simply container management. The new Azure container service taps into Microsoft's partnership with Docker and Mesosphere. You know when there's a standard for containers on the table there's money on the table, too. Everyone is talking containers because they reduce a ton of development-related challenges and make it much easier to move across production and testing environm...
Nov. 26, 2015 06:00 PM EST Reads: 578
Internet of @ThingsExpo, taking place June 7-9, 2016 at Javits Center, New York City and Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 18th International @CloudExpo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo New York Call for Papers is now open.
Nov. 26, 2015 03:30 PM EST Reads: 526
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York and Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty ...
Nov. 26, 2015 03:15 PM EST Reads: 524
We are rapidly moving to a brave new world of interconnected smart homes, cars, offices and factories known as the Internet of Things (IoT). Sensors and monitoring devices will touch every part of our lives. Let's take a closer look at the Internet of Things. The Internet of Things is a worldwide network of objects and devices connected to the Internet. They are electronics, sensors, software and more. These objects connect to the Internet and can be controlled remotely via apps and programs. ...
Nov. 26, 2015 02:15 PM EST Reads: 500
The Internet of Things (IoT) is growing rapidly by extending current technologies, products and networks. By 2020, Cisco estimates there will be 50 billion connected devices. Gartner has forecast revenues of over $300 billion, just to IoT suppliers. Now is the time to figure out how you’ll make money – not just create innovative products. With hundreds of new products and companies jumping into the IoT fray every month, there’s no shortage of innovation. Despite this, McKinsey/VisionMobile data...
Nov. 26, 2015 11:00 AM EST Reads: 446
Just over a week ago I received a long and loud sustained applause for a presentation I delivered at this year’s Cloud Expo in Santa Clara. I was extremely pleased with the turnout and had some very good conversations with many of the attendees. Over the next few days I had many more meaningful conversations and was not only happy with the results but also learned a few new things. Here is everything I learned in those three days distilled into three short points.
Nov. 26, 2015 10:00 AM EST Reads: 292
One of the most important tenets of digital transformation is that it’s customer-driven. In fact, the only reason technology is involved at all is because today’s customers demand technology-based interactions with the companies they do business with. It’s no surprise, therefore, that we at Intellyx agree with Patrick Maes, CTO, ANZ Bank, when he said, “the fundamental element in digital transformation is extreme customer centricity.” So true – but note the insightful twist that Maes adde...
Nov. 26, 2015 10:00 AM EST Reads: 417
DevOps is about increasing efficiency, but nothing is more inefficient than building the same application twice. However, this is a routine occurrence with enterprise applications that need both a rich desktop web interface and strong mobile support. With recent technological advances from Isomorphic Software and others, rich desktop and tuned mobile experiences can now be created with a single codebase – without compromising functionality, performance or usability. In his session at DevOps Su...
Nov. 26, 2015 09:45 AM EST Reads: 372
Nov. 26, 2015 09:30 AM EST Reads: 173
As organizations realize the scope of the Internet of Things, gaining key insights from Big Data, through the use of advanced analytics, becomes crucial. However, IoT also creates the need for petabyte scale storage of data from millions of devices. A new type of Storage is required which seamlessly integrates robust data analytics with massive scale. These storage systems will act as “smart systems” provide in-place analytics that speed discovery and enable businesses to quickly derive meaningf...
Nov. 26, 2015 09:30 AM EST Reads: 380
You may have heard about the pets vs. cattle discussion – a reference to the way application servers are deployed in the cloud native world. If an application server goes down it can simply be dropped from the mix and a new server added in its place. The practice so far has mostly been applied to application deployments. Management software on the other hand is treated in a very special manner. Dedicated resources are set aside to run the management software components and several alerting syst...
Nov. 26, 2015 08:00 AM EST Reads: 155