Welcome!

Microservices Expo Authors: Pat Romanski, Dalibor Siroky, Stackify Blog, Elizabeth White, Liz McMillan

Related Topics: Java IoT, Microservices Expo, Machine Learning , Agile Computing, Release Management , @CloudExpo

Java IoT: Article

Case Study: It Takes More than a Tool

Swarovski’s 10 Requirements for Creating an APM Culture

Swarovski - the leading producer of cut crystal in the world - relies on its eCommerce store as much like other companies in the highly competitive eCommerce environment. Swarovski's story is no different from others in this space: They started with "Let's build a website to sell our products online" a couple of years ago and quickly progressed to "We sell to 60 million annual visitors across 23 countries in six languages." There were bumps along the road and they realized that it takes more than just a bunch of servers and tools to keep the site running.

Why APM and why you don't just need a tool
Swarovski relies on Intershop's eCommerce platform and faced several challenges as they rapidly grew. Their challenges required them to apply Application Performance Management (APM) practices to ensure they could fulfill the business requirements to keep pace with customer growth while maintaining an excellent user experience. The most insightful comment I heard was from René Neubacher, Senior eBusiness Technology Consultant at Swarovski: "APM is not just about software. APM is a culture, a mindset and a set of business processes. APM software supports that."

René recently discussed their Journey to APM, what their initial problems were and what requirements they ended up having on APM and the tools they needed to support their APM strategy. By now they reached the next level of maturity by establishing a Performance Center of Excellence. This allows them to tackle application performance proactively throughout the organization instead of putting out fires reactively in production.

This article describes the challenges they faced, the questions that arose and the new generation APM requirements that paved the way forward in their performance journey:

The Challenge!
Swarvoski had traditional system monitoring in place on all their systems across their delivery chain including web servers, application servers, SAP, database servers, external systems and the network. Knowing that each individual component is up and running 99.99% of the time is great but no longer sufficient. How might these individual component outages impact the user experience of their online shoppers? Who is actually responsible for the end user experience and how should you monitor the complete delivery chain and not just the individual components? These and other questions came up when the eCommerce site attracted more customers which was quickly followed by more complaints about their user experience:

APM includes getting a holistic view of the complete delivery chain and requires someone to be responsible for end user experience.

Questions that had no answers
In addition to "Who is responsible in case users complain?" the other questions that needed to be urgently addressed included:

  • How often is the service desk called before IT knows that there is a problem?
  • How much time is spent in searching for system errors versus building new features?
  • Do we have a process to find the root-cause when a customer reports a problem?
  • How do we visualize our services from the customer‘s point of view?
  • How much revenue, brand image and productivity are at risk or lost while IT is searching for the problem?
  • What to do when someone says "it‘s slow"?

The Ten Requirements
These unanswered questions triggered the need to move away from traditional system monitoring and develop the requirements for new generation APM and user experience management.

1: Support State-of-the-Art Architecture
Based on their current system architecture it was clear that Swarovski needed an approach that was able to work in their architecture, now and in the future. The rise of more interactive Web 2.0 and mobile applications had to be factored in to allow monitoring end users from many different devices and regardless of whether they used a web application or mobile native application as their access point.

Transactions need to be followed from the browser all the way back to the database. It is important to support distributed transactions. This approach also helps to spot architectural and deployment problems immediately

2: 100% transactions and clicks - No Averages
Based on their experience, Swarovski knew that looking at average values or sampled data would not be helpful when customers complained about bad performance. Responding to a customer complaint with "Our average user has no problem right now - sorry for your inconvenience" is not what you want your helpdesk engineers to use as a standard phrase. Averages or sampling also hides the real problems you have in your system. Check out the blog post Why Averages Suck by Michael Kopp for more detail.

Measuring end user performance of every customer interaction allows for quick identification of regional problems with CDNs, Third Parties or Latency.

Having 100% user interactions and transactions available makes it easy to identify the root cause for individual users

3: Business Visibility
As the business had a growing interest in the success of the eCommerce platform, IT had to demonstrate to the business what it took to fulfill their requirements and how business requirements are impacted by the availability or the lack of investment in the application delivery chain.

Correlating the number of Visits with Performance on incoming Orders illustrates the measurable impact of performance on revenue and what it takes to support business requirements.

4: Impact of 3rd Parties and CDNs
It was important to not only track transactions involving their own Data Center but all user interactions with their web site - even those delivered through CDNs or third parties. All of these interactions make up the user experience and therefore all of it needs to be analyzed.

Seeing the actual load impact of third-party components or content delivered from CDNs enables IT to pinpoint user experience problems that originate outside their own data center.

5: Across the life cycle - supporting collaboration and tearing down silos
The APM initiative was started because Swarovski reacted to problems happening in production. Fixing these problems in production is only the first step. Their ultimate goal is to become pro-active by finding and fixing problems in development or testing-before they spill over into production. Instead of relying on different sets of tools with different capabilities, the requirement is to use one single solution that is designed to be used across the application lifecycle (Developer Workstation, Continuous Integration, Testing, Staging and Production). It will make it easier to share application performance data between lifecycle stages allowing individuals to not only easily look at data from other stages but also compare data to verify impact and behavior of code changes between version updates.

Continuously catching regressions in Development by analyzing unit and performance tests allows application teams to become more proactive.

Pinpointing integration and scalability issues, continuously, in acceptance and load testing makes testing more efficient and prevents problems from reaching production.

6: Down to the source code
In order to speed up problem resolution Swarovski's operations and development teams require as much code-level insight as possible - not only for their own engineers who are extending the Intershop eCommerce Platform but also for Intershop to improve their product. Knowing what part of the application code is not performing well with which input parameters or under which specific load on the system eliminates tedious reproduction of the problem. The requirement is to lower the Mean Time To Repair (MTTR) from as much as several days down to only a couple of hours.

The SAP Connector turned out to have a performance problem. This method-level detailed information was captured without changing any code.

7: Zero/Acceptable overhead
"Who are we kidding? There is nothing like zero overhead especially when you need 100% coverage!" - Just the words from René when you explained that requirement. And he is right: once you start collecting information from a production system you add a certain amount of overhead. A better term for this would be "imperceptible overhead" - overhead that's so small, you don't notice it.

What is the exact number? It depends on your business and your users. The number should be worked out from the impact on the end user experience, rather than additional CPU, memory or network bandwidth required in the data center. Swarovski knew they had to achieve less than 2% overhead on page load times in production, as anything more would have hurt their business; and that's what they achieved.

8: Centralized data collection and administration
Running a distributed eCommerce application that gets potentially extended to additional geographical locations requires an APM system with a centralized data collection and administration option. It is not feasible to collect different types of performance information from different systems, servers or even data centers. It would either require multiple different analysis tools or data transformation to a single format to use it for proper analysis.

Instead of this approach, a single unified APM system was required by Swarovski. Central administration is equally important as they need to eliminate the need to rely on remote IT administrators to make changes to the monitored system, for example, simple tasks such as changing the level of captured data or upgrading to a new version.

By storing and accessing performance data from a single, centralized repository, enables fast and powerful analytic and visualization. For example, system metrics such as CPU utilization can be correlated with end-user response time or database execution time - all displayed on one single dashboard.

9: Auto-Adapting Instrumentation without digging through code
As the majority of the application code is not developed in-house but provided by Intershop, it is mandatory to get insight into the application without doing any manual code changes. The APM system must auto-adapt to changes so that no manual configuration change is necessary when a new version of the application is deployed.

This means Swarovski can focus on making their applications positively contribute to business outcomes; rather than spend time maintaining IT systems.

10: Ability to extend
Their application is an always growing an ever-changing IT environment. Where everything might have been deployed on physical boxes it might be moved to virtualized environments or even into a public cloud environment.

Whatever the extension may be - the APM solution must be able to adapt to these changes and also be extensible to consume new types of data sources, e.g., performance metrics from Amazon Cloud Services or VMware, Cassandra or other Big Data Solutions or even extend to legacy Mainframe applications and then bring these metrics into the centralized data repository and provide new insights into the application's performance.

Extending the application monitoring capabilities to Amazon EC2, Microsoft Windows Azure, a public or private cloud enables the analysis of the performance impact of these virtualized environments on end user experience.

The Solution and the Way Forward
Needless to say that Swarovski took the first step in implementing APM as a new process and mindset in their organization. They are now in the next phase of implementing a Performance Center of Excellence. This allows them moving from Reactive Performance Troubleshooting to Proactive Performance Prevention.

Stay tuned for more blog posts on the Performance Center of Excellence and how you can build one in your own organization. The key message is that it is not about just using a bunch of tools. It is about living and breathing performance throughout the organization. If you are interested in this check out the blogs by Steve Wilson: Proactive vs Reactive: How to prevent problems instead of fixing them faster and Performance in Development is the Chief Cornerstone.

More Stories By Andreas Grabner

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
While some developers care passionately about how data centers and clouds are architected, for most, it is only the end result that matters. To the majority of companies, technology exists to solve a business problem, and only delivers value when it is solving that problem. 2017 brings the mainstream adoption of containers for production workloads. In his session at 21st Cloud Expo, Ben McCormack, VP of Operations at Evernote, discussed how data centers of the future will be managed, how the p...
The nature of test environments is inherently temporary—you set up an environment, run through an automated test suite, and then tear down the environment. If you can reduce the cycle time for this process down to hours or minutes, then you may be able to cut your test environment budgets considerably. The impact of cloud adoption on test environments is a valuable advancement in both cost savings and agility. The on-demand model takes advantage of public cloud APIs requiring only payment for t...
It has never been a better time to be a developer! Thanks to cloud computing, deploying our applications is much easier than it used to be. How we deploy our apps continues to evolve thanks to cloud hosting, Platform-as-a-Service (PaaS), and now Function-as-a-Service. FaaS is the concept of serverless computing via serverless architectures. Software developers can leverage this to deploy an individual "function", action, or piece of business logic. They are expected to start within milliseconds...
As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory? In her Day 2 Keynote at @DevOpsSummit at 21st Cloud Expo, Aruna Ravichandran, VP, DevOps Solutions Marketing, CA Technologies, was jo...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
Is advanced scheduling in Kubernetes achievable?Yes, however, how do you properly accommodate every real-life scenario that a Kubernetes user might encounter? How do you leverage advanced scheduling techniques to shape and describe each scenario in easy-to-use rules and configurations? In his session at @DevOpsSummit at 21st Cloud Expo, Oleg Chunikhin, CTO at Kublr, answered these questions and demonstrated techniques for implementing advanced scheduling. For example, using spot instances and co...
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
For DevOps teams, the concepts behind service-oriented architecture (SOA) are nothing new. A style of software design initially made popular in the 1990s, SOA was an alternative to a monolithic application; essentially a collection of coarse-grained components that communicated with each other. Communication would involve either simple data passing or two or more services coordinating some activity. SOA served as a valid approach to solving many architectural problems faced by businesses, as app...
Some journey to cloud on a mission, others, a deadline. Change management is useful when migrating to public, private or hybrid cloud environments in either case. For most, stakeholder engagement peaks during the planning and post migration phases of a project. Legacy engagements are fairly direct: projects follow a linear progression of activities (the “waterfall” approach) – change managers and application coders work from the same functional and technical requirements. Enablement and develo...
Gone are the days when application development was the daunting task of the highly skilled developers backed with strong IT skills, low code application development has democratized app development and empowered a new generation of citizen developers. There was a time when app development was in the domain of people with complex coding and technical skills. We called these people by various names like programmers, coders, techies, and they usually worked in a world oblivious of the everyday pri...
From manual human effort the world is slowly paving its way to a new space where most process are getting replaced with tools and systems to improve efficiency and bring down operational costs. Automation is the next big thing and low code platforms are fueling it in a significant way. The Automation era is here. We are in the fast pace of replacing manual human efforts with machines and processes. In the world of Information Technology too, we are linking disparate systems, softwares and tool...
DevOps is good for organizations. According to the soon to be released State of DevOps Report high-performing IT organizations are 2X more likely to exceed profitability, market share, and productivity goals. But how do they do it? How do they use DevOps to drive value and differentiate their companies? We recently sat down with Nicole Forsgren, CEO and Chief Scientist at DORA (DevOps Research and Assessment) and lead investigator for the State of DevOps Report, to discuss the role of measure...
DevOps is under attack because developers don’t want to mess with infrastructure. They will happily own their code into production, but want to use platforms instead of raw automation. That’s changing the landscape that we understand as DevOps with both architecture concepts (CloudNative) and process redefinition (SRE). Rob Hirschfeld’s recent work in Kubernetes operations has led to the conclusion that containers and related platforms have changed the way we should be thinking about DevOps and...
"As we've gone out into the public cloud we've seen that over time we may have lost a few things - we've lost control, we've given up cost to a certain extent, and then security, flexibility," explained Steve Conner, VP of Sales at Cloudistics,in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
These days, APIs have become an integral part of the digital transformation journey for all enterprises. Every digital innovation story is connected to APIs . But have you ever pondered over to know what are the source of these APIs? Let me explain - APIs sources can be varied, internal or external, solving different purposes, but mostly categorized into the following two categories. Data lakes is a term used to represent disconnected but relevant data that are used by various business units wit...
With continuous delivery (CD) almost always in the spotlight, continuous integration (CI) is often left out in the cold. Indeed, it's been in use for so long and so widely, we often take the model for granted. So what is CI and how can you make the most of it? This blog is intended to answer those questions. Before we step into examining CI, we need to look back. Software developers often work in small teams and modularity, and need to integrate their changes with the rest of the project code b...
"I focus on what we are calling CAST Highlight, which is our SaaS application portfolio analysis tool. It is an extremely lightweight tool that can integrate with pretty much any build process right now," explained Andrew Siegmund, Application Migration Specialist for CAST, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Cloud4U builds software services that help people build DevOps platforms for cloud-based software and using our platform people can draw a picture of the system, network, software," explained Kihyeon Kim, CEO and Head of R&D at Cloud4U, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex ...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In their Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, and Mark Lav...