Get advice from top hands-on experts
How have monitoring tools adapted to meet the challenge of the various changes in technologies, practices and business needs? A demo of tools allowing observability will provide some answers.
This conference organised during the Alenia Production Tour 2022 gathered multiple CIOs, monitoring leaders and operational experts, including Wilfried Villisek Faucounau, Principal Engineer (devops, ops and SRE Lead) at Société Générale, Dimitris Finas, Senior Technical Advisor Lightstep and Matthieu Chavaroc, IT and Business Operations Specialist at Alenia.
We touched ground on topics like history of monitoring, observability, RED metrics, the christmas tree problem and SLO with Zalando use cases and much more. We hope you enjoy this conversation.
Matthieu Chavaroc, Alenia - Thank you for joining us today. We really appreciate your presence. We couldn't have had a conference on application production without involving monitoring. Monitoring is key to avoid incidents and to be able to detect and resolve them as fast and effectively as possible should they occur.
To develop this topic further, let me introduce you to Wilfred Villisek Faucounau. He is a DevOps, Ops and SRE professional at Société Générale, and he will be discussing the evolution of monitoring and introducing the concept of observability. Dimitris Finas, a senior technical advisor at Lightstep, will then provide an example of the implementation of observability.
Thank you once again for joining us today, and we hope you find this conference informative and useful.
Wilfred Villisek Faucounau, Société Générale - Thank you, Matthieu. I'm very happy to be here, and we're going to talk about monitoring. I'll give a brief history of monitoring to prove that humans have been doing it since the dawn of time. We have been measuring time, speed, weight, distances, and so on for a long time, and therefore we have been monitoring all the time. We all wear watches to monitor time to ensure that we are on time or not.
Let's take a simple example: when you take your car on a journey, the first thing you do is check the fuel. There are different signals on the car dashboard that tell you the status of your system and whether you can go or not. This is a step in checking if everything is up and running. It is actually very advanced monitoring, and I would love to have that in my own application. Then, you have the speedometer to monitor the speed of your vehicle and so on. Monitoring is everywhere, and everyone is doing it.
But what does it mean in IT? It means collecting, processing, aggregating and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. That is the definition coming from the SRE book from Google. But let's see where the monitoring we have today comes from.
Let's go back to the 80s and 90s. What were the computers like at that time? They were local, in your room, on your desk. They were not connected between each other and usually, you had some very simple system with only one user. You did not have to monitor much at the time : You had to monitor your drive, your memory, maybe your CPU if your application was too slow. But it was very simple. We had some very simple tools for Linux and Windows, just three ugly bars.
Then we started to connect all these systems together within a company, for example. Different people in the company had their system connected with a network. That's when we had to start having more advanced monitoring because if you were using an application that was hosted on another system, how would you know if the application was not working? Was it coming from your system, or was it coming from the other system? We also had to monitor the hardware, system resources, and so on. That's when we started using tools like nmon, MTRG and Big Brother.
Now that we have computers connected together, what is the next step ? In the 21st century, it's the Internet era, and we have started to connect the World Wide Web. But what really changed at that time is that the Internet became a real place to do business. There were new protocols and new technologies that came with this new technology. Websites started selling some goods, and if the website was down, it was like if your store was closed. You had no cash coming in, customers could not do anything else than find a competitor, and you could go out of business very fast. That was the first big change.
With the Internet your application is exposed outside, it's running on other servers that are not in the same room. How do you know that your application is running properly? It's very complicated, and if it's not running properly, you are not doing business. That's when the monitoring moved from being something very technical to something more business oriented. That's when we started to check application logs, to have health checks, to ping your application externally to ensure that your application is working, and that's also when some metrics like geographic usage and so on came into play.
The 21st century brought other big changes: the cloud and virtualization arrived. We were operating some systems and some applications, but we did not have to manage the infrastructure anymore. We were paying someone to provide us with a service. We had some service level agreements with them, and if there was ever something wrong with the hardware, it was not actually our problem at all. That's why we could shift even more towards monitoring something very applicative-oriented, rather than something very technical and hardware-oriented. It was an even bigger change with containerization where your container is supposed to run on any kind of infrastructure.
Also, a very big cultural change was agility, continuous integration, continuous delivery, and DevOps. Maybe 10 or 15 years ago, we had some very big applications that were released once a year, twice if we were lucky. Now, we have applications that are released maybe monthly, weekly, daily, or even several times a day. The system that we must monitor is now moving constantly. Before, if your packages and infrastructure were not moving, it was not that complicated to monitor. We didn't have to have very complicated monitoring systems. But now that everything is moving all the time, it's not the same. We have an increased complexity, an increased amount of data, and these applications are moving all the time. If a microservice on your application goes down, that's maybe not an issue; it's maybe just a glitch. So, it's hard to understand and to know what is normal or not.
"We now have a highly complex, interconnected system that is continuously moving and being updated. We also have almost zero tolerance for downtime or degradation since all the applications are highly interconnected. If we are not providing the service, we might not get paid for the service that we're providing. And, we have an extremely large amount of data to process. That's why monitoring is moving towards observability". W. Villisek Faucouno, Société Générale
To sum-up here are the three big steps of monitoring :
So, what is observability? I'll take a very simple example. How do you think I am feeling right now? Maybe a bit stressed? Well, maybe. I speak fast, it's true but I’m not falling and I'm kind of walking normally, I can check my hands to see that I'm okay. We are doing the same with applications. When you go to a doctor, they don’t put you into a scanner right away but will check your basic external output to determine a quick diagnosis. We are doing the same with applications using some key signals from the outside of the application because we cannot go into details anymore. Dimitris is going to demonstrate that now.
Dimitris Finas, LightStep - The first thing about observability is that it's a change in what you measure. When you're doing observability, you're looking at external signals, and when doing so, you change the signals that you monitor, meaning it's less linked to infrastructure and more linked to the application and the customer experience.
Typically, you need more than the USE metrics : measuring the Usage, Saturation, and Error rate. What does that mean? It means you measure the usage of your CPU; you measure the saturation of your disk or the saturation of your memory. That is typical USE metrics. When you go to observability, you use RED metrics in addition to USE. You still measure your infrastructure saturation, but you add your Rate, Error, and Duration (ie latency). Why is that? Because the customer perceives two things. One is using your application; he/she perceives the latency. Is it slow or is it responsive ? And he/she perceives the error rate. Do I see an error page ? Those are the two most important perceptions for user experience. That's why when Google established their SRE whitepaper, they defined the RED metrics as golden signals to use as preferred method.
"Remember that there are three important things that you could use as a tool for observability, which are the traces, the metrics, and the logs. We call them the three pillars of observability." D. Finas, Lightstep
I think metrics and logs, everybody knows about it, so, I will give some details about traces : It's the way to monitor a single transaction going through all your services from the top to the bottom, so it's an end-to-end view of your transaction. And there is a revolution with observability, and this revolution has a name. This name is OpenTelemetry.
So, what is OpenTelemetry? It's an open-source project that you use to collect, with a single technology, all your traces, metrics, and logs. The important point in OpenTelemetry is that it should be reliable, it should be performant, and it should be vendor-agnostic. And that's why it is open source. This project has been created by my company, Lightstep, and Google together. It is today used by Google for all their cloud platforms, AWS for all their CloudWatch monitoring, and it is also used by Microsoft on Azure, meaning that all major public cloud providers are using this open-source technology to provide their metrics to their customers. One reason for this is that it is vendor-agnostic and performant.
By the way, to show you the importance of this project, it is the second most active project in the CNCF. CNCF is the Cloud Native Computing Foundation that hosts all open-source projects related to critical components of the global technology infrastructure. The most important project for CNCF is Kubernetes, followed by OpenTelemetry.
Look at Kubernetes from five years ago and compare it to today, all microservices platforms are now running on Kubernetes. I would say that it will be the same for OpenTelemetry. We can already see it used by all public cloud providers. Even monitoring tools such as Splunk, Dynatrace and Datadog are now contributing to this project because they see it as a big change in the future, particularly to be vendor agnostic.
To put it in a simple way, OpenTelemetry can be seen as the one agent to rule them all. Initially, the purpose was to collect everything in a single way so that users could own their data and not be reliant on proprietary monitoring solutions.
In terms of architecture, it means that when you collect data on your application or around your infrastructure, you use OpenTelemetry agents. This works for modern applications using Kubernetes, but this also works with virtual machines, vCenters and other technologies. At one point of time, you put all this data on a proxy which is the collector, and then you can send it to any vendor that you want, it could be Lightstep, Prometheus, Splunk, Honeycomb or any other vendor that is compatible with OpenTelemetry.
This means that if you need to switch from one vendor to another or add a new vendor because you want to send your traces to lightstep and your metrics to Prometheus, you can do so with only three lines of a YAML file that will change the configuration and send data to the new target. You don't have to change anything else; you don't need to add new agents to your infrastructure, you only have to define a new target in your collector proxy. This is why so many vendors and users are already using it.
To give you an example, Zalando selected OpenTelemetry and Lightstep three years ago at the beginning of the project They used it for 3 main reasons :
This presentation is based on the testimony of Heinrich Hartmann, who is Head of SRE at Zalando and allowed us to share this content with you.
Let's zoom in on the Zalando architecture. Zalando is a pure digital native company that sells shoes, clothes, and fashion, with 5,000 microservices connected to each other. You may say that you're not a digital native company, but even if you want to do some of what they do, it's possible. I was recently in a historic bank in Scotland, and they told me that 95% of their IT is legacy, and they mostly rely on mainframes for their most critical operations. The 5% remaining are microservices that represent more than 3,000 containers running in production. Do you think it's possible to monitor 3,000 containers and a critical mobile application without the same tools as Zalando? That's why, even if they're not digital natives, they were really interested in acquiring this capability.
When Zalando started, they acquired a capability that they were missing, which is distributed trace. Whenever they have a transaction, they want to trace the customer's journey from the website to the APIs that are called beyond it. Why is that? It's to monitor the customer’s experience and monitor the performance of their services. This is typically achieved through distributed traces. Why is it interesting? It's simple: with distributed traces, you can quickly identify errors and latency. You just must look at the red line or the longest line, and you can quickly find where the issue is.
Zalando implemented distributed traces during their first year using the product. Then, they implemented the RED metrics, which shows the error rate, latency (response time perceived by the customer on the front end), and the throughput, meaning the usage of the application over time. Whenever there is a peak of error, you can click on the red dot representing the transaction in error, and display the trace and show the red line representing where the issue is.
Beyond this dashboard, you can have one service or one application, but you can also have several applications or several services.
Now let’s develop on how Zalando defined their SLO. Initially, Zalando had 200 product teams, so they defined SLO for each product, but this generated too many errors and was too far from the business. So, they changed their approach and implemented what is called a "critical business operation SLO".
What is a critical business operation? Examples include “browse catalog”, “view product details”, “add to wish list”, and “payment”. These are very visible from the customer's point of view, are essential to the business, and are restricted to a small number.
"Zalando only has 11 critical business operations to monitor the full company. Beyond each of these critical business operations, they have one VP of Zalando who is the business owner of the SLO." D. Finas, Lightstep
As simple and complex as it may sound, to define the business SLO, they simply look at what the customer is searching for.
Now Zalando is implementing the most recent feature, which they called adaptive paging.
Adaptive paging is a solution to the "Christmas tree problem". Whenever an alert occurs, it can trigger a cascade of other alerts across multiple services in the chain. This can lead to triggering hundreds of people in crisis calls and spending a great amount of time investigating each service of a chain.
For example, the “stock reservation” service is causing an exception, and this is critical for the businesses SLO as it is used for “placing orders”. The issue is that this exception will lead to an exception on “logistics”, which will generate an exception on “order” service which will impact the “checkout” service. If you call all the teams for these different services, you will have 100 people on board, and nobody wants that.
Another example from a bank a few weeks ago in Scotland, where a similar problem occurred : They got alerted on Twitter because their mobile phone application was not working. For this kind of incident, they had up to 200 people on board, which are real figures. Ultimately, they discovered that it was an identity service far below in the chain that was the source of this problem.
Zalando had this kind of issue because they have 200 product teams, and there is not a single team responsible for all this chain. What did they do? They pull examples of traces linked to this alert.
The red line displayed is actually just a JSON file that is parsed, and you get a specific attribute that tells you that the error is here. That's how Lightstep displays it in red. So, they just check for this attribute, and they know exactly which service is the lowest line, the lowest level, and they just send the alert to this team. For more than 90% of their use case, it works very well, and all teams are really liking this feature simply because there are fewer people on alert, and it's more efficient.
To conclude, I would say that it's not magic, and it doesn't happen in a week’s time. The journey that I showed you for Zalando took them three years. They started simply with trace, then with the metrics, then they went for this adaptive paging and business SLOs. But as you can see, even for digital natives, it takes time. It was also a cultural change for their resources that were used to simply using logs.
Two last comments :
"The best SLOs are most likely the ones that are visible to your customers." D. Finas, Lightsept
MA – Thank you Dimitris! At Alenia, we are agnostic about the technical solution to use for monitoring. The importance here is to understand the value of having observability for complex systems and particularly for complex chains. Whether you are using Dynatrace, Lightstep, or Splunk, it's important to get this visibility and as Wilfried said, “observe” your system and base your findings on the outputs of your systems.
WVF - It's also important to involve the development team when implementing observability tools in your application. If you don't involve the developers in the production process, you will never succeed in achieving these kinds of results. And it takes time. In my application in Société Générale, we are using the Elastic Stack. We have that kind of observability. When we have an issue, we have the trace, and we know directly where the issue comes from. But it took us three or four years to have this running with a lot of involvement from the developers.
MA – We’ll take questions now. “What would be the first steps into observability ?”
DF - First, involve the business. Try to make it simple. Don't have too many SLOs. Instead, try to select the simplest SLO, even if there are more than 50 different services for each SLO. That's not the problem because technically, we are able to solve this issue later on.
WVF - You have to explain to your business that investing in this kind of process is essential. In my job, I have done a lot of presentations to the business to explain the importance of monitoring in DevOps, and investing in them, even if the application was working. That's how you shift from technical monitoring to business monitoring. You really must cooperate and work with your business as closely as possible to demystify a lot of things and vulgarize the topic…
DF - … and changing the metrics also, like we said, going into RED to measure Rate, Error, and Duration
MA – Time for a new question: “What is observability?”
DF - I would say observability is just monitoring plus plus. And by the way, it will also give you the difference between traces and metrics in monitoring. Metrics are typical of monitoring; you monitor status. You want to know the status of your system, is it red or green?
"With observability, you've got everything that you do with monitoring, but in addition to this, you want to understand what is happening and what is the root cause of the problem or what is the behavior of your application." D. Finas, Lightstep
That's why for observability, you also need to add traces (or logs because the trace at the end is just a structure of logs and that's the same. It's a file with timestamps and context and everything). That's just more efficient.
I see a new question “What is Prometheus?” It's a tool to display and store metrics in the database. It's an open-source tool that is used by nearly all the customers who are starting with Kubernetes monitoring. So, it's also powerful and probably one of the most well-known projects in open source.
MA – Question: “Who should lead the effort of transformation?”
DF - I can say for Lightstep projects, most of the time, the Ops are leading. The first people that we meet are SREs, but the biggest users of the solution are the developers. This might seem strange, but that's really because of the ratio of people. You've probably got one SRE for 10 developers in many companies, and what we see in Zalando is they've got a team of SREs that are the owner of this solution, and they've got several hundred developers that connect to the application every day.
MA – Here is a question I’d like to address: "What would be 3 success factors to consider when shifting from monitoring to observability?". The first success factor is being more efficient in incident detection and resolution.
"Using the Christmas tree example that you presented, implementing observability would allow you to go directly to the root cause of the issue instead of involving 200 people in the same incident." M. Chavaroc, Alenia
This is particularly powerful when you have a follow-the-sun or on-duty setup. With “legacy” monitoring, you can be called in the middle of the night for a file system that is getting full or for a machine that is not even used anymore.
"When you implement observability, you are called based on a SLO breach. So, when an alert pops up, it means that there is a real direct business or client impact." M. Chavaroc, Alenia
You're not going to be woken up for something useless. And that makes sense, especially for people working on production jobs.
DF - Involving developers is also crucial, as without developers, you cannot succeed in this topic. So, you should involve them in the process. At first, they may not be happy because they may see it as a production thing, but once they use it, you can conduct a simple workshop using a war game scenario where you simulate incidents and let them solve them with or without observability. They will see the difference and they will like it.
Another factor to consider is starting with your most critical business services and applications. This is not a project you do for less critical apps. For example, Zalando have 5,000 microservices, only 3,000 of them are connected to the distributed trace because the others are not critical for their business. You don't need to see the distributed trace of everything.
WVF - One thing that our team is doing is serving all the non-production environments with the same monitoring as the production. We are considering the non-production environments to be almost as critical as the production environments. For example, if I have a bug fix to deliver and my UAT is not working, I cannot test my bugfix, which will impact my production. In our team, we have the exact same monitoring stack on every environment, and I think it's very important. All the improvements of monitoring go through homologation and UAT before being delivered as a release on production. So, to me, it's a key to success to consider your monitoring as important as any other piece of software in your system.
Question from the floor: “Is it possible to implement OpenTelemetry on mainframe-based applications?”
DF - Technically, it's possible. One of our customers has done it. But most customers don't do it because they don't have to. Instrumenting legacy layers is a lot of effort. In this case, Open Telemetry offers a notion that is called “inferred service”. It is a service that you guess from the client that is connecting to it. For example, when you've got a mainframe transaction on your SLO based chain, you will see it as an inferred service. If the inferred service creates an error, you will see it anyway, meaning that you've got the error rate for the service (return code), the latency of the service (response time), and the number of transactions for the service (each time the service is called).
Most of the time, we don't install any agent on this kind of technology, and that's the same for external APIs that you don't control such as a SWIFT API or an external payment API. You know the SLA of the API, you know if it returns an error, and that's sufficient to be part of your observability platform.
MA – Thank you very much to all of you. If you want to dig deeper into these topics, you can get in touch with us here, we’ll be happy to keep this conversation going.