DevOps Archives - Lightrun

Observability vs. Monitoring

Lightrun Team — Sat, 21 May 2022 10:09:26 +0000

Although all code is bound to have at least some bugs, they are more than just a minor issue. Having bugs in your application can severely impact its efficiency and frustrate users. To ensure that the software is free of bugs and vulnerabilities before applications are released, DevOps need to work collaboratively, and effectively bridge the gap between the operations, development and quality assurance teams.

But there is more to ensuring a bug-free product than a strong team. DevOps need to have the right methods and tools in place to better manage bugs in the system.

Two of the most effective methods are monitoring and observability. Although they may seem like the same process at a glance, they have some apparent differences beneath the surface. In this article, we look at the meaning of monitoring and observability, explore their differences and examine how they complement each other.

What is monitoring in DevOps?

In DevOps, monitoring refers to the supervision of specific metrics throughout the whole development process, from planning all the way to deployment and quality assurance. By being able to detect problems in the process, DevOps personnel can mitigate potential issues and avoid disrupting the software’s functionality.

DevOps monitoring aims to give teams the information to respond to bugs or vulnerabilities as quickly as possible.

DevOps Monitoring Metrics

To correctly implement the monitoring method, developers need to supervise a variety of metrics, including:

Lead time or change lead time
Mean time to detection
Change failure rate
Mean time to recovery

Deployment frequency

What is Observability in DevOps?

Observability is a system where developers receive enough information from external outputs to determine its current internal state. It allows teams to understand the system’s problems by revealing where, how, and why the application is not functioning as it should, so they can address the issues at their source rather than relying on band-aid solutions. Moreover, developers can assess the condition of a system without interacting with its complex inner workings and affecting the user experience. There are a number of observability tools available to assist you with the software development lifecycle.

The Three Pillars of Observability

Observability requires the gathering and analysis of data released by the application’s output. While this flood of data can become overwhelming, it can be broken down into three fundamental data pillars developers need to focus on:

1. Logs

Logs refer to the structured and unstructured lines of text an application produces when running certain lines of code. The log records events within the application and can be used to uncover bugs or system anomalies. They provide a wide variety of details from almost every system component. Logs make the observability process possible by creating the output that allows developers to troubleshoot code by simply analyzing the logs and identifying the source of an error or security alert.

2. Metrics

Metrics numerically represent data that illustrates the application’s functioning over time. They consist of a series of attributes, such as name, label, value, and a timestamp that reveals information on the system’s overall performance and any incidents that may have occurred. Unlike logs, metrics don’t record specific incidents but return values representing the application’s overall performance. In DevOps, metrics can be used to assess the performance of a product throughout the development process and identify any potential problems. In addition, metrics are ideal for observability as it’s easy to identify patterns gathered from various data points to create a complete picture of the application’s performance.

3. Trace

While logs and metrics provide enough information to understand a single system’s behavior, they rarely provide enough information to clarify the lifetime of a request located in a distributed system. That’s where tracing comes in. Traces represent the passage of the request as it travels through all of the distributed system’s nodes.

Implementing traces makes it easier to profile and observe systems. By analyzing the data the trace provides, your team can assess the general health of the entire system, locate and resolve issues, discover bottlenecks, and select which areas are high-value and their priority for optimization.

Monitoring vs. Observability: What’s the Difference?

We’ve compiled the below table to better distinguish between these two essential DevOps methods:

Monitoring	Observability
Practically any system can be monitored	The system has to be designed for observation
Asks if your system is working	Asks what your system is doing
Includes metrics, events, and logs	Includes traces
Active (pulls and collects data)	Passive (pushes and publishes data)
Capable of providing raw data	Heavily relies on sampling
Enables rapid response to outages	Reduces outage duration
Collects metrics	Generates metrics
Monitors predefined data	Observes general metrics and performance
Provides system information	Provides actionable insights
Identifies the state of the system	Identifies why the system failed

Observability vs. Monitoring: What do they have in common?

While we’ve established that observability and monitoring are entirely different methods, this doesn’t make them incomparable. On the contrary, monitoring and observability are generally used together, as both are essential for DevOps. Despite their differences, their commonalities allow the two methods to co-exist and even complement each other.

Monitoring allows developers to identify when there is an anomaly, while observability gives insights into the source of the issue. Monitoring is almost a subset of, and therefore key to, observability. Developers can only monitor systems that are already observable. Although monitoring only provides solutions for previously identified problems, observability simplifies the DevOps process by allowing developers to submit new queries that can be used to solve an already identified issue or gain insight into the system as it is being developed.

Why both are essential?

Monitoring and observability are both critical to identifying and mitigating bugs or discrepancies within a system. But to fully utilize the advantages of each approach, developers must do both thoroughly. Manually implementing and maintaining these approaches is an enormous task. Luckily, automated tools like Lightrun allow developers to focus their valuable time and skills on coding. The tool enables developers to add logs, metrics, and traces to their code without restarting or redeploying software in real-time, preventing delays and guaranteeing fast deployment.

The post Observability vs. Monitoring appeared first on Lightrun.

Mastering Complex Progressive Delivery Challenges with Lightrun

Dror Bereznitsky — Sun, 21 May 2023 15:08:37 +0000

Introduction

Progressive delivery is a modification of continuous delivery that allows developers to release new features to users in a gradual, controlled fashion.

It does this in two ways.

Firstly, by using feature flags to turn specific features ‘on’ or ‘off’ in production, based on certain conditions, such as specific subsets of users. This lets developers deploy rapidly to production and perform testing there before turning a feature on.

Secondly, they roll out new features to users in a gradual way using canary releases, which involves making certain features available to only a small percentage of the user base for testing before releasing it to all users.

Practices like these allow developers to get incredibly granular with how they release new features to their userbase to minimize risk.

But there are downsides to progressive delivery: it can create very complex code that is challenging to troubleshoot.

Troubleshooting Complex Code

Code written for progressive delivery is highly conditional. It can contain many feature flag branches that respond differently to different subsets of users based on their data profile or specific configuration. You can easily end up with hard-to-follow code composed of complex flows with many conditional statements.

This means that your code becomes very unpredictable. It’s not always clear which code path will be invoked as this is highly dependent on which user is active and in what circumstances.

The difficulty comes when you discover an issue, vulnerability or bug that is related to one of these complex branches and only occurs under certain very specific conditions. It becomes very difficult to determine which code path contains the bug and what information you need to gather to fix it.

This becomes a major barrier to identifying any problems and resolving them effectively.

The Barriers To Resolving Issues In Progressive Delivery

When a problem arises in this complex progressive delivery context, your developers can spend a huge amount of time trying to discern the location and nature of the actual problem amidst all the complexity.

There are three main ways this barrier manifests:

Parsing conditional statements in the code path

Developers have to determine the actual code path that is being executed when the problem arises, a non-trivial issue when there are many different feature flags that are being conditionally triggered by different users in unpredictable ways.

Among all these different possibilities it is very hard to determine which conditional statements will run and therefore to statically analyze the code path that will be executed.

Developers have to add new logs to track the flow of code, forcing them to redeploy the application. Sometimes many rounds of logging/redeployment is required before they get the information they need, which is incredibly time-consuming.

Emulating the production environment locally

Secondly, once the right code path has been isolated, they have to replicate that complex, conditional code on their local machine to test potential fixes.

But if there are many feature flags and conditional statements, it is very hard to emulate that locally to reproduce and assess the problem given the complexity of the production environment.

A huge amount of time and energy is needed to do this, with no guarantee that you will be able to perfectly replicate the production environment.

Generating synthetic traffic that matches the user profiles

Thirdly, when the code path that is executed is highly dependent on specific data (e.g. user data) it is hard to simulate the workloads synthetically in order to properly test the solution in a way that accurately mirrors the production environment.

Yet more time and energy must be expended to trigger the issue in the test environment in a way that gives developers the information they need to properly resolve the issue.

Using Lightrun to Troubleshoot Progressive Delivery

Developer time is extremely valuable. They can waste a lot of time dealing with these niggling hurdles to remediation that could be spent creating valuable new features.

But there is a new approach that can overcome these barriers: dynamic observability.

Lightrun is a dynamic observability platform that enables developers to add logs, metrics and snapshots to live applications—without having to release a new version or even stop the running process.

In the context of progressive delivery, Lightrun enables you to use real-time dynamic instrumentation to:

Identify the actual workflow affected by the issue
Capture the relevant information from that workflow

This means that you can identify and understand your bug or vulnerability without having to redeploy the application or recreate the issue locally, regardless of the complexity of the code.

There are two features of Lightrun that are particularly potent in this regard: Lightrun dynamic logs and snapshots.

Dynamic Logs

You can deploy dynamic logs within each feature flag branch in real-time, providing full visibility into the progressive delivery process without having to redeploy the application.

Unlike regular logging statements which are printed for all requests served by the system. Dynamic logging can target specific users or user segments, using conditions, making them more precise and much less noisy

If there’s a new issue you want to track or a new feature flag branch you want to start logging, you can just add it on the fly. Then you can flow through the execution and watch your application’s state change with every flag flip using real user data, right from the IDE, without having to add endless ‘if/else’ statements in the process.

Granular Snapshots

Similarly, you can place Snapshots – essentially virtual breakpoints – inside any flag-created branch, giving you debugger-grade visibility into each rollout. This gives your developers on-demand access to whatever information they need about the code flows that are affected by your issue.

All the classic features you know from traditional debugging tools, plus many more, are available in every snapshot, which:

Can be added to multiple instances, simultaneously
Can be added conditionally, without writing new code
Provides full-syntax expression evaluation
Is safe, read-only and performant
Can be placed, viewed and edited right from the IDE

By enabling your developers to track issues through complex code and gather intelligence on-demand – all without having to redeploy the app or even leave the IDE – makes troubleshooting progressive delivery codebases much easier.

Developer Benefits Of Using Lightrun

Determine which workflow is being executed during a given issue

Developers can identify exactly which workflow is relevant. This means they no longer have the hassle of troubleshooting sections of code that are not vulnerable because they are not being executed or redeploying the application to insert log messages to track the code flow.

No need to locally reproduce issues

By dynamically observing the application pathways at runtime, you avoid the need to invest significant time and energy into reproducing your production environment locally along with all the complexity of feature flags and config options.

No need to create highly-specific synthetic traffic

Similarly, there is no need to emulate customer workloads by creating highly conditional synthetic traffic to trigger the particular code path in question.

Overall, developers can save a huge amount of time and energy that was previously being sunk into investigating complexity in different ways.

Final Thoughts

Dynamic observability gives you much deeper and faster insights into what’s going on in your code.

With Lightrun’s revolutionary developer-first approach, we enable engineering teams to connect to their live applications and continuously identify critical issues without hotfixes, redeployments or restarts.

If you need a hand troubleshooting complex code flows or dealing with highly conditional progressive delivery scenarios, get in touch.

The post Mastering Complex Progressive Delivery Challenges with Lightrun appeared first on Lightrun.

Effectively Bridging the DevOps – R&D Gap without Sacrificing Reliability

Lightrun Marketing — Mon, 07 Mar 2022 08:23:33 +0000

DevOps culture revolutionized our industry. Continuous Delivery and Continuous Integration made six sigma reliability commonplace. 20 years ago we would kick the production servers and listen to the hard drive spin, that was observability. Today’s DevOps teams deploy monitoring tools that provide development teams with deep insight into the production environment.

“O brave new world That has such people in’t!” – William Shakespeare

Before DevOps practices were commonplace, production used to fail. A lot. We don’t want to go back to the time before DevOps tools were commonplace…

The Twitter Fail Whale Demonstrated the need for DevOps

Everything’s Perfect in our Development Process, Right?

Well… No. Software is hard, especially at the fast pace of continuous delivery cycles. We will always make some bugs and unfortunately some will make it into production. That’s unavoidable.

The problem is that these bugs that made it into production made it past our continuous integration pipeline. They made it past the testing environment. They are typically tough to detect/reproduce bugs – Uber Bugs… The DevOps practices we worked so hard to establish suddenly turned against us.

That Thin (possibly blue) Line

DevOps teams are typically siloed from the dev teams. There’s a line that separates them. This isn’t too bad and fits well with agile development processes. But it falls flat when the development team needs to debug. This noticeably affects software quality. The DevOps approach indeed raised uptime significantly, but bugs in production are still abundant and they take longer to fix.

The Continuous Integration Churn

The first reason for this degradation is the continuous integration churn. When we have a bug in production, developers need to add logging/information and go through the continuous delivery pipeline to see the new logs. If they got something wrong or missed some information, it’s “rinse-repeat” all over again.

In a world of agile teams that move fast, this is a tedious and painfully slow process that puts the development cycle on-hold. While the continuous delivery pipelines are churning, we still have a production bug that we still don’t understand.

Access Limits and Security Teams

The second reason is more about the siloed teams, one of the core DevOps practices. I would like to emphasize that we obviously have and need a culture of collaboration. That’s obvious. But DevOps also has a responsibility of keeping the development environment separate from staging and production.

That line that separates DevOps engineers from R&D is a good line. It’s an important line. It’s a line that enables high-quality software by vetting everything that goes into production through an organized process.

Developers just want to connect a “debugger” and step over the code. This obviously doesn’t scale and would crash production systems. Then there are the obvious security issues involved… That’s why we have DevOps workflow and the silos are important.

Collaboration Between Development and Operations Teams

This isn’t a fresh problem. Rapid delivery and reliability engineering work great under normal conditions, but fall flat when we need to track an error. At that point, we have two options: logs and observability tools. Before I proceed, I would like to stress that we use both and love them. They are crucial pieces of the software development lifecycle!

Logs

Today’s logs are not the logs of our predecessors. DevOps pipe, filter and index them at a huge scale. In fact, corporations spend millions on log ingestion cloud infrastructure!

Working with logs has some limits:

Cost – over-logging is a major problem. It degrades application performance and can be quite expensive
It’s Static – developers aren’t clairvoyant. They don’t know what to log, that’s why they over-log. Still, some information is often missing, and it sends us back to the continuous deployment cycle mentioned above

Observability Tools

There are many observability tools in production, but most of them have one thing in common: they were designed as part of the DevOps toolchain. They weren’t designed for R&D and don’t provide the type of information developers often seek.

Most of these tools are focused on Metrics and Errors. That makes sense for a DevOps practitioner, but a production bug is often-times expressed in application logic/UI.

Finally, the performance of applications can be affected by such observability tools. These tools work by monitoring widely and receiving application events. Their overhead is often noticeable in intense production environments.

Continuous Observability Tools to Save the Day

The problems aren’t new. As a result, the market grew to offer a tool for developers that respects DevOps processes. A debugger that respects security practices and reliability engineers.

Continuous Observability tools are the new generation of Cloud-Native development. They let developers query the running system at the code level without deploying a new version.

Let’s go over the issues above:

Log cost – Continuous Observability tools let us inject logs dynamically. That means developers can reduce the amount of logs (developers can inject more as needed)
CI/CD cycle for updates – Since logs can be injected, developers don’t need to go through the continuous deployment pipeline
These tools were designed for developers and integrate with development tools such as IDEs. They provide the type of information developers need directly in the source code
Performance overhead is low. Since these tools query a specific area of the code and not the full application, the impact is low. The best tools throttle features to keep the application performant

The Line Preserved

Continuous Observability tools are deployed through the DevOps environment. That means developers don’t circumvent the operations teams and we maintain the separation that protects the system reliability. Software updates and all maintenance still propagate through a single team of DevOps like it did before.

This is great news if you’re as passionate about reliability engineering and cloud-native development as I am. The capability and reliability of the tool lets us keep the pace of releases and literally debug production at scale without compromising security.

In Practical Terms – How does it Work?

Common tool usage in this field follows a use case similar to a debugger in normal application development. A problem is reported in production and the application development team makes assumptions about the application. These assumptions can be verified using Snapshots, Logs or Metrics (AKA actions).

Snapshots are the workhorse of continuous observability tools, they provide a deep view into the underlying infrastructures. Snapshots work very much like a breakpoint; they provide a stack trace with the values of the variables within the stack frame contexts. As a result they even look like an IDE breakpoint within the IDE. But they have one distinction: they don’t break. The current thread doesn’t stop and doesn’t affect other threads.

This means there’s no “step-over” which is understandable. But there are conditional actions that let us place a snapshot (or any action) based on a condition similar to a conditional breakpoint. E.g. a snapshot can be defined, so it’s triggered only for a specific user to track an issue experienced by that user only. We can place it on a group of servers using a tag so we can track an issue between distributed servers.

Logs, let us add logs and integrate with existing ones seamlessly. That’s a key capability since logs are best read in context/order. They also get ingested with the rest of the logs based on the definitions made by the DevOps team.

Metrics let us measure small blocks of code or methods. These are very fine grained measurements even things as simple as a counter can be very useful.

TL;DR Applying Continuous Observability into your Agile Practices

Modern cloud environments are remarkably complex. As we’re all adopting cloud-native development, we can’t give developers the level of access they used to enjoy into production. That’s just not tenable. Everything must follow key practices through the DevOps Lifecycle.

This leaves us with a system that’s robust for most cases but much harder to debug and troubleshoot. Organizational culture helps, but it isn’t enough. Bugs that quality assurance didn’t grab are the hardest bugs and analyzing them in production based on customer feedback is hard. It’s time consuming, expensive, affects release frequency and code quality.

Existing tools are great, but they were designed for DevOps teams, not for developers. Bugs should be the responsibilities of the development team, but we can’t expect developers to address bugs without tools that provide insight…

This is the idea behind continuous observability tools. These tools are not a part of a DevOps platform but they’re deployed by DevOps. In that sense, they maintain the separation. Developers don’t have access to production. They can debug it though… Securely and at scale.

The post Effectively Bridging the DevOps – R&D Gap without Sacrificing Reliability appeared first on Lightrun.