monitoring Archives - Lightrun

Observability vs. Monitoring

Lightrun Team — Sat, 21 May 2022 10:09:26 +0000

Although all code is bound to have at least some bugs, they are more than just a minor issue. Having bugs in your application can severely impact its efficiency and frustrate users. To ensure that the software is free of bugs and vulnerabilities before applications are released, DevOps need to work collaboratively, and effectively bridge the gap between the operations, development and quality assurance teams.

But there is more to ensuring a bug-free product than a strong team. DevOps need to have the right methods and tools in place to better manage bugs in the system.

Two of the most effective methods are monitoring and observability. Although they may seem like the same process at a glance, they have some apparent differences beneath the surface. In this article, we look at the meaning of monitoring and observability, explore their differences and examine how they complement each other.

What is monitoring in DevOps?

In DevOps, monitoring refers to the supervision of specific metrics throughout the whole development process, from planning all the way to deployment and quality assurance. By being able to detect problems in the process, DevOps personnel can mitigate potential issues and avoid disrupting the software’s functionality.

DevOps monitoring aims to give teams the information to respond to bugs or vulnerabilities as quickly as possible.

DevOps Monitoring Metrics

To correctly implement the monitoring method, developers need to supervise a variety of metrics, including:

Lead time or change lead time
Mean time to detection
Change failure rate
Mean time to recovery

Deployment frequency

What is Observability in DevOps?

Observability is a system where developers receive enough information from external outputs to determine its current internal state. It allows teams to understand the system’s problems by revealing where, how, and why the application is not functioning as it should, so they can address the issues at their source rather than relying on band-aid solutions. Moreover, developers can assess the condition of a system without interacting with its complex inner workings and affecting the user experience. There are a number of observability tools available to assist you with the software development lifecycle.

The Three Pillars of Observability

Observability requires the gathering and analysis of data released by the application’s output. While this flood of data can become overwhelming, it can be broken down into three fundamental data pillars developers need to focus on:

1. Logs

Logs refer to the structured and unstructured lines of text an application produces when running certain lines of code. The log records events within the application and can be used to uncover bugs or system anomalies. They provide a wide variety of details from almost every system component. Logs make the observability process possible by creating the output that allows developers to troubleshoot code by simply analyzing the logs and identifying the source of an error or security alert.

2. Metrics

Metrics numerically represent data that illustrates the application’s functioning over time. They consist of a series of attributes, such as name, label, value, and a timestamp that reveals information on the system’s overall performance and any incidents that may have occurred. Unlike logs, metrics don’t record specific incidents but return values representing the application’s overall performance. In DevOps, metrics can be used to assess the performance of a product throughout the development process and identify any potential problems. In addition, metrics are ideal for observability as it’s easy to identify patterns gathered from various data points to create a complete picture of the application’s performance.

3. Trace

While logs and metrics provide enough information to understand a single system’s behavior, they rarely provide enough information to clarify the lifetime of a request located in a distributed system. That’s where tracing comes in. Traces represent the passage of the request as it travels through all of the distributed system’s nodes.

Implementing traces makes it easier to profile and observe systems. By analyzing the data the trace provides, your team can assess the general health of the entire system, locate and resolve issues, discover bottlenecks, and select which areas are high-value and their priority for optimization.

Monitoring vs. Observability: What’s the Difference?

We’ve compiled the below table to better distinguish between these two essential DevOps methods:

Monitoring	Observability
Practically any system can be monitored	The system has to be designed for observation
Asks if your system is working	Asks what your system is doing
Includes metrics, events, and logs	Includes traces
Active (pulls and collects data)	Passive (pushes and publishes data)
Capable of providing raw data	Heavily relies on sampling
Enables rapid response to outages	Reduces outage duration
Collects metrics	Generates metrics
Monitors predefined data	Observes general metrics and performance
Provides system information	Provides actionable insights
Identifies the state of the system	Identifies why the system failed

Observability vs. Monitoring: What do they have in common?

While we’ve established that observability and monitoring are entirely different methods, this doesn’t make them incomparable. On the contrary, monitoring and observability are generally used together, as both are essential for DevOps. Despite their differences, their commonalities allow the two methods to co-exist and even complement each other.

Monitoring allows developers to identify when there is an anomaly, while observability gives insights into the source of the issue. Monitoring is almost a subset of, and therefore key to, observability. Developers can only monitor systems that are already observable. Although monitoring only provides solutions for previously identified problems, observability simplifies the DevOps process by allowing developers to submit new queries that can be used to solve an already identified issue or gain insight into the system as it is being developed.

Why both are essential?

Monitoring and observability are both critical to identifying and mitigating bugs or discrepancies within a system. But to fully utilize the advantages of each approach, developers must do both thoroughly. Manually implementing and maintaining these approaches is an enormous task. Luckily, automated tools like Lightrun allow developers to focus their valuable time and skills on coding. The tool enables developers to add logs, metrics, and traces to their code without restarting or redeploying software in real-time, preventing delays and guaranteeing fast deployment.

The post Observability vs. Monitoring appeared first on Lightrun.

Testing in Production: Recommended Tools

Lightrun Marketing — Thu, 11 Jun 2020 07:08:29 +0000

Testing in production has a bad reputation. The same kind “git push – – force origin master” has. Burning houses and Chuck Norris represent testing in production in memes and that says it all. When done poorly, testing in production very much deserves the sarcasm and negativity. But that’s true for any methodology or technique.

This blog post aims to shed some light on the testing in production paradigm. I will explain why giants like Google, Facebook and Netflix see it as a legitimate and very beneficial instrument in their CI/CD pipelines. So much, in fact, that you could consider starting using it as well. I will also provide recommendations for testing in production tools, based on my team’s experience.

Testing In Production – Why?

Before we proceed, let’s make it clear: testing in production is not applicable for every software. Embedded software, on-prem high-touch installation solutions or any type of critical systems should not be tested this way. The risks (and as we’ll see further, it’s all about risk management) are too high. But do you have a SaaS solution with a backend that leverages microservices architecture or even just a monolith that can be easily scaled out? Or any other solution that the company engineers have full control over its deployment and configuration? Ding ding ding – those are the ideal candidates.

So let’s say you are building your SaaS product and have already invested a lot of time and resources to implement both unit and integration tests. You have also built your staging environment and run a bunch of pre-release tests on it. Why on earth would you bother your R&D team with tests in production? There are multiple reasons: let’s take a deep dive into each of them.

Staging environments are bad copies of production environments

Yes, they are. Your staging environment is never as big as your production environment – in terms of server instances, load balancers, DB shards, message queues and so on. It never handles the load and the network traffic production does. So, it will never have the number of open TCP/IP connections, HTTP sessions, open file descriptors and parallel writes DB queries perform. There are stress testing tools that can emulate that load. But when you scale, this stops being sufficient very quickly.

Besides the size, the staging environment is never the production one in terms of configuration and state. It is often configured to start a fresh copy of the app upon every release, security configurations are eased up, ACL and services discovery will never handle real-life production scenarios and the databases are emulated by recreating them from scratch with automation scripts (copying production data is often impossible even legally due to privacy regulations such as GDPR). Well, after all, we all try our best.

At best we can create a bad copy of our production environment. This means our testing will be unreliable and our service susceptible to errors in the real life production environment.

Chasing after maximum reliability before the release costs. A lot.

Let’s just cite Google engineers:

“It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the number of features a team can afford to offer.

Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear. We strive to make a service reliable enough, but no more reliable than it needs to be.”

Let’s emphasize the point: “Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear”. No unit/integration/stating env tests will ever make your release 100% error-free. In fact they shouldn’t (well, unless you are a Boeing engineer). After a certain point, investing more and more in tests and attempting to build a better staging environment will just cost you more compute/storage/traffic resources and will significantly slow you down.

Doing more of the same is not the solution. You shouldn’t spend your engineers’ valuable work hours chasing the dragon trying to diminish the risks. So what should you be doing instead?

Embracing the Risk

Again, citing the great Google SRE Book:

“…we manage service reliability largely by managing risk. We conceptualize risk as a continuum. We give equal importance to figuring out how to engineer greater reliability into Google systems and identifying the appropriate level of tolerance for the services we run. Doing so allows us to perform a cost/benefit analysis to determine, for example, where on the (nonlinear) risk continuum we should place Search, Ads, Gmail, or Photos…. That is, when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.”

So it is not just about when and how you run your tests. It’s about how you manage risks and costs of your application failures. No company can afford its product downtime because of some failed test (which is totally OK in staging). Therefore, it is crucial to ensure that your application handles failures right. “Right”, quoting the great post by Cindy Sridharan, means:

“Opting in to the model of embracing failure entails designing our services to behave gracefully in the face of failure.”

The design of fault tolerant and resilient apps is out of the scope of this post (Netflix Hystrix is still worth a look though). So let’s assume that’s how your architecture is built. In such a case, you can fearlessly roll-out a new version that has been tested just enough internally.

And then, the way to bridge the gap so as to get as close as possible to 100% error-free, is by testing in production. This means testing how our product really behaves and fixing the problems that arise. To do that, you can use a long list of dedicated tools and also expose it to real-life production use cases.

So the next question is – how to do it right?

Testing In Production – How?

Cindy Sridharan wrote a great series of blog posts that discusses the subject in a great depth. Her recent Testing in Production, the safe way blog post depicts a table of test types you can take in pre-production and in production.

Refined by testing spectrum a bit more – this againnisnt comprehensive, but makes it increasingly clear that testing in production (especially testing post release) is just as important as pre-production testing.

Thoughts? pic.twitter.com/YS0VjACvqD

— Cindy Sridharan (@copyconstruct) February 9, 2018

One should definitely read carefully through this post. We’ll just take a brief look and review some of the techniques she offers. We will also recommend various tools from each category. I hope you find our recommendations useful.

Load Testing in Production

As simple as it sounds. Depending on the application, it makes sense to stress its ability to handle a huge amount of network traffic, I/O operations (often distributed), database queries, various forms of message queues storming and so on. Some severe bugs appear clearly only upon load testing (hi, memory overwrite). Even if not – your system is always capable of handling a limited amount of a load. So here the failure tolerance and graceful handling of connections dropping become really crucial.

Obviously, performing a load test in the production environment will stress your app configured for the real life use cases, thus it will provide way more useful insights than loading testing in staging.

There are a bunch of software tools for load testing that we recommend, many of them are open sourced. To name a few:

mzbench

mzbench supports MySQL, PostgreSQL, MongoDB, Cassandra out of the box. More protocols can be easily added. It was a very popular tool in the past, but had been abandoned by a developer 2 years ago.

HammerDB

HammerDB supports Oracle Database, SQL Server, IBM Db2, MySQL, MariaDB, PostgreSQL and Redis. Unlike mzbench, it is under active development as for May 2020.

Apache JMeter

Apache JMeter focuses more on Web Services (DB protocols supported via JDBC). This the old-fashioned (though somewhat cumbersome) Java tool I was using ten years ago for fun and profit.

BlazeMeter

BlazeMeter is a proprietary tool. It runs JMeter, Gatling, Locust, Selenium (and more) open source scripts in the cloud to enable simulation of more users from more locations.

Spirent Avalanche Hardware

If you are into heavy guns, meaning you are developing solutions like WAFs, SDNs, routers, and so on, then this testing tool is for you. Spirinet Avalanche is capable of generating up to 100 Gbps, performing vulnerability assessments, QoS and QoE tests and much more. I have to admit – it was my first load testing tool as a fresh graduate working in Checkpoint and I still remember how amazed I was to see its power.

Shadowing/Mirroring in Production

Send a portion of your production traffic to your newly deployed service and see how it’s handled in terms of performance and possible regressions. Did something go wrong? Just stop the shadowing and put your new service down – with zero impact on production. This technique is also known as “Dark Launch” and described in detail by CRE life lessons: What is a dark launch, and what does it do for me? blog post by Google.

A proper configuration of load balancers/proxies/message queues will do the trick. If you are developing a cloud native application (Kubernetes / Microservices) you can use solutions like:

HAProxy

HAProxy is an open source easy to configure proxy server.

Envoy proxy

Envoy proxy is open source and a bit more advanced than HAProxy. Wired to suit the microservice world, this proxy was built into the microservices world and offers functionalities of service discovery, shadowing, circuit breaking and dynamic configuration via API.

Istio

Istio is a full open-source service mesh solution. Under the hood it uses the Envoy proxy as a sidecar container in every pod. This sidecar is responsible for the incoming and outgoing communication. Istio control service access, security, routing and more.

Canarying in Production

Google SRE Book defines “canarying” as the following:

To conduct a canary test, a subset of servers is upgraded to a new version or configuration and then left in an incubation period. Should no unexpected variances occur, the release continues and the rest of the servers are upgraded in a progressive fashion. Should anything go awry, the modified servers can be quickly reverted to a known good state.

This technique, as well as similar (but not the same!) Blue-Green deployment and A/B testing techniques are discussed in this Cristian Posta blog post while the caveats and cons of canarying are reviewed here. As for recommended tools,

Spinnaker

Netflix open-sourced the Spinnaker CD platform leverages the aforementioned and many other deployment best practices (as in everything Netflix, built bearing microservices in mind).

ElasticBeanstalk

AWS supports Blue/Green deployment with its PaaS ElasticBeanstalk solution

Azure App Services

Azure App Services has its own staging slots capability that allows you to apply the prior techniques with a zero downtime.

LaunchDarkly

LaunchDarkly is a feature flagging solution for canary releases – enabling to perform a gradual capacity testing on new features and safe rollback if issues are found.

Chaos Engineering in Production

Firstly introduced by Netflix’s ChaosMonkey, Chaos Engineering has emerged to be a separate and very popular discipline. It is not about a “simple” load testing, it is about bringing down services nodes, reducing DB shards, misconfiguring load balancers, causing timeouts – in other words messing up your production environment as badly as possible.

Winning tools in that area are tools I like to call “Chaos as a service”:

ChaosMonkey

ChaosMonkey is an open source tool by Netflix . It randomly terminates services in your production system, making sure your application is resilient to these kinds of failures.

Gremlin

Gremlin is another great tool for chaos engineering. It allows DevOps (or a chaos engineer) to define simulations and see how the application will react in different scenarios: unavailable resources (CPU / Mem), state changes (change systime / kill some of the processes), and network failures (packet drops / DNS failures).

Here are some others

Debugging and Monitoring in Production

The last but not least toolset to be briefly reviewed is monitoring and debugging tools. Debugging and monitoring are the natural next steps after testing. Testing in production provides us with real product data, that we can then use for debugging. Therefore, we need to find the right tools that will enable us to monitor and debug the test results in production.

There are some acknowledged leaders, each one of them addressing the need for three pillars of observability, aka logs, metrics, and traces, in its own way:

DataDog

DataDog is a comprehensive monitoring tool with amazing tracing capabilities. This helps a lot in debugging with a very low overhead.

Logz.io

Logz.io is all about centralized logs management – its combination with DataDog can create a powerful toolset.

New Relic

A very strong APM tool, which offers log management, AI ops, monitoring and more.

Prometheus

Prometheus is open source monitoring solution that includes metrics scraping, querying, visualization and alerting.

Lightrun

Lightrun is a powerful production debugger. It enables adding logs, performance metrics and traces to production and staging in real-time, on demand. Lightrun enables developers to securely adding instrumentation without having to redeploy or restart. Request a demo to see how it works.

To sum up, testing in production is a technique you should pursue and experiment with if you are ready for a paradigm shift from diminishing risks in pre-production to managing risks in production.

Testing in production complements the testing you are used to doing, and adds important benefits such as speeding up the release cycles and saving resources. I covered some different types of production testing techniques and recommended some tools to use. If you want to read more, check out the resources I cited throughout the blog post. Let us know how it goes!

Learn more about Lightrun and let’s chat.

The post Testing in Production: Recommended Tools appeared first on Lightrun.

How Cool? Very Cool! Lightrun named a Cool Vendor by Gartner in Monitoring, Observability, and Cloud Operations

Lightrun Team — Thu, 06 May 2021 15:31:24 +0000

So… this is what it feels like to be COOL!

We are thrilled to announce that Lightrun — the world’s first dev-native continuous observability and debugging platform — has been recognized by Gartner as a Cool Vendor, based on its April 28 report titled, “Cool Vendors in Monitoring, Observability and Cloud Operations” by Padraig Byrne, Pankaj Prasad, Hassan Ennaciri, Venkat Rayapudi, and Gregg Siegfried.

Why Does Gartner Think Lightrun is Cool?

“Lightrun helps reduce mean time to repair (MTTR) by enabling continuous debugging capabilities.

Lightrun provides the ability to dynamically instrument code at runtime without the need to release a hot fix or restart the services.

Whether it is a monolithic application or a microservices architecture, reproducing issues in nonproduction environments can be time-consuming or impossible, especially in complex systems.

Lightrun bridges the gap between development and production environments, allowing developers to gain observability by adding logs, traces and performance counters in real time.”

True Shift-left Dev-native Observability — Made Simple

This quote, from Gartner Cool Vendor report, drives home the main ideas we’ve built Lightrun around.

But actually, when we say this product is cool, we usually mean it in a much simpler way. We think it’s awesome that this thing we built allows you to add logs and metrics to production applications with a (right-)click of a button.

Think about it for a second.

Doing the exact same thing today, without Lightrun, requires you to:

Add new line(s) of code with the information you want to learn
Push it to your repository on GitHub, GitLab, etc…
Wait for the CI/CD pipeline to finish doing its thing
Wait some more for the CI/CD pipeline to finish doing its thing
Open the APM / logging tool
FIlter for the relevant information
Profit!
Once you’re done observing the information, remove the code (as to not spam the logs)

With Lightrun, you can simply:

Add a new Lightrun Action (logs, snapshots, etc…)
Profit!

Smooth.

Real-time logs in production, baby!

Software is Complicated, Y’all

As distributed systems, microservices, and CI/CD are becoming the new norm in the software industry, a new set of challenges and issues arises for the developers who build and maintain software in the wild.

Multiple pieces of the same application, all replicated across tens, hundreds or sometimes even thousands of different servers make issues and bugs harder to locate and fix. This means that the quick feedback loops agile teams require are hindered by increasingly longer debugging cycles, and the need for (oftentimes multiple) hotfixes and redeployments in order to locate, pinpoint, and gather information about issues with live applications is now a painful day to day experience for the modern day developer.

Lightrun’s vision is to break down those barriers, by allowing developers to gain code-level insights into live applications, without violating the integrity and security of their services that are running in production.

With this in mind, Lightrun developed the first IDE-native observability platform that enables developers to securely add logs, metrics and traces to production and staging environments in real time, on demand. No hotfixes, redeployments or restarts are required.

Developers use Lightrun for multiple code-level observability needs, including:

Code-level alerts – learn in real-time that a specific line in your code was reached in production
Feature verification – battle-test your features in a QA/ staging/ progressive delivery setting before releasing it to your entire customer base
Testing/ debugging in production – shine a light on difficult production issues without redeploying, restarting or even stopping the running application

Let’s Talk About Developer-Native Observability and Debugging

Lightrun’s vision goes far beyond faster debugging — it’s about empowering developers to understand and work on live applications, in real time, with a true shift-left developer-native observability platform that causes the least impact as possible to running production systems.

Lightrun also allows developers to truly understand features and expected behaviour across the application’s entire lifecycle, whichever stage the application may be in at any given time: development, QA/ staging or production. This also allows for end-to-end ownership of debugging, troubleshooting, alerting and incident resolution in real-time by the people that wrote the software — without sacrificing security or compliance.

Another upside of using Lightrun is substantially improving developer experience. By maximizing productivity, reducing time spent debugging, and allowing developers to detect, locate, and fix issues directly within the context of the tools they are comfortable with — the IDE or the CLI — Lightrun eliminates time-consuming context switching.

Slashing MTTR and Shaving Hours off of Frustrating Debugging Sessions

Currently, Lightrun is trusted by leading global companies, such as Ad-Tech giant Taboola, security leaders WhiteSource and Tufin, Data Analytics powerhouse Sisense, and e-Commerce platform Yotpo, just to name a few.

The common ground for our customers is the foresight that the challenges currently faced by development teams cause longer debugging cycles and redundant redeployments, which negatively impacts the ability (and agility) to quickly solve issues in production, meet SLAs and break apart difficult customer-facing usability bugs.

Those challenges are only going to increase in upcoming years, as more and more applications, products and services are moving from a monolith structure to distributed microservices and/ or serverless architectures.

With the ability to add real-time logs, metrics and traces to live applications, companies can now leverage Lightrun to reduce MTTR, save on expensive developer hours spent debugging instead of developing, and guarantee faster, smoother release cycles, with less issues in production, and speedy incident resolution when they do occur.

So What’s the Bottom Line?

Summing it up, this is what Gartner had to say about the value Lightrun’s solution brings to the table:

“Lightrun should be of interest to development teams, SREs and operations teams that are accountable for improving observability of complex software products to improve MTTR and resiliency.

Having a nonintrusive way to add logs and metrics while the service is running in production gives engineers faster access and insight to production issues, and helps enhance their diagnosis and root cause analysis capabilities.

SREs can leverage the feedback from the added instrumentation in production to build better nonfunctional requirements (NFRs) to improve system resiliency and recoverability.

They also can improve detection to help resolve issues before impacting users or set up autohealing scripts to restore service more quickly.”

… And to End It On a Cool Note:

Last month, we released a cloud-based, self-service version of our dev-native observability platform — you can sign up right here and get all the cool features at the right-click of a button.

About Lightrun

Lightrun is a Tel Aviv-based startup that is transforming the developer experience, bringing a developer-native observability platform. The company is the first to introduce “shift left” observability, empowering developers to gain deeper insights into their live, running applications by allowing them to insert logs, metrics and traces during runtime. Boasting the richest set of observability pillar tools for observing applications directly from within the IDE, Lightrun simplifies every aspect of incident resolution. Lightrun is ISO-27001 certified and is proud to have some of the most innovative technology companies in the world as customers, including Taboola, Sisense, Yotpo, Tufin, WhiteSource, and more. In April 2021, Gartner named Lightrun Cool Vendor in Monitoring, Observability and Cloud Operations. For more information, visit our website or follow Lightrun on Twitter and Linkedin.

Disclaimer: This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Gartner. The GARTNER COOL VENDOR badge is a trademark and service mark of Gartner, Inc. and/or its affiliates and is used herein with permission. All rights reserved.

Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

The post How Cool? Very Cool! Lightrun named a Cool Vendor by Gartner in Monitoring, Observability, and Cloud Operations appeared first on Lightrun.