agility Archives - Lightrun

Continuous Debugging and Observability: the Next Agility Best Practices

Ilan Peleg & Leonid Blouvshtein, Founders — Mon, 15 Jun 2020 09:53:23 +0000

A Message from Lightrun’s Founders

Ever since the publication of the “Manifesto for Agile Software Development” approximately two decades ago, software development has become more and more agile. Companies all over the globe have adapted new work processes, built new R&D team structures and incorporated new developer tools and methods. Agility has become the “right” way to do R&D, and organizations are willing to spend a lot of resources to become agile and improve their product delivery and customer experience.

Software development agility is the flexible management and execution of the development process. However, being the vague term that it is, usually companies relate “agility” to CI/CD. While the two are connected – CI/CD is part of agility – CI/CD does not encompass the full breadth of agility and what it offers R&D departments.

CI/CD means rapid code change, fast delivery and quality. It is a set of principles that enables pushing smaller code changes more frequently, to enable quick advancement of the product and fast error resolution. CI/CD takes place by automating building, testing and deployment, with dedicated tools like Jenkins, CircleCI, and Atlassian Bamboo. Teams also change their structure, since they break monolith development to microservices. So from larger teams built according to development specialty (e.g – backend) they turn into smaller, mission-focused and feature-focused teams.

But there’s more to agility than CI/CD. What happens after deployment, when the service is running?

Agility for Live Running Services

In recent years, the gap between development and production environments has widened. Microservices, Kubernetes, Docker Swarm, ECS, Big Data workers and serverless require complex maintenance in production. These architectures have made production issues hard to anticipate in development and even harder to reproduce there.

Simulating production data in a dev environment, as well as the scale and the user data, is on the verge of impossible. As if that’s not enough, bugs are difficult to reproduce in staging and development. To top it all off – modern architectures (microservices / serverless) have a tendency to fail more often – as is common for distributed applications.

Yet, in today’s world – the only way to understand your application’s behaviour is based on log lines and metrics that were defined during the development stage, way before production. We know how frustrating and disheartening it is to discover you’re missing a log line, just where you need it. Developers find themselves going through numerous iterative processes just for adding another log line or metric.

Engineering teams have advanced tools to analyze production, like logging tools and APMs. But each time they need new visibility into their code, they have to go through the development and full release cycle process again (and again) and recreate the preconditions for the issue. This has created the observability gap.

The observability gap limits R&D’s ability to react fast and troubleshoot production issues. Which, in turn, obstructs agility in live applications and when delivering software. In other words, the observability gap restricts developer and DevOps capability to deliver. MTTR rises, site reliability declines, and so much time is wasted.

We needed a solution for bridging this observability gap. That’s why we came up with Lightrun.

Bridging the Observability Gap with Continuous Debugging and Continuous Observability

We started Lightrun to enable development agility for solving problems in live environments. Developers have a lot of power and agility in development environments. They have debuggers, profilers, and entire pipelines. But in the most vital environment, the meat-and-potatoes of R&D departments, the environment that serves the customers, the process is still stuck and iterative, and requires a lot of time, effort and resources.

Lightrun is the first complete Continuous Debugging and Continuous Observability platform. This means the ability to add and instrument the three pillars of observability in staging or production environments, i.e while the service is live. Lightrun enables developers to securely add log lines, performance metrics and traces to production, on-demand and while the app is running. True agility and 100% code-level observability are now achieved!

Based on the same principles as CI/CD, Continuous Debugging and Observability (CDB/CO) powers development teams to shorten iterations and boost product quality and reliability. Just like CI/CD shortens the release cycle, CDB/CO shortens the RCA (Root Cause Analysis) cycle.

We also wanted to make sure that agility was at the heart of our processes, so it was important to us to incorporate ourselves into existing developer workflows, end-to-end. We started in the development environment, by integrating into developers’ existing IDEs, and ended in production, with their APM and log aggregation tools.

Announcing Lightrun

Lightrun was established a year ago with the vision of leading the CDB/CO revolution. Lightrun was founded by developers for developers. We both have a heavy technical background from the elite 8200 IDF unit, public enterprise companies and successful startups that were acquired. Our goal is to deeply transform developers’ lives all over the world, by disrupting the way they collect data from live apps and the way they debug and troubleshoot live applications. In the past year we were able to fundraise from some of the most prominent investors who are reshaping the development ecosystem (including senior development executives in Fortune 500 companies). We built an A-player team of accomplished members in the enterprise-level developer tool sphere, and – we built a product that already serves satisfied customers who are expanding their use on a daily basis. It is now time to launch our product and company to the public. Lightrun is emerging from stealth to become the market leader in the Continuous Debugging and Observability revolution. Contact us to learn more.

The post Continuous Debugging and Observability: the Next Agility Best Practices appeared first on Lightrun.

The Cost of Production Blindness

Lightrun Marketing — Mon, 04 Jul 2022 13:09:21 +0000

When I speak at conferences, I often fall back to the fact that just a couple of decades ago we’d observe production by kicking the server. This is obviously no longer practical. We can’t see our production. It’s an amorphous cloud that we can’t touch or feel. A power that we read about but don’t fully grasp.

In this case, we have physical evidence that the cloud is there.

A part of this major shift in our industry is a change to our fundamental roles as engineers. DevOps and SRE are roles that didn’t exist back then. Yet today, they’re often essential for major businesses. They brought with them tremendous advancements to the reliability of production, but they also brought with them a cost: distance.

Production is in the amorphous cloud, which is accessible everywhere. Yet it’s never been further away from the people who wrote the software powering it. We no longer have the fundamental insight we took for granted a bit over a decade ago.

Is That So Bad?

Yes, and no. We gave up some insight and control and got a lot in return:

Stability
Simplicity
Security

These are pretty incredible benefits. We don’t want to give these benefits up. But we also lost some insight, debugging became harder and complexity rose. We discussed these problems before but today I want to talk about one impact only…

Cost

This is a form of blindness.

I wrote a lot about the impact of this situation on the reliability of our cloud deployments. But today I want to talk about the financial and environmental costs. Initially, the cloud was billed as a cost saving measure and there was some truth to that. The agility of deployment let us cut down on hardware costs, consolidate and simplify.

But as we got used to the cloud, our appetite for scale/reliability grew. We ended up simplifying deployment to such an extent that launching a container can be accomplished seamlessly, with no interaction on our part. This is enormous progress but also troubling. We slowly lose grip on our costs and end up paying more for less.

So what’s the solution?

APMs are a category that rose to prominence specifically around this problem. Today, they are more important than ever. They help us get a sense of the Pareto principle (80/20 rule) so we can focus optimizations on the specific areas that cost the most.

This is a powerful and important tool that DevOps use every day, but it’s also a very limited one.

Before we proceed, I’d like to take a moment to discuss the concept of cost. The most obvious impact is on our monthly cloud provider bill. This is work that might fund a department. But there’s a more important cost in my humble opinion: the environmental cost. We tend to ignore the electricity spend because it’s a very amorphous spend. But this cost is severe, e.g. the cost of a single cloud instance over one year can be the equivalent of a transatlantic flight.

We don’t see the underlying hardware, but it’s there, and it carries a carbon footprint. By optimizing, we can affect both costs significantly.

Observing Production Effectively

APMs are great for measuring performance at a high level. But they provide very little detail about the dynamic inner workings of the application and the cost-cutting measures we can take inside. I often liken them to the bat signal or check engine light. They notify us of a problem but leave us without the tool to inspect the details.

That’s where developer observability tools can fill in that gap. These tools can provide low level applicable insights into the application. Verify assumptions and provide developers with the means to understand production substantially.

Instead of discussing the theory, let’s give some examples of actions you can take today with developer observability tools to reduce the costs of your production.

Reduce Logs

Log ingestion is probably the most expensive feature in your application. Removing a single line of log code can end up saving thousands of dollars in ingestion and storage costs. We tend to overlog since the alternative is production issues that we can’t trace to their root cause.

We need a middle ground. We want the ability to follow an issue through without overlogging. Developer observability lets you add logs dynamically as needed into production. This frees you from the need to overlog and lets you focus on logging a reasonable amount. You can also raise the log level to keep the logs down. I wrote about this in depth here.

Caching

My top three tips for performance have always been:

Caching
Caching
Caching

There’s really nothing else. It all boils down to that. Unfortunately, cache misses are notoriously hard to tune and detect. This is an even bigger issue in production where we need to account for the changing landscape. E.g. we cache up to 10 friends of a user on a social network but in production the growth team encourages friendships and users have more friends…

You’d have cache misses more often and you wouldn’t even know.

Placing conditional breakpoints or temporary conditional logs on cache misses and inspecting them can go a long way to detect subtle issues like that. This can make an order-of-magnitude difference to performance when done right.

However, there’s a bigger payout here. Many developers just ignore L2 caches entirely. This is understandable. They are hard to maintain and debug. Especially in production. A single cache corruption or a value that’s out of sync and you end up with a major bug. The problem is that debugging these things in production environments is essential. Cache behaves radically differently in production because of its distributed nature.

We built developer observability solutions to debug these exact types of problems. By placing snapshots and logs over cache population/invalidation, we can narrow down the point of corruption and fix cache relation issues. By deploying these solutions to your production server, overhead can be reduced significantly!

Micro Benchmarks

APMs provide us with high-level numbers on performance and a general direction. They don’t provide the lines of code we need to address. That’s left up to our guesswork. If the system behaves identically when it’s running locally, this should be fine. Unfortunately, this is rarely the case. E.g. a database query can have a significantly different impact when running in production. Based on local profiling results, you might waste your energy on the wrong optimization.

Developer observability tools provide the ability to narrow down the performance overhead of a code snippet. This lets us follow through the web service stack and narrow down the actual lines that are taking the most CPU time. We can accomplish this by adding a tictoc metric that measures the time between the tic line and the toc line.

We can mark a block of code and get statistics about its execution time. As in the common case of a specific query taking longer in production, we can quickly prove that this is the cause of the performance problem using this tool. The impact of many “small” issues like this can be significant in a large system and can easily mean the difference between scaling and a bottleneck.

Verification and Dead Code

A common problem is under utilized resources. APMs expose some of those problems but don’t expose them all. When we have dead-code, its impact on our bottom line can be significant.

How many times did you refactor code or stop yourself from refactoring because of a legacy mess you didn’t want to touch?

Yes, that legacy mess is used in your code so you don’t want to “risk it”. If you end up changing the code, you need to walk on eggshells and the entire operation can take an order-of-magnitude longer. This maps to cost since our time is valuable and you can spend it optimizing. It also blocks some major optimizations most times.

But what if that block of code isn’t used by anyone in production?

What if it’s used by very few people?

That’s exactly what the counter metric does. It counts the number of times a line was reached. It can tell us which methods are important to us and how frequently they’re reached. You wouldn’t be as concerned about a refactor if only three people reach that line of code…

Finally

I could carry on with the discussion of these techniques, but the gist is simple: we need to “see” what’s going on. As developers, we’re given a task to build a product. But the tools that let us peer into production aren’t as capable as our local tools. The results we get from production can be very misleading.

As we scale production deployments, we need to use a new class of tools that exposes our code in this way. I can classify modern production with one word:

DREAD.

This deep binding fear that we all feel when we push a major change into production. People lose their jobs by pushing bad stuff to production. That’s scary!

What do we do when facing such dread?

We keep going, but carefully. We step lightly and don’t take big risks. Is our code wasteful?

Maybe, but the risk of bringing down production is far scarier than the benefit of shaving some expenses to the company.

Developer observability is the light within this darkness. When you shine a light in the dark, you take away some of the fear and make production more approachable. We can measure, test, and move fast. We also have a better sense of the risks we’ll be facing with the upcoming changes. The tooling also gives us a sense of the upside. How much can we save? Imagine saving the cost of your entire department in cloud expenses. That’s job security right there… The best to fight that fear of risky changes.

The post The Cost of Production Blindness appeared first on Lightrun.