debugging tools Archives - Lightrun

Testing in Production: Recommended Tools

Lightrun Marketing — Thu, 11 Jun 2020 07:08:29 +0000

Testing in production has a bad reputation. The same kind “git push – – force origin master” has. Burning houses and Chuck Norris represent testing in production in memes and that says it all. When done poorly, testing in production very much deserves the sarcasm and negativity. But that’s true for any methodology or technique.

This blog post aims to shed some light on the testing in production paradigm. I will explain why giants like Google, Facebook and Netflix see it as a legitimate and very beneficial instrument in their CI/CD pipelines. So much, in fact, that you could consider starting using it as well. I will also provide recommendations for testing in production tools, based on my team’s experience.

Testing In Production – Why?

Before we proceed, let’s make it clear: testing in production is not applicable for every software. Embedded software, on-prem high-touch installation solutions or any type of critical systems should not be tested this way. The risks (and as we’ll see further, it’s all about risk management) are too high. But do you have a SaaS solution with a backend that leverages microservices architecture or even just a monolith that can be easily scaled out? Or any other solution that the company engineers have full control over its deployment and configuration? Ding ding ding – those are the ideal candidates.

So let’s say you are building your SaaS product and have already invested a lot of time and resources to implement both unit and integration tests. You have also built your staging environment and run a bunch of pre-release tests on it. Why on earth would you bother your R&D team with tests in production? There are multiple reasons: let’s take a deep dive into each of them.

Staging environments are bad copies of production environments

Yes, they are. Your staging environment is never as big as your production environment – in terms of server instances, load balancers, DB shards, message queues and so on. It never handles the load and the network traffic production does. So, it will never have the number of open TCP/IP connections, HTTP sessions, open file descriptors and parallel writes DB queries perform. There are stress testing tools that can emulate that load. But when you scale, this stops being sufficient very quickly.

Besides the size, the staging environment is never the production one in terms of configuration and state. It is often configured to start a fresh copy of the app upon every release, security configurations are eased up, ACL and services discovery will never handle real-life production scenarios and the databases are emulated by recreating them from scratch with automation scripts (copying production data is often impossible even legally due to privacy regulations such as GDPR). Well, after all, we all try our best.

At best we can create a bad copy of our production environment. This means our testing will be unreliable and our service susceptible to errors in the real life production environment.

Chasing after maximum reliability before the release costs. A lot.

Let’s just cite Google engineers:

“It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the number of features a team can afford to offer.

Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear. We strive to make a service reliable enough, but no more reliable than it needs to be.”

Let’s emphasize the point: “Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear”. No unit/integration/stating env tests will ever make your release 100% error-free. In fact they shouldn’t (well, unless you are a Boeing engineer). After a certain point, investing more and more in tests and attempting to build a better staging environment will just cost you more compute/storage/traffic resources and will significantly slow you down.

Doing more of the same is not the solution. You shouldn’t spend your engineers’ valuable work hours chasing the dragon trying to diminish the risks. So what should you be doing instead?

Embracing the Risk

Again, citing the great Google SRE Book:

“…we manage service reliability largely by managing risk. We conceptualize risk as a continuum. We give equal importance to figuring out how to engineer greater reliability into Google systems and identifying the appropriate level of tolerance for the services we run. Doing so allows us to perform a cost/benefit analysis to determine, for example, where on the (nonlinear) risk continuum we should place Search, Ads, Gmail, or Photos…. That is, when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.”

So it is not just about when and how you run your tests. It’s about how you manage risks and costs of your application failures. No company can afford its product downtime because of some failed test (which is totally OK in staging). Therefore, it is crucial to ensure that your application handles failures right. “Right”, quoting the great post by Cindy Sridharan, means:

“Opting in to the model of embracing failure entails designing our services to behave gracefully in the face of failure.”

The design of fault tolerant and resilient apps is out of the scope of this post (Netflix Hystrix is still worth a look though). So let’s assume that’s how your architecture is built. In such a case, you can fearlessly roll-out a new version that has been tested just enough internally.

And then, the way to bridge the gap so as to get as close as possible to 100% error-free, is by testing in production. This means testing how our product really behaves and fixing the problems that arise. To do that, you can use a long list of dedicated tools and also expose it to real-life production use cases.

So the next question is – how to do it right?

Testing In Production – How?

Cindy Sridharan wrote a great series of blog posts that discusses the subject in a great depth. Her recent Testing in Production, the safe way blog post depicts a table of test types you can take in pre-production and in production.

Refined by testing spectrum a bit more – this againnisnt comprehensive, but makes it increasingly clear that testing in production (especially testing post release) is just as important as pre-production testing.

Thoughts? pic.twitter.com/YS0VjACvqD

— Cindy Sridharan (@copyconstruct) February 9, 2018

One should definitely read carefully through this post. We’ll just take a brief look and review some of the techniques she offers. We will also recommend various tools from each category. I hope you find our recommendations useful.

Load Testing in Production

As simple as it sounds. Depending on the application, it makes sense to stress its ability to handle a huge amount of network traffic, I/O operations (often distributed), database queries, various forms of message queues storming and so on. Some severe bugs appear clearly only upon load testing (hi, memory overwrite). Even if not – your system is always capable of handling a limited amount of a load. So here the failure tolerance and graceful handling of connections dropping become really crucial.

Obviously, performing a load test in the production environment will stress your app configured for the real life use cases, thus it will provide way more useful insights than loading testing in staging.

There are a bunch of software tools for load testing that we recommend, many of them are open sourced. To name a few:

mzbench

mzbench supports MySQL, PostgreSQL, MongoDB, Cassandra out of the box. More protocols can be easily added. It was a very popular tool in the past, but had been abandoned by a developer 2 years ago.

HammerDB

HammerDB supports Oracle Database, SQL Server, IBM Db2, MySQL, MariaDB, PostgreSQL and Redis. Unlike mzbench, it is under active development as for May 2020.

Apache JMeter

Apache JMeter focuses more on Web Services (DB protocols supported via JDBC). This the old-fashioned (though somewhat cumbersome) Java tool I was using ten years ago for fun and profit.

BlazeMeter

BlazeMeter is a proprietary tool. It runs JMeter, Gatling, Locust, Selenium (and more) open source scripts in the cloud to enable simulation of more users from more locations.

Spirent Avalanche Hardware

If you are into heavy guns, meaning you are developing solutions like WAFs, SDNs, routers, and so on, then this testing tool is for you. Spirinet Avalanche is capable of generating up to 100 Gbps, performing vulnerability assessments, QoS and QoE tests and much more. I have to admit – it was my first load testing tool as a fresh graduate working in Checkpoint and I still remember how amazed I was to see its power.

Shadowing/Mirroring in Production

Send a portion of your production traffic to your newly deployed service and see how it’s handled in terms of performance and possible regressions. Did something go wrong? Just stop the shadowing and put your new service down – with zero impact on production. This technique is also known as “Dark Launch” and described in detail by CRE life lessons: What is a dark launch, and what does it do for me? blog post by Google.

A proper configuration of load balancers/proxies/message queues will do the trick. If you are developing a cloud native application (Kubernetes / Microservices) you can use solutions like:

HAProxy

HAProxy is an open source easy to configure proxy server.

Envoy proxy

Envoy proxy is open source and a bit more advanced than HAProxy. Wired to suit the microservice world, this proxy was built into the microservices world and offers functionalities of service discovery, shadowing, circuit breaking and dynamic configuration via API.

Istio

Istio is a full open-source service mesh solution. Under the hood it uses the Envoy proxy as a sidecar container in every pod. This sidecar is responsible for the incoming and outgoing communication. Istio control service access, security, routing and more.

Canarying in Production

Google SRE Book defines “canarying” as the following:

To conduct a canary test, a subset of servers is upgraded to a new version or configuration and then left in an incubation period. Should no unexpected variances occur, the release continues and the rest of the servers are upgraded in a progressive fashion. Should anything go awry, the modified servers can be quickly reverted to a known good state.

This technique, as well as similar (but not the same!) Blue-Green deployment and A/B testing techniques are discussed in this Cristian Posta blog post while the caveats and cons of canarying are reviewed here. As for recommended tools,

Spinnaker

Netflix open-sourced the Spinnaker CD platform leverages the aforementioned and many other deployment best practices (as in everything Netflix, built bearing microservices in mind).

ElasticBeanstalk

AWS supports Blue/Green deployment with its PaaS ElasticBeanstalk solution

Azure App Services

Azure App Services has its own staging slots capability that allows you to apply the prior techniques with a zero downtime.

LaunchDarkly

LaunchDarkly is a feature flagging solution for canary releases – enabling to perform a gradual capacity testing on new features and safe rollback if issues are found.

Chaos Engineering in Production

Firstly introduced by Netflix’s ChaosMonkey, Chaos Engineering has emerged to be a separate and very popular discipline. It is not about a “simple” load testing, it is about bringing down services nodes, reducing DB shards, misconfiguring load balancers, causing timeouts – in other words messing up your production environment as badly as possible.

Winning tools in that area are tools I like to call “Chaos as a service”:

ChaosMonkey

ChaosMonkey is an open source tool by Netflix . It randomly terminates services in your production system, making sure your application is resilient to these kinds of failures.

Gremlin

Gremlin is another great tool for chaos engineering. It allows DevOps (or a chaos engineer) to define simulations and see how the application will react in different scenarios: unavailable resources (CPU / Mem), state changes (change systime / kill some of the processes), and network failures (packet drops / DNS failures).

Here are some others

Debugging and Monitoring in Production

The last but not least toolset to be briefly reviewed is monitoring and debugging tools. Debugging and monitoring are the natural next steps after testing. Testing in production provides us with real product data, that we can then use for debugging. Therefore, we need to find the right tools that will enable us to monitor and debug the test results in production.

There are some acknowledged leaders, each one of them addressing the need for three pillars of observability, aka logs, metrics, and traces, in its own way:

DataDog

DataDog is a comprehensive monitoring tool with amazing tracing capabilities. This helps a lot in debugging with a very low overhead.

Logz.io

Logz.io is all about centralized logs management – its combination with DataDog can create a powerful toolset.

New Relic

A very strong APM tool, which offers log management, AI ops, monitoring and more.

Prometheus

Prometheus is open source monitoring solution that includes metrics scraping, querying, visualization and alerting.

Lightrun

Lightrun is a powerful production debugger. It enables adding logs, performance metrics and traces to production and staging in real-time, on demand. Lightrun enables developers to securely adding instrumentation without having to redeploy or restart. Request a demo to see how it works.

To sum up, testing in production is a technique you should pursue and experiment with if you are ready for a paradigm shift from diminishing risks in pre-production to managing risks in production.

Testing in production complements the testing you are used to doing, and adds important benefits such as speeding up the release cycles and saving resources. I covered some different types of production testing techniques and recommended some tools to use. If you want to read more, check out the resources I cited throughout the blog post. Let us know how it goes!

Learn more about Lightrun and let’s chat.

The post Testing in Production: Recommended Tools appeared first on Lightrun.

Top 8 IntelliJ Debug Shortcuts

Lightrun Team — Mon, 06 Jun 2022 16:52:53 +0000

Let’s get real – as developers, we spend a significant amount of time staring at a screen and trying to figure out why our code isn’t working. According to Coralogix, there are an average of 70 bugs per 1000 lines of code. That’s a solid 7% worth of blimps, bumps, and bugs. In addition to this, fixing a bug can take 30 times longer than writing an actual line of code. But it doesn’t have to be this way. If you’re using IntelliJ (or are thinking about making the switch to it), the in-built debugger and its shortcuts can help speed up the process. But first, what is IntelliJ?

What is IntelliJ?

If you’re looking for a great Java IDE, you should check out IntelliJ IDEA. It’s a robust, feature-rich IDE perfect for developing Java applications. While VSCode is excellent in many situations, IntelliJ is designed for Java applications. Here’s a quick overview of IntelliJ IDEA and why it’s so great.

IntelliJ IDEA is a Java IDE developed by JetBrains. It’s a commercial product, but a free community edition is available. Some of the features include:

Intelligent code completion
Refactoring
Code analysis
Support for various frameworks and libraries
Great debugger

Debugging code in IntelliJ

If you’re a developer, you will have to debug code sooner or later. But what exactly is debugging? And why do we do it?

Debugging is the process of identifying and removing errors from a computer program. Errors can be caused by incorrect code, hardware faults, or software bugs. When you find a bug, the first thing you need to do is to try and reproduce the bug so you can narrow down the problem and identify the root cause. Once you’ve reproduced the bug, you can then start to debug the code.

Debugging is typically done by running a program in a debugger, which is a tool that allows the programmer to step through the code, line by line. The debugger will show the values of variables and allow programmers to change them so they can find errors and fix them.

The general process of debugging follows this flow:

identify the bug
reproduce the bug
narrow down where in the code the bug is occurring
understand why the bug exists
fix the bug

Most often than not, we spend our time on the second and third steps. Statistically, we spend approximately 75% of our time just debugging code. In the US, $113B is spent on developers trying to figure out the what, where, why, and how of existing bugs. Leveraging the IDE’s built-in features will allow you to condense the debugging process.

Sure, using a debugger will slow down the execution of the code, but most of the time, you don’t need it to run at the same snail’s pace speed through the entire process. The shortcut controls allow you to observe the meta inner workings of your code at the rate you need them to be.

Without further ado – here are the top 8 IntelliJ debug shortcuts, what they do and how they can help speed up the debugging process.

Top 8 IntelliJ Debug Shortcuts

1. Step Over (F8)

Stepping is the process of executing a program one line at a time. Stepping helps with the debugging process by allowing the programmer to see the effects of each line of code as it is executed. Stepping can be done manually by setting breakpoints and running the program one line at a time or automatically by using a debugger tool that will execute the program one line at a time.

Step over (F8) takes you to the following line without going into the method if one exists. This step can be helpful if you need to quickly pass through the code, hunt down specific variables, and figure out at what point it exhibits undesired behavior.

2. Step into (F7)

Step into (F7) will take the debugger inside the method to demonstrate what gets executed and how variables change throughout the process.

This functionality is helpful if you want to narrow down your code during the transformation process.

3. Smart step into (Shift + F7)

Sometimes multiple methods are called on the line. Smart step into (Shift + F7) lets you decide which one to invoke, which is helpful as it enables you to target potential problematic methods or go through a clear process of elimination.

4. Step out (Shift + F8)

At some point, you will want to exit the method. The step out (Shift + F8) functionality will take you to the call method and back up the hierarchy branch of your code.

5. Run to cursor (Alt + F9)

Alternative to setting manual breakpoints, you can also use your cursor as the marker for your debugger.

Run to cursor (Alt + F9) will let the debugger run until it reaches where your cursor is pointing. This step can be helpful when you are scrolling through code and want to quickly pinpoint issues without the need to set a manual breakpoint.

6. Evaluate expression (Alt + F8)

It’s one thing to run your code at the speed you need; it’s another to see what’s happening at each step. Under normal circumstances, hovering your cursor over the expression will give you a tooltip.

But sometimes, you just need more details. Using the evaluate expression shortcut (Alt + F8) will reveal the child elements of the object, which can help obtain state transparency.

7. Resume program (F9)

Debugging is a constant stop and start process. The ability to toggle this process is achievable through F9. This shortcut will kickstart the debugger back into gear and get it moving to the next breakpoint.

For Mac, a keycord (Cmd + Alt + R) is required to resume the program.

8. Toggle (Ctrl + F8) & view breakpoints (Ctrl + Shift + F8)

Breakpoints can get nested inside methods – which can be a hassle to look at if you want to step out and see the bigger picture. This is where the ability to toggle breakpoints comes in.

You can toggle line breakpoints with Ctrl+F8. Alternatively, if you want to view and set exception breakpoints, you can use Ctrl+Shift+F8.

For Mac OS, the keycords are:

Toggle – Cmd + F8
View breakpoints – Cmd + Shift + F8

Improving the debugging process

If you’re a software engineer, you know that debugging is essential for the development process. It can be time-consuming and frustrating, but it’s necessary to ensure that your code is working correctly.

Fortunately, there are ways to improve the debugging process, and one of them is by using Lightrun. Lightrun is a cloud-based debugging platform you can use to debug code in real-time. It is designed to make the debugging process easier and more efficient, and it can be used with any programming language.

One of the great things about Lightrun is that you can use it to debug code in production, which means that you can find and fix bugs in your code before your users do. Lightrun can also provide a visual representation of the code being debugged. This can help understand what is going on and identify the root cause of the problem. Start using Lightrun today!

The post Top 8 IntelliJ Debug Shortcuts appeared first on Lightrun.

Debugging Microservices: The Ultimate Guide

Lightrun Marketing — Mon, 20 Jul 2020 10:16:34 +0000

Microservices have come a long way from being a shiny, new cool toy for hypesters to a legitimate architecture that transforms the way modern applications are built. Microservices are loosely coupled, independently deployable and scalable, allow a highly diverse technology stack – and these are just some of their biggest advantages. Also, these are some of their biggest disadvantages, especially when it comes to debugging microservices.

That’s because all the world’s a trade-off. And all those great advantages come with a price tag attached. For a long time the tag has been too high for many teams. In this blog post I am going to discuss the issue that was (and still is, in some cases) a very significant part of the aforementioned price – difficulties in debugging microservices. Then, I will recommend tools (one of which is our own production debugger Lightrun) and platforms that can help overcome these problems, because microservices aren’t going anywhere.

What is Microservices Architecture

Before we start, though, let’s clearly define a few things. First of all, “microservices” and “serverless” are two different things. Well, right, pretty often microservices are built using serverless architecture, and pretty often the serverless architecture is used bearing microservices in mind. And yet, the main goal of serverless is to reduce the total cost of ownership of an application – i.e. reduce the cost of managing servers and usage bill – and it has nothing to do with microservices. It is still possible to build a monolithic web application running entirely on AWS ElasticBeanstalk or Azure AppService or deploy a microservice on top of a nginx server running on EC2.

After this subtle but legally important distinction, note that I will still address serverless debugging issues alongside microservices debugging issues since they interpolate very often.

Another important thing to mention is that the microservices architecture is just a subclass of a more broad and comprehensive cloud-native paradigm, which introduces even more challenges for development teams (out of the scope of this post). But whether you deploy microservices in the public cloud or totally on-prem, you will face the same difficulties. (I don’t assume your application’s gender, age, religion or cloud. And, of course, the language it is written in.)

Microservices Architecture Creates Microservices Debugging Challenges

Imagine that your huge, cumbersome monolith full of shitty legacy code and written in some old-fashioned, boring dinosaur language starts falling apart into beautiful, tiny microservices. What can go wrong?

A lot of things. And when they do go wrong you will find out that, all of a sudden, you can’t put a breakpoint in that new tiny beautiful microservice! You can’t make your favorite IDE debugger just stop there, you can’t see the stack, the values of variables, the process memory, you can’t pause threads and step through the code line by line (well, I do assume language(s) here, sorry for that).

You can’t do all this because the suspicious code is now not just some class instance running at worst as another thread in the process your IDE is attached to. It is now a dedicated Docker container/Kubernetes pod where it runs written in another language: stateless, asynchronous, lonely.

Or even worse, it is now a Lambda function, which is born and dies hundreds of times in a second somewhere in a distant cloud, throwing NPEs every time it starts. How in the world am I supposed to debug a microservice like that? What have I done?

This post comes to the rescue. There are a lot of techniques, tools, and even startup companies that have emerged to address this problem. It is a vibrant and constantly evolving (which is another way to say poor and incomplete) ecosystem that I will review in two steps: debugging microservices locally (this blog post) and debugging microservices in production.

How to Debug Microservices Locally

Let’s see how it is possible to debug microservices when you either develop them locally or try to reproduce and fix a bug. Before I get into solutions, let’s outline the challenges you will face doing that.

Debugging Microservices Locally: The Challenges

Fragmentation

In a good old-fashioned monolith, the functionality (i.e. adding an item to a shopping cart) you were trying to debug was implemented by a couple of classes. These classes made it easy to gain a holistic view.

Now, the same functionality is implemented by a couple of separate microservices, and each can be either a Docker container, a Kubernetes pod, or a serverless function. You are supposed to run all of them simultaneously in order to reproduce a bug and then, after a fix, perform an impact analysis.

To top it all off, to recreate the exact picture, each one of these services must be of the same version they are in production – either all together with the same version, or, even worse, the version each service was running in the production environment where the bug was reported. Creating this environment is a huge challenge, and if you don’t do it right you won’t be able to properly debug your microservices.

Asynchronous Calls

Direct synchronous method invocations (or, at worst, message queues between threads) are replaced in microservices with either synchronous REST or gRPC API calls. Even worse, sometimes they are replaced with an asynchronous event-driven architecture based on plenty of available message queues (async gRPC is also an option).

Too bad that issues occurring with in-process message queues are nothing like what you face with distributed message queues: the configuration is complicated and has a lot of nuances, latency and performance are not always predictable, operational costs are very high (yes, Kafka, I am looking at you) and you may run out of a budget very quickly if you are using managed solutions.

Distributed State

Forget about stack trace, forget about logs. Actually no, don’t forget about logs, forget about understanding anything by digging into those of a single microservice. Those magic ERROR lines you are looking for may be printed into logs of some other microservice at an undefined time offset, messed up with totally unrelated ERROR lines which were printed while handling a different HTTP request. In other words, recreating the application state which led to a bug is often mission impossible.

Different Languages

Back in the day it was one language to write them all, now it is a Noah’s ark of languages and you might have no idea WTF is going on with this “undefined has no properties” error that some weakly dynamically typed language loves to throw (who let this become a backend language, for crying out loud?).

Technical Difficulties in Running Microservices Locally

Well, that’s what Docker was invented for in the first place, right? Docker-compose up and we are done. OK, but what about a Kubernetes cluster? A Kafka cluster? A bunch of Lambda functions? And then your laptop ~melted~ needs more RAM and CPU.

Now it is easy to see why until recently many teams just gave up. For some it cost days, for some it was weeks of frustration, anger and suppressed aggression – and I didn’t even get to production debugging of microservices. The industry reacted quickly to this mess and came up with plenty of solutions addressing these issues. Granted, these are still not even close to providing the speed and convenience of debugging a monolith with an IDE debugger, but the gap is slowly closing. Let’s take a close look at what you can do.

**How You Can Debug Microservices**

So what is in our microservices debugging kit as of July 2020? Let’s look at the main tools and platforms out there, and how they can help you.

Cloud Infrastructure-as-a-Code Tools

There are plenty of configuration orchestration tools, which include, among some others, Terraform and AWS CloudFormation, as well as configuration management tools like Ansible or Puppet, which automate deployment and configuration of complex applications. Debugging microservices with these tools allows creating a quick and seamless debugging environment – subject to your budget constraints, of course. To optimize costs, you can offload only some of the services to a remote cloud and run the rest locally on your machine.

Centralized Logging

All microservices should send logs to a centralized, preferably external, service. This way you can investigate, trace and find a root case for a bug much easier than switching between multiple log files in your local text editor. You can choose from plenty of managed services like Logz.io and Datadog, deploy your own ELK stack, or just send the logs to ~/dev/null~ cold S3 storage. In case you do not know when you will need the logs, this is a much cheaper option and you can always fetch them later. The most important thing is to implement a Correlation Identifier, and then there are more best practices you should definitely read about.

Serverless Frameworks IaC

Some of your microservices might be implemented using serverless solutions like FaaS and/or other managed services like API Gateway. There are two main players that provide Infrastructure-as-Code frameworks for serverless: the cloud agnostic Serverless and AWS SAM, which is just an abstraction layer over CloudFormation. Back in the day, it was a real mess to develop and debug FaaS, but these days both allow local debugging, while SAM even allows using a local debugger in popular IDEs (Visual Studio, IntelliJ IDEA) with its handy AWS Toolkit. A real time saver!

Local Containers

Running Docker Compose locally is trivial unless you’re using a sophisticated architecture, such as a Kafka cluster alongside your Docker containers. Then things start getting complicated while still feasible – take a look.

When it comes to Kubernetes though, it is much more difficult. There are some tools that try to simplify local Kubernetes deployment, such as Microk8s and Minikube, but both require a lot of effort to be invested – well, you should not expect your life to be easy when dealing with Kubernetes anyway.

Dedicated “Debuggers for Microservices”

Not very convincing until now, right? I mean, after a lot of effort you can (barely) create the microservices debugging environment and see logs in a manner which makes sense – things you hardly bother about when debugging a monolith. But what about the debugging capabilities that really matter – setting breakpoints throughout the application, following variable values on the fly, stepping through the code, and changing values during run time?

If your microservices leverage the Kubernetes platform, you can get all of these, at least to an extent. There are two powerful open source tools, Squash and Telepresence, which allow you to use your local IDE debugger features when debugging the Kubernetes environment, preventing your laptop from melting down when running Minicube.

Squash builds a bridge between some of the popular IDEs and debuggers (here’s the full list) and uses a sidecar approach to deploy its client on every Kubernetes node (the authors claim very low performance and resource consumption overhead). This allows you to use all the powerful features of the local debugger such as live debugging, setting breakpoints, stepping through code, viewing the values of variables, modifying them for troubleshooting, and more. You can find a thorough guide here.

Telepresence operates quite differently: it runs a service you want to debug locally, while connecting it to a remote Kubernetes cluster, so you can develop/test it locally and use any of your favorite local IDE debuggers seamlessly. A bunch of tutorials, FAQs and docs can be found here.

Unless I missed something (let me know in the comments), that’s what you have in your hands in the mid 2020 when it comes to debugging microservices locally. Far from ideal, it is much better than just a couple years ago, and it is constantly getting better.

In the next blog post I will discuss the tools and best practices for debugging microservices in production!

Spoiler: a great tool to debug microservices in production is Lightrun. You can add on-demand logs, performance metrics and snapshots (breakpoints that don’t stop your application) in real time without having to issue hotfixes or reproduce the bug locally – all of which makes life much easier when debugging microservices. You can start using Lightrun today, or request a demo to learn more.

The post Debugging Microservices: The Ultimate Guide appeared first on Lightrun.

Extending CI/CD with Continuous Observability & Debugging

Lightrun Marketing — Tue, 17 Nov 2020 16:43:50 +0000

This post is a recap of a joint webinar we held with JFrog – you can watch the on-demand recording here.

At The Phoenix Project – a whimsical journey through the eyes of an imaginary corporation’s IT leader’s day-to-day life – a large production issue is encountered in the deployment of the company’s flagship Phoenix project. This project, already behind schedule and over budget, is one of the focal points for the company’s CEO – who is very adamant that the catastrophe can be mitigated and the deployment finished promptly.

The narrator, Bill, is obviously under a tremendous amount of pressure to get the show on the road. But – being a seasoned veteran with a penchant for proper processes – he spends the majority of the rest of the book attempting to implement a saner operational schedule for the department.

This book is a great story-like representation of what it’s like to build software nowadays. The tension at multiple levels of the company’s hierarchy, an assortment of project leaders, methodologies and deployment environments, external customer pressure – all of these and more are part of a day’s work for any company with a large-enough software development organization (the number of which is increasing rapidly as more and more processes start revolving around software).

It’s interesting to explore what we – as engineers and engineering leaders – have been doing in order to ensure that catastrophes like the one depicted earlier do not repeat themselves with every new version launch. As a discipline, software engineering has made leaps and bounds – infrastructure-wise – since the days of extremely costly dedicated servers in on-prem data centers. It’s cheaper and faster than ever to get your application into production by signing up to a cloud provider, defining your required resources and topology, and clicking a button or running a command to deploy.

This increased agility is not limited to the hardware that we run our software on and to the basic software components we use to scaffold our applications – the process side of things has drastically improved as well. We now have full-fledged, mostly automated processes, around getting software from development to production – “The CI/CD Pipeline” – that makes sure that every piece of software is cared for on every front before reaching the customer’s eyes. In addition, these processes ensure that the entire workflow is faster, more reliable and less fragmented.

But, the value our software brings is only measured when it reaches the hands of our users – i.e. when it hits production. Software can only truly be evaluated by the effect it leaves on the daily lives of our customers, and as such – we need to make sure that the same agility is extended to the “right”-hand side of the SDLC – into the frontier of production environments.

Before we dive into how to make sure the processes in our production environments are as streamlined as the ones earlier in the process, let’s first take a closer look at the earlier steps our software takes before it’s released into the wild.

Early Stage Agility

Getting a piece of software from development to production entails, roughly, the following stages:

1. Development – Source code is written into a source code repository – usually using a VCS (Version Control System) like Git. The main codebase is often hosted on a centralised repository on either a managed service or an on-prem solution, which acts as a single source of truth (SSOT) for the application versions and as a remote backup. This allows for multiple participants to simultaneously develop, with the final codebase always synchronized from one place.
2. Build – In order to make sure the software functions properly, a build process that sets up the environment, fetches all the necessary dependencies and builds the final application is performed.
3. Testing – Before code can actually be added into the centralised location (known as “integration” of the new code), a test suite is executed against it to ensure compatibility with the existing codebase.
  This continuous addition of new code to the codebase following the build and test phases is often referred to as CI – Continuous Integration – and the machines that host the endeavour are often colloquially referred to as “CI Servers” (or, in some places, simply as “Build Servers”).
  
  Once the code is “confirmed working” following the test suites, it needs to be prepared for deployment and then deployed to production:

Creating Artifacts – In a world with hundreds of different deployment targets including VMs, containers, Kubernetes pods, serverless functions, bare metal servers and others, creating the artifacts and the related configuration and metadata can be quite an ordeal. This endeavor gets even more tiresome if you are not working with a monolith application but in a distributed microservices architecture, where every release is composed of dozens (and sometimes hundreds) of different types of artifacts – one for each service.
Deploying Artifacts – Once the artifacts are created, we also need to get them to the production machines. This usually entails communicating with the target platforms to announce the arrival of the new version, and waiting for confirmation of a successful deployment.

This latter part of the process, when conducted automatically, is often referred to as Continuous Deployment – or CD.

There are, however, quite a lot of other concerns that we’ve glided over so far. These issues are integral to the process and the software shouldn’t be deployed without them in place, including (but not limited to):

Dependencies – relevant dependencies are not hosted on the source code repository next to our application, and as such need to be fetched during the build process from external sources (MavenCentral is an example for JVM development). These external sources must host the correct version and deliver it reliably – both of which are notorious pitfalls that often stop a build from finishing successfully.
Security – the vulnerabilities of the underlying dependencies – and of the application itself – must be verified before deploying to production. This is usually done by scanning the dependencies of the application against an external vulnerability database (both on the binary and metadata levels) and the various components of the application itself.

These steps used to be carried out by a member of the engineering team, or – in companies with more resources – by a dedicated systems or production engineer. These processes are repetitive, time-intensive, and prone to error when done manually, making them all prime candidates for automation.

And, indeed, the automation of these processes constitutes a large portion of the tooling of most modern software organizations. The sequential execution of them on the dedicated infrastructure is often referred to as the “CI/CD Pipeline” mentioned earlier, due to the incremental nature of the “flow” of the application from source code to a deployed artifact.

JFrog Artifactory is one example of a solution that can handle the process of artifact management for you -. Instead of relying on external artifact repositories and manual transfer of the software from one host to the other, JFrog Artifactory allows for automated promotion of artifacts between so-called “Artifactory Repositories”.

As your artifacts “mature” throughout the pipeline – passing more and more of the processes mentioned above, and moving between various environments – Artifactory can automatically promote them to the next level. This allows for the pipeline to really “flow” based on triggers from previous stages, instead of relying on human input for things that can be checked by a machine.

If you’re looking for an overarching solution, something that ties all the pieces together under a single roof you can use JFrog Pipelines which allows you to orchestrate and automate every single step your application needs to take before being deployed into production.

JFrog Pipelines coordinates all the existing tools that automate the processes mentioned above – including security scanning using JFrog XRay and artifact caching using JFrog Artifactory – to provide a more wholesome experience for software organizations that need to keep on shipping.

The diagram below gives a great overview of how the different JFrog services fit together.

The Need For Production Agility

By now, we’ve established that there’s a need for agility on the way from development to production. We’ve also looked at a couple of solutions that offer automation and orchestration for significant pieces of the puzzle.

These advancements, along with a growing appreciation for the complexity of the process of developing software and the care it takes for doing it right – has created a world in which getting our software to production is a fast and streamlined process, enabling quicker updates and a better user experience.

And, when something inevitably breaks, it’s usually very easy to track the specific point of failure, make a change immediately, and re-trigger the exact step that failed without going through the full pipeline because of that single problem. Every step along the way is triggered automatically based on the previous step, is extremely visible and thus easy to audit from all angles and can be modified using the command line, user interfaces and APIs.

But what goes on after the pipeline ends? What happens when our software is released into production?

Going back to The Phoenix Project for a second, the narrator says to the other stakeholders at his department the following line during a specifically difficult portion of the project deployment:

The issue is, unfortunately, that it is almost always impossible to match the environments exactly. There are just too many factors to take into consideration – for example:

Users with different characteristics – The sheer amount of technology out there today causes an insurmountable amount of different user configurations that your application might face. In fact, an entire industry – online advertising – is predicated on this very fact for its existence (by profiling users based on the information their setup provides about them).
External Failures – Most, if not all, modern applications rely heavily on external vendors for various utilities. The core one is of course cloud vendors – if a GCP or Azure service you rely on goes down, so will your application.
Unexpected usage – Logical and functional testing can only get you so far – there will always be users who misuse (or outright abuse) your software, and you must be able to deal with them appropriately.
Infrastructure bottlenecks – Your own infrastructure might fail you as well. If a resource inside your topology is pertinent to more than one entity – a database is a good example – latencies caused by actual communication overhead as well as unexpectedly long-running queries might cause timeouts down the road.

These, and more, can all be categorized as things you can either test for in a very limited fashion, or can’t test for at all. Recall, as I mentioned in the beginning of the article, that an obvious (yet easy to forget) fact is that your live application is the most important part of your software development process. These are just too many unknowns to consider when trying to better understand issues with your app.

When a production issue stems from one of these concerns, it’s often hard to understand its origin and to identify which thing exactly broke along the way.

It might not seem apparent at first glance, but the problem is exacerbated by the current set of processes we use to understand, debug and resolve production incidents today.

The Current Production Observability Toolbox

Generally speaking, when a production issue occurs we have a few tools we can use to get better visibility into the issue:

Passive Observation – A better understanding of the issue is extracted from the existing infrastructure, such as application management (APM) or other tools that either aggregate or enhance your existing application logging and metrics collection.
Replication & Reproduction – The service is replicated locally or on a similar piece of infrastructure; the bug is reproduced on the replicated service.
Hotfixing for more information – A “patch” with additional logging is created and deployed to the running service, which now emits more granular information.
Remote Debugging – A special type of agent is attached to the running service ad-hoc, imitating a local debugger, and allows for breakpoint-by-breakpoint analysis of the service (including stopping it at each breakpoint).
Alerting – A set of pre-configured cases – usually based on a specific event occurring or a certain metric reaching a certain threshold – triggers alerts to ensure all stakeholders are aware that something is wrong.
Self-Healing – When a piece of software fails, a certain actor is in charge of triggering a process (usually a reboot or an instance swap) to account for the failure automatically.

While each of these approaches has its own advantages and disadvantages, they all have one thing in common – none is an extension of the existing pipeline.

Looking at the production environment with the same lens that you used for the stages leading up to it, and especially considering its importance in comparison to those other stages, it’s clear that we should have tools that allow us to streamline as many of the processes related to incident resolution as possible. Self-healing can steer us a bit in that direction, but it’s incredibly difficult to account for all failure scenarios when creating our software, and thus it is as difficult to create self-healing mechanisms that deal with all of those scenarios.

We’ve been extremely good at cleaning up and automating processes before the release, but, until recent years, there hasn’t been as much attention focused on the process of troubleshooting the application once it’s live in production. Debugging and incident resolution have always been defined by a diverse set of processes and tooling (a large portion of which is mentioned above), but there has never been a concrete umbrella term for the people in charge of facilitating and automating these flows, nor the discipline that they follow.

When Google’s Site Reliability Engineering book hit the web, and the SRE profession was introduced, a better definition for both became widespread. There is now a profession inside software organisations whose sole goal is to ensure that engineering practices are applied to the right-hand, production side of the cycle as well.

But we would be remiss to so easily define it as an entirely new endeavor. Instead, it’s better to look at it as a continuation of the hard work that came before it – and refer to it as an extension of the existing pipeline:

And with this fresh perspective on the process, we can now talk about how to infuse the same agility we experienced in the earlier stages of the pipeline into the art of production troubleshooting as well.

Continuous Debugging & Continuous Observability

The concept of applying the same principles as the pipeline into the production world can be referred to as Continuous Observability.

If observability can be defined as the ability to understand how your systems are working on the inside just by asking questions from the outside, then Continuous Observability can be defined as a streamlined process for asking new questions and getting immediate responses.

But being conscious of our production systems shouldn’t stop there – we also must be able to answer these new questions without causing any damage to the business. That means that outages must be minimized, no customer data should be corrupted and any disruption to the user experience of our products must be mitigated.

To complement the practice of continuous observability, agile teams can also implement Continuous Debugging processes – ways to actively break down tough bugs by getting more and more visibility into the running service, without stopping it or degrading the customer’s experience.

Lightrun was built from the ground up to empower these exact processes.

Lightrun works inside your IDE, allowing you to add logs, metrics, traces and more to your running application without ever breaking the process. Instead of having to edit the source code to add more visibility, compile, test, create artifacts, deploy and then inspect the information on the other end, Lightrun skips the process and allows you to add more visibility to production services in real-time, and get the answers you need immediately.

To contrast the Lightrun approach with the current production observability toolbox, let’s look at a couple of examples:

With hotfixing, you have to go through the entire pipeline just to get an additional log line into production. This is a long process that can take many precious minutes for something that should be as simple for production services as it is locally.

With remote debugging, in order to ask any new questions you have to stop the process causing an outage. This is an expensive price to pay for getting a peek at what’s going on inside your service. Since this addition of information happens repeatedly during debugging, this could mean a hefty dent in the overall uptime of your service.

With Lightrun, you can add as much information as you want ad-hoc, without stopping the process, and get all the information immediately in your IDE.

By enabling a real-time, on-demand debugging process and enriching the information the application reveals about itself without stopping the process, Lightrun offers a streamlined experience for what is currently an objectively difficult and manual process. By doing so, it facilitates a speedy incident resolution process, resulting in lower mean-time-to-resolution (MTTR) and a better overall developer experience when handling incidents.

The post Extending CI/CD with Continuous Observability & Debugging appeared first on Lightrun.