debugging Archives - Lightrun https://lightrun.com/tag/debugging/ Developer Observability Platform Sun, 03 Dec 2023 06:49:36 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.2 https://lightrun.com/wp-content/uploads/2022/11/cropped-fav-1-32x32.png debugging Archives - Lightrun https://lightrun.com/tag/debugging/ 32 32 Dynamic Observability Tools for API Live Debugging https://lightrun.com/dynamic-observability-tools-for-api-live-debugging/ Wed, 14 Jun 2023 16:27:36 +0000 https://lightrun.com/?p=12034 Intro Application Programming Interfaces (APIs) are a crucial building block in modern software development, allowing applications to communicate with each other and share data consistently. APIs are used to exchange data inside and between organizations, and the widespread adoption of microservices and asynchronous patterns boosted API adoption inside the application itself. The central role of […]

The post Dynamic Observability Tools for API Live Debugging appeared first on Lightrun.

]]>
Intro

Application Programming Interfaces (APIs) are a crucial building block in modern software development, allowing applications to communicate with each other and share data consistently. APIs are used to exchange data inside and between organizations, and the widespread adoption of microservices and asynchronous patterns boosted API adoption inside the application itself.

The central role of APIs is also evident with the emergence of the API-first approach, where the application’s design and implementation start with the API, thus treating APIs as first-class citizens and developing reusable and consistent APIs.

In the last decade, Representational state transfer (REST) APIs have come to dominate the scene, becoming the predominant API technology on the web. REST is more of an architectural approach than a strict specification: This free-formedness is probably the key to REST success as it has been essential in making REST popular and one of the critical enablers of a loose coupling between API providers and consumers. However, sometimes this bites back as a lack of consistency in the API behavior and interface. This is sometimes alleviated using specification frameworks like OpenAPI or JSON Schema.

Also, it’s worth pointing out the role of developers in designing and consuming APIs, as frequently, the development of an API requires strict collaboration between backend developers, frontend developers, and mobile developers since the role of API is the integration of different applications and systems.

Challenges in API integration

Despite being central to modern application development, API integration remains challenging. Those challenges mainly originate from the fact that the systems connected by APIs form a distributed system, with the usual complexities involved in distributed computing. Also, the connected systems are mostly heterogeneous (different tech stacks, data models, ownership, hosting, etc.), leading to integration challenges. Here are the most common ones:

  • Incorrect data. Improper data formatting or conversion errors (due to inaccurate data type or incompatible data structures) can cause issues with the exchanged data. This often results in malformed JSON, errors in deserialization, and type casting errors.
  • Lack of proper documentation. Poorly documented endpoints may require extensive debugging to infer data format or API behavior. This is particularly problematic when dealing with third-party services without access to the source code or the architecture.
  • Incorrect or unexpected logic or behavior. The loosely defined REST model does not allow for specifying the callee behavior formally, or such behavior can be undocumented or implemented wrong for some edge cases.
  • Poor query parameter handling. Query parameters are the way for the callee to modify the provided results. Often, edge cases arise where parameters are not handled correctly, requiring a trial-and-error debugging process.
  • Error handling. Even if HTTP provides the basic mechanism of response codes for error handling, each API implementation tends to customize it, either using custom codes or adding JSON error messages. Error handling is not always coherent, even between different endpoints on the same system, and it may be undocumented.
  • Authentication and authorization errors. The way in which authorization is handled on the API producer can generate errors and unexpected behavior, sometimes manifesting incoherence between different endpoints on the same system.

Errors can be present on the provider side or the consumer side. On the provider side, we often cannot intervene in the implementation, which necessitates implementing workarounds on the consumer side.

For errors on the consumer (wrong deserialization, incorrect handling of pagination, or states, etc.), troubleshooting usually involves examining logs for request/response patterns and adding logs to examine parameters and payloads.

Lightrun Dynamic Observability for API debugging

Ligthrun‘s Developer Observability Platform implements a new approach to observability by overcoming the difficulties of troubleshooting applications in a live setting. It enables developers to dynamically instrument logs for applications that run remotely on a production server by adding logs, metrics, and virtual breakpoints, without the need to code changes, redeployment, or application restarts.

In the context of API debugging, the possibility of debugging on the production environment provides significant advantages, as developers do not need to reproduce locally the entire API ecosystem surrounding the application, which can result difficult: think, for example, to the need to authenticate to third-parties API, or to provide a realistic database to operate the application locally. Also, it is only sometimes possible to reproduce realistic API calls locally, as the local development environment tends to be simplified with respect to the production one.

Lightrun allows debugging API-providing and consuming applications directly on the live environment, in real-time and on-demand, regardless of the application execution environment. In particular, Lightrun makes it possible to:

  • Add dynamic logs. Adding new logs without stopping the application allows obtaining the relevant information for the API exchange (request/response/state) without leaving the IDE and without losing the state (for example, authentication tokens, complex API interactions, pagination, and real query parameters). It’s also possible to log conditionally only when a specific code-level condition is true, for example, to debug a particular API edge case taken out from a high number of API requests.
  • Take snapshots. Adding virtual breakpoints that can be triggered on a specific code condition to show the change in time of request parameters and response payloads.
  • Add Lightrun metrics for method duration and other insights. It makes it possible to measure the execution times of APIs and count the time a specific endpoint is being called.

Lightrun is integrated with developer IDEs, making it ideal for developers, as it allows them to stay focused on their local environment. Doing so, Lightrun works as a debugger that works everywhere the application is deployed, allowing for a faster feedback loop during the API development and debugging phases.

Bottom Line

Troubleshooting APIs returning incorrect data or behaving erratically is essential to ensure reliable communication between systems and applications. By understanding the common causes of this issue and using the right tools and techniques, developers can quickly identify and fix API problems, delivering a better user experience and ensuring smooth software operations. Lightrun is a developer observability platform giving backend and frontend developers the ability to add telemetry to live API applications, thus representing an excellent resolution to API integration challenges. Try it now on the playground, or book a demo!

 

The post Dynamic Observability Tools for API Live Debugging appeared first on Lightrun.

]]>
Lightrun’s Product Updates – Q3 2023 https://lightrun.com/lightruns-product-updates-q3-2023/ Fri, 29 Sep 2023 13:11:23 +0000 https://lightrun.com/?p=12247 Throughout the third quarter of this year, Lightrun continued its efforts to develop a multitude of solutions and improvements focused on enhancing developer productivity. Their primary objectives were to improve troubleshooting for distributed workload applications, reduce mean time to resolution (MTTR) for complex issues, and optimize costs in the realm of cloud computing. Read more […]

The post Lightrun’s Product Updates – Q3 2023 appeared first on Lightrun.

]]>
Throughout the third quarter of this year, Lightrun continued its efforts to develop a multitude of solutions and improvements focused on enhancing developer productivity. Their primary objectives were to improve troubleshooting for distributed workload applications, reduce mean time to resolution (MTTR) for complex issues, and optimize costs in the realm of cloud computing.

Read more below the main new features as well as the key product enhancements that were released in Q3 of 2023!

📢 NEW! Lightrun Support for Action Creation Across Multiple Sources !

Lightrun is excited to announce that developers can now select multiple agents and tags as a single source when creating an action directly from their IDEs. This option lets them simultaneously apply an action to a custom group of agents and tags, which improves their plugin experience and makes it easier to debug with multiple agents and tags. To learn more, see selecting multiple sources in VSCode and selecting multiple sources in JetBrains.

📢 New! Enhanced Capability for Capturing High-Value Snapshot Actions

We’ve taken snapshot capturing to the next level by enabling you to now capture large values for Python and Node.js agents. As part of this enhancement, we’ve raised the default settings to accommodate larger string values. You can also define maximum limits in the agent.config file through the introduction of the max_snapshot_buffer_size, max_variable_size, and max_watchlist_variable_size fields. For more information, refer to the relevant Agent documentation: Python Agent Configuration and Node.js Agent Configuration.

📢 NEW! Duplication of Actions from Within the IDE Plugins 🎉

Lightrun now offers an easy and more efficient way to insert Lightrun actions using ‘Copy and Paste’ within your JetBrains IDE, which allows developers to easily reuse existing actions in multiple locations within your code. This new functionality applies to all Lightrun action types, including Lightrun snapshots, metrics, and logs. It simplifies the task of reviving expired actions or duplicating actions which have non-trivial conditions and/or watch expressions.

Similarly, we’ve added a new Duplicate action within the VSCode IDE, which allows developers to easily reuse existing actions in multiple locations within their code. This new functionality applies to all Lightrun action types including Lightrun snapshots, metrics, and logs, simplyfying the task of creating non-trivial conditions and/or watch expressions.

 

 

📢 NEW! PII Redaction per Agent Pool 

With the introduction of PII Redaction Templates, Lightrun now supports additional granularity for utilizing PII Redaction effectively. You can either establish a single default PII Redaction template to be applied to all your agents or alternatively create and assign distinct PII Redaction templates for different agent pools. For example, if you would like to apply PII Redaction only on a Production environment and not on Development or Staging.

To help you get started with configuring your PII redaction on Agent Pools, we provide a single Default template on the PII Redaction page which serves as a starting point for creating your templates. Note that it does not contain any predefined patterns and is not assigned to any agent pools. For more information, see Assigning PII Redaction templates to Agent Pools.

Feel free to visit Lightrun’s website to learn more or if you’re a newcomer, try it for free!

The post Lightrun’s Product Updates – Q3 2023 appeared first on Lightrun.

]]>
8 Debugging Tips for IntelliJ IDEA Users You Never Knew Existed https://lightrun.com/eight-debugging-tips-for-intellijidea-users-you-never-knew-existed/ Sun, 14 Jun 2020 12:57:11 +0000 https://lightrun.com/?p=2077 As developers, we’re all familiar with debuggers. We use debugging tools on a daily basis – they’re an essential part of programming. But let’s be honest. Usually, we only use the breakpoint option. If we’re feeling frisky, we might use a conditional breakpoint. But guess what, the IntelliJ IDEA debugger has many powerful and cutting-edge […]

The post 8 Debugging Tips for IntelliJ IDEA Users You Never Knew Existed appeared first on Lightrun.

]]>
As developers, we’re all familiar with debuggers. We use debugging tools on a daily basis – they’re an essential part of programming. But let’s be honest. Usually, we only use the breakpoint option. If we’re feeling frisky, we might use a conditional breakpoint.

But guess what, the IntelliJ IDEA debugger has many powerful and cutting-edge features that are useful for debugging more easily and efficiently. To help, we’ve compiled a list of tips and tricks from our very own developers here at Lightrun. We hope these tips will help you find and resolve bugs faster.

Let’s get started.

1. Use an Exception Breakpoint

Breakpoints are places in the code that stop the program, to enable debugging. They allow inspecting the code behavior and its functions to try to identify the error. IntelliJ offers a wide variety of breakpoints, including line breakpoints, method breakpoints and exception breakpoints.

We recommend using the exception breakpoint. This breakpoint type suspends the program according to an exception type, and not at a pre-defined place. We especially recommend the IntelliJ Exception breakpoint because you can also filter the class or package the exceptions are a part of.

So you can define a breakpoint that will stop on a line that throws NullPointerException and ignore the exceptions that are thrown from files that belong to other libraries. All you have to do is define the package that has your project’s files. This will help you focus the analysis of your code behavior.

Exception breakpoint in IntelliJ IDEA

Lightrun offers snapshots – breakpoints that do not stop the program from running. Learn more here.

2. Use Conditions in Your Breakpoints

This is one of the most under-utilised tools in debuggers and possibly one of the most effective ones. Use conditions to narrow down issues far more easily, to save time and the work of hunting for issues. For example, in a loop you can define a breakpoint that will only stop on the actual bug, relieving you from manually going over loops until you run into an issue!

In the loop below, you can see the breakpoint will stop the service when the agent id value is null. So instead of throwing a null pointer exception we’ll be able to inspect the current state of the VM (virtual machine) before it does.

Notice that a condition can be very elaborate and even invoke methods as part of the condition.

Breakpoint condition in IntelliJ IDEA

Lightrun offers conditions for all its actions: snapshots, logs etc. Learn more here.

3. Enable the “Internal Actions” Menu for Custom Plugin Development  

If you’re writing a custom IntelliJ/IDEA plugin, enable Internal Actions (Tools -> Internal Actions) for easy debugging. This feature includes a lot of convenient options, like a component inspector and a UI debugger. It’s always handy to have a wide set of tools at your disposal, providing you with options you may have never thought of yourself.

To enable Internal Actions select Help -> Edit Custom Properties. Then type in

idea.is.internal=true

and save. Upon restart you should see the new option under the Tools menu.

Internal Actions menu for custom plugin development in IntelliJ IDEA

4. Use the “Analyze Thread Dump” Feature

A thread dump is a snapshot that shows what each thread is doing at a specific time. Thread dumps are used to diagnose system and performance issues. Analyzing thread dumps will enable you to identify deadlocks or contention issues.

We recommend using IntelliJ’s “Analyze Thread Dump” feature because of its convenient browsing capabilities that make the dump easy to analyze. “Analyze Thread Dump” automatically detects a stack trace in the clipboard and instantly places it with links to your source code. This capability is very useful when traversing stack dumps from server logs, because you can instantly jump to the relevant files like you can with a local stack trace.

To access the feature go to the Analyze menu. The IDE supports activating this feature dynamically when the IDE detects a stack trace in the clipboard.

5. Use the Stream Debugger

Java 8 streams are very cool to use but notoriously hard to debug. Streams condense multiple functions into a single statement, so simply stepping over the statements with a debugger is impractical. Instead, you need a tool that can help you analyze what’s going on inside the stream.

IntelliJ has a brand new cool tool, the stream debugger. You can use it to inspect the results of the stream operation visually. When you hit a breakpoint on a stream, press the stream debugger icon in the debugger. You will see the UI mapping of the value of the stream elements at each stage/function of the stream. Thus, each step is visualized and you can see the operations in the stream and detect the problem.

Stream debugger in IntelliJ IDEA (1)

Stream debugger in IntelliJ IDEA (2)

Stream debugger in IntelliJ IDEA (3)

6. Use Field Watchpoints

The Field Watchpoint is a type of breakpoint that suspends the program when the defined field is accessed or modified. This can be very helpful when you investigate and find out that a field has a wrong value and you don’t know why. Watching this field could help finding the fault origin.

To set this breakpoint, simply add it at the line of the desired field. The program will suspend when, for example, the field is modified:

Field watchpoints in IntelliJ IDEA

7. Debug Microservices with the Lightrun Plugin

Lightrun’s IntelliJ plugin enables adding logs, snapshots and performance metrics, even while the service is running. Meaning, you can add instrumentation to the production and staging environments. You can debug monolith microservices, Kubernetes, K8, Docker Swarm, ECS, Big Data workers, serverless, and more. Multi-instance support is available through a tagging mechanism.

The Lightrun plugin is useful for saving time, so instead of going through multiple iterations of local reproduction of environments, restarts and redeployments you can debug straight in production.

Lightrun plugin for IntelliJ IDEA

Want to learn more? Request a demo.

8. Use a Friend – Real or Imaginary

When it comes to brainstorming, 1+1=3. And when it comes to dealing with complex debugging issues, you are going to need all the brainpower you can get. Working with someone provides a fresh set of eyes that views the problem in a different manner and might identify details you missed. Or you both complement each other until you reach the solution. Just by asking each other questions and undermining some of each other’s assumptions, you will reach new conclusions that will help you find the problem. You can also use each other for “Rubber Duck Debugging”, or as we like to call it, “Cheetah debugging”.

Cheetah debugging

We hope these tips by our own developers will help you with your debugging needs. Feel free to share your debugging tips and best practices with us and to share this blog post to help others.

As we mentioned in tip no. 7, Lightrun’s IntelliJ plugin enables developers to debug live microservices without interrupting them. You can securely add logs and performance metrics to production and staging in real-time, on-demand. Start using Lightrun today, or request a demo to learn more.

The post 8 Debugging Tips for IntelliJ IDEA Users You Never Knew Existed appeared first on Lightrun.

]]>
4 Tools Every Java Programmer Should Know https://lightrun.com/4-tools-every-java-programmer-should-know/ Tue, 18 Aug 2020 06:19:58 +0000 https://lightrun.com/?p=3232 The Java tooling ecosystem is pretty wide. Read this post to learn about the key tools we use to ramp up our development efforts.

The post 4 Tools Every Java Programmer Should Know appeared first on Lightrun.

]]>
Java is the most popular programming language. As such, it’s no surprise there are many tools whose primary function is to assist the day-to-day work of Java programmers. In this blog post I will introduce some open-source (and free!) tools and platforms that I’ve used personally for years, and can warmly recommend. 

The following tools immensely improve the coding, building, testing and profiling of Java applications, and I think every Java programmer will benefit from a familiarity with them. I built this list based on my experience as a professional Java developer in recent years, and I hope you find them helpful when you code like I did.

Mockito

Mockito is an open-source Java mock library with a simple API and large community support, often regarded as the foremost tool in its space. Mockito helps you create, verify and stub mocks – which are objects that simulate (i.e. mock) complex production objects, without fully creating them in practice. 

Therefore, mocks in general (and Mockito specifically) is very useful in the context of unit testing – it allows for proper, isolated, single-unit tests: we can mock the dependencies of each unit and focus on the behavior we want to test.

You have two choices when working with Mockito – you can either use the library’s API methods manually, or you can use the annotations the library provides.

A lot of developers choose to use annotations since they reduce a significant amount of repetitive, boilerplate code you have to write to use the library. The following is a short list of annotations I use the most:

  • @Spy – Spy the real object. Note that you can use @SpyBean for resolving Spring-specific dependencies.
  • @Mock – Create and inject mocked instances. In Spring, the corresponding annotation is @MockBean.
  • @InjectMocks – Automatically inject instances of your mocks and spies into the annotated class.

You can see more available annotations and usage examples here.

One comment before we go to the next part – note that mocking should not be overused.

A test that includes many mocks – 10 for example – usually indicates that your class has too many dependencies (i.e. it is responsible for too much). In addition, the more mocks you use, the less you’re testing the real environment – so use them wisely!

Sonar

Sonar is the leading automated service for detecting bugs, code smell and security vulnerabilities in your pull requests. By letting Sonar analyze your code, you can identify vulnerabilities and issues before deployment.

Sonar integrates with Github, Bitbucket, Azure and GitLab and fits snugly into most teams’ development processes. That means that every time you open a Pull Request, you can also get a review from Sonar.

Consider the following example of a sonar report result in a Github PR:

SonarCloud in action

As you can see, Sonar found 26 code smells in this PR’s code and 52.3% coverage. We can drill further down the inspection by clicking on one of the issues, which will redirect us to Sonar for a breakdown of the selected issue.

In Sonar, you can choose between SonarCloud or SonarQube. SonarQube is meant to be integrated with on-premise solutions, like GitHub Enterprise or BitBucket Server, and SonarCloud is for cloud solutions like Github or BitBucketCloud. SonarQube is open-source and free, and Sonarcloud is free for public projects.

I recommend adding a Sonar analysis step every time your code is ready to be merged or released, to ensure maximum code quality. You can also integrate Sonar directly into your favorite CI/CD integration (Jenkins, Azure DevOps, etc.) to have it triggered automatically.

By the way, you can still use Sonar even if you’re not a Java developer. Sonar also supports C#, C/C++, Objective-C, TypeScript, Python, PLSQL and a host of other languages.

IntelliJ IDEA

If you’re a Java developer and you’re one of the 38% who haven’t experienced IntelliJ IDEA – it’s about time you do!

With smart code completion, refactoring and debugging, open-source IntelliJ IDEA leads developers to efficient and convenient coding. It is your armchair assistant, there to aid and redirect you to the correct path at every step of the way.

Compared to other Java IDEs, IntelliJ IDEA’s strength is its intelligence. IntelliJ IDEA understands what you want to do and does it. For example, let’s say there is a method that accepts User and a property of User, and you want to simplify it by simply sending User.

IntelliJ in action!

With the inline shortcut (CTRL+ALT+N) on the “activated” parameter, IntelliJ IDEA understands you want to use user.getActivated() and removes the parameter from the method and its usages! No manual changes are needed. The result:

IntelliJ in action - again!

For more refactoring tips with IntelliJ IDEA, take a look at the refactoring guide right here, or learn some new debugging tricks with IntelliJ IDEA.

And a quick plug – if you’re already an IntelliJ IDEA user (or a soon-to-be-convert!), we offer a plugin for real-time production debugging, right from inside your IDE. Schedule a demo!

VisualVM

VisualVM is a powerful open-source profiling tool for Java applications. It supports local and remote profiling, memory and CPU profiling, thread monitoring, thread dumps and heap dumps. VisualVM displays a monitoring and performance analysis of your app, so you can fix your code prior to real-time crashes.

VisualVM displays the running Java applications on the left pane:

All running Java apps in VisualVM

After selecting a Java application, we can see CPU usage, heap space, classes and threads in the monitor tab:

VisualVM Monitoring Window

This enables you to understand, for example, if your application takes too much CPU or memory. Moreover, you can detect memory leaks using a heap dump – a snapshot of the current Java objects and classes in the heap. See all the possible options and some usage examples here

In Summary 

There are many Java tools and platforms available that can help you code faster and smarter, by streamlining various coding-adjacent activities – like testing, analyzing, profiling, building and releasing. 

These tools will make your coding more productive and efficient, and reduce many of the mundane activities we deal with when we program. There are also many Java communities that can provide assistance and consultation. Let me know in the comments section if you have any other favorite tools you like to use, and I’ll try to incorporate them into the next article! 

The post 4 Tools Every Java Programmer Should Know appeared first on Lightrun.

]]>
When Disaster Strikes: Production Troubleshooting https://lightrun.com/when-disaster-strikes-production-troubleshooting/ Wed, 04 May 2022 08:38:01 +0000 https://lightrun.com/?p=7235 Tom Granot and myself have had the privilege of Vlad Mihalcea’s online company for a while now. As a result we decided to do a workshop together talking about a lot of the things we learned in the process. This workshop would be pretty informal ad-hoc, just a bunch of guys chatting and showing off […]

The post When Disaster Strikes: Production Troubleshooting appeared first on Lightrun.

]]>
Tom Granot and myself have had the privilege of Vlad Mihalcea’s online company for a while now. As a result we decided to do a workshop together talking about a lot of the things we learned in the process. This workshop would be pretty informal ad-hoc, just a bunch of guys chatting and showing off what we can do with tooling.

In celebration of that I thought I’d write about some of the tricks we discussed amongst ourselves in the past to give you a sense of what to expect when joining us for the workshop but also a useful tool in its own right.

The Problem

Before we begin I’d like to take a moment to talk about production and the role of developers within a production environment. As a hacker I often do everything. That’s OK for a small company but as companies grow we add processes.

Production doesn’t go down in flames as much. Thanks to staging, QA, CI/CD and DevOps who rein in people like me…

So we have all of these things in place. We passed QA, staging and everything’s perfect. Right?

All good, right? Right???

Well… Not exactly.

Sure. Modern DevOps made a huge difference to production quality, monitoring and performance. No doubt. But bugs are inevitable. The ones that slither through are the worst types of vermin. They’re hard to detect and often only happen on scale.

Some problems, like performance issues. Are only noticeable in production against a production database. Staging or dev environments can’t completely replicate modern complex deployments. Infrastructure as Code (IaC) helps a lot with that but even with such solutions, production is at a different scale.

It’s the One Place that REALLY Matters

Everything that isn’t production is in place to facilitate production. That’s it. We can have the best and most extensive tests. With 100% coverage for our local environments. But when our system is running in production behavior is different. We can’t control it completely.

A knee jerk reaction is “more testing”. I see that a lot. If only we had a test for that… The solution is to somehow think of every possible mistake we can make and build a test for that. That’s insane. If we know the mistake, we can just avoid it. The idea that a different team member will have that insight is again wrong. People make similar mistakes and while we can eliminate some bugs in this way. More tests create more problems… CI/CD becomes MUCH slower and results in longer deploy times to production.

That means that when we do have a production bug. It will take much longer to fix because of redundant tests. It means that the whole CI quality process which we need to go through, will take longer. It also means we’ll need to spend more on CI resources…

Logging

Logging solves some of the problems. It’s an important part of any server infrastructure. But the problems are similar to the ones we run into with testing.

We don’t know what will be important when we write a log. Then in production we might find it’s missing. Overlogging is a huge problem in the opposite direction. It can:

  • Demolish performance & caching
  • Incur huge costs due to log retention
  • Make debugging harder due to hard to wade through verbosity

It might still be missing the information we need…

I recently posted to a reddit thread where this comment was also present:

“A team at my company accidentally blew ~100k on Azure Log Analytics during the span of a few days. They set the logging verbosity to a hitherto untested level and threw in some extra replicas as well. When they announced their mistake on Slack, I learned that yes, there is such a thing as too much logging.”  – full thread here.

Again, logging is great. But it doesn’t solve the core problem.

Agility

Our development team needs to be fast and responsive. We need to respond quickly to issues. Sure, we need to try and prevent them in the first place… But like most things in life the law of diminishing returns is in effect here too. There are limits to tests, logs, etc.

For that we need to fully understand the bug fast. Going through the process of reproducing something locally based on hunches is problematic at best. We need a way to observe the problem.

This isn’t new. There are plenty of solutions to look at issues in production e.g. APM tools provide us invaluable insight into our performance in production. They don’t replace profilers. They provide the one data point that matters: how fast is the application that our customers are using!

But most of these tools are geared towards DevOps. It makes sense. DevOps are the people responsible for production, so naturally the monitoring tools were built for them. But DevOps shouldn’t be responsible for fixing R&D bugs or even understanding them… There’s a disconnect here.

Enter Developer Observability

Developers observability is a pillar of observability targeted at developers instead of DevOps. With tools in this field we can instantly get feedback that’s tailored for our needs and reduce the churn of discovering the problem.  Before these tools if a log didn’t exist in the production and we didn’t understand the problem… We had to redeploy our product with “more logs” and cross our fingers…

In Practice and The Workshop…

I got a bit ahead of myself explaining the problem longer than I will explain the solution. I tend to think that’s because the solution is so darn obvious once we “get it”. It’s mostly a matter of details.

Like we all know: the devil is in the details…

Developer observability tools can be very familiar to developers who are used to working with debuggers and IDEs. But they are still pretty different. One example is breakpoints.

It’s Snapshots Now

We all know this drill. Set a breakpoint in the code that doesn’t work and step over until you find the problem. This is so ingrained into our process that we rarely stop to think about this at all.

But if we do this in a production environment the server will be stuck while waiting for us to step over. This might impact all users in the server and I won’t even discuss the security/stability implications (you might as well take a hammer and demolish the server. It’s that bad).

Snapshots do everything a breakpoint does. They can be conditional, like a conditional breakpoint. They contain the stack trace and you can click on elements in the stack. Each frame includes the value of the variables in this specific frame. But here’s the thing: they don’t stop.

So you don’t have “step over” as an option. That part is unavoidable since we don’t stop. You need to rethink the process of debugging errors.

currentTimeMillis()

I love profilers. But when I need to really understand the cost of a method I go to my trusted old currentTimeMillis() call. There’s just no other way to get accurate/consistent performance metrics on small blocks of code.

But as I said before. Production is where it’s at. I can’t just stick micro measurements all over the code and review later.

So developer observability tools added the ability to measure things. Count the number of times a line of code was reached. Or literally perform a tictoc measurement which is equivalent to that currentTimeMillis approach.

See You There

“Only when the tide goes out do you discover who’s been swimming naked.” –   Warren Buffett

I love that quote. We need to be prepared at all times. We need to move fast and be ready for the worst. But we also need practicality. We aren’t original, there are common bugs that we run into left and right. We might notice them faster but mistakes aren’t original.

In the workshop we’ll focus on some of the most common mistakes and demonstrate how we can track them using developer observability. We’ll give real world examples of failures and problems we ran into in the past and as part of our work. I’m very excited about this and hope to see you all there!

 

The post When Disaster Strikes: Production Troubleshooting appeared first on Lightrun.

]]>
Lightrun’s Product Updates – Q2 2023 https://lightrun.com/lightruns-product-updates-q2-2023/ Wed, 28 Jun 2023 14:34:36 +0000 https://lightrun.com/?p=11949 During the second quarter of this year, Lightrun persisted producing a wealth of developer productivity solutions and enhancements, aiming for greater troubleshooting of distributed workload applications, reduction of MTTR for complex issues, and cost optimization within cloud-computing. Read more below the main new features as well as the key product enhancements that were released in […]

The post Lightrun’s Product Updates – Q2 2023 appeared first on Lightrun.

]]>
During the second quarter of this year, Lightrun persisted producing a wealth of developer productivity solutions and enhancements, aiming for greater troubleshooting of distributed workload applications, reduction of MTTR for complex issues, and cost optimization within cloud-computing.

Read more below the main new features as well as the key product enhancements that were released in Q2 of 2023!

📢 NEW! Lightrun Support for Troubleshooting .NET Applications !

Lightrun extended its runtime support with the addition of .NET applications debugging. With this new capability, developers can troubleshoot live applications directly from the VSCode, VSCode.dev, and JetBrains Rider IDE plugins and resolve issues quickly. The support for .NET runtime also enhances the depth of troubleshooting through custom expressions, top-notch security, support for the latest framework (.NET 7), stability, and multiple deployment options support (On-Premise, SaaS). 

The Lightrun dedicated package for .NET is available through the NuGet gallery. To learn more about this new runtime support and get started with it, please read this blog.

Lightrun supports .NET

📢 New! RBAC Support for Enhanced Enterprise-Grade Security and Governance

Lightrun enhances its enterprise-grade platform with the addition of RBAC (role based access control) support to ensure that only authorized users have access to sensitive information and resources as they troubleshoot their live applications. By using Lightrun’s RBAC solution, organizations can create a centralized system for managing user permissions and access rights, making it easier to enforce security policies and prevent security breaches.

This solution offers 3 main components: User groups, User Roles, and Agent Pools. To learn more about the benefits and settings of RBAC, read our documentation.

RBAC

This feature helps improve organization’s data security and streamlines its workflows. For more information and help with setting this up, please reach out to our support team.

📢 NEW! Lightrun Metrics for Java Runtime Support 🎉

Lightrun has also launched an upgraded version of Metrics integrated with the IntelliJ IDE plugin for Java runtime troubleshooting. The enhanced Lightrun Metrics solution is now available as a separate tab in the plugin, providing developers with a dedicated tool to address application performance-related problems. Previously, all outputs from Metrics were directed to the Lightrun console, but with the latest implementation, these outputs are sent to the newly introduced tab within the plugin.

With the enhanced implementation, the Lightrun metrics helps developers tackle the following common challenges:

  1. Complexity around contributing factors to performance issues (infrastructure related issues, 3rd Party services performance, concurrency, etc.)
  2. Reproducing performance issues locally is a huge challenge, with Lightrun Metrics, developers can avoid the hassle of setting up a production environment with the right data as they debug performance issues.
  3. Performance issues often are specific to users or segments, hence, pinpointing them to a specific client or segment using profilers and APM tools is hard. In addition, such tools (APMs) do not provide code related information. With Lightrun Metrics, developers would find it intuitive to debug such issues directly from the IDE dedicated plugin tab.

The solution currently supports the collection of 4 different KPIs:

  • Tic-Toc (Block Duration) – Measures the elapsed time for executing a specified block of code, within the same function
  • Method Duration – Measures the elapsed time for executing a given method
  • Custom MetricEnables developers to design their own custom metric, using custom expressions that evaluate to a numeric result of type long. Custom metrics can be created using the configuration form in the Lightrun IDE plugin or from the CLI
  • Counters (coming soon!) – Counts the number of times the code line is reached

Each of the above supported metrics is being collected and calculated on a 24 hours timeline as well as can be analyzed within a specific time range to allow developers to zoom into specific issues. In addition, developers can gain metrics visibility per single runtime agent or across multiple ones (See Lightrun Metrics for Java in action in this demo video).

To get started with Lightrun Metrics and learn more, please refer to our documentation.

📢 NEW! LogOptimizer™ Support for .NET Runtime 

In addition to the previously supported runtime development languages including Java, JavaScript, and Python, Lightrun introduced its support for .NET to allow developers to optimize their overall cost of logging within their .NET applications. With the LogOptimizer™ solution, developers can receive a report showing the main log lines that are candidates for replacement with Dynamic Logs.

To learn more about the LogOptimizer, please read more here.

Feel free to visit Lightrun’s website to learn more or if you’re a newcomer, try it for free!

The post Lightrun’s Product Updates – Q2 2023 appeared first on Lightrun.

]]>
Remote Debugging in a WFH era https://lightrun.com/remote-debugging-in-wfh-era/ Sat, 23 Jan 2021 05:36:13 +0000 https://lightrun.com/?p=6428 With WFH becoming the new normal, companies must find alternative solutions to parts of their current stack that don't work anymore.

The post Remote Debugging in a WFH era appeared first on Lightrun.

]]>
While remote work ain’t new for developers, an overwhelming number of tech companies still had onsite developers when the pandemic hit. COVID-19 threw the software development world into a spiral, led by Microsoft and other tech giants. It seems like remote development is the new norm.

Working remotely introduced multiple changes to the development process with a focus on the way developers communicate and collaborate with each other. Existing solutions that were not designed to be used in a remote setup increased organizations’ security concerns and they started to look for ways to mitigate the risk while addressing their developers’ needs.  

An existing solution – the remote debugger – turned out to be suitable for WFH and quickly gained popularity. 

What’s a remote debugger?

The debugger is a fundamental development tool that can connect to another running application and enable developers to explore its state and behavior, one breakpoint at a time. Remote debuggers are relatively newer and are capable of doing the same for applications that are running on different servers (or in the cloud). 

Using an old tool in a new world

With the increasing popularity of remote debuggers new challenges arise in a distributed production setting:

  1. Breakpoints pause active application – while allowing developers to perform step-by-step analysis, and variable inspection is great for local debugging, doing so with live applications disrupts the customer experience. Imagine not being able to perform a Google search since someone placed a breakpoint in production code. In addition, when a breakpoint pauses the application execution it reduces our chances to reproduce a bug and useful data (such as method execution time or the exact exit time) that could be helpful down the road is not being recorded. 
  2. Remote debuggers are insecure by design since they use a known port to communicate with client tools (such as IDE). Combined with the fact that remote debuggers have neither an authorization-authentication scheme nor do they have built-in audit trail we end up with a pretty major security backdoor that hackers could take advantage of. They could retrieve code-level information or worse – control the service behavior. This is obviously unacceptable.
  3. Remote debuggers violate the SOC2 compliance standard (and alike) since they rely on direct communication with production systems. Furthermore, remote debuggers disrupt the segregation of duties between production and development which conflicts with common enterprise security policy.
  4. During the development phase, a single application instance and server (usually the developer machine) are required. In a cloud-native era where microservices are the new norm and container orchestrators such as Kubernetes (K8s) are all the rage, more often than not, multiple instances of the application are at play. This surfaces two concerns – identification and concurrency. High-scale, microservices-based production applications utilize numerous threads running inside multiple instances to digest application data concurrently while load balancers keep the entire system in equilibrium. With such a complex setup, it’s often hard to identify (and attach to) the offending instance. In addition, when troubleshooting code in a multi-tenant environment, a specific request cannot affect other users. Adding a breakpoint to a live busy instance is not only unreasonable but most likely forbidden.
  5. Legacy debugging practices are a non-collaborative process that contradicts agile development. Since multiple engineers are often required to resolve an incident this could be a serious barrier and will surely extend the incident response time. This challenge is magnified for remotely distributed teams. 
  6. Remote debuggers are relatively slow due to their design and the underlying protocol. Since debugging protocols were designed to be highly interactive they are chatty (i.e. use large numbers of requests/responses). Chatty protocols tend to be super slow over high latency networks.

Introducing a new concept – Production Debugger

Given the rapid changes to the modern development environment, rather than solving old tool issues, it might be worth taking a fresh look and designing a new solution. The ideal solution would take into consideration current production environments, the speed at which developers are expected to resolve issues, enterprise security constraints, and regulations.

For lack of a better term, let’s name the solution: Production Debugger. The main goal is to improve the debugging process and reduce the average time to resolve production issues. Production debuggers should provide contextual, code-level information without pausing the service. 

The ideal Production Debugger should: 

  1. Not disrupt user experience by pausing the application but rather allow to (non-intrusively) define the information required to be collected from the application during execution.
  2. Provide more granular information regarding all variables and support different observability pillars such as logs, metrics, traces,  expression evaluation, and variable exploration. Take for example the case of performance analysis in conjunction with other common production issues like race conditions and other forms of infrastructure-based latency. Inspecting local variables values and current stack trace is useful to understand the application context, but insufficient for proper performance analysis. Knowing the time it takes to execute a function or the in-memory data structure size would help to resolve production performance issues significantly faster.
  3. “Think distributed” – Modern software environments involve constant container restarts and active applications that are decoupled from the server’s bare metal. Detecting an offending application instance inside a Kubernetes cluster and sending a request for information to every container in every pod could get complicated. The ideal debugger should consider all of the underlying application code a single entity and allow debugging all instances simultaneously Instead of independently demanding to select on. In other words, eliminating the prerequisite of selecting application instances to debug would greatly simplify code maintenance.
  4. Support secured, managed, and “read-only” access – production-grade applications are precious, have access to highly sensitive data, and as such should be handled with great care. In order to comply with enterprise-level security standards and controls, developers accessing production systems must be authenticated and restricted to read-only rights only. In addition, in order to ensure application stability, the debugger must have a negligible footprint that should not affect the application performance over time.
  5. Be very fast – the ideal protocol should minimize the number of requests and their size. In other words, fewer requests containing richer information would reduce the number of (client-server) round trips and significantly speed up the process. 
  6. Designed for agile development and collaboration and enables multiple developers to debug simultaneously while making all actions immediately visible to the development team involved regardless of their location.  A tool that empowers developers to better collaborate on incident response can reduce at least some of the stress inherent to this ongoing transition.

Disclaimer: Lightrun was built from the ground to support the principles above. We offer an innovative approach to remote debugging with better and faster visibility into running applications to enable the quickest incident resolution. Request a demo to learn more.

The post Remote Debugging in a WFH era appeared first on Lightrun.

]]>
Testing in Production: Recommended Tools https://lightrun.com/testing-in-production-recommended-tools/ Thu, 11 Jun 2020 07:08:29 +0000 https://lightrun.com/?p=1870 Testing in production has a bad reputation. The same kind “git push   – – force origin master” has. Burning houses and Chuck Norris represent testing in production in memes and that says it all. When done poorly, testing in production very much deserves the sarcasm and negativity. But that’s true for any methodology or technique. […]

The post Testing in Production: Recommended Tools appeared first on Lightrun.

]]>
Testing in production has a bad reputation. The same kind “git push   – – force origin master” has. Burning houses and Chuck Norris represent testing in production in memes and that says it all. When done poorly, testing in production very much deserves the sarcasm and negativity. But that’s true for any methodology or technique.

This blog post aims to shed some light on the testing in production paradigm. I will explain why giants like Google, Facebook and Netflix see it as a legitimate and very beneficial instrument in their CI/CD pipelines. So much, in fact, that you could consider starting using it as well. I will also provide recommendations for testing in production tools, based on my team’s experience.

Testing In Production – Why?

Before we proceed, let’s make it clear: testing in production is not applicable for every software. Embedded software, on-prem high-touch installation solutions or any type of critical systems should not be tested this way. The risks (and as we’ll see further, it’s all about risk management) are too high. But do you have a SaaS solution with a backend that leverages microservices architecture or even just a monolith that can be easily scaled out? Or any other solution that the company engineers have full control over its deployment and configuration? Ding ding ding – those are the ideal candidates.

So let’s say you are building your SaaS product and have already invested a lot of time and resources to implement both unit and integration tests. You have also built your staging environment and run a bunch of pre-release tests on it. Why on earth would you bother your R&D team with tests in production? There are multiple reasons: let’s take a deep dive into each of them.

Staging environments are bad copies of production environments

Yes, they are. Your staging environment is never as big as your production environment – in terms of server instances, load balancers, DB shards, message queues and so on. It never handles the load and the network traffic production does. So, it will never have the number of open TCP/IP connections, HTTP sessions, open file descriptors and parallel writes DB queries perform. There are stress testing tools that can emulate that load. But when you scale, this stops being sufficient very quickly.

Besides the size, the staging environment is never the production one in terms of configuration and state. It is often configured to start a fresh copy of the app upon every release, security configurations are eased up, ACL and services discovery will never handle real-life production scenarios and the databases are emulated by recreating them from scratch with automation scripts (copying production data is often impossible even legally due to privacy regulations such as GDPR). Well, after all, we all try our best. 

At best we can create a bad copy of our production environment. This means our testing will be unreliable and our service susceptible to errors in the real life production environment.

Chasing after maximum reliability before the release costs. A lot.

Let’s just cite Google engineers

“It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the number of features a team can afford to offer.

Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear. We strive to make a service reliable enough, but no more reliable than it needs to be.”

Let’s emphasize the point: “Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear”. No unit/integration/stating env tests will ever make your release 100% error-free. In fact they shouldn’t (well, unless you are a Boeing engineer). After a certain point, investing more and more in tests and attempting to build a better staging environment will just cost you more compute/storage/traffic resources and will significantly slow you down.

Doing more of the same is not the solution. You shouldn’t spend your engineers’ valuable work hours chasing the dragon trying to diminish the risks. So what should you be doing instead?

Embracing the Risk

Again, citing the great Google SRE Book:

“…we manage service reliability largely by managing risk. We conceptualize risk as a continuum. We give equal importance to figuring out how to engineer greater reliability into Google systems and identifying the appropriate level of tolerance for the services we run. Doing so allows us to perform a cost/benefit analysis to determine, for example, where on the (nonlinear) risk continuum we should place Search, Ads, Gmail, or Photos…. That is, when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.”

So it is not just about when and how you run your tests. It’s about how you manage risks and costs of your application failures. No company can afford its product downtime because of some failed test (which is totally OK in staging). Therefore, it is crucial to ensure that your application handles failures right. “Right”, quoting the great post by Cindy Sridharan, means:

“Opting in to the model of embracing failure entails designing our services to behave gracefully in the face of failure.”

The design of fault tolerant and resilient apps is out of the scope of this post (Netflix Hystrix is still worth a look though). So let’s assume that’s how your architecture is built. In such a case, you can fearlessly roll-out a new version that has been tested just enough internally.

And then, the way to bridge the gap so as to get as close as possible to 100% error-free, is by testing in production. This means testing how our product really behaves and fixing the problems that arise. To do that, you can use a long list of dedicated tools and also expose it to real-life production use cases.

So the next question is – how to do it right?

Testing In Production – How?

Cindy Sridharan wrote a great series of blog posts that discusses the subject in a great depth. Her recent Testing in Production, the safe way blog post depicts a table of test types you can take in pre-production and in production.

One should definitely read carefully through this post. We’ll just take a brief look and review some of the techniques she offers. We will also recommend various tools from each category. I hope you find our recommendations useful.

Load Testing in Production

As simple as it sounds. Depending on the application, it makes sense to stress its ability to handle a huge amount of network traffic, I/O operations (often distributed), database queries, various forms of message queues storming and so on. Some severe bugs appear clearly only upon load testing (hi, memory overwrite). Even if not – your system is always capable of handling a limited amount of a load. So here the failure tolerance and graceful handling of connections dropping become really crucial.

Obviously, performing a load test in the production environment will stress your app configured for the real life use cases, thus it will provide way more useful insights than loading testing in staging.

There are a bunch of software tools for load testing that we recommend, many of them are open sourced. To name a few:

mzbench

mzbench  supports MySQL, PostgreSQL, MongoDB, Cassandra out of the box. More protocols can be easily added. It was a very popular tool in the past, but had  been abandoned by a developer 2 years ago.

HammerDB

HammerDB supports Oracle Database, SQL Server, IBM Db2, MySQL, MariaDB, PostgreSQL and Redis. Unlike mzbench, it is under active development as for May 2020.

Apache JMeter

Apache JMeter focuses more on Web Services (DB protocols supported via JDBC). This the old-fashioned (though somewhat cumbersome) Java tool I was using ten years ago for fun and profit.

BlazeMeter

BlazeMeter is a proprietary tool. It runs JMeter, Gatling, Locust, Selenium (and more) open source scripts in the cloud to enable simulation of more users from more locations. 

Spirent Avalanche Hardware

If you are into heavy guns, meaning you are developing solutions like WAFs, SDNs, routers, and so on, then this testing tool is for you. Spirinet Avalanche is capable of generating up to 100 Gbps, performing vulnerability assessments, QoS and QoE tests and much more. I have to admit – it was my first load testing tool as a fresh graduate working in Checkpoint and I still remember how amazed I was to see its power.  

Shadowing/Mirroring in Production

Send a portion of your production traffic to your newly deployed service and see how it’s handled in terms of performance and possible regressions. Did something go wrong? Just stop the shadowing and put your new service down – with zero impact on production. This technique is also known as “Dark Launch” and described in detail by CRE life lessons: What is a dark launch, and what does it do for me? blog post by Google. 

A proper configuration of load balancers/proxies/message queues will do the trick. If you are developing a cloud native application (Kubernetes / Microservices) you can use solutions like:

HAProxy

HAProxy is an open source easy to configure proxy server.

Envoy proxy 

Envoy proxy is open source and a bit more advanced than HAProxy. Wired to suit the microservice world, this proxy was built into the microservices world and offers functionalities of service discovery, shadowing, circuit breaking and dynamic configuration via API.

Istio

Istio is a full open-source service mesh solution. Under the hood it uses the Envoy proxy as a sidecar container in every pod. This sidecar is responsible for the incoming and outgoing communication. Istio control service access, security, routing and more.

Canarying in Production

Google SRE Book defines “canarying” as the following:

To conduct a canary test, a subset of servers is upgraded to a new version or configuration and then left in an incubation period. Should no unexpected variances occur, the release continues and the rest of the servers are upgraded in a progressive fashion. Should anything go awry, the modified servers can be quickly reverted to a known good state.

This technique, as well as similar (but not the same!) Blue-Green deployment and A/B testing techniques are discussed in this Cristian Posta blog post while the caveats and cons of canarying are reviewed here. As for recommended tools, 

Spinnaker

Netflix open-sourced the Spinnaker CD platform leverages the aforementioned and many other deployment best practices (as in everything Netflix, built bearing microservices in mind).

ElasticBeanstalk

AWS supports Blue/Green deployment with its PaaS ElasticBeanstalk solution

Azure App Services

Azure App Services has its own staging slots capability that allows you to apply the prior techniques with a zero downtime.

LaunchDarkly

LaunchDarkly is a feature flagging solution for canary releases – enabling to perform a gradual capacity testing on new features and  safe rollback if issues are found.

Chaos Engineering in Production

Firstly introduced by Netflix’s ChaosMonkey, Chaos Engineering has emerged to be a separate and very popular discipline. It is not about a “simple” load testing, it is about bringing down services nodes, reducing DB shards, misconfiguring load balancers, causing timeouts  – in other words messing up your production environment as badly as possible.

Winning tools in that area are tools I like to call “Chaos as a service”:

ChaosMonkey

ChaosMonkey is an open source tool by Netflix . It randomly terminates services in your production system, making sure your application is resilient to these kinds of failures.

Gremlin

Gremlin is another great tool for chaos engineering. It allows DevOps (or a chaos engineer) to define simulations and see how the application will react in different scenarios: unavailable resources (CPU / Mem),  state changes (change systime / kill some of the processes), and network failures (packet drops / DNS failures).

Here are some others 

Debugging and Monitoring in Production

The last but not least toolset to be briefly reviewed is monitoring and debugging tools. Debugging and monitoring are the natural next steps after testing. Testing in production provides us with real product data, that we can then use for debugging. Therefore, we need to find the right tools that will enable us to monitor and debug the test results in production.

There are some acknowledged leaders, each one of them addressing the need for three pillars of observability, aka logs, metrics, and traces, in its own way: 

DataDog

DataDog is a comprehensive monitoring tool with amazing tracing capabilities. This helps a lot in debugging with a very low overhead.

Logz.io

Logz.io is all about centralized logs management – its combination with DataDog can create a powerful toolset. 

New Relic

A very strong APM tool, which offers log management, AI ops, monitoring and more.

Prometheus

Prometheus is open source monitoring solution that includes metrics scraping, querying, visualization and alerting. 

Lightrun

Lightrun is a powerful production debugger. It enables adding logs, performance metrics and traces to production and staging in real-time, on demand. Lightrun enables developers to securely adding instrumentation without having to redeploy or restart. Request a demo to see how it works.

To sum up, testing in production is a technique you should pursue and experiment with if you are ready for a paradigm shift from diminishing risks in pre-production to managing risks in production.

Testing in production complements the testing you are used to doing, and adds important benefits such as speeding up the release cycles and saving resources. I covered some different types of production testing techniques and recommended some tools to use. If you want to read more, check out the resources I cited throughout the blog post. Let us know how it goes!

Learn more about Lightrun and let’s chat.

The post Testing in Production: Recommended Tools appeared first on Lightrun.

]]>
Top 5 Debugging Tips for Kubernetes DaemonSet https://lightrun.com/top-5-debugging-tips-for-kubernetes-daemonset/ Tue, 23 Aug 2022 17:11:33 +0000 https://lightrun.com/?p=7676 Kubernetes is the most popular container orchestration tool for cloud-based web development. According to Statista, more than 50% of organizations used Kubernetes in 2021. This may not surprise you, as the orchestration tool provides some fantastic features to attract developers. DaemonSet is one of the highlighted features of Kubernetes, and it helps developers to improve […]

The post Top 5 Debugging Tips for Kubernetes DaemonSet appeared first on Lightrun.

]]>
Kubernetes is the most popular container orchestration tool for cloud-based web development. According to Statista, more than 50% of organizations used Kubernetes in 2021. This may not surprise you, as the orchestration tool provides some fantastic features to attract developers. DaemonSet is one of the highlighted features of Kubernetes, and it helps developers to improve cluster performance and reliability. Although it is widely used, debugging DaemonSet can be challenging since it is in the application layer. So, this article will discuss five essential tips to help you debug Kubernetes DaemonSet.

What is a Kubernetes DaemonSet? 

Kubernetes DaemonSet is a Kubernetes object that ensures all nodes (or selected subset) in a cluster run a single copy of a pod.

When you add new nodes to a cluster, the DaemonSet controller automatically adds a pod to that node. Similarly, pods will be erased when a node is deleted from the cluster.

Most importantly, DaemonSet improves the performance and reliability of your Kubernetes cluster while distributing tasks across all nodes. Some developers argue that we do not need to consider where pods run on a Kubernetes cluster. But DaemonSet is efficient for long-running services like log collection, node monitoring, and cluster storage. Also, you can create multiple DaemonSets for a single type of daemon using different flags, memory capacities, and CPU requests.

DaemonSet Pods

Taints and tolerations for DaemonSet

Taints and tolerations are used together to stop pods from scheduling onto inappropriate nodes. You can apply one or more taints to a node, and it will not allow the node to accept pods that do not tolerate the taints. On the other hand, tolerations enable the scheduler to find nodes with matching taints and schedule pods on them. However, using tolerations does not ensure scheduling.

Top 5 debugging tips for Kubernetes DaemonSet 

Now that we have a broad understanding of Kubernetes DaemonSet, let’s discuss a few tips you can use to ease the Kubernetes DaemonSet debugging process.

1. Find unhealthy pods

A DaemonSet is considered unhealthy when it does not have one pod running in each node. Unhealthy DaemonSets are caused mainly by pending pods or pods stuck in a crash loop.

You can easily find unhealthy pods in a Kubernetes cluster by listing all the available pods. The below command will list all the pods in the cluster with their statuses.

kubectl get pod -l app=[label]

You can identify the unhealthy pods from their status once they are listed. Pods withcrashloopbackoff, pending, and evicted statuses are considered unhealthy. Once you identify the unhealthy pods, you can use the below commands to get more details and logs on the pod.

// Get more information about the pod
kubectl describe pod [pod-name]

// Get pod logs
kubectl logs [pod-name]

Finally, you can use the pod information and logs to determine the issue in the DaemonSet. This approach saves you a lot of time since you do not need to debug all the pods in the cluster to find the problem. You can prioritize the unhealthy pods first.

May the pods ever be in your favor meme

2. Resolve the nodes that don’t have enough resources 

As mentioned, pods with crashloopbackoff status are considered unhealthy pods. Mainly, this error is caused by a lack of resources available to run the pod. You can follow the below steps to troubleshoot pods quickly with crashloopbackoff status.

First, you need to find the node that runs the unhealthy pod:

kubectl get pod [pod-name] -o wide Then, you can use the node name from the above command result to monitor the available node resources:
kubectl top node [node-name]

If you notice a lack of resources in the node, you can resolve it by:

  • Decreasing the memory and CPU of the DaemonSet.
  • Upgrading nodes to accommodate more pods.
  • Moving affected pods to another node.
  • Using taints and tolerations to prevent pods from running on nodes with lower resources.

However, if you don’t notice a lack of resources in the node, you will have to check node logs and investigate the pod command to find the issue.

3. Identify container issues 

If you can’t find any issues in the pods, the error might be caused by a container within a pod. Using the wrong image is the main reason for container issues. So, first, you need to find the image name from the DaemonSet manifest and verify that you have used the correct image.

If it is not the case, you will have to log into the node through the command line and investigate if there are any application or configuration issues. You can use the below command to gain access to a Kubernetes cluster node through the command line:

docker run -ti --rm ${image} /bin/bash

4. Use Kubectl commands for troubleshooting

Using Kubectl commands is another excellent approach to debugging Kubernetes DaemonSets. Kubectl is a command line tool provided by Kubernetes to communicate easily with Kubernetes clusters. You can use it to perform any action on a cluster, including deploying apps and managing cluster resources. Most importantly, you can use Kubectl on Windows, macOS, and multiple varieties of Linux.

Here are some of the most popular Kubectl commands you can use to debug DaemonSets:

  • Kubectl describe – Provides detailed information on deployments, services, and pods. When debugging, you can use this command to fetch details on nodes to identify memory and disk space issues.
  • Kubectl logs – Used to display logs from a Kubernetes resource. These logs can be a lifesaver when you need more information to determine an error’s root cause.
  • Kubectl exec – You can execute commands in a running container using this command. You can use this command to view configuration, startup scripts, and permissions when debugging.
  • Kubectl auth – This is another essential command for debugging. It allows you to verify that a selected user or a group can perform a particular action.

Kubectl

5. Invest in an observability platform

Logs are an essential part of application debugging. It is no different for Kubernetes, and you can add logs as you see fit to make the debugging process more straightforward. However, manually adding logs is not an easy task. It takes a lot of time, and there can be human errors.

The best way to add logs to your application is by using a specialized observability tool like Lightrun. Such tools help developers monitor their applications in real-time, identify issues, and quickly fix them. Using a specialized tool makes the debugging process much more efficient and faster. 

Next steps

The five tips we discussed to debug Kubernetes DaemonSet should make the debugging process easier for you. However, debugging DaemonSets is naturally challenging since daemons are placed in the application layer of the workload. It is always more beneficial to use an observability tool like Lightrun to automate some of your work. Lightrun enables you to add logs, metrics, and traces to your Kubernetes clusters and monitor these in real-time while your app is running. You can find more details on how Lightrun works by requesting a demo.

The post Top 5 Debugging Tips for Kubernetes DaemonSet appeared first on Lightrun.

]]>
7 Must-Have Steps for Production Debugging in Any Language https://lightrun.com/7-must-have-steps-for-production-debugging-in-any-language/ Tue, 04 Oct 2022 09:01:00 +0000 https://lightrun.com/?p=8158 Debugging is an unavoidable part of software development, especially in production. You can often find yourself in “debugging hell,” where an enormous amount of debugging consumes all your time and keeps the project from progressing. According to a report by the University of Cambridge, programmers spend almost 50% of their time debugging. So how can […]

The post 7 Must-Have Steps for Production Debugging in Any Language appeared first on Lightrun.

]]>
Debugging is an unavoidable part of software development, especially in production. You can often find yourself in “debugging hell,” where an enormous amount of debugging consumes all your time and keeps the project from progressing.

According to a report by the University of Cambridge, programmers spend almost 50% of their time debugging. So how can we make production debugging more effective and less time-consuming? This article will guide you through seven essential steps to optimize your production debugging process. 

What is Production Debugging?

Production debugging identifies the underlying cause of issues in an application in a production environment. This type of debugging can also be done remotely, as it might not be practical to debug the program locally during the production phase. These production bugs are more difficult to fix as the developers might not have access to the local environment when the issues arise.

Production debugging starts with diagnosing the type of production bug and logging the application. The logging mechanism of the application is configured to send information to a secure server for further inspection.

Classical Debugging vs. Remote Debugging

In classical debugging, the function you wish to debug runs within the same system as that of the debugger server. This system can be your workstation or a network-accessible machine. Remote debugging is the process of troubleshooting an active function on a system that is reachable via a network connection.

The idea behind remote debugging is to simplify the debugging of distributed system components. Essentially, it is the same as connecting directly to the server and starting a debugging session there. If you are a VS Code user, you will know how much of a life-saver its extensions are. VS Code remote debugging extensions are no exception. In contrast, remote debugging in IntelliJ IDEA is built in.

Classical Debugging vs. Remote Debugging

Modern infrastructure challenges for Production Debugging

Modern infrastructure is more dispersed and consists of various mobile elements, making it more challenging to identify the issue and track the bug’s origin. The more complex the program, the more challenging it becomes to find a bug.

For example, let’s consider serverless computing. The application is detached at the base level, consisting of specialized, controlled infrastructure-hosted programmatic functions. Thus, it is nearly impossible for a developer to perform a debugging procedure under typical circumstances since the program does not execute in a local environment.

Why debug in production?

If developers followed the best programming practices precisely, an application would have no flaws by the time it was released. In such an ideal situation, there won’t be a need for debugging at the production level. However, this is frequently not the case because there are constantly minor problems and bugs that need fixing, making production debugging a continual and time-consuming process. 

There are many reasons why we can’t handle these issues locally. Some of these issues won’t even occur in a local setup. Even if you can reproduce the issue in local environments, it’ll be a time-consuming and challenging task. Also, you have to very quickly solve production issues as customers are constantly engaging with the system. Therefore, the recommended solution is to do production debugging to solve production issues.

Production debugging poses various challenges, such as having to troubleshoot the app and disturbing its performance. Moreover, making changes to the program while it is running might lead to unanticipated outcomes for users and interfere with their overall user experience. You can overcome these potential troubleshooting issues with a debugging tool like Lightrun.  

Don’t give up on production debugging just yet! You can take some approaches to make this process a lot easier.

7 essential tips for production debugging in any language

1. Stress test your code

Stress testing involves putting an application under extreme conditions to understand what will happen in the worst-case scenario. Testers create a variety of stress test scenarios to assess the resilience of software and fix any bugs. For example, stress testing shows how the system behaves when many users access the application simultaneously. Furthermore, it examines how the app manages simultaneous logins and monitors system performance while there is a lot of user traffic.

Stress testing may resolve these problems before you make the program available to consumers. It ensures a positive user experience even during periods of high user traffic.

Stress test your code

2. Document external dependencies

The “README” file in the source system must include a detailed description of each external requirement. This document will be helpful to anyone who works with this program in the future and needs to understand the required resources to operate it efficiently.

3. Define your debugging scope

This step attempts to pinpoint precisely where in the app the error occurred. By determining your scope beforehand, you avoid wasting time reviewing every single line of code of an application or troubleshooting irrelevant services.

Instead, you focus on a specific part of the app where the bug may be located. Finding a minor bug in 10000 lines of code isn’t feasible, so you should aim to find bugs in the smallest possible scope.

4. Ensure all software components are running

Ensure that software components, including web servers, message servers, plugins, and database drivers, are functioning well and running the most recent updates before starting the debugging process. This ensures that no software elements are to blame for the errors. If all software components are functioning correctly, you may begin to investigate the problem by using logs.

5. Add a balanced number of logs

Logs can be inserted into code lines, making it simpler for developers to retrieve the information they need. They highlight the relevant context and statistical data that help developers anticipate and resolve issues quickly. They are especially beneficial if there is a large amount of code to read.

Add a balanced number of logs

The entire code should have a suitable number of logs added at all levels. The more logs there are, the more data developers get, and the easier it is to detect errors. However, there should be a balance since excessive logging might overwhelm engineers with irrelevant data. So try to keep track of the smallest portion of the production.

6. Invest in a robust debugging tool

Instead of running a program directly on the processor, debuggers can have a greater degree of control over how it executes by using instruction-set simulators. It enables debuggers to pause or terminate the program in response to particular circumstances. Debuggers display the location of the error in the target application when it crashes.

Tools like Lightrun eliminate the need for troubleshooting, redeploying, or redeveloping the app, resulting in faster debugging. No time is wasted as developers can add logs, analytics, and traces to the app in real-time while it is running. Most importantly, there will be no downtime. 

7. Avoid adding extra code

The ideal piece of code to add to a live application is none at all. Adding extra code can have even more significant repercussions than the bug you were trying to resolve in the first place, as you are modifying the app while customers are using it. Therefore, it should be treated as a last resort, and any added code should be carefully considered and strategically written for debugging purposes.

Production debugging doesn’t have to be a pain

It is neither possible nor feasible to deliver a bug-free program to users. However, developers should try to avoid these issues while being ready to handle them if necessary. Lightrun makes production debugging as easy as it can get by enabling you to add logs, metrics, and traces to your app in real-time, with no impact on performance. You can reduce MTTR by 60% and save time debugging to focus on what really matters: your code. Excited to try it out? Request a free demo!

The post 7 Must-Have Steps for Production Debugging in Any Language appeared first on Lightrun.

]]>