best practices Archives - Lightrun

7 Must-Have Steps for Production Debugging in Any Language

Lightrun Team — Tue, 04 Oct 2022 09:01:00 +0000

Debugging is an unavoidable part of software development, especially in production. You can often find yourself in “debugging hell,” where an enormous amount of debugging consumes all your time and keeps the project from progressing.

According to a report by the University of Cambridge, programmers spend almost 50% of their time debugging. So how can we make production debugging more effective and less time-consuming? This article will guide you through seven essential steps to optimize your production debugging process.

What is Production Debugging?

Production debugging identifies the underlying cause of issues in an application in a production environment. This type of debugging can also be done remotely, as it might not be practical to debug the program locally during the production phase. These production bugs are more difficult to fix as the developers might not have access to the local environment when the issues arise.

Production debugging starts with diagnosing the type of production bug and logging the application. The logging mechanism of the application is configured to send information to a secure server for further inspection.

Classical Debugging vs. Remote Debugging

In classical debugging, the function you wish to debug runs within the same system as that of the debugger server. This system can be your workstation or a network-accessible machine. Remote debugging is the process of troubleshooting an active function on a system that is reachable via a network connection.

The idea behind remote debugging is to simplify the debugging of distributed system components. Essentially, it is the same as connecting directly to the server and starting a debugging session there. If you are a VS Code user, you will know how much of a life-saver its extensions are. VS Code remote debugging extensions are no exception. In contrast, remote debugging in IntelliJ IDEA is built in.

Modern infrastructure challenges for Production Debugging

Modern infrastructure is more dispersed and consists of various mobile elements, making it more challenging to identify the issue and track the bug’s origin. The more complex the program, the more challenging it becomes to find a bug.

For example, let’s consider serverless computing. The application is detached at the base level, consisting of specialized, controlled infrastructure-hosted programmatic functions. Thus, it is nearly impossible for a developer to perform a debugging procedure under typical circumstances since the program does not execute in a local environment.

Why debug in production?

If developers followed the best programming practices precisely, an application would have no flaws by the time it was released. In such an ideal situation, there won’t be a need for debugging at the production level. However, this is frequently not the case because there are constantly minor problems and bugs that need fixing, making production debugging a continual and time-consuming process.

There are many reasons why we can’t handle these issues locally. Some of these issues won’t even occur in a local setup. Even if you can reproduce the issue in local environments, it’ll be a time-consuming and challenging task. Also, you have to very quickly solve production issues as customers are constantly engaging with the system. Therefore, the recommended solution is to do production debugging to solve production issues.

Production debugging poses various challenges, such as having to troubleshoot the app and disturbing its performance. Moreover, making changes to the program while it is running might lead to unanticipated outcomes for users and interfere with their overall user experience. You can overcome these potential troubleshooting issues with a debugging tool like Lightrun.

Don’t give up on production debugging just yet! You can take some approaches to make this process a lot easier.

7 essential tips for production debugging in any language

1. Stress test your code

Stress testing involves putting an application under extreme conditions to understand what will happen in the worst-case scenario. Testers create a variety of stress test scenarios to assess the resilience of software and fix any bugs. For example, stress testing shows how the system behaves when many users access the application simultaneously. Furthermore, it examines how the app manages simultaneous logins and monitors system performance while there is a lot of user traffic.

Stress testing may resolve these problems before you make the program available to consumers. It ensures a positive user experience even during periods of high user traffic.

2. Document external dependencies

The “README” file in the source system must include a detailed description of each external requirement. This document will be helpful to anyone who works with this program in the future and needs to understand the required resources to operate it efficiently.

3. Define your debugging scope

This step attempts to pinpoint precisely where in the app the error occurred. By determining your scope beforehand, you avoid wasting time reviewing every single line of code of an application or troubleshooting irrelevant services.

Instead, you focus on a specific part of the app where the bug may be located. Finding a minor bug in 10000 lines of code isn’t feasible, so you should aim to find bugs in the smallest possible scope.

4. Ensure all software components are running

Ensure that software components, including web servers, message servers, plugins, and database drivers, are functioning well and running the most recent updates before starting the debugging process. This ensures that no software elements are to blame for the errors. If all software components are functioning correctly, you may begin to investigate the problem by using logs.

5. Add a balanced number of logs

Logs can be inserted into code lines, making it simpler for developers to retrieve the information they need. They highlight the relevant context and statistical data that help developers anticipate and resolve issues quickly. They are especially beneficial if there is a large amount of code to read.

The entire code should have a suitable number of logs added at all levels. The more logs there are, the more data developers get, and the easier it is to detect errors. However, there should be a balance since excessive logging might overwhelm engineers with irrelevant data. So try to keep track of the smallest portion of the production.

6. Invest in a robust debugging tool

Instead of running a program directly on the processor, debuggers can have a greater degree of control over how it executes by using instruction-set simulators. It enables debuggers to pause or terminate the program in response to particular circumstances. Debuggers display the location of the error in the target application when it crashes.

Tools like Lightrun eliminate the need for troubleshooting, redeploying, or redeveloping the app, resulting in faster debugging. No time is wasted as developers can add logs, analytics, and traces to the app in real-time while it is running. Most importantly, there will be no downtime.

7. Avoid adding extra code

The ideal piece of code to add to a live application is none at all. Adding extra code can have even more significant repercussions than the bug you were trying to resolve in the first place, as you are modifying the app while customers are using it. Therefore, it should be treated as a last resort, and any added code should be carefully considered and strategically written for debugging purposes.

Production debugging doesn’t have to be a pain

It is neither possible nor feasible to deliver a bug-free program to users. However, developers should try to avoid these issues while being ready to handle them if necessary. Lightrun makes production debugging as easy as it can get by enabling you to add logs, metrics, and traces to your app in real-time, with no impact on performance. You can reduce MTTR by 60% and save time debugging to focus on what really matters: your code. Excited to try it out? Request a free demo!

The post 7 Must-Have Steps for Production Debugging in Any Language appeared first on Lightrun.

Debugging Java Equals and Hashcode Performance in Production

Lightrun Marketing — Mon, 21 Mar 2022 11:56:42 +0000

I wrote a lot about the performance metrics of the equals method and hash code in this article. There are many nuances that can lead to performance problems in those methods. The problem is that some of those things can be well hidden.

To summarize the core problem: the hashcode method is central to the java collection API. Specifically, with the performance of hash tables (specifically the Map interface hash table). The same is true with the equals method. If we have anything more complex than a string object or a primitive, the overhead can quickly grow.

But the main problem is nuanced behavior. The example given in the article is that of the Java SE URL class. The API for that class specifies that the following comparison of these distinct objects would evaluate to true:

new URL("http://127.0.0.1/").equals(new URL("http://localhost/")); 

new URL("http://127.0.0.1/").hashcode() == new URL("http://localhost/").hashcode();

This is a bug in the specification. Notice this applies to all domains, so lookup is necessary to perform hashing or equals. That can be very expensive.

TIP: performance of equals/hashcode must be very efficient for usage in key values containers such as maps and other hash-based collections

There are so many pitfalls with these methods. Some of them would only be obvious at scale. E.g. a friend showed me a method that compared objects and had an external dependency on a list of potential values. This performed great locally but slowly in production where the list had more elements.

How can you tell if a hash function is slow in production?

How do you even find out that it’s the fault of the hash function?

Measuring Performance

For most intents and purposes, we wouldn’t know the problem is in the equals method or hash code. We would need to narrow the problem down. It’s likely that a server process would take longer than we would expect and possibly show up on the APM.

What we would see on the APM is the slow performance of a web service. We can narrow that down by using the metrics tools provided by Lightrun.

Note: Before we proceed, I assume you’re familiar with the basics of Lightrun and have it installed. If not, please check out this introduction.

Lightrun includes the ability to set several metric types:

A counter which counts the amount of times a specific line of code is reached
Time measure (tictoc) which measures the performance of a specific code block
Method duration – same as tictoc for a full method
Custom metric – measure based on a custom expression

Notice that you can use conditions on all metrics. If performance overhead impacts a specific user, you can limit the measurement only to that specific user.

We can now use these tools to narrow down performance problems and find the root cause, e.g. here I can check if these two lines in the method are at fault:

Adding this tictoc provides us with periodical printouts like this:

INFO: 13 Feb 2022, 14:50:06 TicToc Stats::
{
  "VetListBlock" : {"breakpointId" : "fc27d745-b394-400e-83ee-70d7644272f3","count" : 33,"max" : 32,"mean" : 4.971277332041485,"min" : 1,"name" : "VetListBlock","stddev" : 5.908043099655046,"timestamp" : 1644756606939  }
}

You can review these printouts to get a sense of the overhead incurred by these lines. You can also use the counter to see the frequency at which we invoke a method.

NOTE: You can pipe these results to Prometheus/Grafana for better visualization, but that requires some configuration that’s outside of the scope of this tutorial.

If you see a collection or map as the main performance penalty in the application, it’s very possible that a wayward hash code or equals method are at fault. At this point, you can use metrics in the method itself to gauge its overhead.

This is very similar to the way in which we would often debug these things locally. Surround a suspect area by measurements and rerun the test. Unfortunately, that approach is slow as it requires recompiling/deploying the app. It’s also impractical in production. With this approach we can quickly review all the “suspicious” areas and narrow them all down quickly.

Furthermore, we can do that on a set of servers using the tag feature. In this way we can scale our measurements as we scale our servers.

Checking Thread Safety

Mutable objects can be changed from multiple threads while we try to debug them. This might trigger problems that appear to be performance issues. By verifying that we have single thread access, we can also reduce synchronization in critical sections.

E.g. in a key values store, if a separate thread mutates the key, the store might get corrupted.

The simplest way to do this is log the current thread using the condition Current thread is: {Thread.currentThread().getName()}:

The problem is that a condition like this can trigger output that’s hard to follow, you might see hundreds of printouts. So once we find out the name of the thread, we can add a condition:

!Thread.currentThread().getName().equals("threadName")

This will only log access from different threads. This is something I discussed in my previous post here.

TL;DR

The performance metrics of the equals and hashcode methods in Java SE are crucial. They have a wide reaching impact on the Java collection API, especially in the key values related calls. Objects must implement this efficiently, but it’s often hard to determine the Java class that’s at fault.

We can use Lightrun metrics to time arbitrary methods in production, sign up for Lightrun if you want to try doing it. It’s important to measure class performance in the “real world” environment, which might differ from our local test cases. Objects can behave radically differently with production scale data, and minor changes to a class can make a big difference..

We can narrow down the overhead of hashes, use logs to determine threading issues, and use counters to determine usage of an API.

The post Debugging Java Equals and Hashcode Performance in Production appeared first on Lightrun.

Modernize Legacy Code in Production – Rebuild your Airplane Midflight without Crashing

Lightrun Marketing — Mon, 02 May 2022 13:55:56 +0000

I spent over a decade as a consultant working for dozens of companies in many fields and pursuits. The diversity of each code base is tremendous. This article will try to define general rules for modernizing legacy code that would hopefully apply to all. But it comes from the angle of a Java developer.

When writing this, my primary focus is on updating an old Java 6 era style J2EE code to the more modern Spring Boot/Jakarta EE code. However, I don’t want to go into the code and try to keep this generic. I discuss COBOL and similar legacy systems too. Most of the overarching guidelines should work for migrating any other type of codebase too.

Rewriting a project isn’t an immense challenge, mostly – however, doing it while users are actively banging against the existing system without service disruption?That requires a lot of planning and coordination.

Why Modernize?

I don’t think we should update projects for the sake of the “latest and greatest”. There’s a reason common legacy systems like COBOL are still used. Valuable code doesn’t lose its shine just because of age. There’s a lot to be said for “code that works”. Especially if it was built by hundreds of developers decades ago. There’s a lot of hidden business logic model knowledge in there…

However, maintenance can often become the bottleneck. You might need to add features making the process untenable. It’s hard to find something in millions of lines of code. The ability to leverage newer capabilities might be the final deciding factor. It might be possible to create a similar project without the same complexities, thanks to newer frameworks and tools.

We shouldn’t take a decision to overhaul existing code that’s in production lightly. You need to create a plan, evaluate the risks and have a way to back out.

Other reasons include security, scalability, end of life to systems we rely on, lack of skilled engineers, etc.

You usually shouldn’t migrate for better tooling but better observability, orchestration, etc. Are a tremendous benefit.

Modernization gives you the opportunity to rethink the original system design. However, this is a risky proposition, as it makes it pretty easy to introduce subtle behavioral differences.

Challenges

Before we head to preparations, there are several deep challenges we need to review and mitigate.

Access to Legacy Source Code

Sometimes, the source code of the legacy codebase is no longer workable. This might mean we can’t add even basic features/functionality to the original project. This can happen because of many reasons (legal or technical) and would make migration harder. Unfamiliar code is an immense problem and would make the migration challenging, although possible.

It’s very common to expose internal calls in the legacy system to enable smooth migration. E.g. we can provide fallback capabilities by checking against the legacy system. An old product I worked on had a custom in house authentication. To keep compatibility during migration, we used a dedicated web service. If user authentication failed on the current server, the system checked against the old server to provide a “seamless” experience.

This is important during the migration phase but can’t always work. If we don’t have access to the legacy code, tools such as scraping might be the only recourse to get perfect backwards compatibility during the migration period.

Sometimes, the source is no longer available or was lost. This makes preparation harder.

Inability to Isolate Legacy System

In order to analyze the legacy system, we need the ability to run it in isolation so we can test it and verify its behaviors. This is a common and important practice, but isn’t always possible.

E.g. a COBOL codebase running on dedicated hardware or operating system. It might be difficult to isolate such an environment.

This is probably the biggest problem/challenge you can face. Sometimes an external contractor with domain expertise can help here. If this is the case, it’s worth every penny!

Another workaround is to set up a tenant for testing. E.g. if a system manages payroll, set up a fake employee for testing and perform the tasks discussed below against production. This is an enormous danger and a problem, so this situation is far from ideal and we should take it only if no other option exists.

Odd Formats and Custom Stores

Some legacy systems might rely on deeply historical approaches to coding. A great example is COBOL. In it, they stored numbers based on their form and are closer to BCD (Java’s BigDecimal is the closest example). This isn’t bad. For financial systems, this is actually the right way to go. But it might introduce incompatibilities when processing numeric data that might prevent the systems from running in parallel.

Worse, COBOL has a complex file storage solution that isn’t a standard SQL database. Moving away from something like that (or even some niche newer systems) can be challenging. Thankfully, there are solutions, but they might limit the practicality of running both the legacy and new product in parallel.

Preparation

Before we need to even consider an endeavor of this type, we need to evaluate and prepare for the migration. The migration will be painful regardless of what you do, but this stage lets you shrink the size of the band aid you need to pull off.

There are many general rules and setups you need to follow before undergoing a code migration. Each one of these is something you need to be deeply familiar with.

Feature Extraction

When we have a long running legacy system, it’s almost impossible to keep track of every feature it has and the role it plays in the final product. There are documents, but they are hard to read and go through when reviewing. Issue trackers are great for followup but they aren’t great maps.

Discovering the features in the system and the ones that are “actually used” is problematic. Especially when we want to focus on minutiae. We want every small detail. This isn’t always possible but if you can use observability tools to indicate what’s used it would help very much. Migrating something that isn’t used is frustrating and we’d want to avoid it, if possible.

This isn’t always practical as most observability tools that provide very fine grained details are designed for newer platforms (e.g. Java, Python, Node etc). But if you have such a platform such as an old J2EE project. Using a tool like Lightrun and placing a counter on a specific line can tell you what’s used and what probably isn’t. I discuss this further below.

I often use a spreadsheet where we list each feature and minor behavior. These spreadsheets can be huge and we might divide them based on sub modules. This is a process that can take weeks. Going over the code, documentation and usage. Then iterating with users of the application to verify that we didn’t miss an important feature.

Cutting corners is easy at this stage. You might pay for them later. There were times I assigned this requirement to a junior software developer without properly reviewing the output. I ended up regretting that, as there were cases where we missed nuances within the documentation or code.

Compliance Tests

This is the most important aspect for a migration process. While unit tests are good, compliance and integration tests are crucial for a migration process.

We need feature extraction for compliance. We need to go over every feature and behavior of the legacy system and write a generic test that verifies this behavior. This is important both to verify our understanding of the code and that the documentation is indeed correct.

Once we have compliance tests that verify the existing legacy system, we can use them to test the compatibility of the new codebase.

The fundamental challenge is writing code that you can run against two completely different systems. E.g. If you intend to change the user interface, adapting these tests would be challenging.

I would recommend writing the tests using an external tool, maybe even using different programming languages. This encourages you to think of external interfaces instead of language/platform specific issues. It also helps in discovering “weird” issues. E.g. We had some incompatibilities because of minute differences in the HTTP protocol implementation between a new and legacy system.

I also suggest using a completely separate “thin” adapter for the UI differences. The tests themselves must be identical when running against the legacy and the current codebase.

The process we take for test authoring is to open an issue within the issue tracker for every feature/behavior in the spreadsheet from the previous step. Once this is done, we color the spreadsheet row in yellow.

Once we integrate a test and the issue is closed, we color the row green.

Notice that we still need to test elements in isolation with unit tests. The compliance tests help verify compatibility. Unit tests check quality and also complete much faster, which is important to productivity.

Code Coverage

Code coverage tools might not be available for your legacy system. However, if they are, you need to use them.

One of the best ways to verify that your compliance tests are extensive enough is through these tools. You need to do code reviews on every coverage report. We should validate every line or statement that isn’t covered to make sure there’s no hidden functionality that we missed.

Recording and Backup

If it’s possible, record network requests to the current server for testing. You can use a backup of the current database and the recorded requests to create an integration test of “real world usage” for the new version. Use live data as much as possible during development to prevent surprises in production.

This might not be tenable. Your live database might be access restricted or it might be too big for usage during development. There’s obviously privacy and security issues related to recording network traffic, so this is only applicable when it can actually be done.

Scale

One of the great things about migrating an existing project is that we have a perfect sense of scale. We know the traffic. We know the amount of data and we understand the business constraints.

What we don’t know is whether the new system can handle the peak load througput we require. We need to extract these details and create stress tests for the critical portions of the system. We need to verify performance, ideally compare it to the legacy to make sure we aren’t going back in terms of performance.

Targets

Which parts should we migrate and in what way?

What should we target first and how should we prioritize this work?

Authentication and Authorization

Many older systems embed the authorization modules as part of a monolith process. This will make your migration challenging regardless of the strategy you take. Migration is also a great opportunity to refresh these old concepts and introduce a more secure/scalable approach for authorization.

A common strategy in cases like this is to send a user to “sign up again” or “migrate their accounts” when they need to use the new system. This is a tedious process for users and will trigger a lot of support issues e.g. “I tried password reset and it didn’t work”. These sorts of failures can happen when a user in the old system didn’t perform the migration and tries to reset the password on the new system. There are workarounds such as explicitly detecting a specific case such as this and redirecting to the “migration process” seamlessly. But friction is to be expected at this point.

However, the benefit of separating authentication and authorization will help in future migrations and modularity. User details in the shared database is normally one of the hardest things to migrate.

Database

When dealing with the legacy system, we can implement the new version on top of the existing database. This is a common approach and has some advantages:

Instant migration – this is probably the biggest advantage. All the data is already in the new system with zero downtime
Simple – this is probably one of the easiest approaches to migration and you can use existing “real world” data to test the new system before going live

There are also a few serious disadvantages:

Data pollution – the new system might insert problematic data and break the legacy system, making reverting impossible. If you intend to provide a staged migration where both the old and new systems are running in parallel, this might be an issue
Cache issues – if both systems run in parallel, caching might cause them to behave inconsistently
Persisting limits – this carries over limitations of the old system into the new system

If the storage system is modern enough and powerful enough, the approach of migrating the data in this way makes sense. It removes, or at least postpones, a problematic part of the migration process.

Caching

The following three tips are at the root of application performance. If you get them right, your apps will be fast:

Caching
Caching
Caching

That’s it. Yet very few developers use enough caching. That’s because proper caching can be very complicated and can break the single source of knowledge principle. It also makes migrations challenging, as mentioned in the above section.

Disabling caching during migration might not be a realistic option, but reducing retention might mitigate some issues.

Strategy

There are several ways we can address a large-scale migration. We can look at the “big picture” in a migration e.g. Monolith to Microservices. But more often than not, there are more nuanced distinctions during the process.

I’ll skip the obvious “complete rewrite” where we instantly replace the old product with the new one. I think it’s pretty obvious and we all understand the risks/implications.

Module by Module

If you can pick this strategy and slowly replace individual pieces of the legacy code with new modules, then this is the ideal way to go. This is also one of the biggest selling points behind microservices.

This approach can work well if there’s still a team that manages and updates the legacy code. If one doesn’t exist, you might have a serious problem with this approach.

Concurrent Deployment

This can work for a shared database deployment. We can deploy the new product to a separate server, with both products using the same database as mentioned above. There are many challenges with this approach, but I picked it often, as it’s probably the simplest one to start with.

Since the old product is still available, there’s a mitigation workaround for existing users. It’s often recommended to plan downtime for the legacy servers to force existing users to migrate. Otherwise, in this scenario, you might end up with users who refuse to move to the new product.

Hidden Deployment

In this strategy, we hide the existing product from the public and set up the new system in its place. In order to ease migration, the new product queries the old product for missing information.

E.g. if a user tried to login and didn’t register in the system, the code can query the legacy system to migrate the user seamlessly. This is challenging and ideally requires some changes to the legacy code.

The enormous benefit is that we can migrate the database while keeping compatibility and without migrating all the data in one fell swoop.

A major downside is that this might perpetuate the legacy code’s existence. It might work against our development goals as a result of that.

Implementation

You finished writing the code. We’re ready to pull the trigger and do the migration… Now we need to update the users that the migration is going to take place. You don’t want an angry customer complaining that something suddenly stopped working.

Rehearsal

If possible, perform a dry run and prepare a script for the migration process. When I say script, I don’t mean code. I mean a script of responsibilities and tasks that need to be performed.

We need to verify that everything works as the migration completes. If something is broken, there needs to be a script to undo everything. You’re better off retreating to redeploy another day. I’d rather have a migration that fails early that we can “walk back” from than having something “half-baked” in production.

Who?

In my opinion you should use a smaller team for the actual deployment of the migrated software. Too many people can create confusion. You need the following personnel on board:

IT/OPS – to handle the deployment and reverting if necessary
Support – to field user questions and issues. Raise flags in case a user reports a critical error
Developers – to figure out if there are deployment issues related to the code
Manager – we need someone with instant decision-making authority. No one wants to pull a deployment. We need someone who understands what’s at stake for the company

There’s a tendency to make a code fix to get the migration through. This works OK for smaller startups and I’m pretty guilty of that myself. But if you’re working at scale, there’s no way to do it. A code change done “on the spot” can’t pass the tests and might introduce terrible problems. It’s probably a bad idea.

When?

The axiom “don’t deploy on a Friday” might be a mistake for this case. I find Fridays are a great migration period when I’m willing to sacrifice a weekend. Obviously I’m not advocating forcing people to work the weekend. But if there’s interest in doing this (in exchange for vacation time) then low traffic days are the ideal time for making major changes.

If you work in multiple time zones, developers in the least active time zone might be best to handle the migration. I would suggest having teams in all time zones to keep track of any possible fallout.

Agility in these situations is crucial. Responding to changes quickly can make the difference between reverting a deployment and soldering on.

Staged Rollout

With small updates, we can stage our releases and push the update to a subset of users. Unfortunately, when we do a major change, I find it more of a hindrance. The source of errors becomes harder to distinguish if you have both systems running. Both systems need to run concurrently, and it might cause additional friction.

Post Migration

A couple of weeks had passed, things calmed down, and the migration worked. Eventually.

Now what?

Retirement Plan

As part of the migration, we brought with us a large set of features from legacy. We probably need some of them, while some others might not be necessary. After finishing the deployment, we need to decide on the retirement plan. Which features that came from legacy should be retired and how?

We can easily see if we use a specific method or if it’s unused in the code. But are the users using a specific line of code? A specific feature?

For that, we have observability.

We can go back to the feature extraction spreadsheet and review every potential feature. Then use observability systems to see how many users invoke a feature. We can easily do that with tools like Lightrun by placing a counter metric in the code (you can create a Lightrun account here). According to that information, we can start narrowing the scope of features used by the product. I discussed this before so it might not be as applicable if this functionality worked in the legacy system.

Even more important is the retirement of the running legacy. If you chose a migration path in which the legacy implementation is still running, this is the time to decide when to pull the plug. Besides costs, the security and maintenance problems make this impractical for the long run. A common strategy is to shut down the legacy system periodically for an hour to detect dependencies/usage we might not be aware of.

Tools such as network monitors can also help gauge the level of usage. If you have the ability to edit the legacy or a proxy into the legacy this is the time to collect data about the usage. Detect the users that still depend on that and plan the email campaign/process for moving them on.

Use Tooling to Avoid Future Legacy

A modern system can enjoy many of the newer capabilities at our disposal. CI/CD processes include sophisticated linters that detect security issues, bugs and perform reviews that are far superior to their human counterparts. A code quality tool can make a vast difference to the maintainability of a project.

Your product needs to leverage these new tools so it won’t deteriorate back to legacy code status. Security patches get delivered “seamlessly” as pull requests. Changes get implicit reviews to eliminate common mistakes. This enables easier long-term maintenance.

Maintaining the Compliance Testing

After the migration process, people often discard the compliance tests. It makes sense to convert them to integration tests if possible/necessary, but if you already have integration tests, it might be redundant and harder to maintain than your standard testing.

The same is true for the feature extraction spreadsheet. It’s not something that’s maintainable and is only a tool for the migration period. Once we’re done with that, we should discard it and we shouldn’t consider it as authoritative.

Finally

Migrating old code is always a challenge, as agile practices are crucial when taking on this endeavor. There are so many pitfalls in the process and so many points of failure. This is especially true when that system is in production and the migration is significant. I hope this list of tips and approaches will help guide your development efforts.

I think pain in this process is unavoidable. So is some failure. Our engineering teams must be agile and responsive to such cases. Detect potential issues and address them quickly during the process. There’s a lot more I could say about this, but I want to keep it general enough so it will apply to wider cases. If you have thoughts on this, reach out to me on twitter – @debugagent – to let me know.

The post Modernize Legacy Code in Production – Rebuild your Airplane Midflight without Crashing appeared first on Lightrun.

16 Top Java Communities, Forums and Groups: The Ultimate Guide

Lightrun Team — Wed, 29 Jul 2020 08:27:38 +0000

Developers didn’t need the transformation to WFH to understand the value of online communities and forums for their professional development and work. Many developer groups and forums offer supporting professional communities that assist with questions, dilemmas and advice. In fact, there are so many developer groups, it might be hard to choose. We’ve prepared a guide, which maps out the most active and helpful Java groups, Java forums and Java communities.

Java is a strong, popular and reliable programming language, used widely around the world. Among 9,000,000 Java developers, to be exact. Java supports multiple platforms, new architectures as well as older legacy systems, and enables developing many strong features. Whether you’re a Java newbie or a seasoned professional with 20 years of experience, you can benefit from Java user groups and forums.

We created a whitepaper with the top 16 Java forums, communities, and groups. Most of them cover a wide array of topics: from programming questions to discussions to job opportunities. Engaging in them can help your professional advancement, both by asking questions and by giving advice. They are also a place for networking and building your own personal brand.

Before posting in each, please make sure you check the appropriate type of content and questions on each community. Follow each community’s guidelines when joining them.

Here are the first three communities. You can get the full list by getting the whitepaper, here.

1. Java Tagged Questions on Stack Overflow

Stack Overflow is still the undisputed leader in developer forums and help. For anything Java related search for the questions tagged as Java, or just go here. You can ask and answer anything that has to do with learning and understanding Java itself. This is the place for inserting snippets of code and asking deep-tech questions.

We highly recommend searching for answers before posting and really making an effort to solve your problem on your own, otherwise community members will not hesitate to downvote you. Don’t get discouraged and don’t take offense if they do, try to read the feedback and interact with the other developers. When wielded correctly Stack Overflow can be a very powerful tool in your arsenal.

2. Java News, Tech and Discussions – Reddit

This Reddit community hosts dozens of questions, comments and discussions every day, and the members are engaged in the conversations. The purpose of this Java community is to discuss Java and share technical content. It’s not the place for beginner tutorials or programming questions, but rather a mature discussion and the sharing of knowledge and insights. Are you looking to learn Java? Take a look at the next community.

3. Java Help, Tutorials and Questions – Reddit

Now this Java Community is the one for asking for help with your Java code. With multiple posts and responses, this vibrant community is the one for you to join if you need help, and just as important – for you to provide help to your fellow developers.

Get more recommended communities and groups on Slack, meetup.com, online forums, dev.to and more, from our whitepaper. There are 13 more communities for you to discover. Get the whitepaper here.

The post 16 Top Java Communities, Forums and Groups: The Ultimate Guide appeared first on Lightrun.

The Cost of Production Blindness

Lightrun Marketing — Mon, 04 Jul 2022 13:09:21 +0000

When I speak at conferences, I often fall back to the fact that just a couple of decades ago we’d observe production by kicking the server. This is obviously no longer practical. We can’t see our production. It’s an amorphous cloud that we can’t touch or feel. A power that we read about but don’t fully grasp.

In this case, we have physical evidence that the cloud is there.

A part of this major shift in our industry is a change to our fundamental roles as engineers. DevOps and SRE are roles that didn’t exist back then. Yet today, they’re often essential for major businesses. They brought with them tremendous advancements to the reliability of production, but they also brought with them a cost: distance.

Production is in the amorphous cloud, which is accessible everywhere. Yet it’s never been further away from the people who wrote the software powering it. We no longer have the fundamental insight we took for granted a bit over a decade ago.

Is That So Bad?

Yes, and no. We gave up some insight and control and got a lot in return:

Stability
Simplicity
Security

These are pretty incredible benefits. We don’t want to give these benefits up. But we also lost some insight, debugging became harder and complexity rose. We discussed these problems before but today I want to talk about one impact only…

Cost

This is a form of blindness.

I wrote a lot about the impact of this situation on the reliability of our cloud deployments. But today I want to talk about the financial and environmental costs. Initially, the cloud was billed as a cost saving measure and there was some truth to that. The agility of deployment let us cut down on hardware costs, consolidate and simplify.

But as we got used to the cloud, our appetite for scale/reliability grew. We ended up simplifying deployment to such an extent that launching a container can be accomplished seamlessly, with no interaction on our part. This is enormous progress but also troubling. We slowly lose grip on our costs and end up paying more for less.

So what’s the solution?

APMs are a category that rose to prominence specifically around this problem. Today, they are more important than ever. They help us get a sense of the Pareto principle (80/20 rule) so we can focus optimizations on the specific areas that cost the most.

This is a powerful and important tool that DevOps use every day, but it’s also a very limited one.

Before we proceed, I’d like to take a moment to discuss the concept of cost. The most obvious impact is on our monthly cloud provider bill. This is work that might fund a department. But there’s a more important cost in my humble opinion: the environmental cost. We tend to ignore the electricity spend because it’s a very amorphous spend. But this cost is severe, e.g. the cost of a single cloud instance over one year can be the equivalent of a transatlantic flight.

We don’t see the underlying hardware, but it’s there, and it carries a carbon footprint. By optimizing, we can affect both costs significantly.

Observing Production Effectively

APMs are great for measuring performance at a high level. But they provide very little detail about the dynamic inner workings of the application and the cost-cutting measures we can take inside. I often liken them to the bat signal or check engine light. They notify us of a problem but leave us without the tool to inspect the details.

That’s where developer observability tools can fill in that gap. These tools can provide low level applicable insights into the application. Verify assumptions and provide developers with the means to understand production substantially.

Instead of discussing the theory, let’s give some examples of actions you can take today with developer observability tools to reduce the costs of your production.

Reduce Logs

Log ingestion is probably the most expensive feature in your application. Removing a single line of log code can end up saving thousands of dollars in ingestion and storage costs. We tend to overlog since the alternative is production issues that we can’t trace to their root cause.

We need a middle ground. We want the ability to follow an issue through without overlogging. Developer observability lets you add logs dynamically as needed into production. This frees you from the need to overlog and lets you focus on logging a reasonable amount. You can also raise the log level to keep the logs down. I wrote about this in depth here.

Caching

My top three tips for performance have always been:

Caching
Caching
Caching

There’s really nothing else. It all boils down to that. Unfortunately, cache misses are notoriously hard to tune and detect. This is an even bigger issue in production where we need to account for the changing landscape. E.g. we cache up to 10 friends of a user on a social network but in production the growth team encourages friendships and users have more friends…

You’d have cache misses more often and you wouldn’t even know.

Placing conditional breakpoints or temporary conditional logs on cache misses and inspecting them can go a long way to detect subtle issues like that. This can make an order-of-magnitude difference to performance when done right.

However, there’s a bigger payout here. Many developers just ignore L2 caches entirely. This is understandable. They are hard to maintain and debug. Especially in production. A single cache corruption or a value that’s out of sync and you end up with a major bug. The problem is that debugging these things in production environments is essential. Cache behaves radically differently in production because of its distributed nature.

We built developer observability solutions to debug these exact types of problems. By placing snapshots and logs over cache population/invalidation, we can narrow down the point of corruption and fix cache relation issues. By deploying these solutions to your production server, overhead can be reduced significantly!

Micro Benchmarks

APMs provide us with high-level numbers on performance and a general direction. They don’t provide the lines of code we need to address. That’s left up to our guesswork. If the system behaves identically when it’s running locally, this should be fine. Unfortunately, this is rarely the case. E.g. a database query can have a significantly different impact when running in production. Based on local profiling results, you might waste your energy on the wrong optimization.

Developer observability tools provide the ability to narrow down the performance overhead of a code snippet. This lets us follow through the web service stack and narrow down the actual lines that are taking the most CPU time. We can accomplish this by adding a tictoc metric that measures the time between the tic line and the toc line.

We can mark a block of code and get statistics about its execution time. As in the common case of a specific query taking longer in production, we can quickly prove that this is the cause of the performance problem using this tool. The impact of many “small” issues like this can be significant in a large system and can easily mean the difference between scaling and a bottleneck.

Verification and Dead Code

A common problem is under utilized resources. APMs expose some of those problems but don’t expose them all. When we have dead-code, its impact on our bottom line can be significant.

How many times did you refactor code or stop yourself from refactoring because of a legacy mess you didn’t want to touch?

Yes, that legacy mess is used in your code so you don’t want to “risk it”. If you end up changing the code, you need to walk on eggshells and the entire operation can take an order-of-magnitude longer. This maps to cost since our time is valuable and you can spend it optimizing. It also blocks some major optimizations most times.

But what if that block of code isn’t used by anyone in production?

What if it’s used by very few people?

That’s exactly what the counter metric does. It counts the number of times a line was reached. It can tell us which methods are important to us and how frequently they’re reached. You wouldn’t be as concerned about a refactor if only three people reach that line of code…

Finally

I could carry on with the discussion of these techniques, but the gist is simple: we need to “see” what’s going on. As developers, we’re given a task to build a product. But the tools that let us peer into production aren’t as capable as our local tools. The results we get from production can be very misleading.

As we scale production deployments, we need to use a new class of tools that exposes our code in this way. I can classify modern production with one word:

DREAD.

This deep binding fear that we all feel when we push a major change into production. People lose their jobs by pushing bad stuff to production. That’s scary!

What do we do when facing such dread?

We keep going, but carefully. We step lightly and don’t take big risks. Is our code wasteful?

Maybe, but the risk of bringing down production is far scarier than the benefit of shaving some expenses to the company.

Developer observability is the light within this darkness. When you shine a light in the dark, you take away some of the fear and make production more approachable. We can measure, test, and move fast. We also have a better sense of the risks we’ll be facing with the upcoming changes. The tooling also gives us a sense of the upside. How much can we save? Imagine saving the cost of your entire department in cloud expenses. That’s job security right there… The best to fight that fear of risky changes.

The post The Cost of Production Blindness appeared first on Lightrun.