101 Open-source Troubleshooting Steps

7 min readSep 25, 2023

A detailed walkthrough on troubleshooting the cause of the exceptions and finding solutions for some basic to advanced issues identified while working with open-source projects such as Hive, Spark, Kafka, Hadoop…

Issue resolution with the right root cause analysis for SRE, developers, platform admins, data engineers, or architects has always been part of their daily firefighting routines, it is a straightforward approach but in classic cases it becomes a new discovery or an unplanned hackathon to resolve and reach a stable solution. I always used to wonder if there should be some designated SOP that everyone could follow and reach some plot points to move the investigation forward, as this is always about finding evidence or clues to solve a murder mystery for finding the lawbreaker. Here onwards we’ll follow our approach in a similar fashion.

1st Step — Identify Scene dimensions

Initial steps to proceed once the crime/error is encountered:

What are the crime scene dimensions, and impact/severity of the issue, and can we curb it immediately with some known workarounds?
List of suspected items, anything changed recently, any unknown observed activity?
Noting down the sequence of steps leading to encountered error
Collect as much evidence as you can, such as error traces, log messages, heap dump, thread dump, and GC dump that could help in further investigation

#Heap Dump Command
jmap -dump:format=b,file=heap.hprof <pid>

#Stack Dump Command
jstack [-F] [-l] [-m] <pid>

#JFR Dump Command
jcmd <pid> JFR.start duration=300s filename=<file_name>.jfr
tar -czvf <file_name>.jfr.tar.gz <file_name>.jfr

Did you try googling or checked with Chat GPT for the collected traces or gist of the issue ?

2nd Step — Categorise crime/error

Classification is very important to know the right direction to proceed in finding the culprit, here are a few categories with their descriptions:

Configurational — Basic to advanced configuration changes depending on issue complexity, requires component restart
Performance Tuning — All good at the functional level but performance is degraded or requires further tuning changes at the platform or upgrade to new open source project features to achieve faster execution, this is an iterative strategy and requires multiple test runs
Functional — Failing to achieve the objective could occur due to a change in design approach such as datatype conversion restrictions introduced in Hive versions, or earlier functions getting deprecated…
Source Code — Class not found Exception, Null Pointer Exceptions, No such method error…other Class exception traces pointing to Java classes or function traces
Packaging — Corrupted packages will block package installations or empty packages, packages got installed but had incorrect access permissions for service accounts, missing jars but available in other project sub-modules

Would love to know if there are further categories out there ! Please post it in comment section.

3rd Step — DNA analysis

DNA fingerprinting majorly helps to identify key problems where performance, configuration, and source code-related issues are resolved. In some cases message itself tries to help move forward and provide hit-and-trial approaches, but for some complex issues, it requires time to recreate crime scenes in local labs to collect further evidence. If it's reproducible easily, it has a higher chance of resolving quickly as this gives vast approaches to debugging, first try running the service in DEBUG mode to get more details and answers for the problem. In case source code can’t be compiled locally due to heavy dependencies try the second approach for adding your own console log statement such as adding print statements in the right classes to know which statements or clauses are being skipped, values are being assigned to member variables or functions being called, class descriptions, etc. In the case of interpreter-based languages such as Python/Bash changes can be directly tested without compiling making it easier. Even if you don’t write code often just google on how to add print statements for different programming languages.

4th Step — Earlier Track Records

To identify if a similar issue has already been triaged and successfully resolved or if any similar approach is being used to fix the problem, here are a few errors to start the backtracking:

Open Source Jira project search — Enter the text or error message as being captured in 1st step to identify if similar traces existed across Apache open source projects
Clone the GitHub Apache project to the local code editor and run a git log search git log --all -i --grep='HIVE-12012'
In case you’ve identified the exact line that was modified or need to know which commit has modified the following line, open the class file on GitHub and click blame mode instead of code mode - https://github.com/apache/hive/blame/master/bin/hive
Always catch the message being thrown at the exception message and backtrack method calls to know which class is being implemented or used during runtime exceptions

5th Step — Case Studies

Here are some illustrative examples of common issues and approaches used to capture exact messages, and evidence and reach conclusions:

Case 1 — Class Not Found Exceptions

In big open-source community-driven projects NoClassDefFoundError is like larceny among encountered issues, here are the steps followed for this error trace:

[2023-04-12 12:49:12,027] ERROR [KafkaServer id=1005] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.security.SecurityUtil
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:316)
        at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:304)

Solution Route

Investigation approach

The quick approach is to run a grep (grep -r "org.apache.hadoop.security.SecurityUtil") across multiple associated projects built in a local server or google to know if the project and corresponding source jar are missing from the service classpath library. In some cases class may be located in another directory, not in the runtime Java classpath options.

Problem Identified

As soon above command was executed we came to know this is a missing dependency of dnsjava-2.1.7.jarissue in distro packaging steps.

Resolution

The simple step is to update and add the following missing packaging dependency in the assembly file for inclusion in future build cycles (making sure to avoid any conflicts).

Case 2 — No Such method error

NoSuchMethodError error is another common list of case studies where class exists but the method may not exist due to the following reasons:

Method not defined in class
Method is protected/private and not accessible to the invoker
Method argument count is different
Method argument type is different
Mismatch on return type of method definition

22/08/30 07:50:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(adm_gsatya1); groups with view permissions: Set(); users  with modify permissions: Set(adm_gsatya1); groups with modify permissions: Set()
22/08/30 07:50:32 WARN ChannelInitializer: Failed to initialize a channel. Closing: [id: 0xea0e9ef5]
java.lang.NoSuchMethodError: io.netty.util.internal.ReferenceCountUpdater.setInitialValue(Lio/netty/util/ReferenceCounted;)V

Above NoSuchMethodError error caused further class initialization issues

2023-04-19T16:08:27,066 WARN  [ShuffleHandler Netty Worker #4 ()] io.netty.channel.ChannelInitializer: Failed to initialize a channel. Closing: [id: 0xc66f4148, L:/138.201.249.49:15551 - R:/138.201.81.125:47246]
java.lang.NoClassDefFoundError: Could not initialize class io.netty.handler.codec.http.HttpResponseEncoder
 at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$3.initChannel(ShuffleHandler.java:395) ~[hive-llap-server-3.1.4.3.2.2.0-1.jar:3.1.4.3.2.2.0-1]
 at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$3.initChannel(ShuffleHandler.java:386) ~[hive-llap-server-3.1.4.3.2.2.0-1.jar:3.1.4.3.2.2.0-1]
 at io.netty.channel.ChannelInitializer.initChannel(ChannelInitializer.java:129) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
 at io.netty.channel.ChannelInitializer.handlerAdded(ChannelInitializer.java:112) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]

Investigation approach

Issue encountered on method missing from io.netty.util.internal.ReferenceCountUpdater class, we can search online for source trace of this class or run grep -r "io.netty.util.internal.ReferenceCountUpdater" on the compiled project.

Problem Identified

After checking jar and class existed on all Netty project jars but there was a difference in the version of the jar with the same name and that raised a suspicion related to the mismatch project dependency versioning issue.

Resolution

The issue was identified in the Tez project where it was compiled with an incorrect netty version where the method didn’t exist.

In case of some no such method errors, here is the link to understand JNI types and data structures.

Case 3 — Null Pointer Exception

Null pointer exception (NPE) occurs if null check conditions aren’t handled properly or values not being initialized as it was expected during the runtime.

023-04-10T17:43:44,226 ERROR [Hive Hook Proto Log Writer 0]: hooks.HiveHookEventProtoPartialBuilder (:()) - Unexpected exception while serializing json.
java.lang.NullPointerException: null
 at org.apache.hadoop.hive.ql.exec.ExplainTask.outputPlan(ExplainTask.java:986) ~[hive-exec-3.1.4.3.2.2.0-1.jar:3.1.4.3.2.2.0-1]
 at org.apache.hadoop.hive.ql.exec.ExplainTask.outputPlan(ExplainTask.java:908) ~[hive-exec-3.1.4.3.2.2.0-1.jar:3.1.4.3.2.2.0-1]
 at org.apache.hadoop.hive.ql.exec.ExplainTask.outputPlan(ExplainTask.java:1263) ~[hive-exec-3.1.4.3.2.2.0-1.jar:3.1.4.3.2.2.0-1]
 at org.apache.hadoop.hive.ql.exec.ExplainTask.outputStagePlans(ExplainTask.java:1408) ~[hive-exec-3.1.4.3.2.2.0-1.jar:3.1.4.3.2.2.0-1]
 at org.apache.hadoop.hive.ql.exec.ExplainTask.getJSONPlan(ExplainTask.java:367) ~[hive-exec-3.1.4.3.2.2.0-1.jar:3.1.4.3.2.2.0-1]
 at org.apache.hadoop.hive.ql.exec.ExplainTask.getJSONPlan(ExplainTask.java:268) ~[hive-exec-3.1.4.3.2.2.0-1.jar:3.1.4.3.2.2.0-1]

Investigation approach

The issue seems to be around class ExplainTask line number 986 (ExplainTask.java:986) in the Hive project, therefore added a few console print statements to know which argument or object is not being initialized properly.

Problem Identified

After rerun issue was identified with the queryState object being called to get configuration values.

Resolution

Added a null check to avoid the above NPE error-related issue.

6th Step — Filing charges

Raise the issue in the respective community’s Jira project along with your encountered issue, and verify the patch added with test cases. This will help to improve project collaboration with developers working globally.
Make sure to test out the solution with defined test cases and functional checks.
In the case of version upgrades, there are high chance for community-driven projects to impact other dependent projects during building or at runtime

101 Open-source Troubleshooting Steps

1st Step — Identify Scene dimensions

2nd Step — Categorise crime/error

3rd Step — DNA analysis

4th Step — Earlier Track Records

5th Step — Case Studies

Case 1 — Class Not Found Exceptions

Case 2 — No Such method error

Case 3 — Null Pointer Exception

6th Step — Filing charges

Written by Datum Scientia