Evolution of our Hadoop Architecture — Chapter 2: The LLAP Saga

Datum Scientia
4 min readJan 11, 2021

As you’ve probably realized from the title, there are chapters before this one. So make sure you check them out to have the overall context from here.

“With great volume comes greater unforeseen issues”

Live Long And Prosper? LONG LIVE AND PROCESS

While we were dealing with the hiccup from the last chapter, we had another pressing issue staring at us, directly impacting the production systems on a daily basis. Since we moved to the cloud, we began ingesting a large amount of data as well, volumes which we hadn’t seen earlier and as always with great volume comes greater unforeseen issues.

We were even ingesting all our data via LLAP along with reporting and the way we were ingesting data meant analyzing the data repeatedly to gather statistics and store them in the Hive Metastore enabling much better query plans and faster reads. All these queries running through the multi-tenant LLAP which crashed almost every day and the only way to recover would be, the age old solution in the computer world, yep, you guessed it, RESTARTS. Imagine a 30 minute down-time every day/alternate day in a production environment intermittently, which not only hampers the user experience but also requires multiple teams to co-ordinate and pause all running applications, the chain of emails, the approvals, acknowledgements was an operational black spot.

“How do you identify the issue causing query when your system is being hit by 10s of query/minute”

The primary problem was that with so many queries running in parallel, it was nearly impossible to narrow down the query which caused the issue. And we eventually observed that the issue causing query did not necessarily take down the system immediately, but infected it like a virus with delayed onset. We came to this conclusion by observing all the queries around +/-2 hours since the first failed query and we had mental estimates of the average run times for most kind of queries (based on historical runs and a good understanding of the data volume). We even got lucky when the volume of queries running were relatively low for us to funnel in.

“Logs, logs and more logs, an ocean of them”

We spent hours collecting gigantic logs trying to analyze the situation, looking at multiple monitoring pages but all leading to a dead-end. We finally got a glimmer of hope once the engineers who developed LLAP got on call with us, and tried to analyze the issue. In their experience no one had been using LLAP to perform inserts, and even if they did, definitely not at the scale we were at. LLAP is enabled from Hive 2.x there was no way to just move to Hive 2 and ditch LLAP, and we just couldn’t go back to Hive 1 due to various issues it had along with the performance repercussions. Hence going back on the version wasn’t an option, it is hard to justify to users why the reports which had started to display results in about 10 seconds have now gone up to 30 seconds and would be the case until foreseeable future. It was either dealing with a performance degradation across the board or a few unfortunate users who ran their report exactly at around the times of restart. The issue was not just the restart, it was the slowness which ensued right before the ultimate doom where the daemons stopped responding completely.

“Java’s nemesis, The Garbage Collector”

Anyway, coming back to the glimmer of hope I gave you initially. The issue seemed to be coming from the “analyze queries (gather table/column stats) we were running which opened up uncontrolled number of mappers which caused a massive memory pressure leading the daemons to stop-the-world Garbage Collector (GC) pauses, so technically the daemons weren’t dying, they simply were busy cleaning up the old references which filled up real quick for it to be able to respond and since our daemons were monstrous in terms of RAM (about 300G each) this took minutes, sometimes hours, we never really got to the point where it took longer because we would have restarted the system by then and caused an automatic flush. There were tons of jmaps, jstacks, log files, histograms from various nodes, potentially hundreds of man hours (both from our side and the vendor support engineers, creators of LLAP), all observed & working in tandem to finally reach the above conclusion.

We were really under the pump to resolve this and after months of efforts we had finally made any progress, the fundamental first step of identifying the root cause itself. We were glad to finally have found the root cause and the sleepless nights were finally about to end (at least that’s what we thought). We soon got a hot-fix and ever since applying it, we observed no issues, for a week, life seemed to be going back to normal after what had been an arduous fire-fighting. Come second week, and we were staring at exactly the same alert we had been witnessing for what felt like ages.

Stay tuned for what came next and how we eventually got around it. Don’t forget to follow us to live our journey vicariously.

Links to Check:

--

--