Evolution of our Hadoop Architecture — Chapter 3: The LLAP Saga Continues

Datum Scientia
4 min readJan 25, 2021

And here is the third part of this quartet, to have the overall context make sure you check out earlier chapters here.

After applying hot-fix things went to normal for sometime and we marked analyze statements were the only culprit to bring LLAP daemons down. Here come next week and we saw the same alerts and emails from teams regarding the health of LLAP service, queries failing, ETL jobs breaching SLA.

What are we running on LLAP?

There are three kinds of queries that usually run across hive engine:

  • DQL (Data Query Language)- select statements by business/analytics teams
  • DML(Data Manipulation Language) - inserts
  • DDL (Data Definition Language) - create, analyze by data engineers and architects.

We already dealt with the third one and now for narrowing down the issue, we targeted remaining individually.

Time to RCA

Stopped ETL jobs and requested BI teams to run hard-hitting queries to LLAP concurrently, reports executed without any failures or causing daemons to go down.

Next action to test ETL inserts (responsible for loading partitioned tables), started this process in phases small(<100GB)/medium(100–200GB)/large(>500GB) tables.

As we reached to third round there was a bittersweet smile on our faces as a known alert got raised (No LLAP daemons are running).

“Why was loading large partitioned table causing LLAP failures”

Loading data from an external unpartitioned table to ORC-managed partitioned tables involves the dynamic partition feature of Hive, it launches many record writers for partition values, increases memory pressure on the reducer stage, thereafter reaching a particular exhaustive OOM limit (for us > 600 partitions) LLAP daemons are unable to withhold such surge in request and starts falling consequently.

Next steps of action

As soon as we encountered this issue we shared it with developers and they provided a hot-fix, set hive.optimize.sort.dynamic.partition to true, and gather stats on the external table’s to be partitioned columns for precise estimation. The modified property would allow sorting the partition columns first before feeding to reducers, each reducer can keep only one record writer open at any time thereby reducing the memory pressure on the reducers.

Provided patch will get implemented with correct estimation after stats are gathered if we skip this step then we face the bug of lower reducer count.

“Hmmm…sounds good, but what about the extra time to gather such stats and isn’t this property bad for tables with lower partition values at a macro level”

Well, the questions sound interesting, we performed basic testing like earlier before finalizing the patch and found a 30–40% rise in query execution time(the kind of rise no one likes), not a convincing idea to pitch to leadership for production.

Shared the same analysis with LLAP developers and got another patch to handle the induced slowness. With the patch, the Hive engine will plan execution to either skip or sort the dynamic partition values depending on the configured threshold. This would avoid the mandatory stats gather requirement for tables with smaller partition values and force only for the tables with larger partitions as defined by the threshold.

We didn’t have much time to evaluate beyond our basic testing as it was a race against time to fix and start the machine. With this exercise, we learned a lot about other LLAP issues:

  • ORC reported a bug where the memory manager was unaware about LLAP’s memory bound per executor
  • Status of tasks running on LLAP executor gets marked as missing or queued for a longer period hints towards priority inversion bug with LLAP, as ETL jobs are known to execute longer compared to interactive queries, leading to higher chances of tasks preemption.
  • Lower cache hit for BI queries as the ETL inserts would cause higher cache eviction

We knew applied patch wasn’t going to help us in the longer run, as the service availability was compromised with degraded performance, and a repeated question was asked

“Dude were there any changes to platform, this report was running fine last week, now it’s running too slow”

With a never-give-up attitude, we were able to find the best solution around this problem that helped us scale. Watch out for the finale of this series and how our journey achieved a new milestone!!!

Links to check:

--

--