Evolution of our Hadoop Architecture — Chapter 4: Divide and Conquer

Datum Scientia
4 min readJan 29, 2021

Finally, the most awaited day has come, here is the finale of the series. You’re the real MVP if you followed our journey from the pilot episode.

The previous chapters covered the LLAP saga of encountering issues and applying patches provided by the LLAP authors to help us continue our ETL and BI cars on the same road but with one slowing down the other. With such interruptions, the only logical solution was to isolate and run them on separate queues and servers.

“You mean adding one more hive interactive server, okay but why and how ?”

To climb up the ladder for scaling your workload with an ever-increasing ingestion workload is bound to challenge the performance consistency of LLAP. Keeping ETL and BI in separate rooms will help in dealing with the diverse downstream sub-problems more effectively, conflicted interactive query optimization and dynamic scaling ETL ingestion jobs for larger partition values. Following were the fundamental steps to redesign our Hadoop Architecture along with the runbook mentioned in the published appendix:

  1. Node labeling for ETL and BI workload
  2. Another Hive Server2 Interactive for ETL workload
  3. Containerized mode of LLAP

Node Labelling: ETL & BI

To isolate the workloads, label the nodes and assign them to only run containers designated to specified jobs. Here how we performed capacity planning for segregating nodes:

Nodemanager max memory = 160GBBefore Separation LLAP (ETL + BI):
Total LLAP Heap Size = 10(Num of Daemons) * 54GB(Daemon Heap Size) = 540GB
LLAP Resource Allocation -> 1TB (Total LLAP Heap Size = 540GB, Total Cache = 360GB, Total Headroom = 100GB, Nodes for Daemons = 10)ETL After separation:#To maintain equivalent heap size at container level
ETL -> 4 nodes = 160GB * 4 = 640GB ~ 540GB (Old Total LLAP Heap Size)
LLAP After separation:BI (LLAP) -> 900GB (Total LLAP Heap Size = 540GB (Old Total LLAP Heap Size), Total Cache = 340GB, Total Headroom = 30GB, Nodes for Daemons = 6)Thus maintaining equivalent or same heap size in old and new configuration.

Later from here, we can increase the number of nodes based on resource requirements.

Add HiveServer2Interactive: ETL workload

Need for another interactive server for ETL as on available distributed version (HDP 2.6.5) Hive version 2.x was only available with LLAP that enables ORC headers while writing data. On recent distributions, it doesn’t have this dependency with LLAP, and provisioning of multiple interactive servers is one click away. But our journey was filled with challenges so needed a runbook to hack and implement multiple interactive servers.

Containerized mode of LLAP

Another daemon-backed LLAP service for ETL could lead to the same issues as mentioned in the third chapter. In order to use Hive 2.x, we enabled LLAP in container mode running with a pseudo-single daemon instance (to keep the LLAP service live) along with additional config changes as discussed in our runbook.

Impact Analysis: Pros & Cons

Pros:

  1. Breaking record of ZERO incident reported from LLAP application
  2. Increased performance due to a higher cache hit rate and lower cache eviction
  3. Dynamic increase to ETL containers allocation based on the ingested table size

Cons:

  1. Lowering data locality from ETL nodes to BI nodes
  2. Losing cache elevation on ETL workloads
  3. No live running ETL containers could be avoided by prewarm containers

This brings us to the end of our journey of “Evolution of our Hadoop Architecture”, it took us four months to reach this milestone hope we were able to answer most of your queries and drive you towards the right decisions for planning out the evaluation of Hive LLAP depending on your workloads.

Links to check:

Appendix:

Chapters:

--

--