Introduction: Evolution of our Hadoop Architecture

Datum Scientia
2 min readJan 4, 2021

Set in circa 2018

One cluster which migrates data from RDBMS (Greenplum) and places it on Hadoop with few modifications in near real-time frequency, pseudo ETL as you may call it, with the help of our beloved in-house data migration framework ACCIO (using a combination of Spark and Hive, we are allowed to dwell only so much into this framework cause we apparently had created something special here at the time).

The same cluster then hosts applications which are expected to return sub-second query responses supporting a plethora of web based financial reporting products, and also allow massive data extracts which could potentially run for 30+ minutes (don’t ask us how the users were planning to use that data though), it of course doesn’t end there. Then, there are multiple spark batch jobs which run and perform some of the housekeeping work in the cluster, some responsible for generating downstream specific data extracts. Sounds like a task for a Swiss knife? This is how we began our journey while building our multi-tenant platform. A lot of the tasks we were trying to achieve were in some ways unprecedented for a Hadoop cluster from a macro architecture.

What the world expects out of a single Hadoop cluster

Most of us running this machinery were mere innocent freshers without any background about the infrastructure and how each parameter could have a multi-fold impact. And we recently had lost our Architect to the Hadoop vendor itself, sounds fun right?

Through the next few chapters, we are going to take you through the evolution of our infrastructure over the course of two years, all the mistakes we made, the lessons we learnt and everything in between. We initially began with a small sub-set of data on our On-prem cluster but since every company was moving to cloud eventually, so did ours, and we spent months planning and migrating our nascent platform onto AWS. It was probably the best time for us to make the move cause we were just getting started on the Hadoop cluster and the migration was a massive effort even without the thousands of production-level users we have today.

Follow the series as the story unfolds, don’t forget to drop a note if you like/dislike the journey!

--

--

Datum Scientia

We bring you the world of Big Data from the inside.