Data Engineering,  Technology


4 min read

I believe that 90% of today’s data was generated in the last two years. Which means the Big Data world has been undergoing changes every single day! So, when I got the chance to see how big firms are tackling their data, I took it immediately. Well I was overwhelmed by all the information, but finally shortlisted two interesting talks to share today!

Uber’s Geospatial Analysis

This session had me thinking of my project in school, where I was exposed to the complexity of working with geospatial data. We all know Uber to be the most used ride-sharing service in 70+ countries with millions ride every single day. Naturally, dealing with such volume of data can get complex. I was super excited to see Uber’s insights on how they transformed their data architecture to improve performance of geospatial data analysis!

Here’s an insight on how their analytics infrastructure looks like –

*Image reference from Uber’s blog

All the data from multiple input sources like Kafka, Schemaless, MySQL, is eventually fed into the central data lake via batch processes or stream processes. The central data lake is built on Hadoop environment. Multiple teams perform analysis and ad-hoc processing on data lake.

It was no shocker to see multiple HDFS and Presto clusters, on top of which Hive, Presto and Spark queries are executed! That is a lot of querying and processing.

Their challenge was to accommodate the rapidly growing demand and adding new services onto the current infrastructure. Since this was a multi tenant shared infrastructure, making changes means correctly allocating resources for both HDFS and Presto clusters.

Geospatial for Uber is an essential ingredient for them to function. So, how is this data stored and analysed? It is stored on Hadoop either as a point with a latitude and longitude or a polygon, that is a set of points.

They also gave a few examples on how they join tables to analyze if the point resides in a particular location, using a Hive query and how they optimized it. Pretty interesting stuff I must say!

They weighed the pros and cons of running such queries on Hive and Presto and settled for Presto. Here are the top reasons why –

  • Queries run faster on Presto, speeding up the process 60x times
  • Since there is a Geospatial plugin within Presto, external dependencies are eliminated
  • Presto optimizer rewrites user queries for better performance
  • It has connectors to Kafka, HDFS, MySQL all built in.

After the talk, I started digging into Uber’s engineering page. There is definitely a whole lot of learning from their blogs, you want to check it out here.

Building an Enterprise Data Lake at FINRA

It is interesting to learn how a Financial Regulatory body like FINRA adopted to the cloud platform on the enterprise level; with a ton of data to be processed everyday and having it all fall under regulatory compliances.

What is FINRA’s definition of a managed data lake? They defined it as taking elements of traditional database system — storage, compute, query and catalog and transforming them on cloud to form a virtual database. FINRA moved to a managed data lake on cloud from on-premise data marts.

Operating on the data lake has helped them –

  • To do much more regulatory reporting on larger datasets, that was made available online.
  • For data scientists, provisioning a catalog made it easier to understanding all data in the organization, how it can be combined and query the data on a common interface, with no IT involvement.
  • Manage data records policy on datasets, like where it resides, lifetime policy.
  • From ETL perspective, gives flexibility to process data when required and meet SLAs.
  • Consolidation from multiple data repository to a single repository, easier from user management perspective.

Overall they have been able to reduce cost when compared to running all processes on premise, while maintaining the security and regulatory compliance.

Before data lake –

  • They had difficulty doing all the processing with market volumes having unpredictable spikes. In case of spike, they had to constantly re-prioritize the workloads to meet their SLAs since the capacity of the Data Warehouse was fixed.
  • There were dependencies between teams, since the data resided in silos. This was slowing down a lot of the process.

Their main motivation to get to the cloud was to start treating data as an enterprise asset, make all data available for analytics. Also, to have a service based architecture, where it was easier to stand up tools and combine them as they saw fit. This had positive outcome on the ETL process, which were now more available and handling the bottlenecks in workflow processing they saw earlier became easier. Finally to embrace open source, take advantage of its flexibility and not be locked down with any vendor.

*Image reference from FINRA’s presentation

In the cloud, they have been able to designing three layers namely compute, storage and catalog (on AWS) and eliminate the silos created on subject areas.

Compute layer : They are using all open source tools — hive, spark, presto which are on separate EMR clusters, to run queries on the data in data lake or S3. These tools can access S3 data as external tables and can be joined using standard SQL query. Simple to analyse, right! And being on separate clusters makes them proficient and independent.

Storage : S3 and Glacier are used as the data lake where all file are stored. Most of the data here is kept in text format for faster processing, .orc format is used only for performant queries.

Catalog : Acts as a central data management unit. Unified catalog containing schema, encryption types, versioning and storage policies. For tracking data lineage and where it is being used. And also used as a shared metastore table partitions, that can be used with Spark, Hive and Presto.

Most interesting outcome is that FINRA has built an open source big data governance platform, that they have been using as a part of the Catalog layer. You can read more about Herd here.

Leave a Reply

Your email address will not be published. Required fields are marked *