Here’s your guide to understand the working of data engineer and how you can jump on the data engineering bandwagon.
What do data engineers do?
In simple terms Data engineering is someone who builds an infrastructure for consuming data, processing it and gaining insights on the data. But in reality, the role is so much more complex.
In the current day the work of a data engineer in overshadowed by the emergence of Data Science. In most cases the work of data scientists is reliant on the structured, aggregated data generated from the data engineering workflow. Any experienced Data Scientist will also have the skillset of a data engineer and know how to work alongside data engineers to get best results for data analysis.
Since the role of data engineers is evolving every day, they needed to – learn multiple programming languages, work on integrating variety of systems and learn emerging technologies and tools. It is not easy to box the skillset for the role. Ok, that may sound like there a lot, let’s simply it and take small steps to unravel the path to Data Engineering. We will start with the basic requirements to enter the field.
How to get started?
1. SQL, your first step
Structured Query Language has made it possible for us to interact and make sense of our data. Every part of the infrastructure uses SQL in some capacity.
In the data ingestion process, if your source systems are relational database like SQL server, MySQL etc. or NoSQL database like HBase, MongoDB or Cassandra, you will need SQL to query the source system and make sense of the data.
If you are designing a Data Warehouse, they need to use SQL to build transformations in order to modify the data and also build queries on the data stored in the warehouse to analyze it.
Even in the sense of using distributed systems, tools like Hive and Spark use SQL to query the data on distributed clusters. Different products may use different syntax of SQL (T SQL or PL/SQL or Spark SQL) but the underlying functionality is still the same.
2. Data warehousing Principles
If you walked down the data engineering path, a decade ago it was all about building efficient relational data warehouses, ETL pipelines and generating reports for analyzing metrics.
The principles of traditional data modeling are still reliant even though the process to structure of a Data Warehousing has changed. Without going into the nitty gritty of design changes, I want to talk about the principles that are followed in today’s Data Warehouse development process.
Data warehouse are central repositories of integrated data from one or more disparate sources. They are considered a core component of data analysis.
The fundamental principles for designing dimension and fact tables are still a basic requirement for anyone aspiring to be a Data Engineer. This is the most strategic part of the job and designing the 100% efficient Data Warehouse comes only with experience. I say strategic because you need a good understanding of the business requirements and how the end-users (could be data scientists, business executives) would like to see their data.
Data Warehouse Tool Kit is a great book to learn about traditional dimensional modeling principles.
For a practical course to follow along, you can start with this coursera course.
3. Coding Languages
With the data engineer role evolving so quickly, programming languages has become a necessity in order to build robust infrastructure. Let me give you few examples of where I have written code –
I worked on building a pipeline to ingest log files and processed the data on a Spark cluster to extract credential attributes from the file and identify if they mapped to a registered user in the system. The script for processing was written in PySpark, which is basically a Python API for Spark. I did not rely on additional services to do the validation.
Another instance was when I was working on AWS cloud platform and wanted to process files as soon as they were dropped into S3 buckets. So, I wrote a custom code on Lambda service to trigger a function that would process the files when they appear in the specific bucket. Since Lambda is serverless, I skip the whole part of setting up infrastructure to run my code.
These are just a few instances, but with programming languages you can hack your way through multiple big data systems and customize it to save time and cost.
4. Distributed System Paradigms
Distributed systems have significantly changed the landscape of data engineering pipelines. Be it data processing or storage, it has given way to make these processes work faster and scale up to hold large datasets. If you don’t have a basic understanding of the concepts of DS, then using the tools will only get more complex.
Source: CAP Theorem
5. Big Data Tools
Big data solutions typically involve a large amount of mostly non-relational data, such as key-value data, JSON files or time series data that cannot be processed by the traditional relational database system.
All of the Big data tools use the distributed architecture to process and store data. Here is a list of some of the most commonly used and popular systems.
Hadoop Distributed File System or HDFS is a distributed file system that was created as a way of being able to horizontally scale out storage by adding commodity hardware. Now, there are numerous services available that provide the same functionality as MapR FS, AWS S3, Azure Blob, GCP Cloud Storage.
Apache Spark is a good option for distributed processing framework. It has the potential to replace the traditional ETL process while processing structured and unstructured data.
Apache Hive is an important part of the Hadoop ecosystem, provides a way to project structure onto distributed data. It uses SQL to generate results and can be helpful to generate insights.
Apache Kafka is a distributed stream-processing platform, useful if you are building a real-time analytics pipelines or dashboards.
Here are two books that I highly recommend everyone to read – Designing Data-Intensive Applications and Big Data: Principles and best practices of scalable realtime data systems
6. Additional skills
There are few other skills that will come in handy but is not a requirement to get started. Knowing how to write UNIX scripts or use command line will be useful when working on big data infrastructure. All cloud platforms have command line tools available that give you the ability to setup and work on their services if you don’t want to use their GUI.
It’s becoming increasingly prominent to have a shared environment to store infrastructure code or code related any part of development process. Now data teams also rely on version control system like GIT or Microsoft TFS (team foundation server). Although this was not a requirement in the past, when ETL tools took care of the version control, with teams building their data infrastructure that spans across multiple systems and clouds, this is now a requirement.
Data engineering a vast field that takes care of the data when it leaves the software system until it goes into the hands of Data Scientists. In order to start building efficient and scalable systems to handle data, there is what you will need to master –
- How to write complex queries using SQL, to read and aggregate data.
- Understand how to design a Data Warehouse, based on the business requirements and looking into the source systems that the data needs to be extracted
- Learn the basics of at least one programming language. I would suggest you start with either JAVA or Python, both are used in most of the systems and tools.
- Go over the how Distributed systems can help scale and handle large amounts of data. Also understand the different architectures build using distributed systems like linear scaling and Lambda architecture.
- Brief knowledge of Big Data systems to start with. Most of these are open source, so start building pipelines using any of the frameworks to see how data can be transformed.
- If you have additional bandwidth, I would suggest learning UNIX scripting and also principles to use version control like GIT.