A Beginner’s Guide to Data Engineering
So you want to break into data engineering? Start today by learning more about data engineering and the fundamental concepts.
By Bala Priya C,
KDnuggets
Image by Author
With the influx of huge amounts of data from a multitude of sources, data engineering has become essential to the data ecosystem. And organizations are looking to build and expand their team of data engineers.
Some data roles such as that of an analyst do not necessarily require prior experience in the field so long as you have strong SQL and programming skills. To break into data engineering, however, previous experience in data analytics or software engineering is generally helpful.
So if you’re looking to pursue a career in data engineering, this guide is for you to:
learn more about data engineering and the role of a data engineer, and
gain familiarity with the essential data engineering concepts.
What Is Data Engineering?
Before we discuss what data engineering is all about it's helpful to review the need for data engineering. If you have been in the data space for a while, you’ll be skilled in SQL queries querying relational databases with SQL and NoSQL databases with SQL-like languages.
But how did the data reach there—ready for further analysis and reporting? Enter data engineering.
We know that data comes from various sources in several forms: from legacy databases to user conversations and IoT devices. The raw data has to be pulled into a data repository. To expand: data from the various resources—should be extracted and processed—before being made available in ready-to-use form in data repositories.
Data engineering encompasses the set of all processes that collect and integrate raw data from various resources—into a unified and accessible data repository—that can be used for analytics and other applications.
What Does a Data Engineer Do?
Understanding what data engineering is should’ve definitely helped you guess what data engineers do on a day-to-day basis. The responsibilities of data engineer include but are not limited to the following:
Extracting and integrating data from a variety of sources—data collection.
Preparing the data for analysis: processing the data by applying suitable transformations to prepare the data for analysis and other downstream tasks. Includes cleaning, validating, and transforming data.
Designing, building, and maintaining data pipelines that encompass the flow of data from source to destination.
Design and maintain infrastructure for data collection, processing, and storage—infrastructure management.
Data Engineering Concepts
Now that we understand the importance of data engineering and the role of data engineers in an organization, it's time to review some fundamental concepts.
Data Sources and Types
As mentioned, we have incoming data from all resources across the spectrum: from relational databases and web scraping to news feeds and user chats. The data coming from these sources can be classified into one of the three broad categories:
Structured data
Semi-structured data
Unstructured data
Here’s an overview:
TypeCharacteristicsExamplesStructured dataHas a well-defined schema.Data in relational databases, spreadsheets, and the likeSemi-structured dataHas some structure but no rigid schema. Typically has metadata tags that provide additional information.Include JSON and XML data, emails, zip files, and moreUnstructured dataLacks a well-defined schema.Images, videos and other multimedia files, website data
Data Repositories: Data Warehouses, Data Lakes, and Data Marts
The raw data collected from various sources should be staged in a suitable repository. You should already be familiar with databases—both relational and non-relational. But there are other data repositories, too.
Before we go over them, it'll help to learn about two data processing systems, namely, OLTP and OLAP systems:
OLTP or Online Transactional Processing systems are used to store day-to-day operational data for applications such as inventory management. OLTP systems include relational databases that store data that can be used for analysis and deriving business insights.
OLAP or Online Analytical Processing systems are used to store large volumes of historical data for carrying out complex analytics. In addition to databases, OLAP systems also include data warehouses and data lakes (more on this shortly).
The choice of data repository is often determined by the source and type of data. Let’s go over the common data repositories:
Data warehouses: A data warehouse refers to a single comprehensive store house of incoming data.
Data lakes: Data lakes allow to store all data types—including semi-structured and unstructured data—in their raw format without processing them. Data lakes are often the destination for ELT processes (which we’ll discuss shortly).
Data mart: You can think of data mart as a smaller subsection of a data warehouse—tailored for a specific business use case common
Data lake houses: Recently, data lake houses are also becoming popular as they allow the flexibility of data lakes while offering the structure and organization of data warehouses.
Data Pipelines: ETL and ELT Processes
Data pipelines encompass the journey of data—from source to the destination systems—through ETL and ELT processes.
ETL—Extract, Transform, and Load—process includes the following steps:
Extract data from various sources
Transform the data—clean, validate, and standardize data
Load the data into a data repository or a destination application
ETL processes often have a data warehouse as the destination.
ELT—Extract, Load, and Transform—is a variation of the ETL process where instead of extract, transform, and load, the steps are in the order: extract, load, and transform.
Meaning the raw data collected from the source is loaded to the data repository—before any transformation is applied. This allows us to apply transformations specific to a particular application. ELT processes have data lakes as their destination.
Tools Data Engineers Should Know
The list of tools data engineers should know can be overwhelming.
Image by Author
But don’t worry, you do not need to be an expert at all of them to land a job as a data engineer. Before we go ahead with listing the various tools data engineers should know, it’s important to note that data engineering requires a broad set of foundational skills including the following:
Programming language: Intermediate to advanced proficiency in a programming language preferably one of Python, Scalar, and Java
Databases and SQL: Good understanding of database design and ability to work with databases both relational databases such as MySQL and PostgreSQL and non-relational databases such as MongoDB
Command-line fundamentals: Familiarity which Shell scripting and data processing and the command line
Knowledge of operating systems and networking
Data warehousing fundamentals
Fundamentals of distributed systems
Even as you are learning the fundamental skills, be sure to build projects that demonstrate your proficiency. There’s nothing as effective as learning, applying what you’ve learned in a project, and learning more as you work on it!
In addition, data engineering also requires strong software engineering skills including version control, logging, and application monitoring. You should also know how you use containerization tools like Docker and container orchestration tools like Kubernetes.
Though the actual tools you use may vary depending on your organization, it's helpful to learn:
dbt (data build tool) for analytics engineering
Apache Spark for big data analysis and distributed data processing
Airflow for data pipeline orchestration
Fundamentals of cloud computing and working with at least one cloud provider such as AWS or Microsoft Azure.
To learn more about engineering tools including tools for data warehousing and stream processing, read: 10 Modern Data Engineering Tools.
Wrapping Up
Hope you found this introduction to data engineering informative! If designing, building, and maintaining data systems at scale excites you, definitely give data engineering a go.
Data engineering zoomcamp is a great place to start—if you are looking for a project-based curriculum to learn data engineering. You can also read through the list of commonly asked data engineer interview questions to get an idea of what you need to know.
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.