Designing an Effective Data Pipeline: Essential Questions I Always Ask.

Published in

CodeX

4 min readJan 30, 2024

Image credit — https://www.linkedin.com/pulse/art-asking-questions-bala-pitchandi/

When embarking on a data engineering project, it’s natural to focus on the design of the data pipeline based on the requirements presented by the consumer. However, it’s equally, if not more, crucial to delve deeper and ask questions that may not be explicitly stated in the requirements document. By proactively seeking additional information, you can ensure that your data pipeline is not only aligned with the immediate needs but also scalable and adaptable to future growth. My work is primarily in batch streaming and below are the questions I ask my data consumers before designing a data solution for them.

Database or Data Warehouse?

Understanding the purpose of the data storage is crucial. Instead of assuming your consumer’s knowledge of concepts like OLTP and OLAP, I like to ask questions such as:

What’s the purpose of the data storage? Are you looking for a data solution that serves operational needs or a solution that serves complex analytical needs and reporting?
Are you looking for a real-time data processing solution or a solution that caters to historical analysis and creating reports as time progresses?
Are you looking to integrate and consolidate data from multiple sources? Will there be transformations and/or aggregations made on the dataset?

Nature of data load

Understanding how you need to load the data governs your overall pipeline logic. I like to ask the following questions on this topic —

Imagine we have a live data solution that runs on a schedule. With each completed cycle, do you need the data team to overwrite all the previous data in the system with the incoming one?

I like to ask this question because based upon the answers I receive, I can delve into sub-questions. For ex:, if a customer answers that they would like to store history of the data changes, I ask the below follow-up questions —

a. Will the load be delta in nature?
b. Would you need to see the history of all the changes made to the data? Hint — SCD Type 2.
c. Would you need to see the most recent change to the data but not a history of all the changes? Hint — SCD Type 3.

2. If a data is no longer seen at the data source, do you require the data team to remove that data point from the final data store?

Data Quality Checks

Ensuring data quality is essential for end-user satisfaction. If your data isn’t at the quality expected by the end-user, your work is incomplete. Here, I like to ask the below questions —

How do you want me to handle missing or incomplete values? Would you like me to remove those for you or would you still like to see them but I can replace the missing values with NULL?
What are the data elements that are critical for your reporting?
If there are any missing values in the critical data elements, how do you want me to handle them?
If the data elements themselves are not present in the dataset in the first place, would you like me to terminate the pipeline and notify the users to ratify the dataset?
Are there any metrics or measures I need to implement during data processing to check if data being fetched from source is of the right quality?

Schedule of Pipeline

Now that the pipeline is developed and migrated to production, it needs to run on a schedule. Here, it’s important to understand from the user about the timelines.

Why?

In many cases, business users would like to check refreshed data before 9AM. This means your pipeline needs to finish loading data before 9AM or should notify users about a disruption during the data load.
Conversely, a dataset may not always be available at a scheduled time. The data pipeline, in this case, needs to handle event-based execution. This becomes important because a business user at any given time would like to trigger the data pipeline and check the updated result.

Scalability and Future Growth

Consider the future of the data source and its impact on the pipeline. Before designing a pipeline, I like to know what the data source is going to look like in the future. To understand this, I ask the below questions —

Will additional datasets be added to the data source? Will data volume expand?
Are there plans to integrate new data sources into the pipeline?
How can the pipeline be designed to accommodate evolving requirements without incurring additional costs or frustrating end-users?

What’s the point of above questions?

Questions like above help to build a robust pipeline that can weather advancing requirements. Designing a data pipeline without the forethought about how requirements might evolve often results in additional expenditure incurred through revamping the data logic, adding new compute resources to handle the data volume and leaving the end users frustrated.

Conclusion

By asking these key questions before designing a data pipeline, you can ensure that your solution is not only aligned with immediate needs but also scalable and adaptable to future growth. Consider the purpose of the data storage, the nature of data load, data quality checks, pipeline scheduling, and scalability for future expansion. By addressing these aspects, you can create a robust and future-proof data pipeline that meets your organization’s evolving data needs.