It’s been fun spending time re-exploring the data platform this week. After months away from the world of big data, I’ve been amazed at how far the tools have come. Two years ago, the support for DevOps tools and practices was very limited. Now, it’s substantially more robust. Don’t get me wrong — you could create workable solutions. It was just harder than necessary.
Azure Data Factory (ADF) was a surprising entry into the big data space. Microsoft actually started the project as code-first, so there was a level of support for DevOps from the beginning. It generates code that can be edited and modified by users, and the environment could be integrated with Git. It wasn’t a perfect solution, but it was great to see them thinking about the problem.
There’s a few things to understand about ADF from this perspective. It only supports a single Git repository. This repository is associated with the complete environment. This makes sense – ADF is designed assuming that everyone with access to a given data factory is on the same team and has the same privileges. Microsoft highlights this in the docs:
The Azure Data Factory team doesn’t recommend assigning Azure RBAC controls to individual entities (pipelines, datasets, etc.) in a data factory. For example, if a developer has access to a pipeline or a dataset, they should be able to access all pipelines or datasets in the data factory.
The product also intends all changes to be deployed together at the same time. It does not support partial or selective deployments. In many ways, it takes an extreme view of infrastructure-as-code and atomic, idempotent deployments. This encourages teams to manage any requirements for more granular deployments by creating more data factory instances. Generally, this is not a problem on a technical level or from a pricing perspective. You pay for what you use, so the time spent by two pipelines in one factory or one pipeline in each of two factories is ultimately around the same price. From a practical level, however, it restricts how many different pipelines and datasets you might include in a given environment.
From a data platform architecture perspective, teams are sometimes driven to group pipelines together so that they can share processes. A better approach is to share the data. You are being guided by ADF towards a best practice. Storage is often less expensive than the compute time required to reprocess the data. By creating and reusing data, you reduce costs and encourage reuse. This also creates an interface. Pipelines have inputs and outputs, and these outputs can become the inputs for other pipelines. As long as the data structure – the interface – remains consistent, other pipelines can consume the values reliably.
That’s enough for the theory. Next up, we’ll dive into something more practical. Implementation!