Data - Ken Muse

Doing DevOps With Databricks

Thu, 03 Nov 2022 00:00:00 -0400

Databricks is an exciting and powerful platform for creating solutions that can process big data into actionable content. Originally, the platform lacked several important aspects that are necessary to fully automate the platform. This historically limited the ability to integrate it into a holistic DevOps practice. Thankfully, those days are long past. Those limitations have been removed, making it possible to utilize the platform more fully.

There are three parts to the DevOps story with this platform: infrastructure, notebooks, and jobs. Over the last few years, the Databricks team has worked hard to build up the DevOps story around these aspects. Today, we’ll explore some of those features!

Implementing DevOps for Azure Data Factory

Thu, 27 Oct 2022 00:00:00 -0400

There’s a lot of documentation around using Azure Data Factory, but surprisingly little on implementing DevOps practices. There’s even less when it comes to implementing an automated workflow. Typically, the system expects you to design and build the pipelines within the provided user interface, then press Publish.

The Native CI/CD Process

Under the covers, edits in ADF create a series of JSON files which stores the configuration. Changes in the portal are stored to these files. When the configuration is published, those JSON documents are used to generate ARM templates. If ADF is configured to use a Git repository, a copy of the published templates is pushed to the publish branch (typically, adf_publish). This branch can then be reused for automation and deployment to other environments.

Azure Data Factory DevOps

Thu, 20 Oct 2022 00:00:00 -0400

It’s been fun spending time re-exploring the data platform this week. After months away from the world of big data, I’ve been amazed at how far the tools have come. Two years ago, the support for DevOps tools and practices was very limited. Now, it’s substantially more robust. Don’t get me wrong — you could create workable solutions. It was just harder than necessary.

Azure Data Factory (ADF) was a surprising entry into the big data space. Microsoft actually started the project as code-first, so there was a level of support for DevOps from the beginning. It generates code that can be edited and modified by users, and the environment could be integrated with Git. It wasn’t a perfect solution, but it was great to see them thinking about the problem.

Azure SQL Database Ledger

Wed, 23 Mar 2022 00:00:00 -0400

Need to prove that your database was not tampered with? Azure SQL Database Ledger can help! Learn how a database blockchain can help!

Intro to Data Lake Storage

Wed, 06 Oct 2021 00:00:00 -0400

Learn how Azure Data Lake Storage Gen2 enhances the experience and creates a foundation for big data.

Understanding Modern Data Warehouse Storage

Mon, 04 Oct 2021 00:00:00 -0400

Learn the options and considerations for storing data in a modern data warehouse.

Modern Data Warehouse Ingestion

Mon, 06 Sep 2021 00:00:00 -0400

Data ingestion is the first step in the journey of creating a modern data warehouse. Essentially, this is the process of transporting data from one or more sources to storage, allowing it to be accessed and analyzed. The ingestion process is sometimes split into these two aspects, the service responsible for receiving the content and the storage solution. We’ll explore both aspects in this series.

Ingestion is the foundation of data processing and analytics. Although often the least discussed, it is often the component most directly responsible for determining the performance and cost of the overall system. The reason for this is tied to the fact that dealing with big data is a trade off between compute, storage, and latency. These tradeoffs require careful balancing, selecting the tool or technology that is optimized for the problem being solved.

Introduction to the Modern Data Warehouse

Mon, 09 Nov 2020 00:00:00 -0500

In the past, the traditional data storage mechanisms were often cleanly divided between file storage, NoSQL and relational transactions, and data warehouses. The data warehouse was often a monolithic system, servicing the needs of both customers and internal stakeholders. With the explosion of data, the days of the single-system approaches have come to an end. For the modern data practitioner, it’s critical to consider the advantages of a cloud-hosted environment to dynamically support the growing data storage needs. As a result, you often find yourself having to rely on the strengths of multiple different components rather than any one single system. Over time, patterns have emerged which optimize this approach and ensure it remains manageable. The dominant approach is the Modern Data Warehouse (MDW).