Intro to Data Lake Storage

Category:

Azure

Data

Tags:

#Azure

#Storage

Published: October 6, 2021 Reading Time: 5 min

This is a post in the series Intro to MDW. The posts in this series include:

Nov 9, 2020 - Introduction to the Modern Data Warehouse
Sep 6, 2021 - Modern Data Warehouse Ingestion
Oct 4, 2021 - Understanding Modern Data Warehouse Storage
Oct 6, 2021 - Intro to Data Lake Storage

Sometimes Azure storage is all we need to successfully hold our data. Other times, we need a little something extra. This brings us to another option — Azure Data Lake Storage Gen2, also called ADLSg2. ADLSg2 is very similar to traditional Azure blob storage. In fact, it’s built on top of that platform. It adds a few additional features, including POSIX permissions and support for additional APIs (such as Azure Blob File System, or ABFS). Microsoft has documented that the HDFS-compatible file system supports both atomic operations and 1 TB/s throughput. In short, it can handle files well. You can read more about the best practices for maximizing the performance of ADLS.

It’s worth a quick mention that ADLSg2 and ADLS Gen1 are two very different products with different behaviors. ADLSg1 promised unlimited scale and storage (while ADLSg2 has limitations based on Azure Storage). ADLS Gen1 is being retired in February 2024, so I don’t recommend it. Stick to Gen2.

A key feature is the introduction of hierarchical namespace support. In simple terms, we can now have folders. This is a key difference that is often overlooked with traditional Azure storage. Most people mistakenly think that traditional Azure blob storage supports folders. In truth, it is more like a simple key/value storage.

As an example, I want to create a folder (“myfolder”) with a single file in it (“myfile.txt”). In traditional Azure Storage, the file name is actually myfolder/myfile.txt. There no folder — it’s just a name and a convention. The SDKs and tools split on the “/” to make it seem like folders exist. While this works for most cases, it has a limitation. If I want to list every file in “myfolder”, that means I need to read the name of every file in the container, parse the file name name, and then determine which files share a common prefix, myfolder/. That’s a lot of reading and parsing! This can also make the concept of deleting a “folder” quite challenging in practice!

With hierarchical namespaces, we gain the ability to have virtual folders. The file myfile.txt is located in the namespace (folder) myfolder. I can easily list the contents of the folder to retrieve the files and folders within it. In fact, I can perform most of the traditional operations associated with folders, such as delete, rename, and move. Because of this extra flexibility, there is a slight additional charge for operations. The tradeoff for this is an increased flexibility.

As another example, I have 10,000 files organized in folders by year and month and want to find all of the files in /2019/10. With traditional blob storage, I would need to download the names of the 10,000 files and filter on the ones that start with “/2019/10”. With ADLSg2, I can request the list of files in the folder 2019/10. Combined with the various big data tools, such as Azure Databricks or Azure Synapse Analytics, I can quickly retrieve a partitioned subset of files. I can gain a performance boost by being able to filter on the folders before I read the files. This benefit also applies with Azure Data Factory. Under the covers, all of these tools rely on being able to read folder structures to maximize performance. As a result, they work better when they don’t need to start by listing every file being stored.

Knowing that, why would we consider regular blobs? Nearly every Azure service supports traditional Azure Storage blobs and can use blob storage. As a result, it is often a solution for receiving or ingesting files directly. If the files don’t need the query support or are accessed directly by name, then the lower-cost blob storage can be ideal. Another use is integration with Azure Functions. Functions can reliably process events for files in blob storage (although you may need to switch to Event Grid triggers when dealing with high-scale storage with more than 100,000 blobs or which receive more than 100 updates per second). Functions can also directly bind to blob sinks. While ADLS is still not completely supported, many events are available via Event Grid.

If you’re doing additional data processing that relies on using folder structures to organize data, then ADLS is essential. This storage is also ideal for implementing a medallion architecture and for data segregation. While I won’t dive into that much today, the medallion model is a practice of organizing the data based on its level of maturity and processing. Traditionally, bronze represents raw, ingested data. Silver represents data which has been refined, cleaned, and augmented. Finally, gold represents aggregated data storesand feautres which are ready to be loaded for querying. This provides us with a separation of reads and writes, enabling data to be prepared, staged, and loaded into queryable storage solutions. In Spark systems (including Databricks), this is often implemented using Delta Lakes. The pattern itself is a variation of the zone model. These approaches isolate partitioning that may be optimal for writes from the partitioning required to optimize for querying, analytis, and data loading.

In both cases, accessing and reading files is a relatively intensive operation. The system is optimized to be read in larger blocks. In fact, best throughput is achieved when blob and block sizes are greater the 4 MiB. Reading lots of small files is typically slower and less performant. Azure does support some smaller files; if the file sizes are at least 256 KiB, the premium storage tier can improve the performance(see the comparison of storage tiers). Generally speaking when files are smaller, the throughput and performance can drop significantly. This is why most ingestion systems will capture and aggregate data into larger files before committing it to storage. Event Hubs Capture does this. Behind the scenes, data may be stored in-memory or to unoptimized storage. When enough data has arrived (or enough time has passed), the data is bulk-persisted into a larger file in the lake. This is another benefit of zone architectures. They allos you to periodically process raw data to create another zone with optimized storage. All of that said, zones is a topic for another time.

Until then, hope you have fun exploring storage!