Ken Muse

Understanding Long-Running Cloud Tasks

Did you know that long-running processes can be challenging? If you’re coming from an on-premises world, then you may not be used to thinking about one of the core tenants of the cloud world: failure is inevitable. The truth is, failure is one of the few aspects of development that is a constant. Systems will fail or restart, code will fail, power will fail, and networks will fail. Especially when running in the cloud, services can be restarted or reprovisioned at any time. Even if the downtime is only a fraction of a second, it can still result in a running process being terminated.

What does this have to do with anything?

Because failure is guaranteed, long-running processes will eventually always encounter a failure. While we can retry requests to external resources, there’s very little we can do if the code is on a system that is restarted or fails. This is why a single, long-running process is something we try to avoid in cloud-native designs.

This situation is the root of an essential practice — try to minimize the amount of time that code is executing. By breaking up a process into multiple short steps, we gain several immediate advantages. First, we minimize the chances a critical piece of code is executing during a failure. Second, we decompose tasks into smaller tasks with inputs and outputs. This makes recovery from failure easier. By breaking a process up into multiple, idempotent activities, we can minimize the processing time and maximize recoverability. The program itself can be thought of as orchestrating a series of activities rather than creating a long-running process. This is the principal behind Azure Durable Functions. At a high level, when we make an asynchronous call, the state is preserved. Each time a call returns, the state is recorded and used as the input to the next step in the process.

There is often an additional benefit to this approach. In many cases, we find that iterations within loops could be processes as multiple parallel activities. This can dramatically speed up execution.

As a practical example, let’s consider a long running process that reads every file in a folder. For each file, it writes a database record. First, we make it idemopotent. No matter how many times the code is executed, we should always get the same database entries. This makes sure that we can recover from failure. Next, we break out the steps in order to create smaller, faster tasks:

  • Get list of files from folder
  • For each file, parse the data it contains
  • For each file’s parsed data, make an appropriate database entry

These steps now outline the orchestration. The list of files could be sent to a queue (such as a Storage Queue). An Azure Function could then process each queue item to parse the data. If the database entries are fast enough, we may create those in the same Function. This approach is often called a fan-out, since it sends the data to multiple, parallel processors. As an alternative to these steps, we could use a Durable Function as the orchestrator for these tasks. Under the covers, it already utilizes a queue to support this kind of parallelization. Microsoft provides an example of a Durable fan-out. Fan-out operations enable the steps to occur in parallel, providing faster processing and higher scale.

With just a little planning, our code is now faster and more resilient to failure. In a worst-case scenario, we re-run the code to automatically recover. This may even be as simple as only processing the failed files. In a best-case scenario, the code is now substantially faster and lighter, enabling the process to complete faster with fewer consumed resources.