Ken Muse

Unexpected Optimizations

Sometimes, what we think we know about coding can work against us. I want to explore that concept just a bit, because it’s an important concept to understand when working with cloud-scale systems and big data. I saw a coding challenge that asked candidates to write code to do the following:

Write a function that calculates the sum of all numbers between zero and a specified value which are evenly divisible by 3 or 5.

This is a fairly simple programming task, which is normally solved with a simple loop and the use of a modulus (%) operator. In C#, a typical solution might look like this:

 1public static long SumNumbers(int max)
 3    long total = 0;
 4    for (int i = 3; i <= max; i++)
 5    {
 6        if (i % 3 == 0 || i % 5 == 0)
 7        {
 8            total += i;
 9        }
10    }
12    return total;

Traditionally, we try to minimize the number of loops and iterations to get better performance. As we move through the data, we select and process the parts we care about. This process assumes that the real cost of the operation is the expense of the iterations. Yes, I’m ignoring the fact that there is an advantage in creating code that is easy to read and maintain.

What if I told you that it might be faster to iterate through the numbers twice? In the first loop, the code evaluates all of the numbers divisible by 3. In the second loop, we add all of the numbers divisible by 5 (except those which are also divisible by 3). This may seem like an odd assertion, but it’s true!

The first and most obvious improvement occurs because we can choose to increment the loops by 3 and 5 (respectively) instead of iterating through all of the numbers. This would mean processing only 53% of the numbers. Reducing the total work will always improve the performance. This can also avoid the need for a modulus operation, which is slightly more expensive than simple addition operations.

The less obvious improvement comes from an additional option – parallelism. Having two separate loops means that we can run each task separately, taking advantage of the multiple cores in modern processors. This allows both activities to occur simultaneously. We’re getting multiple advantages from the processor.

To illustrate, examine this simple benchmarking of the various approaches:

SumNumbers1,576.7 μs1.00Original
SumNumbersParallel1,371.2 μs0.93Parallel loops incremeting through every number
SumNumbersSerial462.0 μs0.29Two loops, optimized to count by 3 or 5
SumNumbersParallelWithMod410.5 μs0.26Parallel incrementing loops, with modulus
SumNumbersParallelOptimal363.0 μs0.25Parallel incrementing loops, without modulus

Notice that even using unoptimized loops in parallel gave a 14% performance improvement. Optimizing the code to increment by 3’s and 5’s in two separate loops (reducing the number of iterations) was nearly 3x faster.

Why is this important? Because it illustrates that the basic rules we tend to follow as programmers have exceptions which can significantly impact the performance of the code. Cloud environments are optimized to be highly scalable and expect parallel operations. As a result, code will benefit even more from parallelism. At the same time, breaking down a task in this way can lead to considering ways to optimize the code performance. This also illustrates an aspect of the cloud and big data: given enough parallelism, it is sometimes better to compute results in parallel, then discard any results that are no longer needed.

As an example, consider a situation where I want to offer a 20 percent discount on all keyboards that were ordered if I receive at least 50,000 orders. I want to write some code to calculate the total sale. In simple C#:

1if (CalculateNumberOfOrders() > 50,000)
3    return CalculateTotalSalesWithDiscounts();
7    return CalculateTotalSalesWithoutDiscount();   

Traditionally, we follow this approach to avoid running expensive calculations. We execute a method to return a result which allow us to select another method optimized for the result. With modern systems, it may actually be faster to run all three calculations in parallel. Based on the calculated number of orders, we could then return the value from the appropriate function and discard the results from the other. At the same time the application is counting the orders, the two results are also being calculated. We eliminate sequential processing to get a faster result. While this does require more total processing time, it reduces the time needed to retrieve the correct results.

At the lowest levels, this is actually how modern CPUs work. A CPU utilizes branch prediction to guess which code will execute next. It then executes the selected code in parallel. If the CPU predicts correctly, the results are available sooner from the parallel execution. If the prediction is wrong, then the CPU starts the correct code in parallel and cancels the other task. The overall performance in this case is identical to the sequential operation. The end result is that on average, the overall application appears to run faster.

At higher levels, this logic is part of how tools like Cosmos DB and Databricks optimize their queries. By executing calculations in parallel and discarding any unneeded results, these systems maximize the utilization of the nodes and improves the overall response time. In truth, they go further and utilize both node-level and hardware-level parallelism to get the best performance.

This is also one reason why Azure Functions are optimal when they are used for short, discrete tasks. It forces you to think in terms of smaller operations and potential parallelism. When performance matters, multiple Function invocations can be utilized which each handle portions of the problem. As calculations complete, the correct set of results can be collected by other Functions. The other results are ignored or discarded.

In short, sometimes the best optimizations for perceived performance involve running the code in a less-than-optimal fashion. We execute code knowing that while it may be wasteful in one way, it is beneficial in another. Ironically, the best way to improve the performance of a system can be to process code that we intend to ignore.

Fun food for thought! :-)