isaiah - Stacks Dev Journal - THREAD ALL OF THE THINGS

Visually stacks are organized like containers. Each container contains other stacks, some of which might be containers themselves. Unsurprisingly this is modeled internally as a tree. Each tree has many branches, some of which might have branches themselves.

There are dozens of reasons to traverse the tree: to search it for specific objects or properties, to modify it in a dozen different ways, or to collect data from it.

Recently I rewrote one of the slow data-collection traversals. The goal is to collect a bit of data from each node of the tree. But in this case the data from each parent node is just the concatenation of all of the child-node data. When the data of the parent relies on the child a depth-first traversal is required.

Until recently this particular traversal was infrequent and not much thought was given to optimizing it. However recent changes make it part of a critical path. It’s slowing other parts of the user experience and that means it’s time to optimize!

The current method of traversal is a vanilla for loop. It accumulates the collects the data from its children and concatenates before returning the result to its parent. At the leaf nodes there’s a computationally expensive task.

This cries out for concurrency. The computationally expensive tasks are being run one at a time. I have a 12-thread Mac Pro on my desk and even on older hardware most users can run 4 threads. Most of those CPUs are sitting idle while just one is running all those expensive tasks. Why not use that hardware?!

warning: Obj-C ahead. Stacks is a 3rd-party plugin to RapidWeaver. Stacks can’t predict which version of the Swift runtime will be included in RapidWeaver, or if it will be included at all. Until a stable Swift runtime ships in macOS, plugins like Stacks can’t be written in Swift.

So, as a starting place I converted all the single-thread for loops into enumerateObjectsAtIndexes:Options: calls with Options set to NSEnumerationConcurrent. Returned data comes back in random order now, but for this particular task the collected data can easily be reordered later without any extra work.

First test: Measure the throughput of the old vs. the new methods. I made a relatively small test case and repeated it 1000 times to get a reliable average measurement. Results: the speedup is about 8x. Not quite 12x (the number of threads available on this machine), but substantial. Even on minimal hardware the speed improvement should be obvious.

Second test: Try it out. Actually open a slow file in RapidWeaver and see how fast it is. Uh oh, the speedup’s gone. In fact the slow file is now slower than ever. Noticeably slower. Maybe 2x-3x slower. Ugh! What happened?!

After a day of analysis the results are that the overheads of constructing and managing operation queues for each node scale geometrically with both the width and depth of the tree. In other words a large real world tree with depth of about 15 and width over 30-50 has so many operation queues that the the CPUs are hardly ever working on the original problem. Most of the effort is spent allocating, managing, and deallocating queues.

The takeaway is that blindly converting a complex task to concurrent operations is a bad idea. And more specifically using NSEnumerationConcurrent in a recursive task is a bad idea. It’s best to call enumerateObjectsAtIndexes only once (or just a few times) on all the data. The operations themselves are very efficient but creating, managing, and deallocating queues is expensive.

As a final note my eventual use is a hybrid. I gather up the expensive leaf node tasks in a simple linear for loop. Then run all the tasks in a single concurrent enumeration. The result is a bit less elegant in code, but is actually faster than the first try. The real world speedup is significant, but not huge: about 3x on my 12-thread machine. And there is some memory overhead as well. It should be noticeable on slower hardware, so I’ll keep it, but probably not a huge speedup for most users. I’ll have to keep hunting for more optimizations to make this plugin as fast as it should be.