Performant Code Is Not Always Idiomatic
When implementing an algorithm in a modern language like Crystal1 (or Gleam2 for that matter) using the built-in idioms lets me get to a working solution quickly. Such code tends to be compact and, once you’re familiar with the idioms, easy to understand.
It’s however unavoidable that implementing low level algorithms naively using functional programming constraints doesn’t often lead to good performance. Such high-performance code tends to be written bottom-up and verbose.
As someone who learned to program with x86 assembly and C3, I have a strong affinity for low-level programming that is almost always imperative. Nonetheless, over time I’ve learned that well-defined abstractions and idioms are key to being effective, especially when taking into account how much computers and compilers have advanced. Functional idioms are great and I love that Crystal (following on Ruby’s heels) lets me write imperative programs with functional characteristics.
Still, I like to try and see how close I can get a functional implementation to the performance of an optimal imperative one. It’s fun to see what we have to give up and what we can gain.
RLE
We’ll use run-length encoding to “compress” a text string where we replace all repeating characters with a number and a character using the following rules:
- If a character C occurs once, write it once
- Except if C is
{
, then write{{1}
as a special case, essentially encoding{
with a count of 1
- Except if C is
- If C repeats N times, encode it as
{
, C, N,}
- Except if N < 4 consider just repeating C since the minimum encoding of repeating characters takes up 4 characters.
Here are some examples of the encoding:
{q9}
meansq
repeats 9 times{413}
means4
repeats 13 timesx
means it occurs once{{1}
means{
occurs once
While a more practical compression should produce a binary encoding, for the purpose of this discussion that doesn’t matter.
Chunking
Crystal’s standard library supports a chunk
method in the Iterator
module that “enumerates over the items, chunking them together based on the return value of” a caller-supplied block. Using this with String#each_char
and Iterator#map
, Iterator#flatten
, and Iterator#join
we have the following compact and idiomatic implementation.
|
|
This however allocates a lot of memory that never gets used: chunk
identifies the repeating character c
and yields an array of the characters in reps
. While we have an edge case where we gather up some of the chunks, many are ignored beyond measuring their lengths.
Faster with reuse
We can do better with reuse of memory by using chunk reuse: true
which retaing and re-uses the same array of reps
for each yield. This however means we can’t just return reps
in the edge case, which in turn gives us an opportunity to remove flatten
.
|
|
Using Benchmark.ips
to compare these two functions we can observe 50% improvement is speed and a slight improvement in memory use.
|
|
Faster with side-effects
Using Crystal’s StringBuilder
and accepting side-effects right from the start of the implementation, we can remove #map
and #join
.
|
|
Now we’re not only doing a lot less work, we’re allocating much less memory; this implementation runs over 2.5x faster than our original attempt, and uses only ~6% of the memory.
|
|
Faster bottom up
While chunk
helps us with identifying repeating sequences, and Crystal’s compiler does a great job optimizing the final code, we’re still doing extra work we don’t need. For example, we really don’t need a collection of the repeating characters, just the count.
To do better we now have to replace chunk
with something hand-crafted.
|
|
This specialization, albeit verbose, make a significant difference. There isn’t much more to save in terms of memory, we’re just using what we need to produce the encoded string. Performance is however over 5x the original implementation.
|
|
Faster by going lower
The implementations so far copy characters one by one when the input has a sequence of non-repeating characters. We could make it faster if we keep track of non-repeating sequences and copy them more efficiently.
In Crystal, strings are immutable. We can go deeper by using Char::Reader
and Slice
, which take advantage of the immutability to avoid copying the source string, to help implement a way in which we can keep track of the start and end of non-repeating sequences and write them to the output more efficiently.
It does however make the implementation even longer.
|
|
Nonetheless, we now have an implementation that runs over 7x faster than our first one.
|
|
Conclusion
When implementing algorithms that process a lot of data, the bottom-up imperative implementation performs better both in terms of memory and execution time.
However the idiomatic implementation was much shorter and simpler. So I was pleased to be able to keep the brevity with the third stll somewhat idiomatic implementation though it does run a little under 3x slower than the fastest implementation. In situations where the volume of data being processed is small, this may be more than sufficient.
Just as I am sure we could write an implementation with fewer lines4, there’s likely more we can do to the lower-level imperative implementations to eke out some more time.
Acknowledgements
I’d like to thank the community in the Crystal Forum whose great feedback here continues to help make this article better.
-
Crystal is a compiled statically-typed language with Ruby-inspired syntax. ↩︎
-
Gleam is a type-safe language that compiles to Erlang for building scalable systems. ↩︎
-
Actually, I started with BASIC which led me to x86 Assembly which led me to C and that to Pascal and Prolog and Java and so on. ↩︎
-
Using regular expression substition with
String#gsub
I have an implementation with fewer lines of code; but that ran 8x slower than the bottom-up implementation and used a lot of memory. ↩︎