Performant Code Is Not Always Idiomatic

Crystal Optimisation

Updated 2024-01-07. Look for revision callouts below.

When implementing an algorithm in a modern language like Crystal1 (or Gleam2 for that matter) using the built-in idioms lets me get to a working solution quickly. Such code tends to be compact and, once you’re familiar with the idioms, easy to understand.

It’s however unavoidable that implementing low level algorithms naively using functional programming constraints doesn’t often lead to good performance. Such high-performance code tends to be written bottom-up and verbose.

As someone who learned to program with x86 assembly and C3, I have a strong affinity for low-level programming that is almost always imperative. Nonetheless, over time I’ve learned that well-defined abstractions and idioms are key to being effective, especially when taking into account how much computers and compilers have advanced. Functional idioms are great and I love that Crystal (following on Ruby’s heels) lets me write imperative programs with functional characteristics.

Still, I like to try and see how close I can get a functional implementation to the performance of an optimal imperative one. It’s fun to see what we have to give up and what we can gain.

RLE

We’ll use run-length encoding to “compress” a text string where we replace all repeating characters with a number and a character using the following rules:

Here are some examples of the encoding:

While a more practical compression should produce a binary encoding, for the purpose of this discussion that doesn’t matter.

Chunking

Crystal’s standard library supports a chunk method in the Iterator module that “enumerates over the items, chunking them together based on the return value of” a caller-supplied block. Using this with String#each_char and Iterator#map, Iterator#flatten, and Iterator#join we have the following compact and idiomatic implementation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
module RLE
  def self.encode(input : String) : String
    input.each_char
      .chunk(&.itself)
      .map { |c, reps|
        if (n = reps.size) == 1
          c == '{' ? "{{1}" : c
        elsif n < 4 && c != '{'
          reps # not worth encoding
        else
          "{#{c}#{n}}"
        end
      }
      .flatten
      .join
  end
end

This however allocates a lot of memory that never gets used: chunk identifies the repeating character c and yields an array of the characters in reps. While we have an edge case where we gather up some of the chunks, many are ignored beyond measuring their lengths.

Faster with reuse

We can do better with reuse of memory by using chunk reuse: true which retaing and re-uses the same array of reps for each yield. This however means we can’t just return reps in the edge case, which in turn gives us an opportunity to remove flatten.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
module RLE
  def self.encode(input : String) : String
    input.each_char
      .chunk(reuse: true, &.itself)
      .map { |c, reps|
        if (n = reps.size) == 1
          c == '{' ? "{{1}" : c
        elsif n < 4 && c != '{'
          reps.join # return string, not array
        else
          "{#{c}#{n}}"
        end
      }.join  # no need to flatten
  end
end

Using Benchmark.ips to compare these two functions we can observe 50% improvement is speed and a slight improvement in memory use.

1
2
RLE #chunk #map #flatten #join   6.26k (159.75µs) (± 2.53%)  199kB/op   1.50× slower
RLE #chunk reuse: #map #join     9.39k (106.47µs) (± 2.08%)  123kB/op         fastest

Faster with side-effects

Using Crystal’s StringBuilder and accepting side-effects right from the start of the implementation, we can remove #map and #join.

Updated the code with a simpler encoding generator based on feedback here .

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
module RLE
  def self.encode(input : String) : String
    String.build do |sb|
      input.each_char
        .chunk(reuse: true, &.itself)
        .each do |c, reps|
          count = reps.size
          if count >= 4 || c == '{'
            sb << '{' << c << count << '}'
          else
            count.times { sb << c }
          end
        end
    end
  end
end

Now we’re not only doing a lot less work, we’re allocating much less memory; this implementation runs over 2.5x faster than our original attempt, and uses only ~6% of the memory.

1
2
3
RLE #chunk #map #flatten #join   6.61k (151.18µs) (± 3.03%)   199kB/op   2.51× slower
RLE #chunk reuse: #map #join     9.74k (102.71µs) (± 1.47%)   123kB/op   1.71× slower
RLE #chunk reuse: StringBuilder  16.63k ( 60.12µs) (± 2.60%)  12.5kB/op        fastest

Faster bottom up

While chunk helps us with identifying repeating sequences, and Crystal’s compiler does a great job optimizing the final code, we’re still doing extra work we don’t need. For example, we really don’t need a collection of the repeating characters, just the count.

To do better we now have to replace chunk with something hand-crafted.

Updated the code with a simpler encoding generator based on feedback here .

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
module RLE
  def self.encode(input : String) : String
    String.build do |output|
      n = 1
      last = nil
      input.each_char do |c|
        if last == c
          n += 1
        else
          encode_rep c: last, count: n, into: output
          n = 1
          last = c
        end
      end
      # capture the last sequence
      encode_rep c: last, count: n, into: output
    end
  end

  # Encode one repeating seqence
  def self.encode_rep(c, count, into : IO)
    if count >= 4 || c == '{'
      into << '{' << c << count << '}'
    else
      count.times { into << c }
    end
  end
end

This specialization, albeit verbose, make a significant difference. There isn’t much more to save in terms of memory, we’re just using what we need to produce the encoded string. Performance is however over 5x the original implementation.

1
2
3
4
RLE #chunk #map #flatten #join   6.47k (154.56µs) (± 3.62%)   199kB/op   5.47× slower
RLE #chunk reuse: #map #join     9.63k (103.84µs) (± 1.51%)   123kB/op   3.68× slower
RLE #chunk reuse: StringBuilder  16.19k ( 61.75µs) (± 2.36%)  12.5kB/op  2.19× slower
RLE bottom up                    35.40k ( 28.25µs) (± 5.74%)  11.4kB/op        fastest

Faster by going lower

The implementations so far copy characters one by one when the input has a sequence of non-repeating characters. We could make it faster if we keep track of non-repeating sequences and copy them more efficiently.

In Crystal, strings are immutable. We can go deeper by using Char::Reader and Slice, which take advantage of the immutability to avoid copying the source string, to help implement a way in which we can keep track of the start and end of non-repeating sequences and write them to the output more efficiently.

It does however make the implementation even longer.

Added this impl. based an optimization recommendation here .

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
module RLE
  def self.encode(input : String) : String
    in_read = Char::Reader.new(input)
    in_slice = input.to_slice

    String.build do |output|
      acc_start = in_read.pos     # accum non-repeating seq
      acc_end = acc_start
      n = 1
      last = nil

      in_read.each do |c|
        if last == c
          n += 1
        else
          if n >= 4 || last == '{'
            # copy the accum slice
            unless (acc_n = acc_end - acc_start).zero?
              output.write(in_slice[acc_start, acc_n])
            end
            # encode the rep sequence
            output << '{' << last << n << '}'
            # reset accumulation
            acc_start = in_read.pos
          end
          n = 1
          last = c
          acc_end = in_read.pos
        end
      end
      # copy the accum slice if any
      unless (acc_n = acc_end - acc_start).zero?
        output.write(in_slice[acc_start, acc_n])
      end
      # capture the remaining sequence
      if n >= 4 || last == '{'
        output << '{' << last << n << '}'
      else
        n.times { output << last }
      end
    end
  end
end

Nonetheless, we now have an implementation that runs over 7x faster than our first one.

1
2
3
4
5
RLE #chunk #map #flatten #join   6.61k  (151.36µs) (± 3.18%)   199kB/op   7.12× slower
RLE #chunk reuse: #map #join     9.59k  (104.31µs) (± 3.51%)   123kB/op   4.91× slower
RLE #chunk reuse: StringBuilder  16.57k ( 60.33µs) (± 1.72%)  12.5kB/op   2.84× slower
RLE bottom up                    37.98k ( 26.33µs) (± 1.98%)  11.4kB/op   1.24× slower
RLE even lower level             47.07k ( 21.25µs) (± 2.57%)  11.4kB/op         fastest

Conclusion

When implementing algorithms that process a lot of data, the bottom-up imperative implementation performs better both in terms of memory and execution time.

However the idiomatic implementation was much shorter and simpler. So I was pleased to be able to keep the brevity with the third stll somewhat idiomatic implementation though it does run a little under 3x slower than the fastest implementation. In situations where the volume of data being processed is small, this may be more than sufficient.

Just as I am sure we could write an implementation with fewer lines4, there’s likely more we can do to the lower-level imperative implementations to eke out some more time.

Acknowledgements

I’d like to thank the community in the Crystal Forum whose great feedback here continues to help make this article better.


  1. Crystal is a compiled statically-typed language with Ruby-inspired syntax. ↩︎

  2. Gleam is a type-safe language that compiles to Erlang for building scalable systems. ↩︎

  3. Actually, I started with BASIC which led me to x86 Assembly which led me to C and that to Pascal and Prolog and Java and so on. ↩︎

  4. Using regular expression substition with String#gsub I have an implementation with fewer lines of code; but that ran 8x slower than the bottom-up implementation and used a lot of memory. ↩︎