Counting Lines, Fast and Slow

Mar 2, 2024

Crystal Optimisation

Updated 2024-03-03. Look for revision callouts below.

It’s fun to take a simple problem and see how far we can take it in terms of performance. Almost always, after you think you’ve exhausted all options, there is someone who can point out one more thing that is all the more esoteric the less there seems left to squeeze. Inevitably, that optimization will be harder to grok by just reading the code.

Still, that “A ha!” moment is always worth digging through to understand what is going on.

The dawning comprehension comes with understanding how something really works under the hood. Then, maybe you’re in a meeting or taking a shower, there is the realisation that you can use this with something else you were working on. That in turn rejuvenates your interest in something you parked or thought you were done with; maybe sparks you interest in something you never know you cared about.

Scope

We’ll write a command line program using Crystal¹ that

takes a file path as its only argument,
assumes all files are UTF-8 encoded²,
counts the number of lines in the file by counting the occurrences of byte 0x0a (aka '\n'), and
prints out the number of lines and nothing else.

For example:

1
2


% count_lines some/file.txt
4783

We’ll start as simple as possible and work our way towards getting it to run as fast as possible.

A simple line counter

When writing a quick and dirty script to count lines, the easiest, and perhaps most idiomatic, approach is to open the file and iterate through all the lines while keeping a counter.

1
2
3
4
5
6
7


line_count = 0
File.open(ARGV.first, "r") do |file|
  file.each_line do |line|
    line_count += 1
  end
end
puts line_count

Here’s the timing when benchmarking³ this program:

1
2
3


Benchmark: bin/lc_string_lines measurements-500m.txt
  Time (mean ± σ):     16.613 s ±  0.122 s    [User: 15.741 s, System: 0.858 s]
  Range (min … max):   16.446 s … 16.756 s    5 runs

Baseline

Adding more on use of wc as baseline.

Unix platforms (and others)⁴ have had the wc command that prints word, line and byte counts for text files. I’ll be using wc -l, the option that counts only lines, as a baseline for comparison with the program versions in this article.

While I compare every version of the evolving line counter with wc -l, my intent is not to denigrate wc but instead pay tribute to a venerable word and line counting workhorse in the command line tools arsenal.

1
2
3


Benchmark: wc -l measurements-500m.txt (name = chars)
  Time (mean ± σ):      7.421 s ±  0.046 s    [User: 6.741 s, System: 0.674 s]
  Range (min … max):    7.378 s …  7.489 s    5 runs

wc -l is ~2.2x faster than our naive first attempt, which is not bad for the simplest first attempt.

Characters vs lines

Since each line is a String made up of Char values (characters), let’s try counting the newline characters by iterating over the characters from the file using IO#each_char.

1
2
3
4
5
6
7
8


line_count = 0
File.open(ARGV.first, "r") do |file|
  # iterate over each Char
  file.each_char do |c|
    line_count += 1 if c == '\n'
  end
end
puts line_count

Here’s the timing when benchmarking this version:

1
2
3


Benchmark: bin/lc_chars measurements-500m.txt
  Time (mean ± σ):     49.046 s ±  0.108 s    [User: 47.995 s, System: 1.031 s]
  Range (min … max):   48.906 s … 49.174 s    5 runs

Ouch. Clearly we need to dig into how #each_line is implemented to figure why iterating through characters is so much slower.

Bytes vs characters

Digging into the standard library’s source code, Crystal’s #gets method, used by #each_line to read and yield each line, has optimizations for when the delimiter (newline or '\n' in our case) is an ASCII byte. Since we are counting lines in UTF-8 files, we know that a newline is a solo byte that we can look for in UTF-8 text without worrying about multi-byte processing.

Let’s try scanning bytes instead of characters using #each_byte, ignoring the overhead of looking for and handling potential multi-byte encoding that happens behind the scenes when we iterate over Char values.

1
2
3
4
5
6
7
8


line_count = 0
File.open(ARGV.first, "r") do |file|
  # iterate over each byte in the file
  file.each_byte do |b|
    line_count += 1 if b == '\n'.ord
  end
end
puts line_count

Here’s the timing when benchmarked:

1
2
3


Benchmark: bin/lc_bytes measurements-500m.txt
  Time (mean ± σ):     13.370 s ±  0.036 s    [User: 12.617 s, System: 0.741 s]
  Range (min … max):   13.325 s … 13.424 s    5 runs

We are back on track, and still as simple. We’re doing better than iterating over lines (both in time and space), and now wc -l is only ~1.8x faster.

Byte slices

Looking deeper into Crystal’s standard library, IO#each_byte calls IO#read_byte for every byte that it yields, in turn dealing with buffering and other book-keeping, adding overhead to every byte we inspect. Since it certainly makes sense to buffer the reading, but we don’t need anything more that to inspect the bytes, let’s try managing our own buffering by reading chunks of bytes into a buffer that we allocate and then iterating over the bytes in each chunk.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


line_count = 0
File.open(ARGV.first, "r") do |file|
  backing_buf = Bytes.new(512*1024)
  until (read_size = file.read(backing_buf)).zero?
    # in case we read < space in the buffer
    buf = backing_buf[0, read_size]
    # iterate over each byte in the slice
    buf.each do |b|
      line_count += 1 if b == '\n'.ord
    end
  end
end
puts line_count

Here’s the timing when we benchmark it:

1
2
3


Benchmark: bin/lc_byte_slices measurements-500m.txt
  Time (mean ± σ):      6.856 s ±  0.031 s    [User: 6.180 s, System: 0.664 s]
  Range (min … max):    6.823 s …  6.891 s    5 runs

Nice. With the 512KB buffer⁵ into which we load portions of the file to count the newline bytes, we’re now running slightly faster than wc -l on the same input. Even though we stopped thinking of “lines” back when we started scanning characters and bytes, this version is starting to look less intuitive at first glance. It’s still small enough though that we can understand it if we stare at it a little.

Long “words” vs bytes

At this point we’re not that very different from how wc -l works. If we want to continue with our sequential approach, to get more speed we will need to change how we find and count the newline bytes in the buffer.

Current mainstream processers are capable of at least 32-bit operations, and most support 64-bit operations. Can make the counting run faster by using “long word” maths on 32 bits (or 64 bits) of data at a time?

The Linux kernel has an optimized implementation of memchr()⁶ that uses a combination of long integer maths and bitwise operations to find a single byte in a long word. This approach counts trailing zero bits to identify the “first” occurrence of a byte after performing some bitwise transformations that leave behind a 1 bit for each byte that matches the byte memchr is looking for.

We can however use #popcount, a CPU instruction which Crystal exposes for integers, to count all the occurrences of the 1 bit.

Example illustrating counting with bit-wise maths

See the example in the illustration above to get a sense of how this works, where the XOR is used to zero out the byte we’re looking for and then we invert those zeros to make a mask for selecting the the 0x08 byte so we can get a 1 bit for every matching newline.

Since we are reading byte buffers, and we are working with Crystal, we will need to use some “unsafe” operations to make sure we can iterate over long words, in effect “casting” a byte pointer to a pointer to a buffer of longer integers. After processing the long integers, near the end of the buffer we need to account for any remaining bytes depending on the file size, which we have to check individually.

I’ve isolated the optimization into a function that uses (and hides) unsafe pointers to read 32-bit “words” and perform bitwise maths to count the number of newline bytes in a buffer of n bytes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


BYTE_NL =        0x0A_u8 # single newline byte
MASK_NL = 0x0a0a0a0a_u32 # newline x each byte of word
MASK_01 = 0x01010101_u32 # 0x01 x ditto
MASK_80 = 0x80808080_u32 # 0x80 x ditto

def count_newlines_by_word(buf : Bytes, n : Int32)
  p = buf.to_unsafe                    # pointer to bytes
  w_n = n >> 2                         # size in 32-bit "words"
  w_p = Pointer(UInt32).new(p.address) # pointer to "words"
  count = 0
  i = 0
  while i < w_n
    x = (w_p[i] ^ MASK_NL)
    s = (x &- MASK_01) & ((~x) & MASK_80) # ignore overflow in arithmetic ops
    count &+= s.popcount                  # count the 1 bits
    i &+= 1
  end

  i = w_n << 2 # remaining bytes
  while i < n
    count &+= 1 if p[i] == BYTE_NL
    i &+= 1 # ignore overflow in arithmetic ops
  end
  count
end

This will return the number of newline bytes in a given byte buffer, and calling this we have the following main loop:

1
2
3
4
5
6
7
8


line_count = 0
File.open(ARGV.first, "r") do |file|
  backing_buf = Bytes.new(512*1024)
  until (read_size = file.read(backing_buf)).zero?
    line_count += count_newlines_by_word(backing_buf, read_size)  # <---
  end
end
puts line_count

Here’s the timing when we benchmark this version:

1
2
3


Benchmark: bin/lc_byte_slice_wide measurements-500m.txt
  Time (mean ± σ):      1.704 s ±  0.006 s    [User: 0.981 s, System: 0.713 s]
  Range (min … max):    1.698 s …  1.713 s    5 runs

Boom.

Using 32-bit wide maths to count newlines four bytes at a time and taking advantage of the CPU’s ability to perform 32-bit maths at least as fast as it performs 8-bit maths, we can now count lines 4.3x faster than wc -l.

Longer “words”

What about using 64-bit maths? After all most if not all of our personal computers today use 64-bit processorts, and I wanted to know what improvement I could get processing 8 bytes at a time. Here’s the modified line counter that uses 64-bit maths:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


BYTE_NL  =                0x0A_u8 # single newline byte
MASK8_NL = 0x0a0a0a0a0a0a0a0a_u64 # newline x each byte of word
MASK8_01 = 0x0101010101010101_u64 # 0x01 x ditto
MASK8_80 = 0x8080808080808080_u64 # 0x80 x ditto

def count_newlines_by_word8(buf : Bytes, n : Int32)
  p = buf.to_unsafe                    # pointer to bytes
  w_n = n >> 3                         # size in 32-bit "words"
  w_p = Pointer(UInt64).new(p.address) # pointer to "words"
  count = 0
  i = 0
  while i < w_n
    x = (w_p[i] ^ MASK8_NL)
    s = (x &- MASK8_01) & ((~x) & MASK8_80) # ignore overflow
    count &+= s.popcount                    # count the 1 bits
    i &+= 1
  end

  i = w_n << 3 # remaining bytes
  while i < n
    count &+= 1 if p[i] == BYTE_NL
    i &+= 1
  end
  count
end

Modifying the main loop to use #count_newlines_by_word8 function instead, we get the following timing from the benchmark:

1
2
3


Benchmark: bin/lc_byte_slice_wide8 measurements-500m.txt
  Time (mean ± σ):      1.591 s ±  0.007 s    [User: 0.864 s, System: 0.718 s]
  Range (min … max):    1.580 s …  1.599 s    5 runs

It’s slightly faster on my laptop with a 64-bit processor, though not as much as you might expect after the 4x improvement earlier. We are now 4.7x faster then wc -l.

Intermediate conclusion

The line-counter now counts half a billion lines in a file of over 6GB in well under two seconds almost 5x faster than wc -l which took a little over 7 seconds.

It’s sequential, utilizing just one of the 6 cores on my benchmarking machine; and it uses a fixed amount of memory (512KB) regardless of file size.

While there are a few more avenues we could explore (e.g. file loading using memory mapping, SIMD intrinsics) to improve sequential performance, at this point I am more curious about the benefits of parallelizing the line counter.

Concurrent IO and Compute

With the most recent optimizations we have reached a point where reading from the file dominates the time spent. Using different chunk sizes has no meaningful impact since the corresponding increase in I/O time continues to overwhelm the compute time.

To go faster we need to consider decoupling the compute and the I/O. Since reading from a file is a blocking operation in Crystal’s standard libaray, running them concurrently to take advantage of the multiple cores in our CPUs should give us a boost. Knowing that reading from the file takes as much or more time than counting the lines in a chunk, the next attempt uses two threads: one for reading chunks and the second one for counting. To ensure we keep both threads busy we do the following:

Setup three channels:
- Let’s call one “Ready for IO” which is used by the “Reader” to detect when a buffer is not in use so that the “Reader” can read from the file to fill it.
- The next one is “Ready for Counting”, which is used by the “Counter” to detect when a buffer is available for counting lines.
- The last one is “Done Counting”, which is used by the “Counter” to send the line count when it has no more buffers to count.
We pre-populate the “Ready for IO” channel with two empty 512K buffers to ensure that after one buffer has been read, we can read the second one while the first is being counted.
Use spawn to setup the “Counter” fiber which can be run in its own thread:
- The “Counter” repeatedly waits on the “Ready to Count” channel until, either it receives a buffer to count or it receives a nil to indicate there is no more work.
  - If there’s a buffer, we count the lines and update the line counter
  - Otherwise we send the line count to the “Done Counting” channel
The main body runs in its own thread and is the “Reader” which, after opening the file
- iterates over the size of file in buffer-sized steps, and
- waits on the “Ready for IO” channel to retrieve a buffer to fill, and
- after reading from the file sends the buffer (with the number of bytes read) to the “Ready for Counting” channel.
When the whole file has been read, the “Reader” countin channel waits on the “Done Counting” channel to receive the line count to print.

Here’s the source code, which is now more complicated than our previous attempts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


# IO processing channel with 2 buffers
IO_CAP = 2
IO_BUFSZ = 512*1024
io_chan = Channel(Bytes).new(IO_CAP)
IO_CAP.times do
  io_chan.send(Bytes.new(IO_BUFSZ))
end

# End of counting channel
done_chan = Channel(Int32).new

# Work pending channel
alias Work = Tuple(Bytes, Int32)
count_chan = Channel(Work).new

# Spawn a fibre that waits for a buffer to count
spawn do
  count = 0
  while (work = count_chan.receive?)
    count += count_newlines_by_word8(work[0], work[1])
    io_chan.send work[0] # return buffer for re-use
  end
  done_chan.send count # submit line count
end

# Open the file and fill and dispatch free buffers
File.open(ARGV.first, "r") do |file|
  while true
    backing_buf = io_chan.receive      # wait for buffer
    read_size = file.read(backing_buf) # fill it
    break if read_size.zero?
    count_chan.send({backing_buf, read_size}) # dispatch work
  end
  count_chan.close # no more buffers
end

# Wait to collect the count
line_count = done_chan.receive
puts line_count

This program needs to be compiled with the -Dpreview_mt option to ensure multi-threading is enabled at runtime.

Running this we get the following peformance result:

1
2
3


Benchmark: bin/lc_byte_slice_wide8_bgio measurements-500m.txt
  Time (mean ± σ):      1.066 s ±  0.015 s    [User: 1.012 s, System: 0.921 s]
  Range (min … max):    1.046 s …  1.083 s    5 runs

Using one thread to read chunks from the file while at the same time another thread counts the lines in each chunk, the program is now ~6.9x faster than wc -l, and ~1.49x faster than our fastest sequential version.

More bigger buffers

The above implementation has two constants: IO_CAP for the number of buffers we pre-load, and IO_BUFSZ_K for the size of each buffer in KB. While the above version uses values 2 and 512 respectively, I wanted to know if different values for those constants would result in better performance. To do this I modified the constant declarations to look for environment variables if available, as follows:

1
2


IO_CAP   = (ENV["IO_CAP"]? || 2).to_i
IO_BUFSZ = (ENV["IO_BUFSZ_KB"]? || 512).to_i*1024

After setting up a wrapper script to which I could provide the values as parameters, I ran a benchmark parameterizing IO_CAP over 2,3,4 and IO_BUFSZ_KB over 256,512,1024,2048. The table below shows the performance (mean time) for the various combinations:

`IO_CAP`	`IO_BUFSZ_KB`	mean (s)	stddev
2	256	1.107	0.029
3	256	1.143	0.049
4	256	1.113	0.010
2	512	1.011	0.024
3	512	0.983	0.009
4	512	1.057	0.143
2	1024	0.991	0.050
3	1024	1.061	0.039
4	1024	1.100	0.113
2	2048	1.084	0.029
3	2048	0.972	0.006
4	2048	1.016	0.008

Using two threads, one for I/O and one for counting, the best run this time was using 3 x 2MB buffers, with the following detailed performance time:

1
2
3


Benchmark 11: ./lc_bgio.sh 3 2048 measurements-500m.txt (name = byte_slice_wide8_bgio)
  Time (mean ± σ):     972.5 ms ±   6.0 ms    [User: 958.2 ms, System: 853.5 ms]
  Range (min … max):   963.1 ms … 979.5 ms    5 runs

Still, it isn’t much of a range and with an average across the different configurations of 1.0535s, we’re running ~7x faster than wc -l.

More cores, more threads

Since my test machine has 6 cores, doubling to 12 with hyper-threading enabled, the next optimization is to try and leverage all those cores. Counting lines is inherently parallelizable, by which I mean that I can split and count the file in any order and the sum of the lines in each buffer will add up the total number in the file.

If you were wondering earlier why I even bothered with the two threads separating I/O and counting, it was because of simplicity. Concurrent data processing is fraught with many potential perils, not the least of which are race conditions due to shared resources. Keeping I/O in one thread and counting in the other meant we were running two sequential tasks that didn’t have much to worry about on their own.

The next version, however, runs NUM_PAR fibres (based on the number of CRYSTAL_WORKERS threads set up by Crystal’s multi-threading runtime) where each reads and counts IO_BUFSZ-byte chunks from the file. The reading offset if calculated so that each thread skips NUM_PAR chunks from their initial offset, essentially moving through the file like a wave.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52


# Number of concurrent workers
NUM_PAR = (ENV["CRYSTAL_WORKERS"]? || 4).to_i

# Size of buffer per worker
IO_BUFSZ = (ENV["IO_BUFSZ_KB"]? || 512).to_i*1024

# End of counting channel
done_chan = Channel(Int32).new

# Open file
file = File.new(ARGV.first, "r")
file.read_buffering = false
file_size = file.size

# Divide up the work based on the buffer size
file_parts = if file_size > IO_BUFSZ
               (file_size // IO_BUFSZ) + (file_size % IO_BUFSZ).clamp(0, 1)
             else
               1
             end

# Spawn a fibre for each thread to iterate over buffers and count the lines
NUM_PAR.to_i64.times do |p_ix|
  spawn do
    count = 0
    backing_buf = Bytes.new(IO_BUFSZ)
    # Starting at p_ix process every Nth chunk where N = max threads
    p_ix.step to: file_parts, by: NUM_PAR, exclusive: true do |ix|
      # Offset always based on IO_BUFSZ, which leaves the last
      # part to be smaller remainder
      ofs = (ix * IO_BUFSZ)
      size = if ofs + IO_BUFSZ < file_size
               IO_BUFSZ
             else
               file_size - ofs # remainder for last part
             end
      buf = backing_buf[0, size]
      read_size = file.read_at offset: ofs, bytesize: size do |io|
        io.read_fully(buf)
      end
      count += count_newlines_by_word8(buf, read_size)
    end
    done_chan.send count # submit line count
  end
end

# Expect a line counter per worker
line_count = 0
NUM_PAR.times do
  line_count += done_chan.receive
end
puts line_count

Using 12 threads and a 512KB buffer per thread (total of 6MB of memory), we have the following benchmark result counting half a billion lines:

1
2
3


Benchmark: bin/lc_byte_slice_wide8_par measurements-500m.txt
  Time (mean ± σ):     316.5 ms ±   1.1 ms    [User: 1791.0 ms, System: 1582.4 ms]
  Range (min … max):   315.6 ms … 318.3 ms    5 runs

Wham! Using multiple threads and harnessing all the cores we are now ~3.3x faster than when we just separated I/O from counting, and compared to wc -l the parallel line counter is ~23x faster!

Summary of benchmarks

Added to clarify concurrency config set via env vars.

The following environment variables were set prior to running the benchmarks to ensure that the two multi-threaded versions were run with specific concurrency configurations.

1
2
3


export CRYSTAL_WORKERS=12       # Set no. of worker threads; for _par version
export IO_CAP=2                 # Set no. of buffers; for _bgio version
export IO_BUFSZ_KB=512          # Set buffer size; for both

The multi-threaded parallel line counter, bin/lc_byte_slice_wide8_par, compared to all the other versions, ran

times	faster than
3.37 ± 0.05 x	`bin/lc_byte_slice_wide8_bgio`
5.03 ± 0.03 x	`bin/lc_byte_slice_wide8`
5.38 ± 0.03 x	`bin/lc_byte_slice_wide`
21.67 ± 0.12 x	`bin/lc_byte_slices`
23.19 ± 0.14 x	`wc -l`
42.25 ± 0.19 x	`bin/lc_bytes`
52.50 ± 0.43 x	`bin/lc_string_lines`
154.99 ± 0.64 x	`bin/lc_chars`

Conclusion

It should be no surprise by now that the simplest and most intuitive programs are not also the fastest. The increasing code complexity comes from us understanding how things work and unwrapping the abstractions that on one hand made it so easy to get the job done simply while on the hand paying the price in performance.

Of course, if most of your files are small, just use the simpler program and instead use the GNU parallel⁷ command to run it in parallel across many files.

With very large data files, however, it is really satisfying to dig into the details of how things work and figure out a way to optimize a program until it is ~155 times faster than the first one that got the job done. After all, isn’t it amazing that a 6GB file with 500 millions lines can be counted in less than a third of a second?

Feedback and thoughtful discussion welcome here in the Crystal Forum

Furthermore

Adding section to address community feedback.

Here I address some questions and topics that didn’t neatly fit into the content above.

It may seem surprising, given its pedigree and vintage, that wc hasn’t already made improvements that were outlined above. As pointed out in the community feedback, wc does indeed have various optimizations (see here and here) including the use of memchr as well as AVX/SIMD operations. Since these appear to apply only when certain criteria are met (e.g. line length), they appear not have kicked in since the test file I used had 500 million lines that didn’t meet the criteria, or perhaps for some other reason.
Nonetheless, I want to reiterate that this post is not intended to denigrate wc but rather use it as a well known baseline for the exercise of showing how more and more optimisation, which is a fun activity in itself, often leads away from intuitive code towards more complicated programs.

Crystal is a compiled statically-typed language with Ruby-inspired syntax. ↩︎
Per Unicode 15’s chapter on “Encoding Forms”: The UTF-8 encoding form maintains transparency for all of the ASCII code points (0x00..0x7F). That means Unicode code points U+0000..U+007F are converted to single bytes 0x00..0x7F in UTF-8 and are thus indistinguishable from ASCII itself. Furthermore, the values 0x00..0x7F do not appear in any byte for the representation of any other Unicode code point, so that there can be no ambiguity. ↩︎
All performance testing was done with hyperfine using 2 warm-up runs followed by 5 measured runs, each separated by sleep 5, with a 6.4GB data file containing 500 million lines. ↩︎
Wikipedia article on `wc ↩︎
I tried a number of buffer sizes and found that after 1MB it started to take longer due to blocking file I/O taking longer than the time count the newlines in each chunk. ↩︎
Here’s the link to patches in the Linux kernel mailing list: https://lore.kernel.org/lkml/[email protected]/T/#u ↩︎
“GNU parallel is a shell tool for executing jobs in parallel using one or more computers.” See here. ↩︎