So what is the fastest way to concatenate bytes in Python? I decided to benchmark and compare a few common patterns to see how they hold up. The scenario I tested is iterative concatenation of a block of 1024 bytes until we get 1MB of data. This is very similar to what one might do when reading a large file into memory, so this test is pretty realistic.

The first implementation is the naive one.

def f():
    ret = b''
    for i in range(2**10):
        ret += b'a' * 2**10
    return ret

It is known that the naive implementation is very slow, as bytes in Python are an immutable type, hence we need to realloc the bytes and copy them after each concatenation. Just how slow is it? About 330 times slower than the append-and-join pattern. The append-and-join pattern was a popular (and efficient) way to concatenate strings in old Python versions.

def g():
    ret = list()
    for i in range(2**10):
        ret.append(b'a' * 2**10)
    return b''.join(ret)

It relies on the fact that appending to lists is efficient and then ''.join can preallocate the entire needed memory and perform the copy efficiently. As you can see below, it is much more efficient than the naive implementation.

Python 2.6 introduced the bytearray as an efficient mutable bytes sequence. Being mutable allows one to "naively" concatenate the bytearray and achieve great performance, more than 30% faster than the join pattern above.

def h():
    ret = bytearray()
    for i in range(2**10):
        ret += b'a' * 2**10

Comparing the naive, join, and bytearray implementations. Time is for 64 iterations.

Comparing the join, bytearray, preallocated bytearray, and memoryview implementations. Time is for 8192 iterations.

What about preallocating the memory?

def j():
    ret = bytearray(2**20)
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

While this sounds like a good idea, Python’s copy semantics turn out to be very slow. This resulted in run times 5 times slower. Python also offers memoryview:

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

The idea of accessing the internal data without unnecessary copying sounds great.

def k():
    ret = memoryview(bytearray(2**20))
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

And it does run almost twice as fast as the preallocated bytearray implementation, but still about 2.5 times slower than the simple bytearray implementation.

I ran the benchmark using the timeit module, taking the best run out of five for each. The CPU was an Intel i7-8550U.

import timeit

for m in [f, g, h]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**6)))

for m in [g, h, j, k]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**13)))

Conclusion

The simple bytearray implementation was the fastest method, and it is also as simple as the naive implementation. Also, preallocating doesn’t help, because it looks like Python can’t copy efficiently.

Month: April 2020

Prevent applications from increasing output volume in PulseAudio

Fast bytes concatenation in Python

Conclusion