Fast bytes concatenation in Python

So what is the fastest way to concatenate bytes in Python? I decided to benchmark and compare few common patterns to see how they hold up. The scenario I tested is iterative concatenation of a block of 1024 bytes until we get 1MB of data. This is very similar to what one might do when reading a large file to memory, so this test is pretty realistic.

The first implementation is the naive one.

def f():
    ret = b''
    for i in range(2**10):
        ret += b'a' * 2**10
    return ret

It is known that the naive implementation is very slow, as bytes in Python are immutable type, hence we need to realloc the bytes and copy them after each concatenation. Just how slow is it? about 330 times slower than the append-and-join pattern. The append-and-join pattern was a popular (and efficient) way to concatenate strings in old Python versions

def g():
    ret = list()
    for i in range(2**10):
        ret.append(b'a' * 2**10)
    return b''.join(ret)

It relies on the fact that appending to lists is efficient and then ''.join can preallocate the entire needed memory and perform the copy efficiently. As you can see below it is much more efficient than the naive implementation.

Python 2.6 introduced the bytearray as an efficient mutable bytes sequence. Being mutable allows one to "naively" concatenate the bytearray and achieve great performance more than 30% faster than the join pattern above.

def h():
    ret = bytearray()
    for i in range(2**10):
        ret += b'a' * 2**10
Comparing the naive, join and bytearray implementation. Time is for 64 iterations.
Comparing the join, bytearray, preallocated bytearray and memoryview implementation. Time is for 8196 iterations.

What about perallocating the memory?

def j():
    ret = bytearray(2**20)
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

While this sounds like a good idea, Pythons copy semantics turn out to be very slow. This resulted in 5 times slower run times. Python also offers memeoryview:

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

The idea of access to the internal data without unnecessary copying sounds great.

def k():
    ret = memoryview(bytearray(2**20))
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

And it does run almost twice as fast as preallocated bytearray implementation, but still about 2.5 times slower than the simple bytearray implementation.

I ran the benchmark using the timeit module, taking the best run out of five for each. CPU was Intel i7-8550U.

import timeit

for m in [f, g, h]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**6)))

for m in [g, h, j, k]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**13)))

Conclusion

The simple bytearray implementation was the fastest method, and also as simple as the naive implementation. Also preallocating doesn’t help, because python it looks like python can’t copy efficiently.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.