Fast bytes concatenation in Python

So what is the fastest way to concatenate bytes in Python? I decided to benchmark and compare a few common patterns to see how they hold up. The scenario I tested is iterative concatenation of a block of 1024 bytes until we get 1MB of data. This is very similar to what one might do when reading a large file into memory, so this test is pretty realistic.

The first implementation is the naive one.

def f():
    ret = b''
    for i in range(2**10):
        ret += b'a' * 2**10
    return ret

It is known that the naive implementation is very slow, as bytes in Python are an immutable type, hence we need to realloc the bytes and copy them after each concatenation. Just how slow is it? About 330 times slower than the append-and-join pattern. The append-and-join pattern was a popular (and efficient) way to concatenate strings in old Python versions.

def g():
    ret = list()
    for i in range(2**10):
        ret.append(b'a' * 2**10)
    return b''.join(ret)

It relies on the fact that appending to lists is efficient and then ''.join can preallocate the entire needed memory and perform the copy efficiently. As you can see below, it is much more efficient than the naive implementation.

Python 2.6 introduced the bytearray as an efficient mutable bytes sequence. Being mutable allows one to "naively" concatenate the bytearray and achieve great performance, more than 30% faster than the join pattern above.

def h():
    ret = bytearray()
    for i in range(2**10):
        ret += b'a' * 2**10

Comparing the naive, join, and bytearray implementations. Time is for 64 iterations.

Comparing the join, bytearray, preallocated bytearray, and memoryview implementations. Time is for 8192 iterations.

What about preallocating the memory?

def j():
    ret = bytearray(2**20)
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

While this sounds like a good idea, Python’s copy semantics turn out to be very slow. This resulted in run times 5 times slower. Python also offers memoryview:

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

The idea of accessing the internal data without unnecessary copying sounds great.

def k():
    ret = memoryview(bytearray(2**20))
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

And it does run almost twice as fast as the preallocated bytearray implementation, but still about 2.5 times slower than the simple bytearray implementation.

I ran the benchmark using the timeit module, taking the best run out of five for each. The CPU was an Intel i7-8550U.

import timeit

for m in [f, g, h]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**6)))

for m in [g, h, j, k]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**13)))

Conclusion

The simple bytearray implementation was the fastest method, and it is also as simple as the naive implementation. Also, preallocating doesn’t help, because it looks like Python can’t copy efficiently.

Optimizing `for` Loops: Reverse Loops

for loops are basic language constructs in many languages. One of the first things to look at when optimizing code is loops, as they do considerable amounts of work (like going through a very large amount of data) in very little code.

If you use a for loop, but you don’t really care about the order in which the loop is executed – to be more precise, if you can afford reversing the loop – you can save quite some time. By reversing the loop, I mean that instead of giving the index values from 0 to 10, for example, you go from 10 downward to zero. This doesn’t seem like a big change, but when carefully implemented, this can easily improve the performance of your for loops.
Continue reading Optimizing for Loops: Reverse Loops

The Revised String Iteration Benchmark

In this post I’m going to discuss again the string benchmark I did before to find out what is the fastest way to iterate over an std::string. If you haven’t read the previous post on this subject, go ahead and read it, as it covers the basic idea behind this benchmark. As I did the last time I ran the benchmark, I checked 5 ways of iteration:
Continue reading The Revised String Iteration Benchmark

Profiling Code Using `clock_gettime`

After raising the issue of the low-resolution problem of the timer provided by clock() in Resolution Problems in clock(), I ended the post by mentioning two more functions that should provide high-resolution timers suitable for profiling code. In this post, I will discuss one of them, clock_gettime().
Continue reading Profiling Code Using clock_gettime

What Is the Fastest Method to Iterate Over a String?

A few days ago, I decided to check what really is the fastest method to iterate over strings in C++. As a string class, I chose the string class from STL, as it is very popular and provides a couple of ways to iterate over it. So how can one iterate over an std::string?

By using indexes, e.g. str[i], and running i from zero to the length of the string.
By using the at method. string::at(size_t pos) provides an interface similar to indexes, with the exception that it checks whether the given position is past the end of the string and then throws an exception. One may see it as the safe version of the regular index.
Treating the string as a sequence of characters and iterating over it using iterators.
Using string::c_str() to get a pointer to a regular C string representation of the string stored in the std::string and treating it as an array, e.g. using indexes to go over it.
The last way to iterate over the string is to get a pointer to a C string representation using string::c_str() and advance the pointer itself to iterate over the string.

The third method is the native method of iterating over objects in STL, and like the last two, it can’t be used if the iteration changes the string itself (e.g. inserting or deleting characters). The first and second methods are similar to the fourth (treating the pointer to the C string as an array), except that they aren’t as problematic as the latter when changing the string. The second method is the safest, as it’s the only one that does range checks and throws an exception when trying to access positions that are outside the string.

To benchmark and find out which method is the fastest way to iterate over a string, I’ve created a huge string of random characters ranging from ‘a’ to ‘z’ and five executables, each one implementing one of the above iteration methods to do a simple task
(count the number of occurrences of each letter). The string is fifty million characters long, because the longer the string, the less important the overhead becomes.

The executables for the benchmark of every version were compiled with the default settings of g++ (without optimization, as the compiler might change the iteration methods when optimizing). The benchmark executables were timed by using the time command and redirecting the executables’ output to /dev/null. The tests were run both on 64-bit Gentoo (with 1 GB RAM) and on 32-bit Kubuntu (with 512 MB RAM), to make sure the overall results (which method is better, not the runtime itself) aren’t system-dependent.