Fast bytes concatenation in Python

So what is the fastest way to concatenate bytes in Python? I decided to benchmark and compare few common patterns to see how they hold up. The scenario I tested is iterative concatenation of a block of 1024 bytes until we get 1MB of data. This is very similar to what one might do when reading a large file to memory, so this test is pretty realistic.

The first implementation is the naive one.

def f():
    ret = b''
    for i in range(2**10):
        ret += b'a' * 2**10
    return ret

It is known that the naive implementation is very slow, as bytes in Python are immutable type, hence we need to realloc the bytes and copy them after each concatenation. Just how slow is it? about 330 times slower than the append-and-join pattern. The append-and-join pattern was a popular (and efficient) way to concatenate strings in old Python versions

def g():
    ret = list()
    for i in range(2**10):
        ret.append(b'a' * 2**10)
    return b''.join(ret)

It relies on the fact that appending to lists is efficient and then ''.join can preallocate the entire needed memory and perform the copy efficiently. As you can see below it is much more efficient than the naive implementation.

Python 2.6 introduced the bytearray as an efficient mutable bytes sequence. Being mutable allows one to "naively" concatenate the bytearray and achieve great performance more than 30% faster than the join pattern above.

def h():
    ret = bytearray()
    for i in range(2**10):
        ret += b'a' * 2**10

Comparing the naive, join and bytearray implementation. Time is for 64 iterations.

Comparing the join, bytearray, preallocated bytearray and memoryview implementation. Time is for 8196 iterations.

What about perallocating the memory?

def j():
    ret = bytearray(2**20)
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

While this sounds like a good idea, Pythons copy semantics turn out to be very slow. This resulted in 5 times slower run times. Python also offers memeoryview:

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

The idea of access to the internal data without unnecessary copying sounds great.

def k():
    ret = memoryview(bytearray(2**20))
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

And it does run almost twice as fast as preallocated bytearray implementation, but still about 2.5 times slower than the simple bytearray implementation.

I ran the benchmark using the timeit module, taking the best run out of five for each. CPU was Intel i7-8550U.

import timeit

for m in [f, g, h]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**6)))

for m in [g, h, j, k]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**13)))

Conclusion

The simple bytearray implementation was the fastest method, and also as simple as the naive implementation. Also preallocating doesn’t help, because python it looks like python can’t copy efficiently.

Optimizing `for` Loops: Reverse Loops

for loops are basic language constructs in many languages. One of the first thing to look at when optimizing code is the loops, as they do considerable amounts of work (like going through a very large amount of data), in very little code.

If you go use for loop, but you don’t really care about the order in which the loop is executed, to be more precise, if you can afford reversing to loop, you can save quite some time. By reversing the loop I mean instead of giving the index values from 0 to 10 for example, you go from 10 downward to zero. This doesn’t seem like a big change, but when being carefully implemented this can easily upgrade the performance of your for loops.
Continue reading Optimizing for Loops: Reverse Loops

The Revised String Iteration Benchmark

In this post I’m going to discuss again the string benchmark I did before to find out what is the fastest way to iterate over an std::string. If you haven’t read the previous post on this subject go a head and read it as it covers the basic idea behind this benchmark. As I did the last time I did the benchmark, I check 5 ways of iteration:
Continue reading The Revised String Iteration Benchmark

Profiling Code Using `clock_gettime`

After raising the issue of the low resolution problem of the timer provided by clock() in Resolution Problems in clock(), I’ve ended the post by mentioning to two more functions that should provide high-resolution timers suitable for profiling code. In this post I will discuss one of them, clock_gettime().
Continue reading Profiling Code Using clock_gettime

What is the Fastest Method to Iterate Over a String?

Few days ago I decided to check what is really the fastest method to iterate over strings in C++. As a string class I chose string class from STL as it is very popular and provides a couple of ways to iterate it. So how can one iterate over an std::string?

By using indexes. E.g. str[i] and running i from zero to the length of the string.
By using the at method. string::at(size_t pos) provides similar interface to indexes with the exceptions that it checks whether the given position is past the end of the string and then throws an exception. One may see it as the safe version of the regular index.
Treating the string as a sequence of characters and and iterate over it using iterators.
Using string::c_str() to get a pointer to a regular C string representation of the string stored in the std::string and treating it as array, e.g. using indexes to go over it.
The last way to iterate over the string is to get a pointer to a C string representation using string::c_str() and advancing the pointer itself to iterate over the string.

The third method is the native method of iterating over objects in STL, and like the last two it can’t be used if the iteration changes the string itself (e.g. inserting or deleting characters). The first and second method are similar to the fourth (treating the pointer to the C string as an array), except that they aren’t so problematic as the latter when changing the string. The second method is the safest as it’s the only one that does range checks and throws exception if trying to access positions which are outside the string.

To benchmark and find out which method is the fastest method to iterate over a string I’ve created a huge string of random characters ranging from ‘a’ to ‘z’ and five executables, each one implementing one of the above iteration methods to do a simple task
(count the number of occurrences of each letter). The string is fifty million characters long which, as the longer the string the less important the overhead becomes.

The executables for the benchmark of every version were compiled with the default setting of g++ (without optimization as the compiler might change the iteration methods when optimizing). The benchmark executables where timed by using the time command and redirecting the executables output to /dev/null. The tests were run both on 64bit Gentoo (with 1 GB RAM) and on 32bit Kubuntu (with 512 MB RAM), to make sure the overall results (which method it better not the runtime itself) isn’t system depended.

Continue reading What is the Fastest Method to Iterate Over a String?