Fast bytes concatenation in Python

So what is the fastest way to concatenate bytes in Python? I decided to benchmark and compare few common patterns to see how they hold up. The scenario I tested is iterative concatenation of a block of 1024 bytes until we get 1MB of data. This is very similar to what one might do when reading a large file to memory, so this test is pretty realistic.

The first implementation is the naive one.

def f():
    ret = b''
    for i in range(2**10):
        ret += b'a' * 2**10
    return ret

It is known that the naive implementation is very slow, as bytes in Python are immutable type, hence we need to realloc the bytes and copy them after each concatenation. Just how slow is it? about 330 times slower than the append-and-join pattern. The append-and-join pattern was a popular (and efficient) way to concatenate strings in old Python versions

def g():
    ret = list()
    for i in range(2**10):
        ret.append(b'a' * 2**10)
    return b''.join(ret)

It relies on the fact that appending to lists is efficient and then ''.join can preallocate the entire needed memory and perform the copy efficiently. As you can see below it is much more efficient than the naive implementation.

Python 2.6 introduced the bytearray as an efficient mutable bytes sequence. Being mutable allows one to "naively" concatenate the bytearray and achieve great performance more than 30% faster than the join pattern above.

def h():
    ret = bytearray()
    for i in range(2**10):
        ret += b'a' * 2**10
Comparing the naive, join and bytearray implementation. Time is for 64 iterations.
Comparing the join, bytearray, preallocated bytearray and memoryview implementation. Time is for 8196 iterations.

What about perallocating the memory?

def j():
    ret = bytearray(2**20)
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

While this sounds like a good idea, Pythons copy semantics turn out to be very slow. This resulted in 5 times slower run times. Python also offers memeoryview:

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

The idea of access to the internal data without unnecessary copying sounds great.

def k():
    ret = memoryview(bytearray(2**20))
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

And it does run almost twice as fast as preallocated bytearray implementation, but still about 2.5 times slower than the simple bytearray implementation.

I ran the benchmark using the timeit module, taking the best run out of five for each. CPU was Intel i7-8550U.

import timeit

for m in [f, g, h]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**6)))

for m in [g, h, j, k]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**13)))

Conclusion

The simple bytearray implementation was the fastest method, and also as simple as the naive implementation. Also preallocating doesn’t help, because python it looks like python can’t copy efficiently.

One Line Implementations of GCD and Extended GCD in Python

Sometimes you need quick and dirty implementation of greatest common divisor and its extended variant. Actually, because the fractions module comes gcd function, probably the implementation of the exteded greatest common divisor algorithm is more useful. Both implementations are recursive, but work nonetheless even for comparatively large integers.

# simply returns the gcd
gcd = lambda a,b: gcd(b, a % b) if b else a

# egcd(a,b) returns (d,x,y) where d == a*x + b*y
egcd = lambda a,b: (lambda d,x,y: (d, y, x - (a // b) * y))(*egcd(b, a % b)) if b else (a, 1, 0)

The code above is Python 2 and 3 compatible.

Preventing Directory Traversal in Python

Consider the following use case:

PREFIX = '/home/user/files/'
full_path = os.path.join(PREFIX, filepath)
read(full_path, 'rb')
...

Assuming that filepath is user-controlled, a malicious user user might attempt a directory traversal (like setting filepath to ../../../etc/passwd). How can we make sure that filepath cannot traverse “above” our prefix? There are of course numerous solutions to sanitizing input against directory traversalthat. The easiest way (that I came up with) to do so in python is:

filepath = os.normpath('/' + filepath).lstrip('/')

It works because it turns the path into an absolute path, normalizes it and makes it relative again. As one cannot traverse above /, it effectively ensures that the filepath cannot go outside of PREFIX.

Post updated: see the comments below for explanation of the changes.

name-taken – Check if your project name is taken

Every time I want to start a new open-source project I come across this small “problem”: Making sure that the name for the project isn’t already taken. Today I decided to solve it by creating a simple script that queries different open-source repositories to check if there exists a project with the desired name.

Usage is quite simple:

$ name_taken.py enlightenment
Debian: Name not taken :-)
SourceForge: Name taken :-(

Currently the script is in early stage, and can search for projects in Debian’s list of packages and in SourceForge. The code is available is hosted in GitHub: https://github.com/guyru/name_taken, and licensed under GPL2 or higher. Suggestions on how to make this tool more useful (and of course patches) are really welcomed.

Fixing virtualenv after Upgrading Your Distribution/Python

After you upgrade your python/distribution (specifically this happened to me after upgrading from Ubuntu 11.10 to 12.04), your existing virtualenv environments may stop working. This manifests itself by reporting that some modules are missing. For example when I tried to open a Django shell, it complained that urandom was missing from the os module. I guess almost any module will be broken.

Apparently, the solution is dead simple. Just re-create the virtualenv environment:

virtualenv /PATH/TO/EXISTING/ENVIRONMENT

or

virtualenv --system-site-packages /PATH/TO/EXISTING/ENVIRONMENT

(depending on how you created it in the same place). All the modules you’ve already installed should keep working as before (at least it was that way for me).

mechanize – Writing Bots in Python Made Simple

I’ve been using python to write various bots and crawler for a long time. Few days ago I needed to write some simple bot to remove some 400+ spam pages in Sikumuna, I took an old script of mine (from 2006) in order to modify it. The script used ClientForm, a python module that allows you to easily parse and fill html forms using python. I quickly found that ClientForm is now deprecated in favor of mechanize. In the beginning I was partly set back by the change, as ClientForm was pretty easy to use, and mechanize‘s documentation could use some improvement. However, I quickly changed my mind about mechanize. The basic interface for mechanize is a simple browser object, that litteraly allows you to browse using python. It takes care of handling cookies and such and it got similar form-filling abilities to ClientForm, but this time they are integrated into the browser object.

For future reference for myself, and as another code example to mechanizes sparse documentation I’m giving below the gist of the simple bot I wrote:

Continue reading mechanize – Writing Bots in Python Made Simple

Audio Based True Random Number Generator POC

Few days ago I came up with an idea to create a true random number generator based on noise gathered from a cheap microphone attached to my computer. Tests showed that when sampling the microphone, the least significant bit behaves pretty randomly. This lead me to think it might be good source for gathering entropy for a true random number generator.
Continue reading Audio Based True Random Number Generator POC

An Early Release of the New cssrtl.py-2.0

It has been three years since I’ve released the original version of cssrtl.py (and two since it’s re-release). The old version did a nice job, but experience gained during that time led me to write from scratch a new version. I’ve detailed more than a month ago, the basic principles and ideas that guided me to design a better tool to help adapting CSS files from left-to-right to right-to-left.

The guidelines weren’t just empty words, they were written while working on the Hebrew adaptation to the Fusion theme and in the same time writing a new proof-of-concept version of cssrtl.py. The original intent was to release a more mature version of that code when it will be completed. However, due to the apparent shortage of time in the present and foreseeable future, I can’t see myself complete the project any time soon. So following the “release early” mantra, I’ve decided to release the code as-is. As I said, the code is in working state, but not polished, so it may be of benefit but may contain bugs. If you find any bugs or have any suggestions, I would be glad to hear.
Continue reading An Early Release of the New cssrtl.py-2.0

tarsum-0.2 – A read only version of tarsum

When I first scratched the itch of calculating checksums for every file in a tar archive, this was my original intention. When I decided I want the script in bash for simplicity, I forfeited the idea and settled for extracting the files and then going over all the files to calculate their checksum value.

So when Jon Flowers asked in the comments of the original tarsum post about the possibility of getting the checksums of files in the tar file without extracting all the archive, I’ve decided to re-tackle the problem.

Continue reading tarsum-0.2 – A read only version of tarsum

Damerau-Levenshtein Distance in Python

Damerau-Levenshtein Distance is a metric for measuring how far two given strings are, in terms of 4 basic operations:

  • deletion
  • insertion
  • substitution
  • transposition

The distance of two strings are the minimal number of such operations needed to transform the first string to the second. The algorithm can be used to create spelling correction suggestions, by finding the closest word from a given list to the users input. See Damerau–Levenshtein distance (Wikipedia) for more info on the subject.

Here is an of the algorithm (restricted edit distance version) in Python. While this implementation isn’t perfect (performance wise) it is well suited for many applications.

"""
Compute the Damerau-Levenshtein distance between two given
strings (s1 and s2)
"""
def damerau_levenshtein_distance(s1, s2):
    d = {}
    lenstr1 = len(s1)
    lenstr2 = len(s2)
    for i in xrange(-1,lenstr1+1):
        d[(i,-1)] = i+1
    for j in xrange(-1,lenstr2+1):
        d[(-1,j)] = j+1

    for i in xrange(lenstr1):
        for j in xrange(lenstr2):
            if s1[i] == s2[j]:
                cost = 0
            else:
                cost = 1
            d[(i,j)] = min(
                           d[(i-1,j)] + 1, # deletion
                           d[(i,j-1)] + 1, # insertion
                           d[(i-1,j-1)] + cost, # substitution
                          )
            if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:
                d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition

    return d[lenstr1-1,lenstr2-1]

Update 24 Mar, 2012: Fixed the error in computing transposition at the beginning of the strings.