Fast bytes concatenation in Python

So what is the fastest way to concatenate bytes in Python? I decided to benchmark and compare few common patterns to see how they hold up. The scenario I tested is iterative concatenation of a block of 1024 bytes until we get 1MB of data. This is very similar to what one might do when reading a large file to memory, so this test is pretty realistic.

The first implementation is the naive one.

def f():
    ret = b''
    for i in range(2**10):
        ret += b'a' * 2**10
    return ret

It is known that the naive implementation is very slow, as bytes in Python are immutable type, hence we need to realloc the bytes and copy them after each concatenation. Just how slow is it? about 330 times slower than the append-and-join pattern. The append-and-join pattern was a popular (and efficient) way to concatenate strings in old Python versions

def g():
    ret = list()
    for i in range(2**10):
        ret.append(b'a' * 2**10)
    return b''.join(ret)

It relies on the fact that appending to lists is efficient and then ''.join can preallocate the entire needed memory and perform the copy efficiently. As you can see below it is much more efficient than the naive implementation.

Python 2.6 introduced the bytearray as an efficient mutable bytes sequence. Being mutable allows one to "naively" concatenate the bytearray and achieve great performance more than 30% faster than the join pattern above.

def h():
    ret = bytearray()
    for i in range(2**10):
        ret += b'a' * 2**10
Comparing the naive, join and bytearray implementation. Time is for 64 iterations.
Comparing the join, bytearray, preallocated bytearray and memoryview implementation. Time is for 8196 iterations.

What about perallocating the memory?

def j():
    ret = bytearray(2**20)
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

While this sounds like a good idea, Pythons copy semantics turn out to be very slow. This resulted in 5 times slower run times. Python also offers memeoryview:

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

The idea of access to the internal data without unnecessary copying sounds great.

def k():
    ret = memoryview(bytearray(2**20))
    for i in range(2**10):
        ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
    return ret

And it does run almost twice as fast as preallocated bytearray implementation, but still about 2.5 times slower than the simple bytearray implementation.

I ran the benchmark using the timeit module, taking the best run out of five for each. CPU was Intel i7-8550U.

import timeit

for m in [f, g, h]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**6)))

for m in [g, h, j, k]:
    print(m, min(timeit.repeat(m, repeat=5, number=2**13)))

Conclusion

The simple bytearray implementation was the fastest method, and also as simple as the naive implementation. Also preallocating doesn’t help, because python it looks like python can’t copy efficiently.

SQL Dump for MS Access databases (.mdb files) on Linux

I recently had to work with some data that came in a huge Microsoft Access database. Because I like SQLite (and despise Access), I’ve decided to export the data to an SQLite file. The first thing I needed to do was to somehow get all the data out of the db. Being a Linux user, complicates things a bit, but thanks to mdb-tools it’s possible to process the .mdb files without resorting to Windows and buying Access. Using mdb-tools directly can be tedious if you want to export a large db with multiple tables, so when I’ve looked for a way to automate it, I came across Liberating data from Microsoft Access “.mdb” files. This post shows a nice script that dumps every table in a .mdb file to separate CSV file.

While useful, I wanted something that I could easily import into SQLite. So I’ve modified their script to generate an SQL dump of the db. Given a db file, it writes to stdout SQL statements describing the schema of the DB followed by INSERTs for each table. Actually because mdb-tools doesn’t support SQLite as a backend, the dump uses a MySQL dialect, but it should be fine with SQLite as well (SQLite will mostly ignore the parts it can’t process such as COMMENTs). The easiest way to use the script is

$ python AccessDump.py access.mdb | sqlite3 new.db

If the original db contains non-ascii characters, and isn’t encoded in UTF-8, you should set the MDB_JET3_CHARSET environment variable to the correct charset. The dump itself will be UTF-8 encoded.

$ MDB_JET3_CHARSET="cp1255" python AccessDump.py access.mdb | sqlite3 new.db

Continue reading SQL Dump for MS Access databases (.mdb files) on Linux

Fixing virtualenv after Upgrading Your Distribution/Python

After you upgrade your python/distribution (specifically this happened to me after upgrading from Ubuntu 11.10 to 12.04), your existing virtualenv environments may stop working. This manifests itself by reporting that some modules are missing. For example when I tried to open a Django shell, it complained that urandom was missing from the os module. I guess almost any module will be broken.

Apparently, the solution is dead simple. Just re-create the virtualenv environment:

virtualenv /PATH/TO/EXISTING/ENVIRONMENT

or

virtualenv --system-site-packages /PATH/TO/EXISTING/ENVIRONMENT

(depending on how you created it in the same place). All the modules you’ve already installed should keep working as before (at least it was that way for me).

Debugging File Type (MIME) Associations

I’m having less and less time to blog and write stuff lately, so it’s a good oppertunity to catch up with old thing I did. Back in the happy days I used Gentoo, one of irritating issues I faced was messed up file type associations. MIME type for some files was recognized incorrectly, and as a result, KDE offered to open files with unsuitable applications. In order to debug it I wrote a small python script which would help me debug the way KDE applications are associated with MIME types and what MIME type is inferred form each file.

The script does so by querying the KMimeType and KMimeTypeTrader. The script does 3 things:

  • Given a MIME type, show it’s hierarchy and a list of applications associated with it.
  • Given an applications, list all MIME types it’s associated with
  • Given a file, show its MIME type (and also the accuracy, which allows one to know why that MIME type was selected, although I admit that in the two years since I wrote it, I forgot how it works :))

The script is pasted below. I hope someone that still fiddles with less than standard installations, will find it helpful.
Continue reading Debugging File Type (MIME) Associations

Building CookieJar out of Firefox’s cookies.sqlite

Firefox 3 started to store it’s cookies in a SQLite database instead of the old plain-text cookie.txt. While Python’s cookielib module could read the old cookie.txt file, it doesn’t handle the new format. The following python snippet takes a CookieJar object and the path to Firefox cookies.sqlite (or a copy of it) and fills the CookieJar with the cookies from cookies.sqlite.

import sqlite3
import cookielib

def get_cookies(cj, ff_cookies):
    con = sqlite3.connect(ff_cookies)
    cur = con.cursor()
    cur.execute("SELECT host, path, isSecure, expiry, name, value FROM moz_cookies")
    for item in cur.fetchall():
        c = cookielib.Cookie(0, item[4], item[5],
            None, False,
            item[0], item[0].startswith('.'), item[0].startswith('.'),
            item[1], False,
            item[2],
            item[3], item[3]=="",
            None, None, {})
        print c
        cj.set_cookie(c)

It works well for me, except that apperantly Firefox doesn’t save session cookies to the disk at all.

Audio Based True Random Number Generator POC

Few days ago I came up with an idea to create a true random number generator based on noise gathered from a cheap microphone attached to my computer. Tests showed that when sampling the microphone, the least significant bit behaves pretty randomly. This lead me to think it might be good source for gathering entropy for a true random number generator.
Continue reading Audio Based True Random Number Generator POC

Python’s base64 Module Fails to Decode Unicode Strings

If you’ve got a base64 string as a unicode object and you try to use Python’s base64 module with altchars set, it fails with the following error:

TypeError: character mapping must return integer, None or unicode

This is pretty unhelpful error message also occurs if you try any method that indirectly use altchars. For example:

base64.urlsafe_b64decode(unicode('aass'))
base64.b64decode(unicode('aass'),'-_')

both fail while the following works:

base64.urlsafe_b64decode('aass')
base64.b64decode(unicode('aass'))

While it’s not complicated to fix it (just convert any unicode string to ascii string), it’s still annoying.

URL-Safe Timestamps using Base64

Passing around timestamps in URLs is a common task. We usually want our URLs to be as shortest as possible. I’ve found using Base64 to result in the shortest URL-safe representation, just 6 chars. This compares with the 12 chars of the naive way, and 8 chars when using hex representation.

The following Python functions allow you to build and read these 6 chars URL-safe timestamps:
Continue reading URL-Safe Timestamps using Base64

An Early Release of the New cssrtl.py-2.0

It has been three years since I’ve released the original version of cssrtl.py (and two since it’s re-release). The old version did a nice job, but experience gained during that time led me to write from scratch a new version. I’ve detailed more than a month ago, the basic principles and ideas that guided me to design a better tool to help adapting CSS files from left-to-right to right-to-left.

The guidelines weren’t just empty words, they were written while working on the Hebrew adaptation to the Fusion theme and in the same time writing a new proof-of-concept version of cssrtl.py. The original intent was to release a more mature version of that code when it will be completed. However, due to the apparent shortage of time in the present and foreseeable future, I can’t see myself complete the project any time soon. So following the “release early” mantra, I’ve decided to release the code as-is. As I said, the code is in working state, but not polished, so it may be of benefit but may contain bugs. If you find any bugs or have any suggestions, I would be glad to hear.
Continue reading An Early Release of the New cssrtl.py-2.0

Scanning Documents Written in Blue Ink – biscan

After writing the post on converting PNMs to DjVu I’ve ran into some trouble scanning documents written in blue ink. The problem: XSane didn’t allow me to set the threshold for converting the scanned image to line-art (B&W). So, I tried scanning the document in grayscale and in color and convert it afterwards to bitonal using imagemagick. This ended up with two results. When I used the -monochrome command line switch, the conversion looked good, but it used halftones (dithering), when I tried to convert it to DjVu it resulted in a document size twice as large as normal B&W would. The other thing that I tried is using the -threshold switch. The DjVu compressed document size was much better now, but the document was awful looking, either it was too dark, or some of the text disappeared. After giving it some thought I knew I can find a better solution.
Continue reading Scanning Documents Written in Blue Ink – biscan