pip list --user --outdated lists all user-installed packages that are outdated. We use jq to extract the package names from the output and feed them to xargs.
So what is the fastest way to concatenate bytes in Python? I decided to benchmark and compare a few common patterns to see how they hold up. The scenario I tested is iterative concatenation of a block of 1024 bytes until we get 1MB of data. This is very similar to what one might do when reading a large file into memory, so this test is pretty realistic.
The first implementation is the naive one.
def f():
ret = b''
for i in range(2**10):
ret += b'a' * 2**10
return ret
It is known that the naive implementation is very slow, as bytes in Python are an immutable type, hence we need to realloc the bytes and copy them after each concatenation. Just how slow is it? About 330 times slower than the append-and-join pattern. The append-and-join pattern was a popular (and efficient) way to concatenate strings in old Python versions.
def g():
ret = list()
for i in range(2**10):
ret.append(b'a' * 2**10)
return b''.join(ret)
It relies on the fact that appending to lists is efficient and then ''.join can preallocate the entire needed memory and perform the copy efficiently. As you can see below, it is much more efficient than the naive implementation.
Python 2.6 introduced the bytearray as an efficient mutable bytes sequence. Being mutable allows one to "naively" concatenate the bytearray and achieve great performance, more than 30% faster than the join pattern above.
def h():
ret = bytearray()
for i in range(2**10):
ret += b'a' * 2**10
Comparing the naive, join, and bytearray implementations. Time is for 64 iterations.Comparing the join, bytearray, preallocated bytearray, and memoryview implementations. Time is for 8192 iterations.
What about preallocating the memory?
def j():
ret = bytearray(2**20)
for i in range(2**10):
ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
return ret
While this sounds like a good idea, Python’s copy semantics turn out to be very slow. This resulted in run times 5 times slower. Python also offers memoryview:
memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.
The idea of accessing the internal data without unnecessary copying sounds great.
def k():
ret = memoryview(bytearray(2**20))
for i in range(2**10):
ret[i*2**10:(i+1)*2**10] = b'a' * 2**10
return ret
And it does run almost twice as fast as the preallocated bytearray implementation, but still about 2.5 times slower than the simple bytearray implementation.
I ran the benchmark using the timeit module, taking the best run out of five for each. The CPU was an Intel i7-8550U.
import timeit
for m in [f, g, h]:
print(m, min(timeit.repeat(m, repeat=5, number=2**6)))
for m in [g, h, j, k]:
print(m, min(timeit.repeat(m, repeat=5, number=2**13)))
Conclusion
The simple bytearray implementation was the fastest method, and it is also as simple as the naive implementation. Also, preallocating doesn’t help, because it looks like Python can’t copy efficiently.
Sometimes you need a quick-and-dirty implementation of the greatest common divisor and its extended variant. Actually, because the fractions module comes with a gcd function, the implementation of the extended greatest common divisor algorithm is probably more useful. Both implementations are recursive, but they work nonetheless, even for comparatively large integers.
# simply returns the gcd
gcd = lambda a,b: gcd(b, a % b) if b else a
# egcd(a,b) returns (d,x,y) where d == a*x + b*y
egcd = lambda a,b: (lambda d,x,y: (d, y, x - (a // b) * y))(*egcd(b, a % b)) if b else (a, 1, 0)
Assuming that filepath is user-controlled, a malicious user might attempt a directory traversal (like setting filepath to ../../../etc/passwd). How can we make sure that filepath cannot traverse “above” our prefix? There are, of course, numerous solutions to sanitizing input against directory traversal. The easiest way (that I came up with) to do so in Python is:
It works because it turns the path into an absolute path, normalizes it, and makes it relative again. As one cannot traverse above /, it effectively ensures that filepath cannot go outside of PREFIX.
Post updated: See the comments below for an explanation of the changes.
Every time I want to start a new open-source project, I come across this small “problem”: making sure that the name for the project isn’t already taken. Today I decided to solve it by creating a simple script that queries different open-source repositories to check if a project with the desired name exists.
Usage is quite simple:
$ name_taken.py enlightenment
Debian: Name not taken :-)
SourceForge: Name taken :-(
Currently, the script is in an early stage and can search for projects in Debian’s list of packages and on SourceForge. The code is hosted on GitHub: https://github.com/guyru/name_taken, and licensed under GPL2 or higher. Suggestions on how to make this tool more useful (and, of course, patches) are really welcome.
I recently had to work with some data that came in a huge Microsoft Access database. Because I like SQLite (and despise Access), I decided to export the data to an SQLite file. The first thing I needed to do was somehow get all the data out of the db. Being a Linux user complicates things a bit, but thanks to mdb-tools, it’s possible to process .mdb files without resorting to Windows and buying Access. Using mdb-tools directly can be tedious if you want to export a large db with multiple tables, so when I looked for a way to automate it, I came across Liberating data from Microsoft Access “.mdb” files. This post shows a nice script that dumps every table in a .mdb file to a separate CSV file.
While useful, I wanted something that I could easily import into SQLite. So I modified their script to generate an SQL dump of the db. Given a db file, it writes SQL statements describing the schema of the DB to stdout, followed by INSERTs for each table. Actually, because mdb-tools doesn’t support SQLite as a backend, the dump uses a MySQL dialect, but it should work fine with SQLite as well (SQLite will mostly ignore the parts it can’t process, such as COMMENTs). The easiest way to use the script is
If the original db contains non-ASCII characters and isn’t encoded in UTF-8, you should set the MDB_JET3_CHARSET environment variable to the correct charset. The dump itself will be UTF-8 encoded.
After you upgrade your Python/distribution (specifically, this happened to me after upgrading from Ubuntu 11.10 to 12.04), your existing virtualenv environments may stop working. This manifests itself as reports that some modules are missing. For example, when I tried to open a Django shell, it complained that urandom was missing from the os module. I guess almost any module can break.
Apparently, the solution is dead simple. Just re-create the virtualenv environment:
(depending on how you created it) in the same place. All the modules you’ve already installed should keep working as before (at least it was that way for me).
I’m having less and less time to blog and write stuff lately, so it’s a good opportunity to catch up with old things I did. Back in the happy days when I used Gentoo, one of the irritating issues I faced was messed-up file type associations. The MIME type for some files was recognized incorrectly, and as a result, KDE offered to open files with unsuitable applications. In order to debug it, I wrote a small Python script that would help me debug the way KDE applications are associated with MIME types and what MIME type is inferred from each file.
The script does so by querying KMimeType and KMimeTypeTrader. The script does 3 things:
Given a MIME type, show its hierarchy and a list of applications associated with it.
Given an application, list all MIME types it’s associated with.
Given a file, show its MIME type (and also the accuracy, which allows one to know why that MIME type was selected, although I admit that in the two years since I wrote it, I forgot how it works :))
I’ve been using Python to write various bots and crawlers for a long time. A few days ago I needed to write a simple bot to remove some 400+ spam pages in Sikumuna, so I took an old script of mine (from 2006) to modify it. The script used ClientForm, a Python module that allows you to easily parse and fill HTML forms using Python. I quickly found that ClientForm is now deprecated in favor of mechanize. In the beginning I was partly set back by the change, as ClientForm was pretty easy to use, and mechanize‘s documentation could use some improvement. However, I quickly changed my mind about mechanize. The basic interface for mechanize is a simple browser object that literally allows you to browse using Python. It takes care of handling cookies and such, and it has similar form-filling abilities to ClientForm, but this time they are integrated into the browser object.
For future reference for myself, and as another code example for mechanize‘s sparse documentation, I’m giving below the gist of the simple bot I wrote:
Firefox 3 started storing its cookies in a SQLite database instead of the old plain-text cookie.txt. While Python’s cookielib module could read the old cookie.txt file, it doesn’t handle the new format. The following Python snippet takes a CookieJar object and the path to Firefox cookies.sqlite (or a copy of it) and fills the CookieJar with the cookies from cookies.sqlite.
import sqlite3
import cookielib
def get_cookies(cj, ff_cookies):
con = sqlite3.connect(ff_cookies)
cur = con.cursor()
cur.execute("SELECT host, path, isSecure, expiry, name, value FROM moz_cookies")
for item in cur.fetchall():
c = cookielib.Cookie(0, item[4], item[5],
None, False,
item[0], item[0].startswith('.'), item[0].startswith('.'),
item[1], False,
item[2],
item[3], item[3]=="",
None, None, {})
print c
cj.set_cookie(c)
It works well for me, except that apparently Firefox doesn’t save session cookies to disk at all.