Category Archives: Python

One Line Implementations of GCD and Extended GCD in Python

Sometimes you need quick and dirty implementation of greatest common divisor and its extended variant. Actually, because the fractions module comes gcd function, probably the implementation of the exteded greatest common divisor algorithm is more useful. Both implementations are recursive, but work nonetheless even for comparatively large integers.

# simply returns the gcd
gcd = lambda a,b: gcd(b, a % b) if b else a

# egcd(a,b) returns (d,x,y) where d == a*x + b*y
egcd = lambda a,b: (lambda d,x,y: (d, y, x - (a // b) * y))(*egcd(b, a % b)) if b else (a, 1, 0)

The code above is Python 2 and 3 compatible.

Preventing Directory Traversal in Python

Consider the following use case:

PREFIX = '/home/user/files/'
full_path = os.path.join(PREFIX, filepath)
read(full_path, 'rb')
...

Assuming that filepath is user-controlled, a malicious user user might attempt a directory traversal (like setting filepath to ../../../etc/passwd). How can we make sure that filepath cannot traverse “above” our prefix? There are of course numerous solutions to sanitizing input against directory traversalthat. The easiest way (that I came up with) to do so in python is:

filepath = os.normpath('/' + filepath).lstrip('/')

It works because it turns the path into an absolute path, normalizes it and makes it relative again. As one cannot traverse above /, it effectively ensures that the filepath cannot go outside of PREFIX.

Post updated: see the comments below for explanation of the changes.

name-taken – Check if your project name is taken

Every time I want to start a new open-source project I come across this small “problem”: Making sure that the name for the project isn’t already taken. Today I decided to solve it by creating a simple script that queries different open-source repositories to check if there exists a project with the desired name.

Usage is quite simple:

$ name_taken.py enlightenment
Debian: Name not taken :-)
SourceForge: Name taken :-(

Currently the script is in early stage, and can search for projects in Debian’s list of packages and in SourceForge. The code is available is hosted in GitHub: https://github.com/guyru/name_taken, and licensed under GPL2 or higher. Suggestions on how to make this tool more useful (and of course patches) are really welcomed.

Fixing virtualenv after Upgrading Your Distribution/Python

After you upgrade your python/distribution (specifically this happened to me after upgrading from Ubuntu 11.10 to 12.04), your existing virtualenv environments may stop working. This manifests itself by reporting that some modules are missing. For example when I tried to open a Django shell, it complained that urandom was missing from the os module. I guess almost any module will be broken.

Apparently, the solution is dead simple. Just re-create the virtualenv environment:

virtualenv /PATH/TO/EXISTING/ENVIRONMENT

or

virtualenv --system-site-packages /PATH/TO/EXISTING/ENVIRONMENT

(depending on how you created it in the same place). All the modules you’ve already installed should keep working as before (at least it was that way for me).

mechanize – Writing Bots in Python Made Simple

I’ve been using python to write various bots and crawler for a long time. Few days ago I needed to write some simple bot to remove some 400+ spam pages in Sikumuna, I took an old script of mine (from 2006) in order to modify it. The script used ClientForm, a python module that allows you to easily parse and fill html forms using python. I quickly found that ClientForm is now deprecated in favor of mechanize. In the beginning I was partly set back by the change, as ClientForm was pretty easy to use, and mechanize‘s documentation could use some improvement. However, I quickly changed my mind about mechanize. The basic interface for mechanize is a simple browser object, that litteraly allows you to browse using python. It takes care of handling cookies and such and it got similar form-filling abilities to ClientForm, but this time they are integrated into the browser object.

For future reference for myself, and as another code example to mechanizes sparse documentation I’m giving below the gist of the simple bot I wrote:

Continue reading

Audio Based True Random Number Generator POC

Few days ago I came up with an idea to create a true random number generator based on noise gathered from a cheap microphone attached to my computer. Tests showed that when sampling the microphone, the least significant bit behaves pretty randomly. This lead me to think it might be good source for gathering entropy for a true random number generator.
Continue reading

An Early Release of the New cssrtl.py-2.0

It has been three years since I’ve released the original version of cssrtl.py (and two since it’s re-release). The old version did a nice job, but experience gained during that time led me to write from scratch a new version. I’ve detailed more than a month ago, the basic principles and ideas that guided me to design a better tool to help adapting CSS files from left-to-right to right-to-left.

The guidelines weren’t just empty words, they were written while working on the Hebrew adaptation to the Fusion theme and in the same time writing a new proof-of-concept version of cssrtl.py. The original intent was to release a more mature version of that code when it will be completed. However, due to the apparent shortage of time in the present and foreseeable future, I can’t see myself complete the project any time soon. So following the “release early” mantra, I’ve decided to release the code as-is. As I said, the code is in working state, but not polished, so it may be of benefit but may contain bugs. If you find any bugs or have any suggestions, I would be glad to hear.
Continue reading

tarsum-0.2 – A read only version of tarsum

When I first scratched the itch of calculating checksums for every file in a tar archive, this was my original intention. When I decided I want the script in bash for simplicity, I forfeited the idea and settled for extracting the files and then going over all the files to calculate their checksum value.

So when Jon Flowers asked in the comments of the original tarsum post about the possibility of getting the checksums of files in the tar file without extracting all the archive, I’ve decided to re-tackle the problem.

Continue reading

Damerau-Levenshtein Distance in Python

Damerau-Levenshtein Distance is a metric for measuring how far two given strings are, in terms of 4 basic operations:

  • deletion
  • insertion
  • substitution
  • transposition

The distance of two strings are the minimal number of such operations needed to transform the first string to the second. The algorithm can be used to create spelling correction suggestions, by finding the closest word from a given list to the users input. See Damerau–Levenshtein distance (Wikipedia) for more info on the subject.

Here is an of the algorithm (restricted edit distance version) in Python. While this implementation isn’t perfect (performance wise) it is well suited for many applications.

"""
Compute the Damerau-Levenshtein distance between two given
strings (s1 and s2)
"""
def damerau_levenshtein_distance(s1, s2):
    d = {}
    lenstr1 = len(s1)
    lenstr2 = len(s2)
    for i in xrange(-1,lenstr1+1):
        d[(i,-1)] = i+1
    for j in xrange(-1,lenstr2+1):
        d[(-1,j)] = j+1
 
    for i in xrange(lenstr1):
        for j in xrange(lenstr2):
            if s1[i] == s2[j]:
                cost = 0
            else:
                cost = 1
            d[(i,j)] = min(
                           d[(i-1,j)] + 1, # deletion
                           d[(i,j-1)] + 1, # insertion
                           d[(i-1,j-1)] + cost, # substitution
                          )
            if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:
                d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition
 
    return d[lenstr1-1,lenstr2-1]

Update 24 Mar, 2012: Fixed the error in computing transposition at the beginning of the strings.

Retrieving Google’s Cache for a Whole Website

Some time ago, as some of you noticed, the web server that hosts my blog went down. Unfortunately, some of the sites had no proper backup, so some thing had to be done in case the hard disk couldn’t be recovered. My efforts turned to Google’s cache. Google keeps a copy of the text of the web page in it’s cache, something that is usually useful when the website is temporarily unavailable. The basic idea is to retrieve a copy of all the pages of a certain site that Google has a cache of.
Continue reading