Audio-Based True Random Number Generator POC

A few days ago, I came up with an idea to create a true random number generator based on noise gathered from a cheap microphone attached to my computer. Tests showed that when sampling the microphone, the least significant bit behaves pretty randomly. This led me to think it might be a good source for gathering entropy for a true random number generator.
Continue reading Audio-Based True Random Number Generator POC

Python’s base64 Module Fails to Decode Unicode Strings

If you’ve got a base64 string as a unicode object and you try to use Python’s base64 module with altchars set, it fails with the following error:

TypeError: character mapping must return integer, None or unicode

This is a pretty unhelpful error message. It also occurs if you try any method that indirectly uses altchars. For example:

base64.urlsafe_b64decode(unicode('aass'))
base64.b64decode(unicode('aass'),'-_')

Both fail, while the following works:

base64.urlsafe_b64decode('aass')
base64.b64decode(unicode('aass'))

While it’s not complicated to fix (just convert any unicode string to an ascii string), it’s still annoying.

URL-Safe Timestamps Using Base64

Passing around timestamps in URLs is a common task. We usually want our URLs to be as short as possible. I’ve found Base64 to result in the shortest URL-safe representation: just 6 chars. This compares with the 12 chars of the naive way, and 8 chars when using a hex representation.

The following Python functions allow you to build and read these 6-char URL-safe timestamps:
Continue reading URL-Safe Timestamps Using Base64

An Early Release of the New cssrtl.py-2.0

It has been three years since I released the original version of cssrtl.py (and two since its re-release). The old version did a nice job, but experience gained during that time led me to write a new version from scratch. More than a month ago, I detailed the basic principles and ideas that guided me in designing a better tool to help adapt CSS files from left-to-right to right-to-left.

The guidelines weren’t just empty words; they were written while working on the Hebrew adaptation of the Fusion theme and at the same time writing a new proof-of-concept version of cssrtl.py. The original intent was to release a more mature version of that code when it was completed. However, due to the apparent shortage of time in the present and foreseeable future, I can’t see myself completing the project any time soon. So, following the “release early” mantra, I’ve decided to release the code as-is. As I said, the code is in a working state, but not polished, so it may be beneficial but may contain bugs. If you find any bugs or have any suggestions, I would be glad to hear.
Continue reading An Early Release of the New cssrtl.py-2.0

tarsum-0.2 – A read-only version of tarsum

When I first scratched the itch of calculating checksums for every file in a tar archive, this was my original intention. When I decided I wanted the script in bash for simplicity, I forfeited the idea and settled for extracting the files and then going over all the files to calculate their checksum values.

So when Jon Flowers asked in the comments on the original tarsum post about the possibility of getting the checksums of files in the tar file without extracting the whole archive, I decided to re-tackle the problem.

Continue reading tarsum-0.2 – A read-only version of tarsum

Damerau-Levenshtein Distance in Python

Damerau-Levenshtein distance is a metric for measuring how far two given strings are, in terms of 4 basic operations:

  • deletion
  • insertion
  • substitution
  • transposition

The distance between two strings is the minimal number of such operations needed to transform the first string into the second. The algorithm can be used to create spelling correction suggestions by finding the closest word from a given list to the user’s input. See Damerau–Levenshtein distance (Wikipedia) for more info on the subject.

Here is an implementation of the algorithm (restricted edit distance version) in Python. While this implementation isn’t perfect (performance-wise), it is well suited for many applications.

"""
Compute the Damerau-Levenshtein distance between two given
strings (s1 and s2)
"""
def damerau_levenshtein_distance(s1, s2):
    d = {}
    lenstr1 = len(s1)
    lenstr2 = len(s2)
    for i in xrange(-1,lenstr1+1):
        d[(i,-1)] = i+1
    for j in xrange(-1,lenstr2+1):
        d[(-1,j)] = j+1

    for i in xrange(lenstr1):
        for j in xrange(lenstr2):
            if s1[i] == s2[j]:
                cost = 0
            else:
                cost = 1
            d[(i,j)] = min(
                           d[(i-1,j)] + 1, # deletion
                           d[(i,j-1)] + 1, # insertion
                           d[(i-1,j-1)] + cost, # substitution
                          )
            if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:
                d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition

    return d[lenstr1-1,lenstr2-1]

Update 24 Mar, 2012: Fixed the error in computing transposition at the beginning of the strings.

Retrieving Google’s Cache for an Entire Website

Some time ago, as some of you noticed, the web server that hosts my blog went down. Unfortunately, some of the sites had no proper backup, so something had to be done in case the hard disk couldn’t be recovered. My efforts turned to Google’s cache. Google keeps a copy of the text of the web page in its cache, something that is usually useful when the website is temporarily unavailable. The basic idea is to retrieve a copy of all the pages of a certain site that Google has cached.
Continue reading Retrieving Google’s Cache for an Entire Website

Start Trac on Startup – Init.d Script for tracd

As part of a server move, I went on to reinstall Trac. I tried to install it as FastCGI, but I failed to configure the clean URLs properly. I got the clean URLs to work if the user accessed them, but Trac insisted on adding trac.fcgi to the beginning of every link it generated. So I decided to use the Trac standalone server, tracd.

The next problem I faced was how to start Trac automatically upon startup. The solution was to use an init.d script for starting Trac. After some searching, I didn’t find an init.d script for tracd that was satisfactory (most were poorly written). So I went on and wrote my own init.d script for tracd.
Continue reading Start Trac on Startup – Init.d Script for tracd

Scanning Documents Written in Blue Ink – biscan

After writing the post on converting PNMs to DjVu, I’ve run into some trouble scanning documents written in blue ink. The problem: XSane didn’t allow me to set the threshold for converting the scanned image to line-art (B&W). So, I tried scanning the document in grayscale and in color and converting it afterward to bitonal using ImageMagick. This ended up with two results. When I used the -monochrome command-line switch, the conversion looked good, but it used halftones (dithering). When I tried to convert it to DjVu, it resulted in a document size twice as large as normal B&W would. The other thing that I tried was using the -threshold switch. The DjVu-compressed document size was much better now, but the document was awful-looking; either it was too dark, or some of the text disappeared. After giving it some thought, I knew I could find a better solution.
Continue reading Scanning Documents Written in Blue Ink – biscan

Convert CSS layout to RTL – cssrtl.py

This is a re-release of a script of mine that helps convert CSS layouts to RTL. I originally released it about a year ago, but it was lost when I moved to the new blog. The script, cssrtl.py, utilizes a bunch of regular expressions to translate a given CSS layout to RTL.

Continue reading Convert CSS layout to RTL – cssrtl.py