I’ve been using python to write various bots and crawler for a long time. Few days ago I needed to write some simple bot to remove some 400+ spam pages in Sikumuna, I took an old script of mine (from 2006) in order to modify it. The script used ClientForm, a python module that allows you to easily parse and fill html forms using python. I quickly found that ClientForm is now deprecated in favor of mechanize. In the beginning I was partly set back by the change, as ClientForm was pretty easy to use, and
mechanize‘s documentation could use some improvement. However, I quickly changed my mind about
mechanize. The basic interface for
mechanize is a simple browser object, that litteraly allows you to browse using python. It takes care of handling cookies and such and it got similar form-filling abilities to ClientForm, but this time they are integrated into the browser object.
For future reference for myself, and as another code example to
mechanizes sparse documentation I’m giving below the gist of the simple bot I wrote:
self.browser = mechanize.Browser() self.browser.set_handle_robots(False) def login(self): self.browser.open(self.login_url) self.browser.select_form(name="userlogin") self.browser["wpName"] = self.username self.browser["wpPassword"] = self.password res = self.browser.submit() def find_pages(self, prefix): self.browser.open(self.find_pages_url) self.browser.select_form(nr=0) self.browser["from"] = prefix res = self.browser.submit() data = res.read() link_regex = re.compile('<td><a href="([^"]*)"[^<]*</a></td>') return link_regex.findall(data) def delete_page(self, page_url): self.browser.open(page_url + "&action=delete") if "Kindle" not in self.browser.title(): print self.browser.title() if raw_input("Confirm: ") != "y": return self.browser.select_form(nr=0) self.browser["wpReason"] = "Spam" self.browser.submit() def run(self, prefix): self.login() pages = self.find_pages(prefix) print "Found %d page" % len(pages) for i,page in enumerate(pages): print "Deleting", i self.delete_page(page)
This isn’t a complete code example, as the rest of the code is just mundane, but you can clearly see how simple it is to use
The interesting parts are:
- Initializing the browser object using
- Openning pages:
- Selecting forms:
browser.select_form(name="userlogin")(selecting forms by name)
browser.select_form(nr=0)(selecting forms by their sequential number in the page).
- Filling forms is done by assigning values to the form fields on the browser object:
browser["wpName"] = self.username