HTML Form Submission via Phython (1/X)

Some tasks can be very tedious and then writing a short python script is often a nice way to automate them and save some time. But most people who actually tried doing that can probably tell you, that this script often gets more complicated than anticipated in the beginning. Here is the relevant xkcd comic:

xkcd comic 1319

Anyway, this story begins with a web form (on a site that shall remained unnamed), which I need to submit about 80 times with similar content. Since I have used that technique before, I thought it should not be to hard to automate this task with python’s urllib module. Here is main class I wrote for this:

import urllib.request
import http.cookiejar

class form_poster:
    def __init__(self):
        self.cj = http.cookiejar.CookieJar()
        self.opener = urllib.request.build_opener(
            urllib.request.HTTPCookieProcessor(self.cj) )

    def clear_cookies(self, cookie_url):
        try:
            self.cj.clear(cookie_url)
        except:
            pass

    def submit_form( self, form_url, form_field_dict ):
        data = urllib.parse.urlencode( form_field_dict ).encode( 'ascii' )
        form_submit_request = urllib.request.Request( form_url, data=data, headers=self.headers )
        response = self.opener.open( form_submit_request )
        if response.getcode() == 200:
            print("Upload successfull: {}".format(form_field_dict))
        else:
            print("Upload failed: {}".format(form_field_dict))
        return response

    def open_page(self, url):
        request = urllib.request.Request( url, data=None, headers=self.headers )
        return urllib.request.urlopen(request)

The idea is as follows:

  1. We initialize this class and give it a cookie jar (since we need to log in to said website before submitting the form).
  2. We can now submit forms using the submit_form function, which takes an URL and a dictionary as arguments. Both can be obtained by looking at the HTML source code of the page that contains the web form.

The first thing we want to do, is log in to the website. To figure out how to do that, we open the log-in page in our web browser and find the log-in form in the source code. In Firefox this can be done (for example) by right clicking in the user field and selecting the “Inspect Element” option. There is probably a lot going on in the source code, but something similar to this should be present:

<form id="login" action"/login.php" method="post">
    <input name="username" id="login_name">
    <input type="password" name="password" id="login_pass">
    <input type="submit" value="Sign In">
</form>

If the website is located at https://www.website.com, this tells us, that the form will be submitted to https://www.website.com/login.php via the HTTP method POST and needs to contain the fields username and password. So it should be fairly simple for us to use our form_poster class defined above and our login credentials:

fp = form_poster()
fp.submit_form( "https://www.website.com/login.php",
    { "username": "myUserName",
      "password": "EL9L*FWdij" } )

As a side note: please be careful where you store/type your credentials when using them in scripts! There are people out there, that are automatically scanning GitHub for things like this in every open repository.

This seemed to work fine for me and should now have set the login cookies that I need to submit the actual form I’m interested in. For this, I basically repeat the previous step of analysing the HTML of the web form and extracting all <input> fields (at least all that have a name attribute). Don’t forget, that there can also be other HTML elements contributing to a <form>! For example <textarea> or <select>.

After this analysis and deciding how to fill the fields, I tried to submit this form in the same way:

fp = form_poster()
# get log-in cookies
fp.submit_form( "https://www.website.com/login.php",
    { "username": "myUserName",
      "password": "EL9L*FWdij" } )
# submit actual form for all wanted values
for value_dict in all_value_dicts_I_want_to_submit:
    fp.submit_form( "https://www.website.com/other_webform.php",
        value_dict )

But instead of a success message, urllib throws an exception:

urllib.error.HTTPError: HTTP Error 403: Forbidden

Apparently I am not allowed to access the form at all? Why would that be? The login seemed to work… It turned out, that the website was checking which kind of “browser” is trying to access the form and blocks urllib in order to prevent spam. Since I don’t want to spam or have any other malicious intend, I’m sure they won’t mind me circumventing this test… Luckily, urllib allows us to set custom header fields, so we just need to change our requests like this:

request = urllib.request.Request( ..., # other arguments
    headers = {'User-Agent':
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0'}

And suddenly the 403 error is gone! But this is only where the real problem starts, since now we get:

urllib.error.HTTPError: HTTP Error 500: Internal Server Error

But that’s a story for another post…