Mining data from a search engine

Is the government doing its homework?

Australia is a big country (spatially). But most of it is made of desert, and the bulk of the population lives in coastal areas in the East and South. So the Australian Bureau of Statistics (ABS) created a remoteness definition that classifies locations as anything from a Major City to a Very Remote place. Moreover, they even made it easy for us to use it by making correspondence tables between postcode and the Remoteness Area classification freely available online, here.

ABS also claims that all government bodies use the same classification, including the Department of Health and Ageing (DoHA). While DoHA would have used a different number coding (1-5 for DoHA versus 0-4 for ABS), the remoteness area classification (RA) to which these numbers refer should remain the same. But does it? Our job is to check this assumption.

Strategy

DoHA makes their RA data available too  — well, sorta  — in the DoctorConnect website. This site lets anyone enter ONE postcode, and then returns its corresponding RA. Well, according to the ABS correspondence file Australia has 2,507 postcodes. I’m not entering all these postcodes by hand, one by one, to find out if DoHA’s and ABS’s RAs are the same! But I can let Python do that for me…

Prepare data for search

So the first task is to find a list of Australian postcodes and feed it to the DoHA’s web search engine. ABS made that step easy for us. We just need to download the ABS file and use it. As is often the case, though, it has duplicates (postcodes might be split between two RAs) and the RA code is concatenated with a state numeric code. So the first step of our code, as is de rigueur, is devoted to clean up the data:


from pandas import DataFrame, Series
import pandas as pd

state = {'1':'NSW', '2':'VIC', '3':'QLD', '4':'SA', '5':'WA', '6':'TAS',
         '7':'NT', '8':'ACT', '9':'OT'}

df = pd.read_csv('CP2006RA_2006POA.txt', lineterminator='\n')
df.columns = pd.read_csv('CP2006RA_2006POA.txt').columns

df['RA_2006_STR'] = [str(a)[1] for a in df['RA_2006_CODE']]
df['RA_2006_STATE'] = [str(a)[0] for a in df['RA_2006_CODE']]
df['POA'] = [str(a) for a in df['POA_2006.1']]
df['STATE'] = [state[a] for a in df['RA_2006_STATE']]
df['STRING'] = df['STATE'] + ' ' + df['POA'] + ' Australia'
df.sort(['PERCENT'], ascending = False, inplace = True)
df = df.drop_duplicates(['POA_2006.1'])
df.sort(['RA_2006_STATE'], ascending = True, inplace = True)
df.index = range(len(df.index))

Now we have a tidy pandas DataFrame:

DataFrame after munging

The search

And we are ready to start feeding DoctorConnect with postcodes (as in the ‘STRING’ column) and collecting the RAs. Notice that, as DoctorConnect relies on Google to make its address search, it could search anywhere in the world. Hence, I am preventing any confusion by giving it not only the postcodes, but also the state and the country. Likewise, I will be saving the address it gives me back with the RAs. I am choosing to match the address it returns to me, rather the address I gave to it, as the source of the postcodes I’ll use later to join my dataframes.

We will use the superb selenium engine to send our Python incantations to a running Firefox instance:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

browser = webdriver.Firefox() # Get local session of firefox
browser.set_page_load_timeout(30)
browser.get("http://www.doctorconnect.gov.au/internet/otd/Publishing.nsf/Content/locator") # Load page
assert "Locator" in browser.title
search = browser.find_element_by_name("Search") # Find the query box
ret = browser.find_element_by_id("searchButton")

doha_ra = {}

for i in df['STRING']:
    switch = True
    while switch == True:
        search.send_keys(i)
        ret.send_keys(Keys.RETURN)
        try:
            addr = browser.find_element_by_xpath("//table/tbody/tr[2]/td[2]")
            ra = browser.find_element_by_xpath("//table/tbody/tr[4]/td[2]")
            doha_ra[addr.text] = ra.text
            switch = False
        except:
            pass
        search.clear()
browser.close()

doha_ra = Series(doha_ra)
doha_ra.to_csv('doha_ra.csv')
df.to_csv('abs_data.csv')

And a couple of hours later (give or take), we have the complete DoHA postcode/RA correspondence file!

DoHA postcode / RA correspondence file

Wrapping up

The rest of the work resembles most of the beginning of the script, in inverse order:

  1. We extract the postcode from the address string (with the handy help of a regular expression);
  2. We merge the ABS and DoHA postcode / RA tables.
import re

doha_ra = pd.read_csv('doha_ra.csv', header=None)
doha_ra.columns = ['str','ra']
poa = re.compile('\d{3,4}(?=, Australia)')

def reg_ex(x, regex=poa):
    try:
        return re.search(regex,x).group()
    except:
        return np.nan

doha_ra['POA'] = [reg_ex(i) for i in doha_ra['str']]
doha_ra['POA'] = doha_ra['POA'].astype(np.float)
doha_ra = doha_ra.drop_duplicates(['POA'])
abs_ra = df[['POA','RA_2006_STR']]
abs_ra['POA'] =  abs_ra['POA'].astype(np.float)
abs_ra = abs_ra.drop_duplicates(['POA'])
res = pd.merge(abs_ra, doha_ra, on='POA')
Final dataframe with Australian postcodes and corresponding ABS and DoHA RAs

Final dataframe with Australian postcodes and corresponding ABS and DoHA RAs

And finally we create a table with the crosstabulation postcode counts of ABS RAs and DoHA RAs:

def summary_table(df, columns, value, func='count'):
    df_grouped = df.groupby(columns)
    return df_grouped[value].agg(func).unstack([columns[-1]]).fillna(0)

table = summary_table(res, ['RA_2006_STR', 'ra'], value='POA')

au = re.compile('(?<=\d{4}, )\w{9}')
doha_ra['chk'] = [reg_ex(i, regex=au) for i in doha_ra['str']]
res.columns = ['postcode','ABS RA', 'str', 'DoHA RA', 'chk']
res = res.drop('str', axis=1)
res = res.drop('chk', axis=1)
table.to_csv('result.csv')
res.to_csv('POA_DoHA_RA_201301.csv')

And the verdict is (28 January 2013)…

Final dataframe with ABS and DoHA codes corresponding to each Australian postcode

Advertisements

About dmvianna

Past neuroscientist, now data analyst.
This entry was posted in ABS data, Australia, Data Analysis, Python and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s