Is the government doing its homework?
Australia is a big country (spatially). But most of it is made of desert, and the bulk of the population lives in coastal areas in the East and South. So the Australian Bureau of Statistics (ABS) created a remoteness definition that classifies locations as anything from a Major City to a Very Remote place. Moreover, they even made it easy for us to use it by making correspondence tables between postcode and the Remoteness Area classification freely available online, here.
ABS also claims that all government bodies use the same classification, including the Department of Health and Ageing (DoHA). While DoHA would have used a different number coding (1-5 for DoHA versus 0-4 for ABS), the remoteness area classification (RA) to which these numbers refer should remain the same. But does it? Our job is to check this assumption.
Strategy
DoHA makes their RA data available too — well, sorta — in the DoctorConnect website. This site lets anyone enter ONE postcode, and then returns its corresponding RA. Well, according to the ABS correspondence file Australia has 2,507 postcodes. I’m not entering all these postcodes by hand, one by one, to find out if DoHA’s and ABS’s RAs are the same! But I can let Python do that for me…
Prepare data for search
So the first task is to find a list of Australian postcodes and feed it to the DoHA’s web search engine. ABS made that step easy for us. We just need to download the ABS file and use it. As is often the case, though, it has duplicates (postcodes might be split between two RAs) and the RA code is concatenated with a state numeric code. So the first step of our code, as is de rigueur, is devoted to clean up the data:
from pandas import DataFrame, Series import pandas as pd state = {'1':'NSW', '2':'VIC', '3':'QLD', '4':'SA', '5':'WA', '6':'TAS', '7':'NT', '8':'ACT', '9':'OT'} df = pd.read_csv('CP2006RA_2006POA.txt', lineterminator='\n') df.columns = pd.read_csv('CP2006RA_2006POA.txt').columns df['RA_2006_STR'] = [str(a)[1] for a in df['RA_2006_CODE']] df['RA_2006_STATE'] = [str(a)[0] for a in df['RA_2006_CODE']] df['POA'] = [str(a) for a in df['POA_2006.1']] df['STATE'] = [state[a] for a in df['RA_2006_STATE']] df['STRING'] = df['STATE'] + ' ' + df['POA'] + ' Australia' df.sort(['PERCENT'], ascending = False, inplace = True) df = df.drop_duplicates(['POA_2006.1']) df.sort(['RA_2006_STATE'], ascending = True, inplace = True) df.index = range(len(df.index))
Now we have a tidy pandas DataFrame:
The search
And we are ready to start feeding DoctorConnect with postcodes (as in the ‘STRING’ column) and collecting the RAs. Notice that, as DoctorConnect relies on Google to make its address search, it could search anywhere in the world. Hence, I am preventing any confusion by giving it not only the postcodes, but also the state and the country. Likewise, I will be saving the address it gives me back with the RAs. I am choosing to match the address it returns to me, rather the address I gave to it, as the source of the postcodes I’ll use later to join my dataframes.
We will use the superb selenium engine to send our Python incantations to a running Firefox instance:
from selenium import webdriver from selenium.webdriver.common.keys import Keys browser = webdriver.Firefox() # Get local session of firefox browser.set_page_load_timeout(30) browser.get("http://www.doctorconnect.gov.au/internet/otd/Publishing.nsf/Content/locator") # Load page assert "Locator" in browser.title search = browser.find_element_by_name("Search") # Find the query box ret = browser.find_element_by_id("searchButton") doha_ra = {} for i in df['STRING']: switch = True while switch == True: search.send_keys(i) ret.send_keys(Keys.RETURN) try: addr = browser.find_element_by_xpath("//table/tbody/tr[2]/td[2]") ra = browser.find_element_by_xpath("//table/tbody/tr[4]/td[2]") doha_ra[addr.text] = ra.text switch = False except: pass search.clear() browser.close() doha_ra = Series(doha_ra) doha_ra.to_csv('doha_ra.csv') df.to_csv('abs_data.csv')
And a couple of hours later (give or take), we have the complete DoHA postcode/RA correspondence file!
Wrapping up
The rest of the work resembles most of the beginning of the script, in inverse order:
- We extract the postcode from the address string (with the handy help of a regular expression);
- We merge the ABS and DoHA postcode / RA tables.
import re doha_ra = pd.read_csv('doha_ra.csv', header=None) doha_ra.columns = ['str','ra'] poa = re.compile('\d{3,4}(?=, Australia)') def reg_ex(x, regex=poa): try: return re.search(regex,x).group() except: return np.nan doha_ra['POA'] = [reg_ex(i) for i in doha_ra['str']] doha_ra['POA'] = doha_ra['POA'].astype(np.float) doha_ra = doha_ra.drop_duplicates(['POA']) abs_ra = df[['POA','RA_2006_STR']] abs_ra['POA'] = abs_ra['POA'].astype(np.float) abs_ra = abs_ra.drop_duplicates(['POA']) res = pd.merge(abs_ra, doha_ra, on='POA')
And finally we create a table with the crosstabulation postcode counts of ABS RAs and DoHA RAs:
def summary_table(df, columns, value, func='count'): df_grouped = df.groupby(columns) return df_grouped[value].agg(func).unstack([columns[-1]]).fillna(0) table = summary_table(res, ['RA_2006_STR', 'ra'], value='POA') au = re.compile('(?<=\d{4}, )\w{9}') doha_ra['chk'] = [reg_ex(i, regex=au) for i in doha_ra['str']] res.columns = ['postcode','ABS RA', 'str', 'DoHA RA', 'chk'] res = res.drop('str', axis=1) res = res.drop('chk', axis=1) table.to_csv('result.csv') res.to_csv('POA_DoHA_RA_201301.csv')
And the verdict is (28 January 2013)…