Mining data from a search engine

Is the government doing its homework?

Australia is a big country (spatially). But most of it is made of desert, and the bulk of the population lives in coastal areas in the East and South. So the Australian Bureau of Statistics (ABS) created a remoteness definition that classifies locations as anything from a Major City to a Very Remote place. Moreover, they even made it easy for us to use it by making correspondence tables between postcode and the Remoteness Area classification freely available online, here.

ABS also claims that all government bodies use the same classification, including the Department of Health and Ageing (DoHA). While DoHA would have used a different number coding (1-5 for DoHA versus 0-4 for ABS), the remoteness area classification (RA) to which these numbers refer should remain the same. But does it? Our job is to check this assumption.


DoHA makes their RA data available too  — well, sorta  — in the DoctorConnect website. This site lets anyone enter ONE postcode, and then returns its corresponding RA. Well, according to the ABS correspondence file Australia has 2,507 postcodes. I’m not entering all these postcodes by hand, one by one, to find out if DoHA’s and ABS’s RAs are the same! But I can let Python do that for me…

Prepare data for search

So the first task is to find a list of Australian postcodes and feed it to the DoHA’s web search engine. ABS made that step easy for us. We just need to download the ABS file and use it. As is often the case, though, it has duplicates (postcodes might be split between two RAs) and the RA code is concatenated with a state numeric code. So the first step of our code, as is de rigueur, is devoted to clean up the data:

from pandas import DataFrame, Series
import pandas as pd

state = {'1':'NSW', '2':'VIC', '3':'QLD', '4':'SA', '5':'WA', '6':'TAS',
         '7':'NT', '8':'ACT', '9':'OT'}

df = pd.read_csv('CP2006RA_2006POA.txt', lineterminator='\n')
df.columns = pd.read_csv('CP2006RA_2006POA.txt').columns

df['RA_2006_STR'] = [str(a)[1] for a in df['RA_2006_CODE']]
df['RA_2006_STATE'] = [str(a)[0] for a in df['RA_2006_CODE']]
df['POA'] = [str(a) for a in df['POA_2006.1']]
df['STATE'] = [state[a] for a in df['RA_2006_STATE']]
df['STRING'] = df['STATE'] + ' ' + df['POA'] + ' Australia'
df.sort(['PERCENT'], ascending = False, inplace = True)
df = df.drop_duplicates(['POA_2006.1'])
df.sort(['RA_2006_STATE'], ascending = True, inplace = True)
df.index = range(len(df.index))

Now we have a tidy pandas DataFrame:

DataFrame after munging

The search

And we are ready to start feeding DoctorConnect with postcodes (as in the ‘STRING’ column) and collecting the RAs. Notice that, as DoctorConnect relies on Google to make its address search, it could search anywhere in the world. Hence, I am preventing any confusion by giving it not only the postcodes, but also the state and the country. Likewise, I will be saving the address it gives me back with the RAs. I am choosing to match the address it returns to me, rather the address I gave to it, as the source of the postcodes I’ll use later to join my dataframes.

We will use the superb selenium engine to send our Python incantations to a running Firefox instance:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

browser = webdriver.Firefox() # Get local session of firefox
browser.get("") # Load page
assert "Locator" in browser.title
search = browser.find_element_by_name("Search") # Find the query box
ret = browser.find_element_by_id("searchButton")

doha_ra = {}

for i in df['STRING']:
    switch = True
    while switch == True:
            addr = browser.find_element_by_xpath("//table/tbody/tr[2]/td[2]")
            ra = browser.find_element_by_xpath("//table/tbody/tr[4]/td[2]")
            doha_ra[addr.text] = ra.text
            switch = False

doha_ra = Series(doha_ra)

And a couple of hours later (give or take), we have the complete DoHA postcode/RA correspondence file!

DoHA postcode / RA correspondence file

Wrapping up

The rest of the work resembles most of the beginning of the script, in inverse order:

  1. We extract the postcode from the address string (with the handy help of a regular expression);
  2. We merge the ABS and DoHA postcode / RA tables.
import re

doha_ra = pd.read_csv('doha_ra.csv', header=None)
doha_ra.columns = ['str','ra']
poa = re.compile('\d{3,4}(?=, Australia)')

def reg_ex(x, regex=poa):
        return np.nan

doha_ra['POA'] = [reg_ex(i) for i in doha_ra['str']]
doha_ra['POA'] = doha_ra['POA'].astype(np.float)
doha_ra = doha_ra.drop_duplicates(['POA'])
abs_ra = df[['POA','RA_2006_STR']]
abs_ra['POA'] =  abs_ra['POA'].astype(np.float)
abs_ra = abs_ra.drop_duplicates(['POA'])
res = pd.merge(abs_ra, doha_ra, on='POA')
Final dataframe with Australian postcodes and corresponding ABS and DoHA RAs

Final dataframe with Australian postcodes and corresponding ABS and DoHA RAs

And finally we create a table with the crosstabulation postcode counts of ABS RAs and DoHA RAs:

def summary_table(df, columns, value, func='count'):
    df_grouped = df.groupby(columns)
    return df_grouped[value].agg(func).unstack([columns[-1]]).fillna(0)

table = summary_table(res, ['RA_2006_STR', 'ra'], value='POA')

au = re.compile('(?<=\d{4}, )\w{9}')
doha_ra['chk'] = [reg_ex(i, regex=au) for i in doha_ra['str']]
res.columns = ['postcode','ABS RA', 'str', 'DoHA RA', 'chk']
res = res.drop('str', axis=1)
res = res.drop('chk', axis=1)

And the verdict is (28 January 2013)…

Final dataframe with ABS and DoHA codes corresponding to each Australian postcode

Posted in ABS data, Australia, Data Analysis, Python | Tagged , , , , , , , , , , , | Leave a comment

pandas to excel (with xlwt styles!)

Why use Python, pandas, xlwt?

While using R is mandatory when doing just about anything statistical, Python code is usually faster, more readable and is not limited to the analysis stage. R is a statistical language, and Python is a complete programming language. The Python pandas library provides very much the same functionality found in R data.frames, which means one could do their data wrangling in Python for a fraction of the time spent in R, and with better code.

Python also offers the xlwt library, which enables one to write just about anything to XLS files, including setting borders, colors, number format and functions. Taken together, having pandas and xlwt would mean one could automate complex reporting through Python scripting, right?

Missing functionality: Excel cell styles

If you’re a heavy pandas user and have to provide reports in Excel format, you would know that pandas’ DataFrame.to_excel method lacks support for xlwt styles. This is a major limitation. But is also easily solved. Whereas I would gladly contribute to pandas if I could reach github from work, the second best solution, and much more time efficient, is to simply subclass DataFrame and Series to include that support.

How can that be so simple? Well, pandas does support xlwt styles. It is there in the code, but hardcoded to one standard style. Pandas uses xlwt to write to XLS files, and xlwt requires a style argument. So the trick is just to create a xlwt.Style object and pass it as an extra argument to the subclassed pandas object.

Missing functionality: Column MultiIndex in to_excel

It is a matter of taste that I would like my column MultiIndexes to work as expected (one row per level) and not as the standard implementation (appended strings).

Hidden functionality: xlwt style string as dict

I didn’t see it described anywhere in the pandas documentation, but its source has a very neat parser to create xlwt styles from dicts. It doesn’t include number formats, but most everything else can be stored this way. It goes without saying that it makes code much more readable. The code below

conv =
hstyle_dict = {"font": {"bold": True},
               "border": {"top": "thin",
                          "right": "thin",
                          "bottom": "thin",
                          "left": "thin"},
                "pattern": {"pattern": "solid",
                            "fore_colour": 26},
                "align": {"horiz": "center"}}
hstyle = conv.to_xls(hstyle_dict)
dstyle_dict = {"border":{"top": "hair",
                        "right": "hair",
                        "bottom": "hair",
                        "left": "hair"}}
dstyle = conv.to_xls(dstyle_dict)
dstyle.num_format_str = '$#,##0.00'

And an object from the new XLtable class

t = XLtable(your_pandas_DataFrame_object)
t.place_table(ws=ws_1, row=5, col=5, dstyle=dstyle, rstyle=hstyle, cstyle=hstyle)

Would give you

result in Excel

result in Excel

Methods for Series and placing data, column and row indexes separately are also available. What are you waiting? Grab the code at my Gist repository!

Posted in Data Analysis, Python | Tagged , , , , , , , , , | 6 Comments

Speed up your map plotting in R

I have been forever annoyed at how long it takes to plot data on a large shapefile. And this is a domain where doesn’t matter if you’re working with MapInfo or R. Just zooming the figure takes ages. But a couple of days ago I bumped into an excellent blogpost by John Myles White that solved the problem in the most Daoistic way, reminiscent of Steve Jobs: Do away with the shapefile!

To have a quick overview of one’s data, there is no need for a shapefile. Just plot latitude and longitude directly into a scattergraph.

That’s it.

Here’s how it is done. For our example, let’s plot a map showing the ratio of Foreign-born vs. Australian residents, by postcode, in all of Australia. As a first step, I made a search and found a ready-to-use file with all Australian postcodes and their geographical coordinates. Next, I got some Census data from the Australian Bureau of Statistics, specifically counts of Australian-born and foreign-born residents sorted by postcode. Quick munge and plot (code below), and voilà!

scatter Admittedly this is a quick fix: One has to guess the shape of Australia and the location of its major cities from their relative place on the map. Even with this limitation, though, I find this solution really appealing. Instead of waiting for many minutes each time I redraw or resize a graph, I can get a complete overview of my dataset instantaneously(!).


poa <- read.csv('data/2011_BCP_ALL_for_AUST_short-header/2011 Census BCP All Geographies for AUST/POA/AUST/2011Census_B01_AUST_POA_short.csv',
poa$poa <- as.integer(sub('[a-zA-Z]{3}','', x=poa$region_id))

# Ratio Aus born vs elsewhere born
ratio <- data.frame('poa'=poa$poa,'Freq'=poa$Birthplace_Australia_P/poa$Birthplace_Elsewhere_P)

zipcodes <- read.csv('data/pc_full_lat_long.csv')
zipcodes <- zipcodes[!duplicated(zipcodes$Pcode),]
zipcodes <- zipcodes[!zipcodes$Lat == 0,]
zipcodes <- zipcodes[!zipcodes$Long == 0,]
poacoord <- data.frame('poa'=zipcodes$Pcode, 'lat'=zipcodes$Lat, 'long'=zipcodes$Long)

popcount <- na.omit(merge(ratio, poacoord, all=FALSE))
popcount <- popcount[rev(order(popcount$Freq)),] # Place the small bubbles on top
popcount$cat <- sapply(popcount$Freq, function(x) if (x < 6) {'> 1:6'} else {'< 1:6'})

ggplot(popcount, aes(x=long, y=lat, colour=cat)) +
scale_size(range = c(1,20), name = 'Population') +
geom_point() +

And once we get the data presentation right, we can add the shapefile and go get some tea.

# must have gpclib installed
require("rgdal") # requires sp, will use proj.4 if installed
gpclibPermit() # required for fortify method
amap <- readOGR(dsn=".", layer="aust_cd66states")
amap@data$id = rownames(amap@data)
amap.points = fortify(amap, region='id')
amap.df = join(amap.points, amap@data, by='id')

ggplot() +
geom_polygon(data=amap.df, aes(long,lat,group=group)) +
geom_path(data=amap.df, aes(long,lat,group=group), color="grey") +
geom_point(data=popcount, aes(x=long, y=lat, colour=cat)) +
scale_size(range = c(1,20), name = "Population") +
scale_colour_discrete(name="ratio of Foreign vs\nAustralian-born") +


Posted in ABS data, Australia, Data Analysis, ggplot2, Mapping, R | Tagged , , , , , | 2 Comments