Example Application

I recently wanted to get some data to select program committee members for a conference. I decided to estimate which authors had the most impact with papers published at the conference in the past. After some experimentation, here is what I wanted to do:

Use Google to estimate how many pages refer to each paper published at the conference, then analyze this data to find the most cited authors.

Here's what I did:

1) First I needed a list of papers. First I thought of scraping HTML from DBLP. Then I noticed that DBLP contains bibtex items, so I used WinHTTrack Website Copier to download bibtex files from DBLP for that conference. This was not too hard, but I did have to play with the configuration a fair bit to get it to work. This is because the bibtex files are stored on a different server.

2) I had to get the bibtex out of the HTML pages. I decided to write an XSLT script to do this, but had a problem because HTML is not valid XML. After a few false starts, I was able to download, compile and run the .NET Html Agility Pack to clean up the HTML. Then I concatenated all the files and ran this XSLT script over them:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output encoding="ascii" method="text">
<xsl:template match="/">
<xsl:for-each select="//pre">
<xsl:value-of select=".">
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

3) I then had a bunch of bibtex files. I needed to extra the titles and authors. So I decided to the bibtex to XML. After a few false tries, I found bib2XML and converted my files to XML. However, I later found that bib2XML does not properly translate special characters (accented characters, superscirpts, trademark symbols). Rather than fix the code, I fixed the files by hand using a few regular expression substitutions.

4) Then I wrote a .NET application to drive the Google web services API. This was fairly straightforward, except that the Google server for registering to use the API was down for a few days. Also, when I passed null for a default parameter, the call failed without a useful explanation. I tried using empty strings instead and it works.

5) My program created a tab-delimited file (TSV) with a line for each author of a paper, which I then loaded into Excel for analysis. I used the data anlysis wizard to count and sum the number of hits for publications. But now I want to do some more sophisticated analysis.

You might say this is overkill for the purpose I had in mind (or even that it is irrelevant). But I sometimes try to automate processes like this just to find out how hard it is. This one was pretty difficult.

No comments: