A bit of hiatus from posting at the end of the year. There wasn’t really one particular reason for the lack of posting, but a combination of being busy with other stuff, i.e. getting presents sorted for Christmas etc. and also not really having much going on that was worth putting a post together for.
On the negative side, I sat and failed the Google Cloud Architect exam. I’m not especially used to failing tests, so it was definitely a confidence knock. Still, there’s no point sitting around and feeling disappointed about it. Given that December was busy I haven’t gone about re-studying for the exam – I’ve decided to have a look at the Professional Data Engineer instead.
I have also since found out that I can access the official Google courses on coursera for free as my employer is a Google Partner. Nice.
I have, though, found time to play around with my (dubious) python skills. As noted elsewhere (I think), I play hockey for a team in Hong Kong, and the official Hong Kong hockey website leaves a lot to be desired (in my opinion anyway). The website displays results and match details from the current season only and it’s not particularly well presented or searchable, and there is certainly no way of querying the data. Welcome to the 1990s … A bit of headscratcher though is that the data for previous seasons is actually still present, it’s just not publicly linked to. Weird.
HKHA hockey data …
What do I mean by the above? Looking at a sample match from the start of this season saw NBC A play HKFC D and a 3-0 win for HKFC D, with the match details made available online http://www.hockey.org.hk/MenMatchCard.asp?Uid=22381
Every match played should be recorded and presented like this on the website, with the Uid value representing a unique identifier for the match. Inspecting the source for the web page shows only one table present, containing the fixture date, time, and match officials details. The actual player details, although presented in a table format do not use an HTML table. Instead these are displayed using <div> and <span> HTML tags.
I am absolutely not a web designer/architect (as will become apparent I’m sure), however this strikes me as horribly inefficient. Every single match record seems to be created as a static ASP page. It’s possible that the server at the backend is dynamically creating this ASP page depending upon the match Uid selected, but that seems unlikely. And we’ll say nothing about the fact that the whole site is served using HTTP only with no HTTPS in sight. sigh…
The website doesn’t link to any previous seasons and matches from previous seasons are not immediately findable. They’re still on the website though! For example, let’s have a look here: http://www,hockey.org.hk/MenMatchCard.asp?Uid=19473 and we find the opening match from the 2017/18 season for HKFC D (against Aquila A).
Getting my own copy
This is promising – I can access match records going back a good few seasons. I’m not exactly sure what I want to accomplish by having these records, but my initial thoughts were to put them into a database and make them searchable and make my own web front end. As much for my own entertainment as for any other reason. I started off by putting together a bash script that would download the ‘MenMatchCard.asp?Uid=<counter> page using a loop that would increment the counter variable up to about 24,000. What I discovered was that not every number linked to a match card, and that the Uid values also linked to the Ladies hockey results as well (even though the asp page was titled MenMatchCard).
In the end I downloaded about 22,000 asp pages from the web site. Each asp file contained a lot of text that I simply wasn’t interested in. What I wanted to do was extract the match date (possibly the pass-back time), match officials, line ups, scores, scorers, and possibly any other information that might be useful.
Extracting data from a table is pretty straightforward – unfortunately most of the data I want isn’t in a table! Some horrible bash scripting (which I am not going to share as it’s so horrible) allowed me to split the file into two parts, one with the home team details and the other with the away team details. My script then removes some white space and commas in names before I run it through a python scraper script that extracts data in a <div> class named ‘team’. More horrible bash scripting removes the HTML tags and then removes other bits of formatting not already removed.
After all this nastiness I’m left with a CSV file with contents like:
,4,GLOVER William Robin,,1
,32,MAK Ho Long Gareth,,1
,46,MAYO Stuart William John,,1
,60,HO Yin Cheuk Adrian,,1
,61,ERVINE Jonathan Desmond,,1
,70,SHEPHERDSON James Andrew,,1
,99,CHAPMAN Simon Geoffrey,,1
,104,O'SHEA James Michael,,1
,115,MA Matthew Tsun Tik,,1
,130,MAYO Findlay Jack,,1
,134,DAVIDSON Loughlin WIlliam,,1
,140,BOULTON Andrew David,,1
,174,BOTY Taha Muffadal,,1
The first field (which is not filled in this particular example) can determine who the captain is (should be signified by an asterisk *). The next field is the player’s playing number, then their name, the number of goals they scored in the fixture, and finally that they played. The final ‘1’ is probably unnecessary as it is always present.
At least this is data that I can do something with. The CSV file can be uploaded into a spreadsheet and manipulated as needed. That’s not very interesting though. Instead I chose to upload it to a MySQL database. I have created a table named _2018_mensResults which tracks the match number (Uid), and then the actual match details (date, officials, score). Additional tables are then created for each team that has played and the player list is uploaded to this table. Extra columns are added for each match that is played and extra rows added as more players from the pool of available players play matches. Over the course of a season a playing record should be created. I did this manually in the previous season tracking players for my own club in a spreadsheet. With this method I can track players across all clubs across the entire season.
What to do with this data?
Having the data available in a database isn’t by itself very interesting or helpful. It only becomes useful if the data is exposed and can be viewed. The next step will therefore be to build a web application around the database.