I recently completed the “Microsoft Professional Program Data Science” certification that cumulated with a Capstone project in the form of a competition to predict the Prevalence of Undernourishment, aka PoU.
I am happy to have finished in 5th position out of 441 participants, and this post is a summary of what I learnt during the competition.
The competition format was standard. The mission was to predict the PoU of a dataset without knowing the actual PoU. You get training data (features and the PoU value) to train a prediction model. Your score is determined by the Root Mean Squared Error (RMSE) of your predicted values vs the actual values. Lowest score wins.
If you are after the details and my code, you can find it at PoU DAT102x 201810, a GitHub repo.
My insights from this competition are:
- It is easy to see why Python, Pandas, SciKit Learn, and Jupyter are so popular for Data Science. It is a fully functional ecosystem with a simple yet compelling feature set.
- Getting to a prediction model is easy; getting to a good prediction model is difficult. It took me two weeks of toil to get from the middle of the pack to the top tier.
- My first breakthrough came after deciding to replace missing data with a linear regression estimate by country, rather than merely deleting or substituting with mean or median.
- The adage that 80% of Data Science is in the feature preparation was entirely right in my case. This relatively simple change improved my RMSE to below ten, which was still way behind the leaders though. From a conversation with a fellow competitor post-completion, it turns out this it isn’t apparent that whatever changes you make here also needs to be made to the test data.
- The second breakthrough came when I realised that stratifying by country when selecting training data was wrong. Removing this step brought my score to 7.7688
- AzureML is excellent for building and testing models. It is less suitable for data visualisation, though. I found that performing data exploration and cleansing much easier using Python and then using AzureML for the modelling. One big drawback of AzureML is that, at the time of writing this, the ability to see the model parameters was limited.
The GitHub repo contains the full details if you want to continue reading.