So I recently participated in a Kaggle competition (final ranking 103 / 2488). I had intended to play with the data for a bit and build a prototype / baseline model, but ended up getting addicted and followed through till the end of the competition. It was such a fun experience that I thought I’d share with you my learnings.
About the competition
The goal of this competition is to predict how popular an apartment rental listing is based on the listing content (text description, photos, #bedrooms, #bathrooms, price, latitude, longitude, etc). Popularity is categorised into three levels (high, medium, and low), based on the number of inquiries a listing has in the duration that the listing was live on the site. The data comes from renthop.com, an apartment listing website. These apartments are located in New York City.
One draw of Kaggle competitions is that you can work with real data sets, which are guaranteed to be ‘dirty’. Some data issues for this competition include:
- Addresses: it seems like when listing a property, managers or owners can use free text so there are quite a few variations of the same thing. For example, 42nd 10 East Avenue can be entered as 42nd 10 e av, 42 10 east avenue, 42nd 10 e av., etc. These variations are regular enough that they can be processed with, well, regular expression. What would’ve been ideal though is to have a database of streets to choose from so that data is consistent.
- Outliers: there are properties with listed rental of over 100k. There are properties with latitude and longitude of 0.0.
- Missing data: there were also quite a lot of missing data, for example the building IDs, which uniquely identify buildings (e.g. apartment blocks)
How to deal with outliers and missing data? For this dataset, there were opportunities to impute missing features or outliers based on other features. For example, missing latitude and longitude of a listing can be inferred based on its building id and display address whose lat and long are not provided in other listings.
When building supervised models, label leakage should be avoided at all cost. On Kaggle however, if there is leaky information then it is almost impossible to win a competition without using it. And this happened to this competition.
What happened (I think) is due to the way data is prepared. Each listing comes with a set of images, and it appears that the images of each class / interest level were stored in its own folder. These were then copied to distribute to Kagglers, and due to the disproportionate numbers of instances per class, the images were written at different timestamp (much later after the actual listing creation date). So there is a strong correlation between image creation timestamp and the interest levels.
The right thing to do in would have been to prohibit the use of such feature. Regrettably though, Kaggle only encourages publicising leakage when one is found. So most people ended up using this feature in their model, which distorts actual performance were this model to be put into practice.
What affect listing popularity?
Feature engineering was a definitely fun exercise in this competition, and there was a lot of room for creativity. I approached this by relating to my own experiences when searching for a rental property and coming up with criteria that are important to me or other renters. Of course I also borrow other features discussed on the forum which were impactful.
Below are some of the features that contribute heavily to the model. Some are naturally expected while others require some reflection but also make a lot of sense.
- The property’s features: bedrooms, bathrooms, price
- Value for money: along with the above features come the value for money, which is determined in a number of ways. For example, it can be comparing an estimated price for the listing against the actual price, or comparing price with that of properties in the same building, street, or neighbourhood.
- Location is the key: location is encoded in variables such as latitude, longitude, building id, display address, and street address. These can be used directly or in combination with other information, thus creating second-order interactions.
- Skills of the managers: This came as a surprise at first as one would expect the desirability of a property has very little to do with who is managing it. But thinking more deeply, good managers can i) acquire quality properties, e.g. in prime locations or good conditions, and ii) know the required ingredients to attract viewers, e.g. setting the right price, and bring other potential benefits. So even though causation is not so strong, correlation is.
The unexpected bits
Interestingly both the textual description of a listing and its accompanying photos did not contribute as much to the model performance. I did not have time and resources to deal with over 80GB of images data, but others reported little performance gain by incorporating these information.
Solution to this competition cannot readily be deployed into production as there are a few things to consider. First, only three months worth of data was given, and the training and testing data were randomly split. It would be better to use at least two years of data and out-of-time sampling to account for seasonality effects. Second, we need to check if all features used in the winning solution are available at the time of making prediction, which can be when the listing is first created or in real time. Third, similar to the Netflix prize competition, the winning solutions are based on stacking and have quite a lot of model. Depending on deployment is done, it may not be possible for the engineers / scientists to use all of the models in their system due to complexity or computational constrains.
What wins competition?
It’s been well publicised that feature engineering wins Kaggle competition, which is still true. But model stacking or ensembling almost always increase performance if done properly.
- Feature engineering: Big gain in performance always comes from good feature engineering
- Stacking or ensembling: While FE is crucial to getting a very good model, stacking and ensembling models can deliver the final boost to win a competition
- Experience: Competing in Kaggle is like a sport, and so experience is vital. Expert Kagglers know the best practices, have scripts that can be re-used, and prior knowledge from other competitions that give them an edge in any competition.
Actions for Rent Hop
- Upgrade their listing service to better capture clean data
- Re-thinking how to best leverage their data and incorporate predictive modelling in their business. This competition has demonstrated that it’s possible to predict the level of interests well, but the more important question is what to do with it? Can interest level be pushed up easily operationally? Recall from above that manager skills and property features such as location, price, number of bedrooms and bathrooms are the key model attributes. But these are hard-to-move needles for RentHop as it does not have much control over them. On the other hands, other metrics such as minimising the time it takes to find and lease a property would have been easier to optimise for.
Overall it has been a thrilling experience which provided many valuable lessons. Thank you Kaggle and fellow Kagglers! You can find my (uncleaned) code at https://github.com/trungngv/kaggle/tree/master/renthop if interested.