Predicting rental listing interest – Kaggle competition

So I recently participated in a Kaggle competition (final ranking 103 / 2488). I had intended to play with the data for a bit and build a prototype / baseline model, but  ended up getting addicted and followed through till the end of the competition. It was such a fun experience that I thought I’d share with you my learnings.

About the competition

The goal of this competition is to predict how popular an apartment rental listing is based on the listing content (text description, photos, #bedrooms, #bathrooms, price, latitude, longitude, etc). Popularity is categorised into three levels (high, medium, and low), based on the number of inquiries a listing has in the duration that the listing was live on the site. The data comes from renthop.com, an apartment listing website. These apartments are located in New York City.

Dirty data

One draw of Kaggle competitions is that you can work with real data sets, which are guaranteed to be ‘dirty’. Some data issues for this competition include:

  • Addresses: it seems like when listing a property, managers or owners can use free text so there are quite a few variations of the same thing. For example, 42nd 10 East Avenue can be entered as 42nd 10 e av, 42 10 east avenue, 42nd 10 e av., etc. These variations are regular enough that they can be processed with, well, regular expression. What would’ve been ideal though is to have a database of streets to choose from so that data is consistent.
  • Outliers: there are properties with listed rental of over 100k. There are properties with latitude and longitude of 0.0.
  • Missing data: there were also quite a lot of missing data, for example the building IDs, which uniquely identify buildings (e.g. apartment blocks)

How to deal with outliers and missing data? For this dataset, there were opportunities to impute missing features or outliers based on other features. For example, missing latitude and longitude of a listing can be inferred based on its building id and display address whose lat and long are not provided in other listings.

Label leakage

When building supervised models, label leakage should be avoided at all cost. On Kaggle however, if there is leaky information then it is almost impossible to win a competition without using it. And this happened to this competition.

What happened (I think) is due to the way data is prepared. Each listing comes with a set of images, and it appears that the images of each class / interest level were stored in its own folder. These were then copied to distribute to Kagglers, and due to the disproportionate numbers of instances per class, the images were written at different timestamp (much later after the actual listing creation date). So there is a strong correlation between image creation timestamp and the interest levels.

The right thing to do in would have been to prohibit the use of such feature. Regrettably though, Kaggle only encourages publicising leakage when one is found. So most people ended up using this feature in their model, which distorts actual performance were this model to be put into practice.

What affect listing popularity?

Feature engineering was a definitely fun exercise in this competition, and there was a lot of room for creativity. I approached this by relating to my own experiences when searching for a rental property and coming up with criteria that are important to me or other renters. Of course I also borrow other features discussed on the forum which were impactful.

Below are some of the features that contribute heavily to the model. Some are naturally expected while others require some reflection but also make a lot of sense.

  • The property’s features: bedrooms, bathrooms, price
  • Value for money: along with the above features come the value for money, which is determined in a number of ways. For example, it can be comparing an estimated price for the listing against the actual price, or comparing price with that of properties in the same building, street, or neighbourhood.
  • Location is the key: location is encoded in variables such as latitude, longitude, building id, display address, and street address. These can be used directly or in combination with other information, thus creating second-order interactions.
  • Skills of the managers: This came as a surprise at first as one would expect the desirability of a property has very little to do with who is managing it. But thinking more deeply, good managers can i) acquire quality properties, e.g. in prime locations or good conditions, and ii) know the required ingredients to attract viewers, e.g. setting the right price, and bring other potential benefits. So even though causation is not so strong, correlation is.

The unexpected bits

Interestingly both the textual description of a listing and its accompanying photos did not contribute as much to the model performance. I did not have time and resources to deal with over 80GB of images data, but others reported little performance gain by incorporating these information.

Deployment challenges

Solution to this competition cannot readily be deployed into production as there are a few things to consider. First, only three months worth of data was given, and the training and testing data were randomly split. It would be better to use at least two years of data and out-of-time sampling to account for seasonality effects. Second, we need to check if all features used in the winning solution are available at the time of making prediction, which can be when the listing is first created or in real time. Third, similar to the Netflix prize competition, the winning solutions are based on stacking and have quite a lot of model. Depending on deployment is done, it may not be possible for the engineers / scientists to use all of the models in their system due to complexity or computational constrains.

What wins competition?

It’s been well publicised that feature engineering wins Kaggle competition, which is still true. But model stacking or ensembling almost always increase performance if done properly.

  • Feature engineering: Big gain in performance always comes from good feature engineering
  • Stacking or ensembling: While FE is crucial to getting a very good model, stacking and ensembling models can deliver the final boost to win a competition
  • Experience: Competing in Kaggle is like a sport, and so experience is vital. Expert Kagglers know the best practices, have scripts that can be re-used, and prior knowledge from other competitions that give them an edge in any competition.

Actions for Rent Hop

  • Upgrade their listing service to better capture clean data
  • Re-thinking how to best leverage their data and incorporate predictive modelling in their business. This competition has demonstrated that it’s possible to predict the level of interests well, but the more important question is what to do with it? Can interest level be pushed up easily operationally? Recall from above that manager skills and property features such as location, price, number of bedrooms and bathrooms are the key model attributes. But these are hard-to-move needles for RentHop as it does not have much control over them. On the other hands, other metrics such as minimising the time it takes to find and lease a property would have been easier to optimise for.

Overall it has been a thrilling experience which provided many valuable lessons. Thank you Kaggle and fellow Kagglers! You can find my (uncleaned) code at https://github.com/trungngv/kaggle/tree/master/renthop if interested.

Advertisements

Google Analytics in BigQuery, explained in one query

Google Analytics (GA) is a popular suite of analytic tools used by many companies to track customer interactions on their digital channels. Although it offers plenty of built-in capabilities for insights discovery, there are times when you want to deep dive and run your own analyses. This post will help you understand the Google Analytics data that is exported to BigQuery and how to extract the key pieces of information that you need.

Understanding the data structure

  • BigQuery stores the exported data by date, and each day is stored in its own table. For instance yesterday’s data will be stored in 1300.ga_sessions_20170315, where 1300 is the project id and 20170315 is the date in yyyymmdd format. Data of the current date is stored in an intraday table, e.g. 1300.ga_sessions_intraday_20170315.
  • Each table contains all sessions by users, one row per user session. A session is simply a sequence of pages viewed by the user (or in GA terminology, page hits).

For analytical tasks, we want to be able to identify users and sessions.

Identifying unique users

Users can be divided into two categories: logged in and not logged in (guests), of which only the former can be reliably identified. Logged in users can be associated with customers if you set and send these programmatically, either via the userId field or some custom dimensions that you define. Guests can be identified via fullVisitorIds but these are reset if user clear their cookies or use multiple devices. In fact the mapping from userIds and fullVisitorIds are N-to-N, so they can’t be reliably linked.

Take-away message: set, send, and use userId to uniquely identify customers.

Identifying unique sessions

The documentation from GA recommends using fullVisitorId + visitId to get a globally unique session identifier (within your GA data source). But for logged in users, we should actually use userId + visitStartTime to identify sessions of each user, where visitStartTime is the start time of a session. Let me illustrate with a toy example:

visitId | fullVisitorId | userId | visitStartTime
v1          | f1                     | u1         | 1000000
v1          | f2                     | u1         | 1500000

Here we have one user u1, who is mapped to two different visitor ids in two different sessions. The visitIds happen to be the same in both sessions. So using userId + visitId we would get only one session where in fact there are two. Using userId with visitTime is the right combination as an user can’t have two sessions that start at exactly the same time. If we want to be 100% certain that they are unique, we can use userId + visitId + visitStartTime.

Example reference query

Now that we know how to identify users and sessions, let me give you one reference query that covers the main concepts you need to know to work with this data.

Note that BigQuery is compliant with the SQL 2011 standard and supports complex types like Arrays and Structs. It also supports legacy SQL, but I encourage the use of the Standard SQL dialect, as in the given query below.

select 
  b.account_number
  , concat(cast(visitId as string), '_', cast(visitStartTime as string)) session_id
  , hits.type as hit_type
  , hits.hitNumber as hit_number
  , concat(hits.page.pagePathLevel1, 
      regexp_replace(hits.page.pagePathLevel2, '\\?.*$', ''))
    ) as page_level12
  , hits.appInfo.screenName
 from `project_id.ga_sessions_201702*` s, s.customdimensions as custdim,
       s.hits, `project_id.account_numbers` b
 where custdim.index = 1 and 
   custdim.value = cast(b.account_number as string) and
   timestamp_diff(timestamp_seconds(visitStartTime), b.first_online, HOUR) < 24
   and hits.type in ('APPVIEW', 'PAGE');

The query extracts all pages visited by each user on apps and websites, within their first  day online. Here are the main points:

  • account_number is used in place of userId for logged in users. This comes from an external data source, for example your customer database table. This is set and send to GA via the first custom dimension, which we retrieve with the condition custdim.index = 1.
  • unique session identifier is given by concatenating visitId and visitStartTime, as discussed above.
  • hits is an array of struct containing information about each hit / page view. There are several different hit types but here we limit to ‘APPVIEW’ for app interactions and ‘PAGE’ for website interactions
  • the sessions table is implicitly joined with its column hits to flatten the table (getting one row per record)
  • hits.hit_number gives us the order of page views within a session
  • wildcard is used to qualify against the tables (hence dates) we query, here we are looking at data in Feb 2017 only
  • hits.page.pagePathLevel{1 to 4} gives the web page, and hits.appInfo.screenName gives the app page
  • timestamp_diff, timestamp_seconds are date time functions

References:

Side note

BigQuery web interface is not yet fully fledged. In fact, I find it quite limiting at first as it does not allow creating tables directly from query and has only one window for writing query. But auto-completion (which works for table names, columns and functions) and pop-up documentation are absolutely two killer features. I like it much better in this respect compared to the boring SQLWorkbench/J client that I’ve been using.

Recommended books for data scientists

In this post I’d like to share some of my recommended books for learning data science and machine learning, both in theory and and practice. Fellow practitioners, let me know your favourite books or any other related resources, I’d be keen to check out some new books and add them to my list.

Theory

These are all foundational textbooks in machine learning. If you study at least one of them in depth, by which I mean formulating models, deriving and implementing the main inference algorithms, and doing the exercises. The books can be quite technical if you’re new to machine learning, but once you stick through one, you’ll find others quite accessible.

The Elements of Statistical Learning (ESL), by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
One of the classics, there’s also an online course and a new textbook accompanied by R code.

Pattern recognition and machine learning (PRML), by Christopher Bishop

Similar to ESL, this highly regarded book is another must-read.

Machine Learning: A Probabilistic Perspective, by Kevin R Murphy

If you study PRML thoroughly, you’ll be familiar with most contents in Murphy’s book. Nevertheless a fun and comprehensive book with a strong focus on principled, probabilistic approach to modelling. It also comes with code in Matlab.

Probabilistic Graphical Models, Daphne Koller and Nir Friedman

Graphical models provide a framework for representation, inference, as well as learning of probabilistic models. This powerful framework provides a unifying view to many ML models which otherwise may be viewed as just a bunch of disparate models. There’s also an online course on Coursera. 

Reinforcement learning, an introduction, by Richard S. Sutto and Andrew G. Barto

Despite still a draft, the second release is well-written and motivates the concepts and applications of RL really well.

Neural networks and deep learning, by Michael Nielsen
Deep Learning
, Ian Goodfellow and Yoshua Bengio and Aaron Courville

Michael Nielsen’s book is more hands-on and contains some cool interactive contents to aid understanding, while Goodfellow et al is more comprehensive. I recommend reading them in the given order.

*Gaussian processes for machine learning, by Carl E. Rasmussen and Christopher K. I. Williams

*The last book is on Gaussian processes, my PhD research topic. You don’t really need to know it for doing practical data science, but still a good reference. The first few chapters present the Bayesian approach to modelling and are worth reading.

Practice

Data science for business, by Foster Provost and Tom Fawcett

This book is accessible to non-technical audience like business managers. It also provides some sound principles on how to execute data science projects. Highly recommended.

Applied predictive modelling, by Kjell Johnson and Max Kuhn

Written by the author of the popular R package caret, this is a must-read for those practising data science. It contains many practical tricks and advices.

 

Data Mining Techniques: For Marketing, Sales, and Customer Relationship, by Gordon S. Linoff and Michael J. A. Berry

Don’t let the title mislead you, this is a good read on data science techniques in general, not just in CRM space.

Data preparation for data mining, by Dorian Pyle

Published in 1999 but still very relevant a day, this book can serves as a good checklist of things to inspect when preparing data for analysis.

Bandit algorithms for website optimization, by John Myles White

This book presents standard multi-armed bandit algorithms and comes with implementations in several languages.

Practical data science with R, John Mount and Nina Zumel

Not as polished as Johnson and Kuhn’s book but has few neat techniques worth knowing.

Visualising thousands of customer journeys

Exploratory analysis of a dataset is a critical step at the beginning of any data science project. This often involves visualising the data, for example by plotting the data with histograms or box plots (for individual dimensions / features), by using a scatter plot (for pairs of features), or by looking at the correlation matrix. (Side note: in R, the corrplot package is great for graphical display of the correlation matrix, which is more effective than inspecting a matrix of numbers).

These visualisations are a great starting point, but they are designed for viewing static data. What if your data set is more dynamic? For example, each data point is a snapshot view of a customer at some point in time, and there are multiple data points corresponding to a ‘journey’ for each customer? You probably want to see how customer behaviours evolve over time. This is easy for a single customer as you can just plot a time series along the customer dimension that you care about. But most of the times you also want to look at the entire population to discern any trends or patterns which may allow you to serve your customers better. Enter heat map.

A heat map is a great way to simultaneously visualise multiple customer journeys on a same plot. In general, you use heat map when you want to plot values of a matrix, where the two axes corresponds to the rows and columns, and the values are mapped to a continuous colour scale. In our context of customer journeys, each row visualises a customer wherein the columns show the customer behaviours (e.g. transaction amount, number of transactions, etc) in the order of time. The colour of a cell indicates the strength of the behaviour.

screen-shot-2016-06-14-at-3-21-00-pm

The above plot shows a visualisation of a proportion of customers and their monthly spend values for each week since they first joined. The matrix is ordered by the total values in each row, i.e. by the total spend of a customer in the period of analysis. It’s immediately clear that there is a group of customers that only purchased in the first month and after that they’re gone. Note that the y-axis represents the row IDs and carries no meaning in this plot. Upon looking at this, one may decide to dig deeper into these sets of early churners to understand why and determine if there can be actions to improve their retention.

Hopefully this post has given you another tool to better understand your customers beyond the standard visualisation toolkits. The great thing about heat map is that they can be easily scaled, in the sense that once you organise your data according to some order, you can sweep through a part of the data that enabled by the graphical capability of your tool or machine, and still discern patterns in your data.

 

The objectives of customer segmentation

Customer segmentation is a practice widely used by companies to divide their customer base into sub-groups that share similar characteristics, and then deliver targeted, relevant messages to each group. Segmentation is done by looking at customer attributes such as demographic (e.g. age, gender, income, residential address) and / or their transactional patterns (e.g. RFM or recency, frequency, and monetary value of their transactions). One key challenge often encountered when doing this is how to measure the goodness of your segmentation?

Qualitative and mathematical objectives

A commonly agreed, qualitative objective for a good segmentation(or clustering, as referred to in machine learning) is that similar customers should be in a same group and different customers should be in separate groups. This criteria can be inspected visually if your data has low dimensions (typically less than 4), like in the below figure(image source: http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/mvoget/cluster/kmeans_diagram.png). There we see two distinct colored clusters, each with a point at the centre called the cluster centroid. If each data point corresponds to a customer, the centroid can be thought of as the most representative member of a group.

If we know the representatives from each group then a natural segmentation mechanism is to find a representative most similar to a customer and assign him or her to the same group. This idea is utilised in the popular k-means clustering method, which has the objective of minimising the sum of total differences between customers in a group, across all groups. So one convenient way to evaluate the quality of a segmentation, for example when considering the number of segments to use (let’s call this k), is to compute the different objectives when varying k and choose the one with the smallest total difference. The disadvantage of this approach though is that this mathematical objective may not align with your business strategy, and the solution may look like a black-box to thus making the resulting segmentation not actionable.

Segmentation with a business objective

Instead of performing at customer segmentation purely from an optimisation perspective, it is important to tie it to your business objective and make sure that you have design a specific strategy for each of the final segments. For example, in one project we looked at customer behaviours in a short period after their acquisition and divide them into two segments: high-value and mass. The high-value segment contains only 15% of the new customers but accounts for 70% of the future life-time value. This allows the business to create two bespoke customer journeys and allocates more resources to retain the more valuable customers. In this case segmentation is determined by maximising the number of high-value customers that can be served, given the available budget for this segment. This indeed is still a constrained optimisation, but it is driven by a business objective and thus is better for execution with marketing campaigns.

The 20% guide to a good single view of customer

All businesses revolve around customers and products / services offered to them. These days companies compete on the ability to accurately predict customer intents with respect to their products in order to best serve them. Examples of intents are the potential to purchase an item, cancel a monthly subscription to a service, or close an account. Such ability relies heavily on how well a firm knows its customers, and most firms will be benefited from having the so-called single view of customer. This view contains hundreds to thousands of attributes which enable detailed customer insights. In this post, I will provide some guides on how to quickly generate a good view of customer. My hope is that by following the recommendations here, it may take you only 20% of your total effort (for feature engineering) to attain 80% of the optimal performance when it comes to modeling.

The importance of feature engineering

It’s widely believed that feature engineering is one of the most effective techniques to improve the performance of machine learning models. Kaggle founder and CEO, Anthony Goldbloom, claimed that “Feature engineering wins competitions”, based on the history of thousands of competitions on their site. While this task depends on the specific application and is often considered more of an art than science, there are a few guiding practices that can be broadly applicable across a wide range of use-cases. This is motivated by the fact that there are three main entities (with respect to the goal of modeling) in most business contexts namely customer, product, and the company itself. As predictive models are often customer-centric, the single view should capture information about an individual customer, and how he or she interacts with the two remaining entities.

Based on the above motivation, the single view can be divided into the following three main categories.

Descriptive features

Descriptive features are about customer characteristics that are often captured at a fixed point in time. This includes information pertaining to a customer such as demographics (age, gender, marital status, employment status, residential address such as suburb or post code). Features related to how was the customer acquired can also be valuable. For example, which channel did they sign up (online vs. offline), whether they received any promotion, which day of the week.

Behavioural / transactional features

Unlike descriptive features, behaviour or transactional features are more dynamic and often computed over repeated transactions, over some period of time. Recurring transactions occur in telecommunication, banking, insurance, entertainment, and many other retail businesses. A transaction can provide granular information, for example transaction amount or price, time, the product / item itself and its category, which is  invaluable in helping us understand customer preferences. Transactional features are often created for three dimensions:

  1. Recency: When was the first time or last time a transaction took place?
  2. Frequency: How often do a customer purchase? How does that break down into different product categories?
  3. Monetary: What was the value of the transaction? Because this is a continuous (real-valued) measurement, and there are multiple transactions per customer, these features are aggregated. In other words, you compute the min, max, sum, average, median, standard deviation, and other summary statistics of all transaction amounts during the period under consideration.

It may be worth emphasising again that behavioural and transactional features are computed over a period of time, and customer behaviours at any time in their lifecycle can be analysed by varying the time window. So for example, you may look at these features in the first day, first week, first month, first 3 months, last 3 months, last month, last week, etc of a customer tenure depending on the nature of the purchasing cycle in your business. You can also compare recent behaviours to initial behaviours when they first joined to see how customers have evolved? Have they become more valuable or are they spending less with your firm?

Interactional features

Interactional features are similar to transactional features in the sense that they are both recurring, although the former does not involve financial matters. We consider two-way interactions initiated by either the company or the customers. Interactions can be direct, such as email marketing, SMS, customer calls or complaints. They can also be indirect, such as customer visiting a company’s web pages. Each interaction can be thought of as an event, which can be categorised according to business activities. Because they are computed over a period of time like behavioural features, we end up with attributes  representing event counts. To give a few examples, they can be the count of how many times a customer visited a particular web page in the last 30 days; how many times a customer called to complaint about a product or service; or how many times did the successfully reach the customer.

As before, these features allow us to measure the changes in the levels of interactions at different time periods. Such measurements indicate the degree of attachment and responsiveness of a customer, which can be a very useful feature when predicting future customer intents.

Based my personal experience, these three sets of features can be implemented quickly if you are already familiar with the data of your business. Coupling these features with robust modeling methods like random forest or boosted trees often result in a reasonably good initial model. For binary classification, I usually get AUC above 0.70 in the first run, which surpasses the accuracy level required for some practical applications.

Signing off note: In this post I’ve actually only described feature generation, a precursor to the whole feature engineering process. Often further processing or transformation of the generated features may be needed, for example by normalising, scaling, or discretising continuous variables, especially when using models that are sensitive to the magnitude of feature values.

 

A brief study of hotels in Vietnam

This is a brief study I did over the weekend about hotels in three popular travel destinations in Vietnam namely Hanoi, Nha Trang, and Phu Quoc.

Key observations:

– Hanoi ranks first on the total hotel capacities, followed by Nha Trang, with Phu Quoc quite further behind.
– 2-to-3 star hotels dominate the market in Hanoi whereas 1-star and 4-to-5-star hotels present more in the other two regions.
– Hanoi hotels are more established and averaged higher review scores than Nha Trang and Phu Quoc. This may also indicate that Hanoi is more competitive.
– Most visitors come from Australia, US, Europe, and some from our neighbours (Singapore, Thailand, Malaysia)

1. Total hotel capacities
hotel_capacities

2. Distribution of hotel capacities
hanoi_capacities

3. Hotel star-ratings
hanoi_star_ratings

4. Review scores
phuquoc_review_scores

5. Guest nationalities
Hanoi
hanoi_guest_nationalities

Nha Trang
nhatrang_guests

Phu Quoc
phuquoc_guests

6. Hotel size, locations, and ratings

Each solid circle corresponds to a hotel; its area is proportional to the capacity of the hotel, and the color is mapped to its review score.

hanoi_location
nhatrang_location
phuquoc_location