Predicting rental listing interest – Kaggle competition

So I recently participated in a Kaggle competition (final ranking 103 / 2488). I had intended to play with the data for a bit and build a prototype / baseline model, but  ended up getting addicted and followed through till the end of the competition. It was such a fun experience that I thought I’d share with you my learnings.

About the competition

The goal of this competition is to predict how popular an apartment rental listing is based on the listing content (text description, photos, #bedrooms, #bathrooms, price, latitude, longitude, etc). Popularity is categorised into three levels (high, medium, and low), based on the number of inquiries a listing has in the duration that the listing was live on the site. The data comes from renthop.com, an apartment listing website. These apartments are located in New York City.

Dirty data

One draw of Kaggle competitions is that you can work with real data sets, which are guaranteed to be ‘dirty’. Some data issues for this competition include:

  • Addresses: it seems like when listing a property, managers or owners can use free text so there are quite a few variations of the same thing. For example, 42nd 10 East Avenue can be entered as 42nd 10 e av, 42 10 east avenue, 42nd 10 e av., etc. These variations are regular enough that they can be processed with, well, regular expression. What would’ve been ideal though is to have a database of streets to choose from so that data is consistent.
  • Outliers: there are properties with listed rental of over 100k. There are properties with latitude and longitude of 0.0.
  • Missing data: there were also quite a lot of missing data, for example the building IDs, which uniquely identify buildings (e.g. apartment blocks)

How to deal with outliers and missing data? For this dataset, there were opportunities to impute missing features or outliers based on other features. For example, missing latitude and longitude of a listing can be inferred based on its building id and display address whose lat and long are not provided in other listings.

Label leakage

When building supervised models, label leakage should be avoided at all cost. On Kaggle however, if there is leaky information then it is almost impossible to win a competition without using it. And this happened to this competition.

What happened (I think) is due to the way data is prepared. Each listing comes with a set of images, and it appears that the images of each class / interest level were stored in its own folder. These were then copied to distribute to Kagglers, and due to the disproportionate numbers of instances per class, the images were written at different timestamp (much later after the actual listing creation date). So there is a strong correlation between image creation timestamp and the interest levels.

The right thing to do in would have been to prohibit the use of such feature. Regrettably though, Kaggle only encourages publicising leakage when one is found. So most people ended up using this feature in their model, which distorts actual performance were this model to be put into practice.

What affect listing popularity?

Feature engineering was a definitely fun exercise in this competition, and there was a lot of room for creativity. I approached this by relating to my own experiences when searching for a rental property and coming up with criteria that are important to me or other renters. Of course I also borrow other features discussed on the forum which were impactful.

Below are some of the features that contribute heavily to the model. Some are naturally expected while others require some reflection but also make a lot of sense.

  • The property’s features: bedrooms, bathrooms, price
  • Value for money: along with the above features come the value for money, which is determined in a number of ways. For example, it can be comparing an estimated price for the listing against the actual price, or comparing price with that of properties in the same building, street, or neighbourhood.
  • Location is the key: location is encoded in variables such as latitude, longitude, building id, display address, and street address. These can be used directly or in combination with other information, thus creating second-order interactions.
  • Skills of the managers: This came as a surprise at first as one would expect the desirability of a property has very little to do with who is managing it. But thinking more deeply, good managers can i) acquire quality properties, e.g. in prime locations or good conditions, and ii) know the required ingredients to attract viewers, e.g. setting the right price, and bring other potential benefits. So even though causation is not so strong, correlation is.

The unexpected bits

Interestingly both the textual description of a listing and its accompanying photos did not contribute as much to the model performance. I did not have time and resources to deal with over 80GB of images data, but others reported little performance gain by incorporating these information.

Deployment challenges

Solution to this competition cannot readily be deployed into production as there are a few things to consider. First, only three months worth of data was given, and the training and testing data were randomly split. It would be better to use at least two years of data and out-of-time sampling to account for seasonality effects. Second, we need to check if all features used in the winning solution are available at the time of making prediction, which can be when the listing is first created or in real time. Third, similar to the Netflix prize competition, the winning solutions are based on stacking and have quite a lot of model. Depending on deployment is done, it may not be possible for the engineers / scientists to use all of the models in their system due to complexity or computational constrains.

What wins competition?

It’s been well publicised that feature engineering wins Kaggle competition, which is still true. But model stacking or ensembling almost always increase performance if done properly.

  • Feature engineering: Big gain in performance always comes from good feature engineering
  • Stacking or ensembling: While FE is crucial to getting a very good model, stacking and ensembling models can deliver the final boost to win a competition
  • Experience: Competing in Kaggle is like a sport, and so experience is vital. Expert Kagglers know the best practices, have scripts that can be re-used, and prior knowledge from other competitions that give them an edge in any competition.

Actions for Rent Hop

  • Upgrade their listing service to better capture clean data
  • Re-thinking how to best leverage their data and incorporate predictive modelling in their business. This competition has demonstrated that it’s possible to predict the level of interests well, but the more important question is what to do with it? Can interest level be pushed up easily operationally? Recall from above that manager skills and property features such as location, price, number of bedrooms and bathrooms are the key model attributes. But these are hard-to-move needles for RentHop as it does not have much control over them. On the other hands, other metrics such as minimising the time it takes to find and lease a property would have been easier to optimise for.

Overall it has been a thrilling experience which provided many valuable lessons. Thank you Kaggle and fellow Kagglers! You can find my (uncleaned) code at https://github.com/trungngv/kaggle/tree/master/renthop if interested.

Advertisements

Data transformation, Scala collections, and SQL

Data transformation is one of the 3 steps in ETL (extract, transform, load) — a process for getting raw data from heterogeneous sources (e.g. databases, text files),  process or transform, then loading it to the final source (e.g. in a format ready for further modelling or analysis). While there exist a plethora of languages for this task, this post describes a minimal set of operations that can be used for the purpose. Concrete examples are given using HiveQL, a variant of the popular query language SQL, and the Scala collections.

Our running example will be a simple dataset containing student records.

1. Data representation

With SQL, it’s obvious that each student record will be represented as a row in a table. Suppose we know the student ID, name, year of birth, and enrolment year, we define the following Student table:

create table Student(
  id string, 
  name string,
  birth_year,
  enrol_year
); 

With Scala, each student record can simply be recorded as a tuple or a case class. We’ll use case class in this example because later on we want to access each field of a record by name.

case class Student(id: String, name: String, birthYear: Int, enrolYear: Int)

The entire set of student records is stored as a table in SQL, and as a collection in Scala. Note also that the type of each field is available in both representations.

2. Example data

Let’s assume our data has 4 students as created by the below code.


val students = List[Student](Student("1", "Alice", 1990, 2015),
    Student("2", "Bob", 1991, 2016), 
    Student("3", "Cathy", 1990, 2015), 
    Student("4", "David", 1991, 2014))

The SQL code is omitted for convenience as it’s not part of the operations we’re interested in.

3. Selection

The first operation is Select. Let’s say we want to find all students with ID less than “3”.

In SQL:

 

select * from Student where id < "3";

In Scala:

students.filter(s => s.id < "3")


The underlying implementation for this operation may simply traverse the list (in Scala) or read each line of the table (in SQL), then check if each element satisfies the condition. In a specialised database, such as MySQL, there can be optimisation such as indexing (based on the ID field) to allow more efficient search. However, the high-level abstractions in both languages are much similar.

4. Projection

To select a subset of fields in SQL, we just need to specify which column names to keep. Let’s say for the previous query we only want the student name.

select name from Student where id < "3";


In Scala:

students.filter(s => s.id < "3").map(s => s.name)


If you’re new to Scala, it may not be easy to understand the above code. Let’s break it down into two expressions:


val studentsWithSmallIds: List[Student] = students.filter(s => s.id < "3")
val studentNames: List[String] = studentsWithSmallIds.map(s => s.name)

The first line filters the list based on the condition (s.id < “3”), for each element s of the list. Then the second line applies the function which returns s.name for each elemnt s, hence the result type if of type List[String].

5. Group By / Aggregations

Another common operation is to group the data based on some fields and performs aggregations such as counting the number of elements in each group.

Let’s say we want to count the number of students that were born in each year. This can be done easily in SQL:

select birth_year, count(*) as cnt from Student group by birth_year;

In Scala:

scala> val groups: Map[Int, List[Student]] = students.groupBy(s => s.birthYear)
res27: scala.collection.immutable.Map[Int,List[Student]] = Map(1991 -> List(Student(2,Bob,1991,2016), Student(4,David,1991,2014)), 1990 -> List(Student(1,Alice,1990,2015), Student(3,Cathy,1990,2015)))
scala> val countByYear: Map[Int, Int] = groups.map{ case (birthYear, xs) => (birthYear, xs.length) }
res29: scala.collection.immutable.Map[Int,Int] = Map(1991 -> 2, 1990 -> 2)


The first line transforms the list into a map whose the key is the field that we want to group by, and the value is a list of Students with a same key.
Then we apply the function case (birthYear, xs) => (birthYear, xs.length) to each element of the resulting map (groups). The function returns a tuple with two elements, which are then implicitly converted into the map countByYear.

6. Join

The last operation we consider is join. Let’s introduce another type of records that some students loathe but some absolutely love — the GPA record. Each record contains student ID and GPA, like so

case class GPA(id: String, gpa: Float)
val gpas = List(GPA("1.0", 1.0f), GPA("2.0", 2.0f), GPA("3.0", 3.0f), GPA("4.0", 4.0f))

Join is supported natively in SQL, so if we want to join Student and GPA we can simply write


select t1.*, t2.gpa from Student t1 join GPA t2 on t1.id = t2.id;

There is no native join operation in Scala, so we’ll implement one by ourselves. It’s easy to do this for this particular example: for example, we can iterate through the Student list and the GPA list, and select those that match on ID:

scala> for (s <- students; g <- gpas; if (s.id == g.id)) yield (s.id, s.name, s.birthYear, s.enrolYear, g.gpa)

res55: List[(String, String, Int, Int, Float)] = List((1,Alice,1990,2015,1.0), (2,Bob,1991,2016,2.0), (3,Cathy,1990,2015,3.0), (4,David,1991,2014,4.0))

7. Generic Join in Scala

The code in the previous section for joining students and their GPAs are all well and good, except that they are not general enough. If we were to join two different collections, we’ll have to repeat the above code with some modification to match on the right key. In this section we’ll study how to implement a generic join in Scala as a fun exercise.

First we will implement join for two Maps: m1 of type [K, V1] and m2 of type [K, V2]. Note that the map share the same key type but can have different value types.

We can define the join function as:

def join[K, V1, V2](m1: Map[K, List[V1]], m2: Map[K, List[V2]]): Map[K, (V1, V2)] = {
   for ((k1, v1) <- m1; (k2, v2) <- m2; if (k1 == k2))
     yield (k1, v1.flatMap(x => v2.map( y => (x, y))))
}


A slightly more complicated but perhaps more functional way of implementing join (without using the for expression) is:


def join[K, V1, V2](m1: Map[K, List[V1]], m2: Map[K, List[V2]]): Map[K, (V1, V2)] = {

  m1.map {

    case (k, v1) => m2.get(k) match {

      case Some(v2) => (k, v1.flatMap( x => v2.map( y => (x, y))))

   }

}

Now to join two lists, we just need to specify what is the key of each element and convert both of them to Maps, where map entry is a binding from a key to all elements of a list with that same key. Coming back to our Student example, the Scala code to use the above join would be:

scala> val studentMap = students.groupBy(s => s.id)
scala> val gpaMap = gpas.groupBy(g => g.id)
scala> val studentWithGPA = join(studentMap, gpaMap)
studentWithGPA: scala.collection.immutable.Map[String,List[(Student, GPA)]] = Map(2 -> List((Student(2,Bob,1991,2016),GPA(2,2.0))), 1 -> List((Student(1,Alice,1990,2015),GPA(1,1.0))), 4 -> List((Student(4,David,1991,2014),GPA(4,4.0))), 3 -> List((Student(3,Cathy,1990,2015),GPA(3,3.0))))

Conclusion

 We see through this post how Scala collections can be used to implement common data transformation operations, much like a SQL language. Scala makes it convenient to write a DSL (domain-specific-language) using Scala. This is the case with Scalding, a data ETL framework built on top of Cascading and Hadoop MapReduce. In fact, the main difference in the operations you see in this post and those in Scalding is that the actual Scolding implementation is for distributed systems like Hadoop cluster.

Linux for Data Scientists

The first step towards becoming a data scientist is to become familiar with Linux. EdX offered a great introductory course by the Linux Foundation, which covers basic to intermediary materials.

Important topics include:

  • Linux philosophy and concepts
  • Command line operators (basic operations and working with files)
  • File operations
  • User environment
  • Text editors (vi/vim and emacs)
  • Text manipulation (cat, echo, sed, awk, grep, tr, wc, cut)
  • Bash scripting
  • Security, networks, processes

Completing the course would give you a decent command of most command line utilities that are used on a almost daily basis.

Basic commands: man, ls, mkdir, rmdir, rm -rf, file, ln, echo

Working with text files:

  • cat – concatenate and print content of file e.g. cat filename
  • head – print first 10 lines of a file e.g. head -n x filename, x = # lines to show
  • tail – print last 10 lines of a file e.g. tail -n x filename, x = #lines to show
  • less, more – inspect file content without printing out to standard output
  • wc – word, line, character, and byte count e.g. wc -l filename
  • grep – search for pattern in a text file (regexp is supported) e.g. grep pattern filename; common options are -i (ignore case), -F (search for fixed string), -m n (show max n results), -c (count only, do not print matching text), -C n (print n leading and trailing lines surrounding each match
  • tr – translate (replace or substitute characters from a file) e.g. tr ’01’ ‘,’
  • sed – stream editor to transform text e.g. sed ‘s/apple/orange/g; s/orange/pear/g’ — this first replaces all (g for global) occurrences of apple with orange, then replaces all occurrences of orange with pear.
  • cut – extract a field (column) of a file with table structure (i.e. each line contains a record and each record consists of multiple fields) e.g. cut -d : -f 2 extracts the second column of a file using : as the delimiter
  • paste – putting files together (horizontally or vertically) e.g. paste file1 file2
  • split – split a big file into smaller parts
  • sort – sort a file line by line (can also sort by field) e.g. sort <filename, cat filename | sort -c
  • uniq – remove all but one line of duplicate lines from a already sorted line e.g. sort filename | uniq -c, use option -c to also prints the count of each instances