Visualising thousands of customer journeys

Exploratory analysis of a dataset is a critical step at the beginning of any data science project. This often involves visualising the data, for example by plotting the data with histograms or box plots (for individual dimensions / features), by using a scatter plot (for pairs of features), or by looking at the correlation matrix. (Side note: in R, the corrplot package is great for graphical display of the correlation matrix, which is more effective than inspecting a matrix of numbers).

These visualisations are a great starting point, but they are designed for viewing static data. What if your data set is more dynamic? For example, each data point is a snapshot view of a customer at some point in time, and there are multiple data points corresponding to a ‘journey’ for each customer? You probably want to see how customer behaviours evolve over time. This is easy for a single customer as you can just plot a time series along the customer dimension that you care about. But most of the times you also want to look at the entire population to discern any trends or patterns which may allow you to serve your customers better. Enter heat map.

A heat map is a great way to simultaneously visualise multiple customer journeys on a same plot. In general, you use heat map when you want to plot values of a matrix, where the two axes corresponds to the rows and columns, and the values are mapped to a continuous colour scale. In our context of customer journeys, each row visualises a customer wherein the columns show the customer behaviours (e.g. transaction amount, number of transactions, etc) in the order of time. The colour of a cell indicates the strength of the behaviour.


The above plot shows a visualisation of a proportion of customers and their monthly spend values for each week since they first joined. The matrix is ordered by the total values in each row, i.e. by the total spend of a customer in the period of analysis. It’s immediately clear that there is a group of customers that only purchased in the first month and after that they’re gone. Note that the y-axis represents the row IDs and carries no meaning in this plot. Upon looking at this, one may decide to dig deeper into these sets of early churners to understand why and determine if there can be actions to improve their retention.

Hopefully this post has given you another tool to better understand your customers beyond the standard visualisation toolkits. The great thing about heat map is that they can be easily scaled, in the sense that once you organise your data according to some order, you can sweep through a part of the data that enabled by the graphical capability of your tool or machine, and still discern patterns in your data.



The objectives of customer segmentation

Customer segmentation is a practice widely used by companies to divide their customer base into sub-groups that share similar characteristics, and then deliver targeted, relevant messages to each group. Segmentation is done by looking at customer attributes such as demographic (e.g. age, gender, income, residential address) and / or their transactional patterns (e.g. RFM or recency, frequency, and monetary value of their transactions). One key challenge often encountered when doing this is how to measure the goodness of your segmentation?

Qualitative and mathematical objectives

A commonly agreed, qualitative objective for a good segmentation(or clustering, as referred to in machine learning) is that similar customers should be in a same group and different customers should be in separate groups. This criteria can be inspected visually if your data has low dimensions (typically less than 4), like in the below figure(image source: There we see two distinct colored clusters, each with a point at the centre called the cluster centroid. If each data point corresponds to a customer, the centroid can be thought of as the most representative member of a group.

If we know the representatives from each group then a natural segmentation mechanism is to find a representative most similar to a customer and assign him or her to the same group. This idea is utilised in the popular k-means clustering method, which has the objective of minimising the sum of total differences between customers in a group, across all groups. So one convenient way to evaluate the quality of a segmentation, for example when considering the number of segments to use (let’s call this k), is to compute the different objectives when varying k and choose the one with the smallest total difference. The disadvantage of this approach though is that this mathematical objective may not align with your business strategy, and the solution may look like a black-box to thus making the resulting segmentation not actionable.

Segmentation with a business objective

Instead of performing at customer segmentation purely from an optimisation perspective, it is important to tie it to your business objective and make sure that you have design a specific strategy for each of the final segments. For example, in one project we looked at customer behaviours in a short period after their acquisition and divide them into two segments: high-value and mass. The high-value segment contains only 15% of the new customers but accounts for 70% of the future life-time value. This allows the business to create two bespoke customer journeys and allocates more resources to retain the more valuable customers. In this case segmentation is determined by maximising the number of high-value customers that can be served, given the available budget for this segment. This indeed is still a constrained optimisation, but it is driven by a business objective and thus is better for execution with marketing campaigns.

The 20% guide to a good single view of customer

All businesses revolve around customers and products / services offered to them. These days companies compete on the ability to accurately predict customer intents with respect to their products in order to best serve them. Examples of intents are the potential to purchase an item, cancel a monthly subscription to a service, or close an account. Such ability relies heavily on how well a firm knows its customers, and most firms will be benefited from having the so-called single view of customer. This view contains hundreds to thousands of attributes which enable detailed customer insights. In this post, I will provide some guides on how to quickly generate a good view of customer. My hope is that by following the recommendations here, it may take you only 20% of your total effort (for feature engineering) to attain 80% of the optimal performance when it comes to modeling.

The importance of feature engineering

It’s widely believed that feature engineering is one of the most effective techniques to improve the performance of machine learning models. Kaggle founder and CEO, Anthony Goldbloom, claimed that “Feature engineering wins competitions”, based on the history of thousands of competitions on their site. While this task depends on the specific application and is often considered more of an art than science, there are a few guiding practices that can be broadly applicable across a wide range of use-cases. This is motivated by the fact that there are three main entities (with respect to the goal of modeling) in most business contexts namely customer, product, and the company itself. As predictive models are often customer-centric, the single view should capture information about an individual customer, and how he or she interacts with the two remaining entities.

Based on the above motivation, the single view can be divided into the following three main categories.

Descriptive features

Descriptive features are about customer characteristics that are often captured at a fixed point in time. This includes information pertaining to a customer such as demographics (age, gender, marital status, employment status, residential address such as suburb or post code). Features related to how was the customer acquired can also be valuable. For example, which channel did they sign up (online vs. offline), whether they received any promotion, which day of the week.

Behavioural / transactional features

Unlike descriptive features, behaviour or transactional features are more dynamic and often computed over repeated transactions, over some period of time. Recurring transactions occur in telecommunication, banking, insurance, entertainment, and many other retail businesses. A transaction can provide granular information, for example transaction amount or price, time, the product / item itself and its category, which is  invaluable in helping us understand customer preferences. Transactional features are often created for three dimensions:

  1. Recency: When was the first time or last time a transaction took place?
  2. Frequency: How often do a customer purchase? How does that break down into different product categories?
  3. Monetary: What was the value of the transaction? Because this is a continuous (real-valued) measurement, and there are multiple transactions per customer, these features are aggregated. In other words, you compute the min, max, sum, average, median, standard deviation, and other summary statistics of all transaction amounts during the period under consideration.

It may be worth emphasising again that behavioural and transactional features are computed over a period of time, and customer behaviours at any time in their lifecycle can be analysed by varying the time window. So for example, you may look at these features in the first day, first week, first month, first 3 months, last 3 months, last month, last week, etc of a customer tenure depending on the nature of the purchasing cycle in your business. You can also compare recent behaviours to initial behaviours when they first joined to see how customers have evolved? Have they become more valuable or are they spending less with your firm?

Interactional features

Interactional features are similar to transactional features in the sense that they are both recurring, although the former does not involve financial matters. We consider two-way interactions initiated by either the company or the customers. Interactions can be direct, such as email marketing, SMS, customer calls or complaints. They can also be indirect, such as customer visiting a company’s web pages. Each interaction can be thought of as an event, which can be categorised according to business activities. Because they are computed over a period of time like behavioural features, we end up with attributes  representing event counts. To give a few examples, they can be the count of how many times a customer visited a particular web page in the last 30 days; how many times a customer called to complaint about a product or service; or how many times did the successfully reach the customer.

As before, these features allow us to measure the changes in the levels of interactions at different time periods. Such measurements indicate the degree of attachment and responsiveness of a customer, which can be a very useful feature when predicting future customer intents.

Based my personal experience, these three sets of features can be implemented quickly if you are already familiar with the data of your business. Coupling these features with robust modeling methods like random forest or boosted trees often result in a reasonably good initial model. For binary classification, I usually get AUC above 0.70 in the first run, which surpasses the accuracy level required for some practical applications.

Signing off note: In this post I’ve actually only described feature generation, a precursor to the whole feature engineering process. Often further processing or transformation of the generated features may be needed, for example by normalising, scaling, or discretising continuous variables, especially when using models that are sensitive to the magnitude of feature values.


A brief study of hotels in Vietnam

This is a brief study I did over the weekend about hotels in three popular travel destinations in Vietnam namely Hanoi, Nha Trang, and Phu Quoc.

Key observations:

– Hanoi ranks first on the total hotel capacities, followed by Nha Trang, with Phu Quoc quite further behind.
– 2-to-3 star hotels dominate the market in Hanoi whereas 1-star and 4-to-5-star hotels present more in the other two regions.
– Hanoi hotels are more established and averaged higher review scores than Nha Trang and Phu Quoc. This may also indicate that Hanoi is more competitive.
– Most visitors come from Australia, US, Europe, and some from our neighbours (Singapore, Thailand, Malaysia)

1. Total hotel capacities

2. Distribution of hotel capacities

3. Hotel star-ratings

4. Review scores

5. Guest nationalities

Nha Trang

Phu Quoc

6. Hotel size, locations, and ratings

Each solid circle corresponds to a hotel; its area is proportional to the capacity of the hotel, and the color is mapped to its review score.


Data transformation, Scala collections, and SQL

Data transformation is one of the 3 steps in ETL (extract, transform, load) — a process for getting raw data from heterogeneous sources (e.g. databases, text files),  process or transform, then loading it to the final source (e.g. in a format ready for further modelling or analysis). While there exist a plethora of languages for this task, this post describes a minimal set of operations that can be used for the purpose. Concrete examples are given using HiveQL, a variant of the popular query language SQL, and the Scala collections.

Our running example will be a simple dataset containing student records.

1. Data representation

With SQL, it’s obvious that each student record will be represented as a row in a table. Suppose we know the student ID, name, year of birth, and enrolment year, we define the following Student table:

create table Student(
  id string, 
  name string,

With Scala, each student record can simply be recorded as a tuple or a case class. We’ll use case class in this example because later on we want to access each field of a record by name.

case class Student(id: String, name: String, birthYear: Int, enrolYear: Int)

The entire set of student records is stored as a table in SQL, and as a collection in Scala. Note also that the type of each field is available in both representations.

2. Example data

Let’s assume our data has 4 students as created by the below code.

val students = List[Student](Student("1", "Alice", 1990, 2015),
    Student("2", "Bob", 1991, 2016), 
    Student("3", "Cathy", 1990, 2015), 
    Student("4", "David", 1991, 2014))

The SQL code is omitted for convenience as it’s not part of the operations we’re interested in.

3. Selection

The first operation is Select. Let’s say we want to find all students with ID less than “3”.



select * from Student where id < "3";

In Scala:

students.filter(s => < "3")

The underlying implementation for this operation may simply traverse the list (in Scala) or read each line of the table (in SQL), then check if each element satisfies the condition. In a specialised database, such as MySQL, there can be optimisation such as indexing (based on the ID field) to allow more efficient search. However, the high-level abstractions in both languages are much similar.

4. Projection

To select a subset of fields in SQL, we just need to specify which column names to keep. Let’s say for the previous query we only want the student name.

select name from Student where id < "3";

In Scala:

students.filter(s => < "3").map(s =>

If you’re new to Scala, it may not be easy to understand the above code. Let’s break it down into two expressions:

val studentsWithSmallIds: List[Student] = students.filter(s => < "3")
val studentNames: List[String] = =>

The first line filters the list based on the condition ( < “3”), for each element s of the list. Then the second line applies the function which returns for each elemnt s, hence the result type if of type List[String].

5. Group By / Aggregations

Another common operation is to group the data based on some fields and performs aggregations such as counting the number of elements in each group.

Let’s say we want to count the number of students that were born in each year. This can be done easily in SQL:

select birth_year, count(*) as cnt from Student group by birth_year;

In Scala:

scala> val groups: Map[Int, List[Student]] = students.groupBy(s => s.birthYear)
res27: scala.collection.immutable.Map[Int,List[Student]] = Map(1991 -> List(Student(2,Bob,1991,2016), Student(4,David,1991,2014)), 1990 -> List(Student(1,Alice,1990,2015), Student(3,Cathy,1990,2015)))
scala> val countByYear: Map[Int, Int] ={ case (birthYear, xs) => (birthYear, xs.length) }
res29: scala.collection.immutable.Map[Int,Int] = Map(1991 -> 2, 1990 -> 2)

The first line transforms the list into a map whose the key is the field that we want to group by, and the value is a list of Students with a same key.
Then we apply the function case (birthYear, xs) => (birthYear, xs.length) to each element of the resulting map (groups). The function returns a tuple with two elements, which are then implicitly converted into the map countByYear.

6. Join

The last operation we consider is join. Let’s introduce another type of records that some students loathe but some absolutely love — the GPA record. Each record contains student ID and GPA, like so

case class GPA(id: String, gpa: Float)
val gpas = List(GPA("1.0", 1.0f), GPA("2.0", 2.0f), GPA("3.0", 3.0f), GPA("4.0", 4.0f))

Join is supported natively in SQL, so if we want to join Student and GPA we can simply write

select t1.*, t2.gpa from Student t1 join GPA t2 on =;

There is no native join operation in Scala, so we’ll implement one by ourselves. It’s easy to do this for this particular example: for example, we can iterate through the Student list and the GPA list, and select those that match on ID:

scala> for (s <- students; g <- gpas; if ( == yield (,, s.birthYear, s.enrolYear, g.gpa)

res55: List[(String, String, Int, Int, Float)] = List((1,Alice,1990,2015,1.0), (2,Bob,1991,2016,2.0), (3,Cathy,1990,2015,3.0), (4,David,1991,2014,4.0))

7. Generic Join in Scala

The code in the previous section for joining students and their GPAs are all well and good, except that they are not general enough. If we were to join two different collections, we’ll have to repeat the above code with some modification to match on the right key. In this section we’ll study how to implement a generic join in Scala as a fun exercise.

First we will implement join for two Maps: m1 of type [K, V1] and m2 of type [K, V2]. Note that the map share the same key type but can have different value types.

We can define the join function as:

def join[K, V1, V2](m1: Map[K, List[V1]], m2: Map[K, List[V2]]): Map[K, (V1, V2)] = {
   for ((k1, v1) <- m1; (k2, v2) <- m2; if (k1 == k2))
     yield (k1, v1.flatMap(x => y => (x, y))))

A slightly more complicated but perhaps more functional way of implementing join (without using the for expression) is:

def join[K, V1, V2](m1: Map[K, List[V1]], m2: Map[K, List[V2]]): Map[K, (V1, V2)] = { {

    case (k, v1) => m2.get(k) match {

      case Some(v2) => (k, v1.flatMap( x => y => (x, y))))



Now to join two lists, we just need to specify what is the key of each element and convert both of them to Maps, where map entry is a binding from a key to all elements of a list with that same key. Coming back to our Student example, the Scala code to use the above join would be:

scala> val studentMap = students.groupBy(s =>
scala> val gpaMap = gpas.groupBy(g =>
scala> val studentWithGPA = join(studentMap, gpaMap)
studentWithGPA: scala.collection.immutable.Map[String,List[(Student, GPA)]] = Map(2 -> List((Student(2,Bob,1991,2016),GPA(2,2.0))), 1 -> List((Student(1,Alice,1990,2015),GPA(1,1.0))), 4 -> List((Student(4,David,1991,2014),GPA(4,4.0))), 3 -> List((Student(3,Cathy,1990,2015),GPA(3,3.0))))


 We see through this post how Scala collections can be used to implement common data transformation operations, much like a SQL language. Scala makes it convenient to write a DSL (domain-specific-language) using Scala. This is the case with Scalding, a data ETL framework built on top of Cascading and Hadoop MapReduce. In fact, the main difference in the operations you see in this post and those in Scalding is that the actual Scolding implementation is for distributed systems like Hadoop cluster.

With data comes responsibility

I’ve been reading quite a lot on the web on how to become a great data scientist. The most comprehensive resource for this question can perhaps be found on this Quora post: Most of the answers focused on specific theoretical background (e.g. machine learning)  or the technical know-how (e.g. data mungling, feature engineering). I agree that these are all essential skills required for day-to-day work, but there are also other important qualities that we should acquire.

One example that came to mind today is the need to be aware that “with data comes (big) responsibility”.  This has to do with the fact that most analytical work is done independently in private. This starkly contrasts software development where collaboration will give an extra or multiple pairs of eyes on every single line of code that goes into production. Rigorous testing or verification can further reduce the amount of bugs in software. Data analysis is a different animal though. The goal of analysis is to find meaningful but unknown signals and patterns inside the data. Because the signals are unknown, it is hard to verify the findings of an analytical work. But since the data scientist is perceived as being most knowledgeable about the data, his conclusions will often be taken as the truth. Their conclusions then become “actionable insights” which are used to improve the business, for e.g. to re-design the interface of a website or an app. At big companies this could have a strong impact on the business operations and affects millions of customers. As such, an effective data scientist must become the owner of his data, question and verify every hypothesis raised for the data, and think hard about the implications of every conclusion drawn from his analysis.

Điều tồi tệ nhất

Điều gì là tồi tệ nhất? Đây có lẽ là câu hỏi mà chỉ có một câu trả lời duy nhất chung cho tất cả mọi người, đó chính là cái chết. Nghe có vẻ hợp lí nhưng không phải ai cũng đồng tình với câu trả lời này – nhiều người chọn cái chết như một sự giải thoát khi cuộc sống của họ có quá nhiều bi kịch vượt quá sức chịu đựng. Trong bài viết này, ta hãy tạm bỏ qua những trường hợp này và giả sử mọi người đều đồng tình rằng cái chết là điều tồi tệ nhất sẽ xảy đến với chúng ta.

Chết là điều không ai có thể tránh khỏi, từ những kẻ trộm cắp, giết người, đến những bậc vua chúa đầy quyền hành hay những nhà tỉ phú. Có lẽ cái chết chính là sự bình đẳng duy nhất trên đời này: những người xung quanh chúng ta, dù họ có xa lạ đến đâu, dù ta có yêu quí hay ghét bỏ họ, thì kết cục cuối cùng của chúng ta cũng giống nhau. Khi  không thể tìm được sự đồng cảm với một người xa lạ, liệu ý nghĩ rằng rồi mai kia họ cũng như ta rồi cũng sẽ phải xa lìa cuộc sống này có thể giúp ta có ý nghĩ thân thiện hơn?