16 productivity rules

Rule #1: Establish and follow a system that enforces positive habits (like this set of rules). Big goals are achieved by winning small goals consistently.
Rule #2: Write down all to-do things in a in-bucket list (GTD system), which are evaluated on a weekly basis. Many trivial tasks may be eliminated after the evaluation. Also a good place to record all creative ideas.
Rule #3: Know the purpose by asking why should I do this? Am I really creating value or am I just inventing things to do or to seem busy? Say no to anything that is not high-value.
Rule #4: Set 3 wins for each day, week, month, and year (The Agile way). Plan every Monday. Review every Friday.
Rule #5: Make time for important things and fit others in the remaining slot. Like Stephen Covey’s example of filling a jar with big rocks, then small rocks, sand, water.
Rule #6: Apply 80/20 principle whenever possible. 80% of tasks / outcome can be achieved with 20% of time / effort. 20% of customers make 80% of profit while 20% of customers create 80% of problems.
Rule #7: No meetings in the morning. Reserve entire morning for productive work.
Rule #8: Do the most important thing first in the morning. This follows the 20% effort practice and leads to 80% outcome.
Rule #9: Of those tasks befitting rule #8, work on the most difficult task first.
Rule #10: Maintain focus and flow. Do not multitask (context-switching is expensive and brain can only work on a single demanding task). Block a time for hyper focus and then reward with 5 minute break (the Pomodoro technique).
Rule #11: Set deadline. Limit time to finish a task will increase efficiency (Parkinson’s law). Create positive pressure.
Rule #12: Choose tasks based on context (mental / physical energy, time and other resources availability).
Rule #13: Batch processing – group tasks that can be processed in a batch.
Rule #14: If a task takes less than 2 minutes, do it (when reviewing tasks).
Rule #15: Check email twice a day: once before lunch, once before leaving work.
Rule #16: For effective learning of a new subject, immerse in it. Study all great works on the topic. Apply it to daily life. Focus on a few and master them before moving on to new topics.

Predicting rental listing interest – Kaggle competition

So I recently participated in a Kaggle competition (final ranking 103 / 2488). I had intended to play with the data for a bit and build a prototype / baseline model, but  ended up getting addicted and followed through till the end of the competition. It was such a fun experience that I thought I’d share with you my learnings.

About the competition

The goal of this competition is to predict how popular an apartment rental listing is based on the listing content (text description, photos, #bedrooms, #bathrooms, price, latitude, longitude, etc). Popularity is categorised into three levels (high, medium, and low), based on the number of inquiries a listing has in the duration that the listing was live on the site. The data comes from renthop.com, an apartment listing website. These apartments are located in New York City.

Dirty data

One draw of Kaggle competitions is that you can work with real data sets, which are guaranteed to be ‘dirty’. Some data issues for this competition include:

  • Addresses: it seems like when listing a property, managers or owners can use free text so there are quite a few variations of the same thing. For example, 42nd 10 East Avenue can be entered as 42nd 10 e av, 42 10 east avenue, 42nd 10 e av., etc. These variations are regular enough that they can be processed with, well, regular expression. What would’ve been ideal though is to have a database of streets to choose from so that data is consistent.
  • Outliers: there are properties with listed rental of over 100k. There are properties with latitude and longitude of 0.0.
  • Missing data: there were also quite a lot of missing data, for example the building IDs, which uniquely identify buildings (e.g. apartment blocks)

How to deal with outliers and missing data? For this dataset, there were opportunities to impute missing features or outliers based on other features. For example, missing latitude and longitude of a listing can be inferred based on its building id and display address whose lat and long are not provided in other listings.

Label leakage

When building supervised models, label leakage should be avoided at all cost. On Kaggle however, if there is leaky information then it is almost impossible to win a competition without using it. And this happened to this competition.

What happened (I think) is due to the way data is prepared. Each listing comes with a set of images, and it appears that the images of each class / interest level were stored in its own folder. These were then copied to distribute to Kagglers, and due to the disproportionate numbers of instances per class, the images were written at different timestamp (much later after the actual listing creation date). So there is a strong correlation between image creation timestamp and the interest levels.

The right thing to do in would have been to prohibit the use of such feature. Regrettably though, Kaggle only encourages publicising leakage when one is found. So most people ended up using this feature in their model, which distorts actual performance were this model to be put into practice.

What affect listing popularity?

Feature engineering was a definitely fun exercise in this competition, and there was a lot of room for creativity. I approached this by relating to my own experiences when searching for a rental property and coming up with criteria that are important to me or other renters. Of course I also borrow other features discussed on the forum which were impactful.

Below are some of the features that contribute heavily to the model. Some are naturally expected while others require some reflection but also make a lot of sense.

  • The property’s features: bedrooms, bathrooms, price
  • Value for money: along with the above features come the value for money, which is determined in a number of ways. For example, it can be comparing an estimated price for the listing against the actual price, or comparing price with that of properties in the same building, street, or neighbourhood.
  • Location is the key: location is encoded in variables such as latitude, longitude, building id, display address, and street address. These can be used directly or in combination with other information, thus creating second-order interactions.
  • Skills of the managers: This came as a surprise at first as one would expect the desirability of a property has very little to do with who is managing it. But thinking more deeply, good managers can i) acquire quality properties, e.g. in prime locations or good conditions, and ii) know the required ingredients to attract viewers, e.g. setting the right price, and bring other potential benefits. So even though causation is not so strong, correlation is.

The unexpected bits

Interestingly both the textual description of a listing and its accompanying photos did not contribute as much to the model performance. I did not have time and resources to deal with over 80GB of images data, but others reported little performance gain by incorporating these information.

Deployment challenges

Solution to this competition cannot readily be deployed into production as there are a few things to consider. First, only three months worth of data was given, and the training and testing data were randomly split. It would be better to use at least two years of data and out-of-time sampling to account for seasonality effects. Second, we need to check if all features used in the winning solution are available at the time of making prediction, which can be when the listing is first created or in real time. Third, similar to the Netflix prize competition, the winning solutions are based on stacking and have quite a lot of model. Depending on deployment is done, it may not be possible for the engineers / scientists to use all of the models in their system due to complexity or computational constrains.

What wins competition?

It’s been well publicised that feature engineering wins Kaggle competition, which is still true. But model stacking or ensembling almost always increase performance if done properly.

  • Feature engineering: Big gain in performance always comes from good feature engineering
  • Stacking or ensembling: While FE is crucial to getting a very good model, stacking and ensembling models can deliver the final boost to win a competition
  • Experience: Competing in Kaggle is like a sport, and so experience is vital. Expert Kagglers know the best practices, have scripts that can be re-used, and prior knowledge from other competitions that give them an edge in any competition.

Actions for Rent Hop

  • Upgrade their listing service to better capture clean data
  • Re-thinking how to best leverage their data and incorporate predictive modelling in their business. This competition has demonstrated that it’s possible to predict the level of interests well, but the more important question is what to do with it? Can interest level be pushed up easily operationally? Recall from above that manager skills and property features such as location, price, number of bedrooms and bathrooms are the key model attributes. But these are hard-to-move needles for RentHop as it does not have much control over them. On the other hands, other metrics such as minimising the time it takes to find and lease a property would have been easier to optimise for.

Overall it has been a thrilling experience which provided many valuable lessons. Thank you Kaggle and fellow Kagglers! You can find my (uncleaned) code at https://github.com/trungngv/kaggle/tree/master/renthop if interested.

Google Analytics in BigQuery, explained in one query

Google Analytics (GA) is a popular suite of analytic tools used by many companies to track customer interactions on their digital channels. Although it offers plenty of built-in capabilities for insights discovery, there are times when you want to deep dive and run your own analyses. This post will help you understand the Google Analytics data that is exported to BigQuery and how to extract the key pieces of information that you need.

Understanding the data structure

  • BigQuery stores the exported data by date, and each day is stored in its own table. For instance yesterday’s data will be stored in 1300.ga_sessions_20170315, where 1300 is the project id and 20170315 is the date in yyyymmdd format. Data of the current date is stored in an intraday table, e.g. 1300.ga_sessions_intraday_20170315.
  • Each table contains all sessions by users, one row per user session. A session is simply a sequence of pages viewed by the user (or in GA terminology, page hits).

For analytical tasks, we want to be able to identify users and sessions.

Identifying unique users

Users can be divided into two categories: logged in and not logged in (guests), of which only the former can be reliably identified. Logged in users can be associated with customers if you set and send these programmatically, either via the userId field or some custom dimensions that you define. Guests can be identified via fullVisitorIds but these are reset if user clear their cookies or use multiple devices. In fact the mapping from userIds and fullVisitorIds are N-to-N, so they can’t be reliably linked.

Take-away message: set, send, and use userId to uniquely identify customers.

Identifying unique sessions

The documentation from GA recommends using fullVisitorId + visitId to get a globally unique session identifier (within your GA data source). But for logged in users, we should actually use userId + visitStartTime to identify sessions of each user, where visitStartTime is the start time of a session. Let me illustrate with a toy example:

visitId | fullVisitorId | userId | visitStartTime
v1          | f1                     | u1         | 1000000
v1          | f2                     | u1         | 1500000

Here we have one user u1, who is mapped to two different visitor ids in two different sessions. The visitIds happen to be the same in both sessions. So using userId + visitId we would get only one session where in fact there are two. Using userId with visitTime is the right combination as an user can’t have two sessions that start at exactly the same time. If we want to be 100% certain that they are unique, we can use userId + visitId + visitStartTime.

Example reference query

Now that we know how to identify users and sessions, let me give you one reference query that covers the main concepts you need to know to work with this data.

Note that BigQuery is compliant with the SQL 2011 standard and supports complex types like Arrays and Structs. It also supports legacy SQL, but I encourage the use of the Standard SQL dialect, as in the given query below.

select 
  b.account_number
  , concat(cast(visitId as string), '_', cast(visitStartTime as string)) session_id
  , hits.type as hit_type
  , hits.hitNumber as hit_number
  , concat(hits.page.pagePathLevel1, 
      regexp_replace(hits.page.pagePathLevel2, '\\?.*$', ''))
    ) as page_level12
  , hits.appInfo.screenName
 from `project_id.ga_sessions_201702*` s, s.customdimensions as custdim,
       s.hits, `project_id.account_numbers` b
 where custdim.index = 1 and 
   custdim.value = cast(b.account_number as string) and
   timestamp_diff(timestamp_seconds(visitStartTime), b.first_online, HOUR) < 24
   and hits.type in ('APPVIEW', 'PAGE');

The query extracts all pages visited by each user on apps and websites, within their first  day online. Here are the main points:

  • account_number is used in place of userId for logged in users. This comes from an external data source, for example your customer database table. This is set and send to GA via the first custom dimension, which we retrieve with the condition custdim.index = 1.
  • unique session identifier is given by concatenating visitId and visitStartTime, as discussed above.
  • hits is an array of struct containing information about each hit / page view. There are several different hit types but here we limit to ‘APPVIEW’ for app interactions and ‘PAGE’ for website interactions
  • the sessions table is implicitly joined with its column hits to flatten the table (getting one row per record)
  • hits.hit_number gives us the order of page views within a session
  • wildcard is used to qualify against the tables (hence dates) we query, here we are looking at data in Feb 2017 only
  • hits.page.pagePathLevel{1 to 4} gives the web page, and hits.appInfo.screenName gives the app page
  • timestamp_diff, timestamp_seconds are date time functions

References:

Side note

BigQuery web interface is not yet fully fledged. In fact, I find it quite limiting at first as it does not allow creating tables directly from query and has only one window for writing query. But auto-completion (which works for table names, columns and functions) and pop-up documentation are absolutely two killer features. I like it much better in this respect compared to the boring SQLWorkbench/J client that I’ve been using.

Recommended books for data scientists

In this post I’d like to share some of my recommended books for learning data science and machine learning, both in theory and and practice. Fellow practitioners, let me know your favourite books or any other related resources, I’d be keen to check out some new books and add them to my list.

Theory

These are all foundational textbooks in machine learning. If you study at least one of them in depth, by which I mean formulating models, deriving and implementing the main inference algorithms, and doing the exercises. The books can be quite technical if you’re new to machine learning, but once you stick through one, you’ll find others quite accessible.

The Elements of Statistical Learning (ESL), by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
One of the classics, there’s also an online course and a new textbook accompanied by R code.

Pattern recognition and machine learning (PRML), by Christopher Bishop

Similar to ESL, this highly regarded book is another must-read.

Machine Learning: A Probabilistic Perspective, by Kevin R Murphy

If you study PRML thoroughly, you’ll be familiar with most contents in Murphy’s book. Nevertheless a fun and comprehensive book with a strong focus on principled, probabilistic approach to modelling. It also comes with code in Matlab.

Probabilistic Graphical Models, Daphne Koller and Nir Friedman

Graphical models provide a framework for representation, inference, as well as learning of probabilistic models. This powerful framework provides a unifying view to many ML models which otherwise may be viewed as just a bunch of disparate models. There’s also an online course on Coursera. 

Reinforcement learning, an introduction, by Richard S. Sutto and Andrew G. Barto

Despite still a draft, the second release is well-written and motivates the concepts and applications of RL really well.

Neural networks and deep learning, by Michael Nielsen
Deep Learning
, Ian Goodfellow and Yoshua Bengio and Aaron Courville

Michael Nielsen’s book is more hands-on and contains some cool interactive contents to aid understanding, while Goodfellow et al is more comprehensive. I recommend reading them in the given order.

*Gaussian processes for machine learning, by Carl E. Rasmussen and Christopher K. I. Williams

*The last book is on Gaussian processes, my PhD research topic. You don’t really need to know it for doing practical data science, but still a good reference. The first few chapters present the Bayesian approach to modelling and are worth reading.

Practice

Data science for business, by Foster Provost and Tom Fawcett

This book is accessible to non-technical audience like business managers. It also provides some sound principles on how to execute data science projects. Highly recommended.

Applied predictive modelling, by Kjell Johnson and Max Kuhn

Written by the author of the popular R package caret, this is a must-read for those practising data science. It contains many practical tricks and advices.

 

Data Mining Techniques: For Marketing, Sales, and Customer Relationship, by Gordon S. Linoff and Michael J. A. Berry

Don’t let the title mislead you, this is a good read on data science techniques in general, not just in CRM space.

Data preparation for data mining, by Dorian Pyle

Published in 1999 but still very relevant a day, this book can serves as a good checklist of things to inspect when preparing data for analysis.

Bandit algorithms for website optimization, by John Myles White

This book presents standard multi-armed bandit algorithms and comes with implementations in several languages.

Practical data science with R, John Mount and Nina Zumel

Not as polished as Johnson and Kuhn’s book but has few neat techniques worth knowing.

A Blended System for Productivity

I first became interested in time management and productivity since reading The 7 habits of highly effective people by Steven Covey and Getting things done (GTD) by David Allen back in college. Since then I have adopted a few simple techniques for managing work, such as doing the most important things first and keeping a daily or weekly to-do list. That was working alright, but recently I decided to design a better system which can be (almost) effortlessly incorporated in my daily routine. After studying most of the best-selling books on the topic, I came up with the following blended system (see the end of the posts for a list of references). I implemented it in Evernote, which is accessible from the web, mobile phone, and desktop app. My system consists of five notebooks explained below:

  1. Foundation – the big picture
    • Note 1 – Who am I? Used to capture the essence of me, for example my principles and values. It serves as the first filtering layer, reminding me to ask if there’s a convincing reason for including something in my to-do list.
    • Note 2 – Hot spots: Used to keep track of active projects pertaining crucial aspects of life — physical, intellectual, spiritual, financial, social, recreational, etc. Aims to give a high-level overview of my resources allocation, each project is summarised in only one sentence describing its goal.
  2. Backlog / ideas – repertoire of short and long term projects and ideas
    • Note 3 – In-bucket list: Used for brain dumping any ideas (with almost no filtering), to keep them out of the brain and avoid interfering with its operation when not necessary. This practice is heavily emphasised in GTD. I schedule a time to process and empty this list weekly, using the triage recommended in GTD.
    • Note 4, 5, 6, 7- Personal development, professional development, assets creation, Misc: Used for storing the project ideas that have been processed from the in-bucket list. I find it easier to split these into a few prioritised categories (that may change over time) as a big project idea typically warrants further elaborations and expansions. Processing the bucket list weekly ensures more thorough consideration of which projects or tasks to act on. I often cross out many tasks as I review the bucket-list as, by the time I get to them, their true importance has dropped significantly.
  3. Outcomes
    • Note 8 – Monthly challenges
    • Note 9 – Monday visions
    • Note 10 – Friday reflections
    • Daily (optional) – Years ago I used to write down my to-do list, one note per day, but now I just use a pen and a notebook that I can carry around to meetings to take note more easily.
  4. Projects – the actual execution
  5. References

The principles

The resources

  • The 7 habits of highly effective people, Stephen Covey
  • Getting things done, David Allen
  • Eat that frog, Brian Tracy
  • The 4-hour work week, Tim Ferris
  • Getting results the agile way

More but mostly subset of contents from these books

The power of habits

So last year I decided to build a habit of learning Korean for 15 – 20 minutes per day. In reality I could only do on weekday on the train back home.  I think this habit has worked marvellously and proved the power of habit. Through the practice of daily learning with small chunks, I can now write basic Korean sentences much faster, like the below letter that I wrote to my Korean teacher.
선생님, 새 해 복 많이 받으세요!
혹시 이 메시지 받을수 있는지 없는지 궁금해요 ㅎ
선생님 facebook 자주 안 사용하니까
선생님과 가족 모두 잘 지내죠?
나는 한국어 매일 이십분정도 공부해라서 아마도 선생님 한테 더 자주 통화 할수있어요
우리는 시드니에 생활 좋아요.
날씨가 좋고 음식도 베트남 동네들을 가까워서 좋아요
작년에 우리는 집을 샀어요. 나의 사무실 부터 좀 멀지만 괜찮아요. 그리고 이번 4월의 우리는 다음 아이가 기대하고 있어요.
시간이되면 호주 여행 한번 하세요.
I’m sure there are mistakes somewhere but blame Google Translate 😀 This technology is so good now that it can tolerate mis-spellings when doing the translation. I suspect the letter may even look better had I just written it in English and send the translated version, but my teacher will be happier to see the product of my composure.
P.S. I use Duolingo for the lack of a better alternative as it’s now too basic for non-beginners and take some willpower to go through the lessons.

Finance and Investing in Under an Hour

I just finished watching this video (link at the end of the post) by William Ackman on Finance and Investing. Despite the exaggerated title, the content is reasonably informative and is a good introduction as well as a refresher to the topic. Also some of the recommendations are worthy to keep in mind when you’re investing.

I took some screenshots from the video for reference later:

So his keys to successful investing:

  • Invest in public companies
  • Understand how the company makes money
  • Invest at a reasonable price (e.g. avoiding too much fee for trading or management)
  • Invest it a company that could last forever (e.g. Cocacola, McDonald)
  • Find a company with limited debt
  • Look for high barriers to entry
  • Invest in a company immune to extrinsic factors
  • Invest in a company with low reinvestment costs
  • Avoid business with controlling shareholders