Cohort Analysis in Python with Pandas

Powerful marketing analytics technique with Pandas in just few lines of code.

Published in

Analytics Vidhya

5 min readMar 1, 2021

The Cohort Analysis is a potent marketing practice that is not used very often yet leads to strong conclusions. This technique can be used to assess different metrics of groups of users through time. While it is available in most analytics platforms (Google Analytics, for example), it is not as flexible as using Python to perform it. We will start with the simple introduction of the method and then go to the actual implementation using Python.

What is the Cohort Analysis?

Cohort Analysis is the behavioral analytical technique used to track specific metrics based on the groups of users created by the time of their first action within the program or website (it can be either first appearance on the website, registration, or purchase).

It is useful because if we would just look at the population of users as a whole, it can cause problems due to the heterogeneity. By looking at the behavior of different segments of the user base over time, we can create more reliable and robust conclusions about our business’s efficiency.

Usually, analytics use weekly or monthly cohorts segmentation. However, sometimes marketers can use daily segmentation for liquid businesses. We can calculate values such as retention rate, Customer Life Time value (LTV), or Customer Acquisition Cost (CAC) for each cohort. For even more in-depth analysis, other segmentation metrics beyond time can be used (source of traffic, device, country, etc.).

Now, as we know what Cohort Analysis is, we can go to the actual implementation.

Implementation

Firstly, we need to import data. We will use the data from Yandex open data. It can be accessed from this link (Sorry, this link expired and I was not able to find the data). In the following code cell, we read the data from CSV format, change the type of event_date column to DateTime, and extract the week from it into a new column event_date_week.

Now, we need to identify the registration week for each user_id. To do so, we assume (and we can check if we want) that we have only one registration per user.

From the original dataframe, we bring the rows where registration was performed, and from there, we extract rows with user_id and event_date_week.

As we know the registration week of each user, we can merge it to the original dataframe. To do so, we use python function merge on two dataframes and specify that we merge on user_id as well as we merge by the original dataframe (to put the registration date even if the user appears twice or more in the dataframe). We also find the cohort lifetime at the time of operation (specific row) by subtracting the week of registration from the week of operation.

The next step is to create a pivot table using my favorite pandas pivot_table function. To become more proficient with this powerful tool, check my YouTube video about pivot tables or my Medium article about them. As indices, we use registration_week and cohort_lifetime; as values, we calculate the number of unique user ids for every specific week for every cohort.

Looks great! We have the first iteration of useful information from our analysis. We can see how the number of people changes for each cohort during its lifetime.

However, to calculate the required metrics, we need slightly more effort. For example, we can compute the retention rate and draw a beautiful graph.

Retention Rate

Retention rate is one of the most important metrics to assess when we perform a new advertisement or campaign and introduce a new product or service. It is the measure of customer retention or the percentage of people who remain customers after some time. It is calculated as:

For our cohort analysis, we will divide the number of customers remaining from a specific cohort by the number of all users in this cohort (registered in the same week).

Therefore, we need to find the initial number of people in each cohort. We can bring only the rows, where the cohort lifetime is 0 (cohort is only formed in this period), and extract just the registration date and number of users.

We need to merge it with the cohorts dataframe to get the number of initial customers in each cohort. Then, we basically find the percent of users that remained in each period of time from the initial number of customers in the cohort.

Retention Rate for each cohort during its lifetime

We obtain this beautiful table, and the last step is to make it more appealing by drawing a graph using seaborn. We will use a heatmap to show how numbers change (Please, don’t forget that putting imports in the middle of the code is a bad practice, relocate them at the beginning of your analysis).

What does this graph show? We can see what percentage of initial customers registered at the specific week stayed until some cohort lifetime. For example, we can see that in a month from the registration week 31 (cohort lifetime = 4), only 15.04% of people, who registered during week 31, continued to be active users of the app.

Conclusion

As you could see from this article, it is relatively easy to perform cohort analysis in Python. Also, you can customize the process a lot. You can choose your metrics to look at or identify cohorts with different time frames. I encourage you to try cohort analysis in your company to see the patterns in other groups of customers.

Thanks for reading!

If you have any questions, ask in the comments! I am more than happy to share my knowledge.