Recommendation Systems and How to Create a Simple one with Python

8 min readDec 4, 2018

I used to sell luxury watches in a retail environment. My job was to use my knowledge of customers and convince them to splurge. In retrospect, I was constantly filtering and sorting customer information then drawing conclusions on how to construct my next sales approach. Today, many companies use algorithms to do the same thing with recommendation systems. It’s fascinating that companies are able to ell you what you want before you knew it and without ever seeing your face. In this blog post I’m going to briefly talk about how companies YouTube and Ant Financial use their recommendation system to great effect and show you a simple movie recommendation system using numpy and pandas.

YouTube has been an exceptional resource for learning and watching anything under the sun but it’s also a place where you can spend too many hours on it if you’re not careful. There have been many nights where I went to watch a video and emerged an hour later in a daze wondering what just happened. Needless to say, YouTube has done a great job at keeping viewers longer and longer on their site. What recommendation systems have they implemented to keep us watching for longer? Google Brain, their parent company’s artificial intelligence division, turned to unsupervised learning. YouTube uses Google Brain’s machine learning algorithms to find relationships and patterns that software engineers would not have seen, providing more useful recommendations on its own and at an accelerated rate. The patterns they’ve picked up have made a significant improvement in viewership.

In the past, YouTube relied on algorithms that put videos with the most clicks in front of more viewers. However, with Google Brain’s unsupervised learning algorithms, they realized that the amount of time a user spent watching a video was a much better indicator of a higher quality video. Another way is recommending longer times on YouTube’s TV app and shorter times on its mobile app. This helped boost view duration as people on their mobile phone tend to be on the go while people on the TV app stayed in one place. In this same effect, the ad times were shorter on mobile and longer on the TV.

We should be careful not to look too closely at the most obvious things because I think we can get attached to one way of thinking and not see another pattern. It’s great that machine learning algorithms are able to detect these relationships and help us make impactful changes. Another company that is using recommendation systems is Ant Financial.

Ant Financial is the financial affiliate of Alibaba Group Holding. I found this company very interesting because while I was doing research on this company, I realized that when I was in China, I was using it every single day. They’re behind Alipay, a third-party mobile and online payment platform in China. People have stopped using cash because of its popularity. I never needed to bring a wallet out. All you do is put money on your e-wallet and scan the terminal’s QR code and you’re good to go. Convenience is king.

You can basically buy anything on Alipays’ platform. Shop owners, restaurant owners online stores, rice paddy grandmas, are all using the app. When over 500 million people are using the same platform, you have a huge amount of transactional data. What do they do with this data to create a very good recommendation system? Instead of the traditional recommendation system such as collaborative filtering, where the algorithm deduces that since love pineapples, you might like mangos because someone else who pineapples also likes mangos, or content-based filtering where it recommends you other tropical fruits, Ant Financial provides a two-way recommendation system that connects you to the businesses. Michael Jordan, professor at UC Berkeley and a prominent leader in machine learning, goes into more detail about this. Say you just come out of a movie and you’re looking for a restaurant to eat at nearby.

On your app, it would say here’s some restaurant nearby that you might like and some you’ve been to, and on the restaurant sides app, the restaurant owners might be holding a wedding so there might be 50 empty seats so they put down this down in their preferences. Since you’re nearby this particular restaurant, they can send you discounts and special offers to entice you to eat there tonight. One those 50 seats are filled, great! Now you’re happy you’re full and you saved some money AND the restaurant filled to capacity. And this creates a market because when that restaurant is full, other restaurants in the area see this and can also make special offers to fill their seating as well. It’s a win-win for you, the businesses and the economy. Jordan says that this is “democratizing financial services” and “it’s all about providing to the right people at the right time.”

Now this is a simple python recommendation system using numpy and pandas which will return similar movies based on rating correlation in order to find the similarity between movies. First showing that the higher number of ratings a movie gets is related to the average rating it receives as more people tend to watch these movies. Then using rating correlation, where users who voted the same for both movies would strengthen the correlation, to show similar movies. We will be using the MovieLens dataset which consists of 100,000 ratings of 9,000 movies by 600 users last updated 9/2018. This part is following the walkthrough of Usman Malik in this post.

First the author explored the MovieLens data. We open the ratings and movies .csv’s in the dataset with pandas. We name them ‘ratings_data’ and ‘movie_names’. We merge the two together.

movie_data = pd.merge(ratings_data, movie_names, on='movieId')

Then we take the mean of each of the ratings and average them.

movie_data.groupby('title')['rating'].mean().sort_values(ascending=False).head()

Because these movies can reach the top as the highest rated movies because they only have 1 review, we should also get the number of ratings per movie. The authors main point was that “really good movies get higher ratings because it is rated by a large number of users.”

movie_data.groupby('title')['rating'].count().sort_values(ascending=False).head()

The really well known movies are now at the top of the list. And it may be safe to say that these are also very good movies. Next, we’ll add the rating count and the rating tables together.

ratings_mean_count['rating_counts'] = pd.DataFrame(movie_data.groupby('title')['rating'].count())

Here above we have the rating and the number of ratings next to the title of the movies. To illustrate the amount of movies with or without ratings, we’ll plot it out on a histogram.

sns.set_style('dark')
%matplotlib inlineplt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
plt.title('Movies and Ratings Counts')
plt.ylabel("Number of Movies")
plt.xlabel("Number of Ratings per Movie")
ratings_mean_count['rating_counts'].hist(bins=50)
#graphing number of movies that received x amount of ratings

This shows there are many movies that have 0 ratings and very few that have over 100 ratings (can’t even see it on the graph).

plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
plt.title('Movies and Average Rating')
plt.xlabel("Average Rating per Movie")
plt.ylabel("Number of Movies")
ratings_mean_count['rating'].hist(bins=50)

The graph shows that the higher the amount of ratings the average rating tends to be higher.

plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
sns.jointplot(x='rating', y='rating_counts', data=ratings_mean_count, alpha=0.4)
#movies with higher avg rating have more ratings

This puts the two graphs together to show that the more well known a movie is, the more people watch it, therefore has a higher rating. Now to find the similarities between the movies.

user_movie_rating = movie_data.pivot_table(index='userId', columns='title', values='rating')

This is a matrix with each movie and their ratings. Many of them have had no votes so there are NaN. We will choose Forrest Gump to be our base movie because it has the highest amount of ratings and then find other movies that are highly correlated with its rating. We will clean this data by removing the NaN in the next steps.

forrest_gump_ratings = user_movie_rating['Forrest Gump (1994)']

These are the user ratings for the movie Forrest Gump. Using the corrwith() method we can find the correlation of other movies that rated the same thing for Forrest Gump.

movies_like_forrest_gump = user_movie_rating.corrwith(forrest_gump_ratings)corr_forrest_gump = pd.DataFrame(movies_like_forrest_gump, columns=['Correlation'])
corr_forrest_gump.dropna(inplace=True)
corr_forrest_gump.head()
corr_forrest_gump.sort_values('Correlation', ascending=False).head(10)

A lot of these movies are unheard of. Although the correlation is 1.0, a movie with one rating is not as a good of an indicator as one with more ratings. So, we filter out the movies with less than 50 review counts.

corr_forrest_gump[corr_forrest_gump ['rating_counts']>50].sort_values('Correlation', 
                                                                      ascending=False).head(10)

The final results shows other movies that are also very similar as they are big Hollywood movies, showing movies that they are pretty correlated, taking into account the rating count and average rating.

Sources:

https://www.forbes.com/sites/kristinwestcottgrant/2017/12/10/how-to-think-about-artificial-intelligence-in-the-music-industry/#2938bfc37d4a

Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133–151. July 2001.

https://stackabuse.com/creating-a-simple-recommender-system-in-python-using-pandas/

https://www.forbes.com/sites/bernardmarr/2017/08/08/the-amazing-ways-how-google-uses-deep-learning-ai/#2aaa6e713204

https://www.theverge.com/2017/8/30/16222850/youtube-google-brain-algorithm-video-recommendation-personalized-feed

The 'terrifying' moment in 2012 when YouTube changed its entire philosophy

Although there wasn't any Roman-style backstabbing, YouTubers will remember the ides of March 2012 as a traumatic day…

www.businessinsider.com

https://grouplens.org/datasets/movielens/latest/

The 'terrifying' moment in 2012 when YouTube changed its entire philosophy

Although there wasn't any Roman-style backstabbing, YouTubers will remember the ides of March 2012 as a traumatic day…

Written by Chris Chung