In this post we are going to develop a movie recommender java application using Collaborative Filtering Algorithm. Full code and working application are provided together with results.Generally no deep knowledge on the field is required and topics are kept as high level as possible.
Recommender system are probably one of the most widely used applications in Machine Learning. A recommender system suggests to users information based on their preference trend on the data. The main advantage of this systems is that it learns automatically as it knows more from users preferences. So basically more the users interact with the system(users likes/clicks particular books, movies) better suggestion(more close to user interest) is the system going to make them.
So we have a very big list of movies of different genres and our user has rated few of them. Now we want to predict what ratings would this user would have give for movies he haven’t see or rate and suggest the user only top ten highest predicted rates.
Lets suppose that we have how much of a certain genre a movie is(70% action and 30% romance) and the rates(1-5) given by our users. Given users rates and how much of genre a movies is we want to know how much our users like the movies genres or what is the trend of users preferences towards genres.
For example since Bob has given 4 to Transformers and Transformers is Action movie we can say he like Action movies. Similarly Alice has given 5 to Beauty and the Beast(Romance) so she tends to like Romance movies.
The reason we want to know what are user preferences is because it will help us to predict if user will like or not a specific genre of unrated movie.
Rates are from 0-5
Lets suppose now that we have ratings from users per movie and how much users like movie genres. Now we want to find what type of a genre a movie is.
For example we know that Bob like Action movies and has given 4 to Transformers so most probably Transformers is a Action movie. Similarly we know Alice likes Romance movies and she rated Beauty and the Beast 5 so most probably Beauty and the Beast is Romantic movie.
Rates are from 0-5
Mos of the time is hard to figure out features like Action,Romance,Family because there can be more like Sci-Fi, Animated,Adult… and who knows what more features will help us to get better suggestion. In order to figure out the features we can use users ratings and their preferences towards features.
For example lets say we have 4000 users and lets say 10.000 of movies. A feature can be born when 300 users with similar preferences like (4 to 5 stars) a group of 1000 of movies and this feature can be something we could never come up logically(because one or two specific actors play there maybe).
Lets suppose we magically have how much of a Genre(horror,family,action) a specific movie is. What we want to predict is ratings.
More specifically we want to find out how much a certain Genre is contributing to User ratings. In few words we want to know the weights(θ) of each Genre to particular user preferences. The weights can greatly differ from user to user , for example Action Genre can have grater impact on Bob rather than on Alice or Romance Genre can have greater impact on Alice rather than Bob. So for Bob we can have :
and for Alice :
n-> number of examples
k-> number of Genres(features)
θj -> weight for genre j(want to have)
Xji -> amount of genre j of the i-th movie(known)
Ru-> Rating for User u
So once we have weights(θ) it will be straight forward to find user ratings for all movies not yet rated by him, after all now we have user preferences through weights. We need just to apply weights(θ) found by Algorithm and genres types(X) to the simple equation above. So to say if movie is action we will rate high for Bob and low if Romance.
Here the problem is reversed. Lets suppose that magically we have the weights(θ or user preferences) but we are missing the genres or missing description/classification of movies. Still what we want to predict is ratings. Now the role of the weights(θ) is simply replaced by genres. The equation is exactly the same as above but with one difference weights(θ) are not variable but they are known:
n-> number of examples
k-> number of Genres(features)
θj -> weight for genre j(known)
Xji -> amount of genre j of the i-th movie(want to have)
Ru-> Rating for User u
So once we have how much of certain genre(X) a movie is it will be straight forward to find user ratings for all movies not yet rated by him. Now beside user preferences which is given we have also detailed information(genres) about movies. So we can easily say if particular movie is close to user preferences or not. We need just to apply weights(θ) and genres types(X) found by Algorithm to the simple equation above.
Once we understand the problem in isolation is time to figure out how the magic is done. In reality we do not have weights(θ) or genres types(X) so basically we have an equation with two variables. For simplicity lets take simple equations for each of our problems:
1 * θ + 1* X = 5 (Users Preferences)
2 * θ + 3 * X = 5 (Genres)
- We cannot solve this equations just like that so what we do is randomly
choose some value for X like X=2 for the first equation(Users Preferences). Now is much easier we can easily find θ=3. This is a solution that satisfies the equation but is it the best one? Are wights(θ) the best ones describing our user preference? Are this genres X the best describing our movies? We may by lucky but most of the times we will not be. In order for us to change θ and X towards to a better solution we need a Method that magically tells us how fare from best solution we are. So lets say that methods exists and told us that we need to increase θ and decrease X by certain degree like the below:
1.5 * θ + 0.5* X = 5 (Users Preferences)(2)
- It is still the second equation(Genres) to solve. What we do is use there the value of θ we found by previous equation, θ=3. Replacing θ=3 to the equation gives us a new value of X=-1/3. Same again we use the magic method to change θ and X towards to a better solution. Lets suppose we got signal to lower θ and increase X as below:
1.5 * θ + 5 * X = 5 (Genres)(2)
- At this point what we do is simply repeat what we did above but with new equations(2) and new X value -1/3 instead of a random one. So how this goes is both of the problems are collaborating together towards the best solution. Genres are helping to find out user preferences and use preferences are helping to find genres until together they find the best relation between them and the ratings. θ->X->θ->X->θ->….
If you are disappointed because there is still some magic feel free to read next section about this two. Fortunately we have that magic method and is called cost function and we have the way to change θ and X is called gradient descent.
This is the method that magically tells us how fare from the best solution we are. It is very similar to Logistic Regression post , even more simpler. We want the best user preferences looking at the rates and genres and similarly the best genres looking at user preferences and rates so not just some values.
So what we need is to compare how well our current solution is doing. Lets call the current solution hypothesis. Hypothesis are basically what we say at Insight section so our way to predict rates:
Replacing X and θ values gives us rate for particular user.
Once we have the hypothesis rate(Ru) calculated on current X and θ we can simply compare it with the real user rate we have. So to say if hypothesis rate gives us 2 and in reality user rated 5 than this tells us that 5-2=3 units away from wanted value. Similarly if hypothesis rate gives us 5 and in reality user rated 1 than this tells us that 1-5=-4 units away from wanted value. What we want ideally is that our hypothesis is exactly like the real value so the difference will be 0. In a few words we want to minimize the cost function.
In reality cost function calculates the average of the squared difference of our prediction(Ru hypothesis rate) with real data(user rates) like below:
where yi is the real value like rate and h(x) is the hypothesis. We want cost function J to be zero ideally or very small because this tells us that there is no or little difference between hypothesis and real prediction.
Minimize Cost Function
We have our hypothesis which gives the current prediction of our algorithm and cost function which tells us how well performs. What is missing is a way to react after cost function calculates the performance of hypothesis. Reaction is basically changing values of θ and X by decreasing or increasing so cost function will be minimized.
Fortunately there is already build in algorithm to minimize the cost function Gradient Descent. GD is an iterative algorithm which iteration by iteration changes values of θ and X until cost function goes almost 0 or converges. It uses derivative of cost function to decide if to lower or increase θ and X values. Beside the derivative which just giving a direction to lower or to increase the value it also uses a coefficient α to define how much to change the θ and X values.
Changing θ and X values to much(big α) can make Gradient Descent fail optimizing cost function to zero, since a big increase may overcome the real value and also a big decrease may go far from wanted value. While having small change of θ and X(small α) means we are safe, but the algorithm needs a lot of time to go to the minimum value of cost function(almost zero) since we are progressing too slow towards the wanted or real value.
More formally the function looks exactly like Logistic Regression post. Just in our case we have two equations: one for finding user preferences(θ) and one for finding movies genres(X)(Partial derivative changes for user preferences we multiply by X and for genres we multiply by θ) . In reality both equation are merged together to optimize the execution cost but since the equation becomes too messy we would not show it in here, feel free to find more formal view here.
Insight Gradient Descent
Plotting cost function J with θ and X values can help to understand what Gradient Descent is doing. For simplicity we take as example one dimensional data. Looking at the cost function equation :
we can simply the equation with this one Y=X2
Plotting Y with X values from -10 to 10 it will look like below:
- Lets suppose that we come up with some values of X or θ that locates us at point A(replacing X and θ we have to cost function J). What Gradient Descent will do is first do the derivative which is the green line. The green line gives us a direction and that direction is point us to greater values of X or θ. More specifically from -8(A) to -7(A’).
- Lets suppose that we come up with some values of X or θ that locates us at point B. Similar Gradient Descent will do the derivative which gives us the green line pointing to lower value of X or θ. More specifically from 7(B) to 6(B’).
- Notice how regardless of where we are A or B direction is pointing to 0 in the center. Remember that we want our cost function to be zero(hypothesis same as real value) so Gradient Descent is taking us to wanted destination iteration by iteration. In a few words what Gradient Descent is doing is just step by step calculating derivative and changing X or θ until it reaches the zero or very near by.
- The hope -8 to -7 or 7 to 6 is controlled by some coefficient α, meaning that if α will be bigger we may end up at A” or B” which have more drastic change comparing to original point A or B.
- Having bigger value of α as we can see can help to reach zero faster. But we cannot abuse to much as greater values can cause algorithm to not reach zero but rather go far from it. On the other hand having too small values is safe but it slows down algorithm to go to zero. In order to check that Gradient Descent is doing fine we need to check for each step if the value is going to zero or not so if negative we expect an increase and if positive we expect a decrease.
Data used by application where taken from MovieLens. More specifically the small data set used for education and development. Data contains 9125 movies, 671 users and from there 100.004 rating from this users in total.
As we mention in algorithm details we are not using any genres from the data movies but rather let the algorithm figure out genres and rating as two problems helping each other step by step. Said that we still use some basic genres just for making application GUI more friendly. So to say you are asked to rate movies and for simplicity the movies are categorized on some basic genres like below:
* (no genres listed)
But again the algorithm does not use them. For the matter of fact algorithm is using 50 features as minimum and letting you to choose more. More feature we choose better the algorithm is performing but notice that in same time is becoming slower to train.
Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer. You can try it by your self by rating some movies and than wait for suggestions. Feel free to play around(50 features do not take much to train) and notice how algorithm adapts in accordance with your preference changes.
We can run the application from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java. Application was build using Swing as GUI and Spark MLib for the Collaboration Filtering Algorithm.
After running feel free to rate some movies by choosing genres and than hit the suggest movies button and for 50 genres it may take to 30 seconds maximum to give you suggestion with predicted rating and the error of the algorithm or Mean Squared Error.
After that you should be able to see something like this:
Genres Size field is giving the flexibility to choose how much of features(genres) you want the algorithm to know and figure out. More genres better the algorithm performs and predicts ratings but notice that slows down the training time. Below you can find how the error would like after training(0.029, close to 0 better it is, Mean Squared Error):