In this post we are going to develop an autocomplete component for large data sets using Tries Data Structure and Collaborating Filtering to choose best book titles suggestions to users. It is interesting to notice that both “Algorithm&Data Structure” and “Machine Learning” are working together towards the final solution. Full code and working application are provided together with results.
Problem Formulation
What we want to build from high level perspective is an autocomplete field that when we type some characters it suggests book titles that start with those characters.
- From GUI perspective what is required is a TextField or ComboBox that displays a list of options which are provided by some service like findTitlesThatStartWith(chars [] ch). Fortunately there are already existing GUI components out there in SWING(also java script or jQuery). For this post building GUI Autocomplete Components is not the focus although it may be sometimes be great challenge to build them.
- On the other hand implementing findTitlesThatStartWith(chars [] ch) is of more interest as it gives an opportunity to optimize what is offered to the customers from data perspective. There could a long list starting with particular characters and we can return only a limited number of titles. And that short list need to make sense from a user perspective as much as possible.
There are various options:
-
-
- Sort the list by some criteria(alphabetic order) and return only Top 10(or any number that make sense).
- Keep a counter on how many times words are chosen by users. And only show Top 10 titles with highest counter.
- Show Top 10 most rated titles by users.
- Show Top 10 that are the most interest to the current user preferences.
-
- Once we clarified on high level what service will return is time to explore how it will search for the titles in fairly large collections of titles.
Again there are various options:
- We search all the list/array and for each title we see if starts with those characters or not:
for (String title : allTitles) { if(title.startsWith(charsEnteredByUser)){ options.add(title); } }
If N is the size of the list and k the length of the words we need θ(N*k) time to search. Inserting a new title takes constant time(θ(1)) although adding new films happens fairly rarely.
- Since this is a search problem HashTable may come as option because of their very fast constant time access and insert(θ(1)). Unfortunately HashTables can only look up for entirely word match and not a prefix match(titles that start with…).
- Similarly we may think of Binary Tree(well balanced) as it give us θ(log(N), N size of all titles time complexity in searching and inserting. Again Binary Tree are not helpful because they cannot find a prefix match but rather exact match.
- Fortunately there is already exiting a Data Structure ready for finding prefix matches : Prefix Tree or Tries. The great thing about this data structures is that with a small modification it gives you search time complexity θ(k) where k is the length of the prefix. Yes there is a catch, you may need a lot more storage.
Tries
In this section we will explore how Tries can help with searching for a prefix match in a list of titles(words). Tries are fairly easy to understand once you get how the words are inserted:
So basically we insert word’s characters in a separate nodes when characters are not already existing and re using existing ones. We mark also the end of each word with a special sign so later on we know when a full word is reached.
Lets see how we can search with titles starting with “te”:
When searching we first start from the root and look up on immediate children’s for our first character(t) match. When node matching character is found we treat it as the root so we continue to look up in direct children’s for next character(e) match. This logic continues until there are no more characters left on prefix. If that is the case than all the suggestion list is the sub-tree below our last node match. So we simply traverse all the sub-tree and add words when the end of the word sign is reached.
You may think : Not so fast the complexity there is not θ(k) where k is the length of the prefix! Indeed the complexity is rather θ(k+M) where k is the length of the prefix and M is the size of the suggestion list or the sub – tree under the last node match(immediate children are kept on HashTable so constant time is need to look up for character match). Anyway we need to traverse the sub – tree to collect the suggestion words /titles and therefore if this list results big it can considerably slow down the algorithm. Of course is better than θ(k*N) here k is the length of the prefix and N the size of the all list but still can we do better?
Well we can slightly augment nodes to have store more information than just the character as below:
As we notice by now we store in each node beside the character also the word we are inserting(in practice a reference to the word).Step by step each node will have a list of words that passed on the path.
This modification can greatly help to avoid going down all the sub-tree under the last matched node since the node now already have the list of the words which the sub- tree contains. Lets see below how the search would look now:
The difference with this solution is that when we reached the last node that node already has ready the list of words starting with the prefix. So there is no need for sub-tree traversal therefore the complexity is now θ(k) where k is the length of the prefix.
Final Change
There is a final small trick and the algorithm is ready to be implemented. Titles usually are sentences rather than a single word. It will not be very useful if we search only the beginning of the title because for example a lot of title start with : “The …”(The walking dead) therefore we will miss those suggestion if user search with something more meaningful than “The”.
The solution is easy we just insert each of the words separately in the tree but also save all Sentence of Title to the node suggestion list. In this way we can search with middle words(walking) and in same time be able to suggest all title.
The code is fairly easy(50 lines) so please feel free to have a look Trie and TrieTest.
Recommender System
We have only a limit number of suggestion so when it comes to what suggestions to show to the user I think the best answer is: what is more relevant or more close to user interests. This leads us to 4(was underlined anyway :)) and therefore to Recommender Systems.
A recommender system suggests to users information based on their preference trend on the data. The main advantage of this systems is that it learns automatically as it knows more from users preferences. So basically more the users interact with the system(users likes/clicks particular books, movies) better suggestion(more close to user interest) is the system going to make them. On previous post we explained in details how this is achieved using Collaborative Filtering Algorithm. Also an Application was build for suggesting Movies based on user ratings. In this post we are going to implement same algorithm for suggesting books instead of movies.
Data
Thanks to this source for providing enough data to build a meaniful algorithm:
“Improving Recommendation Lists Through Topic Diversification,
Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW ’05), May 10-14, 2005, Chiba, Japan. To appear.
Download: [ PDF Pre-Print ]
“
The data set is quite big approx. around 271.000 books, 1.1 million of ratings and 279.000 of users..
Application
Application can be downloaded and executed without any knowledge of java beside JAVA has to be installed on your computer.We can run the application from source by simply executing the RUN class or if you do not fill to open it with IDE just run mvn clean install exec:java.
You can try it by rating first some books(!please notice that if books are not rated first no suggestion are made) and than search in the field for autocomplete suggestions.Feel free to play around(50 features do not take much to train) and notice how algorithm adapts in accordance with your preference changes. For 50 genres it may take to 30 seconds maximum to give you suggestion with predicted rating and the error of the algorithm or Mean Squared Error. You can as well increase the ratings data size till 1.149.000but please be aware of the slow down of the training process
Application was build using Swing as GUI and Spark MLib for the Collaboration Filtering Algorithm and after running you the below screen will show up: