Python Unstructured Learning - Building a Book Recommender Engine
Description
Through an invitation with Datacamp I was able to take and review their Python course on Unstructured learning. From this I have created an example of how to build a recommender engine. In this example we will be using a list of books from the Gutenberg library and look for the top books similar to Moby Dick. The first thing we have to do is load all our necessary libraries. This includes numpy, pandas, some functions from sklearn, and of course the Gutenberg library and associated functions. I am on a Windows machine, so this was quite a difficult process. The Gutenberg library depends on a Berkly DB connection, this required a wheel install for windows instead of the usual “pip install”. I had some problems with the meta data connection to the Gutenberg library, so I pulled a random list of ten books. These are an individual text string for each book and will be the data to work with. Finally, I created a list of the books and a list of the associated titles.
NMF Engine Method
For this engine we will be leveraging a Non-negative Matrix Factorization. NMF is a dimension reduction technique similar to PCA, but has a much more interpretible output. These reduced dimensions with respect to books, will be the common themes. To leverage the NMF method we first have to create a CSR matrix of the word frequency in each book. The below code creates a TFIDF matrix where the columns are words used in the book and the rows are the books. The data is then the frequency of each word in each book.
Next we apply the word frequency matrix to an NMF model. NMF requires a set number of components, in this case I chose 6. In an ideal world, I would be comparing the entire Gutenberg library of 53,000 books. There are 21 sub categories of books, so in that case I would choose 21 components. Once the NMF model is fit to the data we can extract the feature set.The feature matrix will be built based on the number of components set.
Recommending Similar Books
Now we have a set of NMF Features which represent the themes of the books in our list. Now we need a way to compare the features of articles. This is done using Cosine similarity. The basics of Cosine similarity comparison plots lines for each books themes and measures the angle between them as a measure of similarity. A higher value is a closer book similarity.
First, we normalize the features. Then we create a pandas DataFrame using these normalized features and the titles of the books. Finally we need a comparison for the cosine similarity. By setting article equal to Moby Dick we have a book which we want recommendations for. Finally the .dot method on our pandas DataFrame computes the cosine similarity. We can view the 5 books most similar to Moby Dick, and the fisrt is obviously itset, but second closes is Vikram and the Vampire
Conclusion
So, how good is the engine created? It says that the book in our list most similar to Moby Dick is Vikram and the Vampire. From wikipedia this book is an ancient indian tale about a King named Vikram who vows to capture a vampire. Each time he tries to capture the vampire it escapes. This seems like a decent comparison to Moby Dick of a main character chasing after a mythical creature.
This is just one example where NMF can be used to create a recommender engine. Any array that contains non-negative values is suitable for NMF. This could also be used for image recognition, or lists of ecommerce purchases.