Drupal RecommenderAPI: Roadmap for 2012-2013 and beyond
I've been working on RecommenderAPI, a general purpose recommendation engine for Drupal, for a few years now. In the meantime, I'm doing my graduate work in recommender system, social computing, and machine learning. In this article, I'd like to discuss what to look forward to in the next major release of RecommenderAPI.
A brief history
The RecommenderAPI module (release 6.x-2.x) started as a Google Summer of Code 2009 project. It was entirely written in PHP, and was simple, easy to use. However, soon it runs into performance bottleneck: PHP is not really designed to do complex computation that uses lots of CPU/RAM, and such computation shouldn't run on the Drupal production server anyways.
Therefore, in a GSoC 2011 project, I completely re-wrote RecommenderAPI in Java (release 7.x-4.x and D6 backport), and uses Apache Mahout as the underlying computational library. This solves the performance problem by using Java multi-threading, a large amount of memory with the "java -Xmx" option, and it can run on a separate machine or potentially on a Hadoop cluster. Also, I have started a cloud service (currently in alpha testing) for people to try out RecAPI for free.
However, the current version of RecommenderAPI still has some limitations, which will be addressed in the next major release.
Plans for the next major release of RecommenderAPI
Step 1: Code refactoring, re-branding the dependent module, and code-readability improvements.
RecommenderAPI uses the "async command" dependent module to facilitate integration with Drupal. The "async command" module has been falsely branded as an asynchronous work queue module. But in fact, its main purpose is to provide code building blocks to help developers build 3rd party applications (esp. big data analysis programs written in Java/Python/R/etc) on top of Drupal. I plan to rename it to be "Drupal Computing" module in order to make that point clear, as well as make other code refactoring changes (rename class names, etc) to improve code-readability.
Step 2: Search API integration, and/or other ways to push data to external destination.
The current version of RecommenderAPI integrates with Drupal through direct database access. Many people have shown concerns about this approach, and suggested that the right ways is for Drupal to "push" data to an external recommendation engine. "Search API" is the the only way I'm aware of that "pushes" data. The Feeds module only imports data; and the Services module allows other programs to "pull" data, but it doesn't push data to other programs. I wrote a hack that uploads data to the recommender cloud server through HTTP, but it's not done in a systematic way. In the next major release of RecAPI, I'll explore different ways (including SearchAPI) to push data for Drupal to other external programs.
Step 3: Content-boosted algorithm, and other state-of-the-art recommendation algorithms.
As of now, the RecAPI module and Apache Mahout only supports the Collaborative Filtering algorithms. In other words, it only takes into account user-item preference information and makes recommendations accordingly. It doesn't use any textual content features. This is a serious limitation for many content-rich or knowledge based Drupal sites. Also, both RecAPI and Apache Mahout don't support the "ensemble method", which is the winning algorithm in the Netflix Prize competition. In the next release of RecAPI, I'll implement both the content-boosted algorithm and the ensemble method, and contribute it either as a patch to Apache Mahout or directly to RecAPI itself.
More readings about different recommendation algorithms:
- README file in RecommenderAPI
- Adomavicius, G., & Tuzhilin, A. (2005). Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on Knowledge and Data Engineering, Vol.17(No.6), 734-749.
- Koren, Y., & Bell, R. M. (2009). Matrix factorization techniques for recommender systems. Computer, 42-49. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5197422
- Bell, R. M., & Koren, Y. (2007). The BellKor solution to the Netflix Prize A factorization model. KorBell Teams Report to Netflix.
Step 4: Hadoop cluster support, and other performance improvements.
RecAPI supports multi-threading, but it is still quite limiting: I have tested the RecAPI engine on a 6-CPU server, computing recommendations for a 10 million user-items rating datasets (72,000 users, 10,000 items), and a full run took 11 days 20 hours. The magnitude of recommendation computation process might require hundreds of CPUs in a Hadoop cluster to run simultaneously. Apache Mahout supports Hadoop, but it needs to be integrated with RecAPI.
There could be other performance improvements too: 1) trade recommendation quality for performance gain, 2) support incremental updates in addition to full runs (currently only the item-item algorithm supports incremental update). I'll consider all these performance issues in the next major release.
Step 5: Better integration with web analytics.
In order to make good recommendations, we need to track site users' preferential behaviors, such as their browsing history, ratings, or purchases. Currently, RecAPI reads these data directly from the Drupal database. But a more efficient, and perhaps a more powerful approach is to track these data using web analytics and then have RecAPI read analytics data. In the next major release, I'll explore how to integrate RecAPI with the Piwik module, or the Google Analytics module, or both.
Step 6: Full cloud service support.
The RecommenderAPI cloud service is running at http://recommenderapi.com in alpha testing phase. Together with the next release of RecAPI, I hope to provide a fully-supported cloud service as well. It'll have all the features mentioned above, and will work as the freemium model.
Step 7: Better integration with the Drupal Commerce module and Acquia Commons distribution.
RecAPI is designed to be a general purpose API framework to take care of complex recommendation computation, and it relies on helper modules to apply its computational power to real world scenarios. There are multiple ways to optimize RecAPI to work for eCommerce sites with Drupal Commerce, or social network sites with Acquia Commons. For example, we can combine product features, customers' browsing history, purchases, ratings, comments and demographics information together in order to make the best product recommendations. I'll spend some time to optimize these helper module too.
Timeline
I'm very excited about these planned items, but unfortunately I can't work on them right now. Most of my time nowadays are spent on my PhD dissertation, which is also exciting.
I expect to graduate in December, 2012, or January, 2013. Before graduation, I'll find some spare time to maintain the current version of RecAPI (fixing bugs/issues, etc.), and hopefully work on Step 1, Step 2, and part of Step 3.
After graduation, I'll work full time on all these items as a startup company. It'll take 6-9 months full time work to implement these features, so I'm quite optimistic to deliver them at the end of 2013. In the meantime, please stay with me and give it a little patience.
Vision
It is my personal commitment to build the state-of-the-art recommendation engine for Drupal, so that any Drupal website can use the power of recommendation engine out of the box.
A few last words about the benefits of using RecommenderAPI:
- It integrates with Drupal seamlessly.
- It's open source: you can use it for free, and customize it if needed.
- It's backed by top talents from in the field (such as my advisor Paul Resnick), so you can be sure the recommendation quality is good.
- We'll offer cloud service and premium support for those who don't want to directly deal with technical details.
- We contribute to Drupal and we want to grow with the Drupal community: support us and support Drupal!