Datamining in C — Building a Recommendation System

The looming question: Why C?
C has become my favorite language, so initially this started as a fun toy project to improve and practice my C before attempting more difficult projects.

Development Speed:
However while building in C I came across the work of Ben Klemens (http://modelingwithdata.org) where Klemens (also the writer of “21st Century C” ) argues that statistical systems can and perhaps should be built in C. The arguments of C being slow to program with are no longer quite valid as there are a great statistical and mathematical libraries, such as apophenia, that can be quickly used without investing too much or extraordinary development time in building them. In fact the role of a C developer in ways has become a combination of mixing great libraries and some custom work to accomplish a task as opposed to the early days of C where writing code was mostly custom work and reinventing the wheel.

Performance Speed:
I became more fascinated about this idea when I ran into articles that showcased some C implementations of of numpy/scipy algorithms where the C implementation was  at least 14x faster than the Scipy suite and pure python implementations. C’s spectacular speed in performance and it’s reasonable speed in development could make for a very fast and highly performing recommendation system.

My Implementation:
The time spent on this project allowed me to use some C functionalities I hadn’t used extensively, and most I had never used at all.  Most notably dynamic programming using Macro Function expansion and a void function with void parameters that acts on function pointers.  “lib.h” contains the generalized code and the examples and dataset are in the examples folder. There’s a dataset provided in the repository of about 100,000 movie ratings for a little over 600 users. The goal of the project is to read a ratings csv file describing all the ratings in the system and be able to recommend movies to a user.

Here’s a sample version implementing Collaborative Filtering using the Euclidean Distance:

Goals:
To add more effective ranking algorithms piecemeal as the weeks go by.
My progress can be followed at http://github.com/sabzo/datamining.

Leave a Reply