For this project, the first of my full-scale endeavors, I attempted to build a linear regression predictor for movie ratings using data queried from The Movie Database’s (TMDb) API. After some data cleaning and EDA, I used scikit-learn’s linear regression predictor build a model to predict ratings. I then used 4-fold cross-validation and a stochastic algorithm to randomly choose features to test against each other in order to minimize the root-mean-squared error. (Because this algorithm takes a long time to complete through iteration, I used the module ray to parallelize the feature selection process.)

When the RMSE was high using this initial model (around 2 on a 10-point scale), I turned to the presence of certain words and word roots in synopses. I built an IPython widget that would plot the overlaid distributions of ratings for plots that did and did not contain a specific string, which I used to look at what words would be good to include as features.

Using a similar stochastic algorithm, I reran the predictor using parallelization and got a slightly lower RMSE than previously, but it was still not good.

Finally, in a last attempt to increase the accuracy of the predictor, I employed L2 regularization to penalize high model weights. I then used a gradient descent algorithm to find the optimal value of the hyperparameter *α*. This did not make any great impact on the RMSE.

This project was great for me, as it was the first time I worked all the way through the data science life cycle, from obtaining the data to testing models. While I couldn’t build a very accurate predictor, it is important that I share the results for the sake of reproducibility; too often, people only publish when they have some “significant” result when they should publish everything, even if the result is inconclusive. While this was my first attempt, sharing results is an important part of data science, and if we want science to evolve beyond the problems that it experiences today, we all need to share knowledge.