UPDATE 9/13/15: The author replied to this article in the comments and clarified in the source article the nature of the changes, and also gave more explicit credit to the original code authors.
Over the summer, I took a couple of online classes on edX: CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph, and the followup course CS190.1x Scalable Machine Learning by Ameet Talwalkar. Both courses focused on Apache Spark, an up-and-coming technology in the data science world which promises incredible analytical performance. I strongly recommend taking both courses if you have an interest in data science. (however, both courses require strong knowledge of Python)
A Hacker News submission linked to an article titled “Building a Movie Recommendation Service with Apache Spark and Flask - Part 1” on Codementor, which was relevant to my interests. However, while reading the article, I felt a sense of deja vu.
As it turns out, the concept, and the majority of the code for the recommendation service is taken directly from CS100.1x Assignment #4, without giving any explicit attribution to any of the professors.
It is unusual to see someone not only plagiarize code, but to do it very blatantly and very poorly.
The June 27th 2015 edX assignment uses Spark’s MLib for parallelized machine learning on the MovieLens dataset of millions of ratings of movies by users.
The July 7th 2015 Codementor article, in fairness, has nonplagiarized code to parse this data into RDDs (resilient distributed datasets), which is Spark’s data structure that enables high scalability and parallelization. That’s because the edX assignment has the data with a slightly different schema (comma-delimited vs. colon-delimited data).
The collaborative filtering algorithm using Alternating Least Squares (ALS) is what caused me to raise an eyebrow. As with any good statistical methodology, the data should be split into independent training, test, and validation datasets, where the training set is used to construct the model, the validation set is used to select the best model when model parameters vary, and the test dataset is used for the actual statistical analysis. The edX assignment explains this well:
[6, 2, 2] parameter says that 60% of the data is split for training, 20% of validation, 20% for testing. The
seed=0L ensures that the results are the same each time; important for reproducible results in the case for the assignment, but not good for general statistical analysis.
So what does the Codementor article say?
Not much reasoning behind the split, but an interesting choice of split percentages and seed there. But what’s curious is the variable names; the edX assignment has variables names as CamelCase, while the Codementor article has variable names with underscore separators.
One of the recommended ways to select an predictive model is to optimize parameters of the predictive function. The edX assignment changes the
rank (number of latent factors) of the Alternating Least Squares model while keeping the other parameters constant:
The Codementor article has a similar approach.
To find the “best” model, the edX assignment runs the model and computes the Root-Mean-Square Error, a popular metric for assessing model quality, between the predictive results of the model and the actual values for each
rank, and selects the
rank which results in the model with the lowest RMSE.
The Codementor approach is similar; except since the
computeError function was defined earlier in the edX assignment, the Codementor article has to reimplement it inline.
We can rule out “coincidence” at this point. What’s concerning is that these workarounds make the code far less legible and more difficult to parse. (the results in both are the same of course: Rank 4 model had an RMSE of 0.8927, Rank 8 model had an RMSE of 0.8901, and Rank 12 model had an RMSE of 0.8902, making the Rank 8 model the winner)
The Codementor article repeats the predictive assessment with the full, 21x larger dataset, and getting a RMSE of 0.8218 with the Rank 8 model, which in fairness is new material. (although the conclusion, “we can see how we got a more accurate recommender when using a much larger dataset,” is not necessarily caused by using the larger dataset due to the nature of random splits; that’s what cross-validation is for).
The edX assignment has a fun section where the user can test the predictive model they created by giving their own ratings for movies, in the format of
(myUserID, movie ID, your rating). For example, here’s what I put:
I may have felt like being lazy that day.
The Codementor article has an interesting approach.
i.e. the same approach. But instead of “myUser”, it’s “new_user”, and the “0” id is hard-coded in each rating. And more underscore variables? This use of “code paraphrasing” is a new concept to me.
The edX assignment ended by showing your top movies based on your recommendations and your model! (the code for the
ratingsWithNamesRDD object is mine)
Of course, the recommended movies are limited to those which only have more than 75 reviews, in order to prevent movies which have only a few ratings from skewing the results.
The same thing…except, for some reason, the
filter is limited to movies with at least 25 reviews instead of more than 75, the top 25 movies are output instead of the top 20 (the first parameter in
takeOrdered), and the initial title string was changed to the third-person voice? (notably, the “more than 25” text is incorrect since it’s an “at least” statement)
The Point of Plagiarizing Code
The follow-up article to the Codementor article incorporates the ALS recommender algorithm with Flask, a popular Python framework for creating microservices. This is genuinely interesting new code not covered by the original edX assignment. So what was the point of the weak plagiarism in Part 1? A piece of content marketing for a personal portfolio?
If Part 1 was entirely “Hey, I took the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX which included a good Collaborative Filtering algorithm, here’s a link to the assignment if you want to know more, and now I will use the code I wrote for that assignment to create a Flask app,” I would have been perfectly OK with that. (after the course was completed to avoid cheating, of course)
Coding is an industry that’s generally more lax about reuse. I would not expect someone to cite every instance they used code from a Stack Overflow question. Forking code repositories on GitHub is actively encouraged, as it often leads to new insights.
But the willful and unnecessary paraphrasing of code, replacing all the CamelCase variables with underscored variables, and code which is worse than the original, is hard to overlook. “I recommend you for example this edX course” does not give sufficient credit to the teachers and staff of CS100.1x.
I am currently looking for a job in data analysis/software engineering in San Francisco. If you liked this post and have a lead, feel free to shoot me an email.
Since I currently do not have a full-time salary to subsidize my machine learning/deep learning/software/hardware needs for these blog posts, I have set up a Patreon, and any monetary contributions to the Patreon are appreciated and will be put to good creative use.