CS 91 Lab 1

Writing MapReduce Jobs

Warm-up due 11:59pm Thursday, September 11

Movie Similarities due 11:59pm Thursday, September 18

Note: Run with --conf-path mrjob.conf to avoid filling up your local /tmp directory!

Handy References

Lab 1 Goals:

Write programs using the MapReduce paradigm.
Process (moderately) large amounts of data in parallel.
Gain experience using the Amazon Web Services platform.

Lab Description

For this lab, you will use the MRJob library to write MapReduce programs in Python. MRJob allows you to run jobs locally, on a Hadoop cluster, or on Amazon's Elastic MapReduce platform, making it easy to test and deploy MapReduce tasks.

This will be a two-week lab with two parts: a warm-up, requiring a few small applications that can be run locally, and larger application that will process significantly more data on Amazon's cloud.

Environment

Rather than cluttering up the department's file system with lots of Python packages, we'll be using a python virtual environment that I've configured for MRJob. To use it, execute the following commands:

source /home/kwebb/public/setup-91.sh
workon mrjob

After doing so, you should see (mrjob) at the beginning of your shell prompt. If all went well, you should now be able to "import mrjob" from within python. Note: You'll need to set up your environment this way every time you plan to work with the mrjob library.

Part 1 (Warm-up): Grep, Inverted Indices, Jumbles.

These exercises will help you to think about dividing problems into mappers and reducers. The solutions to these will likely be only a handful of lines of code. Use the same mrjob.conf file that we used in the word count example.

Grep

Your solution to this part should be in MRGrep.py

If you're not already familiar with grep, it's a Unix command line application that searches the contents of a file for a specified text expression. You can invoke it with:

grep [text to search for] [name(s) of files to search]

For example, if you wanted to find all instances of the word "map" in MRGrep.py, or all instances of "map" in every python file in this directory, you could execute these commands, respectively. (Try them out...)

grep map MRGrep.py

grep map *.py

Our MRGrep will behave slightly differently. It will take one or more strings to search for, and it should group the final output according by the matching expression. Strings to search for are specified with the -e flag, which can be passed multiple times:

python MRGrep.py -e [string1] -e [string2] -e [string3] [name(s) of files to search]

For example, if you wanted to search for instances of "agreeable", "Darcy", and "pride" in the text of Pride and Prejudice, you could run:

python MRGrep.py -e agreeable -e Darcy -e pride PrideAndPrejudice.txt > grep-output

You should get a result that looks like this example output.

Note: for your version of grep, you don't need to worry about reporting the file name like the command line version of grep does. We'll look at incorporating file names in the next part.

Inverted Index

Your solution to this part should be in MRIndex.py

An inverted index maps words or strings of interest to the files they can be found in. Inverted indices are a key component of search engines.

For our MRIndex, we'll be looking for email addresses and web URLs. Given a list of files, your code should output the file location(s) of every email address or web URL it can find, as in this example output. Your mapper can read the name of the file it's currently processing by querying the environment (see the MRIndex.py starter code).

Jumble

Your solution to this part should be in MRJumble.py

A Jumble is a word puzzle commonly found in newspapers. We're going to use a MapReduce job to make finding anagrams, and thus solving jumbles, easier. You're given a list of words from the Scrabble dictionary (dictionary.txt). Using the list as your input, generate an output file containing: a sequence of sorted letters, the number of words that those letters can produce, and the words themselves. A sequence of sorted letters should appear only once in the output.

For example, the sorted sequence of letters "ACEF" can produce two words, CAFE and FACE, so you should see a line in the output that looks like:

"ACEF" 2 CAFE FACE

Another (much less obvious) example:

"ACEIIMNORST" 3 CREATIONISM MISCREATION ROMANTICISE

Note: your output does NOT have to match this format exactly. It's fine if you have brackets, commas, quote marks, or other additional symbols as long as the order is correct (letters, # of words, words) and it's human-readable enough that I can grade it.

Part 2: Movie similarities.

Your solution to this part should be in MRMovies.py

MRJob makes it easy to chain multiple map/reduce tasks together. For this part, we'll make use of this capability to process a large corpus of movie ratings for the purpose of providing recommendations. When you're done, your program will help you decide what to watch on Netflix tonight. For each pair of movies in the data set, you will compute their statistical correlation and cosine similarity (see this blog for a discussion of these and other potential similarity metrics). Since this isn't a statistics class, I've implemented the similarity metrics for you, but you need to provide them with the correct inputs.

For this section of the assignment, we have two input data sets: a small set (details) for testing on your local machine and a large set (details) for running on Amazon's cloud. You can access the data sets at /home/kwebb/public/cs91/lab1/

For both data sets, you will find two input files:

movies.dat contains identification and genre information for each movie.
Lines are of the form:

MovieID::Movie Title::Genre

For example:

0068646::The Godfather (1972)::Crime|Drama
0181875::Almost Famous (2000)::Drama|Music

ratings.dat contains a series of individual user movie ratings.
Lines are of the form:

UserID::MovieID::Rating::Timestamp

For example:

120::0068646::10::1365448727
374::0181875::9::1374863640

In addition to these two input files, your program should take a few additional arguments:

-m [movie title]: The title of a movie for which we'd like to see similar titles. Like MRGrep, you should be able to accept multiple movie titles with more than one -m argument.

-k [number of items]: For each of the movies specified using -m, this specifies how many of the top matches we'd like to see. In other words, running with "-m The Godfather (1972) -k 10" would be asking for "the top ten movie recommendations if you liked The Godfather." (Default 25)

-l [similarity lower bound]: When computing movie similarity metrics, we'll produce a floating-point value in the range [-1, 1]. This input says to ignore any movie parings whose similarity metric is below this value. (Default 0.4)

-p [minimum rating pairs]: When computing similarity metrics, ignore any pair of movies that don't have at least this many shared ratings. (Default 3)

Please don't attempt to filter down to the movies specified via -m until the final step. Yes, it would be more efficient to do so earlier. I want you to compute the similarities for all movies. The -m argument is there to reduce the output size and make reading (and grading) the output easier. For the others arguments (-k, -l, and -p), you may filter whenever you want.

Output

Since we're computing two similarity metrics, we'll need to combine them into a single similarity value somehow. For your submission, you should blend the values together, using 50% of each. That is, your final value for a pair of movies is 0.5 * the correlation value + 0.5 * the cosine value for the pair. (Feel free to experiment with different weights or similarity metrics, but you should do 50/50 between the two for your submission.)

For each movie selected (-m), sort them from largest to smallest by their blended similarity metric, outputting only the top K (-k) most similar movies that are have at least the minimum blended similarity score (-l). For movies meeting this criteria, you should output:

The name of the movie for which we want similar titles (specified via -m).

The name of a similar movie.

The blended similarity metric that these two movies share.

The correlation metric that these two movies share.

The cosine metric that these two movies share.

The number of ratings this pair of movies had in common.

You may format your output however you like, as long as the values are in the correct order and I can reasonably make sense of it by looking at it briefly.

You may want to take a look at my output from running on the large data set.

Steps

You may structure your sequence of map/reduce tasks however you want to solve the problem. I recommend the following sequence of steps:

Join the input files: Initially, you have two input files (ratings.dat and movies.dat). You'll get most of the important info from ratings.dat, but it only has movie IDs rather than movie names. For the first step, you can assign names to the rated movies and drop the movie ID. This way, you can refer to movies by their name going forward. You probably want your reducer's output to be key: user id, value: (movie title, rating).

Produce movie rating pairs: Next, you want to organize the movies into pairs, recording the ratings of each when a user has rated both movies. This gives you vectors to use for the similarity metrics. For example, suppose we have three users, Alice, Bob, and Charlie:

Alice has rated Almost Famous a 10, The Godfather a 9, and Anchorman a 4.
Bob has rated Almost Famous a 7 and Anchorman a 10.
Charlie has rated The Godfather a 10 and Anchorman an 8.

You would end up with records that look like:

Key: (Almost Famous, The Godfather)
Values: (10, 9)

Key: (Almost Famous, Anchorman)
Values: (10, 4), (7, 10)

Key: (Anchorman, The Godfather)
Values: (4, 9), (8, 10)

Protip: You'll want to ensure that the keys you output are consistent for a pair of movies, for example, by putting them in alphabetical order. Otherwise, you run the risk of having two keys for a pair of movies, e.g., (Anchorman, The Godfather) and (The Godfather, Anchorman). Having more than one key for a pair is bad, as they will be treated independently (and probably sent to different machines for processing).

Compute the similarity scores: Given keys that tell you a pair of movie names and values that contain a sequence of pairs, each corresponding to how a user rated that pair, you now have the information you need to compute similarity scores for that pair of movies.

Filter and format the output: Filter out the movies that weren't specified with the -m flag and sort the output by similarity metric so that it conforms to the desired output format.

Running it

For the small data set, you can run it locally, just as you did for the warm up exercises. It will probably take a few minutes (5-10) to complete. You can run over the large data set locally too, if you want, but it will take at least an hour (probably closer to two hours). Instead, let's farm it out to Amazon's Elastic MapReduce (EMR) compute platform. There, it'll only take about 20-30 minutes, which is much more reasonable.

To run on Amazon, you'll need to tell MRJob to use "emr" as its runner. You'll also need to give it some basic configuration information. Edit your mrjob.conf with the following contents:

runners:
  inline:
    base_tmp_dir: /local
  emr:
    ec2_instance_type: m1.medium
    num_ec2_instances: 10
    aws_access_key_id: [your access key]
    aws_secret_access_key: [your secret key]
    aws_region: us-east-1

Now, you should be able to invoke MRJob with "--conf-path mrjob.conf -r emr" to do your processing in the cloud. My sample output has an example of a full command line.

Submitting

Once you are satisfied with your solution, you can hand it in using the handin91 command from a terminal on any of the CS lab machines.

Only one of you or your partner should run handin91 to submit your joint solutions. If you accidentally both run it, send me email right away letting me know which of the two solutions I should keep and which I should discard.

You may run handin91 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize, after handing in some programs, that you'd like to make a few more changes to them.