Data Structures and Algorithms

Lab 7: Plagiarism Detector

Due on Sunday, November 13th at 11:59 PM. This is a team lab. You and your assigned lab partner(s) will complete this lab together. Make sure that you are familiar with the Partner Etiquette guidelines. You may discuss the concepts of this lab with other classmates, but you may not share your code with anyone other than course staff and your lab partner(s). Do not look at solutions written by students other than your team. If your team needs help, please post on the Piazza forum or contact the instructor or ninjas. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

This lab consists of two parts. The first part is a written assignment involving algorithms for AVL trees and heaps. The second part involves writing a program to detect plagiarism in essays. Both of these parts involve some amount of material on the heap data structure; if you encounter that part of the work before heaps have been covered in class, set it aside and move on to the other part. (If you get to the heap material on both parts of this lab before heaps are covered in lecture, then well done; you deserve a break!)

Part I: Written Lab

Source: OpenClipArt.org

For this part of your assignment, you will give written answers much in the same way as you did in previous labs. Your submission must be in a typeset PDF format; please see that link for instructions and see your Lab 3 template for instructions on how to use LaTeX.

`textree.py`

This part of the lab involves drawing trees. This is difficult in LaTeX (and in most diagram software), so we’ve given you a Makefile rule that will make things much easier. This rule calls the script textree.py, which has been provided for you. Between this and the provided WrittenLab.tex, all you need to do is draw some ASCII art trees in text files and your PDF will be built for you.

In the written-trees directory, there are a series of files named e.g. problem1.1.tree. The textree.py script turns these tree files into .tex code which will draw those trees in the PDF. To complete the written part of the assignment, you just need edit the .tree files to contain the right trees. The following rules apply in these .tree files:

Blank lines and lines starting with a # are a comment (like in Python) and are ignored.
Lines starting with a : are treated as LaTeX text. You can write your explanation of what’s happening on those lines.
Lines containing only numbers, letters, and spaces are taken to be a binary tree. Each line is a level of the tree.
Lines containing only symbols are ignored. This allows you to include / and \ symbols if you want to draw your tree in the text file.

For instance, this example diagram is produced by the following .tree file.

Example `.tree` File

: This is a sample BST.

 2
1 3

# This line is a comment.

: I can put slashes into my tree if I want; they don't do anything.

  2
 / \
1   3

: Each node decides its parent by the text that is closest to it.  For instance,
: the 3 below is the left child of 4 (and not the right child of 1) because the
: 4 is closer to the three.  The blank lines are ignored just like lines with
: symbols are.

  2

1   4

   3

AVL Tree Questions

Show the left rotation of the following tree. In this rotation, what are the X, Y, and Z subtrees?
Show the right rotation of the subtree with the root node 6. In this rotation, what are the X, Y, and Z subtrees?
Using the AVL tree algorithm, insert the value 2 into the following tree. Show the tree before and after rebalancing.
Using the AVL tree algorithm, remove the value 12 from the following tree. Show the tree before and after each rebalancing.

Heap Questions

Note that these questions rely upon the discussion of heaps in lecture.

Show the addition of element 5 to max-heap below. First, show the addition of 5 to the tree; then, show each bubbling step.
Show the removal of the top element of this max-heap. First, show the swap of the root node; then, show each bubbling step.
Consider the sequence of elements [7,1,5,2,8,3,4]. Using the representation discussed in class, show the tree to which this sequence corresponds. Then, show the heapification of this tree; that is, show how this tree is transformed into a heap. Demonstrate each bubbling step.

Part II: The Plagiarism Detector

Plagiarism Detector

In this part of the lab, you will write a program for detecting plagiarism in written essays. To do this, we will make one fundamental assumption: plagiarized documents will have a significant number of phrases in common. Our plagiarism detector will then proceed according to the following high-level algorithm:

Determine which phrases appear in each document and how often.
Compare each pair of documents to determine how similar their phrase counts are.
Report the most similar pairs of documents.

You will not have to write tests for your program; they have been provided. To approach this problem, we will break the work into a few parts. We’ll start by writing code to represent a single document. Next, we’ll write code to perform the comparison between documents. Finally, we’ll put it all together into a main function.

The `Document` Class

Source: OpenClipArt.org

We will begin by splitting the text files that we are checking into “phrases”. Here, a phrase will be a sequence of three words that appear next to each other in the document. For instance, the text “hedgehog in a balloon factory” contains the phrases “hedgehog in a”, “in a balloon”, and “a balloon factory”.

The files documentUtils.h and documentUtils.cpp already contain code that will read the files for you and split them into phrases; you must write the Document class that this code uses to remember the phrases for a particular document. Look at document.h for a description of how the Document class should work; write your implementations in document.cpp. Note that we do not have a document__definitions.h file; we only need to put definitions in .h files for templated classes.

When writing your code, note that each Document will need a Dictionary to keep track of how many times it has seen each phrase. You can find a balanced BST implementation of the Dictionary interface in adts/impls/stlBST.h.

Note that you must provide the private field declarations for the classes in this lab. Destructors are declared for each class, but how they should operate will depend upon the particular fields you pick. It’s possible, for instance, that the right destructor implementation involves no lines of code at all.

Testing the `Document` Class

Tests for this lab have been provided for you, but they have been split into several groups to allow you to run them separately. If you run tests as normal…

    make tests
    ./tests

…then you will receive a message requesting which tests to run. Your choices are all, document, detectorShort, detectorMedium, and detectorLong. You should avoid all until you have completed your lab. For now, just run

    make tests
    ./tests document

You should make sure your Document class tests succeed before you continue.

The `PlagiarismDetector` Class

Source: OpenClipArt.org

Once you have implemented the Document class, you are able to load documents, count their phrases, and determine how similar they are to each other. Next, we need to write code that uses this class to analyze numerous documents. We’ll write this code in the PlagiarismDetector class to keep it organized; when we’ve finished, we’ll be able to think of a PlagiarismDetector object as a device for detecting similar documents without focusing on the details of e.g. AVL trees.

The first important method of the PlagiarismDetector class is the addDocument method. Given a filename, this method loads that file using the documentUtils.h functions and keeps the resulting Document* for later analysis. You will need to add an appropriate field to store this information but, since both the loadDocument function and the Document class have already been written, there aren’t that many lines of code involved.

The second important method of PlagiarismDetector is the findSuspiciousDocuments method. This method must compare each added document to each other, producing a list of SimilarityResult objects. It must then produce some number of the top matches based upon a parameter; for instance, the caller may want the top ten matches. The findSuspiciousDocuments method is required to use a heap to find these most suspicious documents. Sorting the entire list of results isn’t an option.

For large sets of documents, findSuspiciousDocuments takes a very long time. Feel free to add a progress indicator to help you keep track of which files you are comparing or how much work you have left to do.

`SimilarityResult` and Copying

The findSuspiciousDocuments method of the PlagiarismDetector class allows you to find the most suspicious documents that the detector has loaded. This method returns a vector containing objects of type SimilarityResult, a class which has been written for you. SimilarityResult is similar to pair or string in that it knows how to copy itself; that is, the following code works:

SimilarityResult myResult("file1.txt", "file2.txt", 20);
SimilarityResult otherResult = myResult; // this doesn't usually work for classes we write

This means that you don’t need to create dynamically-allocated SimilarityResult objects; you can statically allocate them like you have done with pair objects in previous labs.

Testing the `PlagiarismDetector` Class

As mentioned above, the tests for this lab have been split into groups. For the PlagiarismDetector class, there are three groups: detectorShort, detectorMedium, and detectorLong. These groups reflect how long the tests will take to run. Because the plagiarism detection process is computationally intensive, you’ll want to start by debugging the short tests and move on to the longer tests when they’re the only ones left that fail. Make sure you read the following section on performance before moving on to the longer tests!

Performance

The plagiarism detection process is fairly slow. Fortunately, there’s one key optimization you can introduce which improves the wall time of your program considerably. This optimization could make the difference between your unit tests running for five minutes or five hours, so make sure to include it before running detectorMedium or detectorLong.

The optimization only involves changing a few lines of code. In your Document class where the similarity method is defined, your algorithm probably works something like this:

Create a vector to contain the phrases.
Set that vector to contain all phrases from this document.
For each phrase in that vector, accumulate either the phrase count from this document or the phrase count from the other document, whichever is smaller.

This algorithm works well on documents of similar size. If, however, one of the documents is much larger than the other, we run the risk of calling getPhraseCount many, many times only to get a lot of zeroes. To fix this problem, use the following algorithm instead:

Create a vector to contain the phrases.
Set that vector to contain all phrases from either this document or the other document, based upon which document has fewer phrases.
For each phrase in that vector, accumulate either the phrase count from this document or the phrase count from the other document, whichever is smaller.

With this algorithm in place, the tests should take roughly the following times on the CS network machines:

detectorShort: less than a second
detectorMedium: less than a minute
detectorLong: about five or six minutes

Your `main` Function

Once you’ve written the PlagiarismDetector class, you should be prepared to write your main method. The first argument of your program should be the number of results to display. Every remaining argument is the name of a document to use when running your program. For instance, a user might write

./plagiarismDetector 4 essay1.txt essay2.txt essay3.txt essay4.txt essay5.txt

to get the top four examples of plagiarism among the provided five files. In bash (the language used by the terminals on the CS network machines), you can accomplish the same thing by writing

./plagiarismDetector 4 essay*.txt

The terminal will interpret the “*” in the filename to mean “any amount of text goes here”; it will then find all files matching that description and use each of their names as a different argument. For instance, you can try

./plagiarismDetector 10 test_data/small/*

to test the small data set. For instance, the above command might return the following:

test_data/small/bgt61.txt test_data/small/sra31.txt 1639
test_data/small/bgt157.txt test_data/small/bgt160.txt 754
test_data/small/bmu392.txt test_data/small/mul30.txt 24
test_data/small/esv276.txt test_data/small/toi94.txt 23
test_data/small/bah105.txt test_data/small/esv276.txt 22
test_data/small/bah27.txt test_data/small/esv276.txt 21
test_data/small/bgt61.txt test_data/small/toi94.txt 20
test_data/small/sra31.txt test_data/small/toi94.txt 19
test_data/small/esv276.txt test_data/small/toi141.txt 18
test_data/small/ecu179.txt test_data/small/esv276.txt 18

The specific format of the output is up to you.

Invalid Input

Your program is expected to run without crashing or throwing an exception from main regardless of how it is run; in case of bad input, you should provide an appropriate error message. In particular, you must handle all of the following bad cases:

Too few command-line arguments were provided.
The first command-line argument is not a number.
The first command-line argument is a number but is the only argument; no files are specified.
The number of results requested is greater than the number of comparisons performed by the PlagiarismDetector.

Memory

Your program must run without any memory errors; make sure to use valgrind to find them. Memory leaks, on the other hand, are acceptable in small amounts and will not significantly impact your grade. In particular, “still reachable” memory is acceptable; “lost” memory should not occur in large amounts (e.g. millions of bytes).

Coding Style Requirements

You are required to observe some good coding practices:

Source: OpenClipArt.org

You should pick meaningful variable names.

// Good
int* pixels = new int[size];
// Bad
int* p = new int[size];

You should use correct and consistent indentation. Lines of code within a block (that is, surrounded by { and }) should be indented four spaces further than the lines surrounding them.
```
// Good
if (condition) {
    cout << "Test" << endl;
}
// Bad
if (condition) {
cout << "Test" << endl;
    }
```
You should use a block whenever possible, even if it’s not necessary. This helps you avoid subtle or messy bugs in the future.
```
// Good
if (condition) {
    cout << "Something" << endl;
}
// Bad
if (condition)
    cout << "Something" << endl;
```
Any new methods or fields in your header files should have comments explaining their purpose and behavior. You are permitted to omit documentation for methods that are inherited from other classes; that is, if your class has a foo method because its superclass has a foo method, you don’t need to document that method.
```
// Good
public:
    /**
     * Saves the image represented by this object to a file.
     * @param filename The name of the file to save.
     * @return The number of pixels in the saved image.
     * @throws runtime_error If the image could not be saved due to an I/O error.
     */
    int save(std::string filename);

// Bad
public:
    int save(std::string filename);
```
Your method/field documentation does not have to be in the format above, but you must describe the method’s behavior, its parameters, its return value, and any exceptions that it may throw. (If you’re indifferent, the above syntax is a good one to know; it’s a de facto standard used by Javadoc, Doxygen, and other tools that automatically process source code comments into other formats like searchable webpages.)

Peer Review

As this is a team assignment, you are required to complete a peer review of your lab partner or partners. You must do this even if everything is fine. You should complete this peer review after you have turned in your assignment. The peer review comes in the form of a simple Google Form and can be accessed here; in most cases, it will likely take less than a minute to complete. If you have trouble accessing this peer review, please make sure that you are logged into Google using your Swarthmore credentials.

Your peer review will be kept private; only course staff will have access to this information. You can also update your peer review after you have submitted it. However, if you do not submit a peer review, you will lose participation grade points. Please don’t forget!

Summary of Requirements

When you are finished, you should have

A working plagiarism detector!
Passing unit tests
No memory errors
Code which conforms to the style requirements above
A completed peer review

Lab 7: Plagiarism Detector

Overview

Part I: Written Lab

textree.py

AVL Tree Questions

Heap Questions

Part II: The Plagiarism Detector

The Document Class

Testing the Document Class

The PlagiarismDetector Class

SimilarityResult and Copying

Testing the PlagiarismDetector Class

Performance

Your main Function

Invalid Input

Memory

Coding Style Requirements

Peer Review

Summary of Requirements

`textree.py`

The `Document` Class

Testing the `Document` Class

The `PlagiarismDetector` Class

`SimilarityResult` and Copying

Testing the `PlagiarismDetector` Class

Your `main` Function