Swarthmore College Department of Computer Science

running long jobs

When running long experiments on the CS department machines, it is important that you do not hog all of the resources. Here are some things to be aware of and a few rules to follow:

  1. Do not run your job on allspice.

    Allspice is our email and file server. If you put a heavy load on allspice, everyone will suffer. Besides, the lab machines have faster CPUs, so your job should run faster on a lab machine (especially if you use the /local directories -- see below).

  2. Use /local and /scratch for large data files.

    Your home directory is really on allspice, so running a job on a lab machine that writes data to your home dir means sending data (over the network) to allspice. Also, your home dir has a disk quota, which limits the amount of data you can store in your home dir.

    Anyone can make a directory in /scratch or the /local dirs (e.g., mkdir /local/yourusername). /scratch is still on allspice, but accessible from any lab machine, so use /scratch if you need to access the file from any machine. Use the /local dirs if you can always use the same machine (i.e., /local on lime is not the same as /local on lemon). If your program writes tons of data to files, it will be faster to use a /local directory.

  3. Be nice.

    All programs that will run for an extended amount of time (more than 15 minutes) should be "niced" to a lower priority. If you are about to start a long program, try nice +20 ./a.out. If you have already started your program, use renice +20 -p pid, where pid is the Process ID of your program (found by using the ps or top command). If nobody is using the computer your job is running on, it will get 100% of the CPU. If someone is using the computer your job is running on, your job will run at a lower priority so the console won't be slow.

  4. If possible, use screen.

    If you don't need the graphics console, try running your long simulation in a screen session. screen allows you to detach from a session and then reattach later. For example, you might start your program in the lab, detach and log out (your program keeps running!), and then reattach to the same session from your dorm room. Here's how to do it:

    • In any window, type screen. The window should clear, and you get your prompt back, but you are now inside a screen session.
    • Run your program (in the background or not).
    • Type Ctrl-a d (control-a, then the d key) to detach.
    • At this point you could log out and your program keeps running.
    • In another window (on the same machine, but you could be logged in from somewhere else), type screen -r to reattach.
    • When done with a screen session, just type exit.
  5. Try not to use the machines in the main teaching lab.

    Use the overflow or robot lab machines first (especially if you have to xlock). Look in the files in /usr/swat/db/hosts.* to see which machines are in which labs.

  6. If you have to xlock, leave a message.

    If you have to leave yourself logged on for a long time, its a good idea to xlock the console. Try one of these to xlock and leave a helpful message:

    xlock -mode flag -message "Please do not log me off!"
    xlock -mode marquee -messagefile file_with_msg_in_it
    add some stuff to your .plan file
     

    And don't xlock more than one machine (use ssh with X forwarding if you are running GUI applications on multiple machines).

  7. If possible, write your program to allow restarts.

    This seems like common sense, but many programs don't do it. If you're going to run a job that takes 24 hours, what happens if the power goes out (or Jeff has to reboot the machines) after 23 hours? Ideally your program has been writing data files every N timesteps and can be restarted from any of these data files.