First Steps

Setup

The fastest and easiest way to make sure you have a working setup (as described below) is to go through these tutorials on Binder Binder

The second-easiest way is to run the tutorials in a Docker container on your computer. Run

docker run -p 8888:8888 htcondor/htmap-tutorials

and follow the instructions it gives you to get into the Jupyter environment. Then go to tutorials/first-steps.ipynb in the file browser and open it to get back to this point.

Alternatively, you might want to immediately start running HTMap on your HTCondor pool. This tutorial assumes that you’ve already installed HTMap on your HTCondor pool’s submit node, or have access to HTMap through a JupyterHub server connected to an HTCondor pool or similar. See How do I install HTMap? for details!

This tutorial also assumes that you’re working in a Jupyter Notebook. It will work just as well in the Python REPL. Later, once you get a hang things, you’ll be ready to use HTMap in scripts as well. Either way, you’ll need to be on a computer that can submit jobs to an HTCondor pool.

This tutorial assumes that you have already set up your dependency management, as described in Dependency Management. If your HTCondor pool supports Docker, you’ll be good to go with the default settings.

The tutorials in this series are written inside Juypter Notebooks. If you click the “View page source” link in the upper right corner, you’ll be able to grab the raw .ipynb file yourself and step through it along with the tutorial.

The Problem

Suppose you’ve been given the task of writing a function that doubles numbers, like this:

[1]:
def double(x):
    return 2 * x

If you want to double a list of numbers, you might do something like

[2]:
doubled = [double(x) for x in range(10)]
print(doubled)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

or we can use the built-in function map(), which applies a function to each element of an iterable (like a list):

[3]:
mapped = map(double, range(10))
print(mapped)
doubled = list(mapped)
print(doubled)
<map object at 0x7f7ae8393390>
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In both cases, doubled is the list [0, 2, 4, ...]. The reason we need the list call is that map actually returns an iterator over the results, not the results themselves. So you need to iterate over it to get the output, which is what list does: iterate over its input and put the elements in a list.

Now suppose that, for some reason, you want to double a lot of numbers. So many numbers that you can’t bear to do all the work on your own computer. It takes days to multiply all the numbers, and if your program crashes halfway through, you lose all of of your progress and have to start over. You’re losing sleep, and your boss is breathing down your neck because they need those numbers doubled now.

Luckily, you remember that you have access to an HTCondor high-throughput computing pool. Since each of your function calls is isolated from all the others, the computers in the pool don’t need to talk to each other at all, and you can achieve a huge speedup. The pool can run your code on hundreds or thousands of computers simultaneously, storing the inputs and outputs for you and recovering from individual errors gracefully. It’s the perfect solution.

The problem is: how do you get your code running in the pool?

The Solution

With HTMap, it’s like this:

[4]:
import htmap

mapped = htmap.map(double, range(10))
print(mapped)
doubled = list(mapped)
print(doubled)
Created map super-busy-dog with 10 components
Map(tag = super-busy-dog)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

It may take some time for the second print to run. During that time, the individual components of your map are being run out on the cluster on execute nodes. Once they all finish, you’ll get the list of numbers back. As you can see, the output is identical to what you would get from running the function locally.


In the next tutorial we’ll start digging into the extra features that HTMap provides on top of this basic functionality.