Working with Files

Binder

High-throughput computing often involves analyzing data stored in files. For many simple cases, HTMap can automatically work with files that you specify as arguments of your function without (much) special treatment.

Let’s start with “Hello world!” example:

[1]:
from pathlib import Path

def read_file(path: Path):
    return path.read_text()

This function takes in a pathlib.Path, reads it, and returns its contents. Let’s make a file and see how it works:

[2]:
hi_path = Path.cwd() / 'hi.txt'
print(hi_path)
hi_path.write_text('Hello world!')
/home/jovyan/tutorials/hi.txt
[2]:
12
[3]:
print(read_file(hi_path))
Hello world!

(pathlib has a steeper learning curve than os.path, but it’s well worth the effort!)

Now, let’s start mapping. In this case, the map call is barely different than the original function call, but we need to set up the inputs correctly. The trick is that, instead of a pathlib.Path, we need to use a htmap.TransferPath. htmap.TransferPath is a drop-in replacement for pathlib.Path in every way, except for HTMap’s special treatment of it.

HTMap will detect that we used an htmap.TransferPath in a map as long as it is an argument or keyword argument of the function, or stored in a primitive container (list, dict, set, tuples) and automatically transfer the named file to wherever the function executes.

[4]:
import htmap

bye_path = htmap.TransferPath.cwd() / 'bye.txt'
bye_path.write_text('Have a nice day!')
[4]:
16
[5]:
map = htmap.map(read_file, [bye_path])
print(map.get(0))  # map.get will wait until the result is ready
Created map puny-thin-echo with 1 components
Have a nice day!

Multiple Files

To see how we can transfer a container full of files, let’s write a simple clone of the unix cat program, which concatenates files. It takes a single argument which is a list of files to be concatenated, and returns the concatenated files as a string.

[6]:
def cat(files):
    file_contents = (file.read_text() for file in files)
    return ''.join(file_contents)

Let’s write some test files…

[7]:
cwd = htmap.TransferPath.cwd()
paths = [
    cwd / 'start.txt',
    cwd / 'middle.txt',
    cwd / 'end.txt',
]
parts = [
    'The quick brown ',
    'fox jumps over ',
    'the lazy dog!',
]
for path, part in zip(paths, parts):
    path.write_text(part)

… and run a map!

[8]:
m = htmap.map(cat, [paths])  # this creates a single map component with the list of paths as the argument
print(m.get(0))
Created map red-bland-tub with 1 components
The quick brown fox jumps over the lazy dog!

If the “output” of your map function needs to be a file instead of a Python object (or you produce files that you need back submit-side for whatever reason), you’ll want to look at the Output Files recipe once you’re done with the tutorials.

In the next tutorial we’ll learn how to tell HTCondor about what resources our map components require, as well as another HTCondor configuration they need.