Using HTCondor with HTMap

HTMap is a Python wrapper over the underlying HTCondor API. That means the vast majority of the HTCondor functionality is available. This page is a brief overview of how HTMap uses HTCondor to run your maps. It may be helpful for debugging, or for cross-referencing your HTMap and HTCondor knowledge.

Component and Job States

Each HTMap map component is represented by an HTCondor job. Map components will usually be in one of four HTCondor job states:

  • Idle: the job/component has not started running yet; it is waiting to be assigned resources to execute on.

  • Running: the job/component is running on an execute machine.

  • Held: HTCondor has decided that it can’t run the job/component, but that you (the user) might be able to fix the problem. The job will try to run again if it released.

  • Completed: the job/component has finished running, and HTMap has collected its output. These jobs will likely leave the HTCondor queue soon.

For more detail, see the relevant HTCondor documentation:

Requesting Resources

The default resources provisioned for your map component can be limiting – what if your job requires more memory or more disk space? HTCondor jobs can request resources, and HTMap supports those requests via MapOptions.

MapOptions accepts many of the same keys that condor_submit accepts. Some of the more commonly requested resources are:

  • request_memory. Possible values are like "1MB for 1MB, or "2GB" for 2GB of memory.

  • request_cpus. Possible values are like "1" for 1 CPU, or "2" for 2 CPUs.

  • request_disk to request an amount of disk space. Possible values are like "10GB" for 10GB, or "1TB" for 1 terabyte.

If any of the resource requests are not set, the default values set by your HTCondor cluster administrator will be used.

These would be set with MapOptions. For example, this code might be used:

options = htmap.MapOptions(
    request_cpus="1",
    request_disk="10GB",
    request_memory="4GB",
)
htmap.map(..., map_options=options)

When it’s mentioned that “the option foo needs to be set” in a submit file, this corresponds to adding the option in the appropriate place in MapOptions.

GPUs

There are some site-specific options. For example, CHTC has a guide on some of these options “Jobs that use GPUs” to run jobs on their GPU Lab. Check with your site’s documentation to see if they have any GPU documentation.

Command Line Tools

HTMap tries to expose a complete interface for submitting and managing jobs, but not for examining the state of your HTCondor pool itself. Here are some HTCondor commands that you may find useful:

The links go an HTML version of the man pages; they are also visible with man (e.g., man condor_q). Here’s a list of possibly useful commands:

## See the jobs user foobar has submitted, and their status
condor_q --submitter foobar

## See if how many machines have GPUs, and how many are available
condor_status --constraint "CUDADriverVersion>=10.1" -total

## See the stats on GPU machines (including GPU name)
condor_status -compact -constraint 'TotalGpus > 0' -af Machine TotalGpus CUDADeviceName CUDACapability

## See how much CUDA memory on each machine (and how many are available)
condor_status --constraint "CUDADriverVersion>=10.1" -attributes CUDAGlobalMemoryMb -json
# See which machines have that much memory
# Also write JSON file so readable by Pandas read_json
condor_status --constraint "CUDADriverVersion>=10.1" -attributes CUDAGlobalMemoryMb -attribute Machine -json >> stats.json

## See how many GPUs are available
condor_status --constraint "CUDADriverVersion>=10.1" -total

CUDAGlobalMemoryMb is not the only attribute that can be displayed; a more complete list is at https://htcondor.readthedocs.io/en/latest/classad-attributes/machine-classad-attributes.html.