ML Engineering

Stas Bekman; Sam Foreman

February 20, 2024

Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs

While the methodologies found in this article were developed while working with multi-node multi-gpu pytorch-based training, they, of course, can help with any multi-process multi-node Python programs.

Helper tools

Try to use the following script torch-distributed-gpu-test.py to diagnose the situation.

This will help primarily with discovering network-related issues. And also to quickly understand how multi-gpu communications work.

For code-related issues read the rest of this document.

Approaches to diagnosing multi-gpu hanging / deadlocks

py-spy

First do pip install py-spy.

Now you can attach to each process with:

py-spy dump -n -p PID

and it will tell you where the process hangs (very often it’s a nccl collective function or a barrier).

PID is the process id of the hanging python process.
-n is useful if you want to see stack traces from python extensions written in C, C++, etc., as the program may hang in one of the extensions
you may need to add sudo before the command - for more details see this note.

If you have no sudo access your sysadmin might be able to perform this for you:

sudo echo 0 > /proc/sys/kernel/yama/ptrace_scope

which will allow you running py-spy (and strace) without needing sudo. Beware of the possible security implications - but typically if your compute node is inaccessible from the Internet it’s less likely to be a risk.

To make this change permanent edit /etc/sysctl.d/10-ptrace.conf and set:

kernel.yama.ptrace_scope = 0

Here is an example of py-spy dump python stack trace:

Thread 835995 (active): "MainThread"
    broadcast (torch/distributed/distributed_c10d.py:1191)
    _aggregate_total_loss (deepspeed/runtime/pipe/engine.py:540)
    train_batch (deepspeed/runtime/pipe/engine.py:330)
    train_step (megatron/training.py:436)
    train (megatron/training.py:851)
    pretrain (megatron/training.py:187)
    <module> (pretrain_gpt.py:239)

The very first line is where the program is stuck.

If the hanging happens inside a CPP extension, add --native py-spy and it’ll show the non-python code if any.

multi-process py-spy

Now, how do you do it for multiple processes. Doing it one-by-one is too slow. So let’s do it at once.

If the launch command was python, what you do is:

pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}

if deepspeed:

pgrep -P $(pgrep -o deepspeed) | xargs -I {} py-spy dump --pid {}

for accelerate:

pgrep -P $(pgrep -o accelerate) | xargs -I {} py-spy dump --pid {}

you get the idea.

This particular approach will only analyse the main processes and not various other sub-processes/threads spawned by these processes. So if you have 8 gpus and 8 processes, the above will generate 8 stack traces.

If you want all processes and their subprocesses, then you’d just run:

pgrep -f python | xargs -I {} py-spy dump --pid {}

(and as before replace python with the name of the launcher program if it’s not python)

multi-node py-spy via srun

What if you have multiple nodes?

You can of course ssh to each node interactively and dump the stack traces.

If you’re using the SLURM environment you can use srun to do it on all nodes for you.

Now in another console get the SLURM_JOBID (or get it from salloc log):

squeue -u `whoami` -o "%.16i %9P %26j %.8T %.10M %.8l %.6D %.20S %R"

Now use the following srun command after adjusting jobid with SLURM_JOBID from the outcome of the command above this sentence:

srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'ps aux | grep python | egrep -v "grep|srun" | grep `whoami` | awk "{print \$2}" | xargs -I {} py-spy dump --native --pid {}' || echo "failed"

Notes: - One must use --gres=gpu:0 for the monitor srun or otherwise it will block until the main srun (the one running the training) exits. - Each node will generate its unique log file named trace-nodename.out - so this would help to identify which node(s) are problematic. You can remove --output=trace-%N.out if you want it all being dumped to stdout - In some SLURM versions you may also need to add --overlap - In some SLURM versions the jobid might not match that of reported in squeue, so you have to get the correct SLURM_JOB_ID from the logs of the job you’re trying to “attach” to - i.e. your srun job that allocated the GPUs. - Sometimes bash doesn’t work, but sh does. I think it has to do with what dot files get sourced - You might need to also activate a custom python environment, which you can do like so:

srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'conda activate myenvname; ps auxc | ... ' || echo "failed"

or you can do it inside ~/.bashrc or whatever shell’s rc file you decide to use.

As mentioned before if you want just the main processes you’d use this instead:

srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}' || echo "failed"

Adjust python if need be as explained in the multi-gpu section above.

The previous longer command will deliver traces for all python processes.

If you’re not getting anything, start with the basic debug like:

srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'date'

once you know you’re talking to all the nodes, then you can progressively unravel the depth of calls, as in:

srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'date'
srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -o python'
srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -P $(pgrep -o python) '
srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 sh -c 'pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}'

and at each stage check that the output makes sense - e.g. the 2nd and 3rd call you should be getting the PIDs of the processes.

multi-node py-spy via pdsh

pdsh seems to be a good easy tool to use to accomplish remote work on multiple nodes. Say, you’re running on 2 nodes with hostnames nodename-5 and nodename-8, then you can quickly test that remote execution is working by getting the date on all of these hosts with just:

$ PDSH_RCMD_TYPE=ssh pdsh -w nodename-[5,8] "date"
nodename-5: Wed Oct 25 04:32:43 UTC 2023
nodename-8: Wed Oct 25 04:32:45 UTC 2023

footnote: pdsh should be available via a normal OS package installer

Once you tested that date works it’s time to move to py-spy.

To do py-spy on all python processes that are sub-processes, it’d be:

PDSH_RCMD_TYPE=ssh pdsh -w nodename-[5,8] 'pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}'

but as you’re likely to need to have the ~/.bashrc run, you will need to clone it into ~/.pdshrc, reduce that clone to what is needed to be run (e.g. modify PATH, activate conda) and then source it, like:

PDSH_RCMD_TYPE=ssh pdsh -w nodename-[5,8] 'source ~/.pdshrc; pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}"'

The reason you need a startup script is because usually ~/.bashrc starts with:

# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

so when you run such non-interactive workflows Bash won’t process its ~/.bashrc normally (exit early) and thus anything relying on this startup script won’t work. So you can either remove the non-interactive exiting code above or fork ~/.bashrc into a startup file that only contains what’s needed for the remote command to succeed.

footnote: there is nothing special about ~/.pdshrc - any other name would do, since you’re manually sourceing it.

And if your system isn’t setup to run py-spy w/o sudo as explained a few sections up, you’d need something like this:

PDSH_RCMD_TYPE=ssh pdsh -w nodename-[5,8] 'sudo bash -c "source ~/.pdshrc; pgrep -P $(pgrep -o python) | xargs -I {} py-spy dump --pid {}"'

Of course, you may need to edit the pgrep section to narrow down which processes you want to watch.

Additionally, to avoid being prompted with:

Are you sure you want to continue connecting (yes/no/[fingerprint])?

for every new node you haven’t logged into yet, you can disable this check with:

echo "Host *" >> ~/.ssh/config
echo "  StrictHostKeyChecking no" >> ~/.ssh/config

Here I assume you’re on an isolated cluster so you don’t need to worry about security issues and thus bypassing such check is most likely OK.

multi-node py-spy via ds_ssh

This is yet another way, but please make sure to read the pdsh section above first.

The following notes require pip install deepspeed.

In one SLURM environment I also attempted using pdsh via ds_ssh, but somehow I wasn’t able to run py-spy remotely - the main issue was that remote ssh command wasn’t giving the same env as when I was logged in interactively via ssh. But if you have sudo access on the compute nodes then you could do:

First prepare hostfile:

function makehostfile() {
perl -e '$slots=split /,/, $ENV{"SLURM_STEP_GPUS"};
$slots=8 if $slots==0; # workaround 8 gpu machines
@nodes = split /\n/, qx[scontrol show hostnames $ENV{"SLURM_JOB_NODELIST"}];
print map { "$_ slots=$slots\n" } @nodes'
}
makehostfile > hostfile

Adapt $slots to the number of gpus per node. You may have to adapt this script if your scontrol produces a different output.

Now run the py-spy extraction command over all participating nodes:

ds_ssh -f hostfile "source ~/.pdshrc; ps aux | grep python | grep -v grep | grep `whoami` | awk '{print \$2}' | xargs -I {} sudo py-spy dump --pid {} "

Notes: - Put inside ~/.pdshrc whatever init code that you may need to run. If you don’t need any you can remove source ~/.pdshrc; from the command line. - If you don’t have it already ds_ssh is installed when you do pip install deepspeed. - you might need to export PDSH_RCMD_TYPE=ssh if you get rcmd: socket: Permission denied error

Network-level hanging

The hanging could be happening at the network level. NCCL_DEBUG=INFO can help here.

Run the script with NCCL_DEBUG=INFO env var and try to study the outcome for obvious errors. It will tell you which device it’s using, e.g.:

DeepWhite:21288:21288 [0] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>

So it’s using interface enp67s0 over 192.168.50.21

Is your 192.168.50.21 firewalled? or is it somehow a misconfigured network device?

Does it work if you use a loopback device 127.0.0.1?

NCCL_DEBUG=INFO NCCL_SOCKET_IFNAME=lo python -m torch.distributed.run --nproc_per_node 4 --nnodes 1 torch-distributed-gpu-test.py

if not, see what other local network devices you have via ifconfig - try that instead of lo if any.

It’s currently using enp67s0 in the above example.

Isolate problematic GPUs

You can also try to see if only some GPUs fail

For example, does it work if you use the first 2 or the last 2 gpus:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

then the 2nd pair:

CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

python `trace`

Now what happens when the training doesn’t just hang, but the hanging process stops responding? e.g. this happens when there is a serious hardware issue. But what if it is recurrent and py-spy won’t help here, since it won’t be able to attach to a process that is not responding.

So next came the idea of tracing all calls like one does with strace(1), I researched python calls tracing facilities and have discovered that python has a trace sub-system.

The following code will trace all python calls and log them to the console and into a dedicated per process log file, via a custom Tee module I added.

This then can help to understand where some processes stopped responding, since we will have the log of the last call and all the previous calls before it went unresponsive.

$ cat train.py
[...]

def main():
    # [...]
    train()

import re
class Tee:
    """
    A helper class to tee print's output into a file.
    Usage:
    sys.stdout = Tee(filename)
    """

    def __init__(self, filename):
        self.stdout = sys.stdout
        self.file = open(filename, "a")

    def __getattr__(self, attr):
        return getattr(self.stdout, attr)

    def write(self, msg):
        self.stdout.write(msg)
        self.file.write(msg)
        self.file.flush()

    def flush(self):
        self.stdout.flush()
        self.file.flush()

if __name__ == "__main__":

    import sys
    import trace
    import socket
    import os

    # enable the trace
    if 0:
        cwd = os.path.realpath('.')
        pid = os.getpid()
        hostname = socket.gethostname()
        local_rank = int(os.environ["LOCAL_RANK"])
        trace_output_file = f"{cwd}/trace-{hostname}-{local_rank}-{pid}.txt"

        # create a Trace object, telling it what to ignore, and whether to
        # do tracing or line-counting or both.
        tracer = trace.Trace(
            ignoredirs=[sys.prefix, sys.exec_prefix],
            trace=1,
            count=1,
            timing=True,
        )

        # run the new command using the given tracer
        sys.stdout = Tee(trace_output_file)
        tracer.run('main()')
    else:
        main()

This code doesn’t require any special handing other than enabling the trace by changing if 0 to if 1.

If you don’t set ignoredirs, this will now dump all python calls. Which means expect a lot of GBs of data logged, especially if you have hundreds of GPUs.

Of course, you don’t have to start tracing from main - if you suspect a specific are you can start tracing there instead and it’ll be much faster and less data to save.

I wish I could tell trace which packages to follow, but alas it only supports dirs to ignore, which is much more difficult to set, and thus you end up with a lot more data than needrf. But still this is a super useful tool for debugging hanging processes.

Also, your code will now run much much slower and the more packages you trace the slower it will become.

NicerTrace

As Trace proved to provide very limited usability when debugging a complex multi-node multi-hour run crash, I have started on working on a better version of the trace python module.

You can find it here: NicerTrace

I added multiple additional flags to the constructor and made the output much more useful. You fill find a full working example in that same file, just run:

python trace/NicerTrace.py

and you should see:

        trace/NicerTrace.py:1 <module>
0:00:00 <string>:     1:         trace/NicerTrace.py:185 main
0:00:00 NicerTrace.py:   186:     img = Image.new("RGB", (4, 4))
        PIL.Image:2896 new
0:00:00 Image.py:  2912:     _check_size(size)
        PIL.Image:2875 _check_size
0:00:00 Image.py:  2883:     if not isinstance(size, (list, tuple)):
0:00:00 Image.py:  2886:     if len(size) != 2:
0:00:00 Image.py:  2889:     if size[0] < 0 or size[1] < 0:

as you will see in the example I set:

            packages_to_include=["PIL"],

so it’ll trace PIL plus anything that is not under site-packages. If you need to trace another package, just add it to that list.

This is a very fresh work-in-progress package, so it’s evolving as we are trying to make it help us resolve a very complex crashing situation.

Working with generated trace files

When the per-node-rank trace files has been generated the following might be helpful to quickly analyse the situation:

grep for a specific match and also print the file and line number where it was found:

grep -n "backward" trace*

show tail -1 of all trace files followed by the name of each file:

find . -name "trace*" -exec sh -c 'echo "$1: $(tail -3 "$1")"' _ {} \;

or similar to the above, but print 5 last lines with the leading filename and some vertical white space for an easier reading:

find . -name "trace*" -exec sh -c 'echo; echo $1; echo "$(tail -5 "$1")"' _ {} \;

count how many times grep matched a given pattern in each ifle and print the matched file (in this example matching the pattern backward):

find . -name "trace*" -exec sh -c 'echo "$1: $(grep "backward" $1 | wc -l)"' _ {} \;

good old `print`

Now once you discovered where the hanging happens to further understand why this is happening, a debugger would ideally be used, but more often than not debugging multi-process (multi-node) issues can be very difficult.

In such situations a good old print works. You just need to add some debug prints before the calls where things hang, things that would help understand what lead to the deadlock. For example, some barrier was missing and one or a few processes skipped some code and while the rest of processes are still blocking waiting for everybody to send some data (for example in NCCL collective functions like gather or reduce).

You of course, want to prefix each print with the rank of the process so that you could tell which is which. For example:

import torch.distributed as dist
print(f"{dist.get_rank()}: passed stage 0")

What you will quickly discover is that if you have multiple GPUs these prints will be badly interleaved and you will have a hard time making sense of the debug data. So let’s fix this. We are going to override print with a custom version of the same, but which uses flock to ensure that only one process can write to stdout at the same time.

The helper module printflock.py is included here. To activate it just run this at the top of the module you’re debugging:

from printflock import printflock as print

and now all your print calls in that module will magically be non-iterleaved. You can of course, just use printflock directly:

from printflock import printflock
import torch.distributed as dist
printflock(f"{dist.get_rank()}: passed stage 0")

core files

If the hanging happens inside non-python code, and py-spy --native isn’t enough for some reason you can make the hanging program dump a core file, which is done with one of these approaches:

gcore <pid>
kill -ABRT <pid>

and then you can introspect the core file as explained here.

If you don’t get the core file dumped you need to configure your system to allow so and also specify where the core files should be saved to.

To ensure the file is dumped in bash run (other shells may use a different command):

ulimit -c unlimited

To make this persistent run:

echo '* soft core unlimited' >> /etc/security/limits.conf

On some systems like Ubuntu the core files are hijacked by apport, check the contents of /proc/sys/kernel/core_pattern to see where they are sent. You can override where they are sent with:

sudo sysctl -w kernel.core_pattern=/tmp/core-%e.%p.%h.%t

Change the directory if you want to, but make sure that the user the program is running under can write to that directory. To make this change permanent edit /etc/sysctl.conf and add kernel.core_pattern=/tmp/core-%e.%p.%h.%t (or modify if it’s already there).

footnote: see man core for all the different templates available

If on Ubuntu by default it sends core files to apport, which may save the core to /var/lib/apport/coredump or /var/crash. But you can change it explained above.

A quick way to test if your setup can generate a core file is:

sleep 10 &
killall -SIGSEGV sleep

Normally SIGSEGV isn’t recommended for a real situation of diagnosing a hanging program, because SIGSEGV is likely to launch a sighandler, but for this test it’s good enough.

Code loops

Code loops can be tricky to debug in hanging scenarios. If you have code like the following:

for i, d in enumerate(data):
    some_hanging_call(d)

it’s possible that one process hangs in the first iteration, and another process in the second iteration, which makes things very confusing. But the stack trace won’t give such indication, as the line numbers would be the same, even though the processes aren’t in the same place code progression-wise.

In such situations unroll the loop to be:

d_iter = iter(data)
some_hanging_call(next(d_iter)
some_hanging_call(next(d_iter)

and now when you run py-spy the line numbers will be correct. The processes hanging in the first iteration will report the first some_hanging_call and those in the second iteration in the second call - as each now has its own line.

Hardware-specific issues

AMD/ROCm hangs or slow with IOMMU enabled

AMD Instinct users may need to either Disable IOMMU or set it to:

GRUB_CMDLINE_LINUX_DEFAULT="iommu=soft"

in /etc/default/grub (the grub config file could be elsewhere depending on the OS).

Disabling is GRUB_CMDLINE_LINUX="amd_iommu=off"

Back to top

Citation

BibTeX citation:

@online{bekman2024,
  author = {Bekman, Stas and Foreman, Sam},
  title = {ML {Engineering}},
  date = {2024-02-20},
  url = {https://saforem2.github.io/ml-engineering},
  langid = {en}
}

For attribution, please cite this work as:

Bekman, Stas, and Sam Foreman. 2024. “ML Engineering.” February 20, 2024. https://saforem2.github.io/ml-engineering.