ethan-s
4/9/2016 - 10:09 PM

What happens when you cancel a Jenkins job

What happens when you cancel a Jenkins job

When you cancel a Jenkins job

Unfinished draft; do not use until this notice is removed.

We were seeing some unexpected behavior in the processes that Jenkins launches when the Jenkins user clicks "cancel" on their job. Unexpected behaviors like:

  • apparently stale lockfiles and pidfiles
  • overlapping processes
  • jobs apparently ending without performing cleanup tasks
  • jobs continuing to run after being reported "aborted"

This is an investigation into what exactly happens when Jenkins cancels a job, and a set of best-practices for writing Jenkins jobs that behave the way you expect.

First, recall the name and purpose of some Unix process signals:

  • 1, HUP, "hangup"; the controlling terminal disconnected, or the controlling process died.
  • 2, INT, "interrupt"; the user typed CTRL+C.
  • 9, KILL, "kill"; terminate immediately, no cleanup. Can't be trapped.
  • 15, TERM, "terminate," but cleanup first. May be trapped. The default when using the kill command.

Jenkins

When a Jenkins job is cancelled, it sends TERM to the process group of the process it spawns, and immediately disconnects, reporting "Finished: ABORTED," regardless of the state of the job.

This causes the spawned process and all its subprocesses (unless they are spawned in new process groups) to receive TERM.

This is the same effect as running the job in your terminal and pressing CTRL+C, except in the latter situation INT is sent, not TERM.

Since Jenkins disconnects immediately from the child process and reports no further output:

  • it can misleadingly appear that signal handlers are not being invoked. (Which has mislead many to think that Jenkins has KILLed the process.)

  • it can misleadingly appear that spawned processes have completed or exited, while they continue to run.

Bash

When a bash script has trapped a signal, it waits for any processes it spawned to complete before handling it. If a subprocess is stuck or ignoring signals, the controlling script may take a very long time to exit, despite the existence of a signal handler. If one of your subprocesses does not properly exit upon being signaled, it's not enough to create a signal handler in the calling shell script that more forcefully KILLs it, because bash will wait for the process to complete before attempting that action.

If no hashbang is given on the first line of Jenkins' shell command specification, Jenkins defaults to /bin/sh -xe to interpret the script. When -e is active, any spawned process that exits with a nonzero status causes the script to abort. So it can misleadingly appear that sh signals its subprocesses and automatically exits when signaled even if a trap is set, but behave differently with other languages or when someone pastes in a #!/bin/sh with no -e.

SSH

In some Jenkins jobs, we have a process (script) that invokes another on a different machine through ssh.

OpenSSH, unlike RSH, does not pass received signals to the remote process, nor does it provide a mechanism to manually send a signal to the remote process, even though this capability is specified in RFC4254. It will pass a CTRL+C character, but only when a pseudoterminal is allocated (i.e. only when invoked with -tt.) It will send HUP to the remote process group when it disconnects, but only when a pseudoterminal is allocated.

Cleanup actions in Bash vs. Python

Recall that scripts may receive signals TERM (from Jenkins), HUP (from ssh), or INT (from you typing CTRL+C while testing).

The default action for shell scripts upon receiving HUP, INT, and TERM is to abort, which triggers the EXIT handler just before execution ends. So for shell scripts, the EXIT trap is a good place for cleanup code. (EXIT is not really a signal; it is a POSIX shell mechanism to trigger a handler just before execution ends, regardless of how it ends.)

In contrast, the default action for Python upon receiving INT is to raise a KeyboardInterrupt exception, which if unhandled causes Python to abort, which triggers any handlers registered with atexit.register() just before execution ends. However, that handler is by default not invoked upon Python receiving HUP or TERM; instead, Python aborts immediately. So, for your atexit handler to fire you must also explicitly trap those signals with signal.signal() and cause execution to end.

Takeaways

  1. If you need to write a wrapper for a process that does not exit when signalled, write it in Python, or use a shell trick (run the subprocess as a background task and call wait).

  2. If you're doing remote orchestration through ssh, always invoke it with -tt so that the remote processes will receive HUP if the ssh connection goes away. If a pseudoterminal causes problems, there are other workarounds like an 'EOF to SIGHUP' wrapper.

  3. For shell scripts, trap EXIT and place cleanup code there. For Python, use the atexit module and register a cleanup handler, and also use the signal module to either raise an exception or call sys.exit() upon HUP and TERM, like this:

     for s in [signal.SIGHUP, signal.SIGTERM]:
         signal.signal(s, lambda n, _: sys.exit("Received signal %d" % n))
    

References

#FIXME

  • Verify that Jenkins signals the process group. It doesn't appear so from Jenkins source code. Maybe the shell is reissuing the signal to its own process group?
  • Identify Bash-isms; will a system with a different /bin/sh (like dash) behave differently than I have described here?
  • Clean up and include my small scripts which demonstrate each of the assertions above.
  • What negative side effects could forcing ssh to perform pty allocation have on a non-interactive script?
  • Will signals propagate through sudo?