What happens when you cancel a Jenkins job
Unfinished draft; do not use until this notice is removed.
We were seeing some unexpected behavior in the processes that Jenkins launches when the Jenkins user clicks "cancel" on their job. Unexpected behaviors like:
This is an investigation into what exactly happens when Jenkins cancels a job, and a set of best-practices for writing Jenkins jobs that behave the way you expect.
First, recall the name and purpose of some Unix process signals:
HUP
, "hangup"; the controlling terminal disconnected, or the controlling process died.INT
, "interrupt"; the user typed CTRL+C
.KILL
, "kill"; terminate immediately, no cleanup. Can't be trapped.TERM
, "terminate," but cleanup first. May be trapped. The default when using the kill
command.When a Jenkins job is cancelled, it sends TERM
to the process group of the process it spawns, and immediately disconnects, reporting "Finished: ABORTED," regardless of the state of the job.
This causes the spawned process and all its subprocesses (unless they are spawned in new process groups) to receive TERM
.
This is the same effect as running the job in your terminal and pressing CTRL+C, except in the latter situation INT
is sent, not TERM
.
Since Jenkins disconnects immediately from the child process and reports no further output:
it can misleadingly appear that signal handlers are not being invoked. (Which has mislead many to think that Jenkins has KILL
ed the process.)
it can misleadingly appear that spawned processes have completed or exited, while they continue to run.
When a bash script has trapped a signal, it waits for any processes it spawned to complete before handling it. If a subprocess is stuck or ignoring signals, the controlling script may take a very long time to exit, despite the existence of a signal handler. If one of your subprocesses does not properly exit upon being signaled, it's not enough to create a signal handler in the calling shell script that more forcefully KILL
s it, because bash will wait for the process to complete before attempting that action.
If no hashbang is given on the first line of Jenkins' shell command specification, Jenkins defaults to /bin/sh -xe
to interpret the script. When -e
is active, any spawned process that exits with a nonzero status causes the script to abort. So it can misleadingly appear that sh
signals its subprocesses and automatically exits when signaled even if a trap is set, but behave differently with other languages or when someone pastes in a #!/bin/sh
with no -e
.
In some Jenkins jobs, we have a process (script) that invokes another on a different machine through ssh
.
OpenSSH, unlike RSH, does not pass received signals to the remote process, nor does it provide a mechanism to manually send a signal to the remote process, even though this capability is specified in RFC4254. It will pass a CTRL+C character, but only when a pseudoterminal is allocated (i.e. only when invoked with -tt
.) It will send HUP
to the remote process group when it disconnects, but only when a pseudoterminal is allocated.
Recall that scripts may receive signals TERM
(from Jenkins), HUP
(from ssh), or INT
(from you typing CTRL+C while testing).
The default action for shell scripts upon receiving HUP
, INT
, and TERM
is to abort, which triggers the EXIT
handler just before execution ends. So for shell scripts, the EXIT
trap is a good place for cleanup code. (EXIT
is not really a signal; it is a POSIX shell mechanism to trigger a handler just before execution ends, regardless of how it ends.)
In contrast, the default action for Python upon receiving INT
is to raise a KeyboardInterrupt exception, which if unhandled causes Python to abort, which triggers any handlers registered with atexit.register()
just before execution ends. However, that handler is by default not invoked upon Python receiving HUP
or TERM
; instead, Python aborts immediately. So, for your atexit
handler to fire you must also explicitly trap those signals with signal.signal()
and cause execution to end.
If you need to write a wrapper for a process that does not exit when signalled, write it in Python, or use a shell trick (run the subprocess as a background task and call wait
).
If you're doing remote orchestration through ssh, always invoke it with -tt
so that the remote processes will receive HUP
if the ssh connection goes away. If a pseudoterminal causes problems, there are other workarounds like an 'EOF to SIGHUP' wrapper.
For shell scripts, trap EXIT and place cleanup code there. For Python, use the atexit
module and register a cleanup handler, and also use the signal
module to either raise an exception or call sys.exit() upon HUP
and TERM
, like this:
for s in [signal.SIGHUP, signal.SIGTERM]:
signal.signal(s, lambda n, _: sys.exit("Received signal %d" % n))
man 7 signal
KILL
.)atexit
module documentationsignal
module documentation#FIXME
sudo
?