Docker pstree: From The Inside

I recently posted about my docker-pstree tool and in the post mentioned that at some point I might port the tool to be 100% “in-container.” Well I couldn’t help myself and figured out how.

Saturday morning I was tinkering with making docker-pstree run from within the container itself. Recall that the intermediate goal is to find all “root” processes in the tree. I started off trying to port docker-root-pids to POSIX sh. This is difficult because the it relies on a hashmap, and POSIX sh doesn’t even have arrays. I considered actually doing math to generate a hash (as in numeric digest) of the keys and then using that to do a seek within a null terminated string, but that seems crazy and because you are doing a seek you might as well just seek using the string anyway (because it’s O(n).)

🔗 Environment based solution

At some point I realized I could use the environment has a sort of hashmap, and came up with the following:

eval `ps -eo pid= -o ppid= | \
  awk '{
    list=list "A" $2 " ";
    print "A" $1 "=A" $2 "; export A" $1
  };
  END { print "export LIST=\"" list "\""}'`

Before I go further, let me break that down:

ps -e says to grab all processes, ps -eo pid= -o ppid= basically says we want the process id, the parent process id, and no header. The awk code tokenizes the input (so $1 is the process id and $2 is the parent process id.) The awk code is two blocks. The first block concatenates A, the parent process id, and a space, to itself every time, so it’s a space separated list of parent process ids. Then the main block prints A$pid=A$ppid; export A$pid. A space concatenates in awk, just like in Python and MySQL. Finally the END block prints the built up list as export LIST="$ppids". The output looks something like this:

A1=A0; export A1
A7=A1; export A7
A8=A1; export A8
A9=A1; export A9
A10=A1; export A10
A11=A1; export A11
A12=A7; export A12
A14=A11; export A14
A15=A10; export A15
A16=A9; export A16
A35094=A8; export A35094
A36488=A0; export A36488
A38783=A36488; export A38783
A38784=A36488; export A38784
export LIST="A0 A1 A1 A1 A1 A1 A7 A11 A10 A9 A8 A0 A36488 A36488 "

When this is evaluated with eval you can then iterate over the parent process ids like this:

for ppid in $LIST; do; echo $ppid; done

Or resolve the parent process ids like this:

for ppid in $LIST; do eval "echo \$$ppid"; done

🔗 `ps` based solution

Then I went on a walk to the farmer’s market (where I saw Ted Danson) and realized, after looking at the output above, that it can be easier:

ps -eopid= -oppid= | \
  grep '\s0$' | \
  awk '{ print $1 }'

So this prints all processes with a parent process id of zero, and then we can do this, like in the original docker-pstree:

ps -eopid= -oppid= | \
  grep '\s0$' | \
  awk '{ print $1 }' | \
  xargs -n1 pstree

🔗 `pstree` based solution

But then my pattern seeking brain realized I could do this:

pstree 0

This is a useful command to run even on a full Linux machine because the kernel threads do not run under init, and on my system the kernel threads are actually process id 2 with a parent process id of zero.

🔗 Tradeoffs

The nice thing about this is that it should work in any setup. If you have a way to run the docker client, the following should work reliably to get you complete and accurate pstree output:

docker exec -it my-container watch -n 1 pstree 0

The bummer is that on a really basic system, watch is limited to integers, and pstree can’t show arguments (and can’t do pretty unicode output.) To overcome the former, assuming you are not on windows, you can probably do this:

watch -n 0.3 docker exec -it my-container pstree 0

The other problem can only be solved by adding a more powerful pstree to your container. If you are using a traditional base image, like Ubuntu or Debian, your pstree probably came with psmisc and is already good enough. For something like Alpine which all my containers are based on, you just need to install psmisc yourself.

🔗 Container Tooling

For a long time I have felt that containers should be as minimal as possible and should not be “tooled up” to help debugging. This is motivated by the desire to create the smallest possible artifact with the lowest attack surface possible. I have colleagues who I respect who think it’s fine and good to put strace, psmisc, bash, etc into their containers for simpler debugging. I think that is definitely the pragmatic path forward, but I have a few ideas that I think would be superior.

🔗 `rkt`’s Stage 1

In rkt there is the concept of a “Stage 1“ which is basically the supervisor and log aggregator for your container. By default, this is systemd. I think it would be interesting to add more tooling to that image, so if you need to debug a container, you just run it differently and suddenly have all this extra tooling. I looked into doing this yesterday but gave up.

I think that the rkt architecture is superior to Docker’s because there is no central daemon. If the Docker daemon crashes, all the containers go down with it (by default,) with rkt this is not even a problem because there is no central daemon. There are other reasons I like rkt better but this is the core reason why.

On the other hand, I can add a user to the docker group and the user is thus fully able to control the docker daemon. In rkt this is impossible without setuiding the rkt binary, which people are reasonably not doing. So all that together means that I’m unlikely to switch my stuff to rkt just yet. I would consider making a super tiny script that just does rkt run commands and setuid‘ing that. We’ll see.

🔗 `nsenter`

What I would love would be the ability to run a command like this:

sudo nsenter --pid --mount --target $pid pstree -Ua 0

The idea being that I would be in the pid namespace, and have access to the container’s /proc, but sadly the --mount namespace is basically chroot. It probably doesn’t have to be, but that’s how it works with nsenter. Because it’s a chroot the pstree implementation is the one inside of the container. Sad.

🔗 Docker Volumes

Docker allows you to mount volumes in a container. Volumes could be data (or binaries) from the host, that end up in a specific directory in the container. For super basic tools this could work, but how many tools work without a boatload of dependencies? At some point, going down this route, you’d end up back in the dependency hell that Docker is supposed to help solve.

🔗 The Magical Overlay

What I really want is a way to bolt on additional tools to a container while it’s running. Today that means using the host, but I think that container systems like Docker and rkt could provide ways to do this while still using a managed container. The rkt fly Stage 1 sorta works, as you get to use a container you made, but running with full host privileges, but it just sounds so difficult and frustrating to use fly to run gdb to debug a container.

I think all that I would need would be a fresh container that has the “to-debug” container mounted in /to-debug (or something, it could be configurable) and has complete access to all of the processes in that container. I suspect that to allow this containers would need to be more nested than they are today in Docker and rkt, more like how lmctfy had child containers.

I definitely don’t think we are there yet when it comes to container debugging, and I think that there is a pattern yet to be realized that will make container debugging more effective than even current systems where you simply have root on a VM.

Posted Mon, Aug 15, 2016

If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.