[H-GEN] a process that won't die

Thu Oct 5 21:09:22 EDT 2000

On Fri, Oct 06, 2000 at 10:26:49AM +1000, Daniel Quinlan wrote:
> hi,
> 
>    I upgraded one of our machines to Debian 2.2 the other night and the upgrade 
>    of pidentd attempted to create a new user called ident
> 
>    for some reason useradd hung and I had to manually create the user.
> 
>    it was only today when I went to add a new user that I found the process
>    was still running and nothing seems to be able to kill it.
> 
> host:~# ps aux | grep useradd | grep -v grep
> root     23021  0.0  0.7  1432  672 ?        D    Oct03   0:00 useradd -d /var/run/identd -g nogroup -s /bin/false -u 100 identd
> host:~# kill 23021
> host:~# ps aux | grep useradd | grep -v grep
> root     23021  0.0  0.7  1432  672 ?        D    Oct03   0:00 useradd -d /var/run/identd -g nogroup -s /bin/false -u 100 identd
> host:~# kill -9 23021
> host:~# ps aux | grep useradd | grep -v grep
> root     23021  0.0  0.7  1432  672 ?        D    Oct03   0:00 useradd -d /var/run/identd -g nogroup -s /bin/false -u 100 identd
> host:~# kill -15 23021
> host:~# ps aux | grep useradd | grep -v grep
> root     23021  0.0  0.7  1432  672 ?        D    Oct03   0:00 useradd -d /var/run/identd -g nogroup -s /bin/false -u 100 identd
> host:~# 
> 
>    I'm going to reboot the box for a kernel upgrade anyway, but I was wondering
>    if anyone knew what could cause this.

After refreshing my memory from tridge and sfr:

`D' means `disk wait', which is an anachronistic name for
`uninterruptible wait'.  So this means that the process is inside the
kernel, waiting for some resource, and the kernel doesn't have a clean
way to terminate it.  Basically the kernel programmer didn't write
code to do an abnormal exit from this point.  Processes in this state
never respond to signals, not even SIGKILL.  Normally processes pass
through this state for only a very small amount of time.

A common way to see processes here is for an NFS server to be
unreachable as the kernel tries to page in from it.  Paging IO is
uninterruptible.  Similarly for other IO errors.  So you might like to
check /var/log/message or /var/log/kern to see if you're getting disk
IO errors.

Another way to get into it is for some other task to have left a
semaphore set inside the kernel, so that 23021 can never get the
resource it's waiting for.  This should never happen, and if it has it
means there's probably a bug in the error-handling code somewhere in
the kernel.  The fact that the whole machine hasn't locked up shows
the semaphore must not be too important.

If the machine is still up, you can track this down by working out
which semaphore is held.  Please run this command to show the wait
channel for all the processes:

  ps awwx -eo pid,tt,user,fname,tmout,f,wchan

You might also try 

  strace -p 23021

which might discover which system call it's waiting it.  (Or it might
hang, we're not quite sure.)

Hope that helps (or is at least interesting!),
-- 
Martin Pool, Linuxcare, Inc.
+61 2 6262 8990
mbp at linuxcare.com, http://www.linuxcare.com/
Linuxcare. Support for the revolution.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: <http://lists.humbug.org.au/pipermail/general/attachments/20001006/cfe3e195/attachment.sig>