Pages

Tuesday, 3 March 2020

Ansible is stuck! ssh command is stuck!

1 What happened

I use Ansible to monitor lots of servers and everything went well until today. For some managed nodes, any ansible command got stuck.

$ ansible -i servers server1 -a 'whoami' -vvvv
ansible 2.9.5
  config file = /etc/ansible/ansible.cfg
  .....................................................
<server1> ESTABLISH SSH CONNECTION FOR USER: user01
<server1> SSH: EXEC sshpass -d9 ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'User="user01"' -o ConnectTimeout=10 -o ControlPath=/home/user01/.ansible/cp/0f5a762f79 server1 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /tmp/ansible-tmp-1583257267.1866546-111049028129670 `" && echo ansible-tmp-1583257267.1866546-111049028129670="` echo /tmp/ansible-tmp-1583257267.1866546-111049028129670 `" ) && sleep 0'"'"''

It's hanging forever.....

Let's get more details by setting ANSIBLE_DEBUG=1.

$ ANSIBLE_DEBUG=1 ansible -i servers server1 -a 'whoami' -vvvv
...................................
<server1> SSH: EXEC sshpass -d9 ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'User="user01"' -o ConnectTimeout=10 -o ControlPath=/home/user01/.ansible/cp/0f5a762f79 server1 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /tmp/ansible-tmp-1583257473.2961693-131219090034240 `" && echo ansible-tmp-1583257473.2961693-131219090034240="` echo /tmp/ansible-tmp-1583257473.2961693-131219090034240 `" ) && sleep 0'"'"''
  4141 1583257473.33138: stderr chunk (state=2):
...................................
  4141 1583257473.37015: stderr chunk (state=3):
>>>debug1: mux_client_request_session: master session id: 2
<<<

Now, we can infer that the issue is related to SSH.

Let's test ssh command directly.

$ ssh user01@server1 'whoami'
pwd
/home/user01

Instead of returning the output of "whoami", it actually started an interactive shell. 
In the underhood, Ansible uses "ssh command" to do the real work as well. So it's the same problem that caused Ansible hanging.

2 The root cause

So why "ssh command" didn't run as expected? Let's investigate more.

First, how does SSHD run a command? 
In order to give users the same experience as a local login, sshd run a command just like:

$SHELL -c COMMAND

The SHELL is the user's default login shell which is got by "getent". 

On server1:

[user01@server1]$ getent passwd user01
user01:x:27093:5000:User01:/home/user01:/bin/csh

[user01@server1]$ /bin/csh -c whoami
[user01@server1]$

Nothing was printed.

[user01@server1]$ echo $0
/bin/bash

Wow, we are now actually in an interactive bash.

[user01@server1]$ exit
logout
user01

Finally, after exiting the bash,  the username is printed out.

It can be known that "/bin/csh whoami" actually started an interactive bash.
By analyzing csh's configure files, '/home/user01/.cshrc' was located.

[user01@server1]$ cat .cshrc
/bin/bash -l
 
At this moment, I remembered days ago, I changed this file so that every time user01 logins, it has a BASH shell. 

Because user01 is an AD user, and I don't have permission to change its default shell, using .cshrc to start a bash automatically is an alternative. And what's worse, user01 has an NFS home where .cshrc is located, so all servers using that home were affected.

As a result, any "csh xxx" command only returns after the bash exits.

3 The lesson learned

It must be careful to run commands in a shell's configuration file, as a shell may be used in inactive mode.

I changed .cshrc so that only an interactive csh would start an interactive bash.

[user01@server1]$ cat .cshrc
if (x$0 == 'x-csh') then
        /bin/bash -l
endif

Now, everything works again.

$ ansible -i servers server1 -a 'whoami'
server1| CHANGED | rc=0 >>
user01

$ ssh server1 whoami
user01



No comments:

Post a Comment