1 What happened
I use Ansible to monitor lots of servers and everything went well until today. For some managed nodes, any ansible command got stuck.
$ ansible -i servers server1 -a 'whoami' -vvvv
ansible 2.9.5
config file = /etc/ansible/ansible.cfg
.....................................................
<server1> ESTABLISH SSH CONNECTION FOR USER: user01
<server1> SSH: EXEC sshpass -d9 ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'User="user01"' -o ConnectTimeout=10 -o ControlPath=/home/user01/.ansible/cp/0f5a762f79 server1 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /tmp/ansible-tmp-1583257267.1866546-111049028129670 `" && echo ansible-tmp-1583257267.1866546-111049028129670="` echo /tmp/ansible-tmp-1583257267.1866546-111049028129670 `" ) && sleep 0'"'"''
$ ansible -i servers server1 -a 'whoami' -vvvv
ansible 2.9.5
config file = /etc/ansible/ansible.cfg
.....................................................
<server1> ESTABLISH SSH CONNECTION FOR USER: user01
<server1> SSH: EXEC sshpass -d9 ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'User="user01"' -o ConnectTimeout=10 -o ControlPath=/home/user01/.ansible/cp/0f5a762f79 server1 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /tmp/ansible-tmp-1583257267.1866546-111049028129670 `" && echo ansible-tmp-1583257267.1866546-111049028129670="` echo /tmp/ansible-tmp-1583257267.1866546-111049028129670 `" ) && sleep 0'"'"''
It's hanging forever.....
Let's get more details by setting ANSIBLE_DEBUG=1.
$ ANSIBLE_DEBUG=1 ansible -i servers server1 -a 'whoami' -vvvv
...................................
<server1> SSH: EXEC sshpass -d9 ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'User="user01"' -o ConnectTimeout=10 -o ControlPath=/home/user01/.ansible/cp/0f5a762f79 server1 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /tmp/ansible-tmp-1583257473.2961693-131219090034240 `" && echo ansible-tmp-1583257473.2961693-131219090034240="` echo /tmp/ansible-tmp-1583257473.2961693-131219090034240 `" ) && sleep 0'"'"''
4141 1583257473.33138: stderr chunk (state=2):
...................................
4141 1583257473.37015: stderr chunk (state=3):
>>>debug1: mux_client_request_session: master session id: 2
<<<
Now, we can infer that the issue is related to SSH.
Let's test ssh command directly.
$ ssh user01@server1 'whoami'
pwd
/home/user01
Let's get more details by setting ANSIBLE_DEBUG=1.
$ ANSIBLE_DEBUG=1 ansible -i servers server1 -a 'whoami' -vvvv
...................................
<server1> SSH: EXEC sshpass -d9 ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'User="user01"' -o ConnectTimeout=10 -o ControlPath=/home/user01/.ansible/cp/0f5a762f79 server1 '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo /tmp/ansible-tmp-1583257473.2961693-131219090034240 `" && echo ansible-tmp-1583257473.2961693-131219090034240="` echo /tmp/ansible-tmp-1583257473.2961693-131219090034240 `" ) && sleep 0'"'"''
4141 1583257473.33138: stderr chunk (state=2):
...................................
4141 1583257473.37015: stderr chunk (state=3):
>>>debug1: mux_client_request_session: master session id: 2
<<<
Now, we can infer that the issue is related to SSH.
Let's test ssh command directly.
$ ssh user01@server1 'whoami'
pwd
/home/user01
Instead of returning the output of "whoami", it actually started an interactive shell.
In the underhood, Ansible uses "ssh command" to do the real work as well. So it's the same problem that caused Ansible hanging.
2 The root cause
So why "ssh command" didn't run as expected? Let's investigate more.
First, how does SSHD run a command?
In order to give users the same experience as a local login, sshd run a command just like:
$SHELL -c COMMAND
The SHELL is the user's default login shell which is got by "getent".
On server1:
[user01@server1]$ getent passwd user01
user01:x:27093:5000:User01:/home/user01:/bin/csh
[user01@server1]$ /bin/csh -c whoami
[user01@server1]$
Nothing was printed.
[user01@server1]$ echo $0
/bin/bash
Wow, we are now actually in an interactive bash.
[user01@server1]$ exit
logout
user01
Finally, after exiting the bash, the username is printed out.
It can be known that "/bin/csh whoami" actually started an interactive bash.
By analyzing csh's configure files, '/home/user01/.cshrc' was located.
[user01@server1]$ cat .cshrc
/bin/bash -l
At this moment, I remembered days ago, I changed this file so that every time user01 logins, it has a BASH shell.
Because user01 is an AD user, and I don't have permission to change its default shell, using .cshrc to start a bash automatically is an alternative. And what's worse, user01 has an NFS home where .cshrc is located, so all servers using that home were affected.
As a result, any "csh xxx" command only returns after the bash exits.
3 The lesson learned
It must be careful to run commands in a shell's configuration file, as a shell may be used in inactive mode.
I changed .cshrc so that only an interactive csh would start an interactive bash.
[user01@server1]$ cat .cshrc
if (x$0 == 'x-csh') then
/bin/bash -l
endif
Now, everything works again.
$ ansible -i servers server1 -a 'whoami'
server1| CHANGED | rc=0 >>
user01
$ ssh server1 whoami
user01
No comments:
Post a Comment