In my line of work, we tend to deal with a lot of cron scripts. From time to time, and for various reasons they will hang for no apparent reason. Here are a few reasons I have found for them hanging:
- When making cURL requests to an external URL, that dependency will sometimes stop responding and leave the request hanging. This can usually be mitigated by setting the CURLOPT_CONNECTTIMEOUT option before connecting, but sometimes doesn’t seem like it works correctly.
- I built a system that captures STDOUT and STDERR for cron scripts in order to display statuses for scripts after they complete including any potential errors. In doing so, when a script echoes a lot of output, the buffer will get full and cause the script to hang. Lesson being that if you want the ability to display output, write a script that accepts command line flags such as -v or -verbose and only then does it display the output…otherwise remains silent. This can be done by checking the $argc and $argv variables.
So anyway, a script is hung, what do you do?
- Find out it’s process ID. In the command line, type ps ax | grep .php to see any currently executing PHP scripts. You might see something like this being output:
1220371 ? S 0:00 /usr/bin/php -q /var/shared/scripts/MyScript.php20383 ? S 0:05 /usr/bin/php -q /var/shared/scripts/MyOtherScript.php
The first column is the process ID. The second is the TTY or “controlling terminal”. It’s ok if this is a question mark. The third column, where you see the S in my example, is the process state code which can be any one of the following:
- D: uninterruptable sleep
- N: low priority
- R: runnable (on run queue)
- S: sleeping
- T: traced or stopped
- Z: defunct (zombie)
The fourth column is the runtime which is important to be able to know how long your script has been executing and is an indicator of a hung process. If your script normally takes 2 minutes to run, and it’s been currently running for 10 minutes — its hung. The final column is the command that was executed which in the case of PHP cron scripts will usually be the path to your PHP executable followed by the file being run.
Another cool tool for getting the pid (Process ID) of a script is to use pgrep followed by a string to match. If you run pgrep -l .php It will just spit out a list of matching files currently running and their pid which is pretty handy.
- So now we know a little more about the process that is hung. Now lets check it out with the bread and butter of cron script debugging tools: strace. Execute the following for the pid you want to inspect where 12345 is your Process ID.:
1strace -p 12345
Ok, you will see one of two scenarios (typically) when you do this. The first is that you will see a bunch of Linux/Unix system commands scrolling by. This means your script is actually doing stuff. It’s not hung, just busy. There is an exception to this, you may see the poll() command scrolling by with (Timeout) at the end of each line. We will address that case in Step 3. The other scenario is you will see a single command and it’s just stopped on it like this:12Process 22770 attached - interrupt to quitwrite(3, "--->8707215n", 12
Another common system call you will see is read(). In the above example, we can deduce that the script is attempting to write data and is unable to or hasn’t completed. The first parameter to the write() call is the file descriptor. Remember in *nix, everything is a file which includes your TCP connections. The second parameter is what it is trying to write, and the third is how many bytes the data is. Note: The read() system call has the same parameters in case you see that. Our final conclusion from this step is that data is unable to be output to file descriptor 3 for some unknown reason.
- Since we know the file descriptor number, we need to find out what the file is. Remember before I mentioned if you see the poll() system call followed by (Timeout) we will address it here. Just like read() and write(), the first parameter to poll() is a file descriptor. It’s function is to wait for a file descriptor to be ready for I/O meaning if you see it repeatedly timing out, it’s because the file descriptor is never being ready! Ok, so lets look at how to see just what that file descriptor is:
1lsof -p 12345
We pass it our process and it will return a list of file descriptors. The fd column in the results is what you should be paying attention to to find your descriptor.123456789101112131415161718COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEmyScript 22770 myServer cwd DIR 253,0 4096 1693953 /var/shared/scriptsmyScript 22770 myServer rtd DIR 253,0 4096 2 /myScript 22770 myServer txt REG 253,0 3224944 664842 /usr/bin/phpmyScript 22770 myServer mem REG 253,0 27424 521733 /lib64/libnss_dns-2.12.somyScript 22770 myServer mem REG 253,0 65928 521766 /lib64/libnss_files-2.12.somyScript 22770 myServer mem REG 253,0 82960 664597 /usr/lib64/php/modules/zip.somyScript 22770 myServer mem REG 253,0 14288 521327 /lib64/libgpg-error.so.0.5.0myScript 22770 myServer mem REG 253,0 478496 521442 /lib64/libgcrypt.so.11.5.3myScript 22770 myServer mem REG 253,0 248072 666407 /usr/lib64/libxslt.so.1.1.26myScript 22770 myServer mem REG 253,0 81688 666404 /usr/lib64/libexslt.so.0.8.15myScript 22770 myServer mem REG 253,0 34056 666413 /usr/lib64/php/modules/xsl.somyScript 22770 myServer 0r FIFO 0,8 0t0 357242409 pipemyScript 22770 myServer 1w FIFO 0,8 0t0 357242410 pipemyScript 22770 myServer 2w FIFO 0,8 0t0 357242411 pipemyScript 22770 myServer 3u IPv4 357242380 0t0 TCP myserver.com:12345->otherserver.com:mysql (ESTABLISHED)myScript 22770 myServer 4u IPv4 357242586 0t0 TCP myserver.com:54321->yetanotherserver.com:mysql (ESTABLISHED)myScript 22770 myServer 6u IPv4 357242693 0t0 TCP myserver.com:13579->lastserver.com:mysql (ESTABLISHED)
Your list of .so PHP modules will probably be much longer (I cut a lot of them off for brevity). The important part is where you see the numbers in the FD column such as 1w or 4u. The letter following the number means: w = write, r = read, u = read/write. As you can see, file descriptor 3 is a TCP connection to MySQL on otherserver.com. That gives me a starting place for looking into what is going wrong. Is there a firewall issue? Is the MySQL server not responding? Does the server name resolve in DNS?
Using these tools and researching some linux system calls will help you identify why your script is hanging. If you have any questions or need clarification, please comment below and I will try to address an issues.