How can I find the error reason why queues/jobs go into error state

Grid Engine administrators sometimes have the problem that a bunch of queues switch into error state or that users complain about failing jobs where they cannot find the error reason.

Finding the root cause can be tricky because the startup process of a job itself is complicated. Many different parts of a real life UNIX environment are involved in this phase and they all have to work hand in hand. So what can be done to help Grid Engine users and administrators in this situation?

In the past users where directed to a trace file that is located on the host where a job startup failed. To do so it was necessary to know that a job will fail, the location where it will fail was required and it was necessary that an administrator sets a specific parameter in the Grid Engine configuration so that the files containing the error reason are not deleted when the job terminates. It was then necessary to login to the executing host to read the trace files containing the error reason.

The engineering team I am working with had a long discussion what could be done to make it easier for our customers. We came to a solution that we implemented in Univa Grid Engine. This solution will be available with UGE 8.1.

After the installation of UGE an administrator can decide if the new simplified debugging should be enabled per default. In this case the KEEP_ACTIVE parameter has to be set to ERROR in the execd_params of the Univa Grid Engine.

> qconf -mconf
execd_params KEEP_ACTIVE=ERROR, …
…

When this is enabled then following will be available after a job failed:

  • the spool directory of the job
  • the job script
  • a file which includes all job related messages from the execution daemon
  • a list of all files located in the jobs temp-directory

All this is available only for failing jobs, there is no need to know that a job will fail, also the execution location is not important because no access to the execution host is required. Files are transferred automatically to the qmaster host. They can be found in $SGE_ROOT/$SGE_CELL/faulty_jobs.

Lets provoke an error to see how this works now. Below you can see the submission of a job that requests a shell that does not exist. Further below you see an excerpt of the trace file containing the error reason. That’s it!

> qsub -S /bin/DOES_NOT_EXIT -b y /bin/sleep 10
Your job 12 ("sleep") has been submitted

> more $SGE_ROOT/default/faulty_jobs/12/1/active_jobs_dir/trace
07/01/2012 21:11:03 [501:16789]: shepherd called with uid = 0, euid = 501
07/01/2012 21:11:03 [501:16789]: starting up 8.1.0alpha2
…
07/01/2012 21:11:03 [501:16790]: unable to find shell "/bin/DOES_NOT_EXIT"
07/01/2012 21:11:03 [501:16789]: wait3 returned 16790 (status: 6912; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 27)
07/01/2012 21:11:03 [501:16789]: job exited with exit status 27
...
blog comments powered by Disqus