How can I find the error reason why queues/jobs go into error state

Grid Engine administrators sometimes have the problem that a bunch of queues switch into error state or that users complain about failing jobs where they cannot find the error reason.

Finding the root cause can be tricky because the startup process of a job itself is complicated. Many different parts of a real life UNIX environment are involved in this phase and they all have to work hand in hand. So what can be done to help Grid Engine users and administrators in this situation?