[Linux-ivv4] problem with the superdome

David Vernazobres dv at uni-muenster.de
Die Okt 16 01:19:03 CEST 2007


Dear all,

I am having the following problem on the NWZSuperdome.

On some of my jobs but not all (same binary), I am getting strange
errors. Here are the mail that I am getting from the batch system : 

----- Forwarded message from adm <adm at nwzsuperdome.uni-muenster.de> -----

Date: Tue, 16 Oct 2007 00:02:03 +0200
From: adm <adm at nwzsuperdome.uni-muenster.de>
To: dv at uni-muenster.de
Subject: PBS JOB 4714.nwzsuperdome.nwznet.uni-muenster.de

PBS Job Id: 4714.nwzsuperdome.nwznet.uni-muenster.de
Job Name:   trna_ECK12_31_ev
An error has occurred processing your job, see below.
Post job file processing error; job 4714.nwzsuperdome.nwznet.uni-muenster.de on host nwzsupertest/1Unknown resource type  REJHOST=nwzsupertest.nwznet.uni-muenster.de MSG=invalid home directory '/ddfs/user/data/d/dv' specified, errno=2 (No such file or directory)

----- End forwarded message -----
----- Forwarded message from adm <adm at nwzsuperdome.uni-muenster.de> -----

Date: Tue, 16 Oct 2007 00:02:03 +0200
From: adm <adm at nwzsuperdome.uni-muenster.de>
To: dv at uni-muenster.de
Subject: PBS JOB 4714.nwzsuperdome.nwznet.uni-muenster.de

PBS Job Id: 4714.nwzsuperdome.nwznet.uni-muenster.de
Job Name:   trna_ECK12_31_ev
Begun execution

----- End forwarded message -----
----- Forwarded message from adm <adm at nwzsuperdome.uni-muenster.de> -----

Date: Tue, 16 Oct 2007 00:02:03 +0200
From: adm <adm at nwzsuperdome.uni-muenster.de>
To: dv at uni-muenster.de
Subject: PBS JOB 4714.nwzsuperdome.nwznet.uni-muenster.de

PBS Job Id: 4714.nwzsuperdome.nwznet.uni-muenster.de
Job Name:   trna_ECK12_31_ev
Aborted by PBS Server 
Job cannot be executed
See job standard error file

----- End forwarded message -----

My standard PBS submission file:

#!/bin/sh
#PBS -q batch
#PBS -N trna_ECK12_31_ev
#PBS -e /work/dv/trna_ECK12_31_error.txt
#PBS -o /work/dv/trna_ECK12_31_output.txt
#PBS -m bae
#PBS -M dv at uni-muenster.de
#PBS -S /bin/sh
#PBS -l nodes=1:ppn=1
source /opt/intel/cc/9.1.042/bin/iccvars.sh
export OMP_NUM_THREADS=1
cd /work/dv
./Evolution_trna -b 31

$ tracejob 4714.nwzsuperdome
/var/spool/torque/server_priv/accounting/20071016: Permission denied
/var/spool/torque/mom_logs/20071016: No matching job records located

Job: 4714.nwzsuperdome.nwznet.uni-muenster.de

10/16/2007 00:02:02  S    enqueuing into batch, state 1 hop 1
10/16/2007 00:02:02  S    Job Queued at request of
dv at NWZSUPERDOME.NWZNET.UNI-MUENSTER.DE, owner =
                          dv at NWZSUPERDOME.NWZNET.UNI-MUENSTER.DE, job
name = trna_ECK12_31_ev, queue = batch
10/16/2007 00:02:03  S    Job Modified at request of
Scheduler at NWZSUPERDOME.NWZNET.UNI-MUENSTER.DE
10/16/2007 00:02:03  S    Job Run at request of
Scheduler at NWZSUPERDOME.NWZNET.UNI-MUENSTER.DE
10/16/2007 00:02:03  S    Exit_status=-2 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
                          resources_used.walltime=00:00:00
10/16/2007 00:02:03  L    Job Run
10/16/2007 00:02:03  S    Post job file processing error
10/16/2007 00:02:03  S    dequeuing from batch, state COMPLETE


I do not see why, I am getting this errors from time to time,
The error mention the nwzsupertest but I am using the batch queue, so
shouldn't it be nwzsuperbatch?
And I am not accessing my home directory during the execution of the
job.

Any clue, explanations ?

many thanks,
david