[Linux-ivv4] problem with the superdome
David Vernazobres
dv at uni-muenster.de
Die Okt 16 01:19:03 CEST 2007
Dear all,
I am having the following problem on the NWZSuperdome.
On some of my jobs but not all (same binary), I am getting strange
errors. Here are the mail that I am getting from the batch system :
----- Forwarded message from adm <adm at nwzsuperdome.uni-muenster.de> -----
Date: Tue, 16 Oct 2007 00:02:03 +0200
From: adm <adm at nwzsuperdome.uni-muenster.de>
To: dv at uni-muenster.de
Subject: PBS JOB 4714.nwzsuperdome.nwznet.uni-muenster.de
PBS Job Id: 4714.nwzsuperdome.nwznet.uni-muenster.de
Job Name: trna_ECK12_31_ev
An error has occurred processing your job, see below.
Post job file processing error; job 4714.nwzsuperdome.nwznet.uni-muenster.de on host nwzsupertest/1Unknown resource type REJHOST=nwzsupertest.nwznet.uni-muenster.de MSG=invalid home directory '/ddfs/user/data/d/dv' specified, errno=2 (No such file or directory)
----- End forwarded message -----
----- Forwarded message from adm <adm at nwzsuperdome.uni-muenster.de> -----
Date: Tue, 16 Oct 2007 00:02:03 +0200
From: adm <adm at nwzsuperdome.uni-muenster.de>
To: dv at uni-muenster.de
Subject: PBS JOB 4714.nwzsuperdome.nwznet.uni-muenster.de
PBS Job Id: 4714.nwzsuperdome.nwznet.uni-muenster.de
Job Name: trna_ECK12_31_ev
Begun execution
----- End forwarded message -----
----- Forwarded message from adm <adm at nwzsuperdome.uni-muenster.de> -----
Date: Tue, 16 Oct 2007 00:02:03 +0200
From: adm <adm at nwzsuperdome.uni-muenster.de>
To: dv at uni-muenster.de
Subject: PBS JOB 4714.nwzsuperdome.nwznet.uni-muenster.de
PBS Job Id: 4714.nwzsuperdome.nwznet.uni-muenster.de
Job Name: trna_ECK12_31_ev
Aborted by PBS Server
Job cannot be executed
See job standard error file
----- End forwarded message -----
My standard PBS submission file:
#!/bin/sh
#PBS -q batch
#PBS -N trna_ECK12_31_ev
#PBS -e /work/dv/trna_ECK12_31_error.txt
#PBS -o /work/dv/trna_ECK12_31_output.txt
#PBS -m bae
#PBS -M dv at uni-muenster.de
#PBS -S /bin/sh
#PBS -l nodes=1:ppn=1
source /opt/intel/cc/9.1.042/bin/iccvars.sh
export OMP_NUM_THREADS=1
cd /work/dv
./Evolution_trna -b 31
$ tracejob 4714.nwzsuperdome
/var/spool/torque/server_priv/accounting/20071016: Permission denied
/var/spool/torque/mom_logs/20071016: No matching job records located
Job: 4714.nwzsuperdome.nwznet.uni-muenster.de
10/16/2007 00:02:02 S enqueuing into batch, state 1 hop 1
10/16/2007 00:02:02 S Job Queued at request of
dv at NWZSUPERDOME.NWZNET.UNI-MUENSTER.DE, owner =
dv at NWZSUPERDOME.NWZNET.UNI-MUENSTER.DE, job
name = trna_ECK12_31_ev, queue = batch
10/16/2007 00:02:03 S Job Modified at request of
Scheduler at NWZSUPERDOME.NWZNET.UNI-MUENSTER.DE
10/16/2007 00:02:03 S Job Run at request of
Scheduler at NWZSUPERDOME.NWZNET.UNI-MUENSTER.DE
10/16/2007 00:02:03 S Exit_status=-2 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
10/16/2007 00:02:03 L Job Run
10/16/2007 00:02:03 S Post job file processing error
10/16/2007 00:02:03 S dequeuing from batch, state COMPLETE
I do not see why, I am getting this errors from time to time,
The error mention the nwzsupertest but I am using the batch queue, so
shouldn't it be nwzsuperbatch?
And I am not accessing my home directory during the execution of the
job.
Any clue, explanations ?
many thanks,
david