memory - Job getting killed on HPC cluster, why? - Stack Overflow

admin2025-04-19 10

I am trying to solve a nonlinear optimization problem in AMPL. It is quite large but not ridiculously so. I solved a similar problem on my home PC (about 1 order of magnitude less in size though).

I am submitting to the HPC at my university (Grid Engine based). I used this job script:

#!/bin/bash
#$ -cwd
#$ -l h_rt=24:0:0   # Request 12 hours
#$ -pe smp 1 -l h_vmem=1500G
#$ -l mem=1300G

source $HOME/ampl_venv/bin/activate

cd $AMPL_HOME || exit 1

echo "include s5.run;" | ./ampl > amploutput.txt

This is the report from qacct -j 1724385

qname Bran
hostname node-d00a-222.myriad.ucl.ac.uk
group ucbvx0
owner ucbvapk
project AllUsers
department defaultdepartment
jobname jet16
jobnumber 1724385
taskid undefined
account policyjsv;F=0;J=0;D=8;I=0;B=0;E=0;L=0
priority 0
qsub_time Tue Mar 4 15:09:11 2025
start_time Tue Mar 4 15:17:02 2025
end_time Tue Mar 4 15:32:01 2025
granted_pe smp-D
slots 8
failed 0
exit_status 137 (Killed)
ru_wallclock 899s
ru_utime 25.314s
ru_stime 55.071s
ru_maxrss 8.000MB
ru_ixrss 0.000B
ru_ismrss 0.000B
ru_idrss 0.000B
ru_isrss 0.000B
ru_minflt 4841336
ru_majflt 122161
ru_nswap 0
ru_inblock 5370096
ru_oublock 88
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 321341
ru_nivcsw 381
cpu 80.385s
mem 677.044GBs
io 1.615MB
iow 0.000s
maxvmem 11.735GB
arid undefined
ar_sub_time undefined
category -l batch=true,h_rt=86400,h_vmem=1500G,memory=1G,snx=1 -pe smp-[D]* 8

How can it have run out of memory? Is there anything from these stats you can use to diagnose what's going on?

Here is my submission file s5.run for AMPL, if it helps:

model model5.md;
data  data75.dat;

option solver conopt;
option conopt_options 'outlev=3 maxftime=36000 opttol=0.1 rgmax=1e-15';
option solver_msg 1;
option times 1;
option gentimes 1;
option show_stats 1;

solve;

转载请注明原文地址:http://conceptsofalgorithm.com/Algorithm/1745032198a281032.html

memoryJob getting killed on HPC cluster

whyStack Overflow

最新回复(0)