Serial Test notes: Compiler versions & options: EKO2.0, PGI5.2-4, and gcc 3.4.3, g95 version info: gcc version 3.5.0 20040824 (experimental) (g95!) Dec 18 2004 f77 and f90 flags: pathf90 -O3 -WOPT:aggstr=off pgf90 -O3 -fastsse g77 -O4 f90 flags: pathf90 -O3 -WOPT:aggstr=off -OPT:unroll_size=256 pgf90 -O3 -fastsse g95 -O4 C flags: pathcc -Ofast -LNO:prefetch=0 -OPT:unroll_size=256 -m32 pgcc -O3 -Munroll=c:256 -tp k8-32 gcc -O4 -funroll-loops -m32 System Under Test: 1 x Opteron 248 2.2Ghz CPU 4x512, DDR400, PC3200, Corsair CL2 Memory SuSE Linux 9.1 64 bit Kernel 2.6.4-52-default Source code: Serial F77, F90 and C versions are the 'M'edium or 'dynamic allocate' versions from this page: http://w3cic.riken.go.jp/HPC/HimenoBMT/program1.htm ================================================= OpenMP Parallel test notes: Compiler versions & options: EKO 2.0, PGI5.2-4 F77/OpenMP Flags pathf90 -O3 -WOPT:aggstr=off -mp pgf90 -O2 -Munroll -Mnoframe -Mscalarsse -Mvect=sse -Mcache_align -Mflushz -mp System Under Test: 4 x 2.2 Ghz Opteron 848 CPUs 16x1024, DDR400, PC3200 Memory Fedora Core 2 64 bit Kernel 2.6.8-52 Source Code: Original F77 OpenMP version (M data size) is from this page: http://w3cic.riken.go.jp/HPC/HimenoBMT/program2.htm Original Himeno source -- himenobmtxp_m_omp.f PathScale modified source -- himenobmtxp_m_omp.1st.touch.f PathScale's parallel first touch data initialization changes to improve scalability on NUMA machines: $ diff himenobmtxp_m_omp.f himenobmtxp_m_omp.1st.touch.f 137a138,140 > C$OMP PARALLEL SHARED (kmax,jmax,imax,nn,a,p,b,c,bnd,wrk1,wrk2, > C$OMP& omega,gosa) PRIVATE (k,j,i) > C$OMP DO 156a160 > C$OMP END DO NOWAIT 157a162 > C$OMP DO 176a182,183 > C$OMP END DO > C$OMP END PARALLEL Differences to change timers: $ diff himenobmtxp_m_omp.f.orig himenobmtxp_m_omp.f 77c77 < cpu0=dtime(time0) --- > cpu0=etime(time0) 82c82 < cpu1= dtime(time1) --- > cpu1= etime(time1) 96c96 < cpu0=dtime(time0) --- > cpu0=etime(time0) 99c99 < cpu1= dtime(time1) --- > cpu1= etime(time1) 112c112 < pause --- > c pause