New ("old" as of 2010) f90 compiler for desktop superscalar/superpipelined cpus


Performance of basic BLAS1 codes
Comparing GNU Fortran, several hand-coded assembler libraries and output from my f90 compiler (provisionally, rkhf90) with no assists and only the basic x87 fpu. Best performance is about 2x over hand-coded efforts. Some codes run a little slower than GNU Fortran. (NOTE: The AMD f90 compiler uses SSE1 and SSE2 instruction sets).
Performance of selected BLAS1 codes using 3DNow!
F90 can generate any mixture of 3DNow!/SSE1/SSE2 and x87 FPU instructions to maximise performance. While 3DNow! has no overflow handling and non-standard roundoff, it generally performs better on platforms that offer both 3DNow! and SSE1.
Performance of selected BLAS1 codes using SSE1
SSE1 can be overlapped with execution in the x87 FPU.
Performance of selected BLAS1 codes using SSE2
SSE2 can be overlapped with execution in the x87 FPU.
Sample codes output from the rkhf90 compiler.
More-or-less follows style of GNU C. The compiler automatically unrolls loops to optimise pipelining. It can't re-roll loop code. Ergo, the BLAS1 examples compiled by rkhf90 are much simpler than the Jack Dongarra code used for the GNU Fortran compiler.
Sample 3DNow! codes output from f90.
F90 handles any combination of instruction set offerings, and can even handle partial implementations (e.g. PADDD that isn't wired -- but doesn't instruction fault -- when using SSE regs on XP/MP's).
SSE1 codes output from f90.
SSE2 codes output from f90.
Performance comparison of the different instruction sets

Kym Horsell /
Kym@KymHorsell.COM

Modest donations gladly accepted via PayPal.

ADVISORY: Email to these sites is filtered.