New ("old" as of 2010) f90 compiler for desktop
superscalar/superpipelined cpus
-
Performance of basic BLAS1 codes
-
Comparing GNU Fortran, several hand-coded assembler libraries and
output from my f90 compiler (provisionally,
rkhf90) with no assists and only the basic x87 fpu. Best
performance is about 2x over hand-coded efforts. Some codes run a
little slower than GNU Fortran. (NOTE: The AMD f90 compiler uses
SSE1 and
SSE2 instruction sets).
- Performance of selected BLAS1 codes
using
3DNow!
- F90 can generate any mixture of
3DNow!/SSE1/SSE2 and x87 FPU instructions to
maximise performance. While
3DNow! has no overflow handling and non-standard roundoff, it
generally performs better on platforms that offer both
3DNow!
and
SSE1.
- Performance of selected BLAS1
codes using
SSE1
-
SSE1 can be overlapped with execution in the x87 FPU.
- Performance of selected BLAS1
codes using
SSE2
-
SSE2 can be overlapped with execution in the x87 FPU.
- Sample codes output from the
rkhf90
compiler.
-
More-or-less follows style of GNU C. The compiler automatically
unrolls loops to optimise pipelining. It can't re-roll loop code.
Ergo, the BLAS1 examples compiled by
rkhf90 are much simpler than the Jack Dongarra code used for
the GNU Fortran compiler.
- Sample
3DNow! codes output from f90.
-
F90 handles any combination of instruction set offerings, and can even
handle partial implementations (e.g. PADDD that isn't wired -- but
doesn't instruction fault -- when using SSE regs on XP/MP's).
- SSE1 codes output from
f90.
- SSE2 codes output from
f90.
- Performance comparison of the
different instruction sets
Kym Horsell /
Kym@KymHorsell.COM
Modest donations gladly accepted via
PayPal.
ADVISORY: Email to these sites is filtered.