[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Need help on Athlon optimized gemm kernel!



Hi Julian,

I would love to look at this. Unfortunately, I am going on a weeks
vacation, but when if it is still relevant when I get back I will try to
look at it, maybe port it to gasm if that makes any sense.

Cheers,

Peter.

On Thu, 11 Oct 2001, Julian Ruhe wrote:

> Hello all,
>
> I am currently working on a new Athlon optimized gemm kernel, but I ran
> into a problem:
> The kernel crunches 6 dotproducts simultaneously. Of course I must store
> the six produces
> elements of matrix(block) C and load the next elements on the stack
> after that operation. And this
> is exactly the problem. As long as I leave out the exchange of elements
> of C (means that
> the results of all dotproducts in the matrix multiplication are
> accumulated in only 6 stack registers),
> the matrix multiplication runs with a stellar speed of 1.93 FLOPS/cycle
> on my Athlon 600 classic/Win2000. When
> I insert the exchange part (I have tried some dozens variations of this)
> performance dops enormously,
> which I cannot explain.
> Currently I try to modify the routine for MSVC++ in order to run AMD
> Code Analyzer, but I do not
> think that this will enlight the problem.
> So I ask everybody who feels able to help me, to progam the register
> exchange part of
> my kernel. I have prepared a NASM .asm file (and C test program) that is
> ready for modifications. This code is the one
> that runs with 1.93 FLOPS/cycle so a direct comparion is possible.
> Requirements:
> - Cygwin installed
> - NASM installed
> - Skills in Assembly
>
> The person that finds a fast solution will win a golden cake and much
> honor!
> If anybody from AMD reads this posting, please help us. Frank S., what's
> about you?
>
> Regards
>
> Julian
>
> R Clint Whaley wrote:
>
> >Guys,
> >
> >I include below some timings on a 733 Mhz G4e (access courtesy of SourceForge
> >compile farm).  For quite a while now, Apple's "half the Mhz, half again the
> >price" strategy has eluded me, but this machine ought to at least reduce the
> >screaming fits of it's laugh-test failure to at most a few furtive chuckles.
> >
> >Essentially, it is still not going heads up against either the Athlon or
> >P4 (and if anyone hits me with the clock-for-clock crap, I will point out that
> >clock for clock the original Power chip is still the champ), but I think
> >it is cleaning the floor with the PIII, for instance (let's not mention
> >price, though, eh?).
> >
> >In single precision, its results are roughly 75% of a P4 clocked at twice
> >its speed (before you sneer with the "easy to be fast at low Mhz", I'll remind
> >you it is doing this with good ol' SDRAM, so that's pretty impressive), and it
> >almost doubles the performance of a 933Mhz PIII . . .
> >
> >These results are much crappier on an original G4.  Obviously, the extra level
> >of cache can't be hurting, but perhaps the greater instruction bandwidth,
> >etc., are helping as well.
> >
> >I found it interesting to compare these timings to the ones I have previously
> >posted for the P4 and PIII.  Note that gemm timings can be compared pretty
> >directly (no real change from 3.3.0 till 3.3.7), but the LU timings cannot
> >(3.3.7 has some speedups over 3.3.0) . . .
> >
> >Cheers,
> >Clint
> >
> >ATLAS 3.3.7 on 733Mhz G4e, 256K L2, 1MB L3
> >
> >             100    200    300    400    500    600    700    800    900   1000
> >          ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
> >ATL dLU    386.8  480.7  513.0  580.7  594.3  684.9  671.8  668.7  703.8  724.1
> >ATL dMM    416.7  687.7  771.4  914.3  757.6  919.1  879.5  922.5  928.7  943.4
> >
> >ATL sLU    437.3  631.0  897.8  982.8 1109.4 1307.5 1343.7 1482.7 1566.4 1586.1
> >ATL sMM   1428.6 1600.0 1800.0 2560.0 2500.0 2400.0 2450.0 3011.8 2803.8 2631.6
> >
> >            1200   1400   1600   1800   2000   2200   2400   2600   2800   3000
> >          ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
> >ATL dLU    733.3  758.7  786.6  799.7  809.0  819.4  833.0  838.5  840.4  837.0
> >
> >ATL sLU   1744.4 1846.8 1922.1 1993.0 2058.4 2118.3 2167.8 2206.0 2261.3 2275.0
> >ATL sMM   2953.8 2814.4 3022.9 2858.8 3053.4 2937.4 3061.8 2936.7 3081.0 2995.0
> >
> >_______________________________________________
> >Math-atlas-results mailing list
> >[email protected]
> >http://lists.sourceforge.net/lists/listinfo/math-atlas-results
> >
>
>
>