[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Math-atlas-results] SSE warnings, Band matrix request feature
Hi Clint!
R Clint Whaley <[email protected]> writes:
> >1) so I take it the level 3 proposal was for an extension of the blas
> > spec?
>
> Yep, if you have banded and packed routines take a leading dimension and some
> starting/stopping criteria, you can write level-3 based kernels (which are
> slightly modified dense kernels for packed), and then use the same recursive
> algorithms as in dense. This gives major speedups.
>
Interesting. Too bad about the proposal. I'm sure it is a lot of
work, yes?
> >2) My comment was that the existing kernels would of course not work.
> > Why can't (different) kernels be used with narrow band cases?
>
> What I meant by this is that narrow-band guys are essentially Level 1 ops,
> which means optimization is not as good as 2 or 3, and that reusing kernels
> is difficult, because there are no low-order costs you can ignore (think
> matrix copy for Level 3, vector copy Level 2) . . .
>
True. Is there any way for the banded l2 for example to kick over to
small loops over level 1 routines with narrow matrices, kind of like
the small case test in gemm?
> >a) a[i]*=b[i]; (should be a ?sbmv with k=0)
> >b) a[i]+=const.
>
> These would be additional Level 1 ops, not banded or packed, surely? Adding
> them as additional Level 1 ops would not be hard with the templates already
> in place . . .
OK, but a) is also sbmv with k=0, no? Seems to work, anyway.
You'd mentioned before that atlas treats these as ref blas. Then why
this? (sse1 850MHz p3)
/usr/lib/atlas/xsl2blastst -R tbmv -n 1000 -F 200
----------------------------- TBMV ------------------------------
TST# UPLO TRAN DIAG N K LDA INCX TIME MFLOP SpUp TEST
==== ==== ==== ==== ==== ==== ==== ==== ====== ====== ===== =====
0 L N N 1000 1 2 1 0.00 158.7 1.00 -----
0 L N N 1000 1 2 1 0.00 36.6 0.23 PASS
1 tests run, 1 passed
/usr/lib/atlas/xsl2blastst -R gemv -n 1000 -F 200
------------------------------- GEMV --------------------------------
TST# TR M N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
0 N 1000 1000 1.0 1000 1 1.0 1 0.03 67.4 1.00 -----
0 N 1000 1000 1.0 1000 1 1.0 1 0.01 238.7 3.54 PASS
1 tests run, 1 passed
>
> >c) a[i][j]-=b[i]+b[j]-const
>
> What the hell kind of operation is this?
>
Actually, a syr2k with 1 vector all ones. In fact, it seems many of
these routines get invoked on a vector of ones somewhere, or is that
just me?
One other item regarding possible precision. I've written a
quadratic program using basically a loop over gemv and ger. I've run
with both double and single precision, using both atlas and ref blas.
And it seems that atlas is losing a lot more precision than ref blas
for the single float case. Here are my results (all these vectors
should be the same to within rounding):
=============================================================================
Ref blas, double precision:
=============================================================================
$3 = {0, 0, 0, 0.018940107898226873, 0.0066021159141143645, 0,
0.0098790949108835055, 0.002352625132218658, 0, 0, 0.043024871496419696, 0,
0, 0.003431704885730046, 0, 0, 0, 0.0028504331146380468, 0, 0,
0.0072203088464554126, 9.88852015697785e-05, 0, 0.0023821475938574058, 0, 0,
0, 0, 0, 0.013236189261078805, 0.020158668979830381, 0,
0.0061762775312050426, 0.0037328201000592612, 0.0089936656569935425,
0.0069173895768482232, 0, 0.0021318399813026172, 0, 0.014128699897621055,
0.004155492234810312, 0, 0, 0, 0.0053944443352520266, 0.011233760977118805,
0, 0.015775110574074071, 0.039780460298061364, 0, 0, 0, 0,
0.013780690365307539, 0.00013085617793068323, 0, 0.0056143347174945415,
0.0030021166895419344, 0.00017110725612916425, 0, 0, 0,
0.0022233377673914652, 0.0078112045315271994, 0.008062977599026782, 0,
0.0047622593146473503, 0, 0.0092151662037889982, 0, 0.013641357424147373,
0 <repeats 11 times>, 0.0023007066920399419, 0, 0, 0, 0.019783023745609019,
0, 0.0030098138926147877, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.004376432858798472}
=============================================================================
Ref blas. single precision
=============================================================================
$4 = {0, 0, 0, 0.0193224791, 0.00681497343, 0, 0.00986788794, 0.00191707187,
0, 0, 0.0434913486, 0, 0, 0.00334887533, 0, 0, 0, 0.00334869302, 0, 0,
0.00728768529, 1.79629933e-05, 0, 0.00284249382, 0, 0, 0, 0, 0,
0.0132361548, 0.0201619491, 0, 0.00567072676, 0.00403881213, 0.00907292403,
0.00605385005, 0, 0.00213182555, 0, 0.0145751694, 0.00393279782, 0, 0, 0,
0.00539445179, 0.0113568529, 0, 0.0154621108, 0.0397804491, 0, 0, 0, 0,
0.0135530541, 0.000135249895, 0, 0.00546122249, 0.00274914224,
0.000139613199, 0, 0, 0, 0.00207114615, 0.0076302751, 0.00818008091, 0,
0.00498831039, 0, 0.00935398508, 0, 0.0133951297, 0 <repeats 11 times>,
0.00230074138, 0, 0, 0, 0.0200425033, 0, 0.00290917116, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0.00403231615}
=============================================================================
atlas double precision
=============================================================================
$5 = {0, 0, 0, 0.018940107895699877, 0.0066021227977551469, 0,
0.0098791040760392978, 0.0023526251311276414, 0, 0, 0.043024871498807404, 0,
0, 0.0034317022092371905, 0, 0, 0, 0.0028504331110518455, 0, 0,
0.0072202984477313667, 9.888495348369022e-05, 0, 0.0023821475909385102, 0,
0, 0, 0, 0, 0.013236189261078857, 0.020158668979602598, 0,
0.0061762775267177909, 0.0037328174594609014, 0.0089936670203928423,
0.0069173895745572051, 0, 0.0021318399813026138, 0, 0.014128699899851334,
0.0041554912282962958, 0, 0, 0, 0.00539444433525209, 0.011233760973877238,
0, 0.015775110571483737, 0.039780460298061412, 0, 0, 0, 0,
0.013780692105380066, 0.00013085769806731604, 0, 0.0056143347156559715,
0.0030021166882750992, 0.00017110455923383634, 0, 0, 0,
0.002223337767391506, 0.0078112045288354984, 0.0080629767943949, 0,
0.0047622593115835016, 0, 0.0092151583019473293, 0, 0.013641357421016581,
0 <repeats 11 times>, 0.0023007066920400221, 0, 0, 0, 0.01978302374266707,
0, 0.0030098096180152987, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0.0043764328584068165}
=============================================================================
atlas single precision
=============================================================================
$6 = {0, 0, 0, 0.0198229793, 0.00542137539, 0, 0.00876359735, 0.00156316732,
0, 0, 0.0425283462, 0, 0, 0.00367816561, 0, 0, 0, 0.00305529614, 0, 0,
0.00622634217, 0, 0, 0.000910904317, 0, 0, 0, 0, 0, 0.0132361939,
0.021575097, 0, 0.00552129187, 0.0016044986, 0.00860607717, 0.00690866355,
0, 0.00213182229, 0, 0.0151669253, 0.00489541329, 0, 0, 0, 0.00539442617,
0.014026829, 0, 0.0175433327, 0.0397804528, 0, 0, 0, 0, 0.0139623322, 0, 0,
0.00566139678, 0.00161994447, 0.00017111942, 0, 0, 0, 0.00235463725,
0.00996533688, 0.00697260955, 0, 0.00575390365, 0, 0.0080448091, 0,
0.0121865831, 0 <repeats 11 times>, 0.00230072462, 0, 0, 0, 0.020749582, 0,
0.00329418154, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00432143174}
=============================================================================
Does this look nomal? All libs pass all testers/timers. Is it
possible that the sse1 fpu has fewer bits internally? The normal fpu
does everything at 80 bits if I recall, right?
> >d) ffts
> Kind of a big topic. What about fftw (http://www.fftw.org/)? I've never
> used it myself (I can't spell fft), but have heard good things about it.
> Uses some of the same kinds of ideas as ATLAS, as I understand it . . .
>
Thanks for the tip!
> By the way, you mentioned optimization of of non-unit stride vectors. the
> speedups to be had are pretty meager, even in the best case (read only).
> My feeling is that 5% would be heroic. Probably not worth thinking about
> except in exceptional cases. For level 1 ops, memory bandwidth is the big
> constraint most of the time, and prefetch is the only real anelgesic. From
> my limited experience, writing to the vector tends to kill the advantage of
> prefetch a great deal (I guess the bus is to busy to prefetch), so your
> big wins come on scalar-output routines like nrm2, iamax, ddot, etc.
>
Agreed. Have you looked into writing behind the reading pt at various
distances, and the different flavours of prefetch?
Take care,
> Cheers,
> Clint
>
>
--
Camm Maguire [email protected]
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah