[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: ATLAS
Hello again!
R Clint Whaley <[email protected]> writes:
> Camm,
>
> I don't know what specific requirements SSE forces on you, but the read/write
> pattern is much better for DDOT-based codes than DAXPY, because DAXPY
> does additional writes, whereas DDOT does additional reads, which are
> cheaper than writes . . .
>
> Cheers,
> Clint
>
You are of course right here, and I think, contrary to my earlier
guess, the complex case shows this to be the case even more. But I'm
a bit confused:
sgemvT prefetcht0 208 MFLOPS stable
sgemvT prefetchnta 238 MFLOPS stable
sgemvN prefetcht0 217 MFLOPS fluctuates a bit
sgemvN prefetchnta 242 MFLOPS fluctuates a bit
cgemvT prefetcht0 370 MFLOPS stable
cgemvT prefetchnta 383 MFLOPS stable
cgemvN prefetcht0 330 MFLOPS stable
cgemvN prefetchnta 250 MFLOPS fluctuates a lot
What appears to be going on here is that the extra writes in the N
case pollute the L1 cache in an erratic fashion. Apparently the nta
doesn't guarantee that the data is in all levels of cache, making
this disruption more evident. The single precision appears to be
entirely ram bandwidth limited, but then why does the axpy in the N
case do *better*? At least this seems to indicate that the complex N
case could \profit from a ddot implementation, no?
Take care,
>
--
Camm Maguire [email protected]
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah