[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: SSE-enabled level 2
>1) There still seems to be some noticeable hit for symv wrt gemv, most
> probably due to the very different data access patterns, as I
> understand it. Is there a way around this?
There is a way around this, but it needs to be applied to the ATLAS install,
not your code. The symmetric routines (SYMV, and to a lesser extent SYR2)
are special in that they can reuse $A$, the dominant cost of the algorithm,
in L1. Taking SYMV as an example, SYMV is built by calling GEMV twice: once
with Notrans, and once with Trans. ATLAS blocks the operation so that in the
the second call, A comes from L1.
Now, your code is optimized for A coming from L2 or main. When A is already
in L1, I'm guessing the prefetch becomes a pure overhead, and slows you down.
Right now, ATLAS uses the fastest individual gemvN and gemvT for SYMM. What
we *should* do, is take the fastest gemvT, and then retime all the gemvN's
as used by SYMV, and use that in SYMV. We've known about this for a long
time, it's just a question of finding the time to modify the install process
appropriately. I guess if I had any user's begging for faster SYMV times,
it would be up higher on the do-it queue. Note that this would be a general
speedup for all SYMV; your SSE-stuff would just get more benefit than usual.
This does bring up an interesting, if probably unusable, point. The main or
L2-cache optimized L2BLAS are *worse* for a guy keeping things in L1. So
users with tight loops and memory access will not thank us for adding prefetch.
>2) Any idea of what a new SSE sgemm based sgemv would do? Gemm based
> routines won out in the original atlas, if memory serves.
Won't help. Could only be used when M = KB. You obviously can't afford a
data copy for a N^2 algorithm like GEMV. The reason GEMM used to win is
not because GEMM is a good way to do GEMV, but because the generated GEMM was
so much better than the hand-implementations we had . . .
Cheers,
Clint