[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Math-atlas-results] SSE warnings, Band matrix request feature



R Clint Whaley <[email protected]> writes:

> Camm,
> 
> First off, since this message doesn't have any timings in it, it is probably
> not wise to post it to the results list :)
> 

Oops :-)

> 
> Redefining the same macro name with the same definition is allowed by ANSI/
> ISO C, but a several compilers nonetheless issue warnings about it.  Elsewhere
> in ATLAS, we never do it.  If any macro is going to be redefined, #undef
> is first applied.  I think that would be best for this as well, even though
> your stuff is only meant to be compiled by gcc (most atlas routines can be
> compiled by any compiler).  This will guarantee that all present/future gccs
> don't issue the pages of warnings . . .
> 

OK

> >2) I've gotten interested in band matrices recently, and am wondering
> >   how atlas handles these.  Take the extreme case of a diagonal
> >   matrix, 'band packed' so that the diagonal elements are contiguous in
> >   memory.   For s{tsg}bmv, there seems to be no way the basic atlas
> >   code can hand this off to a kernel without moving the memory
> >   around.  But this would be an easily vectorizeable operation.
> >   Should we have a 4rth l2 kernel to deal with band matrices?
> 
> ATLAS handles banded and packed as, essentially, reference BLAS.  In our NFS
> proposal (rejected), Antoine and myself laid out how to handle these guys,
> including extending them to Level 3 operations, giving you order of magnitude
> performance improvements.  You can indeed base them on kernels, but not, as
> you point out, the very narrow band cases.
> 

1) so I take it the level 3 proposal was for an extension of the blas
   spec?
2) My comment was that the existing kernels would of course not work.
   Why can't (different) kernels be used with narrow band cases?

> It was going to be a year or two of work by our full-time team to do this very
> thorough solution we proposed, so it's pretty clear it won't happen now.  As
> far as things that are within the realm of the possible, if you examine 
> Antoine's Level 2 packed and banded routines, you will see they are like our
> dense Level 2: mixed recursive/kernel-based solutions.  This means that if
> someone were to write efficient versions of Antoine's reference kernel, you
> would speed up the entire Level 2 packed/banded, just as with dense.  
> 

Great!  I'll see if I can take a look.  The testers should not be an
obstacle.  

But before I do, maybe I should clarify my thinking a bit -- it could
be these band routines won't be efficient for the case of, say, a
single diagonal.  Here are the linear algebra routines I write
frequently, and would love to off-load to an optimized blas library,
if for no other reason than ease of maintenance, not to mention the
isolation of the ISA extensions, etc.

a) a[i]*=b[i]; (should be a ?sbmv with k=0)
b) a[i]+=const.
c) a[i][j]-=b[i]+b[j]-const
d) ffts


While a) should be covered in the current spec, (?sbmv), will this be
efficient in this no-superdiagonal case?  b) and c) can be attained by
making a dummy vector x filled with 1, and using axpy/syr2
respectively, but this increases the operation count considerably by
all the unnecessary multiplications by 1.  The standard algorithms for
d) can certainly be blas-ized, but all the memory accesses are
non-contiguous, and I'm not sure if any performance can be gained.

In general, you mentioned a while ago about possibly going beyond
blas.  What are your thoughts here?

take care,

> However, there are no packed/banded kernel testers/timers as there are with
> dense, so this is more problematic.  I will not have time to produce such
> tester/timers myself . . .
> 
> Cheers,
> Clint
> 
> 

-- 
Camm Maguire			     			[email protected]
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah