[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A bug and possible fix in atlas3.3.0



Shi,

>	The problem was traced to:
>        ATLAS/src/blas/level3/kernel/ATL_CtrsmKL.c

AAAARGH!  Now we see the difference between a developer release and the
real thing ;>  

Thank you very much for catching this; the p0 that is causing the error and
puzzling you is a manual prefetch of the next iteration's data.  I just added
these routines in the developer release, and I now remember making a fatal
mistake:  with the manual prefetch, you must stop your iteration 1 cycle before
the end, and then finish the last cycle by hand without prefetch.  However,
the additional memory ref. will not usually (as we have seen) cause probs,
so I decided to develop the things with the full N-loop, and once everything
was debugged, go back in and unroll the last iteration once I was sure it
worked.  Guess which step I forgot . . .

Anyway, the fix you suggest of moving the memory reference to this iteration
will also work, but it may cost you in performance . . .

I may take a bit to get the next developer release out; right now I have
support for user contribution of about 1/2 of the level 1 BLAS.  I'd like
to finish that before issuing the next dev release . . .

Thanks,
Clint
From [email protected] Sun Jun  3 15:04:07 2001
Return-Path: <[email protected]>
Received: from shi.phy.okstate.edu (marvin@localhost) 
        by cs.utk.edu with ESMTP (cf v2.9s-UTK)
	id PAA15339; Sun, 3 Jun 2001 15:04:05 -0400 (EDT)
Received: from shi.phy.okstate.edu (139.78.124.240 -> shi.phy.okstate.edu)
 by cs.utk.edu (smtpshim v1.0); Sun, 3 Jun 2001 15:04:05 -0400
Received: from localhost (shi@localhost)
	by shi.phy.okstate.edu (8.11.0/8.11.0) with ESMTP id f53J43627509
	for <[email protected]>; Sun, 3 Jun 2001 14:04:03 -0500
Date: Sun, 3 Jun 2001 14:04:03 -0500 (CDT)
From: Junren Shi <[email protected]>
X-X-Sender:  <shi@xie0>
To: <[email protected]>
Subject: A bug and possible fix in atlas3.3.0
Message-ID: <Pine.LNX.4.33.0106031328580.27488-100000@xie0>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: R

Hi,

	I found a bug in atlas3.3.0. At certain situation, it causes
segment fault. In my case, I built a dynimic library for atlas3.3.0 for
Linux_P4SSE2, and link it with matlab 6.0. Under matlab:

   >> A \ B; (A and B are complex, matlab will call zgesv)

will cause segment fault. The dimensions of the A and B which cause the
error is not repeatable. Sometimes working, sometimes not.

	The problem was traced to:

        ATLAS/src/blas/level3/kernel/ATL_CtrsmKL.c

In functions: trsm_LU2, trsm_LU3, trsm_LU4.
For instance, for trsm_LU2, the codes read:

   TYPE *bn=B+ldb2;
   int j;

   p0 = B[2];
   for (j=N; j; j--)
  {
      xr2 = p0  ; xi2 = B[3];
      xr1 = *B  ; xi1 = B[1];

      t0 = xr2;
      xr2 = ar22*xr2 - ai22*xi2;
      xi2 = ar22*xi2 + ai22*t0;   p0 = bn[2];
      ...

      B = bn;
      bn += ldb2;
   }

   apparently, p0 = bn[2] will make a invalid memory reference at the
final iteration.

   I changed trsm_LU2 as following, the similar changes are also in
in trsm_LU3 and trsm_LU4. Basicly it changes for(j=N; j; ...) to
for(j=N-1; j; ...). And manually adds the codes for Nth interation. It
seems it is working now.

   By the way, I don't really undestand the codes. It seems to me that the
codes could become simpler if written as:

    for(j=N; j; ...){
       p0 = B[2];
       ....;

       xi2 = ar22*xi2 + ai22*t0;

       ...;
     }

   Am I correct? Or there is some special consideration here?

   Thanks

Shi

P.S. Atlas3.3.0 is much faster than intel MKL in P4 now. The double
precision gemm can reach 1862 Mflops. Good job!

------------------------------------------------------------------------
static void trsmLU_2(const int N, const TYPE *A, TYPE *B, const int ldb)
/*
 * 'Left, 'Upper', with 1 col prefetch, written with all dependencies
shown,
 * so that compiler can optimize.
 * A is known to be 2x2, with 1/alpha already applied, diagonals already
 * inverted
 */
{
   const TYPE ar11=*A, ai11=A[1], ar12=A[4], ai12=A[5];
   const TYPE ar22=A[6], ai22=A[7];
   TYPE xr1, xi1, xr2, xi2;
   TYPE t0, p0;
   const int ldb2=ldb+ldb;
   TYPE *bn=B+ldb2;
   int j;

   p0 = B[2];

   for (j=N-1; j; j--)
  {
      xr2 = p0  ; xi2 = B[3];
      xr1 = *B  ; xi1 = B[1];

      t0 = xr2;
      xr2 = ar22*xr2 - ai22*xi2;
      xi2 = ar22*xi2 + ai22*t0;   p0 = bn[2];

      xr1 -= ar12*xr2 - ai12*xi2;
      xi1 -= ar12*xi2 + ai12*xr2;
      t0 = xr1;
      xr1 = ar11*xr1 - ai11*xi1;
      xi1 = ar11*xi1 + ai11*t0;

      *B   = xr1; B[1] = xi1;
      B[2] = xr2; B[3] = xi2;
      B = bn;
      bn += ldb2;
   }

      xr2 = p0  ; xi2 = B[3];
      xr1 = *B  ; xi1 = B[1];

      t0 = xr2;
      xr2 = ar22*xr2 - ai22*xi2;
      xi2 = ar22*xi2 + ai22*t0;

      xr1 -= ar12*xr2 - ai12*xi2;
      xi1 -= ar12*xi2 + ai12*xr2;
      t0 = xr1;
      xr1 = ar11*xr1 - ai11*xi1;
      xi1 = ar11*xi1 + ai11*t0;

      *B   = xr1; B[1] = xi1;
      B[2] = xr2; B[3] = xi2;
      B = bn;
      bn += ldb2;

}