*******************************************************************************
All Level 2 BLAS are written in terms of 3 primitive routines:
  (1) NoTranspose GEMV
  (2) Transpose GEMV
  (3) GER

However, for this beta release ATLAS does not have code generators for 
these primitive operations.  Instead, ATLAS employs a number of hand-coded
primitive routines, and a search script tries them all and chooses the best
for a given architecture.  While the primitives supplied by the ATLAS team
achieve decent coverage, there are certainly cases where they are far from
optimal on some architectures.  With the ATLAS infrastructure, however, it is
relatively easy for an interested user to supply a primitive to ATLAS, which
ATLAS will time it against those cases supplied by the ATLAS team, and use it
to speed up all of the Level 2 if it is superior.  This file exists
to explain how the user may do this if the performance achieved by ATLAS for
the level 2 BLAS are not adequate.

We make this information available because users have asked in the past how
they can speed things up using their particular knowledge of their
architecture, not because we expect the typical user to try it.

Users may write new primitive routines from scratch, or modify an existing
one (eg., adding prefetch instructions to the provided codes).

*******************************************************************************
           SECTION 1: SPEEDING UP GEMV, HEMV, SYMV, TRMV AND TRSV
*******************************************************************************
These routines are all based on GEMV.  Therefore, to speed them up, the user
needs to supply a more efficient GEMV primitive.  The hand-coded GEMV
primitives may be found in ATLAS/tune/blas/gemv/CASES.  

   ----------------------------------------------------------------------------
                 1.1 : The primitive description file
   ----------------------------------------------------------------------------
The most important file ATLAS/tune/blas/gemv/CASES is the primitive description
file, <pre>cases.dsc.  Each precision has its own description file (as
indicated by <pre>), and this file describes all of the routines to time in
order to find the best.  For instance, for double precision, we see:

--------------------------------------------------
speedy. cat dcases.dsc 
9
1 0 0 ATL_gemvN_mm.c
0 1 1 ATL_gemvN_1x1_1.c
2 32 1 ATL_gemvN_1x1_1a.c
0 4 2 ATL_gemvN_4x2_0.c
0 4 4 ATL_gemvN_4x4_1.c
0 8 4 ATL_gemvN_8x4_1.c
0 16 2 ATL_gemvN_16x2_1.c
0 16 4 ATL_gemvN_16x4_1.c
2 32 4 ATL_gemvN_32x4_1.c
6
1 0 0 ATL_gemvT_mm.c
0 2 8 ATL_gemvT_2x8_0.c
0 4 8 ATL_gemvT_4x8_1.c
0 4 16 ATL_gemvT_4x16_1.c
0 2 16 ATL_gemvT_2x16_1.c
0 1 1 ATL_gemvT_1x1_1.c
--------------------------------------------------

The first number (in this case 9) is the number of NoTranspose primitives
to time.  This is followed by that number of primitive lines describing those 
NoTrans primitives, and then we supply the number of Transpose primitives to
time (in this example, 6), followed by that number of primitive lines
describing the Transpose primitives.

As you can see, each line supplies three integers and a filename to the
search routine.  The filename is the filename of the primitive to time.
The three integers supply information necessary in order for the higher
level routines to do blocking.

This is the first piece of important information about these primitive
routines: no blocking should be done in them.  The appropriate blocking
is done by higher level ATLAS routines.  Most primitives
employ some kind of loop unrolling, and when these higher level routines
block in order to reuse vectors or matrices, it is important that this 
blocking does not conflict with the primitives' unrolling factors (for instance,
if the primitive unrolls a given dimension by 8, but ATLAS blocks that
dimension to 3, ATLAS would always call the cleanup code).  So this is the
information conveyed by these three integers.

The form of a GEMV primitive line is:

<flag> <Yunroll> <Xunroll> <filename>

As mentioned previously, <filename> is the primitive source file.  <Yunroll>
is the unrolling used for the loop that loops over the Y vector, and 
<Xunroll> is the unrolling used for the loop that loops over the X vector.
<flag> is a less obvious parameter which is used to tell the search script
about special properties of a primitive.

It is assumed that the user has supplied a "inner-product" based GEMV
implementation (i.e., an implementation which basically does <Yunroll>
simultaneous dot products).  This default state is expressed to the search
by a <flag> value of 0.  However, since the inner product formulation of
NoTranspose GEMV loops across the non-contiguous dimension of the matrix,
some architectures need to employ an "outer-product" based NoTranspose GEMV
(i.e., a GEMV which is performed by doing <Xunroll> simultaneous axpy's).
This is indicated by a <flag> value of 2.  Finally, since ATLAS's GEMM has
a code generator which allows it to achieve very good portable performance,
it is always worth seeing how optimal a GEMV can be obtained by simply
making the appropriate call to GEMM.  <flag> of 1 indicates that this is what
the primitive is doing.

In summary:

FLAG   MEANING
====   ========================================================================
0      Inner-product or dot-based primitive
1      GEMM-based primitive
2      Outer-product or AXPY-based primitive (only valid for Notranspose GEMV)

   ----------------------------------------------------------------------------
                 1.2 : Writing a primitive GEMV
   ----------------------------------------------------------------------------
There are several assumptions that need to hold true for a user-supplied GEMV
primitive.  First, the loop ordering must be that implied by the <flag> setting
the user supplies in the primitive description file, as discussed in
Section 1.1.  Each primitive makes assumptions about the arguments it handles,
and these assumptions are reflected in the routine name.  The function name of
a GEMV primitive is:
   ATL_<pre>gemv<Trans>_a1_x1_<betanam>_y1
where:
 <pre> : Replaced by the precision prefix:  s, d, c, or z.
 <Trans> : Transpose specifier: 
           N : NoTranspose
           T : Transpose
           C : Conjugate Transpose
           Nc: No transpose, with conjugation
 <betanam> : The beta value this primitive supplies.  All primitives must supply
             the following:

             b0 : beta=0
             b1 : beta=1
             bX : beta != 0 && beta != 1
             bXi0 : complex only, for when beta != 0 && beta != 1, but the
                    imaginary component is zero

For a given gemv primitive (either NoTranspose or Transpose), if the cpp macro
   Conj_ 
is defined we want the conjugate form of that transpose setting (i.e., Nc or C).

Each file is further compiled with differing cpp settings to generate the
various beta cases.  The beta macro settings and their meanings are:

CPP MAC   MEANING
=======   ====================================================================
BETA0     Primitive should provide y = A * x
BETA1     Primitive should provide y = y + A * x
BETAX     Primitive should provide y = beta * y + A * x
BETAXI0   For complex only, primitive should provide y = beta * y + A * x,
          where the imaginary component of beta is zero.

All primitives additionally assume:
   alpha == 1.0
   incX == 1
   incY == 1
   column-major storage of A

Higher level ATLAS routines ensure these assumptions are true before calling
the primitive.

Therefore, the routine:
   ATL_dgemvN_a1_x1_b0_y1
supplies a primitive doing notranspose gemv, on a column-major array with 
alpha=1, beta=0, incX=1 and incY=1.  while:
   ATL_cgemvNc_a1_x1_bXi0_y1:
supplies a primitive doing notranspose gemv, on a column-major array whose 
elements should be conjugated before the multiplication, with 
alpha=1, incX=1 and incY=1, and beta whose real component is unknown, but
whose imaginary component is known to be zero.

For greater understanding of how these CPP macros are used to compile multiple
primitives from one file, examine the provided CASES files.

The API of the primitive is:
   ATL_<pre>gemv<Trans>_a1_x1_<betanam>_y1
   (
      const int M,       /* length of Y vector */
      const int N,       /* length of X vector */
      const SCALAR alpha,/* ignored, assumed to be one */
      const TYPE *A,     /* pointer to column-major matrix */
      const int lda,     /* leading dimension of A, or row-stride */
      const TYPE *X,     /* vector to multiply A by */
      const int incX,    /* ignored, assumed to be one */
      const SCALAR beta, /* value of beta */
      TYPE *Y,           /* output vector */
      const int incY     /* ignored, assumed to be one */
   );
where
    <pre> :        s          d         c         z
   =======  ========  =========  ========  ========
   SCALAR :    float     double    float*   double*
   TYPE   :    float     double     float    double

Note that the meaning of M & N are slightly different than that used by the
Fortran77 API, in that they give the vector lengths, not array dimensions.

One final note: close examination will reveal that scases.dsc is a logical link
to dcases.dsc, and the same holds true for zcases.dsc & ccases.dsc.  This is
because hand-written implementations of differing precisions of real and complex
do not differ from each other except in declaration, so ATLAS uses CPP macros
to compile both single and double precision out of the same file.  If you
generate a code that does not use CPP in this way, you will have to split up
the description files, since your implementation will exist in only one
precision.

*******************************************************************************
       SECTION 2: SPEEDING UP GER, GERU, GERC, HER, HER2, SYR AND SYR2
*******************************************************************************
All of these routines rely on the GER primitive for their performance.  The
hand-written primitives tried by ATLAS may be found in
   ATLAS/tune/blas/ger/CASES.

Most of the discussion of the GEMV primitives applies to the GER primitives
as well, so I assume you have read and are familiar with the concepts
discussed in Section 1 here.  As before, the routines to be timed are given
in a primitive description file, <pre>cases.dsc.  GER does not have a 
transpose case, so this file first lists the number of GER primitives to search,
followed by that many primitive lines describing them.

GER primitive lines are of the form:
<flag> <Xunroll> <Yunroll> <filename>

<flag> is ignored at the moment
<Xunroll> is the unrolling of the loop over the X vector (i.e. the M-loop)
<Yunroll> is the unrolling of the loop over the Y vector (i.e. the N-loop)
<filename> is the name of the C source file for the primitive.

The API for the ger primitive is:

   ATL_<pre>ger1_a1_x1_yX
   (
      const int M,       /* length of X vector */
      const int N,       /* length of Y vector */
      const SCALAR alpha,/* ignored, assumed to be one */
      const TYPE *X,     /* pointer to X vector */
      const int incX,    /* ignored, assumed to be one */
      const SCALAR beta, /* value of beta */
      const TYPE *Y,     /* pointer to Y vector */
      const int incY     /* increment of Y vector; NOTE: NOT IGNORED */
      TYPE *A,     /* pointer to column-major matrix */
      const int lda,     /* leading dimension of A, or row-stride */
   );

Note that this primitive assumes:
   A is Column-major array
   alpha == 1.0
   incX == 1

