I frankly admit that I'm out of my depth on this one.
I have a C++ (PP, Carbon, *not* Mach-O) app with some highly optimized
matrix manipulations in it (because the runtime can be long). I noticed
recently that the runtime had increased by nearly 1/3 for no apparent
reason.
Using the profiler, I tracked down what seemed to be the bottleneck but
that function was unchanged in a old/fast vs. new/slow compilation of
the app. Even the assembly was identical.
The function in question has the following prototype:
void transposeProduct4(double *G, int Nrows, double *C);
Here, G and C reference the beginning of fixed-length arrays defined in
the private portion of the class which is instantiated just once.
After considerable floundering, I was finally able to decrease the
runtime back to what it had been formerly by removing these two arrays
from the object and declaring them as static arrays at the top of the
module where they are used (and, thus, local to the module). Otherwise,
the new/fast vs. new/slow code is identical.
Apparently, the mysterious runtime increase was somehow due to the
accessing of the addresses of these arrays, not to the subsequent matrix
computation.
Can anyone clue me in as to why this should be? Is there some basic
fact that I am missing here?
TIA.
Note:
a) In the old/fast code, the G and C arrays are still in the object.
b) The executables are of different sizes: old/fast = 1,074,706 bytes
and new/fast = 1,112,900 bytes due to other additions.
--
Mike McLaughlin
MW Ron - 22 Nov 2004 20:49 GMT
>I frankly admit that I'm out of my depth on this one.
>
[quoted text clipped - 33 lines]
>b) The executables are of different sizes: old/fast = 1,074,706 bytes
>and new/fast = 1,112,900 bytes due to other additions.
Do you have any idea to when the change in runtime happened (i.e. when
did things start to get worse, what release of the compiler?) Perhaps
this is an optimization or something.
Also marking the input pointers as "restrict" could greatly improve the
compilers ability to know that something in C does not get written to by
G. When the variables are global (static or note) the compiler can tell
if a write to G can modify C. I think the prototype would be:
void transposeProduct4(double * restrict G, int Nrows, double * restrict
C);
Ron

Signature
Metrowerks Community Forum is a free online resource for developers
to discuss CodeWarrior topics with other users and Metrowerks' staff
-- http://www.metrowerks.com/community --
Ron Liechty - MWRon@metrowerks.com - http://www.metrowerks.com
Thomas Engelmeier - 22 Nov 2004 21:00 GMT
> Apparently, the mysterious runtime increase was somehow due to the
> accessing of the addresses of these arrays, not to the subsequent matrix
> computation.
>
> Can anyone clue me in as to why this should be? Is there some basic
> fact that I am missing here?
Cache stalls? On an 1 GHz machine with an 333MHz Bus clock, an [BusWidth
wide] value can be computed from RAM every 3 cycles, and every cycle
from the CPU cache.
CHUD should give you more details..
Regards,
Tom_E

Signature
This address is valid in its unmodified form but expires soon.
Chris Cox - 23 Nov 2004 21:21 GMT
What CPU are you testing this on?
The G5 CPUs have a dispatch dependency on the instruction alignment --
so moving code slightly can slow things down quite a bit (half the
speed in my worst case test to date, but it's not a very common sort of
thing).
For the details, see the (finally released) 970FX processor manual, or
the older Power4 processor manual (which has a VERY similar issue).
Chris
> I frankly admit that I'm out of my depth on this one.
>
[quoted text clipped - 36 lines]
> --
> Mike McLaughlin