[H-GEN] Which is better?
Andrae Muys
andrae at tucanatech.com
Fri Apr 30 04:10:42 EDT 2004
Russell Stuart wrote:
> Egads! What machine did you run it on? When I ran it under 1.4.1, and
> under 1.5.0 on RH 8.9 I got this:
> rstuart at ras:~$ /usr/java/j2sdk1.5.0/bin/java -cp . Looping
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> All very odd. It looks like there is not more than 30Mb being allocated
> on a machine that has 512Mb of physical memory. On my machine, a
> malloc(2Gb) succeeds.
By default java generally caps the maximum heap usage to 64Mb. If you
want to use more you just pass the appropriate command-line arg (-Xmx#
under linux).
> Still, with a slightly modified version of your program I get similar
> results. Which is to say, memory allocation seems to be faster than
> copying 16 bytes in Java. In fact not just faster, but 5 times faster!
> To say that I find that amazing is an understatement - even if the GC is
> not doing anything. It almost defies belief.
>
Don't worry, there's no need to believe it completely. It is likely
that the 16 bytes are being copied 3 or 4 times. So you are really
talking ~64 bytes. Of course the pattern of usage should be rather
cache friendly. Remembering the optimisation hierachy, once you are
reduced to micro-optimisations a likely target is IO --- it is important
to realise that this is not restricted to disk-io, but includes primary
and secondary cache utilisation. Yet another reason to avoid
convoluting the instruction path with convoluted micro-pre-optimisations
of instruction counts which can make restructuring data-accesses to
improve cache performance difficult or even impossible.
I was very fortunate to have the opportunity to do a one week
optimisation workshop taken by one of SGI's compiler team. The guy who
takes it literally does micro-optimisation as his full-time job --- his
job is to show potential SGI-supercomputer purchasers how much faster
their code will run if the spend $$$$ buying a new SGI supercomputer.
As a part of the workshop I watched him take some 'cpu-bound' simulation
code from one of the other participants and produce 10% improvements
just by tweaking environment variables.
However the most interesting part was watching him profile various
pieces of code, and in every case bar one, the 'cpu-bound' code actually
proved to be cache-bound. In one case I remember, using the
cache-optimisation techniques he taught us, we were able to improve the
performance of a problematic app by 3x. Of course, by fixing the
algorithmic complexity of the central loop we got a 100x improvement!
If Harry's still reading this he should note those numbers, 3x is a
remarkable speed improvement for micro-optimisation. It is very rare to
see such a high improvement 1.1-1.5x is generally considered *good*.
OTOH 100x is *not* unusual when fixing a pathalogical algorithm problem.
eg. compare the following two ways of calculating fib n. (written in a
hypothetical C with LCO and lexicial scope) :
It is important to realise that the first takes ~3600 seconds to find
fib(50) on my machine. The second takes 0.026 seconds. This is why
no-one in this thread has even tried to suggest that you won't get the
best speed improvements by improving the algorithm.
int fib(int n) {
switch (n) {
case 0,1:
return n;
default:
return fib(n - 1) + fib(n - 2);
}
vs.
int fib(int n) {
int iter(int f1, int f2, int count) {
if (count == n) {
return f2;
} else {
return iter(f2, f1 + f2, count + 1);
}
}
iter(1, 0, 0);
}
Which converted into register-form becomes:
int fib(int n) {
int f1 = 1;
int f2 = 0;
int count = 0;
int temp;
while (1) {
if (count == n) {
return f2;
} else {
temp = f1 + f2;
f1 = f2;
f2 = temp;
count = count + 1;
}
}
}
Andrae Muys
More information about the General
mailing list