[H-GEN] Which is better?

Fri Apr 30 04:10:42 EDT 2004

Russell Stuart wrote:
> Egads!  What machine did you run it on?  When I ran it under 1.4.1, and
> under 1.5.0 on RH 8.9 I got this:
>   rstuart at ras:~$ /usr/java/j2sdk1.5.0/bin/java -cp . Looping
>   Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> All very odd.  It looks like there is not more than 30Mb being allocated
> on a machine that has 512Mb of physical memory.  On my machine, a
> malloc(2Gb) succeeds.

By default java generally caps the maximum heap usage to 64Mb.  If you 
want to use more you just pass the appropriate command-line arg  (-Xmx# 
under linux).

> Still, with a slightly modified version of your program I get similar
> results.  Which is to say, memory allocation seems to be faster than
> copying 16 bytes in Java.  In fact not just faster, but 5 times faster! 
> To say that I find that amazing is an understatement - even if the GC is
> not doing anything.  It almost defies belief.
> 

Don't worry, there's no need to believe it completely.  It is likely 
that the 16 bytes are being copied 3 or 4 times.  So you are really 
talking ~64 bytes.  Of course the pattern of usage should be rather 
cache friendly.  Remembering the optimisation hierachy, once you are 
reduced to micro-optimisations a likely target is IO --- it is important 
to realise that this is not restricted to disk-io, but includes primary 
and secondary cache utilisation.  Yet another reason to avoid 
convoluting the instruction path with convoluted micro-pre-optimisations 
of instruction counts which can make restructuring data-accesses to 
improve cache performance difficult or even impossible.

I was very fortunate to have the opportunity to do a one week 
optimisation workshop taken by one of SGI's compiler team.  The guy who 
takes it literally does micro-optimisation as his full-time job --- his 
job is to show potential SGI-supercomputer purchasers how much faster 
their code will run if the spend $$$$ buying a new SGI supercomputer.

As a part of the workshop I watched him take some 'cpu-bound' simulation 
code from one of the other participants and produce 10% improvements 
just by tweaking environment variables.

However the most interesting part was watching him profile various 
pieces of code, and in every case bar one, the 'cpu-bound' code actually 
proved to be cache-bound.  In one case I remember, using the 
cache-optimisation techniques he taught us, we were able to improve the 
performance of a problematic app by 3x.  Of course, by fixing the 
algorithmic complexity of the central loop we got a 100x improvement!

If Harry's still reading this he should note those numbers, 3x is a 
remarkable speed improvement for micro-optimisation.  It is very rare to 
see such a high improvement 1.1-1.5x is generally considered *good*. 
OTOH 100x is *not* unusual when fixing a pathalogical algorithm problem.

eg. compare the following two ways of calculating fib n.  (written in a 
hypothetical C with LCO and lexicial scope) :

It is important to realise that the first takes ~3600 seconds to find 
fib(50) on my machine.  The second takes 0.026 seconds.  This is why 
no-one in this thread has even tried to suggest that you won't get the 
best speed improvements by improving the algorithm.

int fib(int n) {
   switch (n) {
     case 0,1:
       return n;
     default:
       return fib(n - 1) + fib(n - 2);
}

vs.

int fib(int n) {
   int iter(int f1, int f2, int count) {
     if (count == n) {
       return f2;
     } else {
       return iter(f2, f1 + f2, count + 1);
     }
   }
   iter(1, 0, 0);
}

Which converted into register-form becomes:

int fib(int n) {
   int f1 = 1;
   int f2 = 0;
   int count = 0;
   int temp;

   while (1) {
     if (count == n) {
       return f2;
     } else {
       temp = f1 + f2;
       f1 = f2;
       f2 = temp;
       count = count + 1;
     }
   }
}

Andrae Muys