Need faster programs or at least a profiler

Marcus Groeber (100712.270@CompuServe.COM)
XXX, 23 Aug 12:48:42 1996 EDT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: cdi@cdpubs.com: "PDA Solutions Conference & Exhibition (FREE)"
Previous message: Marcus Groeber: "Need faster programs or at least a profiler"

4) Mistrust your assumptions about the time needed to complete operating system
calls.

Following are two examples taken straight out of my own development:

* GStrings in FontMagick:

FontMagick uses a fairly complex way of post-processing generated GStrings to
fix a number of bugs in the drawing engine and printer drivers of Geos and to
apply non-linear distortions (i.e. effects that cannot be described using simple
matrix transformations) to an object. Afterwards, the transformed outline of the
object is replicated a number of times to create the desired effect.

I assumed that these computations were relatively time-consuming compared to the
time required for actually drawing the object, so I decided to cache all the
drawing commands necessary to reproduce the effect in a VM-based GString.
Therefore, redrawing the contents of the main window would only require to play
back the GString without any further computation. In addition, I used a simple
GrGetGStringBounds() on the entire GString to get the dimensions of the
resulting object.

As beta testers kept complaining about the time the program needed to compute
the shape of a new object any time one of the settings had been changed, I went
back to the drawing board and had a look at the "real" timing using Swat and the
Profiler (there is one! - see below). After some timing of the various
components, I found out what the real time-killers were:

- GrGetGStringBounds() alone took about as much time as all the rest of the
computation, preparation and drawing of the object. Looking back, it was
probably stupid to assume that the accurate size of an object with a large
number of splines could be computed really fast, but I was surprised when I saw
*how* much time it took.

- Drawing to a VM GString actually took *longer* than drawing the same object
to the screen directly, even though the object involved filling complex,
spline-bounded areas, while one would assume that creating a GString would do
little more than putting one command record after the other... The profiler
revealed that a single routine, LMemInsertAt(), accounted for a considerable
amount of total CPU time spent. It seems that the LMem operations involved in
creating the internal structure of a GString are not implemented in an optimum
fashion...

- To make things worse, the object would still have to be drawn on screen
anyway, regardless of how much time had been spent already for just putting
together the command sequence...

- The actual computations which I had always blamed for the time delay when
formulating my excuses to the beta testers used up about a tenth of the time
needed for a single screen refresh. :-)

Once I had gathered this data, the solution was fairly obvious: I reduced
caching to a minimum, using it only to store one copy of the outline to which
all the math had been applied. Replication, scaling and attribute manipulation
of the outline is now done completely "on the fly" whenever the view is exposed.
As a side effect, I could now remove the entire handling of the VM cache file,
because the "basic" outline would be small enough to fit into an LMem GString.

To avoid having to call GrGetGStringBounds() on the (non-existing) total object
GString, I only use it on the "fundamental" outline now and then infer the total
object size based on that by "rules of thumb" and "play-it-safe" calculations.
This may give me a narrow white margin around complex objects (or it may rarely
lead to parts of the object slightly extending beyond the bounds because of
inaccuracies in GrGetGStringBounds() which are then multiplied), but it means
that almost no additional time is required for size calculations.

The only area which still suffers from the slow speed of VM GStrings now is the
"copy to clipboard" routine which has no other choice to create the transfer
format...

* Math speed in GeoRaycast:

When I had the first working port of the rendering routine in GeoRaycast (which
I adapted from an original Mac version published in a German computing
magazine), it would require about 7 seconds to compute and draw a single image
on my machine. The original C routine made heavy use of float and double
variables, which initially forced to me to work around a couple of bugs in the
borlandc x87 emulator library (btw, these seem to be fixed on the OGo -
finally!).

After a day or so spent thinking nasty things about what would have happened to
Intel if one of their CPUs had showed similar problems, I decided to make the
big switch and remove all the float arithmetics in favor of WWFixed integer
math. This was easier than I had thought, and the result was unexpectedly
striking: the redraw time was suddenly down to about 200 milliseconds per frame,
a factor-35 improvement over the first version! As an added benefit, I could
once again remove most of the variables holding intermediate results for
avoiding emulator bugs...

All further changes, like using double-buffer techniques for drawing, converting
the timing critical texture scaling routine to assembly language and using a new
"divide-free" algorithm for scaling only yielded another factor 4 or so in
total...

btw: A completely non-representative test using a loop of Geos-based
floating-point divisions yielded the following results (P75, Geos 2.01 under
OS/2, all within a 10-20% margin of error):

float: 9944 operations/second
double: 5825 operations/second
Geos80: 17821 operations/second
WWFixed: 120000 operations/second

This shows how much speed can be gained even without using integer arithmetics
by avoiding the float and double data types of GOC - it should be noted that the
Geos80 type is actually the widest data type in the list with 10 bytes per
number, while float and double only use 4 and 8 bytes, respectively Of course,
the comparison to WWFixed is not really fair, because it only covers a very
limited range of numbers (0 to 65535 or signed equivalent) and has a very
limited accuracy, but it may often be worth considering if it can be used
anyway...

Finally (I have already been writing too much... :-)) - the profiler: I have
found a "profiling kernel" of Geos in the NC target of the old 2.01 SDK. You
will have to use the Debug application to switch to this kernel; afterwards, you
can use the profiling commands (use "apropos profile" to look for them) of Swat
to count the number of times the instruction pointer gets "caught" in certain
routines over a period of about 30 seconds - the current position is apparently
only sampled 60 times a second (whenever a scheduling timer tick occurs), so the
resolution is quite coarse, but if there is any individual time-killer, it will
probably be detected using this method.

The only disadvantage seems to be that only the current position of the IP is
taken into account, without any attempt to backtrace that stack to find for
example the routine in a given patient that ultimately caused the call to the
current location - in other words, you will get a lot of hits in internal system
routines, but you can often only guess as to why the system has to call them in
the first place.

The profiling kernel doesn't seem to be present in the OG SDK any more, and also
I'm not sure if one exists for the Nokia 9000.

ciao marcus

Next message: cdi@cdpubs.com: "PDA Solutions Conference & Exhibition (FREE)"
Previous message: Marcus Groeber: "Need faster programs or at least a profiler"