LibrariesThe Fastcode library consists of 3 or 4 units per challenge. Each unit contains the winner functions for one challenge. Each challenge at least spans these three units: Direct Calling, CPUID based function selection, conditional compilation. If the challenge function exist in the RTL or VCL there is also a library unit that supports the patching principle.
Presently the challenges target the following architectures:
Pentium 4 Prescott
Pentium 4 Northwood
Pentium M Dothan
Pentium M Banias
Direct callingAll functions can be called direcly via these function interfaces
Sometimes the same function will be called through two or more interfaces if it is optimal in more targets.
Conditional compilation is also supported. One of the 9 above mentioned
functions will be compiled in as the implementation behind this function
Compiler directives are named:
P4N, P4P, PMD, PMB, ATHLONXP, AMD64, BLENDED, PASCAL, RTLREPLACEMENT
Only one of these can be set at the same time.
CPU id based function selection
On library initialization a function pointer is initialized to point at the fastest function for the given processor. Call via the function pointer
and the call will be redirected to one of the functions
If the processor is none of these, this function will be called
but only if the processor supports IA32 extensions and MMX. IA32 extensions
concist of instructions such as CMOVcc, FCMOVcc, FCOMI.
Otherwise this function will be called
The Patching Principle
Each unit contains the winner functions, just like the direct calling unit, but it also contains patching code. This code iterates through the executable image and patches all calls to the RTL function such that said calls are redirected to the Fastcode versions.
Patching, in its simplest form is relatively straightforward. It is just a matter of finding the address of the system function to be patched, and inserting a new jump instruction at that address to jump to the replacement function. There are however a few important things to take into account.
- If the function being patched is less than 5 bytes is size, a jump cannot be inserted without possibly overwriting another system function.
- If the system function to be patched is already small or fast, then unless packages are being used, inserting a jump to a replacement function is very unlikely to produce any performance gain (from the calling programs viewpoint, we would be unnecessarily calling a jump to another function). For this reason fastcode function like the MaxInt, Round, etc are unlikely to see any improvement by patching.
- When packages are being used, the inserted jump is simply a replacement for an existing jump.
- A few API calls (VirtualProtect, FlushInstructionCache) are needed while performing the actual patching.
The unofficial FastMove unit by John O’Harrow uses patching to select the
IA32, MMX or SSE replacement, but with an additional performance tweak:- When
not using packages, rather than just inserting a jump at the original
system.move location, John actually patch 58 bytes (of the original 64 bytes
used by move). Within these 58 bytes, He can handle all small moves
(<36 bytes) more efficiently.
How to modify and recompile a RTL/VCL unit
Directly inserting the RTL replacement function in the Delphi/C++ Builder library is probably the best option.
Recompilation of the RTL units (apart from SYSTEM.PAS) is also very straightforward.
- Edit the source code (\program files\borland\delphiX\source\RTL\sys directory).
- Ensure that MAKE.EXE (make utility) and DCC.EXE (command line compiler) are in the search path.
- In a DOS shell, go to the RTL directory (\program files\borland\delphiX\source\RTL) and type MAKE. This will create new DCU files in a subdirectory called LIB (you may need to create this directory).
- Copy the required DCU file created to the real LIB directory
- Run or Restart Delphi.
Patching SYSTEM.PAS can get more complicated. If you are just directly replacing a function in SYSTEM.PAS with another, then no problems should occur. If however you are adding code to detect the CPU type and assign a function pointer to replace a system function etc., then virtually all of the DCU's in the RTL (and in most cases, also the VCL) will need to be replaced.
ANSI StringReplace Library
Compare Text Library
Floor Library Preliminary version 0.1
Int64Div Library version 1.0
MaxInt Library Preliminary version 0.1
MaxFP Library Preliminary version 0.1
MinFP Library Preliminary version 0.1
MinInt Library Preliminary version 0.1
Power Library Preliminary version 0.1
RGB To BGR
Round Library Preliminary version 0.1
CPU ID based function selection
This unit is used by all the library units in this section
CPU ID Detection Unit