random.h is a header only random number generation library for games, image processing and procedural content generation.

A few years back I wrote some handy floating point random number generation routines for a raytracer.  Time and again, I have needed these routines for other programs I have been working on.  It seems like a good idea to wrap them up in a small header and release them in the public domain for others to use too.

Each RNG function takes an 64bit integer as a seed value and permutes it in place.  The resulting bits are used for the generation of the result.  I like this design as it keeps the RNG functions stateless and thread safe.  You just have to keep track of the uint64_t you have been using as your RNG stream.

Included are various 1, 2 and 3 dimentional random number generators with a range of different distributions.

Triangular random noise is very usefull when dithering audio and images during the process of bitdepth conversion.  Typicaly, triangular noise is added to the signal immediately before type casting (at the same magnitude as the LSB of the destination bit depth).

The 2d and 3d vector RNG functions generate statisticaly unbiased vectors inside and on the unit circle or sphere (Generating vectors by randomizing the components and normalizing would bias the vector to the corners of unit cube).  I have used these functions successfully in the past for generating ambient occlusion maps.

I wrote a little program to generate some images to display the output distribution of each function.  A 3d point is generated by filling its component from the RNG function under test, and the resulting point is projected in space, as well as projected on each component axis (the red, blue and green planes).  Hopefully the images help a little to convey the distribution of each function.

Again, you can find the library here:

randfu() - unsigned random 
range [0,1]


randfs() - signed random 
range [-1,+1]


trandfs() - triangular random 
range[-1,+1], tends to 0


grandfs() - gaussian random 
range ~[-1,+1], tends to 0


vrand3d() - 3d vector random 
magnitude range: [0, 1]


nvrand3d() - normalized 3d vector random 
magnitude = 1


vrand2d() - 2d vector random
magnitude range: [0,1]


nvrand2d() - normalized 2d vector random
magnitude = 1



Binary Heap

Despite lack of activity here I have been very busy programming cool stuff which I hope to get around documenting and sharing with you all.

I need to implement the A* algorythm for one of my projects. Its a fairly common to speed up of the implemention to use a binary heap for the open list because of its O(log(n)) insertion and O(1) removal properties.  As I code it up I thought it would be nice to share my binary heap implementation, something which can be a bit fiddly to get right in practice.  I wrote a little fuzz tester to nuke it, so that I can be fairly sure it works correctly.

Gist: BinaryHeap

I chose to make the size fixed at compile time for simplicity, and becaues I am happy to terminate a pathfinding search that exceeds a fixed open list size, as old school games used to do.  It should however not require much work at all to grow the heap size when a node is pushed when currently at full capacity.

I will also leave you with this picture from my latest project:


Coroutines, x64 and Visual Studio

I really like co-routines, finding them really useful for programming game AI and scripting.  Since visual studio has dropped support for inline assembly when compiling x64 code I thought I would be out of luck trying to implement co-routines for this target.  Fortunately visual studio can still compile assembly files using masm in x64 mode.  In order to implement co-routines I had to familiarities myself with x64 calling conventions, something that has changed quite a bit from x86.  The first 4 arguments are passed on the stack, there are many more callee save registers, and more obviously all registers and pointers have been extended to 64bits.

Below is a small demo showing how to implement co-routines in visual studio for x64 targets.

#include <stdio.h>

typedef unsigned long long u64;
typedef void * user_t;
typedef void *stack_t;

typedef void (*cofunc_t)( stack_t *token, user_t arg );

/* coroutine functions */
extern "C" void yield_( stack_t *token );
extern "C" void enter_( stack_t *token, user_t arg );

/* artificial stack */
const int nSize = 1024 * 1024;
static char stack[ nSize ] = { 0 };

/* prepare a coroutine stack */
void prepare( stack_t *token, void *stack, u64 size, cofunc_t func )
    u64 *s64     = (u64*)( (char*)stack + size);
         s64    -= 10;                 // 10 items exist on stack
         s64[0]  = 0;                  // R15
         s64[1]  = 0;                  // R14
         s64[2]  = 0;                  // R13
         s64[3]  = 0;                  // R12
         s64[4]  = 0;                  // RSI
         s64[5]  = 0;                  // RDI
         s64[6]  = (u64) s64 + 64;     // RBP
         s64[7]  = 0;                  // RBX
         s64[8]  = (u64) func;         // return address
         s64[9]  = (u64) yield_;       // coroutine return address
          *token = (stack_t) s64;      // save the stack for yield

/* coroutine function */
void threadFunc( stack_t *token, user_t arg )
    for ( int i=0; i<10; i++ )
        printf( "  coroutine %d\n", i );
        yield_( token );

/* program entry point */
int main( )
    stack_t token = nullptr;

    /* prepare the stack */
    prepare( &token, stack, nSize, threadFunc );
    /* enter the coroutine */
    enter_( &token, (void*)0x12345678 );
    /* simple test loop */
    for ( int i=0; i<10; i++ )
        printf( "main thread %d\n", i );
        yield_( &token );
    /* program done */
    printf( "program exit\n" );
    getchar( );



;---- ---- ---- ---- ---- ---- ---- ----
; coroutine yield function
;   : void yield_( void * token );
;   'token' -&amp;gt; RCX
yield_ proc

    push RBX
    push RBP
    push RDI
    push RSI
    push R12
    push R13
    push R14
    push R15

    mov  RAX ,  RSP
    mov  RSP , [RCX]
    mov [RCX],  RAX

    pop R15
    pop R14
    pop R13
    pop R12
    pop RSI
    pop RDI
    pop RBP
    pop RBX


yield_ endp

;---- ---- ---- ---- ---- ---- ---- ----
; enter a co-routine
;   : void enter_( void * token, void * arg1, ... );
;   'token'     -&amp;gt; RCX
;   'arg1, ...' -&amp;gt; RDX, R8, and R9
enter_ proc

    jmp yield_

enter_ endp


Fuzzy Compression


I recenty programmed a very strange and almost worthless image compressor for fun.

The image on the left is 230400 bytes and when reduced to greyscale takes up 76800 bytes.  The image on the right was reconstructed from 3072 bytes… thats 4% of the original size.  The compression time was somewhere in the region of an hour running on a single core.  It is very very lossy.

The process is fairly cool however.  The compressed image is reconstructed by performing 512 random walks over the screen, each leaving a pixel trail behind.  If the start locations and random number seeds for each random walk is computed correctly then the final result can be trained to produce any desired image.  The training I used for this demo was a very basic genetic algorythm / progressive refinement technique.  I imagine if left to its own devices to learn for a few days, the output image could look much cleaner, but I dont have that kind of patience.

The concept has been done before and much better:

see here: http://www.pouet.net/prod.php?which=62917

It seems the creator of this demo managed to fit that into 256 bytes including the unpacker.  Here are some of the things he says about his demo:

There are 64 pseudo random brush strokes of cycling color. Each stroke is shorter than previous to increase details. Intro code generates set of random numbers starting from initial 32bit seed with Galois LFSRs as a PRNG. This set is used to control direction of brush movement and and starting position. Each stroke has initial 16bit value that influences the seed. Therefore data is 4 bytes (initial seed) + 64*2 bytes of stroke PRNG seeds. Picture of Mona Lisa (3072 bytes) was compressed to those random brush strokes by external optimization program (a few days of work for modern CPU/GPU combo). The process can be seen as lossy compression (23x) of picture into random brush movements.

The seeds were found with iterative brute-force approach. Each brush stroke is a layer which covers the previous one, therefore is mostly independent in calculations. For each layer there is a 16bit seed to be found that generates the best picture possible. Therefore there are 64 layers * 65536 possible values per layer and the program iterates through all of them (4 million of combinations – acceptable). This should be done for each initial 32bit seed (means *4 billion combinations) but can be performed in parallel. I’ve checked only a few thousands of randomly chosen initial seeds to find a nice looking picture. Details mask had to be used to multiply difference error on critical parts of the picture (eyes, nose, mouth). Otherwise too many details were put in the background.

Minecraft 4K C++ Port

A while ago I discovered Minecraft4k, a java demo written by Notch fitting into a 4kb object file.  He used software raycasting to rasterize a large 3d array of voxels. The original demo can be played here as an applet:


At the same time I found that Notch had made a Javascript port of the same basic engine, minus user input.


One awesome thing about these two demos is that all of the textures are synthesized proceduraly for space saving reasons.

Seeing his work, I got a strong urge to port the code to C++ and SDL.

Grab the source code here:


Playing with MicroThreads

I recently bought Game Programming Gems 2 and was captivated by an article titled ‘Micro-Threads for Game Object AI’.

A really neat technique is describes that can performing very fast cooperative multitasking, also sometimes referred to as co routines.  The heart of this technique involves manipulating the processor stack pointer to switch between multiple contexts of execution.  Its very similar in concept to true multitasking, however each thread decides when to share the processor.  The concept is simple enough, but really so powerful and radically different from many other techniques for AI I have used.  I knew I would have to program up a text to play with this idea.  Using this method, each object in my game can pretend it is running in its own thread and I am sure this can find application elsewhere too.

The article provides a nice description, but doesn’t provide enough information to directly copy an implementation, and my copy arrived without the CD, but that is fine however and I am having fun of working it out my own.  I have made a few projects in the past which manipulated the CPU registers to perform various tricks, so I can build upon that knowledge.

My first implementation appeared to operate correctly until exit where my call to delete[] threw up an assert.  This is because the fake stack was getting sprayed with bad data somehow from within my virtual thread. I will be spending lots of time in the debugger until I find out the cause.

Update 1:

I believe that the debug run time version of printf() is writing past the end of the 1kb stack, causing a buffer overflow. Hmm….

update 2:

I went back and read the article all over again and there were many hints that I simply didn’t have enough stack space at all.  The author also proposed a very nice improvement, where we effectively extend the stack size by copying each micro thread into much larger stack before execution.  Thus the Micro Thread code can make use of a large stack space, but individually they only take up a small amount of space.  After execution, we extract the smaller part of the stack again.  This places the limitation that _yield() cant be called deep in some code in the thread, but that is hardly a restriction at all.

Here is my demo code for anyone that want to have a look at its implementation.  I wrapped it up with a little API as well so it should be somewhat readable.


update 3:

My brain must have been thinking of improvements overnight since I woke up with three ideas I had to experiment with. In each case it was to address the stability and protect against stack overflows. The most major improvement is the use of VirtualProtect(), which is used here to set the bottom of the shared stack region to a guard page. This wastes 4k of stack space, but it is well worth it since now a nice exception is thrown if we write to the stack anywhere in the bottom 4k.

The second improvement, is to check the value of the stack pointer upon return from a micro thread. If it falls within the space allocated to each thread then everything is fine, otherwise we will need to allocate more stack space for that thread. If this check fails, and esp lies outside of our thread space, then we can be sure that we wont be saving all of that threads state during the task switch. That would lead to very very bad things.

A final improvement is to address what happen when the micro thread function returns. Currently, there is no return address on the stack, so execution could jump literally anywhere.
The solution is to ensure that we have a valid address for this function to return to. We can set the return address to a custom function that can handle the situation on one of two ways. We can setup the stack and processor state so that it will execute the micro thread function once more, thus it will feel like our micro thread function is permanently called. Alternatively we can mark that thread as finished, switch back to our main thread, and then kill the micro thread cleanly. Since both options have merit, I will implement both of then and let the user specify the handling during thread creation.

update 4:

I am now playing with Vectored Exception handlers to see if I can recover from stack overflows, and automatically terminate a malfunctioning thread.  I have confirmed that an exception is being thrown when I intentionally overflow the stack from within a micro thread.  It shouldn’t be too hard to switch the context back from within a vectored exception handler.  I remember having played around with something similar, having self modifying code from an exception handler if a debugger is detected.  According to the original article, it states that structured exceptions no longer work correctly from micro threads since no handlers are on the stack, perhaps this is something I can look into addressing also.

update 5:

Here is the latest version of the library provided as a source release.  The vectored exception handler really works amazingly well, terminating any micro thread that attempts to overwrite its stack.  If a micro thread returns from its function, the thread will terminate as well.  There are many more stability improvements and validity check making this quite a viable option for use in any project now.  I have included a small test program also so that the functionality can be demonstrated.