Saturday, May 30, 2009

Link post

Some assorted interesting sounding tech:

Thrust is a STL-like tech for CUDA programming. Could be interesting, certainly looks like it is easier to use than CUDA, but just from my feelings it probably won't be worth anything until nVidia let multiple CUDA functions execute in parallel..

5 optimization tips for CUDA, a nice succinct roundup of some good performance tips for CUDA, including Arithmetic Instruction optimization.

DANCE framework for Dynamic Animation and Control, definately something I want to check out for potential synergy with the Physics engine Abstraction Layer. Proper animation controllers is something PAL lacks.

ReactPhysics3d a new open source physics engine, very much in the early stages of development, I've seen a number of engines grow in time, I hope this one succeeds too, but it will need to find a niche if it is to survive..

Predictive-Corrective Incompressible SPH is a paper on estimating the SPH state without explicitly evaluating it, thereby saving some CPU cycles. I actually had a similar idea, so maybe this is a path for further research. Also on the SPH front, Co-rotated SPH for Deformable Solids, the idea is great, not convinced by their particular implementation though..

Perhaps these ideas might make it into the SPH routines for PAL..

and on the lighter side;

2D boy's prototyping framework for the world of goo.

LLVM for Beginners (Windows)

Low Level Virtual Machine (LLVM) is, officially, "Compiler Infrastructure". In my own words, it's a platform independent optimizing assembler. That is, you write high level assembly code for a virtual machine, and then LLVM goes through it, optimizes it, and pumps out the native assembly for your target platform (e.g. x86). The best part is that it can do it Just-In-Time (JIT) too!

LLVM and Clang (the llvm c compiler frontend) first gained my interest when I was looking into OpenCL, and heard Apple's technology of choice was LLVM. After that, LLVM replaced GCC for compiling FreeBSD, and more recently, Rapidmind have sung LLVM's praise. Could LLVM be the future for high performance languages? (Goodbye CUDA/Open64?)

So, how do you get started? First, download LLVM, I just grabbed the MinGW builds for Windows. Now, we need a compiler (LLVM is just the 'assembler', Clang is the C front end). LLVM provide an online compiler. Copy past your code in there, for example:

int main() {
printf("hello world\n");
return 0;

The online compiler will then spit out the human readable LLVM assembly code, or 'll' code. Copy and paste this to a file, e.g. "hello.ll".
Now we can turn this into a binary bytecode representation using the LLVM assembler, llvm-as:
llvm-as hello.ll
This will generate hello.bc, the byte code version of the file.

We can either run this JIT with lli, (just type lli hello.bc) or create an assembly file with llc the LLVM static compiler. If you wish to compile the program for windows, you need to change the target provided by the online compiler to a windows compatible compiler. Do this by replacing the 'target triple = "i386-pc-linux-gnu"' line in your ll file to 'target triple = "i586-mingw32msvc"'. (Thanks Eli!) Re-run the assmber, Now your ready to compile:
llc hello.bc
Which generates hello.s

At this point, you need to use your platforms compiler, for example MinGW's gcc:
C:\MinGW\bin>gcc hello.s


hello world

Cool! The LLVM community is very open and friendly so I'm looking foward to trying some nifty things with LLVM!

Crosshatching in VBA for Microsoft Office 2007

The new versions of Microsoft Office don't natively support cross hatching anymore, not in Word or Powerpoint anyway.

So if you want to get lovely crosshatched diagrams and different patterns you can use this script I wrote. Just press view, macro, create a new one, paste in the code, then run it on your selected object. A great way to make venn diagrams with overlapping crosshatched regions in Powerpoint 2007!

Sub crosshatch()

Dim Sh As Shape
Dim oldFColor As Long

If Not ActiveWindow.Selection.Type = ppSelectionShapes Then
MsgBox "Select something, then try again"
Exit Sub
End If

Set Sh = ActiveWindow.Selection.ShapeRange(1)

strMenu = "Select pattern: 1. Horizontal, 2. Vertical, 3. Upward Diagonal, 4. Downward Diagonal "
rc = InputBox(strMenu, "Menu", 1)
If IsNumeric(rc) Then
Select Case rc
Case 1
p = msoPatternHorizontal
Case 2
p = msoPatternVertical
Case 3
p = msoPatternUpwardDiagonal
Case 4
p = msoPatternDownwardDiagonal
End Select
MsgBox "Invalid Selection"
Exit Sub
End If

With Sh
With .Fill
.Transparency = 0.5
.Visible = msoTrue
.ForeColor.RGB = RGB(0, 0, 0)
.BackColor.RGB = RGB(255, 255, 255)
.Patterned p
End With
End With

End Sub

Wednesday, May 27, 2009

Fast division

This is a trick I wish I'd known back in 2000:


int exactDiv3(int x) { // given that x is a multiple of 3
return x/3;


int exactDiv3(int x) {
return x * 0xaaaaaaab;


Generate your own optimized magic division numbers!
There is a great blog post detailing how this works here

Wednesday, May 20, 2009


Designer Mike Burton has developed a wall of water, called ‘the Waterboard’. Is is in interactive whiteboard that allows users to manipulate the flow of water. Nifty!

Check out the video on youtube, and some more info here.

Wolfram Alpha

Wolfram Alpha claims to be a computation engine, so it can tell us things about maths, apparently.

I heard it could 'compute' things, based on a huge database it had, so I needed to know the probability of something occuring, I thought I'd try out Wolfram Alpha. Needless to say, it returned nothing. I wasn't really supprised.

I though, ok, thats perhaps a bit obscure, perhaps I'll try a few maths-y terms from computer graphics and see how we go.
spherical harmonics, returned nothing. OK,.. perhaps a bit too obscure? (But Spherical Harmonics are described on their own website..)

How about the bitangent vector.. again, nothing. Wierd, again, there is information on the bitangent vector on the Wolfram website!

Clearly I'm asking for too much, so I tried the simple cross product. OK, I'm pretty sure this was covered in high school..

Exactly how stupid do your questions have to be before Wolfram Alpha gives you some information? dot product? No joy. vector at last! It knows something about maths... .. .

I'm sure it won't be long know before it achieves singularity.

Saturday, May 16, 2009

Zero Moment Point

The Zero tipping Moment Point (ZMP) is useful in determining if a bipedal robot is in a stable configuration or not. It has been used extensively in a number of biped control algorithms, most notably at Honda. The ZMP is the point on the ground where the sum of all the moments of the active forces is equal to zero. It helps to clarify this with equations and diagrams:
Given a robot we can write:
F = m.g - m.a

Where m is the robots mass, g is gravity and a is the acceleration of the center of mass.

The moment on any point is:
Mp = p.COM x m.g - p.COM x m.a - H

Where p is the point in question, COM is the center of mass, and H is the rate of angular momentum at the center of mass.

If we can find a point on the ground that balances the motion of the robot then we have the ZMP.

P.zmp = n x MP / F.n

Where P is the projection of the ankle to the ground, and n is the grounds normal.

For more information refer to Honda's ASIMO history which gives a few diagrams and an overview, and Philippe Sardain and Guy Bessonnet article Forces Acting on a Biped Robot: Center of Pressure, Zero Moment Point.

Friday, May 15, 2009

Very Sleepy Profiler

I stumbled across Richard Mitton's awesome free profiler for MSVC, Very Sleepy.

Install, and compile your program with debug information, run the profiler, select your process and thread, and sample away!

Amazing! The best tool ever, couldn't be easier and does exactly what I want.

As you can see from the screenshot, it will even take you straight to the code you need to spend some time optimizing. Nifty.

Wednesday, May 13, 2009

Point Sprites and Billboards

Point sprites are awesome. I love them. Simple, fast, efficient and beautifully.
The main thing that first got me hooked on DirectX, since OpenGL was taking forever to come with a point sprite extension.

Lighthouse 3D's tutorial on billboards,CodeSampler's tutorial on ARB Point Sprites and an explanation of the ARB extensions should come in handy for anyone trying to implement a particle system.

Some use of alpha blending will obviously help too. (See this description of OpenGL Blending modes). I'm not a fan of the OpenGL blending modes either (DirectX gives far more control), so using pixel shaders is probably for the best.

Ofcourse, that will look dated (unless you push into the millions of particles, which you easily can with the wonderfully fast point sprite extensions). So to make them look good and blend soft particles will really help. The nVidia website has some examples, and there is a presentation from Petter Börjesson and Mattias Thell Chalmers Advanced Computer Graphics course on Soft Particles. Volumetric particles will take it up another notch, but requires non-trivial effort.. (well, actually you just raytrace a sphere which is pretty easy, but does take a bit to set up)

Of course to do that you need access to the Z buffer, which is actually quite straight forward in OpenGL using the ARB_depth_texture extension:
zSize, zSize, 0,
Then just copy the buffer to your zbuffer texture:
glCopyTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 0, 0, zSize, zSize);
(Disable color writes and shading for speed: glShadeModel(GL_FLAT); glColorMask(0, 0, 0, 0);)
See this tutorial on shadow mapping with the zbuffer.

Here are some resources on implementing rain effects, an amazingly wonderfully convenient database of rain streaks, and some rain code to go with it.

Enjoy your particle effects!


Since I've been sucked into the buzz about Larrabee, and after some discussions with some people at nVidia convinced me that it might not be a bad GPGPU afterall. I wanted to share some detective work over at beyond3d:

Looks like Larrabee will have 32 cores (as opposed to the original 24 suggestions) and, apparently 8 texture units. This means even at the low clock speed Intel originally quoted it will definitely be pushing around more than 1 teraflop peak. Will it actually be able to give us anything near peak though? I guess time will tell.

According to an 8U CUDA GPU/CPU cluster can achieve 1 tereaflop on the LINPACK benchmark. Now you just need a couple of hundred GPUs at your disposal and your in the top500.

Sunday, May 10, 2009

Word 2007 references

Word 2007 passed my initial tests on creating large documents, so I've been using it ever since. This has allowed me to continue not-learning LaTeX. This hasn't been a great plan. Word 2007 still crashes a lot, but its recovery is near-perfect. Master documents actually seem to work now as well, so you can keep sections of your document separated, which is very helpful in reducing the amount of work word needs to do.

One thing that has irritated me is the lack of citation styles, but I came across the BibWord website, which adds a IEEE and ACM citation style to Word 2007. Finally!

The word team also have a post on how to create your own citation styles in word 2007.

Matali Physics Engine - XNA

I came across this new physics engine, Matali Physics. It says it is designed for .NET/XNA, etc. and the features page state that it comes with a few niceties like camera configurations, controllers/AI, etc.

The developers are polish, and seem to have a bit of an oldschool background, since their other product is a Atari emulator. Wierd.

Saturday, May 09, 2009

VBA for Excel

Extracting a number from text in excel is.. amazingly complex.

Here is the Microsoft way of extracting a number from a string in excel.

You know your in trouble in that they provide an 'algorithm' for doing this.
This is their way:

Logical right? or not..

Why on earth they didn't provide a "scanf" variant is beyond me.

Luckily, VB script to the rescue again: (to get the trailing number on the right)

Function GetNumber(MyEntry As String)
Do Until Not IsNumeric(MyNumber)
i = i + 1
MyNumber = Right(MyEntry, i)
GetNumber = Right(MyNumber, i - 1)
End Function

And in excel:
and off you go!

Selling out

Saw this great advice on selling out slashdot, of all places. I often am unsure whether to go with a deal or not.
Think about it. If they were offering you one billion dollars, would you still be unsure?
If your answer is "I may still be in two minds about selling", then don't sell.
If your answer is "I would sell in a heartbeat", then lower that limit, i.e. would you sell for a hundred million? 10 million? 5 million?

Put a price on what you would be willing to sell for without hesitation. Then evaluate the difference between that amount and the offer. If it is less than 25%, sell. If it is over 50%, don't. In between 25% and 50%, I don't know - but may you don't even fall in that window.

(I've somewhat modified the quote)

VBA for Powerpoint

I like to create figures for presentations or papers in Powerpoint 2007, it's just such a simple tool to use, and since I have no design talent the default templates are a god-send.

Often it is difficult to create the effect you need, but thats where the hidden power of macros comes in.

Here are two Powerpoint 2007 macros, one for creating a regular grid of rectangles, another for creating random dots. Both work inside a given shape for all your spotted object needs be it a dalmatian, to a chessboard, to creating diagrams of Eulerian and Lagrangian fluid representations.

(NOTE: I don't actually understand VBA script, I just hobbled it together from examples on the web.)

Sub grid()

Dim oSh As Shape
Dim oSld As Slide
Dim sngWidth As Single ' width/height of a grid rect
Dim sngHeight As Single
Dim lCols As Long
Dim lRows As Long
Dim x As Long ' which col across are we making
Dim y As Long ' which row down are we making
Dim sngLeft As Single ' where to draw current rectangle
Dim sngTop As Single
Dim sTemp As String

If Not ActiveWindow.Selection.Type = ppSelectionShapes Then
MsgBox "Select something, then try again"
Exit Sub
End If

' get rows/cols from user
sTemp = InputBox("How many columns?", "Columns")
If CLng(sTemp) > 0 Then
lCols = CLng(sTemp)
sTemp = InputBox("How many rows?", "Rows")
If CLng(sTemp) > 0 Then
lRows = CLng(sTemp)
Exit Sub
End If
Exit Sub
End If

Set oSh = ActiveWindow.Selection.ShapeRange(1)
Set oSld = oSh.Parent

sngWidth = oSh.Width / lCols
sngHeight = oSh.Height / lRows

For x = 0 To lCols - 1
For y = 0 To lRows - 1
' with osld.Shapes.AddShape(msoShapeRectangle, left, top, width, height)
With oSld.Shapes.AddShape(msoShapeRectangle, oSh.Left + x * sngWidth, oSh.Top + y * sngHeight, sngWidth, sngHeight)
Call .Tags.Add("Grid", "YES")
End With

End Sub

Sub spots()
Dim oSh As Shape
Dim oSld As Slide

If Not ActiveWindow.Selection.Type = ppSelectionShapes Then
MsgBox "Select something, then try again"
Exit Sub
End If

Set oSh = ActiveWindow.Selection.ShapeRange(1)
Set oSld = oSh.Parent
For x = 0 To 50
With oSld.Shapes.AddShape(msoShapeOval, oSh.Left + oSh.Width * Rnd, oSh.Top + oSh.Height * Rnd, 10, 10)
Call .Tags.Add("Grid", "YES")
End With
End Sub

Friday, May 08, 2009

Particle Systems

When implementing a large particle system, chances are that will have to interact with something. Often, with other particles (as in the case of SPH).

The most obvious approach is a uniform grid. As soon as you think you might want a large range, the idea of a hashed grid will cross your mind. Turns out, these are great data structures for a GPU. If your new to the topic, Christer Ericson's Real-time collision detection book offers a nice introduction to the topic and some optimizations.

My first implementation for my particle system was a straight froward uniform grid. To avoid dynamic memory allocation I just had a static maximum sized array, (which turned out only to need to be able to store 8 elements max), so this isn't too bad. If you start having larger cell sizes, or start hashing the cells then you might end up with a problem if you limit how many cells you can store. Otherwise from information I've gathered in the CUDA forums, this approach should be relatively fast on the GPU. (Trading a lot of GPU memory for speed)

So this approach basically goes like:
1. Clear, then populate a grid
(calculate the particles grid position, add it to the 'list' for this cell in the grid)
2. For each particle, look up its grid position, and pull out a list of other particles in this cell, do the same for all neighboring cells.
3. Run through this list, and do your comparisons with your current particle.

You might need to sort this list and eliminate duplicates, depending on how you handle corner/border cases. (Since this list is usually <8 elements, you don't need a fancy sort routine.)

Some nice websites on sorting routines, animated sorting and not-animated sorting. (ie : Bubble/Insert should do just fine.)

If you start hashing cells and having larger numbers of particles occupying one cell, (or from now on, 'bucket') you might want a different approach. This is where the technique nVidia uses shines.

The best resource on this I've found is Simon Green's GDC08 presentation. Also, Mark Harris' CUDA fluids presentation is worthwhile as it goes into a bit more depth on the CUDA/sorting front. (well, actually the best resource was Mark Harris himself, but I didn't actually know he worked on this specifically (PhysX Fluids) until after asking him stupid questions. I probably should have googled a bit more first to save the embarrassment).

Anyway, the approach goes like this:
First you build the data structures you need:
1. Populate a list with a pair of values, corresponding to the particle ID and the cell it occupies.
2. Sort this list (so that you can figure out which particles are in the same cell)
3. Create a grid structure which contains the index to this list where the first particle entry appears

Then to find neighbors for each particle:
1. For each neighboring cell
2. Calculate the cell id, look it up in the grid to get the start index in the list
3. Step through the list doing your particle-particle calculations until a new cell id occurs.

Again, with pictures:

For each particle, we can figure out a cell id.
eg: Particle 0, has a cell id of 9.

Now we can add this to a list, which contains the cell id, and particle pairs.

Then we sort this list, and populate our grid structure with -1 if it is empty, or the earliest index into this list if it contains a particle.

Great, again, how to use this?

For each particle, we look it up into the grid. This gives us the starting index in our list. Then, all the particles in this cell can be referenced from the pairs in the list.

For example, particle 4 is in cell 6, so we look up cell 6 in the grid. This gives us a list index of 2.

Looking into our list at index 2, we see particle 1, and following until we no longer have a cell id of 6, we get particles 1,2,4. (Not particle 0, it has cell index 9!)

So thats how we have a nice efficiently growing grid on the GPU with one big hash bucket. All we need to do is have a GPU friendly sorting routine.

How convenient, that we already have a fast radix GPU sort in the CUDA sdk:
CUDA Radix sort (again, from Mark Harris & co.). (No comparison to GPU QuickSort, who knows which is faster..)

Buying a Car

I always pay attention to the safety record of a car before a purchase. Cars are definitely the most dangerous things we interact with in our lives. Its worth spending an extra $1k to not-die.

SafeCarGuide has a set of crash test reports, from a range of places and a long history. EURO NCAP have a nice website with nicely presented reports.

ANCAP, the Australian version isn't too bad, but it doesn't test many cars.

CUDA Tips and Tricks

I attended the GPGPU presentation by Mark Harris, with lots of tips and tricks. It was really great value. I'm just going to write down what is essentially my notes on the event, so this won't be much use to you unless you already know a bit about the GPU arch and GPU programming.

First thing, I wasn't aware of, all of nVidia's tech targets 'PTX' an abstract assembly language. So Cg compiles to PTX, CUDA compiles to PTX, OpenCL compiles to PTX, etc.

PTX can be then just-in-time compiled and optimized for a target platform, or you can combine it with a targeted compile into what is called a 'Fat' binary. Neato. Apparently RapidMind are building a full C++ GPU compiler, but judging from their history I guess it would be an all-round stream-processing compiler (SPU,etc as well).

On the hardware front, the GPU is basically structured as a bunch of PE's grouped into multiprocessors (SMs) which in turn either have a 1-1 or 2-1 relationship to Cooperative Thread Arrays (CTA). Instructions are scheduled in 'Warps' which are 32 threads at a time (ie, the instruction unit synchronously decodes 32 instructions, one for each thread at a time, AFAIK). The instruction cache on current chips is ~2MB.

The hardware has special/super function units, which are used for rsqrt, etc. and they have 1/4 the throughput of normal ALU instructions, that is, if all your threads do a sqrt at once, they will need to wait on each other to issue it. Likewise double precision instructions get serialized, so you loose your 8x performance boost.

The texture units are also accessible under CUDA, these do the usual graphics texture unit things, let you interpolate between values, address them in wrap-around etc, all for nearly free. They are also cached in a 2D spatial coherency style, that could be interesting for image processing algorithms.

Memory is accessed in half-warps (16) at a time, and depending on your arch type 1.2 or above then this will cause memory coalescing. (For 1.0,etc this means if you read memory in a non in-order continues way it will cause stalls, or, in nVidia talk 'coalesced') 1.2 style hardware was very forgiving and would be smart enough to mask out different read types to optimize its read request. This was actually pretty damn impressive. Reading memory in aligned boundaries can be optimal (eg: prefer float4 over float3)

Speaking of memory, local thread memory doesn't really exist (it's just mapped to global memory, and you get no control over it - lame!), instead each thread has some allowed number of 32bit registers it can use, as specified by a compile flag. So if you want some super-fast local storage, you might want to up the reg's per thread. Ofcourse, if you do that, then not so many threads can be run at once, limiting how much you can mask synch/mem/sqrt latencies etc.

Your next bit of fast memory is of course shared memory. (Shared between blocks. Something special happens with working with shared memory if its <32 threads, I can't recall exactly, but I remember thinking it was cool. Something like not needing synch's or more likely, really fast synchs). Global memory is persistent between CPU threads and applications, which could be a nifty feature for some larger scale applications. (Lots of GPU RAM could be very handy in a future iMagic style separated rendering engine/content loading system)

When copying memory to the GPU, you can malloc from DMA bound memory, to stop CUDA from needing to do an additional memcopy, and future hardware will have 'Zero'Memory which means you can write directly into DMAed system memory. Nifty. Theres also asynch mem copies, and streaming mem stuff. Not really stuff I need right now so I wasn't all to interested in it. Didn't look too complicated though.

A fair bit of time was spent on talking about bank conflicts, but I didn't really get it. Basically having multiple different bits of your code read from the same spot is bad. This happens more than you might thing due to some magic modulo arithmetic on the GPU.

Some optimization rules of thumb: at least 64 threads per block, and you want 6 warps or more to hide instruction latency . CUDA intrinsics are specified with __sin, or use_fast_math compiler flag. At least 2 block per physical processor, so that sync calls will let the other block execute instead of just busy-waiting.

Well, thats the end of my brain dump of the day. HTH

UPDATE: No sooner had I pressed 'publish' and then:
CUDA 2.2 is out.
Visual Profiler works on Vista! Yay!

Wednesday, May 06, 2009

Blender Dome

Paul Bourke has released a blender dome plug-in that should make it into the main branch for the next version. Now all I need is a dome.

GPU Fluids - iVEC report

iVEC have released their student project reports, including the report by Niel Osborne on our work on GPU fluid simulation.