Adrian Boeing: Blog

Adrian · at 6:04 PM

Well overdue for a catchup post on the non-graphics programming side of things, so here we go:

Ange Albertini put together a fantastic overview of the Windows Portable Executable (PE) file format. He also has a great page on x86 oddities.
OS Dev has a short tutorial on writing an Operating System kernel in the D programming language
Leo Meyerovich and Ari Rabkin have an amazing interactive visualization and comparison of different programming languages
If you haven't visited Rosetta Code for a while there are now a very large comparison of programming tasks in multiple languages
Google Blocky is a visual web-based programming language. Drag & drop function blocks together.
LuaJIT is a just in time compiler for Lua.
Lua Development Kit is a eclipse plugin to provide a Lua IDE.
Microsoft have relased Roslyn compiler API letting you interact with the Microsoft compilers intermediary language.
Marc Feeley wrote a one file tiny c compiler.
Clang has a free Static Analyzer for C and Objective C programs. (has XCode support).
Andy Wingo wrote an indepth post inspecting the Google V8 javascript optimizing compiler
SLOCCount is a neat tool for calculating the number of Lines of Code in your project, and other metrics. Other great tools for software metrics and code counting are CCCC, CMT and the C metrics collection.
LCOV is a graphical front-end for the gcov code coverage testing tool.
Very Sleepy windows profiler v0.8 has been released.
Lucas van Dijk has a short tutorial on threads with C++11
The Google Chrome Framework is quite comprehensive and multiplatform.
Folly (Facebook Open Source Library) provides a number of extensions over the STL and C++11.
The POCO library provides a solid framework for applications with signal/slots, zip, cryptography, database, logging, networking, and multiprocessing functions. The basic and complete editions are free and open source, however the professional edition is commercial.
Ian Boyd has a great post on Hasing functions. Murmur hash gives good results as it is fast to compute and results in very few collisions. (CRC32 provides less collisions though)
Santhosh Kumar wrote an interesting post on exact string match using the Burrows Wheeler transform, which includes a good visual description of the BWT (used often for compression).
John Cook has a small collection of random number generators for C++, C# and Python including gamma, Weibull, Chi-square, Student-t, Cauchy, etc, etc.
miniz is a single file compression library for ZLIB, ZIP and PNG files.
XZ Embedded is a small decompressor for the XZ file format for embedded platforms.
μSynergy is a lightweight embedded client for Synergy (mouse sharing).
Quinten Lansu wrote a fantastic tutorial on serialisation using Google's Protobuf
Some nice examples of how to generate various google maps including geocodes, events, markers, routes, etc.
Steve Hanov has a post on A/B testing.
The Unity test framework for C is fairly easy to use.
AlgoWiki has a nice collection of algorithms
And finally, David Joyner, Minh Van Nguyen, and Nathann Cohen have released an free open source book on graph algorithms covering everyones favourites from Dijkstra and Floyd-Roy-Warshall to Huffman codes, Priority queues, and specific problems such as graph coloring and travelling salesman.

Adrian · at 11:21 PM

nVidia's GPU Technology Conference is over, and a number of presentation slides have been uploaded. There were a quite a few interesting talks relating to graphics, robotics and simulation:

Simon Green from nVidia and Christopher Horvath from Pixar presented 'Flame On: Real-Time Fire Simulation for Video Games'. It starts with a recent history of research on CG fluid systems, and gives five tips on better looking fire: 1. Get the colors right (e.g. radiation model), 2. Use high quality advection (not just bilinear filtering), 3. Post process with glow and motion blur. 4. Add noise. 5. Add light scattering and embers. They then go into more detail on Tip #1 looking at the physics behind the black-body radiation in a fire, and the color spectrum.
Elmar Westphal of PGI/JCNS-TA Scientific IT-Systems presented 'Multiparticle Collision Dynamics on one or more GPUs', about multiparticle collision dynamics GPU code. He starts by explaining the overall algorithm, and explaining step-by-step what performs well on the GPU. Specific GPU optimisations explained include spatial subdivision lists, reordering particles in memory, hash collisions, and finally dividing workload between multiple GPU's. An interesting read.

Michal Januszewski from the University of Silesia in Katowice introduces 'Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python'. He explains lattice boltzmann fluid simulation, and some of the different configurations of lattice connectivity and collision operators. Moves into code generation examples, and gives a brief explanation of how the GPU implementation works.
Nikos Sismanis, Nikos Pitsianis and Xiaobai Sun (Aristotle University, Duke University) cover 'Efficient k-NN Search Algorithms on GPUs'. Starts with an overview of sorting and K-Nearest Neighbour (KNN) search algorithm solutions, including ANN (approximate NN) and lshkit and moves into results including a comparison of thrust::sort with Truncated Bitonic sort. Software is available at http://autogpu.ee.auth.gr/.
Thomas True of nVidia explains 'Best Practices in GPU-Based Video Processing' and covers overlapping copy-to-host and copy-to-device operations, and an example of processing bayer pattern images.
Scott Rostrup, Shweta Srivastava, and Kishore Singhal from Synopsys Inc. explain 'Tree Accumulations on GPU' using parallel scatter, parallel reduce and parallel scan algorithms.

Wil Braithwaite from nVidia presents an interesting talk on 'Interacting with Huge Particle Simulations in Maya using the GPU'. He begins with a brief runthrough of the workings of the CUDA SPH example, and then moves onto the particle system including Maya's body forces (uniform, radial, vortex), shape representations (implicit, covex hull, signed distance fields, displacement maps), collision response, SPH equations, and finally data transfer. Ends with a brief overview of rendering the particles in screen space. Neat.
David McAllister and James Bigler (nVidia) cover the OptiX internals in 'OptiX Out-of-Core and CPU Rendering' including PTX code generation and optimisation, and converting the OptiX backend to support CPU's via Ocelot and LLVM. An interesting result, LLVM does better at optimising "megafunctions" than small functions, but not entirely unexpected given how LLVM works. The presentation finishes with an overview of paging and a tip on bounding volume heirarchies. Good to see Ocelot in the mainstream.

Eric Enderton and Morgan McGuire from nVidia explain 'Stochastic Rasterization' (ala 'screen door transparency' rendering) via MSAA for motion blur, depth of field and order-independent transparency, by using a geometry shader to bound the shape and motion of each tri in screen space, and setting up the MSAA masks. Nice.
Cliff Woolley presents 'Profiling and Tuning OpenACC Code' (by adding pragmas to C / Fortran code, ala OpenMP) using an example of Jacobi iteration, and there were a number of other talks on the topic.
Christopher Bergström introduced 'PathScale ENZO' the alternative to CUDA and OpenCL.
Phillip Miller from nVidia did an broad coverage of 'GPU Ray Tracing'. He starts with a myths and claimed facts on GPU raytracing, highlights some commercial GPU raytracers (and the open source OpenCL LuxRenderer) and goes into some details that are better explained in the OptiX Out-of-Core presentation.
Phillip Miller follows with 'Advanced Rendering Solutions' where he takes a look at nVidia's iray, and where they believe they can introduce new capabilities for design studios and find a middle ground with re-lighting and physcially based rendering.

Peter Messmer presents 'CUDA Libraries and Ecosystem Overview', where he provides an overview of the linear algebra cuBLAS and cuSPARSE libraries performance, then moves to signal processing with cuFFT and NPP/VSIP for image processing, next is random numbers via cuRAND and finally ties things up with Thrust.
Jeremie Papon and Alexey Abramov discuss the 'Oculus real-time modular cognitive visual system' including GPU accelerated stereo disparity matching, likelihood maps and image segmentation with a parallel metropolis algorithm.

Jérôme Graindorge and Julien Houssay from Alyotech present 'Real Time GPU-Based Maritime Scenes Simulation' beginning with ocean simulation and rendering from FFT based wave simulation using HF and LF heightmap components. They then cover rendering the mesh, scene illumination and tone mapping, and a sneak peak at boat interaction. The ocean simulation video is neat.
Dan Negrut from the Simulation-Based Engineering Lab at the University of Wisconsin–Madison gives an overview of the labs multibody dynamics work in 'From Sand Dynamics to Tank Dynamics' including friction, compliant bodies, multi-physics (fluid/solid interactions), SPH, GPU solution to the cone complementary problem, ellipsoid-ellipsoid CCD, multi-CPU simulation, and finally vehicle track simulation in sand. Wow. Code is available on the Simulation-Based Engineering Lab website.
Max Rietmann of USI Lugano looks at seismology (earthquake simulation) in 'Faster Finite Elements for Wave Propagation Codes' and describes parallising FEM methods for GPUs in SPECFEM3D.
Dustin Franklin from GE introduces GE's MilSpec ruggedised Kepler-based GPU solutions and Concurrent Redhawk6 in 'Sensor Processing with Rugged Kepler GPUs'. Looks at some example applications including hyperspectral imaging, mosaicing, 360 degree vision, synthetic aperture radar processing, and space-time adaptive processing for moving target identification.
Graham Sanborn of FunctionBay presents 'Particle Dynamics with MBD and FEA Using CUDA' and gives a brief overview of their combined CPU/GPU multi-body FEA system and briefly describes the contact, contact force, and integration steps.
Ritesh Patel and Jason Mak of University of California-Davis cover the Burrows-Wheeler Transform, Move-to-Front Transform and Huffman Coding in 'Lossless Data Compression on GPUs'. They find merge sort for BWT performs best on the GPU, explain the parallel MTF transform and Huffman in illustrative detail and tie things up with benchmarks, unfortunately GPU is 2.78x slower than CPU.
Nikolai Sakharnykh and Nikolay Markovskiy from NVIDIA provide an indepth explanation of their GPU implementation of solving ADI with tridiagonal systems in '3D ADI Method for Fluid Simulation on Multiple GPUs'.
Enrico Mastrostefano, Massimo Bernaschi, and Massimiliano Fatica investigate breadth first search in 'Large Graph on multi-GPUs' and describe how best to parallelise it across multiple GPU's by using adjacency lists and level frontiers to minimise the data exchange.
Bob Zigon from Beckman Coulter presents '1024 bit Parallel Rational Arithmetic Operators for the GPU' and covers exact 1024 bit rational arithmetic (add,sub,mul,div) for the GPU. Get the 1024 bit arithmetic code here.
Roman Sokolov and Andrei Tchouprakov of D4D Technologies discuss 'Warped parallel nearest neighbor searches using kd-trees' where they take a SIMD style approach by grouping tree searches via voting (ballot)
David Luebke from nVidia takes a broad look at CG in 'Computational Graphics: An Overview of Graphics Research @ NVIDIA' and provides an overview of research which is featured in a number of previous talks and other GTC talks including edge aware shading, ambient occlusion via volumes and raycasting, stochastic rendering, improved image sampling and reconstruction, global illumination, and CUDA based rasterization.
Johanna Beyer and Markus Hadwiger from King Abdullah University of Science and Technology discuss 'Terascale Volume Visualization in Neuroscience' where each cubic mm of the brain scanned with an electron microscope generates 800 tereabytes of data. The idea here is to leverage the virtual memory manager to do all the intelligent caching work, rather than a specialised spatial datastructure for the volume rendering.
Mark Kilgard introduces the NV_path_rendering extension in 'GPU-Accelerated Path Rendering', and demonstrates using the GPU to render PDF, flash, clipart, etc. Contains some sample code.

Janusz Będkowski from the Warsaw University of Technology presented 'Parallel Computing In Mobile Robotics For RISE' a full GPGPU solution for processing mobile robot laser scan data through to navigation. Starts with data registration into a decomposed grid which is then used for scan matching with point-to-point Iterative Closest Point. Next is estimating surface normals using principle component analysis, demonstrated on velodyne datasets. This is used to achieve point-to-plane ICP and he demonstrates a 6D SLAM loop-closure. Finishes it all off with a simple gradient based GPU path planner.

Note that in recent days more presentation PDF's have been uploaded so there is still plenty to look through, and with all the content it's difficult to look through it all - take a look yourself! I'll leave you with a video from the GTC 2012 keynote on rendering colliding galaxies:

Adrian · at 1:13 PM

Another slow month on the blog. More than half way through the year, so its time to catch up on the backlog of news. Only covering graphics, games, physics and GPGPU and compilers. Expect a number of posts on robotics soon!

GitHub is now more popular than SourceForge.
Fabian 'ryg' Giesen has put together an in-depth explanation of the graphics hardware pipeline.
nVidia released an extension to allow OpenGL direct access to DirectX buffers. Now you can write your graphics programs in both OpenGL and DirectX - pick whichever is easiest.
nVidia released their 11th Graphics SDK - DirectX 11 only.
Aras did a blog post on visualising mip-map levels and released the code to his shader.
Redirect OpenGL from a remote machine with this VirtualGL tutorial.
Morphological Anti-Aliasing paper with source code.
Tim Lottes released his FXAA anti-aliasing source code
José María Méndez wrote a fantastic article on a simple method to implement screen space ambient occlusion.
Improve your ambient occlusion with occlusion normals to enable dynamic shading.
A simple GLSLShader handling class
Dave has a post on how you can generate game characters with a webcam.
Source code from the Game Development Tools book. From texture generation and CSG to Continuous Integration, Perforce and XML.
Steven Wittens has a fantastic series of posts on procedurally generating planets and worlds.
Amit's blog post on polygon map generation using Voroni and Delaunay is worth a read.
The old gameUltima 6 technical documents have been released. An interesting look at game design and programming decisions made in 1990. This should help nuvie, which has already made great progress.
Oldskool demoscener Trixter wrote an interesting post onC64 vs IBM PC, and argues the C64 was a faster computer.
Democoder Ferris/YUP did an interesting post on
creating a tiny 1k graphics demo.
John Ratcliff updated his Hierarchical Approximate Convex Decomposition utility library.
Tim Ventimiglia and Kevin Wayne did an interesting post onusing the Barnes-Hut algorithm to accelerate n-body simulations.
Pierre Terdiman has a great post on fast separating axis tests to speed up your collision detection for convex objects.
PhysBAM source code - there is plenty of great stuff in here, so I'm sure to do a whole blog post on this later. It covers maths (e.g. ODEs, fourier transforms), spatial subdivision (KD trees, octrees), collision detection, raytracing and so much more.
OpenCloth is a new library for cloth simulation algorithms.
nVidia APEX SDK is now free. It has a number of functions that help with clothing and destruction physics.
PhysxLab is a digital content creation tool to help you take advantage of the APEX sdk destruction features.
The Oren Game Engine blog has a great post onimplementing destruction fractures.
A realtime raytracer with pure C++, OpenMP and CUDA written by Thanassis Tsiodras is accompanied by some interesting analysis.
Microsoft Accelerated Massive Parallelism (C++ AMP) appears to be the microsoft attempt at competing with CUDA, OpenCL and some of the libraries ontop of those technologies.
Microsoft also released concurrent containers, parallel versions of STL algorithms.
WebCL - OpenCL for the web, at Nokia research.
AMD's OpenCL University kit covers some OpenCL and GPU basics.
AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) has had a few updates, currently 2.4 is out.
AMD gDEBugger is now free for debugging OpenGL and OpenCL programs.
Intel SPMD Program Compiler is designed to generate optimal SIMD programs from a C-like language.
PathScale EKOPath 4 is an optimising compiler, now free for linux, BSD and Solaris.
Yang Chen wrote an amazing post comparing compiler optimizations.
Helpful hints on the unix command xargs
FreeArc is an amazing open source compressor. Fast and fantastic compression ratios.
Google's excellent diff/match/patch tools
Visual hashing explained, how to detect similar images.
Ocropus is a open source library for Optical Character Recognition (OCR).

Finally, here is the SIGGRAPH 2011 technical papers highlights video, it contains a number of interesting advances in physics simulations and modeling.

Adrian · at 11:55 PM

I didn't want to have a month without a post, so here are some links for the month:

GDC 2011 is over again. I'll do a breakdown soon, but meanwhile take a look yourself at DICE publications on Battlefield 3, nVidia's GDC publications, AMD GDC and ofcourse the official GDC vault (free). (Maciej Sinilo has a nice collection here).
Game Physics Pearls has been published by Gino van den Bergen and Dirk Gregorius. Dirk is a very knowledgeable guy, and Gino wrote SOLID, so I'm sure it's great.
ScienceBooksOnline has a collection of free books.
Paul Firth has a great interactive blog post on contact resolution, and introduces a method used in Little Big Planet he calls "Speculative Contacts".
PhysX info has a comparison of different clothing simulation systems.
GLM is an OpenGL syntax math library.
Cairo is a 2D graphics library that supports both OpenGL and PDF output.
An MIT WebGL tutorial.
How to convert your HTML5 page into a native OSX app.
Paul Grenfell added another small raycasting OpenProcessing demo.
Photoshop filters implemented in GLSL
Herwig Hauser has an amazing gallery of 3D algebraic surfaces - useful for 4k productions.
On that topic, Auld has a great snippet on miniature GLSL code for moving a camera.
0 AD is an open source 'Age of Empires' style game.
Ninite is like apt-get for Windows. (Installation package manager)
A siggraph paper on the miniature effect
Bundler is an open source tool to generate structure from motion from unordered images. (It was published in 2006, but now you can look at the code)
GPU Ocelot 2.0 supports AMD devices now. CUDA for AMD.
Microsoft's paper on the Kinect bodypart tracking algorithm.
Looking for a place to publish? Wiki Call for Papers collects conference paper submission deadlines and more. (Neat!)

The pick of the month has to be the new Festo production, the SmartBird ornithopter. (Official site has a PDF spec-sheet). Video below:

Adrian · at 12:20 PM

Packing an executable to a smaller size is a common task, and the compression algorithms share a lot in common with good text compression systems. Most good executable compression algorithms have some method of interpreting, filtering or transforming the program instructions to a more compressible format. The most well-known compression algorithms are:

Run Length Encoding - where multiples of the same in-order data are represented in a compressed form. For example, the word "beekeeper", may be represented as "b(2x e)k(2x e)per", with an appropriate coding results in a reduction in size. This kind of compression works well on 8bit images and is found in image formats such as BMP.
Huffman compression - each code is represented with a number of bits proportional to the frequency of occurrence. For example, we could represent vowels with shorter bits, and consonants with longer bit sequences, or "e" could be represented with "1" and every other character with 0 plus the usual 8 bits then "beekeeper" would come to "b11k11p1r" in a total of 4x 9 bits and 5x 1 bit totalling to 41 bits rather than 72 bits.
Lempel-Ziv (LZ) - a dictionary of all previous data is kept and references are made back to the history of the data. For example, "beekeeper" could be represented as "be(-1,1)k(-3,2)p(-2,1),r" which translates to "be" plus the same character as before (-1) and one (1) byte of it, resulting in "bee", then "k", followed by the occurrence three characters before (-3) and (2) bytes of it, resulting in "beekee", etc.
Arithmetic coding is similar to huffman coding (and explained very concisely by Charles Bloom), in essence it also assigns bits according to the probability of occurrence (as a "floating point" number), and then re-scales/encodes the probabilities to fixed-point. Arturo ("Dario Phong") Campos has a concise explanation.
PAQ - this is generally considered the state-of-the-art compression algorithm. It uses conditional probabilistic models to achieve optimal compression.

There is plenty of source code around the web for these algorithms, but the Basic Compression Library provides routines for RLE (Run Length Encoding), Huffman, Rice, Lempel-Ziv (LZ77) and Shannon-Fano compression algorithms. And Matt Mahoney maintains the PAQ algorithms. There are also quite a few general executable compressors available including UPX and Mpress, both zip and 7zip can generate self-extracting executables. Microsoft CAB also performs very well (See trimmit for Mac). But the state-of-the-art in self-extracting executable compressors for small programs are Crinkler and kkrunchy for < 4kb and < 128kb programs respectively. (A good resource for various tools is available at in4k)

These compressors use a standard compression algorithm but also transform the input to a more compressible format. The most common transform algorithm is the Burros Wheeler Transform (BWT), this attempts to re-order data so that similar codes are close to each other - increasing the ability for the data to be compressed.

For executable compression patterns in the instructions stream can be leveraged to increase the compression ratio. For example, instructions often address similar areas of memory, simple delta-encoding of the address values can typically greatly increase the compression ratio.

Recently, both kkrunchy and Crinkler were described in detail. The author of kkrunchy Fabian "ryg" Giesen wrote a great blog post describing how kkrunchy works. Since kkrunchy works with larger file sizes some more intelligent tricks can be applied. Alexandre Mutel decomposed the Crinkler compressor in his blog post and posted some fabulous pictures of before-vs-after entropy of a compressed executable. Unfortunately, both of these programs are targeted at Windows PE files, and so don't work for other platforms (See here, here, here and here for tips on PE).

The usual approach to exe compression on linux/osx is just to do the modern-day version of com/cab dropping and compress your program with gzip and have a script extract and execute the program. This lacks a bit of flare, but works. Articles describing elf and mach-o are available by Brian Raiter and Amit Singh respectively, but unfortunately they are a little out of date.

A few tools have been written for linux and osx including the byte-optimized linker (x86/64) and obj stripping for linux, and iPakk and muncho for OSX (again, a bit dated). An older elf stripper is also around, and Timo Wiren gives a good overview too. Unfortunately, a similar page on OSX is no longer available online (maybe one day). All in all, I've still found your best bet on a platform other than Windows is just to follow Timo Wiren's first step of advice and do a simple dropper:

Write your program, and compile it: g++ helloworld.cpp

#include 

int main() {
printf("hello world!\n");
return 0;
}

Optional: Strip it
```
strip a.out
```
Compress it:
```
gzip a.out
```

Create the unpack script (unpack.header):

a=/tmp/I;tail -n+2 $0|zcat>$a;chmod +x $a;$a;rm $a; exit

Combine the script and zip into a final executable:

cat unpack.header a.out.gz > program
chmod +x program

Enjoy!

My a.out went from 8768 to 1153 bytes. Nice. (Using strip saves me 160 bytes on the original and 136 bytes on the final, but I'm sure you can save more by being clever about use of the gcc crt, gcc flags and throwing a few tools at this process).

Adrian · at 9:21 AM

Make sure you have access to root. (eg: sudo su root, passwd)

Make sure you have your build environment set up. eg:

sudo apt-get install gcc
sudo apt-get install build-essential
sudo apt-get install g++
sudo apt-get install rpm

ICC requires older standard C libraries, libstdc++5, so you can get them for Ubuntu from:
Package: libstdc++5. You just need to download the i386 package. (eg from the Australian mirror, by using wget [url]), and then install it, eg: "sudo dpkg -i libstdc++5_3.3.6-18_i386.deb".
If you don't do this you may get the following error messages:
```
Missing critical pre-requisite
-- missing system commands

The following required for installation commands are missing:
libstdc++.so.5 ( library)
```
Download ICC
Unzip (tar -xvf filename)
./install.sh
Choose to install for all (1)
Read "welcome" message follow instructions for licence, etc.
Activate your software (1) and provide the serial number that came in your email. You should see "Activation Completed Successfully"
Install the software (requires 3.35 gigs). You should see a message "Checking the prerequisites. It can take several minutes. Please wait…".

You might see:

Missing optional pre-requisite
-- No compatible Java* Runtime Environment (JRE) found
-- operating system type is not supported.
-- system glibc or kernel version not supported or not detectable
-- binutils version not supported or not detectable

The JRE you need for the visual debugger, otherwise you can safely continue.

The installer then asks which components to install, eg "Intel(R) C++ Compiler Professional Edition for Linux*", just press "1. Install" to continue. It should state "component installed successfully."
Setup the paths, you can use the iccvars.sh in /opt/intel/Compiler/11.1/073/bin to setup everything for you. (eg: source /opt/intel/Compiler/11.1/073/bin/iccvars.sh ia32). You may wish to put it into your .bashrc file.
That's it! Type "icc" to invoke for C files or "icpc" for C++ files. For standard makefiles use "make CXX=icpc"

On a 1GHz VIA Esther processor, GCC 4.4.1, with -O2 -march=c3-2:

real 0m16.855s
user 0m16.849s
sys 0m0.004s

And, with ICC 11.1, with -O2:

real 0m11.369s
user 0m11.361s
sys 0m0.008s

An instant 45% speedup! Unfortunately not all code makes such a big change, some other compute-intensive code I tested only got a 4% speedup. In any case I'd say its worth it!

Adrian · at 8:45 AM

One of the most important parts of programming is the debugging phase, and for development, the testing phase. Some of the tools that can be used under OSX are the standard linux tools - but ofcourse, there are always a few tweeks.

CppCheck is a static analysis tool, it can find formatting issues, unused variables or sub-optimally used variables, pointer/memory issues, etc.

The install procedure for CppCheck is:

git clone git://github.com/danmar/cppcheck.git
make
sudo make install

sudo easy_install Pygments-1.3.1-py2.6.egg 
http://pypi.python.org/pypi/Pygments

cd htmlreport
sudo ./setup.py install

(The python aspect you only need for a nice html report)

Note, I had to check out v 1.44 for it to work with OSX, so after the git checkout command add the following:

git checkout 1.44
git checkout -b 144

Here is an example shell script to run cpp-check (thanks John):

#!/bin/bash
cppcheck --enable=all --xml --verbose -f -I math -I boost/boost -I src/framework src 2> err.xml
cppcheck-htmlreport --file err.xml --report-dir="CodeAnalysisReport" --title="My Project"

Just alter it to match your own include directories and source.

If you need to do any runtime debugging, GDB is the standard tool. It can be quite complex, but here is its basic usage:

gdb (program name) [this will start GDB with your program loaded]
set args (arguments) [this will set the command line parameters to your program]
break (function/bp) [eg: break lib/preprocessor.cpp:610]
r [run]
bt [trace]
q [quit]

Not as good as the MS debug tools, but its still far better than not having a debugger at all! Finally, Valgrind is a tool for memory-checking. It normally works out of the box, but I had to make some changes for OSX. Luckily Greg Parker already wrote a patch, and Cristian Draghici wrote up a blog post on compiling and installing valgrind on OSX.

Adrian · at 11:18 PM

Managing and building software packages is a big task. There are a few options on what build system you can use. Windows users prefer Visual Studio 'solutions' and 'projects', *nix users prefer Makefiles. Ideally, cross-platform tools are what everyone wants.

Some popular choices include:

CMake - a 'cross-platform' make, popular with a lot of projects. (We use it in PAL thanks to Benoit Neil's contributions). My gripe with CMake is its complexity. The fact that there is a complete book to describe it is a scary thought.
Premake - is a less popular choice (a few large projects use it, eg: ODE, PAL previously used it). It uses Lua-script, and it is my personal favourite due to its simplicity whilst still being very powerful. Premake is a bit windows-orientated, which helped me as I came from a Windows development background.
bjam, the boost build system. I've not used this one too much, but it is apparently decent.
scons, a python-based build system. I've not used it much, but it seems quite complex. (Evan Drumwright developed an scons build system for a PAL benchmark tool)

Premake (4) is quite easy to use, below is a very simple template for its use:

solution "NAME"
 configurations {"Debug", "Release"}
 targetdir "bin"
 language "C++"
 location "build"
 includedirs {"src", "."}

project "NAME"
 kind "ConsoleApp"
 files {"src/*.cpp"}
 links {"lib"}

Simply invoke with "premake vs2008" for windows, or "premake gmake" under *nix.
The lua script lets you easily extend the script to copy DLL's, set up custom include paths for each user, etc, etc.

This should help you get started with cross-platform build solutions!

Adrian · at 5:55 PM

It is always advisable to profile your code before trying to optimize it. It is often most helpful to role your own (using timers & counters), but often a quick and simple tool will help a lot.

For windows I like the very sleepy profiler. On Mac OSX Apple provide the Saturn profiler (See: Saturn profiler user guide).

To demonstrate we can try compiling a small program:

#include <math.h> 
#include <stdio.h> 

#include <saturn.h> //include the saturn profiler

double func1(double x) { //do some maths
    return sin(x)*cos(x); 
} 

double func2(double x) { //do some more maths
    return pow(x,3); 
} 

int main() { 
    double x=0; 
    double z=0; 
    startSaturn(); //being profiling 
    for (x=0;x<100;x+=0.01) {
        z+=func1(x)+func2(x)*tan(x); 
    } 
    stopSaturn(); //end profiling 
    printf("z:%f\n",z); //make sure compiler doesn't throw our computations away
    return 0; 
}

Now we need to compile it with profiling support:

g++ x.c -finstrument-functions -lSaturn -m32 -O2

Some common problems include:

```
Undefined symbols: ___cyg_profile_func_enter
```
This comes from the -finstrument-functions option : it requires a special hook for each function - this is provided by saturn, so you must link with it.
```
ld: warning: in /usr/lib/libSaturn.dylib, file is not of required architecture
Undefined symbols: _startSaturn
```
Again, either you forgot -lSaturn, or you have remembered it, but are using a 64 bit OS/chip. Specify '-m32' to force 32bit mode.

If it all compiled succesfully, then start Saturn and choose 'Saturn', 'Launch Process'. Then select the executable (eg: a.out) in the dialog box.

Press OK, and Saturn will run and profile your program, and generate a folder with the profiler output data. (eg: Saturn_profile_a.out.000)

You can then select the data file to view the output, and Saturn will display a call graph and the amount of time spent in each function. Now the fun of optimizing can begin. Enjoy!

Adrian · at 9:45 PM

One of my favourite programming tasks is to create a program in under 4k. This is something the Demoscene excel at (see scene.org awards).
Some of my favourites are:

stiletto by Rgba for the humanoid 3d models.
Glitterati by Fairlight for the ambient occlusion and fantastic physics. (My favourite overall perhaps?)
Micropolis by TBC & Mainloop for the awesome robot and city.
Void 3 by Kolor, an oldschool DOS intro, with a voice synthesizer. Awesome DIY.

Of course, achieving this seems impossible, however a few tools make this easier. I used many of these tools and tricks to create my 3kb entry to the global game jam. First of all a drop/compression tool. Crinkler has eliminated the need for com/cob droppers and gives excellent compression. This little tool has made most of my 4k productions possible. IN4K has plenty of tools and code examples to help you learn the trade, but Iñigo Quilez has an excellent set of beginner projects. (Not the most optimized ones out there, but still a fantastic starting point). FIT have an excellent set of demoscene resources, including source code to some fantastic 4k intros (plus synthesizers!). Ryg/Farbrausch has some interesting reads as well. The other thing that you will find when making 4K intros is the lack of maths functions (which you can get around by using intrinsics /QIfist in MSVC, or something like that), and by writing them yourself in assembly. The next problem is often getting rid of the C standard library, in particular rand(). This is where Linear congruential generator (LCG) come in handy. This is where IQ comes handy again with a 'better, smaller and faster random number generator'. And again, this isn't the fastest or most optimal, but it will do. Finally you will probably want to allocate memory dynamically, and on Windows you can simply use: GlobalAlloc instead. (Feel free to overide operator new and use your standard C++ coding style). If your really looking to crunch size, then stick to values that will compress well (ie: powers of 2), but the latest crinkler can drop floating precision for you anyway, so I'm not sure how much you save with this trick these days...

Adrian · at 9:59 PM

A short collection of interesting links/articles recently:

Designing Fast Cross-Platform SIMD Vector Libraries plus source code, an interesting read on rolling your own vector libraries. If your looking for a bigger library you can try WildMagic or the Sony maths libraries.
Compiler Cache helps reduce your compile times.
Howard Yeend has an excellent tutorial on K-means clustering (ie: grouping data together, given you know how many groups you want - always a shame..)
New DirectX SDK is out, not sure how this is exciting..
Apple released the iPad, I have no idea if this is good. But I like the alternative, Linux-friendly iPad.
I stumbled upon Alexandre Mutel's blog on 4k programming
I discovered the IEEE has RSS feeds for most journals.
nVidia have named the GF100 chips: GeForce GTX 480 and GeForce GTX 470.
Anandtech have an
article on GF100 explaining how the CUDA centric card is also great for gaming.
Facebook released HipHop a PHP to C/gcc conversion tool.
Clang self-hosts! I wonder how long until it becomes a serious g++ competitor.
Symbian OS will be open sourced, perhaps a little too late? Still, Nokia are to be admired.
An interesting gallery of processor cache behaviour, unfortunately I don't think the author is fully qualified on this topic. Still, interesting pictures.
The PS3 hypervisor was hacked

Finally, a video of the top Tron AI from the Google AI Challenge:

Adrian · at 2:16 PM

It's been a while since I updated, reason being it is exams/assignments marking period, and I had two GPU industry projects due (3D groundwater flow fluid simulation and a pathfinding/travelling salesman with profits project).

The biggest news item was that the ACM decided to start (over)enforcing its rules saying that you can not link to preprint and author pages. Thankfully, it started a call-to-arms and prominent pages like Ke-Sen Huang's SIGGRAPH links have been restored. I wonder how many less public pages have silently slipped away. Frankly, I can't wait until the concept of conferences and journals disappear. My websites have always had far more impact that my publications, and it can't be long until the same can be said universally.

A short update with some interesting things in the last while:

Obama has made a comment about robotics, it's great to hear the field is getting a lot more visible and more support, but I still feel we have a very long way to go to break the western cultures distrust of robotics.
Sony/Toshiba/IBM CELL is now officially dead.
There is a new GPU computing community website, just started up: http://gpucomputing.net/
nVidia's raytracing technology Optix is finally available, but only for next-gen GPU's. The "RealityServer" was also released, to great disappointment to many.
Tech report has a more detailed report on nVidias new Fermi based GPU's.
Caustic claims to be selling raytracing accelerators, but they haven't been able to sell me one yet. Smells like vapourware.
Chromium OS was released. Not sure who wants this? Fast starts haven't been an issue for me since laptop's could sleep. I guess they are aiming for very low prices.
Peteris Krumins posted a great blog post on the highlights of MIT's algorithms lectures
Wolfire did an interesting post on generating vertex weights.
Felix von Leitner put together a very interesting presentation on compiler optimization tricks. It's good proof that compilers are much better than humans for most optimizations, plenty of new things I learnt from this.
Finally, I saw a link to this paper:Aggregate Dynamics for Dense Crowd Simulation. I've not had a chance to read it yet, but the results look fantastic and the idea is novel. (Using fluid-simulation techniques for representing crowds.) Video below:

Adrian · at 2:57 PM

Inspired by the post by Elan Ruskin (Valve) on x86 SQRT routines I thought I would visit this for my supercomputing platform of choice, the GPU. These kinds of low level trickery I left behind after finishing with RMM/Pixel-Juice some time around 2000, having decided that 3dNow! reciprocal square root routines were more than good enough..

Anyway, a brief overview of how we can do square roots:

Calculate it with the FPU, (however that was implemented by the chip manafacturer).
Calculate it from newton-raphson. This allows you to control the accuracy of the sqrt. (Or typicaly rsqrt) This comes in two flavours:
- Use an initial estimate, and refine ALA Greg Walsh / John Carmack / Quake 3 approach.
- Use a lookup table, then refine. This is probably an obvious approach, but I think AMD did a lot of pioneering work on this approach. (Well, at least, I learned these tricks from them..) See nVidias lookup table sqrt code.
These approaches typically approximate the inverse square root, so this means we need to:
Calculate it from the inverse. This comes in two flavours:
- Calculate the reciprical, then invert it (1/rsqrt(x)), this gives you correct results
- Multiply it by the input value (x*rsqrt(x)), this gives you faulty results around 0, but saves you a costly divide.
  Note:
  1.0f / rsqrtf(0.0f) = 1.0f / infinity = 0.0f
  0.0f * rsqrtf(0.0f) = 0.0f * infinity = NaN

Elan's results indicated the x86 SSE units rsqrtss instruction was the fastest (no suprise - it is also a rough estimate), followed by SSE rsqrt with newton-raphson iteration for improvement, then Carmack’s Magic Number rsqrt, and finally the x86 FPU's sqrt. Note that many people don't actually get to the slowest point on Elan's scale, since they don't enable intrinsics when compiling, meaning that the C compiler will use the C library sqrt routine, and not the FPU.
I decided to test three routines for the GPU:

native sqrt
native rsqrt
Carmack's rsqrt

Now benchmarking and timing on the lowest level has always been somewhat of a black art (see Ryan Geiss's article on timing), but that is even more true on the GPU - you need to worry about block sizes, as well as the type of timer, etc.
I did my best at generating reliable results by testing block sizes from 2..256 and performing 2.5 million sqrt operations. Here are the results from my nVidia 9800GX2:

Method	Total time	Max. ticks per float	Avg. ticks per float	Std. Dev.	Avg. Error
GPU intrinsic SQRT	1.285ms	5.99	3.99	0.00	0.00%
GPU intrinsic RSQRT * x	1.281ms	5.99	3.99	0.00	0.00%
Carmack RSQRT * x	2.759ms	6.28	4.26	0.01	0.09%

Total time is the total time measured by the CPU that the GPU took to launch the kernel and calculate the results. The clock ticks are meant to be more accurate measurements using the GPU's internal clock, but I find that to be dubious.
The conclusions to take from these results are simple: Carmack's inverse and other trickery isn't going to help, using the GPU RSQRT function as opposed to the inbuilt SQRT function saves you about a clock tick or two. (Probably because nVidias SQRT is implemented as 1/RSQRT, as opposed to X*RSQRT)
I'm happy to say, low level optimization tricks are still safely a thing of the past.
You can get the code for the CUDA benchmark here: GPU SQRT snippet.

Adrian · at 5:22 PM

Not one to be left behind by nVidias news, AMD/ATI have released the AMD OpenCL beta v4 which now supports OpenCL for AMD GPU's! Some highlights:

First beta release of ATI Stream SDK with OpenCL GPU support.
ATI Stream SDK v2.0 OpenCL is certified OpenCL 1.0 conformant by Khronos.
Microsoft Windows 7 and native Microsoft Windows® 64-bit support

Fabulous news! Now you can do OpenCL on OSX, Windows (32&64bit), for nVidia and ATI GPU's and AMD CPU's. It doesn't get any better than this, well, at least util next year when Intel enters the fray.

It's not all good news though, it seems some of AMD's GPUs don't support double precision:

Still, it's better than nVidias lot, and I'm happy to see AMD finally making a serious effort in this space. (Not that the previous efforts weren't impresive, just not so focused...)

nVidia has also released the new version of CG, v 2.2. I wonder how much OpenCL will replace the use of Cg..

Adrian · at 10:23 PM

There's been a fair bit of news flowing out of nVidia, biggest first:
nVidia has released a GPU OpenCL implementation compatible with all devices that support CUDA (no surprise).
You can get the nVidia OpenCL from the nVidia OpenCL Developer Website.

Next in nVidia news, Nexus has been released. I haven't had a chance to try it, but apparently it allows you to debug GPU programs via Microsoft Visual Studio in the 'normal' way - this would certainly make GPU programming a little easier.

Finally, nVidia has released information on Fermi their next-generation architecture. Basically it seems to be more of the same (which is always good) without all the bad bits for GPGPU programming (even better!). The biggest changes are allowing multiple kernels to execute in parallel, and having decent double-precision support. This should really open up the scientific&engineering computing to GPGPU, and will probably do good things for getting accelerated raytracing happening. AnandTech has a good write-up on Fermi, although it looks like we will see Larabee before Fermi...

I had a chance to play with a 3xC1060 Tesla boards for the GPU fluid simulation project on a 64bit machine. This threw up a whole bunch of problems, since I was using MSVC express edition, which does not support 64bit (apparently.) Problem was solved by using the 32 bit CUDA wizard, redirecting the CUDA libraries to the 32bit versions (ie: c:\CUDA\lib, not C:\CUDA\lib64), and some other tweaks.

Adrian · at 9:25 AM

Intel has just bought RapidMind. RapidMind provided a parallel language that could tarket CPUs (AMD & Intel), GPU's (nVidia & ATI/AMD), the Cell processor (Sony, Toshiba, IBM), and various DSPs.

Now that it becomes an Intel owned product I wonder if all that wonderful support for other devices will just disappear? If they primarily want it to target their own Larrabee, then probably yes, but if they want to use it to establish a software platform for the next Playstation, or as their OpenCL interface, then it might have a good chance of good support for the Cell and perhaps still a few GPUs. I don't see why Intel would want to provide support to their competitors DSPs, and if Larrabee is meant to be a serious competitor to Cell and GPUs then this might end up being a very disappointing purchase..

Goodbye rapid mind.

Adrian · at 10:29 PM

The new ATI stream SDK includes a CPU implementation of OpenCL.

AMD released a demo of a 2d fluid simulation using 24 cores.

I have to say, I'm not very impressed. I'd be keen to see how this holds up to an OpenCL GPU implemetnation, it certainly seems like CUDA has a strong position for now. They do have a good OpenCL introductory tutorial though.

In other news, nVidia released OptiX a realtime raytracing "engine", whatever that is supposed to mean..

Adrian · at 10:22 PM

The LLVM group has just got a new logo, a modernized version of the 'dragon book' dragon. But the real exciting news I saw on GPGPU.org is a project 'GPUocelot'. This program will translate compiled PTX programs (produced by nVidias CUDA) via the just-in-time LLVM compiler to any targeted backend, meaning for example a PS3 CELL processor. All you need to do is add an AMD backend to LLVM, and hey-presto, instant CUDA-for-ATI. That could potentially put a dent into OpenCL's plans..

The papers from the High Performance Graphics conference are also out, one that caught my eye was Understanding the Efficiency of Ray Traversal on GPUs, not really because I thought the paper was fundamentally groundbreaking, but because it explained a few neat tricks on nVidias part, in particular, a good explanation of persistant threads in CUDA for breaking down non-uniform workloads from a global pool. (eg: They used it on a per-ray basis, so that fast and slow rays don't "block" each other.)

Adrian · at 1:49 PM

Writing your own OS is something every computing person should do at some point. The first step is of course writing a boot loader. This is actually very easy to do under linux.

With a fresh install of ubuntu, make sure you have GCC and related goodies, the other things you will probably want are nasm, and virtual machine like QEMU. Installing these in ubuntu is as simple as:


sudo apt-get install nasm
sudo apt-get install qemu

(you may need to modify your /etc/apt/sources.list - any core ubuntu mirror will do, I used au.archive.ubuntu.com/ubuntu)

Now you probably want to try qemu out before going further, so grab freedos:
http://www.freedos.org/freedos/files/ and grab an iso, I got fdbasecd.iso

Now we create a virtual hard drive to install freedos on to: (do this in the same dir as the iso)


qemu-img create -f raw freedos.img 100M

And then we just boot up qemu:


qemu -localtime freedos.img -cdrom fdbasecd.iso -boot d

Freedos install is a bit obtuse, you need to format the drive to FAT16, exit the formatter, then again to FAT32, but I just pretty much just went with the defaults for everything, afterall, this is just for testing. At then end, you should have a working DOS prompt.

Now that we have QEMU working and we know it, let's try our own boot loader:


[BITS 16]  ; 16 bit code generation
[ORG 0x7C00]  ; ORGin location is 7C00

;Main program
main:   ; Main program label

mov ah,0x0E  ; This number is the number of the function in the BIOS to run.
   ;  This function is put character on screen function
mov bh,0x00  ; Page number (I'm not 100% sure of this myself but it is best
   ;  to leave it as zero for most of the work we will be doing)
mov bl,0x07  ; Text attribute (Controls the background and foreground colour
   ;  and possibly some other options)
   ;  07 = White text, black background.
   ; (Feel free to play with this value as it shouldn't harm 
   ;  anything)
mov al,65  ; This should (in theory) put a ASCII value into al to be 
   ;  displayed. (This is not the normal way to do this)
int 0x10  ; Call the BIOS video interrupt.

jmp $   ; Put it into a coninuous loop to stop it running off into
   ;  the memory running any junk it may find there.

; End matter
times 510-($-$$) db 0 ; Fill the rest of the sector with zeros
dw 0xAA55  ; Boot signature

Save the code to loader.asm, and assemble it with:


nasm loader.asm

Then, we can run our wonderful loader with:


qemu loader

Adrian · at 10:24 AM

It's been a while since I set up my PS3 with Linux, but I remember it being a lengthy process. Installing Yellow Dog Linux on the PS3 is relatively straight forward, but then getting the compiler tool chain up and running took a while, mostly in just figuring out what to do. If you only have RCA out on the PS3 you need to install Linux in text-mode, which I remember being a bit dramatic (all the instructions assume you have the GUI environment).

Yellow dog has an package manager called 'yum' (Yellow Dog Updater, Modified).

After the install do a 'yum update'.
(Note, if you are behind a proxy you will need to set the http_proxy variable. For other networking issues 'ifconfig' up/down and 'dhclient' are your friends.)

export http_proxy = "http://username:password@proxy.host.com:port"

Now do a search in yum for any SPU/PPU packages, ie: 'yum search spu' and install the relevant ones. (I can't recall which ones, probably libspe2,spu-binutils,spu-gcc,spu-newlib,ppu-binutils). At the end of this you should have spu-gcc, ppu-embedspu, ppu-ar and the usual suspects (gcc/g++).

You have these different compilers because the SPE and the PPE are completely different processors, it's like having two different computers in one box. So the spu-gcc compiles code only for the SPE, and the 'normal' gcc compiles code for the Power PC.

Obviously the first thing to try is 'hello world' for the PPE, but after that a little SPU/PPU program is what to try. You should be able to find some code on the web, (try the GATECH STI website, or Jeremy's SPE hello world examples) so then you just need to build it:

spu-gcc spu-program.cpp -o spu-program
ppu-embedspu -m32 symbol_name binary_name output-embedded.o
ppu-ar -qcs spulib.a output-embedded.o
g++ ppu-program.cpp -lspe spulib.a -o output

Or, as a concrete example:

spu-gcc spumain.cpp -o spumain
ppu-embedspu -m32 hello_spu spumain hello_spu-embed32.o
ppu-ar -qcs hello_spu.a hello_spu-embed32.o
g++ ppu-program.cpp -lspe hello_spu.a -o helloworld

So in Step 1, we are compiling the SPE program, in Step 2 we are embedding it into an object file by providing the symbol name (eg: extern spe_program_handle_t hello_spu), the binary name (the result from spu-gcc), and the final object file.
In Step 3 we are creating a library from the object, and in Step 4 we are linking it all together.

The entire 'embedding' thing can be a bit confusing, but Alex Chow provides a great overview.

(Note: If you are working with windows, putty and winscp will make your life a lot easier. If your new to linux, try the 'nano' editor (or 'pico'), as opposed to the more powerful, and difficult to master vim. You can make something executable with 'chmod +x filename'. If you stuff up the console, typing 'reset' will save the day)

Good luck. You might find this Cell Cheat Sheet useful.

Sunday, August 05, 2012

Thursday, June 07, 2012

Friday, July 29, 2011

Thursday, March 31, 2011

Wednesday, January 26, 2011

Tuesday, September 14, 2010

Monday, August 02, 2010

Thursday, April 29, 2010

Friday, February 19, 2010

Monday, February 08, 2010

Thursday, February 04, 2010

Wednesday, November 25, 2009

Tuesday, October 20, 2009

Thursday, October 15, 2009

Thursday, October 08, 2009

Thursday, August 20, 2009

Wednesday, August 05, 2009

Thursday, July 30, 2009

Tuesday, July 07, 2009

Thursday, July 02, 2009

Popular Posts

Blog Archive

Labels

My Blog List

Links

Subscribe To

Feedjit

About Me

Followers