Showing posts with label compilers. Show all posts
Showing posts with label compilers. Show all posts

Sunday, August 05, 2012

Programming links

Well overdue for a catchup post on the non-graphics programming side of things, so here we go:

Thursday, June 07, 2012

GPU Technology Conference 2012

nVidia's GPU Technology Conference is over, and a number of presentation slides have been uploaded. There were a quite a few interesting talks relating to graphics, robotics and simulation:
  • Simon Green from nVidia and Christopher Horvath from Pixar presented 'Flame On: Real-Time Fire Simulation for Video Games'. It starts with a recent history of research on CG fluid systems, and gives five tips on better looking fire: 1. Get the colors right (e.g. radiation model), 2. Use high quality advection (not just bilinear filtering), 3. Post process with glow and motion blur. 4. Add noise. 5. Add light scattering and embers. They then go into more detail on Tip #1 looking at the physics behind the black-body radiation in a fire, and the color spectrum.
  • Elmar Westphal of PGI/JCNS-TA Scientific IT-Systems presented 'Multiparticle Collision Dynamics on one or more GPUs', about multiparticle collision dynamics GPU code. He starts by explaining the overall algorithm, and explaining step-by-step what performs well on the GPU. Specific GPU optimisations explained include spatial subdivision lists, reordering particles in memory, hash collisions, and finally dividing workload between multiple GPU's. An interesting read.
  • Michal Januszewski from the University of Silesia in Katowice introduces 'Sailfish: Lattice Boltzmann Fluid Simulations with GPUs and Python'. He explains lattice boltzmann fluid simulation, and some of the different configurations of lattice connectivity and collision operators. Moves into code generation examples, and gives a brief explanation of how the GPU implementation works.
  • Nikos Sismanis, Nikos Pitsianis and Xiaobai Sun (Aristotle University, Duke University) cover 'Efficient k-NN Search Algorithms on GPUs'. Starts with an overview of sorting and K-Nearest Neighbour (KNN) search algorithm solutions, including ANN (approximate NN) and lshkit and moves into results including a comparison of thrust::sort with Truncated Bitonic sort. Software is available at http://autogpu.ee.auth.gr/.
  • Thomas True of nVidia explains 'Best Practices in GPU-Based Video Processing' and covers overlapping copy-to-host and copy-to-device operations, and an example of processing bayer pattern images.
  • Scott Rostrup, Shweta Srivastava, and Kishore Singhal from Synopsys Inc. explain 'Tree Accumulations on GPU' using parallel scatter, parallel reduce and parallel scan algorithms.
  • Wil Braithwaite from nVidia presents an interesting talk on 'Interacting with Huge Particle Simulations in Maya using the GPU'. He begins with a brief runthrough of the workings of the CUDA SPH example, and then moves onto the particle system including Maya's body forces (uniform, radial, vortex), shape representations (implicit, covex hull, signed distance fields, displacement maps), collision response, SPH equations, and finally data transfer. Ends with a brief overview of rendering the particles in screen space. Neat.
  • David McAllister and James Bigler (nVidia) cover the OptiX internals in 'OptiX Out-of-Core and CPU Rendering' including PTX code generation and optimisation, and converting the OptiX backend to support CPU's via Ocelot and LLVM. An interesting result, LLVM does better at optimising "megafunctions" than small functions, but not entirely unexpected given how LLVM works. The presentation finishes with an overview of paging and a tip on bounding volume heirarchies. Good to see Ocelot in the mainstream.
  • Eric Enderton and Morgan McGuire from nVidia explain 'Stochastic Rasterization' (ala 'screen door transparency' rendering) via MSAA for motion blur, depth of field and order-independent transparency, by using a geometry shader to bound the shape and motion of each tri in screen space, and setting up the MSAA masks. Nice.
  • Cliff Woolley presents 'Profiling and Tuning OpenACC Code' (by adding pragmas to C / Fortran code, ala OpenMP) using an example of Jacobi iteration, and there were a number of other talks on the topic.
  • Christopher Bergström introduced 'PathScale ENZO' the alternative to CUDA and OpenCL.
  • Phillip Miller from nVidia did an broad coverage of 'GPU Ray Tracing'. He starts with a myths and claimed facts on GPU raytracing, highlights some commercial GPU raytracers (and the open source OpenCL LuxRenderer) and goes into some details that are better explained in the OptiX Out-of-Core presentation.
  • Phillip Miller follows with 'Advanced Rendering Solutions' where he takes a look at nVidia's iray, and where they believe they can introduce new capabilities for design studios and find a middle ground with re-lighting and physcially based rendering.
  • Peter Messmer presents 'CUDA Libraries and Ecosystem Overview', where he provides an overview of the linear algebra cuBLAS and cuSPARSE libraries performance, then moves to signal processing with cuFFT and NPP/VSIP for image processing, next is random numbers via cuRAND and finally ties things up with Thrust.
  • Jeremie Papon and Alexey Abramov discuss the 'Oculus real-time modular cognitive visual system' including GPU accelerated stereo disparity matching, likelihood maps and image segmentation with a parallel metropolis algorithm.
  • Jérôme Graindorge and Julien Houssay from Alyotech present 'Real Time GPU-Based Maritime Scenes Simulation' beginning with ocean simulation and rendering from FFT based wave simulation using HF and LF heightmap components. They then cover rendering the mesh, scene illumination and tone mapping, and a sneak peak at boat interaction. The ocean simulation video is neat.
  • Dan Negrut from the Simulation-Based Engineering Lab at the University of Wisconsin–Madison gives an overview of the labs multibody dynamics work in 'From Sand Dynamics to Tank Dynamics' including friction, compliant bodies, multi-physics (fluid/solid interactions), SPH, GPU solution to the cone complementary problem, ellipsoid-ellipsoid CCD, multi-CPU simulation, and finally vehicle track simulation in sand. Wow. Code is available on the Simulation-Based Engineering Lab website.
  • Max Rietmann of USI Lugano looks at seismology (earthquake simulation) in 'Faster Finite Elements for Wave Propagation Codes' and describes parallising FEM methods for GPUs in SPECFEM3D.
  • Dustin Franklin from GE introduces GE's MilSpec ruggedised Kepler-based GPU solutions and Concurrent Redhawk6 in 'Sensor Processing with Rugged Kepler GPUs'. Looks at some example applications including hyperspectral imaging, mosaicing, 360 degree vision, synthetic aperture radar processing, and space-time adaptive processing for moving target identification.
  • Graham Sanborn of FunctionBay presents 'Particle Dynamics with MBD and FEA Using CUDA' and gives a brief overview of their combined CPU/GPU multi-body FEA system and briefly describes the contact, contact force, and integration steps.
  • Ritesh Patel and Jason Mak of University of California-Davis cover the Burrows-Wheeler Transform, Move-to-Front Transform and Huffman Coding in 'Lossless Data Compression on GPUs'. They find merge sort for BWT performs best on the GPU, explain the parallel MTF transform and Huffman in illustrative detail and tie things up with benchmarks, unfortunately GPU is 2.78x slower than CPU.
  • Nikolai Sakharnykh and Nikolay Markovskiy from NVIDIA provide an indepth explanation of their GPU implementation of solving ADI with tridiagonal systems in '3D ADI Method for Fluid Simulation on Multiple GPUs'.
  • Enrico Mastrostefano, Massimo Bernaschi, and Massimiliano Fatica investigate breadth first search in 'Large Graph on multi-GPUs' and describe how best to parallelise it across multiple GPU's by using adjacency lists and level frontiers to minimise the data exchange.
  • Bob Zigon from Beckman Coulter presents '1024 bit Parallel Rational Arithmetic Operators for the GPU' and covers exact 1024 bit rational arithmetic (add,sub,mul,div) for the GPU. Get the 1024 bit arithmetic code here.
  • Roman Sokolov and Andrei Tchouprakov of D4D Technologies discuss 'Warped parallel nearest neighbor searches using kd-trees' where they take a SIMD style approach by grouping tree searches via voting (ballot)
  • David Luebke from nVidia takes a broad look at CG in 'Computational Graphics: An Overview of Graphics Research @ NVIDIA' and provides an overview of research which is featured in a number of previous talks and other GTC talks including edge aware shading, ambient occlusion via volumes and raycasting, stochastic rendering, improved image sampling and reconstruction, global illumination, and CUDA based rasterization.
  • Johanna Beyer and Markus Hadwiger from King Abdullah University of Science and Technology discuss 'Terascale Volume Visualization in Neuroscience' where each cubic mm of the brain scanned with an electron microscope generates 800 tereabytes of data. The idea here is to leverage the virtual memory manager to do all the intelligent caching work, rather than a specialised spatial datastructure for the volume rendering.
  • Mark Kilgard introduces the NV_path_rendering extension in 'GPU-Accelerated Path Rendering', and demonstrates using the GPU to render PDF, flash, clipart, etc. Contains some sample code.
  • Janusz Będkowski from the Warsaw University of Technology presented 'Parallel Computing In Mobile Robotics For RISE' a full GPGPU solution for processing mobile robot laser scan data through to navigation. Starts with data registration into a decomposed grid which is then used for scan matching with point-to-point Iterative Closest Point. Next is estimating surface normals using principle component analysis, demonstrated on velodyne datasets. This is used to achieve point-to-plane ICP and he demonstrates a 6D SLAM loop-closure. Finishes it all off with a simple gradient based GPU path planner.
Note that in recent days more presentation PDF's have been uploaded so there is still plenty to look through, and with all the content it's difficult to look through it all - take a look yourself! I'll leave you with a video from the GTC 2012 keynote on rendering colliding galaxies:

Friday, July 29, 2011

Half year catchup on Graphics, GPUs, Compilers, etc.

Another slow month on the blog. More than half way through the year, so its time to catch up on the backlog of news. Only covering graphics, games, physics and GPGPU and compilers. Expect a number of posts on robotics soon!
Finally, here is the SIGGRAPH 2011 technical papers highlights video, it contains a number of interesting advances in physics simulations and modeling.

Thursday, March 31, 2011

March update

I didn't want to have a month without a post, so here are some links for the month:
The pick of the month has to be the new Festo production, the SmartBird ornithopter. (Official site has a PDF spec-sheet). Video below:

Wednesday, January 26, 2011

Executable compression

Packing an executable to a smaller size is a common task, and the compression algorithms share a lot in common with good text compression systems. Most good executable compression algorithms have some method of interpreting, filtering or transforming the program instructions to a more compressible format. The most well-known compression algorithms are:
  • Run Length Encoding - where multiples of the same in-order data are represented in a compressed form. For example, the word "beekeeper", may be represented as "b(2x e)k(2x e)per", with an appropriate coding results in a reduction in size. This kind of compression works well on 8bit images and is found in image formats such as BMP.
  • Huffman compression - each code is represented with a number of bits proportional to the frequency of occurrence. For example, we could represent vowels with shorter bits, and consonants with longer bit sequences, or "e" could be represented with "1" and every other character with 0 plus the usual 8 bits then "beekeeper" would come to "b11k11p1r" in a total of 4x 9 bits and 5x 1 bit totalling to 41 bits rather than 72 bits.
  • Lempel-Ziv (LZ) - a dictionary of all previous data is kept and references are made back to the history of the data. For example, "beekeeper" could be represented as "be(-1,1)k(-3,2)p(-2,1),r" which translates to "be" plus the same character as before (-1) and one (1) byte of it, resulting in "bee", then "k", followed by the occurrence three characters before (-3) and (2) bytes of it, resulting in "beekee", etc.
  • Arithmetic coding is similar to huffman coding (and explained very concisely by Charles Bloom), in essence it also assigns bits according to the probability of occurrence (as a "floating point" number), and then re-scales/encodes the probabilities to fixed-point. Arturo ("Dario Phong") Campos has a concise explanation.
  • PAQ - this is generally considered the state-of-the-art compression algorithm. It uses conditional probabilistic models to achieve optimal compression.
There is plenty of source code around the web for these algorithms, but the Basic Compression Library provides routines for RLE (Run Length Encoding), Huffman, Rice, Lempel-Ziv (LZ77) and Shannon-Fano compression algorithms. And Matt Mahoney maintains the PAQ algorithms. There are also quite a few general executable compressors available including UPX and Mpress, both zip and 7zip can generate self-extracting executables. Microsoft CAB also performs very well (See trimmit for Mac). But the state-of-the-art in self-extracting executable compressors for small programs are Crinkler and kkrunchy for < 4kb and < 128kb programs respectively. (A good resource for various tools is available at in4k)

These compressors use a standard compression algorithm but also transform the input to a more compressible format. The most common transform algorithm is the Burros Wheeler Transform (BWT), this attempts to re-order data so that similar codes are close to each other - increasing the ability for the data to be compressed.

For executable compression patterns in the instructions stream can be leveraged to increase the compression ratio. For example, instructions often address similar areas of memory, simple delta-encoding of the address values can typically greatly increase the compression ratio.

Recently, both kkrunchy and Crinkler were described in detail. The author of kkrunchy Fabian "ryg" Giesen wrote a great blog post describing how kkrunchy works. Since kkrunchy works with larger file sizes some more intelligent tricks can be applied. Alexandre Mutel decomposed the Crinkler compressor in his blog post and posted some fabulous pictures of before-vs-after entropy of a compressed executable. Unfortunately, both of these programs are targeted at Windows PE files, and so don't work for other platforms (See here, here, here and here for tips on PE).

The usual approach to exe compression on linux/osx is just to do the modern-day version of com/cab dropping and compress your program with gzip and have a script extract and execute the program. This lacks a bit of flare, but works. Articles describing elf and mach-o are available by Brian Raiter and Amit Singh respectively, but unfortunately they are a little out of date.

A few tools have been written for linux and osx including the byte-optimized linker (x86/64) and obj stripping for linux, and iPakk and muncho for OSX (again, a bit dated). An older elf stripper is also around, and Timo Wiren gives a good overview too. Unfortunately, a similar page on OSX is no longer available online (maybe one day). All in all, I've still found your best bet on a platform other than Windows is just to follow Timo Wiren's first step of advice and do a simple dropper:
  1. Write your program, and compile it: g++ helloworld.cpp
    #include 
    
    int main() {
    printf("hello world!\n");
    return 0;
    }
    
  2. Optional: Strip it
    strip a.out
  3. Compress it:
    gzip a.out
  4. Create the unpack script (unpack.header):
    a=/tmp/I;tail -n+2 $0|zcat>$a;chmod +x $a;$a;rm $a; exit
    
  5. Combine the script and zip into a final executable:
    cat unpack.header a.out.gz > program
    chmod +x program
    
  6. Enjoy!

My a.out went from 8768 to 1153 bytes. Nice. (Using strip saves me 160 bytes on the original and 136 bytes on the final, but I'm sure you can save more by being clever about use of the gcc crt, gcc flags and throwing a few tools at this process).

Tuesday, September 14, 2010

Intel Compiler on Linux

  • Make sure you have access to root. (eg: sudo su root, passwd)
  • Make sure you have your build environment set up. eg:

    sudo apt-get install gcc
    sudo apt-get install build-essential
    sudo apt-get install g++
    sudo apt-get install rpm
    
  • ICC requires older standard C libraries, libstdc++5, so you can get them for Ubuntu from:
    Package: libstdc++5. You just need to download the i386 package. (eg from the Australian mirror, by using wget [url]), and then install it, eg: "sudo dpkg -i libstdc++5_3.3.6-18_i386.deb".
    If you don't do this you may get the following error messages:

    Missing critical pre-requisite
    -- missing system commands
    
    The following required for installation commands are missing:
    libstdc++.so.5 ( library)
    
  • Download ICC
  • Unzip (tar -xvf filename)
  • ./install.sh
  • Choose to install for all (1)
  • Read "welcome" message follow instructions for licence, etc.
  • Activate your software (1) and provide the serial number that came in your email. You should see "Activation Completed Successfully"
  • Install the software (requires 3.35 gigs). You should see a message "Checking the prerequisites. It can take several minutes. Please wait…".
  • You might see:

    Missing optional pre-requisite
    -- No compatible Java* Runtime Environment (JRE) found
    -- operating system type is not supported.
    -- system glibc or kernel version not supported or not detectable
    -- binutils version not supported or not detectable
    
    The JRE you need for the visual debugger, otherwise you can safely continue.
  • The installer then asks which components to install, eg "Intel(R) C++ Compiler Professional Edition for Linux*", just press "1. Install" to continue. It should state "component installed successfully."
  • Setup the paths, you can use the iccvars.sh in /opt/intel/Compiler/11.1/073/bin to setup everything for you. (eg: source /opt/intel/Compiler/11.1/073/bin/iccvars.sh ia32). You may wish to put it into your .bashrc file.
  • That's it! Type "icc" to invoke for C files or "icpc" for C++ files. For standard makefiles use "make CXX=icpc"
On a 1GHz VIA Esther processor, GCC 4.4.1, with -O2 -march=c3-2:
real 0m16.855s
user 0m16.849s
sys 0m0.004s
And, with ICC 11.1, with -O2:
real 0m11.369s
user 0m11.361s
sys 0m0.008s

An instant 45% speedup! Unfortunately not all code makes such a big change, some other compute-intensive code I tested only got a 4% speedup. In any case I'd say its worth it!

Monday, August 02, 2010

C++ debugging tools: Gdb, CppCheck, Valgrind

One of the most important parts of programming is the debugging phase, and for development, the testing phase. Some of the tools that can be used under OSX are the standard linux tools - but ofcourse, there are always a few tweeks.

CppCheck is a static analysis tool, it can find formatting issues, unused variables or sub-optimally used variables, pointer/memory issues, etc.

The install procedure for CppCheck is:
git clone git://github.com/danmar/cppcheck.git
make
sudo make install

sudo easy_install Pygments-1.3.1-py2.6.egg 
http://pypi.python.org/pypi/Pygments

cd htmlreport
sudo ./setup.py install
(The python aspect you only need for a nice html report)


Note, I had to check out v 1.44 for it to work with OSX, so after the git checkout command add the following:
git checkout 1.44
git checkout -b 144

Here is an example shell script to run cpp-check (thanks John):
#!/bin/bash
cppcheck --enable=all --xml --verbose -f -I math -I boost/boost -I src/framework src 2> err.xml
cppcheck-htmlreport --file err.xml --report-dir="CodeAnalysisReport" --title="My Project"
Just alter it to match your own include directories and source.

If you need to do any runtime debugging, GDB is the standard tool. It can be quite complex, but here is its basic usage:
  • gdb (program name) [this will start GDB with your program loaded]
  • set args (arguments) [this will set the command line parameters to your program]
  • break (function/bp) [eg: break lib/preprocessor.cpp:610]
  • r [run]
  • bt [trace]
  • q [quit]
Not as good as the MS debug tools, but its still far better than not having a debugger at all! Finally, Valgrind is a tool for memory-checking. It normally works out of the box, but I had to make some changes for OSX. Luckily Greg Parker already wrote a patch, and Cristian Draghici wrote up a blog post on compiling and installing valgrind on OSX.

Thursday, April 29, 2010

Build Systems

Managing and building software packages is a big task. There are a few options on what build system you can use. Windows users prefer Visual Studio 'solutions' and 'projects', *nix users prefer Makefiles. Ideally, cross-platform tools are what everyone wants.

Some popular choices include:
  • CMake - a 'cross-platform' make, popular with a lot of projects. (We use it in PAL thanks to Benoit Neil's contributions). My gripe with CMake is its complexity. The fact that there is a complete book to describe it is a scary thought.
  • Premake - is a less popular choice (a few large projects use it, eg: ODE, PAL previously used it). It uses Lua-script, and it is my personal favourite due to its simplicity whilst still being very powerful. Premake is a bit windows-orientated, which helped me as I came from a Windows development background.
  • bjam, the boost build system. I've not used this one too much, but it is apparently decent.
  • scons, a python-based build system. I've not used it much, but it seems quite complex. (Evan Drumwright developed an scons build system for a PAL benchmark tool)
Premake (4) is quite easy to use, below is a very simple template for its use:
solution "NAME"
 configurations {"Debug", "Release"}
 targetdir "bin"
 language "C++"
 location "build"
 includedirs {"src", "."}

project "NAME"
 kind "ConsoleApp"
 files {"src/*.cpp"}
 links {"lib"}

Simply invoke with "premake vs2008" for windows, or "premake gmake" under *nix.
The lua script lets you easily extend the script to copy DLL's, set up custom include paths for each user, etc, etc.

This should help you get started with cross-platform build solutions!

Friday, February 19, 2010

Profiling on Mac OSX with Saturn

It is always advisable to profile your code before trying to optimize it. It is often most helpful to role your own (using timers & counters), but often a quick and simple tool will help a lot.

For windows I like the very sleepy profiler. On Mac OSX Apple provide the Saturn profiler (See: Saturn profiler user guide).

To demonstrate we can try compiling a small program:
#include <math.h> 
#include <stdio.h> 

#include <saturn.h> //include the saturn profiler

double func1(double x) { //do some maths
    return sin(x)*cos(x); 
} 

double func2(double x) { //do some more maths
    return pow(x,3); 
} 

int main() { 
    double x=0; 
    double z=0; 
    startSaturn(); //being profiling 
    for (x=0;x<100;x+=0.01) {
        z+=func1(x)+func2(x)*tan(x); 
    } 
    stopSaturn(); //end profiling 
    printf("z:%f\n",z); //make sure compiler doesn't throw our computations away
    return 0; 
} 

Now we need to compile it with profiling support:
g++ x.c -finstrument-functions -lSaturn -m32 -O2

Some common problems include:



  • Undefined symbols: ___cyg_profile_func_enter
    
    This comes from the -finstrument-functions option : it requires a special hook for each function - this is provided by saturn, so you must link with it.



  • ld: warning: in /usr/lib/libSaturn.dylib, file is not of required architecture
    Undefined symbols: _startSaturn
    
    Again, either you forgot -lSaturn, or you have remembered it, but are using a 64 bit OS/chip. Specify '-m32' to force 32bit mode.

If it all compiled succesfully, then start Saturn and choose 'Saturn', 'Launch Process'. Then select the executable (eg: a.out) in the dialog box.

Press OK, and Saturn will run and profile your program, and generate a folder with the profiler output data. (eg: Saturn_profile_a.out.000)

You can then select the data file to view the output, and Saturn will display a call graph and the amount of time spent in each function. Now the fun of optimizing can begin. Enjoy!

Monday, February 08, 2010

4K programs

One of my favourite programming tasks is to create a program in under 4k.  This is something the Demoscene excel at (see scene.org awards).
Some of my favourites are:
Of course, achieving this seems impossible, however a few tools make this easier. I used many of these tools and tricks to create my 3kb entry to the global game jam.  First of all a drop/compression tool. Crinkler has eliminated the need for com/cob droppers and gives excellent compression. This little tool has made most of my 4k productions possible. IN4K has plenty of tools and code examples to help you learn the trade, but Iñigo Quilez has an excellent set of beginner projects. (Not the most optimized ones out there, but still a fantastic starting point).  FIT have an excellent set of demoscene resources, including source code to some fantastic 4k intros (plus synthesizers!). Ryg/Farbrausch has some interesting reads as well. The other thing that you will find when making 4K intros is the lack of maths functions (which you can get around by using intrinsics /QIfist in MSVC, or something like that), and by writing them yourself in assembly.  The next problem is often getting rid of the C standard library, in particular rand(). This is where Linear congruential generator (LCG) come in handy. This is where IQ comes handy again with a 'better, smaller and faster random number generator'. And again, this isn't the fastest or most optimal, but it will do. Finally you will probably want to allocate memory dynamically, and on Windows you can simply use: GlobalAlloc instead. (Feel free to overide operator new and use your standard C++ coding style). If your really looking to crunch size, then stick to values that will compress well (ie: powers of 2), but the latest crinkler can drop floating precision for you anyway, so I'm not sure how much you save with this trick these days...

Thursday, February 04, 2010

Catchup Post

A short collection of interesting links/articles recently:
Finally, a video of the top Tron AI from the Google AI Challenge:

Wednesday, November 25, 2009

Catchup post

It's been a while since I updated, reason being it is exams/assignments marking period, and I had two GPU industry projects due (3D groundwater flow fluid simulation and a pathfinding/travelling salesman with profits project). 

The biggest news item was that the ACM decided to start (over)enforcing its rules saying that you can not link to preprint and author pages. Thankfully, it started a call-to-arms and prominent pages like Ke-Sen Huang's SIGGRAPH links have been restored. I wonder how many less public pages have silently slipped away. Frankly, I can't wait until the concept of conferences and journals disappear. My websites have always had far more impact that my publications, and it can't be long until the same can be said universally.

A short update with some interesting things in the last while:


   

Tuesday, October 20, 2009

Timing square root on the GPU

Inspired by the post by Elan Ruskin (Valve) on x86 SQRT routines I thought I would visit this for my supercomputing platform of choice, the GPU. These kinds of low level trickery I left behind after finishing with RMM/Pixel-Juice some time around 2000, having decided that 3dNow! reciprocal square root routines were more than good enough..

Anyway, a brief overview of how we can do square roots:
  1. Calculate it with the FPU, (however that was implemented by the chip manafacturer).
  2. Calculate it from newton-raphson. This allows you to control the accuracy of the sqrt. (Or typicaly rsqrt) This comes in two flavours:
    These approaches typically approximate the inverse square root, so this means we need to:
  3. Calculate it from the inverse. This comes in two flavours:
    • Calculate the reciprical, then invert it (1/rsqrt(x)), this gives you correct results
    • Multiply it by the input value (x*rsqrt(x)), this gives you faulty results around 0, but saves you a costly divide.
      Note:
      1.0f / rsqrtf(0.0f) = 1.0f / infinity = 0.0f
      0.0f * rsqrtf(0.0f) = 0.0f * infinity = NaN
Elan's results indicated the x86 SSE units rsqrtss instruction was the fastest (no suprise - it is also a rough estimate), followed by SSE rsqrt with newton-raphson iteration for improvement, then Carmack’s Magic Number rsqrt, and finally the x86 FPU's sqrt. Note that many people don't actually get to the slowest point on Elan's scale, since they don't enable intrinsics when compiling, meaning that the C compiler will use the C library sqrt routine, and not the FPU.
I decided to test three routines for the GPU:
  • native sqrt
  • native rsqrt
  • Carmack's rsqrt
Now benchmarking and timing on the lowest level has always been somewhat of a black art (see Ryan Geiss's article on timing), but that is even more true on the GPU - you need to worry about block sizes, as well as the type of timer, etc.
I did my best at generating reliable results by testing block sizes from 2..256 and performing 2.5 million sqrt operations. Here are the results from my nVidia 9800GX2:
Method Total time Max. ticks per float Avg. ticks per float Std. Dev. Avg. Error
GPU intrinsic SQRT 1.285ms 5.99 3.99 0.00 0.00%
GPU intrinsic RSQRT * x 1.281ms 5.99 3.99 0.00 0.00%
Carmack RSQRT * x 2.759ms 6.28 4.26 0.01 0.09%
Total time is the total time measured by the CPU that the GPU took to launch the kernel and calculate the results. The clock ticks are meant to be more accurate measurements using the GPU's internal clock, but I find that to be dubious.
The conclusions to take from these results are simple: Carmack's inverse and other trickery isn't going to help, using the GPU RSQRT function as opposed to the inbuilt SQRT function saves you about a clock tick or two. (Probably because nVidias SQRT is implemented as 1/RSQRT, as opposed to X*RSQRT)
I'm happy to say, low level optimization tricks are still safely a thing of the past.
You can get the code for the CUDA benchmark here: GPU SQRT snippet.

Thursday, October 15, 2009

AMD OpenCL for GPU

Not one to be left behind by nVidias news, AMD/ATI have released the AMD OpenCL beta v4 which now supports OpenCL for AMD GPU's! Some highlights:

  • First beta release of ATI Stream SDK with OpenCL GPU support.
  • ATI Stream SDK v2.0 OpenCL is certified OpenCL 1.0 conformant by Khronos.
  • Microsoft Windows 7 and native Microsoft Windows® 64-bit support


Fabulous news! Now you can do OpenCL on OSX, Windows (32&64bit), for nVidia and ATI GPU's and AMD CPU's. It doesn't get any better than this, well, at least util next year when Intel enters the fray.

It's not all good news though, it seems some of AMD's GPUs don't support double precision:

Still, it's better than nVidias lot, and I'm happy to see AMD finally making a serious effort in this space. (Not that the previous efforts weren't impresive, just not so focused...)

nVidia has also released the new version of CG, v 2.2. I wonder how much OpenCL will replace the use of Cg..

Thursday, October 08, 2009

nVidia: OpenCL , Nexus, Fermi

There's been a fair bit of news flowing out of nVidia, biggest first:
nVidia has released a GPU OpenCL implementation compatible with all devices that support CUDA (no surprise).
You can get the nVidia OpenCL from the nVidia OpenCL Developer Website.

Next in nVidia news, Nexus has been released. I haven't had a chance to try it, but apparently it allows you to debug GPU programs via Microsoft Visual Studio in the 'normal' way - this would certainly make GPU programming a little easier.

Finally, nVidia has released information on Fermi their next-generation architecture. Basically it seems to be more of the same (which is always good) without all the bad bits for GPGPU programming (even better!). The biggest changes are allowing multiple kernels to execute in parallel, and having decent double-precision support. This should really open up the scientific&engineering computing to GPGPU, and will probably do good things for getting accelerated raytracing happening. AnandTech has a good write-up on Fermi, although it looks like we will see Larabee before Fermi...

I had a chance to play with a 3xC1060 Tesla boards for the GPU fluid simulation project on a 64bit machine. This threw up a whole bunch of problems, since I was using MSVC express edition, which does not support 64bit (apparently.) Problem was solved by using the 32 bit CUDA wizard, redirecting the CUDA libraries to the 32bit versions (ie: c:\CUDA\lib, not C:\CUDA\lib64), and some other tweaks.

Thursday, August 20, 2009

Intel Buys RapidMind


Intel has just bought RapidMind. RapidMind provided a parallel language that could tarket CPUs (AMD & Intel), GPU's (nVidia & ATI/AMD), the Cell processor (Sony, Toshiba, IBM), and various DSPs.

Now that it becomes an Intel owned product I wonder if all that wonderful support for other devices will just disappear? If they primarily want it to target their own Larrabee, then probably yes, but if they want to use it to establish a software platform for the next Playstation, or as their OpenCL interface, then it might have a good chance of good support for the Cell and perhaps still a few GPUs. I don't see why Intel would want to provide support to their competitors DSPs, and if Larrabee is meant to be a serious competitor to Cell and GPUs then this might end up being a very disappointing purchase..

Goodbye rapid mind.

Wednesday, August 05, 2009

ATI Stream SDK includes OpenCL

The new ATI stream SDK includes a CPU implementation of OpenCL.

AMD released a demo of a 2d fluid simulation using 24 cores.


I have to say, I'm not very impressed. I'd be keen to see how this holds up to an OpenCL GPU implemetnation, it certainly seems like CUDA has a strong position for now. They do have a good OpenCL introductory tutorial though.

In other news, nVidia released OptiX a realtime raytracing "engine", whatever that is supposed to mean..

Thursday, July 30, 2009

GPGPU & LLVM

The LLVM group has just got a new logo, a modernized version of the 'dragon book' dragon. But the real exciting news I saw on GPGPU.org is a project 'GPUocelot'. This program will translate compiled PTX programs (produced by nVidias CUDA) via the just-in-time LLVM compiler to any targeted backend, meaning for example a PS3 CELL processor. All you need to do is add an AMD backend to LLVM, and hey-presto, instant CUDA-for-ATI. That could potentially put a dent into OpenCL's plans..

The papers from the High Performance Graphics conference are also out, one that caught my eye was Understanding the Efficiency of Ray Traversal on GPUs, not really because I thought the paper was fundamentally groundbreaking, but because it explained a few neat tricks on nVidias part, in particular, a good explanation of persistant threads in CUDA for breaking down non-uniform workloads from a global pool. (eg: They used it on a per-ray basis, so that fast and slow rays don't "block" each other.)

Tuesday, July 07, 2009

Simple bootloader

Writing your own OS is something every computing person should do at some point. The first step is of course writing a boot loader. This is actually very easy to do under linux.

With a fresh install of ubuntu, make sure you have GCC and related goodies, the other things you will probably want are nasm, and virtual machine like QEMU. Installing these in ubuntu is as simple as:

sudo apt-get install nasm
sudo apt-get install qemu

(you may need to modify your /etc/apt/sources.list - any core ubuntu mirror will do, I used au.archive.ubuntu.com/ubuntu)

Now you probably want to try qemu out before going further, so grab freedos:
http://www.freedos.org/freedos/files/ and grab an iso, I got fdbasecd.iso

Now we create a virtual hard drive to install freedos on to: (do this in the same dir as the iso)

qemu-img create -f raw freedos.img 100M


And then we just boot up qemu:

qemu -localtime freedos.img -cdrom fdbasecd.iso -boot d

Freedos install is a bit obtuse, you need to format the drive to FAT16, exit the formatter, then again to FAT32, but I just pretty much just went with the defaults for everything, afterall, this is just for testing. At then end, you should have a working DOS prompt.

Now that we have QEMU working and we know it, let's try our own boot loader:

[BITS 16] ; 16 bit code generation
[ORG 0x7C00] ; ORGin location is 7C00

;Main program
main: ; Main program label

mov ah,0x0E ; This number is the number of the function in the BIOS to run.
; This function is put character on screen function
mov bh,0x00 ; Page number (I'm not 100% sure of this myself but it is best
; to leave it as zero for most of the work we will be doing)
mov bl,0x07 ; Text attribute (Controls the background and foreground colour
; and possibly some other options)
; 07 = White text, black background.
; (Feel free to play with this value as it shouldn't harm
; anything)
mov al,65 ; This should (in theory) put a ASCII value into al to be
; displayed. (This is not the normal way to do this)
int 0x10 ; Call the BIOS video interrupt.

jmp $ ; Put it into a coninuous loop to stop it running off into
; the memory running any junk it may find there.

; End matter
times 510-($-$$) db 0 ; Fill the rest of the sector with zeros
dw 0xAA55 ; Boot signature


Save the code to loader.asm, and assemble it with:

nasm loader.asm


Then, we can run our wonderful loader with:

qemu loader

Thursday, July 02, 2009

Getting started on the PS3

It's been a while since I set up my PS3 with Linux, but I remember it being a lengthy process. Installing Yellow Dog Linux on the PS3 is relatively straight forward, but then getting the compiler tool chain up and running took a while, mostly in just figuring out what to do. If you only have RCA out on the PS3 you need to install Linux in text-mode, which I remember being a bit dramatic (all the instructions assume you have the GUI environment).

Yellow dog has an package manager called 'yum' (Yellow Dog Updater, Modified).

After the install do a 'yum update'.
(Note, if you are behind a proxy you will need to set the http_proxy variable. For other networking issues 'ifconfig' up/down and 'dhclient' are your friends.)
export http_proxy = "http://username:password@proxy.host.com:port"


Now do a search in yum for any SPU/PPU packages, ie: 'yum search spu' and install the relevant ones. (I can't recall which ones, probably libspe2,spu-binutils,spu-gcc,spu-newlib,ppu-binutils). At the end of this you should have spu-gcc, ppu-embedspu, ppu-ar and the usual suspects (gcc/g++).

You have these different compilers because the SPE and the PPE are completely different processors, it's like having two different computers in one box. So the spu-gcc compiles code only for the SPE, and the 'normal' gcc compiles code for the Power PC.

Obviously the first thing to try is 'hello world' for the PPE, but after that a little SPU/PPU program is what to try. You should be able to find some code on the web, (try the GATECH STI website, or Jeremy's SPE hello world examples) so then you just need to build it:
spu-gcc spu-program.cpp -o spu-program
ppu-embedspu -m32 symbol_name binary_name output-embedded.o
ppu-ar -qcs spulib.a output-embedded.o
g++ ppu-program.cpp -lspe spulib.a -o output

Or, as a concrete example:
spu-gcc spumain.cpp -o spumain
ppu-embedspu -m32 hello_spu spumain hello_spu-embed32.o
ppu-ar -qcs hello_spu.a hello_spu-embed32.o
g++ ppu-program.cpp -lspe hello_spu.a -o helloworld

So in Step 1, we are compiling the SPE program, in Step 2 we are embedding it into an object file by providing the symbol name (eg: extern spe_program_handle_t hello_spu), the binary name (the result from spu-gcc), and the final object file.
In Step 3 we are creating a library from the object, and in Step 4 we are linking it all together.

The entire 'embedding' thing can be a bit confusing, but Alex Chow provides a great overview.

(Note: If you are working with windows, putty and winscp will make your life a lot easier. If your new to linux, try the 'nano' editor (or 'pico'), as opposed to the more powerful, and difficult to master vim. You can make something executable with 'chmod +x filename'. If you stuff up the console, typing 'reset' will save the day)

Good luck. You might find this Cell Cheat Sheet useful.