In order to provide the CUDA developer with a low level programming
language without exposing any of the underlying instruction set, NVIDIA
have given us, the developers, the option to program in PTX ( Parallel
Thread eXecution ). The PTX being somewhat similar to “assembly code” in
structure opens up a new set of features to the developer, which in
certain cases might be useful to take advantage of. One case which I use
a lot in my daily work is the ability to internally time blocks of code
within a thread using the %clock register( Somewhat like the Time Stamp
Counter on x86 ), which is not exposed through the CUDA high level
language.
Although useful, the documentation is rather poor. Let me rephrase that.
The PTX code itself is pretty well documented in the Nvidia SDK
documentation, in the CUDA/docs/ptx_1.x.pdf file, with everything you
need to know about the instruction format. However, its application is
poorly documented in the documentation of the nvidia cuda compiler (
nvcc.pdf ), and thus I thought I could be as kind as to provide you with
a small hands on tutorial.
First, what I’ve found works best is to do some cheating, and let the
compiler itself create a skeleton framework for me. This allows me to
rapidly start developing the PTX code, without the boring part where I
have to create all the auxiliary files by hand. What I usually do is to
write a small skeleton .cu file, where I just create an empty __global__
function with the correct parameters. Hence my initial skeleton file
would look something like:
/* Cu-code */
#include <cuda.h>
__global__ void zeroKernel(int *in, int *out){
out[threadIdx.x] = 0;
}
int main(){
//** Set up **/
<<<threads, grid>>> zeroKernel(foo,bar);
/** Tear down **/
}
I would then run the nvcc with the command “nvcc main.cu –ext=all
–dir=a.out.devcode” in order to have it create the necessary files for me.
Some explanation is needed though. One very useful feature of the CUDA runtime
library is the support for what they call code repositories. During execution,
the CUDA binary will check its current directory for a sub directory and look
for child directories, containing a cubin file. If the executable file finds a
file matching his kernel, he will use the one from the code repository instead
of the one found embedded in his binary file. The matching cubin file for the
kernel can be seen here:
// cubin
architecture {sm_10}
abiversion {1}
modname {cubin}
code {
name = _Z4testPiS_
lmem = 0
smem = 24
reg = 3
bar = 0
const {
segname = const
segnum = 1
offset = 0
bytes = 4
mem {
0x00000004
}
}
bincode {
0x00000005 0x60004780 0x30010209 0xc4100780
0x1000ca05 0x0423c780 0x60040005 0x00000003
0xd00e0209 0xa0c00781
}
}
The cubin file, is the executable file, and keeps all information
needed by the binary application in order to execute. It also contains
the kernel code in the CODE section of the cubin file itself. Quite
nifty. For those of you especially interested in the binary format
itself, Wladimir J. van der Laan has created an assembler / disassembler
for the G80 architecture[1], and which can be read if you want to learn
more about the true instruction set of the nvidia G80.
Besides the .cubin file, it should be a couple of files named comp_10 or
comp_12, depending on which architecture you tried to compile the
original .cu file to. This file will contain the PTX code for you to
start code in, although with some extra directives such as debug
statements, and various other lines of unneeded code. The following
figure shows how the PTX code for the zeroKernel looks when compiled
into PTX, minus the crud:
/**
PTX code
**/
.version 1.3
.entry _Z4testPiS_
{
.reg .u16 %rh<3>;
.reg .u32 %r<6>;
.param .u32 __cudaparm__Z4testPiS__in;
.param .u32 __cudaparm__Z4testPiS__out;
.loc 14 5 0
$LBB1__Z4testPiS_:
.loc 14 6 0
mov.u32 %r1, 0;
ld.param.u32 %r2, [__cudaparm__Z4testPiS__out];
mov.u16 %rh1, %tid.x;
mul.wide.u16 %r3, %rh1, 4;
add.u32 %r4, %r2, %r3;
st.global.u32 [%r4+0], %r1;
.loc 14 7 0
exit;
$LDWend__Z4testPiS_:
}
The given PTX code is the one that you can modify for your own purpose.
Hence an easy check to make sure that the tool chain works is to change
the “mov.u32 %r1, 0;” to “mov.u32 %r1, 0xDEADBEEF;”, which should give a
different output from your main kernel. When done modifying the kernel,
you can run “ptxas -o sm10″ which will give you an updated of the cubin
file itself. Careful though, ptxas will output by default to sm10
architecture, so if your GPU/Tesla supports a different architecture you
have to set this with the -arch sm_XX option.
Links:
[1] Decuda