Rant on Programming.

As both an avid programmer, and computer engineer, I see more and more informatics students finishing their bachelor or master’s degree without having the faintest idea about how a very basic CPU works. And by this I don’t mean that they don’t know how to write HDL, or how caches work. It’s that the closest thing that the students encounter is the Java Virtual Machine, or the CLI from Microsoft. Now, I’m not going to start the entire “Which language should we teach the Students(tm)” discussion all over again, although I agree with Patterson on this manner ( Down-Up ).

However, independent of which programming language you learn in your freshman Computer Science track, I see no reason why not to combine it with some basic introduction to computer architecture. This on the basis of, independent of your future specialization, having a basic idea of how a computer works is really not a silly thing at all. At least on the level where you know that you have registers, memory and special flags. Furthermore, the modelling of a CPU lend itself very well to implementation in a high level language. At least if you are interested in a high level functional simulator. Writing such a piece of code could be as easy as you want, the code could be easily modular, and due to the nature of a processor or a virtual machine, you could easily introduce all the introductory concepts of programming. Your different functional units would be classes, the logic needs your basic control structures, and for parsing the input program, you need basic string processing.

Now, as always, to check if my hypothesis was correct, I took 5 minutes to write a simple model of a virtual machine / processor. Currently it supports 4 instructions ( add, put/load imm, prt, end ), so not all that but I think it illustrates the point. More importantly, the code base itself is more or less 200 lines of Java ( lending itself nicely to a exercise ), and I have tried to use some different concepts, such as basic control structures, OOP, all which you could find in your introduction to rogramming course. All without going overly complex.

Check the code out from git using: http://git.langly.org/java-cpu, or just point your browser to it.

On GPU versus CPU cores.

In the confusion surrounding the amount of cores on a GPU contra the number of cores I want to contribute with my part, and do some clearing up.

Since the modern day GPUs with support for general purpose calculations, it has been pushed that they contain several hundreds of “Cores”, a magnitude higher than the amount of “Cores” you can find in a regular CPU, which these days are about 2-4 depending on your version of CPU. Now, this is due to marketing only, and it has bothered me for a while seeing how academicians and computer engineers have started to pick up the term core, using it relentlessly.

Now, the problem arises due to two major facts. First, a the comparison between what the general CPU manufacturers calls a “Core” (Intel, AMD etc.), and that what GPU manufacturers ( Nvidia, AMD/ATI ) calls a core, is in fact two similar, but different things. Typically, the modern GPU normally consists of several “Cores”, as seen in the following illustration of the new Fermi architecture:

Fermi ARchitecture

What you can see in this photo is that the GPU core is a scaled down version of what the CPU manufacturers would call ALU, or a functional unit. Now, compare this to the microarchitecture of a Core2 chip, and check out the yellow boxes. If you want to compare the amount of real cores, these are the ones you have to look at. Furthermore, it’s important to remember that the Core2 even have a vector unit, which can multiply / add several operands at once.

Core2

Thus, what might be a more fair comparison is the number of multiprocessors in the GPU contra the number of cores on a CPU. In the newest Fermi architecture this is 16, contra the 4 cores on a quad core processor.

However, this is still an unfair comparison. The reason why is due to the type of applications the different architectures are optimized for. Needless to say, the GPU is optimized for graphics processing and stream processing, which in turn is just to churn out data with fairly regular behaviour and memory accesses. Thus, the complexity of of the GPU has been scaled down compared to that of an CPU which has to perform better on a much wider range of applications. Hence, what happens is that the CPU has to use a lot more resources / gates on control structures, leaving the control to calculation gate ratio much higher than found in a GPU. This again, leads to the huge differences between the number of “cores” between the CPU and GPU.

As a sidenote, there are still a lot of applications where the CPU outperforms the GPU ;)

Zsh – Skipping words

A couple of weeks ago I installed zsh on all of my shell accounts, and I’ve started to grow found of it. However, one thing that annoyed me is that per default I couldn’t press ctrl+arrows to jump back and forth amongst words like I could in bash.

However, the solution was quite easy as soon as I read the manual. First, in your terminal press ctrl+arrow, and copy the code that appears on your terminal. In my case it was “;5D” and “;5C”.

Then in your .zshrc file put:

bindkey “;5D” backward-word
bindkey “;5C” forward-wordbindkey

That should do it

Fermi Architecture

Some new and interesting articles about the new Fermi architecture from nvidia:

http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=1

and

http://techreport.com/articles.x/17670

Quite interesting to see how they are turning back to a more G80ish architecture again.

New blog

As maybe the observant reader can see, I’ve chickend out and replaced my old blog system, pyblosxom, with wordpress.

Just me being waay to lazy :)

Sixty Symbols

Just came over this one: Sixty Symbols from the Uni. of Nottingham. Basically, it’s a collection of 5 minutes youtube videos where scientists talks about a mathematical symbol / constant of their choice

And it’s really not as boring as it sounds like. They even have rockets under η :)

CUDA: Hacking PTX code.

In order to provide the CUDA developer with a low level programming
language without exposing any of the underlying instruction set, NVIDIA
have given us, the developers, the option to program in PTX ( Parallel
Thread eXecution ). The PTX being somewhat similar to “assembly code” in
structure opens up a new set of features to the developer, which in
certain cases might be useful to take advantage of. One case which I use
a lot in my daily work is the ability to internally time blocks of code
within a thread using the %clock register( Somewhat like the Time Stamp
Counter on x86 ), which is not exposed through the CUDA high level
language.

Although useful, the documentation is rather poor. Let me rephrase that.
The PTX code itself is pretty well documented in the Nvidia SDK
documentation, in the CUDA/docs/ptx_1.x.pdf file, with everything you
need to know about the instruction format. However, its application is
poorly documented in the documentation of the nvidia cuda compiler (
nvcc.pdf ), and thus I thought I could be as kind as to provide you with
a small hands on tutorial.

First, what I’ve found works best is to do some cheating, and let the
compiler itself create a skeleton framework for me. This allows me to
rapidly start developing the PTX code, without the boring part where I
have to create all the auxiliary files by hand. What I usually do is to
write a small skeleton .cu file, where I just create an empty __global__
function with the correct parameters. Hence my initial skeleton file
would look something like:

/* Cu-code */
#include <cuda.h>

__global__ void zeroKernel(int *in, int *out){
        out[threadIdx.x] = 0;
}

int main(){
        //** Set up **/
        <<<threads, grid>>> zeroKernel(foo,bar);
        /** Tear down **/
}

I would then run the nvcc with the command “nvcc main.cu –ext=all
–dir=a.out.devcode”
in order to have it create the necessary files for me.
Some explanation is needed though. One very useful feature of the CUDA runtime
library is the support for what they call code repositories. During execution,
the CUDA binary will check its current directory for a sub directory and look
for child directories, containing a cubin file. If the executable file finds a
file matching his kernel, he will use the one from the code repository instead
of the one found embedded in his binary file. The matching cubin file for the
kernel can be seen here:

// cubin
architecture {sm_10}
abiversion   {1}
modname      {cubin}
code {
    name = _Z4testPiS_
    lmem = 0
    smem = 24
    reg  = 3
    bar  = 0
    const {
            segname = const
            segnum  = 1
            offset  = 0
            bytes   = 4
        mem {
            0x00000004
        }
    }
    bincode {
        0x00000005 0x60004780 0x30010209 0xc4100780
        0x1000ca05 0x0423c780 0x60040005 0x00000003
        0xd00e0209 0xa0c00781
    }
}

The cubin file, is the executable file, and keeps all information
needed by the binary application in order to execute. It also contains
the kernel code in the CODE section of the cubin file itself. Quite
nifty. For those of you especially interested in the binary format
itself, Wladimir J. van der Laan has created an assembler / disassembler
for the G80 architecture[1], and which can be read if you want to learn
more about the true instruction set of the nvidia G80.

Besides the .cubin file, it should be a couple of files named comp_10 or
comp_12, depending on which architecture you tried to compile the
original .cu file to. This file will contain the PTX code for you to
start code in, although with some extra directives such as debug
statements, and various other lines of unneeded code. The following
figure shows how the PTX code for the zeroKernel looks when compiled
into PTX, minus the crud:

/**
	PTX code
**/
.version 1.3

.entry _Z4testPiS_
{
    .reg .u16 %rh<3>;
    .reg .u32 %r<6>;
    .param .u32 __cudaparm__Z4testPiS__in;
    .param .u32 __cudaparm__Z4testPiS__out;
    .loc    14  5   0
	$LBB1__Z4testPiS_:
    .loc    14  6   0

    mov.u32     %r1, 0;

    ld.param.u32    %r2, [__cudaparm__Z4testPiS__out];
    mov.u16     %rh1, %tid.x;
    mul.wide.u16    %r3, %rh1, 4;
    add.u32     %r4, %r2, %r3;

    st.global.u32   [%r4+0], %r1;
    .loc    14  7   0
    exit;
$LDWend__Z4testPiS_:
}

The given PTX code is the one that you can modify for your own purpose.
Hence an easy check to make sure that the tool chain works is to change
the “mov.u32 %r1, 0;” to “mov.u32 %r1, 0xDEADBEEF;”, which should give a
different output from your main kernel. When done modifying the kernel,
you can run “ptxas -o sm10″ which will give you an updated of the cubin
file itself. Careful though, ptxas will output by default to sm10
architecture, so if your GPU/Tesla supports a different architecture you
have to set this with the -arch sm_XX option.

Links:

[1] Decuda

Perian — Codex for QuickTime

So, I’m going to kick off my new mac category by recommending a new
recommendation for a nice app for OsX. Usually, I’m a fond user of VLC when it
comes to playing my media. However, it seems like the newest version has some
bugs in it, notably some problems with some files. Thus I was trying to play my
media with quicktime, only to end up in the eternal codec hell.

However, I came over a small app ( Not as much an app as a collection of codecs )
named Perian. I installed it the usual way you install stuff in OsX, and it allowed me to play all my files through QuickTime by installing all the needed codecs. Check it out.

Returned from Acaces08

So, last week I went with some of my fellow Ph.D students to Acaces, which is a summer school
for computer architects and compiler designers. Here we could meet up with a
lot of other students and industry people from the rest of the world, to build
network and learn stuff in classes.

Needless to say, I had a great time and got to meet a lot of new people. I
hope I can get back next year as well.

Even a pretty picture stolen from Andreas Popp. If my prof. asks what we
did on that picture, the answer is that we were discussing computers, not
sitting in the sun relaxing ;)

Update on the UTF-8

So, a couple of weeks ago I posted a href="http://blog.langly.org/hacking/05022008-mutt-htmlmail">entry on how
to make your favorite mail client ( MUTT ) read HTML mails, using links. Now
for most mails that works just fine, but I seemed to have some problems with
certain character sequences. Now, I did some playing around, only to figure out
that the easiest thing was to switch the embedded reader to w3m instead. ( It
seems like elinks have some problems with chars that look like this: \123 )

So, the quick fix is to put this in your .mailcap file instead:

text/html; w3m -I %{charset} -dump %s; nametemplate=%s.html; copiousoutput

and it will display everything correctly :)