Sunday, November 30, 2008

iP* programming tip #1

This is the first of a series of iPhone / iPod programming tips.
Starting iPhone development requires first the knowledge of the underlying hardware and what it can do for you. Here are the latest hardware specs I am aware of (a rumour was talking about iPods that run the CPU with 532 MHz ... I haven't found any evidence for this).
  • GPU: PowerVR MBXLite with VGPLite with 103 Mhz
  • ~DX8 hardware with vs_1_1 and ps_1_1 functionality
  • Vertex shader is not exposed
  • Pixel shader is programmed with texture combiners
  • 16 MB VRAM – not mentioned anywhere
  • CPU: ARM 1176 with 412 Mhz (can do 600 Mhz)
  • VFP unit 128-bit Multimedia unit ~= SIMD unit
  • 128 MB RAM; only 24 MB for apps allowed
  • 320x480 px at 163 ppi screen
  • LIS302DL, a 3-axis accelerometer with 412 Mhz (?) update rate
  • Multi-Touch: up to five fingers
  • PVRTC texture compression: color map 2-bit per pixel and normal map 4-bit per-pixel
The interesting part is that the CPU can do up to 600 Mhz, so it would be possible to increase the performance here in the future.
I wonder how the 16 MB VRAM are handled. I assume that this is the place where the VBO and textures are stored. Regarding the max size of apps of 24 MB; I wonder what happens if an application generates geometry and textures dynamically ... when does the sandbox of the iPhone / iPod touch stop the application. I did not find any evidence for this.

WARP - Running DX10 and DX11 Games on CPUs

As a MVP I was involved into testing this new Windows Advanced Rasterization Platform. They just published the first numbers

Running Crysis on a 8 core CPU with a resolution of 800x600 at 7.2 fps is an achievement. If this would be hand-optimized very well, it would be the best way to write code for. 4 - 8 cores will be a common target platform in the next two years. Because it can be switched off if there is a GPU, this is a perfect target for game developers. What this means is that you can write a game with the DirectX 10 API and not only target all the GPUs out there but also machines without GPU ... this is one of the best developments for the PC market since a long time. I am excited!

The other interesting consequence from this development is: if INTELs "Bread & Butter" chips run games with the most important game API, it would be a good idea if INTEL would put a bunch of engineers behind this and optimize WARP (in case they haven't already done so). This is the big game market consisting of games like "The Sims" and "World of Warcraft" and similar games that we are talking about here. The high-end PC gaming market is much smaller.

Thursday, November 6, 2008

iPhone ARM VFP code

The iPhone has a kind of SIMD unit. It is called VFP unit and it is pretty hard to figure out how to program it. Here is a place where you can find soon lots of VFP asm code.

With help from Matthias Grundmann I wrote my first piece of VFP code. Here it is:

void MatrixMultiplyF(
MATRIXf &mOut,
const MATRIXf &mA,
const MATRIXf &mB)
#if 0

/* Perform calculation on a dummy matrix (mRet) */
mRet.f[ 0] = mA.f[ 0]*mB.f[ 0] + mA.f[ 1]*mB.f[ 4] + mA.f[ 2]*mB.f[ 8] + mA.f[ 3]*mB.f[12];
mRet.f[ 1] = mA.f[ 0]*mB.f[ 1] + mA.f[ 1]*mB.f[ 5] + mA.f[ 2]*mB.f[ 9] + mA.f[ 3]*mB.f[13];
mRet.f[ 2] = mA.f[ 0]*mB.f[ 2] + mA.f[ 1]*mB.f[ 6] + mA.f[ 2]*mB.f[10] + mA.f[ 3]*mB.f[14];
mRet.f[ 3] = mA.f[ 0]*mB.f[ 3] + mA.f[ 1]*mB.f[ 7] + mA.f[ 2]*mB.f[11] + mA.f[ 3]*mB.f[15];

mRet.f[ 4] = mA.f[ 4]*mB.f[ 0] + mA.f[ 5]*mB.f[ 4] + mA.f[ 6]*mB.f[ 8] + mA.f[ 7]*mB.f[12];
mRet.f[ 5] = mA.f[ 4]*mB.f[ 1] + mA.f[ 5]*mB.f[ 5] + mA.f[ 6]*mB.f[ 9] + mA.f[ 7]*mB.f[13];
mRet.f[ 6] = mA.f[ 4]*mB.f[ 2] + mA.f[ 5]*mB.f[ 6] + mA.f[ 6]*mB.f[10] + mA.f[ 7]*mB.f[14];
mRet.f[ 7] = mA.f[ 4]*mB.f[ 3] + mA.f[ 5]*mB.f[ 7] + mA.f[ 6]*mB.f[11] + mA.f[ 7]*mB.f[15];

mRet.f[ 8] = mA.f[ 8]*mB.f[ 0] + mA.f[ 9]*mB.f[ 4] + mA.f[10]*mB.f[ 8] + mA.f[11]*mB.f[12];
mRet.f[ 9] = mA.f[ 8]*mB.f[ 1] + mA.f[ 9]*mB.f[ 5] + mA.f[10]*mB.f[ 9] + mA.f[11]*mB.f[13];
mRet.f[10] = mA.f[ 8]*mB.f[ 2] + mA.f[ 9]*mB.f[ 6] + mA.f[10]*mB.f[10] + mA.f[11]*mB.f[14];
mRet.f[11] = mA.f[ 8]*mB.f[ 3] + mA.f[ 9]*mB.f[ 7] + mA.f[10]*mB.f[11] + mA.f[11]*mB.f[15];

mRet.f[12] = mA.f[12]*mB.f[ 0] + mA.f[13]*mB.f[ 4] + mA.f[14]*mB.f[ 8] + mA.f[15]*mB.f[12];
mRet.f[13] = mA.f[12]*mB.f[ 1] + mA.f[13]*mB.f[ 5] + mA.f[14]*mB.f[ 9] + mA.f[15]*mB.f[13];
mRet.f[14] = mA.f[12]*mB.f[ 2] + mA.f[13]*mB.f[ 6] + mA.f[14]*mB.f[10] + mA.f[15]*mB.f[14];
mRet.f[15] = mA.f[12]*mB.f[ 3] + mA.f[13]*mB.f[ 7] + mA.f[14]*mB.f[11] + mA.f[15]*mB.f[15];

/* Copy result in pResultMatrix */
mOut = mRet;
const float* src_ptr1 = &mA.f[0];
const float* src_ptr2 = &mB.f[0];
float* dst_ptr = &mOut.f[0];

asm volatile(
// switch on ARM mode
// involves uncoditional jump and mode switch (opcode bx)
// the lowest bit in the address signals whether are (bit cleared)
// or tumb should be selected (bit set)
".align 4 \n\t"
"mov r0, pc \n\t"
"bx r0 \n\t"
".arm \n\t"

// set vector length to 4
// example fadds s8, s8, s16 means that the content s8 - s11
// is added to s16 - s19 and stored in s8 - s11
"fmrx r0, fpscr \n\t" // loads fpscr status reg to r4
"bic r0, r0, #0x00370000 \n\t" // bit clear stride and length
"orr r0, r0, #0x00030000 \n\t" // set length to 4 (11)
"fmxr fpscr, r0 \n\t" // upload r4 to fpscr
// Note: this stalls the FPU

// result[0][1][2][3] = mA.f[0][0][0][0] * mB.f[0][1][2][3]
// result[0][1][2][3] = result + mA.f[1][1][1][1] * mB.f[4][5][6][7]
// result[0][1][2][3] = result + mA.f[2][2][2][2] * mB.f[8][9][10][11]
// result[0][1][2][3] = result + mA.f[3][3][3][3] * mB.f[12][13][14][15]
// s0 - s31
// if Fd == s0 - s7 -> treated as scalar all the other treated like vector
// load the whole matrix into memory - transposed -> second operand first
"fldmias %2, {s8-s23} \n\t"
// load first column to scalar bank
"fldmias %1!, {s0 - s3} \n\t"
// first column times matrix
"fmuls s24, s8, s0 \n\t"
"fmacs s24, s12, s1 \n\t"
"fmacs s24, s16, s2 \n\t"
"fmacs s24, s20, s3 \n\t"
// save first column
"fstmias %0!, {s24-s27} \n\t"

// load second column to scalar bank
"fldmias %1!, {s4-s7} \n\t"
// second column times matrix
"fmuls s28, s8, s4 \n\t"
"fmacs s28, s12, s5 \n\t"
"fmacs s28, s16, s6 \n\t"
"fmacs s28, s20, s7 \n\t"
// save second column
"fstmias %0!, {s28-s31) \n\t"

// load third column to scalar bank
"fldmias %1!, {s0-s3} \n\t"
// third column times matrix
"fmuls s24, s8, s0 \n\t"
"fmacs s24, s12, s1 \n\t"
"fmacs s24, s16, s2 \n\t"
"fmacs s24, s20, s3 \n\t"
// save third column
"fstmias %0!, {s24-s27} \n\t"

// load fourth column to scalar bank
"fldmias %1!, {s4-s7} \n\t"
// fourth column times matrix
"fmuls s28, s8, s4 \n\t"
"fmacs s28, s12, s5 \n\t"
"fmacs s28, s16, s6 \n\t"
"fmacs s28, s20, s7 \n\t"
// save fourth column
"fstmias %0!, {s28-s31} \n\t"

// reset vector length to 1
"fmrx r0, fpscr \n\t" // loads fpscr status reg to r4
"bic r0, r0, #0x00370000 \n\t" // bit clear stride and length
"fmxr fpscr, r0 \n\t" // upload r4 to fpscr

// switch to tumb mode
// lower bit of destination is set to 1
"add r0, pc, #1 \n\t"
"bx r0 \n\t"
".thumb \n\t"

// binds variables to registers
: "=r" (dst_ptr), "=r" (src_ptr1), "=r" (src_ptr2)
: "0" (dst_ptr), "1" (src_ptr1), "2" (src_ptr2)
: "r0"