## Thursday, November 6, 2008

### iPhone ARM VFP code

The iPhone has a kind of SIMD unit. It is called VFP unit and it is pretty hard to figure out how to program it. Here is a place where you can find soon lots of VFP asm code.

With help from Matthias Grundmann I wrote my first piece of VFP code. Here it is:

`void MatrixMultiplyF(MATRIXf &mOut,const MATRIXf &mA,const MATRIXf &mB){#if 0MATRIXf mRet;/* Perform calculation on a dummy matrix (mRet) */mRet.f[ 0] = mA.f[ 0]*mB.f[ 0] + mA.f[ 1]*mB.f[ 4] + mA.f[ 2]*mB.f[ 8] + mA.f[ 3]*mB.f[12];mRet.f[ 1] = mA.f[ 0]*mB.f[ 1] + mA.f[ 1]*mB.f[ 5] + mA.f[ 2]*mB.f[ 9] + mA.f[ 3]*mB.f[13];mRet.f[ 2] = mA.f[ 0]*mB.f[ 2] + mA.f[ 1]*mB.f[ 6] + mA.f[ 2]*mB.f[10] + mA.f[ 3]*mB.f[14];mRet.f[ 3] = mA.f[ 0]*mB.f[ 3] + mA.f[ 1]*mB.f[ 7] + mA.f[ 2]*mB.f[11] + mA.f[ 3]*mB.f[15];mRet.f[ 4] = mA.f[ 4]*mB.f[ 0] + mA.f[ 5]*mB.f[ 4] + mA.f[ 6]*mB.f[ 8] + mA.f[ 7]*mB.f[12];mRet.f[ 5] = mA.f[ 4]*mB.f[ 1] + mA.f[ 5]*mB.f[ 5] + mA.f[ 6]*mB.f[ 9] + mA.f[ 7]*mB.f[13];mRet.f[ 6] = mA.f[ 4]*mB.f[ 2] + mA.f[ 5]*mB.f[ 6] + mA.f[ 6]*mB.f[10] + mA.f[ 7]*mB.f[14];mRet.f[ 7] = mA.f[ 4]*mB.f[ 3] + mA.f[ 5]*mB.f[ 7] + mA.f[ 6]*mB.f[11] + mA.f[ 7]*mB.f[15];mRet.f[ 8] = mA.f[ 8]*mB.f[ 0] + mA.f[ 9]*mB.f[ 4] + mA.f[10]*mB.f[ 8] + mA.f[11]*mB.f[12];mRet.f[ 9] = mA.f[ 8]*mB.f[ 1] + mA.f[ 9]*mB.f[ 5] + mA.f[10]*mB.f[ 9] + mA.f[11]*mB.f[13];mRet.f[10] = mA.f[ 8]*mB.f[ 2] + mA.f[ 9]*mB.f[ 6] + mA.f[10]*mB.f[10] + mA.f[11]*mB.f[14];mRet.f[11] = mA.f[ 8]*mB.f[ 3] + mA.f[ 9]*mB.f[ 7] + mA.f[10]*mB.f[11] + mA.f[11]*mB.f[15];mRet.f[12] = mA.f[12]*mB.f[ 0] + mA.f[13]*mB.f[ 4] + mA.f[14]*mB.f[ 8] + mA.f[15]*mB.f[12];mRet.f[13] = mA.f[12]*mB.f[ 1] + mA.f[13]*mB.f[ 5] + mA.f[14]*mB.f[ 9] + mA.f[15]*mB.f[13];mRet.f[14] = mA.f[12]*mB.f[ 2] + mA.f[13]*mB.f[ 6] + mA.f[14]*mB.f[10] + mA.f[15]*mB.f[14];mRet.f[15] = mA.f[12]*mB.f[ 3] + mA.f[13]*mB.f[ 7] + mA.f[14]*mB.f[11] + mA.f[15]*mB.f[15];/* Copy result in pResultMatrix */mOut = mRet;#else#if (TARGET_CPU_ARM)const float* src_ptr1 = &mA.f[0];const float* src_ptr2 = &mB.f[0];float* dst_ptr = &mOut.f[0];asm volatile(// switch on ARM mode// involves uncoditional jump and mode switch (opcode bx)// the lowest bit in the address signals whether are (bit cleared)// or tumb should be selected (bit set)".align 4 \n\t""mov r0, pc \n\t""bx r0 \n\t"".arm \n\t"// set vector length to 4// example fadds s8, s8, s16 means that the content s8 - s11// is added to s16 - s19 and stored in s8 - s11"fmrx r0, fpscr \n\t" // loads fpscr status reg to r4"bic r0, r0, #0x00370000 \n\t" // bit clear stride and length"orr r0, r0, #0x00030000 \n\t" // set length to 4 (11)"fmxr fpscr, r0 \n\t" // upload r4 to fpscr// Note: this stalls the FPU// result[0][1][2][3] = mA.f[0][0][0][0] * mB.f[0][1][2][3]// result[0][1][2][3] = result + mA.f[1][1][1][1] * mB.f[4][5][6][7]// result[0][1][2][3] = result + mA.f[2][2][2][2] * mB.f[8][9][10][11]// result[0][1][2][3] = result + mA.f[3][3][3][3] * mB.f[12][13][14][15]// s0 - s31// if Fd == s0 - s7 -> treated as scalar all the other treated like vector// load the whole matrix into memory - transposed -> second operand first"fldmias %2, {s8-s23} \n\t"// load first column to scalar bank"fldmias %1!, {s0 - s3} \n\t"// first column times matrix"fmuls s24, s8, s0 \n\t""fmacs s24, s12, s1 \n\t""fmacs s24, s16, s2 \n\t""fmacs s24, s20, s3 \n\t"// save first column"fstmias %0!, {s24-s27} \n\t"// load second column to scalar bank"fldmias %1!, {s4-s7} \n\t"// second column times matrix"fmuls s28, s8, s4 \n\t""fmacs s28, s12, s5 \n\t""fmacs s28, s16, s6 \n\t""fmacs s28, s20, s7 \n\t"// save second column"fstmias %0!, {s28-s31) \n\t"// load third column to scalar bank"fldmias %1!, {s0-s3} \n\t"// third column times matrix"fmuls s24, s8, s0 \n\t""fmacs s24, s12, s1 \n\t""fmacs s24, s16, s2 \n\t""fmacs s24, s20, s3 \n\t"// save third column"fstmias %0!, {s24-s27} \n\t"// load fourth column to scalar bank"fldmias %1!, {s4-s7} \n\t"// fourth column times matrix"fmuls s28, s8, s4 \n\t""fmacs s28, s12, s5 \n\t""fmacs s28, s16, s6 \n\t""fmacs s28, s20, s7 \n\t"// save fourth column"fstmias %0!, {s28-s31} \n\t"// reset vector length to 1"fmrx r0, fpscr \n\t" // loads fpscr status reg to r4"bic r0, r0, #0x00370000 \n\t" // bit clear stride and length"fmxr fpscr, r0 \n\t" // upload r4 to fpscr// switch to tumb mode// lower bit of destination is set to 1"add r0, pc, #1 \n\t""bx r0 \n\t"".thumb \n\t"// binds variables to registers: "=r" (dst_ptr), "=r" (src_ptr1), "=r" (src_ptr2): "0" (dst_ptr), "1" (src_ptr1), "2" (src_ptr2): "r0");#endif#endif} `

noxa said...

Very cool! Does this mean we'll be seeing some stuff go up on the google code site? I've been checking daily ;)
With the bit of help you've given here and some of the ARM docs I've found, I'd love to help contribute to the project.

Wolfgang Engel said...

check again :-)

noxa said...

I was wondering if you had noticed any perf differences between compiling to arm or thumb - for my floating point heavy code, not using -mthumb *seems* to make things faster, although I have no hard numbers. My guess is that it's not having to switch between arm/thumb at all, and that's a good thing, even though the code size has increased. I'm not sure if there is a define set when compiling to thumb, because that would save the switch between modes here - worst case, it could mean having the macros VFP_SWITCH_TO_ARM/VFP_SWITCH_TO_THUMB set to nop if the user has some -DVFP_ALWAYS_ARM or something.

Wolfgang Engel said...

Matthias made performance measurements with a rather low amount of data. It still made a difference there but you could see that switching to VFP mode and then switching to the 128-bit mode costs some time. So you want to do this when you have lots of data nicely aligned to whatever you want to squeeze through at once. So data alignment plays a huge role. I believe that with the right data you can reach easily 4 - 5 time speed-ups.

jeff said...

Very cool guys. I'd love to share performance results with you guys once I get up and running with VFP, and I could even contribute to the project if you're interested.

Wolfgang Engel said...

Jeff if you would contribute, this would be great. As soon as you wanted to be added to the website give me a sign.

Ed said...

I tried out the VFP code, but I see no difference at all in performance. I'm testing on a skinned mesh animation.
It seems strange that you have the "switch to thumb/arm" thing. You should have thumb switched off all the time if you are doing floating point stuff.

Wolfgang Engel said...

I measured the matrix multiply function in the VFP library and it was 20x faster than the C function in Oolong. I ran both thousand times to make this a good comparison. So it switched on and off the ARM / thumb state.
Can you share the details on how you measured performance? Just saying it did not make a difference does not really inspire much response here :-)

Ed said...

I have two big skinned meshes (4650 triangles), which render at about 32 ms per frame. I have thumb disabled for the whole project. (If thumb is enabled then the render time jumps to 53ms)
I use my own code to render.
The joints in the skinned mesh use a lot of matrix multiply operations, so I used your VPF code that multiplys 2 4x4 matrices.
My matrix class is similiar to the one in Irrlicht.

Wolfgang Engel said...

Interesting. Is it possible that you are CPU or GPU limited somehow in the scene? I can see this in the San Angeles example. The number of draw-calls is so high that the CPU just stalls. Using the VFP code made only a marginal difference.

Ed said...

That makes sense. I think rendering the scene takes much more time than the skinning, so optimisations on that side have little effect

Max said...

If you set VFP_SWITCH_TO_ARM/VFP_SWITCH_TO_THUMB to nop, make sure you mark all s0-s31 registered as used in the asm clobbers list.

Mike Riegger said...

Very cool, thanks for posting this.

I've written a VFP version of unsharp_mask and it works great... with optimization level -O3. When I use -Os, I get

Try as I might, I can't seem to force -O3 for just that one file. Anyone encountered something similar?

Jeffrey Lim said...

I recently started writing my iPhone vfp code and searching for what others had done, I stumbled upon this site.

I had a peek at the code, and with just a few minutes of looking at it and relatively small modifications, I can make the matrix * matrix routine ~30% faster (measured on iPhone 2G), and the vector * matrix ~10% faster.

I'm more than happy to share the code/knowledge.. but there's always the fun and challenge to figuring it out...

So with the knowledge that it can be done, give it a try. If you really just want the code and answers, EMail me.

Cheers.

jeff at lim dot com dot au.

Wolfgang Engel said...

You seem to be smart. Are you talking about my first attempt to write ARM VFP code that is featured above or the code that is part of the ARM VFP Code library on the google side? The later I would expect to be 2 - 3x faster than the code above.
If you want to contribute to the VFP code math library give me a sign and stop posing :-) ... if you are posting messages that say that you can do better my answer will always be: show me :-) ... talk is cheap because there is an unlimited supply of it :-) ... update the VFP math library and show me :-)
You can go to

Download the code and see if you can improve it. I am happily make you a project owner ...
BTW: if I would have time I would look into the NEON asm code now ... assuming the compiler can't generate NEON code sufficiently :-)

Jeffrey Lim said...

I went to http://code.google.com/p/vfpmathlibrary/ and I made a few small changes and got dramatically increased performance.

The key is minimizing pipeline stalls.

While NEON is going to be relevant from now on, making things as fast as possible on the old iPhones isn't going to hurt :)

You're right. Talk is cheap. So I'll show you an example of what can be done:

Matrix4Mul:

Originally:

"fmuls s24, s8, s0 \n\t"
"fmacs s24, s12, s1 \n\t"

etc.

The problem is.. between the first and second line there is a large number of stall cycles! fmuls has an 8 cycle latency according to the CPU docs. I've tried to keep the pipeline filled the entire time, eliminating RAW dependencies where possible.

Try the following and measure the performance difference.

void Matrix4Mul(const float* src_mat_1, const float* src_mat_2, float* dst_mat) {
asm volatile (VFP_SWITCH_TO_ARM
VFP_VECTOR_LENGTH(3)

// Let A:=src_ptr_1, B:=src_ptr_2, then
// function computes A*B as (B^T * A^T)^T.

// Load first two columns to scalar bank.
"fldmias %1!, {s0-s7} \n\t"
// Load the whole matrix into memory.
"fldmias %2, {s16-s31} \n\t"

// First column times matrix.
"fmuls s8, s16, s0 \n\t"
"fmuls s12, s16, s4 \n\t"
"fmacs s8, s20, s1 \n\t"
"fmacs s12, s20, s5 \n\t"
"fmacs s8, s24, s2 \n\t"
"fmacs s12, s24, s6 \n\t"
"fmacs s8, s28, s3 \n\t"
"fmacs s12, s28, s7 \n\t"

// Load next two column to scalar bank.
"fldmias %1!, {s0-s7} \n\t"

// Save first column.
"fstmias %0!, {s8-s15} \n\t"

"fmuls s8, s16, s0 \n\t"
"fmuls s12, s16, s4 \n\t"
"fmacs s8, s20, s1 \n\t"
"fmacs s12, s20, s5 \n\t"
"fmacs s8, s24, s2 \n\t"
"fmacs s12, s24, s6 \n\t"
"fmacs s8, s28, s3 \n\t"
"fmacs s12, s28, s7 \n\t"

"fstmias %0!, {s8-s15} \n\t"

VFP_VECTOR_LENGTH_ZERO
VFP_SWITCH_TO_THUMB
: "=r" (dst_mat), "=r" (src_mat_2)
: "r" (src_mat_1), "0" (dst_mat), "1" (src_mat_2)
: "r0", "cc", "memory", VFP_CLOBBER_S0_S31
);
}

On my tests, this took ~40% less time, in other words, I could do 60% more matrix*matrix calculations in the same time.

After my original posting on your thread, I've measured a vector * matrix considerably faster than the one in the package...

Wolfgang Engel said...

Jeffrey: very cool!! I like that. So now the obvious question: do you want to contribute to the VFP math library :-)

Jeffrey Lim said...

I'll unfortunately pass on contributing to the library for several personal reasons... but I do enjoy discussing code/optimization.

Some further interesting results:

C matrix code compiled runs faster on iPhone 3gs than iPhone 2g. (as expected)

However, VFP asm code runs SLOWER on iPhone 3GS than iPhone 2G. The iPhone 3GS does not seem to have a VFP pipeline at all (!).

Here's some timings I made (I have some extra overheads, eg. loop control, and I copy the result elsewhere, but take the timings as relative)

iPhone 2G C: 15401 microseconds
iPhone 2G VFP: 4094 microseconds

iPhone 3GS C: 9444 microseconds
iPhone 3GS VFP: 19485 microseconds
iPhone 3GS Neon: 2134 microseconds.

(Note that the C timings with the iPhone 3.1 SDK are FAR FAR better than the previous SDKs!)

Here's the Neon code:

r0 = output float*
r1 = matrix a*
r2 = matrix b*

vldmia r1, { q4-q7 }
vldmia r2, { q8-q11 }

vmul.f32 q0, q8, d8[0]
vmul.f32 q1, q8, d10[0]
vmul.f32 q2, q8, d12[0]
vmul.f32 q3, q8, d14[0]

vmla.f32 q0, q9, d8[1]
vmla.f32 q1, q9, d10[1]
vmla.f32 q2, q9, d12[1]
vmla.f32 q3, q9, d14[1]

vmla.f32 q0, q10, d9[0]
vmla.f32 q1, q10, d11[0]
vmla.f32 q2, q10, d13[0]
vmla.f32 q3, q10, d15[0]

vmla.f32 q0, q11, d9[1]
vmla.f32 q1, q11, d11[1]
vmla.f32 q2, q11, d13[1]
vmla.f32 q3, q11, d15[1]

vstmia r0, { q0-q3 }

Wolfgang Engel said...

Hey Jeffrey,
this is very good to know. So the VFP math library looses its importance on the 3GS.

Thanks for contributing this!
- Wolf

Yotto said...

Thanks for the examples.

I tried a hand at speeding up the matrix*vector implementation and the best I could come up is the following. I think it should save 11 cycles but it doesn't seem to match up when I actually measure the performance (I get about a 5% speed up versus the original implementation when I expected around 25%). Note, this is my first attempt at VFP coding so I'm kind of fudging about and the docs are a bit sparse ...

Note, the return vector is dim-3 vs dim-4.

If Jeffrey or Wolfgang could comment that would be most helpful.

thanks

yotto

void Matrix4Vector3Mul(const float* src_mat, const float* src_vec, float* dst_vec) {
asm volatile (VFP_SWITCH_TO_ARM

// Load first 2 columns of the matrix.
"fldmias %0!, {s8-s15} \n\t"

// Load vector to scalar bank.
"fldmias %1, {s0-s2} \n\t"

VFP_VECTOR_LENGTH(3)

"fmuls s24, s8, s0 \n\t"
"fmuls s28, s12, s1 \n\t"

// Load next 2 columns of the matrix.
"fldmias %0!, {s16-s23} \n\t"
"fmacs s24, s16, s2 \n\t"

// Save vector.
"fstmias %2, {s24-s26} \n\t"

VFP_VECTOR_LENGTH_ZERO
VFP_SWITCH_TO_THUMB
:
: "r" (src_mat), "r" (src_vec), "r" (dst_vec)
: "r0", "cc", "memory", VFP_CLOBBER_S0_S2, VFP_CLOBBER_S8_S31
);
}

Jeffrey Lim said...

The best thing to read is the ARM1176JZF-S Technical Reference Manual, section 21. It explains how to predict performance of vfp code (it doesn't take into account absolutely everything, but it's a good starting point).

Some pointers:

"fldmias %0!, {s16-s23} \n\t"
"fmacs s24, s16, s2 \n\t"

The first instruction says to load s16-s23. The next instruction uses s16. There's a 4 cycle latency on load, so there's some potential dead time to save here.

For armv6, there's equivalent cycle latency count (8) for fadd, fmul and fmacs. Which means that fmacs should be used in preference to fmul then fadd.

Easy win: Your clobber list is:

VFP_CLOBBER_S8_S31

But you only use up to s26. s16-s31 need to be saved between function calls, so you're specifying an extra 5 registers to push/pop that don't need it.

You also don't use s4-s7 which are registers that don't need to be saved across function calls, so you could move s24-s26 to s4-s6 and save an extra 3 register push/pop.. ie. your clobber list should be:

VFP_CLOBBER_S8_S23

Good to see someone taking the challenge :) My routine is more than double the speed of the one in current vfp package (and will be contributed to it at some stage).

I'm wondering if someone can come up with something better though.

Yotto said...

Thanks for the pointers Jeffrey.

You must be doing something clever to get the 2X+ speed up.

"fldmias %0!, {s16-s23} \n\t"
"fmacs s24, s16, s2 \n\t"

A few lines before the load I call

"fmuls s24, s8, s0 \n\t"

which has a 8 cycle latency so my thought was the 4 cycle latency of fldmias wouldn't matter since fmacs s24 would need to wait for the fmuls s24 to finish anyways. Have I missed something?

Also was wondering, since thumb is turned off as a compiler option, can we remove "r0" and "cc" from the clobber list (since VFP_SWITCH_TO_ARM and VFP_SWITCH_TO_THUMB are empty)?

BTW, for others new to VFP like myself, while searching for the ARM1176JZF-S Technical Reference Manual I came across this site as well that was helpful

http://aleiby.blogspot.com/2008/12/iphone-vfp-for-n00bs.html

Jeffrey Lim said...

From my understanding of the ARM documentation (which could be wrong), when you do

"fmuls s24, s8, s0"

This has an 8 cycle latency on each element that's produced. Since you have vector mode set to operating on 4 instructions at once, this is equivalent to:

Cycle 1: "fmuls s24, s8, s0"
Cycle 2: "fmuls s25, s9, s0"
Cycle 3: "fmuls s26, s10, s0"
Cycle 4: "fmuls s27, s11, s0"

Then you do:

Cycle 5: "fmuls s28, s12, s1"
Cycle 6: "fmuls s29, s13, s1"
Cycle 7: "fmuls s30, s14, s1"
Cycle 8: "fmuls s31, s15, s1"

ie. from what I read in your code, all of those 8 cycles are used up before the load instruction occurs.