ARM asm timing framework for iOS

Aug 7, 2013

Hello Hackers!

Do you have a thirst for knowledge? Perhaps you have some low level code that needs extra optimization for iOS devices? Or you might just desire to understand what is going on at the lowest level once your Objective-C or C code is compiled. To really get to the next level in programming skills one should learn to write code in assembler. These days, the only game in town in terms of assembler is ARM. Assembler takes a little time to learn, but ARM is not that complex and most every phone or embedded device sold these days uses an ARM processor.

How does an iOS developer actually get started writing ARM asm? All the tools you need are already included in Xcode, it is just a matter of learning how to use them. It is also important to know when to use assembler and when to stick with C. Many audio and video codecs make use of assembly code because of the performance benefits. A project that does a lot of the same operations on sets of data can also benefit from use of NEON SIMD instructions that are most easily accessed via assembler. But, not everything should be coded in assembler because actually writing and debugging assembler code is complex and time consuming.

Here is a quick example of some C code and the ARM assembler that does the same thing:

C:

// Add 2 variables then subtract 1 from the result
a = b + c
a = a - 1

ARM asm:

// Add registers r1 and r2, save in r0, then subtract
add r0, r1, r2
sub r0, r0, #1

For more introductory information about ARM assembler, please have a look at these useful links:

A developer will find a lot of conflicting and dated information related to ARM asm online. After many many hours of implementing different approaches using both gcc and clang, I have found that the best way to work with assembler in your Xcode project is to add a .s assembler file to your Xcode project. Do not bother trying to learn how to use inline asm statements in C code, inline asm is just a mess and is very difficult to debug. You basically cannot rely on the compiler for anything when it comes to integrating C and asm. Just write your functions completely in asm and handle pushing saved registers and adjusting the stack yourself. Just take care not to use ARM register r7 in your asm code otherwise debugging will be impossible. When you attach an assembly file to your Xcode project, you can set breakpoints in the debugger and step through ARM code one asm instruction at a time. Xcode used to support a split mode where C source and asm instructions could be stepped into, but that feature no longer works with recent versions of Xcode.

The goal of this blog post is not to teach you how to write ARM asm. The goal of this post is provide information and tools to make it easier to learn how to write fast ARM asm. Suppose you have two implementations and you want to know which one will execute more quickly. A developer needs a tool that makes it easy to compare the execution time of approach A to approach B and determine which one is actually more efficient. The only way to find out which one is faster is to run each implementation hundreds of times and accurately measure the execution time on ARM hardware.

What is presented here is an example iOS app named ARMTimingTests that includes low level system timing code that will execute 3 different implementations and then display how long each one took to execute. By default the project compares a C code implementation to two different ARM asm implementations, but you can configure and modify the code to do whatever you want. This is not an app that will ever appear in the iTunes store. Instead, a developer should download the example Xcode project and run it on an iOS device.

This Xcode project file contains everything an iOS developer needs to get going with ARM asm. The specific value of this timing framework is that it makes it easier to get A/B testing results that compare different ARM asm implementations. A development process based on hard timing facts is required when optimizing ARM asm code. The bottom line is that the just is no way to know if one implementation will work better than another without actually testing on real hardware. This timing framework makes it easier to gather real performance numbers that can guide a developer when choosing between implementation paths.

Inside ARMTimingTests:

The default test module that comes included in the ARMTimingTests project file is named TestTimeSimple and the source code for this module is located in the file test_time_simple.m. This default module is the most simple example of using the timing framework I could come up with. It consists of a C and ARM implementations of a function that loops over a series of integers and adds them together, like so:

uint32_t simple_add_result1(
    uint32_t* wordPtr,
    uint32_t numWords)
{
  uint32_t sum = 0;
  do {
    uint32_t tmp = *wordPtr++;
    sum += tmp;
  } while (--numWords != 0);
  return sum;
}

If the Xcode menu command Product->Generate Output->Assembly File is executed, one would see Thumb2 asm output like:

_simple_add_result1:
  mov     r2, r0
  movs    r0, #0
LBB9_1:
  ldr     r3, [r2], #4
  subs    r1, #1
  add     r0, r3
  bne     LBB9_1
  bx      lr

This asm is very simple, it reads a word from the address held in r2 and the word value is stored in register r3. The word value is then added to a running sum that is held in r0. The loop runs until the counter variable counts down to zero and then the result is returned in register r0. While the implementation is very simple, it is also going to be very slow. It should be easy to create a much faster implementation with hand crafted ARM asm. Here is the ARM asm for the function simple_add_result2() defined in test_time_simple.s.

_simple_add_result2:
  push {r4, r5, r6, r7, lr}
  push {r8, r10, r11}
  @ r0 = wordPtr
  @ r1 = numWords
  @ r2 = sum
  
  mov r2, #0
  
  cmp r1, #4
  blt 2f
1:
  ldm r0!, {r3, r4, r5, r6}
  sub r1, r1, #4
  @ r2 = r2 + r3 + r4 + r5 + r6
  @ with minimal interlock
  add r2, r2, r3
  add r4, r4, r5
  add r2, r2, r6
  cmp r1, #4
  add r2, r2, r4
  bge 1b
  
2:
  cmp r1, #0
  ldrgt r3, [r0], #4
  subgt r1, r1, #1
  addgt r2, r2, r3
  bgt 1b
  
  @ cleanup and return
  
  mov r0, r2
  pop {r8, r10, r11}
  pop {r4, r5, r6, r7, pc}

In addition to the ARM asm implementation, the simple example module also contains a NEON implementation of the same logic named simple_add_result3(). The interested reader is invited to take a look at test_time_simple.s to see the NEON asm code. With three implementations, the simple example can be run to compare the execution time results.

iPhone 4 (Cortex-A8)
simple_add_result1               0.0092 seconds
simple_add_result2               0.0021 seconds
simple_add_result3               0.0018 seconds

iPad 2 (Cortex-A9)
simple_add_result1               0.0073 seconds
simple_add_result2               0.0015 seconds
simple_add_result3               0.0013 seconds

The first thing to note about these results is that the ARM asm implementation is about 3 to 4 times faster than the simple C loop. The main reason for this is that the ARM asm implementation makes use of the ldm instruction to read four words at a time into registers in the main loop. In normal C code, there is no direct way to read into 4 registers at the same time. The second thing to note is that the NEON implementation is only a tiny bit faster than the ARM asm. One might be tempted to think that NEON simd instructions could really speed things up, but these results show that actually doing the add operations is an insignificant portion of the execution time. Most of the execution time in this logic is spent waiting for reads to complete. So, as a result doing multiple vector add operations with NEON does not actually speed things up by much.

The sum loop is a trivial example and your own code and data structures will have completely different execution timing results. The purpose of the execution timing framework is to provide a way to actually gather timing results that a developer can depend on. This is important, so I will just say it again. A developer should make implementation decision based on timing results and those timing results need to be reliable.

The timing loop logic makes use of the mach_absolute_time() API to gather very precise timing results. A specific test is run for N iterations and then the total runtime is divided by the number of iterations N. The timing results are then examined and all the results that fall outside of one half of a standard deviation are tossed out before the average time for N loops is determined. This approach is the result of countless hours of experimentation and fine tuning of ASM on iOS devices. The implementation logic for these timing results can all be found in the file test_time.h, the entire test time module is implemented as a set of inline functions to make profiling easier.



The memcpy and fill timing module:

The simple test module provides an easy to understand example, but if you want to see a more complex example then have a look at test_time_asm.m. This module is disabled by default, but it can be enabled by editing the method runTestsInSecondaryThreadEntryPoint in SpeedTest.m. The memcpy and fill tests run a large number of tests to determine what code generates the most efficient memcpy and memory fill. The memcpy and fill test module was inspired by the following ARM tech note:

While a developer can find a lot of ARM information online, not all information found online is always useful. Even when it comes from arm.com. If you run the memcpy and fill timing tests on both A8 and A9 processors you will find that while the suggested approach is optimal for the Cortex-A8 processor, it is actually worse than a pure ARM ldm/stm based copy on the Cortex-A9. The point is that you cannot always find all the answers to your programming problems online. When it comes to really optimizing your software, you need to roll up your sleeves and get busy testing the actual execution times for your code. This timing framework makes that job a little easier.