This post off will start with an important question. Look at Listing 1 below; after executing the instruction located at main+12, what values will be stored in r0and r1? Take a moment to consider this.
My first (albeit incorrect) answer was that r0 would have 0x000083bc (main+8) stored in it and that r1 would have 0x000083c8 (main+12+8) stored in it (the address of the instruction, plus the value from the add instruction). I thought this because I made a few assumptions about the state of the processor during the execution of the instructions. First, I assumed that while executing the instruction located at main+8, “mov r0, pc“, the pc register would have the address main+8 stored in it and therefore that address would be moved into r0. I also made the assumption that while executing the instruction at main+12, “add r1, pc, #8“, the pc register would have the address of main+12 stored in it and this address plus 8 would be moved into r1. According to Listing 2, I felt that GDB supported my assumptions by showing the pc register with the currently executing instruction stored in it.
By examining r0 and r1 while executing the instruction located at main+16(Listing 3) it became obvious my two predictions would not come to pass. Ther0 register had 0x000083c4 stored in it and the r1 register had 0x000083d0 stored in it. Perplexed, I needed to try to understand the mechanism at work here.
After a few minutes of thinking, I started to remember a topic covered in my NYU:Poly computer architecture class, pipelining. I then noticed that both of the values were exactly 0×8 higher then I expected. Doing a quick Google search, I came across the fact that the ARM processor executing my code has a 3 stage pipeline with a 4 byte fixed instruction size.
To understand this problem, we now have to get into some details of ARM processor architecture. Pipelining, as it relates to a processor, is a term used to describe an optimization to optimize the execution of instructions. When a processor executes one instruction there are normally a few distinct steps required before finally executing it. For example, ARMs pipeline stages include fetch, decode, and execute* (see note at end of post). The processor must first load the instruction from memory (fetch), decipher what the instruction must accomplish (decode), and perform the operations necessary to complete the instruction (execute). Each step usually relates to a set of components inside the processor and requires a certain amount of time to accomplish. In addition, the steps must performed in a strictly serial manner per instruction. In a non-pipelined processor only one phase of one instruction is performed at once. This leaves the hardware on the processor responsible for the other stages idle. A pipelined processor is designed to have each phase active for an instruction all of the time.
Let us work through an example (Image 2); during one clock cycle Instruction 1 will be in the execute phase, Instruction 2 will be in decode phase, and Instruction 3 will be in the fetch phase. So why is this important? From a high level point of view, the pc register points to the currently executing instruction. This is the convention that GDB employs. However, from an ARM pipeline point of view, the physical pc register always points to the instruction currently being fetched. The reason for this resides in the deepest levels of the ARM processor architecture. The pc register is used as a direct input into the address register. The address register is used to index memory used to fetch the instruction. See image 1 below from “ARM system on a chip, second edition” for a good diagram of this.
Our example image below shows a time based view of this processor over the course of 5 clock cycles. Carefully analyzing image 2 during clock cycle 3, we see that the instruction being fetched is 2 instructions after the instruction being executed. Therefore, within the processor, pc must point 2 instructions after the current executing instruction, or 8 bytes ahead (each instruction is 4 bytes long). Instructions that use the value stored in the pc register will be using this actual value of pc. When we see an instruction such as “mov r0, pc” we can think of this as r0 will get pc + 8 where pc represents the current executing instruction as reported by GDB.
With this in mind, the correct answers to the initial question is:
r0 = 0x000083c4 = (main+8) + 8 = 0x000083bc + 8
r1 = 0x000083d0 = (main+12) + 8 + 8 = 0x000083c0 + 8 + 8 = 0x000083d0.
As you can see, these solutions match what was observed by GDB. Yay!
So what are the key lessons learned? Depending on the number of stages or the specific hardware, the difference between the address of the currently executing instruction and the value stored in the physical program counter register (eip, pc, rip, etc.) may be different. It is important to research this behavior for any processor architecture you are going to be reverse engineering on, writing shell code for, or simply writing assembly to be executed.
*Note: There are many different ARM processors and pipeline architectures, however, this is a good description of it to understand the general mechanism at work.
Published date:  22 September 2011
Written by:  by 0xD1AB10