Pre-increment vs. Post-increment: Which is Faster for 3D Software Development?

Fri Feb 11, 2011

By Kevin Tatterson

It was fun writing my first two blog posts – because I like pushing people’s buttons with controversial, philosophical subjects.  But for this article, I’d like to get technical.

Pre-increment vs. Post-increment in 3D Software Development

I was asked recently, “Did you know that pre-increment was faster than post-increment?”  I answered, “No”.  The claim:

                                                            ++d is faster than d++

The explanation:  the implementation of post-increment (d++) must create a temporary storage for housing the original value of d, increment/store d, and return the original temporary value.  In contrast, pre-increment (++d) only has to increment/store the value and return it – no temporary required. 

To put things in programming speak, suppose we implement our own integer class and implement the overloads for pre and post-increment:

class myint

{

       int n;

 

public:

       myint( int _n ): n(_n) {}

 

       myint( const myint & m )

       {

              n = myint.n;

       }

 

       myint& operator++()  // prefix, e.g., ++n

       {

n = n + 1;

return n;
}

 

       myint operator++(int) // postfix, e.g., n++

{

       myint m(*this);      // create the temporary

       n = n + 1;

       return m;

}

};

 

On the surface – it is clear that post-increment is doing more work – because it has to “create the temporary”.  But is it measurable?  Let’s look at this through a couple different lenses.

 

 

Lense #1:  basic types

 

More than one person has studied this and all have found the same conclusion:  for basic types (int, double, etc.), ++d has the same performance as d++.  I’ve seen a couple good studies that break things down right to the exact assembly instructions.  Here’s a good study 

 

Lense #2:  really simple objects

 

The “myint” class above is a good example of a “really simple object”.  Essentially, the goal of the design is to mimic an integer type – yet allow control over the precise implementation of each operator overload – in particular, the ones we want to measure:  pre-increment and post-increment.

 

Suppose we put myint’s pre-increment and post-increment under the empirical microscope – and try to measure the performance difference.  Here’s the program:

 

clock_t start;

       const int iters = 800000000;

       {

              myint m(0);

              start = clock();

              for ( int i = 0; i < iters; i++ )

                     m++;

              printf( "post incr time: %d\n ms", clock() - start );

       }

 

       {

              myint d(0);

              start = clock();

              for ( int i = 0; i < iters; i++ )

                     ++d;

              printf( "pre incr time: %d ms\n", clock() - start );

       }

 

The first problem we run into is the compiler optimizer.  As written above, the Microsoft VS 2005 compiler optimizer completely optimizes the loop away, avoiding it altogether!  This defeats our desire to measure the overhead cost of pre- vs. post-increment.

 

One low-cost trick to “calm” the optimizer is to add a __asm nop block to the implementations of our pre- and post-increment methods.  The biggest downside to this approach is that the __asm  directive prevents the optimizer from inlining – and for our “really simple object”, this can make a big difference.  Let’s get back to the inlining issue later – for now, here’s what the __asm nop code looks like:

 

       myint& operator++()  // prefix, e.g., ++n

       {

              __asm { nop }

n = n + 1;

return n;
}

 

       myint operator++(int) // postfix, e.g., n++

{

              __asm { nop }

       myint m(*this);      // create the temporary

       n = n + 1;

       return m;

}

 

With the __asm nop blocks in place, we can start to run some benchmarks.  Repeated runs converge on the following times:

 

post incr time: 1781 ms

pre incr time: 1515 ms

 

This means pre-increment is 15% faster than post-increment.  But is it believable?  Let’s look at the generated assembly and compare clock cycles:

 

Pre-increment

Clock cycles

Post-increment

Clock cycles

lea ecx,[esp+10h]

1

push 0

1

call myint::operator++

3

lea eax,[esp+18h]

1

     mov eax,ecx

1

push eax

1

     nop

1

lea ecx,[esp+18h]

1

     add dword ptr [eax],1

3

call myint::operator++

3

     ret

5

     nop

1

 

 

     mov edx,dword ptr [ecx]

1

 

 

     mov dword ptr [eax],edx

1

 

 

     add edx,1

1

 

 

     mov dword ptr [ecx],edx

1

 

 

     ret

5

sub ebx,1

1

sub ebx,1

1

jne class_int_bench+61h

3

jne class_int_bench+61h

3

TOTAL clocks

18

TOTAL clocks

21

 

The assembly instructions shown above are the exact instructions being timed – and hence, represent the cost of each iteration of the loop.  The performance improvement based on total clocks:  14.3%.  This compares nicely to the empirical data, 15%.

Next Post:

I will push this a little into the theoretical (with respect to inlining), then tear it all down with a just a small dose of reality.

You May Also Like

These Stories on 3D Software Development Kits

Subscribe to the D2D Blog

No Comments Yet

Let us know what you think