Pre-increment vs. Post-increment: Which is Faster for 3D Software Development?

By Kevin Tatterson

It was fun writing my first two blog posts – because I like pushing people’s buttons with controversial, philosophical subjects. But for this article, I’d like to get technical.

Pre-increment vs. Post-increment in 3D Software Development

I was asked recently, “Did you know that pre-increment was faster than post-increment?” I answered, “No”. The claim:

++d is faster than d++

The explanation: the implementation of post-increment (d++) must create a temporary storage for housing the original value of d, increment/store d, and return the original temporary value. In contrast, pre-increment (++d) only has to increment/store the value and return it – no temporary required.

To put things in programming speak, suppose we implement our own integer class and implement the overloads for pre and post-increment:

class myint

{

int n;

public:

myint( int _n ): n(_n) {}

myint( const myint & m )

{

n = myint.n;

}

myint& operator++() // prefix, e.g., ++n

{

n = n + 1;

return n;
}

myint operator++(int) // postfix, e.g., n++

{

myint m(*this); // create the temporary

n = n + 1;

return m;

}

};

On the surface – it is clear that post-increment is doing more work – because it has to “create the temporary”. But is it measurable? Let’s look at this through a couple different lenses.

Lense #1: basic types

More than one person has studied this and all have found the same conclusion: for basic types (int, double, etc.), ++d has the same performance as d++. I’ve seen a couple good studies that break things down right to the exact assembly instructions. Here’s a good study.

Lense #2: really simple objects

The “myint” class above is a good example of a “really simple object”. Essentially, the goal of the design is to mimic an integer type – yet allow control over the precise implementation of each operator overload – in particular, the ones we want to measure: pre-increment and post-increment.

Suppose we put myint’s pre-increment and post-increment under the empirical microscope – and try to measure the performance difference. Here’s the program:

clock_t start;

const int iters = 800000000;

{

myint m(0);

start = clock();

for ( int i = 0; i < iters; i++ )

m++;

printf( "post incr time: %d\n ms", clock() - start );

}

{

myint d(0);

start = clock();

for ( int i = 0; i < iters; i++ )

++d;

printf( "pre incr time: %d ms\n", clock() - start );

}

The first problem we run into is the compiler optimizer. As written above, the Microsoft VS 2005 compiler optimizer completely optimizes the loop away, avoiding it altogether! This defeats our desire to measure the overhead cost of pre- vs. post-increment.

One low-cost trick to “calm” the optimizer is to add a __asm nop block to the implementations of our pre- and post-increment methods. The biggest downside to this approach is that the __asm directive prevents the optimizer from inlining – and for our “really simple object”, this can make a big difference. Let’s get back to the inlining issue later – for now, here’s what the __asm nop code looks like:

myint& operator++() // prefix, e.g., ++n

{

__asm { nop }

n = n + 1;

return n;
}

myint operator++(int) // postfix, e.g., n++

{

__asm { nop }

myint m(*this); // create the temporary

n = n + 1;

return m;

}

With the __asm nop blocks in place, we can start to run some benchmarks. Repeated runs converge on the following times:

post incr time: 1781 ms

pre incr time: 1515 ms

This means pre-increment is 15% faster than post-increment. But is it believable? Let’s look at the generated assembly and compare clock cycles:

Pre-increment	Clock cycles	Post-increment	Clock cycles
lea ecx,[esp+10h]	1	push 0	1
call myint::operator++	3	lea eax,[esp+18h]	1
mov eax,ecx	1	push eax	1
nop	1	lea ecx,[esp+18h]	1
add dword ptr [eax],1	3	call myint::operator++	3
ret	5	nop	1
		mov edx,dword ptr [ecx]	1
		mov dword ptr [eax],edx	1
		add edx,1	1
		mov dword ptr [ecx],edx	1
		ret	5
sub ebx,1	1	sub ebx,1	1
jne class_int_bench+61h	3	jne class_int_bench+61h	3
TOTAL clocks	18	TOTAL clocks	21

The assembly instructions shown above are the exact instructions being timed – and hence, represent the cost of each iteration of the loop. The performance improvement based on total clocks: 14.3%. This compares nicely to the empirical data, 15%.

Next Post:

I will push this a little into the theoretical (with respect to inlining), then tear it all down with a just a small dose of reality.

Our Solutions

View all Solutions

CGM Modeler

3D ACIS Modeler

Constraint Design Solver

3D Interop

Data prep

HOOPS Visualize

HOOPS Communicator

3D Precise Mesh

CSM / CVM

AGM

Resources

Featured

Traceparts Case Study

The Future of CAM Workflows

Supercharge Your CAD Workflows with Data Prep

Pre-increment vs. Post-increment: Which is Faster for 3D Software Development?

Pre-increment vs. Post-increment in 3D Software Development

You might also like...

CGM Modeler

3D ACIS Modeler

Constraint Design Solver

3D Interop

Data prep

HOOPS Visualize

HOOPS Communicator

3D Precise Mesh

CSM / CVM

AGM

Featured

Traceparts Case Study

The Future of CAM Workflows

Supercharge Your CAD Workflows with Data Prep

Pre-increment vs. Post-increment: Which is Faster for 3D Software Development?

Pre-increment vs. Post-increment in 3D Software Development

You might also like...

Stay in the Know With Our Newsletter