Optimizing for Loops: Reverse Loops

for loops are basic language constructs in many languages. One of the first thing to look at when optimizing code is the loops, as they do considerable amounts of work (like going through a very large amount of data), in very little code.

If you go use for loop, but you don’t really care about the order in which the loop is executed, to be more precise, if you can afford reversing to loop, you can save quite some time. By reversing the loop I mean instead of giving the index values from 0 to 10 for example, you go from 10 downward to zero. This doesn’t seem like a big change, but when being carefully implemented this can easily upgrade the performance of your for loops.

Take a look at the following for loop implementation:

for (int i=0; i<10; i++) {
	//some work
}

This loop goes from 0 to 10 and does some work. If the kind of work done allows one to go from 9 to 0, a for loop with the same functionality can be implemented like this:

for (int i=9; i--;) {
	//some work
}

Notice the difference between the two implementations. The first one, needs to compare the index to the stopping value, and then increases the index for the next iteration. On the other hand, the second implementation just checks that the index isn’t zero and increases it in the same statement.

To check if the theory behind this optimization is right, I’ve put a short piece of code to check it.

// fortest.cpp
#include <iostream>
#include <time.h>
using namespace std;
 
timespec diff(timespec start, timespec end);
 
int main()
{
	timespec time1, time2;
	unsigned int i, temp = 1;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time1);
	for (i = 0; i <= 2420000000; i++)
		temp+=temp;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time2);
	cout<<diff(time1,time2).tv_sec<<":"<<diff(time1,time2).tv_nsec<<endl;
	temp = 1;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time1);
	for (i = 2420000000; i--; )
		temp+=temp;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time2);
	cout<<diff(time1,time2).tv_sec<<":"<<diff(time1,time2).tv_nsec<<endl;
	return 0;
}
 
timespec diff(timespec start, timespec end)
{
	timespec temp;
	if ((end.tv_nsec-start.tv_nsec)<0) {
		temp.tv_sec = end.tv_sec-start.tv_sec-1;
		temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
	} else {
		temp.tv_sec = end.tv_sec-start.tv_sec;
		temp.tv_nsec = end.tv_nsec-start.tv_nsec;
	}
	return temp;
}

To compile it use g++ -lrt fortest.cpp -o fortest (don’t turn on, yet, any kind of compiler optimization). The program prints two line, one for every kind of for loop. Each line states the time it took for the for loop to complete in a seconds:nanoseconds format.

A typical run of the program on my machine resulted in:

10:338986608
9:728866372

The difference is about 0.5 seconds, which isn’t very small. On the other hand, it’s pretty small if taking into account that we did very little work in every iteration. But nonetheless it a speed gain you can easily achieve by just reversing the for loop code.

By the way, if you do use optimization, the speed gain is smaller, but compared to the runtime of the loop it can improve runtime by up to 70% (all my test showed an improvement of at least 50%).

For example one the same code compiled with the “-O2” optimization flag, I got the following output:

0:873
0:255

Which is a big improvement when considering the the total runtime of the traditional loop.

8 thoughts on “Optimizing for Loops: Reverse Loops”

  1. I don’t see any loops generated when your program is compiled the with -O2. So the the compiler must be smart enough to calculate the final result.
    You’d better move the value of loop limit out of compiler’s visibility (get it from argv, for example). What’s more if I put the second loop first, it’s still slower, so your benchmark methodology seems wrong.

  2. The basic idea behind it which also appears when compiling without the optimization (optimization greatly depends on the compiler) is that x86 based CPUs have the zero-flag in their ALU. For every arithmetic operation this zerfo-flag is check so by changing the order of the loop we can utilize this zero-flag to omit additional comparison operation for each loop. This should work like the loop with CX in assembly.

  3. Thanks for taking the time to discuss this, I feel strongly about it and love learning more on this topic. If possible, as you gain expertise, would you mind updating your blog with more information? It is enormously helpful and beneficial to your readers.

  4. I noticed this niffy for loop and checked assemble compilation for it. I must say

    // is still fastest with no cmp done. due to –i order of index there will be more clear how the for loop works

    for(int i=x-1; i>=0; –i)
    {
    printf() //causes a call..
    }
    x = 5 => loop 4,3,2,1,0
    in the case of
    for(i=x; i–; )
    {
    printf() //casues a call! but will cause the i to break and result in a cmp/test due to its order and thus not being the fastest allternative.
    }

    x = 5 => loop 4,3,2,1,0

    looks good in code though…
    i– => cmp i and then dec…
    –i => dec and cmp in this case branch directly due to zero flag is set default. hence one instr less..

  5. seems that web page removes a – in the text..

    I did meant minus minus ‘i’
    —i
    i—

  6. This trick might need an update.

    Try this code under VS 2013 Pro, inside a Windows 7 VM on Win 7 Host with Xeon E3-1270,

    I get the following results:

    First run:
    x = 12499999997500000000 ->Execution 1 took 4.53651
    x = 12499999997500000000 ->Execution 2 took 3.81237

    Second run:
    x = 12499999997500000000 ->Execution 1 took 4.5459
    x = 12499999997500000000 ->Execution 2 took 3.83222

    Apparently, the trick is counter productive now.

    void forlooptest()
    {
    LARGE_INTEGER start, finish, freq;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&start);

    unsigned long long x = 0;
    unsigned long long y = 5000000000;
    for (unsigned long long i = y; --i;)
    {
    x += i;
    }

    QueryPerformanceCounter(&amp;finish);
    std::cout &lt;&lt; &quot;x = &quot; &lt;&lt; x &lt;";
    std::cout &lt;&lt; &quot;Execution 1 took &quot;
    &lt;&lt; ((finish.QuadPart - start.QuadPart) / (double)freq.QuadPart) &lt;&lt; std::endl;

    x = 0;
    QueryPerformanceCounter(&amp;start);
    for (unsigned long long i = 0; i &lt; y; ++i)
    {
    x += i;
    }
    QueryPerformanceCounter(&amp;finish);
    std::cout &lt;&lt; &quot;x = &quot; &lt;&lt; x &lt;";
    std::cout &lt;&lt; &quot;Execution 2 took &quot;
    &lt;&lt; ((finish.QuadPart - start.QuadPart) / (double)freq.QuadPart) &lt;&lt; std::endl;

    }

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.