Tail call elimination is good in C too

2008-04-09 (permalink tags: , , )

All recursive algorithms can be converted into an iterative version. They teach that in school. They teach it as something that should be done, and it used to be true. However, we a living in a wonderful time and old wisdoms sometimes cease to be true.

Some algorithms are much more elegant when written recursively; the recursive version is also easier to prove. On the other hand, the iterative version will avoid many stack manipulation operations. Theoretically, any tail recursive function can be converted into an iteration transparently by the compiler without performance penalty compared to the hand written iterative version. Theoretically.

When I wanted to prove that Common Lisp had really fast implementations, I did my best to use the iterative version of the Mandelbrot formula. Unfortunately, there was no way around it, the iterative version would always run a bit faster. "So be it," I said to myself in resignation.

update: Thanks to the many readers who pointed out that is had nothing to do with side effects. Indeed, compiling and dissassembling a version without GCC attributes show that the tail call was eliminated and it runs just as fast.

Tail call elimination is often implemented by functional language compilers. This is because the compiler can easily determine if a function has side effects. Indeed, the automatic tail call elimination won't work when the function does something else beside returning a value. The ubiquity of side effects in C is often cited as the main problem for implementing tail call elimination this language. So the compiler can't determine that a function has no side effects. What is you tell the compiler that there are none? The result is pleasantly surprising.

Check for your self. Take a C program with a tight loop, rewrite the loop body as a tail recursive function and watch it run faster. No mirror tricks, it just happens. C hackers are probably not going to annotate all their pure functions are such but the great implementation of tail call elimination in GCC will certainly bring a major performance boost to all the high level languages implemented on top of it. Beside, it's always fun to point out when teachers are wrong.

Comments

2008-04-09 05:13:42 by Adrien (direct link | reply)

afair, "gcc -S" should compile to assembly text. Might be interesting to compare the generated code on simpler, yet improvement exhibiting code samples. I don't have time for that, but you made me want to hack some C code now ;)

2008-04-09 08:36:38 by Yannick Gingras (direct link | reply)

You can always open the executable with gdb and disassemble from there. There is indeed a jump at the end of the function to rewind its execution (jbe 0x... iterate+34). Also interesting is the fact that param passing is with the XMM registers instead of the stack.

2008-04-09 09:11:25 by Tac-Tics (direct link | reply)

What do side effects have to do with TCO? Even in side-effectful languages, TCO works just great (Scheme being the most obvious example). The only important point is that functions can only be optimized when the function call is in tail position, which is effectively the last statement in an if-branch or a lambda body.

2008-04-09 10:05:39 by Jesse Tov (direct link | reply)

I'm glad you're writing about tail recursion, but this post is quite confused in a few places. Starting with. . . .

> All recursive algorithms can be converted into an iterative version.

Not true. Some recursive algorithms can, but others are inherently recursive. You can write a tree traversal with a for loop, but you'll have to maintain your own stack if you do.

> The ubiquity of side effects in C is often cited as the main problem for implementing tail call elimination this language.

Side effects have nothing to do with tail call elimination. Nothing at all. The C calling convention (on many platforms) is the real culprit. The other culprit is just that idiomatic C avoids recursion, so for most C compilers, it hasn't been worth the trouble.

> Take a C program with a tight loop, rewrite the loop body as a tail recursive function and watch it run faster.

Sorry, no. A good compiler might generate the same code for both, but there's no reason it should be faster.

> the great implementation of tail call elimination in GCC will certainly bring a major performance boost to all the high level languages implemented on top of it.

Is this new? GCC's support for proper tail calls was pretty abysmal last I checked, which is why functional languages with C backends tend to use dirty strategies like trampolines to get space safety.

2008-04-10 05:55:43 by Yannick (direct link | reply)

Thank you for your comment. I will probably write something less confused in a followup post. Can you develop on why the calling convention prevents tail call elimination?

2008-04-10 14:40:30 by guest (direct link | reply)

> You can write a tree traversal with a for loop, but you'll have to maintain your own stack if you do.

You can write iterative in-order tree traversal without using additional memory.

2008-07-03 17:17:09 by guest2 (direct link | reply)

Prove it.

2008-04-09 14:59:55 by Kostas Rastapopoulis (direct link | reply)

Don't change the divisions to multiplications, and you'll get the same speed. But I've always considered recursive functions smoke and mirrors. Confused yet?

2008-04-09 16:28:54 by John Stracke (direct link | reply)

I just tried this with a simple example, under GCC 4.1.2, and found that the iterative form was faster. Here's the code:

#include <stdio.h>

static int f(int j) { int i; for (i=0; i<1000000000; i++) j=j*2; return j; }

int main(int argc, const char* argv[]) { printf("%dn",f(1)); }

In tail-recursive form, this became:

#include <stdio.h>

static int f(int i, int j) { if (i<1000000000) return f(i+1,j*2); else return j; }

int main(int argc, const char* argv[]) { printf("%dn",f(0,1)); }

And here's the results. This was on a 2.4GHz Core 2 Quad under Fedora:

tail.c, -O3 -m32: 739ms

tail.c, -O3 -m64: 745ms

tail.c, -O3 -m32 -mtune=core2: 737ms

tail.c, -O3 -m64 -mtune=core2: 761ms

iterative.c, -O3 -m32: 418ms

iterative.c, -O3 -m64: 634ms

iterative.c, -O3 -m32 -mtune=core2: 418ms

iterative.c, -O3 -m64 -mtune=core2: 596ms

Note that the GCC docs say that -O3 turns on tail call elimination. Without -O3, tail.c crashes with a stack overflow (not surprising, with 100 million stack frames), so apparently TCE is being done; it's just not efficient.

Admittedly, I didn't run this multiple times, as I would've done if I'd been doing this for work. But a speedup of 17-76% seems pretty conclusive.

2008-04-09 16:30:41 by John Stracke (direct link | reply)

OK, -S may have shown part of the reason: in the iterative version, f() is eliminated altogether. I'll change it to two calls to f().

2008-04-09 16:52:13 by John Stracke (direct link | reply)

OK, changing the printf() to call f() twice reverses the results:

tail, -O3 -m32 -march=core2: 738ms

tail, -O3 -m32 -march=i686: 738ms

tail, -O3 -m32 -march=i386: 689ms

tail, -O3 -m32: 739ms

tail, -O3 -m64 -march=core2: 762ms

tail, -O3 -m64: 745ms

iterate, -O3 -m32 -march=core2: 835ms

iterate, -O3 -m32 -march=i686: 836ms

iterate, -O3 -m32 -march=i386: 837ms

iterate, -O3 -m32: 835ms

iterate, -O3 -m64 -march=core2: 1183ms

iterate, -O3 -m64: 1138ms

So now the tail-recursive form has an advantage of 12-35%. That's more reasonable.

The real shocker is that telling GCC to tune for the chip I'm actually running on makes the tail recursive form slower. In fact, it runs the fastest when I tell it to generate for i386.

2008-04-10 06:01:41 by Yannick (direct link | reply)

Someone pointed out (mailman has broken archiving) on the MSLUG mailing list that GCC unrolled the loop in the recursive version but not one the iterative one. What does your final test program looks like? Do you see the same unrolling pattern?

2008-05-03 06:52:02 by bryane (direct link | reply)

I do see unrolling in gcc 4.2.2, using the code shown above (but using "%d %d", and two calls to f). gcc 3.2.3 does not unroll. Strangely, the best results were the older version of gcc with iteration:

  • tail.c, with gcc-4.2.2: 0.400s
  • iter.c, with gcc-4.2.2: 1.795s
  • tail.c, with gcc-3.2.3: 2.153s
  • iter.c, with gcc-3.2.3: 0.215s

It looks like someone implemented an optimization for tail recursion between the two versions; however, I don't understand how iter.c could be that much worse in the newer version of the compiler, so it's possible I made an error somewhere.

Leave a comment