Computers Counting Part 3: Compiler Optimizations
How fast can computers count?
This is the second post in a series of writing programs to see how fast computers can count. The series will start with naive methods, then will make use of optimization and later parallelism within the CPU, then will utilize a GPU, and the finale will use a dedicated FPGA for counting.
Previous posts:
Hardware specs:
- CPU: AMD Ryzen 9 3900X (12x Zen 2 cores)
- GPU: Nvidia GeForce RTX 5060
- RAM: 32GB DDR4
- FPGA: PYNQ-Z2
What C code are we looking at?
The same C count program as the end of the last post. Here it is written out:
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>
int main()
{
uint64_t count = 0;
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
while(count < 100'000'000)
{
count++;
}
gettimeofday(&endTime, NULL);
long long elapsedTime = \
(endTime.tv_sec - startTime.tv_sec) * 1000000 + \
(endTime.tv_usec - startTime.tv_usec);
printf("Counted to 100 million in %lld microseconds.", elapsedTime);
return 0;
}Compiler optimizations
When I compiled this code, I ran gcc count.c -o count. This implies compiling without any optimization, which means I was using -O0 (no optimizations). There are multiple levels of optimization that I can use:
- -O1 (optimize). This causes the compiler to try to reduce code size and execution time, by making the compilation take somewhat more time.
- -O2 (optimize more). This causes the compiler to perform all optimizations that do not involve a space-speed tradeoff, increasing the compilation time even more.
- -O3 (optimize yet more). This causes the compiler to perform all optimizations for speed, with a greater program memory usage and a greatly increased compilation time.
To compile with optimization level 1, I run gcc count.c -O1 -o count-o1 and similar for higher optimization levels. The code is the same across all compilations.
Let’s see how much these optimizations do! I will record the microseconds it takes to count to 100,000,000 in all optimization levels:
| Optimization level | Count time (microseconds) |
|---|---|
| -O0 | 49957 |
| -O1 | 28032 |
| -O2 | 0 |
| -O3 | 0 |
You will notice that -O1 sped up the counting considerably, and -O2 and -O3 somehow made it count so incredibly fast that it took no time! Wow, this compiler optimization thing is incredible! Let’s try again with -O2 counting to 1 quintillion (10^18).
Counted to 1 quintillion in 0 microseconds.
So -O2 is able to make it count at over 10^24 increments a second! This gotta be a world record!
What if I told you it was all a lie?
Exposing my cheating programs
To see what is really going on, we have to look at the assembly again, between the two calls to gettimeofday to see the full scope of what is running in the timed region (between the two calls to gettimeofday).
-O1:
call gettimeofday
movl $100000000, %eax
.p2align 3
.L2:
subq $1, %rax
jne .L2
movq %rsp, %rdi
movl $0, %esi
call gettimeofday-O2:
call gettimeofday
leaq 16(%rsp), %rdi
xorl %esi, %esi
call gettimeofday-O3:
call gettimeofday
leaq 16(%rsp), %rdi
xorl %esi, %esi
call gettimeofdayAs you can see, the -O1 program is legitimate, as it puts the number 100,000,000 into register %rax, and continually subtracts 1 until the register’s number reaches 0. The computer is counting down, but it’s still legitimately counting. Additionally, it did 3.57B increments (sorry, decrements) per second. The Ryzen 9 3900X has a base clock speed of 3.8GHz, so the counting program reached 1 increment/decrement per cycle!
The -O2 and -O3 programs are cheating in the same way. They never count! They call gettimeofday, then do something, then call gettimeofday again, 3 instructions after the first call. The two calls in between pertain to calling the second gettimeofday correctly. Why does it do this? Because the compiler is smart enough to figure out that the count loop is unnecessary for later in the program, and omits the count loop.
Preventing the compiler from cheating
To prevent the compiler from cheating, I must modify the source code such that we need the counting to happen later in the program.
Unfortunately, the compiler is quite smart at making this difficult. I tried using the count variable after the for loop. I tried using the count variable in the final printf. I tried declaring a new variable based on the value of the count variable. All optimized away with -O2 and -O3.
I need to somehow indicate to the compiler that the count variable cannot be optimized away. One way to do this is the volatile keyword, indicating that the variable’s value might change at any time and must be fetched from main memory every fetch to account for this. For a benchmark, this is a problem! The latency of data fetched from main memory is orders of magnitude worse than data from a low-level cache or CPU registers.
But it certainly doesn’t hurt to test. So I compiled my program with the count variable as a volatile uint64_t, and got this output: Counted to 100 million in 33307 microseconds. Wait, what? This is faster than anything done before! This is about 1 to 1.5 clock cycles per increment! The assembly, on -O2 and -O3, looks like this:
call gettimeofday
movq 8(%rsp), %rax
cmpq $99999999, %rax
ja .L2
.L3:
movq 8(%rsp), %rax
addq $1, %rax
movq %rax, 8(%rsp)
movq 8(%rsp), %rax
cmpq $99999999, %rax
jbe .L3
.L2:
leaq 32(%rsp), %rdi
xorl %esi, %esi
call gettimeofdayThis assembly does in fact load the count variable from main memory (in the stack), adds 1 to it, then stores it in main memory again, before loading to compare it to potentially escape the loop. It still counts up from 0 to 100,000,000, one increment at a time.
I am flabbergasted that this is faster than the optimized -O0 assembly. It’s several more instructions per loop, and more expensive loads and stores, supposedly to main memory. This also experiences the same variability as my other compiled C counting programs. This leaves me with two major questions that I don’t know the answer to.
Conclusion
Still, I feel like this can go faster without needing to resort to parallelism yet. What if I just write the assembly by hand? Stay tuned for the next part, as this will be exactly what I do.
Until next time!
Summary table of counting speed
| Method | Increments / second |
|---|---|
| Very naive JavaScript | 12.6M |
| Naive JavaScript | 369M |
| Naive C | 1.55B |
| Optimized C (-O0) | 2.02B |
| Optimized C (-O3) | 3.00B |