Noah Kochavi's blog

Computers Counting Part 2: Digging into the assembly

How fast can computers count?

This is the second post in a series of writing programs to see how fast computers can count. The series will start with naive methods, then will make use of optimization and later parallelism within the CPU, then will utilize a GPU, and the finale will use a dedicated FPGA for counting.

Previous posts:

  1. Naive counting

Hardware specs:

What C code are we looking at?

The same naive C count program as the last post. Here it is written out:

#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>

int main()
{
	uint64_t count = 0;
	struct timeval startTime, endTime;
	gettimeofday(&startTime, NULL);
	while(count < 1e8)
	{
		count++;
	}
	gettimeofday(&endTime, NULL);
	long long elapsedTime = \
		(endTime.tv_sec - startTime.tv_sec) * 1000000 + \
		(endTime.tv_usec - startTime.tv_usec);
	printf("Counted to 100 million in %lld microseconds.", elapsedTime);
	return 0;
}

How do dig into the assembly?

There are two ways that I can look at the assembly:

  1. Run gcc -S naive-count.c to generate a naive-count.s file. This file contains the assembly that runs on the CPU. It is the surefire source of truth.
  2. Copy and paste my code into Godbolt and view the assembly from there. I have to be careful to select the right compiler and architecture. This provides a better formated view with easy ways to look up the instructions.

Note that from here on, we focus on x86_64 assembly, as this is the Instruction Set Architecture that runs on Zen 2, the Ryzen 9 3900X’s microarchitecture.

What does the assembly look like?

This. Note that I have truncated the assembly to only the part where the CPU is counting.

	jmp	.L2
.L5:
	addq	$1, -8(%rbp)
.L2:
	movq	-8(%rbp), %rax
	testq	%rax, %rax
	js	.L3
	pxor	%xmm0, %xmm0
	cvtsi2sdq	%rax, %xmm0
	jmp	.L4
.L3:
	movq	%rax, %rdx
	shrq	%rdx
	andl	$1, %eax
	orq	%rax, %rdx
	pxor	%xmm0, %xmm0
	cvtsi2sdq	%rdx, %xmm0
	addsd	%xmm0, %xmm0
.L4:
	movsd	.LC0(%rip), %xmm1
	comisd	%xmm0, %xmm1
	ja	.L5
...
.LC0:
	.long	0
	.long	1100470148

This is a lot to understand! Let me break down what is going on. First, I’ll explain the assembly syntax.

What is the assembly code doing?

Buckle up, since I’m gonna go over every line of assembly posted above.

The actual counting occurs on a single instruction: addq $1, -8(%rbp). Everything else pertains to the while loop enclosing the incrementing.

The adept may notice that we are adding to a memory location without putting it in a register. This is inefficient, as reading from RAM can take hundreds of CPU cycles. Luckily, the CPU cache helps speed things up, as it will cache this memory location into its lowest-level cache, reducing the read time to only a few cycles. This might be the cause of the inconsistent results from last post, but let’s keep studying the assembly.

Execution starts at the top, where it jumps to .L2. The counter is first moved to a register with movq, then is checked if the number is negative with testq, and puts the results into the appropriate CPU flags. Note that %rax appears twice in testq because testq checks the logical AND of the two arguments.

“But wait? Isn’t the counter a uint64_t? That’s unsigned!” Correct, but soon we run cvtsi2sd, which converts a 64-bit signed integer into a double-precision floating-point number. That instruction treats an unsigned integer as a signed integer, so we have two different paths for if the counter is positive or “negative” (if the most significant bit is 0 or 1, respectively). The js instruction means that if the counter is “negative”, we jump to .L3.

Otherwise, we continue on as normal. The pxor instruction is a common zeroing idiom, meaning that %xmm0 because 0. Then, the cvtsi2sd instruction converts the 64-bit integer in %rax to a double-precision floating point number in %xmm0. Next, this path jumps to .L4.

Remember how I said that %xmm0 is a 128-bit register? Most x86 CPUs since the Pentium III have Streaming SIMD Extensions, where the floating point registers are wide vector registers. The 128-bit registers can hold two 64-bit floating point numbers, which is what this code uses. The reason why CPUs have these vector registers is to perform multiple floating point operations with a single instruction that performs an operation on each number in the vector. I will exploit this fact in a later post; here they can just be thought of as holding individual 64-bit floating point numbers.

The “negative” path starting at .L3 is a more complicated version of the “positive” path. The count is copied to %rdx with movq and shifted right 1 bit to divide by 2 with shrq. Then, %rax is AND-ed with 1 to get the lowest bit of %rax. The syntax uses %eax because it can purportedly zero the upper 32 bits of %rax. Next, %rax and %rdx are ORed to save the lowest bit of %rax into %rdx. After that, %rdx is converted into a 64-bit floating point value in %xmm0. Frankly, I find this path weird, confusing, and irrelevant because the count will never exceed 2^63, as this is over 1e18, while my test runs count to 1e8 usually, or as high as 5e9 if I want something longer.

Both paths converge on .L4. The next instruction, movsd gets the bytes stored at .LC0 and stores it in %xmm1. The two .longs located just after .LC0 are loaded as one contiguous group such that the 2nd .long is the high 32 bits and the 1st .long is the lower 32 bits. It turns out that the number stored in %xmm1 is exactly 1e8. Then, %xmm0 and %xmm1 are compared with the comisd instruction, and the results are put into the appropriate flags.

Finally, we jump to .L5 if %xmm1 > %xmm0. In other words, we jump to .L5 only when the count is less than 1e8. Otherwise, the branch is not taken, meaning that the counting is done!

Inefficency of the naive C program

I notice a few areas where the assembly is very inefficient. Yes, I know that we are using -O0 (no optimizations) for now, but this helps me see what will be improved in the future.

  1. The counting happens with a memory address rather than a register, even though the latency of a memory address is higher. Yet, the counter is put in a register every iteration!
  2. The constant floating point number is loaded from memory every iteration.
  3. The counter is converted into a floating point number every iteration to compare against a constant floating point number. This is unnecessary, and I know how to fix this.

The performance hiccup

while(count < 1e8).

The e notation that I used to quickly type up a large number is the reason why the program has to go through the floating-point nonsense, as the compiler interprets 1e8 as a floating-point number.

By changing while(count < 1e8) to while(count < 100'000'000), the important assembly code now looks like this:

	jmp	.L2
.L3:
	addq	$1, -8(%rbp)
.L2:
	cmpq	$99999999, -8(%rbp)
	jbe	.L3

This is a lot simpler! Now all that happens is that the counter gets compared to 99,999,999, and will jump to .L3 if the counter is less than or equal to 99,999,999. Unfortunately, this still happens on a memory address rather than a register, but it still should result in a considerable speedup.

Counted to 100 million in 50005 microseconds.

And occasionally, a value as high as 170710 microseconds…

When testing counting to 1 billion, the rate was similar: Counted to 1 billion in 496116 microseconds.

Let’s assume for now, we use a typical “good” counting rate, as I am running this on a timesharing operating system and some other process may be hogging the count program.

If somebody knows why it’s occasionally worse, please let me know in a comment section on Lemmy or Hacker News.

Conclusion

As much of an improvement this is, it’s a bit surprising that greatly reducing the code only helps a little. Yes, the CPU is probably achieving 1 increment every 2 cycles, but can we go faster? We have not explored the compiler optimization options yet, and I want to see what the limits of increment speed are on a single CPU core. This will be the topic of the next post in the series.

Until next time!

Summary table of counting speed

Method Increments / second
Very naive JavaScript 12.6M
Naive JavaScript 369M
Naive C 1.55B
Optimized C (-O0) 2.02B