the buddy allocator that finally works
summary: a u64 bitmap, a mask, a count-trailing-zeroes. the smallest fitting order in two instructions.
Core idea: the allocator became simple enough to trust when block selection collapsed to a bitmap mask and a hardware trailing-zero count.
The allocator went through three rewrites this year. The trick that finally made it fast is two instructions long.
heap->avail is a u64. Bit i is set when there is at least one free block at order i. To find the smallest free block that fits a request of order order:
i = (AVAIL_MASK << order) & heap->avail;
i = __builtin_ctzll(i);
That is the entire "find a block" path. The mask drops orders below the requested one. The count-trailing-zeroes pulls the smallest set bit out of what remains. No tree walk, no list scan, no priority queue; the bitmap is the priority queue, and the hardware does the priority extraction.
Everything else in the allocator exists to keep that bitmap honest. Freelists per order are intrusive doubly-linked lists; popping the head clears the bit when the list goes empty, allocating splits a larger block down to size and lights up bits along the way, freeing walks up coalescing with buddies and lights up the bit at the merged order. None of that is novel - it is textbook buddy. What is novel, for me, is that the entire find-a-block decision is a mask and a bit-count, and the slab cache I bolted on for small frequent allocations is a fast path in front of an already fast path.
If I had figured out the bitmap on rewrite one I would not have needed rewrites two and three.