morsel-driven, finally
summary: rayforce2 with the teide pipeline underneath rayfall. morsels make cancellation, fusion, and profiling structurally simpler.
Core idea: morsel-driven execution moved rayforce2 from isolated vector kernels to schedulable, cache-sized units of query work.
In rayforce, cancelling a long query depended on a fragile thread-state machine. In rayforce2 it is a five-line check. The reason is morsels.
A morsel is a slice of a column - 1024 elements, fixed - and every kernel takes morsels in and produces morsels out. There is no scalar fallback path; the partial last morsel is just its own morsel with fewer elements. The pipeline is one chain of morsels long.
Once the state of a query lives between morsels rather than inside any single kernel call, cancelling the query stops being a problem. There is nothing in flight to unwind. The check if (cancelled) return RAY_ERR_CANCELLED; runs at every morsel boundary, the next morsel never gets requested, and the query is over. No thread state. No partial commits. No timer-vs-handler race.
Fusion and profiling fall out of the same shape (two kernels that produce and consume the same morsel can fuse into one loop in L1; flame graphs become honest because time attributes to morsels rather than function calls), but the cancellation gain is the one I had not predicted. It is the kind of decision whose absence is invisible at month one and unfixable at year three. Starting fresh with the rule baked in took six weeks. Retrofitting it into rayforce would have been a year.