doc/combiner-explainer.md - platform/external/grpc-grpc - Git at Google

 # Combiner Explanation
 ## Talk by ctiller, notes by vjpai

 Typical way of doing critical section

 ```
 mu.lock()
 do_stuff()
 mu.unlock()
 ```

 An alternative way of doing it is

 ```
 class combiner {
   run(f) {
     mu.lock()
     f()
     mu.unlock()
   }
   mutex mu;
 }

 combiner.run(do_stuff)
 ```

 If you have two threads calling combiner, there will be some kind of
 queuing in place. It's called `combiner` because you can pass in more
 than one do_stuff at once and they will run under a common `mu`.

 The implementation described above has the issue that you're blocking a thread
 for a period of time, and this is considered harmful because it's an application thread that you're blocking.

 Instead, get a new property:
 * Keep things running in serial execution
 * Don't ever sleep the thread
 * But maybe allow things to end up running on a different thread from where they were started
 * This means that `do_stuff` doesn't necessarily run to completion when `combiner.run` is invoked

 ```
 class combiner {
   mpscq q; // multi-producer single-consumer queue can be made non-blocking
   state s; // is it empty or executing

   run(f) {
     if (q.push(f)) {
       // q.push returns true if it's the first thing
       while (q.pop(&f)) { // modulo some extra work to avoid races
         f();
       }
     }
   }
 }
 ```

 The basic idea is that the first one to push onto the combiner
 executes the work and then keeps executing functions from the queue
 until the combiner is drained.

 Our combiner does some additional work, with the motivation of write-batching.

 We have a second tier of `run` called `run_finally`. Anything queued
 onto `run_finally` runs after we have drained the queue. That means
 that there is essentially a finally-queue. This is not guaranteed to
 be final, but it's best-effort. In the process of running the finally
 item, we might put something onto the main combiner queue and so we'll
 need to re-enter.

 `chttp2` runs all ops in the run state except if it sees a write it puts that into a finally. That way anything else that gets put into the combiner can add to that write.

 ```
 class combiner {
   mpscq q; // multi-producer single-consumer queue can be made non-blocking
   state s; // is it empty or executing
   queue finally; // you can only do run_finally when you are already running something from the combiner

   run(f) {
     if (q.push(f)) {
       // q.push returns true if it's the first thing
       loop:
       while (q.pop(&f)) { // modulo some extra work to avoid races
         f();
       }
       while (finally.pop(&f)) {
         f();
       }
       goto loop;
     }
   }
 }
 ```

 So that explains how combiners work in general. In gRPC, there is
 `start_batch(..., tag)` and then work only gets activated by somebody
 calling `cq::next` which returns a tag. This gives an API-level
 guarantee that there will be a thread doing polling to actually make
 work happen. However, some operations are not covered by a poller
 thread, such as cancellation that doesn't have a completion. Other
 callbacks that don't have a completion are the internal work that gets
 done before the batch gets completed. We need a condition called
 `covered_by_poller` that means that the item will definitely need some
 thread at some point to call `cq::next` . This includes those
 callbacks that directly cause a completion but also those that are
 indirectly required before getting a completion. If we can't tell for
 sure for a specific path, we have to assumed it is not covered by
 poller.

 The above combiner has the problem that it keeps draining for a
 potentially infinite amount of time and that can lead to a huge tail
 latency for some operations. So we can tweak it by returning to the application
 if we know that it is valid to do so:

 ```
 while (q.pop(&f)) {
   f();
   if (control_can_be_returned && some_still_queued_thing_is_covered_by_poller) {
     offload_combiner_work_to_some_other_thread();
   }
 }
 ```

 `offload` is more than `break`; it does `break` but also causes some
 other thread that is currently waiting on a poll to break out of its
 poll. This is done by setting up a per-polling-island work-queue
 (distributor) wakeup FD. The work-queue is the converse of the combiner; it
 tries to spray events onto as many threads as possible to get as much concurrency as possible.

 So `offload` really does:

 ```
   workqueue.run(continue_from_while_loop);
   break;
 ```

 This needs us to add another class variable for a `workqueue`
 (which is really conceptually a distributor).

 ```
 workqueue::run(f) {
   q.push(f)
   eventfd.wakeup()
 }

 workqueue::readable() {
   eventfd.consume();
   q.pop(&f);
   f();
   if (!q.empty()) {
     eventfd.wakeup(); // spray across as many threads as are waiting on this workqueue
   }
 }
 ```

 In principle, `run_finally` could get starved, but this hasn't
 happened in practice. If we were concerned about this, we could put a
 limit on how many things come off the regular `q` before the `finally`
 queue gets processed.
	# Combiner Explanation
	## Talk by ctiller, notes by vjpai

	Typical way of doing critical section

	```
	mu.lock()
	do_stuff()
	mu.unlock()
	```

	An alternative way of doing it is

	```
	class combiner {
	run(f) {
	mu.lock()
	f()
	mu.unlock()
	}
	mutex mu;
	}

	combiner.run(do_stuff)
	```

	If you have two threads calling combiner, there will be some kind of
	queuing in place. It's called `combiner` because you can pass in more
	than one do_stuff at once and they will run under a common `mu`.

	The implementation described above has the issue that you're blocking a thread
	for a period of time, and this is considered harmful because it's an application thread that you're blocking.

	Instead, get a new property:
	* Keep things running in serial execution
	* Don't ever sleep the thread
	* But maybe allow things to end up running on a different thread from where they were started
	* This means that `do_stuff` doesn't necessarily run to completion when `combiner.run` is invoked

	```
	class combiner {
	mpscq q; // multi-producer single-consumer queue can be made non-blocking
	state s; // is it empty or executing

	run(f) {
	if (q.push(f)) {
	// q.push returns true if it's the first thing
	while (q.pop(&f)) { // modulo some extra work to avoid races
	f();
	}
	}
	}
	}
	```

	The basic idea is that the first one to push onto the combiner
	executes the work and then keeps executing functions from the queue
	until the combiner is drained.

	Our combiner does some additional work, with the motivation of write-batching.

	We have a second tier of `run` called `run_finally`. Anything queued
	onto `run_finally` runs after we have drained the queue. That means
	that there is essentially a finally-queue. This is not guaranteed to
	be final, but it's best-effort. In the process of running the finally
	item, we might put something onto the main combiner queue and so we'll
	need to re-enter.

	`chttp2` runs all ops in the run state except if it sees a write it puts that into a finally. That way anything else that gets put into the combiner can add to that write.

	```
	class combiner {
	mpscq q; // multi-producer single-consumer queue can be made non-blocking
	state s; // is it empty or executing
	queue finally; // you can only do run_finally when you are already running something from the combiner

	run(f) {
	if (q.push(f)) {
	// q.push returns true if it's the first thing
	loop:
	while (q.pop(&f)) { // modulo some extra work to avoid races
	f();
	}
	while (finally.pop(&f)) {
	f();
	}
	goto loop;
	}
	}
	}
	```

	So that explains how combiners work in general. In gRPC, there is
	`start_batch(..., tag)` and then work only gets activated by somebody
	calling `cq::next` which returns a tag. This gives an API-level
	guarantee that there will be a thread doing polling to actually make
	work happen. However, some operations are not covered by a poller
	thread, such as cancellation that doesn't have a completion. Other
	callbacks that don't have a completion are the internal work that gets
	done before the batch gets completed. We need a condition called
	`covered_by_poller` that means that the item will definitely need some
	thread at some point to call `cq::next` . This includes those
	callbacks that directly cause a completion but also those that are
	indirectly required before getting a completion. If we can't tell for
	sure for a specific path, we have to assumed it is not covered by
	poller.

	The above combiner has the problem that it keeps draining for a
	potentially infinite amount of time and that can lead to a huge tail
	latency for some operations. So we can tweak it by returning to the application
	if we know that it is valid to do so:

	```
	while (q.pop(&f)) {
	f();
	if (control_can_be_returned && some_still_queued_thing_is_covered_by_poller) {
	offload_combiner_work_to_some_other_thread();
	}
	}
	```

	`offload` is more than `break`; it does `break` but also causes some
	other thread that is currently waiting on a poll to break out of its
	poll. This is done by setting up a per-polling-island work-queue
	(distributor) wakeup FD. The work-queue is the converse of the combiner; it
	tries to spray events onto as many threads as possible to get as much concurrency as possible.

	So `offload` really does:

	```
	workqueue.run(continue_from_while_loop);
	break;
	```

	This needs us to add another class variable for a `workqueue`
	(which is really conceptually a distributor).

	```
	workqueue::run(f) {
	q.push(f)
	eventfd.wakeup()
	}

	workqueue::readable() {
	eventfd.consume();
	q.pop(&f);
	f();
	if (!q.empty()) {
	eventfd.wakeup(); // spray across as many threads as are waiting on this workqueue
	}
	}
	```

	In principle, `run_finally` could get starved, but this hasn't
	happened in practice. If we were concerned about this, we could put a
	limit on how many things come off the regular `q` before the `finally`
	queue gets processed.