Zephyr provides a library of common general purpose data structures used within the kernel, but useful by application code in general. These include list and balanced tree structures for storing ordered data, and a ring buffer for managing “byte stream” data in a clean way.
Note that in general, the collections are implemented as “intrusive” data structures. The “node” data is the only struct used by the library code, and it does not store a pointer or other metadata to indicate what user data is “owned” by that node. Instead, the expectation is that the node will be itself embedded within a user-defined struct. Macros are provided to retrieve a user struct address from the embedded node pointer in a clean way. The purpose behind this design is to allow the collections to be used in contexts where dynamic allocation is disallowed (i.e. there is no need to allocate node objects because the memory is provided by the user).
Note also that these libraries are uniformly unsynchronized; access to them is not threadsafe by default. These are data structures, not synchronization primitives. The expectation is that any locking needed will be provided by the user.
Balanced Red/Black Tree¶
For circumstances where sorted containers may become large at runtime, a list becomes problematic due to algorithmic costs of searching it. For these situations, Zephyr provides a balanced tree implementation which has runtimes on search and removal operations bounded at O(log2(N)) for a tree of size N. This is implemented using a conventional red/black tree as described by multiple academic sources.
struct rbtree tracking struct for a rbtree may be initialized
anywhere in user accessible memory. It should contain only zero bits
before first use. No specific initialization API is needed or
Unlike a list, where position is explicit, the ordering of nodes
within an rbtree must be provided as a predicate function by the user.
A function of type
rb_lessthan_t() should be assigned to the
lessthan_fn field of the
struct rbtree before any tree
operations are attempted. This function should, as its name suggests,
return a boolean True value if the first node argument is “less than”
the second in the ordering desired by the tree. Note that “equal” is
not allowed, nodes within a tree must have a single fixed order for
the algorithm to work correctly.
As with the slist and dlist containers, nodes within an rbtree are
represented as a
struct rbnode structure which exists in
user-managed memory, typically embedded within the the data structure
being tracked in the tree. Unlike the list code, the data within an
rbnode is entirely opaque. It is not possible for the user to extract
the binary tree topology and “manually” traverse the tree as it is for
Nodes can be inserted into a tree with
rb_insert() and removed
rb_remove(). Access to the “first” and “last” nodes within a
tree (in the sense of the order defined by the comparison function) is
rb_get_max(). There is also a
rb_contains(), which returns a boolean True if the
provided node pointer exists as an element within the tree. As
described above, all of these routines are guaranteed to have at most
log time complexity in the size of the tree.
There are two mechanisms provided for enumerating all elements in an
rbtree. The first,
rb_walk(), is a simple callback implementation
where the caller specifies a C function pointer and an untyped
argument to be passed to it, and the tree code calls that function for
each node in order. This has the advantage of a very simple
implementation, at the cost of a somewhat more cumbersome API for the
user (not unlike ISO C’s
bsearch() routine). It is a recursive
implementation, however, and is thus not always available in
environments that forbid the use of unbounded stack techniques like
There is also a
RB_FOR_EACH() iterator provided, which, like the
similar APIs for the lists, works to iterate over a list in a more
natural way, using a nested code block instead of a callback. It is
also nonrecursive, though it requires log-sized space on the stack by
default (however, this can be configured to use a fixed/maximally size
buffer instead where needed to avoid the dynamic allocation). As with
the lists, this is also available in a
variant which enumerates using a pointer to a container field and not
the raw node pointer.
As described, the Zephyr rbtree implementation is a conventional red/black tree as described pervasively in academic sources. Low level details about the algorithm are out of scope for this document, as they match existing conventions. This discussion will be limited to details notable or specific to the Zephyr implementation.
The core invariant guaranteed by the tree is that the path from the root of the tree to any leaf is no more than twice as long as the path to any other leaf. This is achieved by associating one bit of “color” with each node, either red or black, and enforcing a rule that no red child can be a child of another red child (i.e. that the number of black nodes on any path to the root must be the same, and that no more than that number of “extra” red nodes may be present). This rule is enforced by a set of rotation rules used to “fix” trees following modification.
These rotations are conceptually implemented on top of a primitive
that “swaps” the position of one node with another in the list.
Typical implementations effect this by simply swapping the nodes
internal “data” pointers, but because the Zephyr
struct rbnode is
intrusive, that cannot work. Zephyr must include somewhat more
elaborate code to handle the edge cases (for example, one swapped node
can be the root, or the two may already be parent/child).
struct rbnode struct for a Zephyr rbtree contains only two
pointers, representing the “left”, and “right” children of a node
within the binary tree. Traversal of a tree for rebalancing following
modification, however, routinely requires the ability to iterate
“upwards” from a node as well. It is very common for red/black trees
in the industry to store a third “parent” pointer for this purpose.
Zephyr avoids this requirement by building a “stack” of node pointers
locally as it traverses downward thorugh the tree and updating it
appropriately as modifications are made. So a Zephyr rbtree can be
implemented with no more runtime storage overhead than a dlist.
These properties, of a balanced tree data structure that works with only two pointers of data per node and that works without any need for a memory allocation API, are quite rare in the industry and are somewhat unique to Zephyr.
For circumstances where an application needs to implement asynchronous
“streaming” copying of data, Zephyr provides a
abstraction to manage copies of such data in and out of a shared
buffer of memory. Ring buffers may be used in either “bytes” mode,
where the data to be streamed is an uninterpreted array of bytes, or
“items” mode where the data much be an integral number of 32 bit
words. While the underlying data structure is the same, it is not
legal to mix these two modes on a single ring buffer instance. A ring
buffer initialized with a byte count must be used only with the
“bytes” API, one initialized with a word count must use the “items”
struct ring_buf may be placed anywhere in user-accessible
memory, and must be initialized with
ring_buf_init() before use.
This must be provided a region of user-controlled memory for use as
the buffer itself. Note carefully that the units of the size of the
buffer passed change (either bytes or words) depending on how the ring
buffer will be used later. Macros for combining these steps in a
single static declaration exist for convenience.
RING_BUF_DECLARE() will declare and statically initialize a ring
buffer with a specified byte count, where
RING_BUF_ITEM_DECLARE_SIZE() will declare and statically
initialize a buffer with a given count of 32 bit words.
RING_BUF_ITEM_DECLARE_POW2() can be used to initialize an
items-mode buffer with a memory region guaranteed to be a power of
two, which enables various optimizations internal to the
implementation. No power-of-two initialization is available for
bytes-mode ring buffers.
“Bytes” data may be copied into the ring buffer using
ring_buf_put(), passing a data pointer and byte count. These
bytes will be copied into the buffer in order, as many as will fit in
the allocated buffer. The total number of bytes copied (which may be
fewer than provided) will be returned. Likewise
will copy bytes out of the ring buffer in the order that they were
written, into a user-provided buffer, returning the number of bytes
that were transferred.
To avoid multiply-copied-data situations, a “claim” API exists for
ring_buf_put_claim() takes a byte size value from the
user and returns a pointer to memory internal to the ring buffer that
can be used to receive those bytes, along with a size of the
contiguous internal region (which may be smaller than requested). The
user can then copy data into that region at a later time without
assembling all the bytes in a single region first. When complete,
ring_buf_put_finish() can be used to signal the buffer that the
transfer is complete, passing the number of bytes actually
transferred. At this point a new transfer can be initiated.
ring_buf_get_claim() returns a pointer to internal ring
buffer data from which the user can read without making a verbatim
ring_buf_get_finish() signals the buffer with how many
bytes have been consumed and allows for a new transfer to begin.
“Items” mode works similarly to bytes mode, except that all transfers
are in units of 32 bit words and all memory is assumed to be aligned
on 32 bit boundaries. The write and read operations are
ring_buf_item_get(), and work
otherwise identically to the bytes mode APIs. There no “claim” API
provided for items mode. One important difference is that unlike
ring_buf_item_put() will not do a partial
transfer; it will return an error in the case where the provided data
does not fit in its entirety.
The user can manage the capacity of a ring buffer without modifying it
ring_buf_space_get() call (which returns a value of
either bytes or items depending on how the ring buffer has been used),
or by testing the
ring_buf_reset() call exists to immediately empty a
ring buffer, discarding the tracking of any bytes or items already
written to the buffer. It does not modify the memory contents of the
buffer itself, however.
Ring Buffer Internals¶
Data streamed through a ring buffer is always written to the next byte
within the buffer, wrapping around to the first element after reaching
the end, thus the “ring” structure. Internally, the
ring_buf contains its own buffer pointer and its size, and also a
“head” and “tail” index representing where the next read and write
This boundary is invisible to the user using the normal put/get APIs, but becomes a barrier to the “claim” API, because obviously no contiguous region can be returned that crosses the end of the buffer. This can be surprising to application code, and produce performance artifacts when transfers need to alias closely to the size of the buffer, as the number of calls to claim/finish need to double for such transfers.
When running in items mode (only), the ring buffer contains two implementations for the modular arithmetic required to compute “next element” offsets. One is used for arbitrary sized buffers, but the other is optimized for power of two sizes and can replace the compare and subtract steps with a simple bitmask in several places, at the cost of testing the “mask” value for each call.