SPDK Threads, Pollers & Buffer Architecture

A deep dive into SPDK's userspace threading model, polling infrastructure, and memory buffer management — with code references to the actual implementation and real-world scenarios.

1 — Threads vs Pollers

The Core Distinction

SPDK uses a run-to-completion model. There are no kernel context switches, no blocking syscalls, and no mutexes in the I/O hot path. Everything runs on lightweight userspace threads that are polled by the application's reactor loop.

spdk_thread

Execution Context

A stackless, lightweight thread — NOT a POSIX thread. It's a logical execution unit containing:

Active pollers — zero-period callbacks that run every poll cycle
Timed pollers — callbacks on a periodic timer (RB-tree sorted)
Paused pollers — temporarily suspended pollers
Message ring — lockless SPSC ring for cross-thread messages
I/O channels — per-thread connections to I/O devices
CPU affinity mask — which cores this thread can run on

lib/thread/thread.c:114 — struct spdk_thread include/spdk/thread.h:234 — spdk_thread_create()

struct spdk_thread {
    uint64_t tsc_last;
    struct spdk_thread_stats stats;
    TAILQ_HEAD(, spdk_poller) active_pollers;
    RB_HEAD(, spdk_poller)    timed_pollers;
    TAILQ_HEAD(, spdk_poller) paused_pollers;
    struct spdk_ring  *messages;
    RB_HEAD(, spdk_io_channel) io_channels;
    struct spdk_cpuset cpumask;
    bool is_bound;
    bool in_interrupt;
    // ...
};

spdk_poller

Repeated Callback

A function repeatedly called on the same spdk_thread. Pollers are the workhorses — they check for I/O completions, process admin commands, handle network events, etc.

Active poller (period=0) — runs every poll cycle, round-robin
Timed poller (period>0) — runs at intervals, sorted in an RB-tree by next_run_tick
Returns SPDK_POLLER_BUSY (did work) or SPDK_POLLER_IDLE (no work)
Can be paused/resumed dynamically

lib/thread/thread.c:70 — struct spdk_poller include/spdk/thread.h:605 — spdk_poller_register()

struct spdk_poller {
    TAILQ_ENTRY(spdk_poller) tailq;
    RB_ENTRY(spdk_poller)    node;
    uint64_t period_ticks;
    uint64_t next_run_tick;
    uint64_t run_count;
    uint64_t busy_count;
    spdk_poller_fn fn;
    void *arg;
    struct spdk_thread *thread;
    char name[SPDK_MAX_POLLER_NAME_LEN + 1];
};

Key Relationship: Thread ← 1:N → Pollers

One spdk_thread owns many pollers. When spdk_thread_poll() is called by the reactor, it iterates through all active pollers (round-robin), then checks timed pollers whose deadline has arrived. Each poller function runs to completion before the next one is invoked — no preemption.

The reactor (DPDK's event framework or SPDK's app framework) creates one POSIX thread per core, and each POSIX thread drives one or more spdk_threads in a tight loop.

2 — The Thread Poll Loop

What happens inside spdk_thread_poll()

lib/thread/thread.c:1120 — thread_poll() lib/thread/thread.c:1223 — spdk_thread_poll()

This is the heart of SPDK's execution model. Every call to spdk_thread_poll() does the following in order:

flowchart TD
    A["spdk_thread_poll(thread, max_msgs, now)"] --> B["1. Process critical_msg
One-shot emergency callback"]
    B --> C["2. msg_queue_run_batch()
Drain up to max_msgs from ring buffer
Cross-thread messages land here"]
    C --> D["3. Active Pollers (round-robin)
TAILQ_FOREACH_REVERSE_SAFE
Each poller: fn(arg) → BUSY|IDLE"]
    D --> E["4. Timed Pollers (deadline check)
RB-tree sorted by next_run_tick
Only run if now >= next_run_tick"]
    E --> F["5. thread_update_stats()
Update busy/idle tsc counters"]
    F --> G{"Thread exiting?"}
    G -->|Yes| H["thread_exit() — clean up"]
    G -->|No| I["Return to reactor loop"]
    I --> A

    style A fill:#1e3a5f,color:#fff,stroke:#1e3a5f
    style B fill:#d4a73a,color:#1e1b16,stroke:#d4a73a
    style C fill:#d4a73a,color:#1e1b16,stroke:#d4a73a
    style D fill:#0f766e,color:#fff,stroke:#0f766e
    style E fill:#0f766e,color:#fff,stroke:#0f766e
    style F fill:#3f6212,color:#fff,stroke:#3f6212

Actual code from thread_poll() — lib/thread/thread.c:1120

static int
thread_poll(struct spdk_thread *thread, uint32_t max_msgs, uint64_t now)
{
    uint32_t msg_count;
    struct spdk_poller *poller, *tmp;
    spdk_msg_fn critical_msg;
    int rc = 0;

    thread->tsc_last = now;

    // Step 1: Process critical message (single, high-priority)
    critical_msg = thread->critical_msg;
    if (spdk_unlikely(critical_msg != NULL)) {
        critical_msg(NULL);
        thread->critical_msg = NULL;
        rc = 1;
    }

    // Step 2: Drain message queue (cross-thread msgs)
    msg_count = msg_queue_run_batch(thread, max_msgs);
    if (msg_count) { rc = 1; }

    // Step 3: Execute ALL active pollers (period_ticks == 0)
    TAILQ_FOREACH_REVERSE_SAFE(poller, &thread->active_pollers,
                               active_pollers_head, tailq, tmp) {
        int poller_rc = thread_execute_poller(thread, poller);
        if (poller_rc > rc) { rc = poller_rc; }
    }

    // Step 4: Execute timed pollers whose deadline arrived
    poller = thread->first_timed_poller;
    while (poller != NULL) {
        if (now < poller->next_run_tick) break; // sorted, so stop early
        tmp = RB_NEXT(...);
        RB_REMOVE(...);
        timer_rc = thread_execute_timed_poller(thread, poller, now);
        poller = tmp;
    }

    return rc;
}

3 — NVMe Controller Attach: What Happens

Scenario: `bdev_nvme_attach_controller` RPC

When you issue bdev_nvme_attach_controller, SPDK creates an NVMe controller, registers pollers for I/O and admin queues, and creates bdevs for each namespace. Here's the thread/poller dance:

sequenceDiagram
    participant RPC as JSON-RPC Thread
(app thread)
    participant NVMe as NVMe Driver
    participant BdevMod as bdev_nvme module
    participant Thread as SPDK Thread
(I/O core)

    RPC->>NVMe: spdk_nvme_connect_async(trid)
    Note over RPC: Probe poller registered
on app thread
    NVMe-->>BdevMod: connect_attach_cb()
    BdevMod->>BdevMod: nvme_ctrlr_create()
    Note over BdevMod: Allocates nvme_ctrlr struct
on app thread
    BdevMod->>BdevMod: SPDK_POLLER_REGISTER(
bdev_nvme_poll_adminq,
period=1000μs)
    Note over BdevMod: Admin queue poller
— timed poller on app thread
    BdevMod->>BdevMod: spdk_io_device_register(
nvme_ctrlr)
    Note over BdevMod: Enables I/O channel
creation on any thread
    BdevMod->>BdevMod: nvme_ctrlr_create_done()
    BdevMod->>BdevMod: Register bdevs for
each NVMe namespace
    Note over BdevMod: bdev_auto_examine triggers
if enabled
    RPC-->>Thread: When I/O channel opened:
    Thread->>Thread: bdev_nvme_create_poll_group_cb()
    Thread->>Thread: spdk_nvme_poll_group_create()
    Thread->>Thread: SPDK_POLLER_REGISTER(
bdev_nvme_poll, period=0)
    Note over Thread: I/O completion poller
— active poller (period=0)
runs EVERY poll cycle

Pollers Created During NVMe Attach

Poller	Type	Thread	Purpose
`bdev_nvme_poll_adminq`	Timed (1000μs)	App thread	Poll admin queue for completions (identify, set features, etc.)
`bdev_nvme_poll`	Active (0μs)	I/O thread	Poll NVMe I/O queue for completions — the hot path

module/bdev/nvme/bdev_nvme.c:3925 — I/O poller module/bdev/nvme/bdev_nvme.c:6113 — Admin poller

Buffer Usage During NVMe I/O

When an I/O arrives at the NVMe bdev:

spdk_bdev_io is pulled from the per-thread cache (fast) or the global pool (slower)
If the I/O needs a data buffer, it requests an iobuf — small (≤8K) or large (≤132K)
The bdev_nvme_poll active poller calls spdk_nvme_poll_group_process_completions()
On completion, bdev_io goes back to cache/pool and iobuf is released

module/bdev/nvme/bdev_nvme.c:1924 — bdev_nvme_poll()

bdev_nvme_poll() — the NVMe I/O completion poller

static int
bdev_nvme_poll(void *arg)
{
    struct nvme_poll_group *group = arg;
    int64_t num_completions;

    num_completions = spdk_nvme_poll_group_process_completions(
        group->group, 0, bdev_nvme_disconnected_qpair_cb);

    // Returns BUSY if completions processed, IDLE otherwise
    // This tells the thread whether this cycle did useful work
    return num_completions > 0 ? SPDK_POLLER_BUSY : SPDK_POLLER_IDLE;
}

4 — RAID bdev Creation

Scenario: `bdev_raid_create` RPC

Creating a RAID over NVMe bdevs adds another layer. The RAID bdev stacks on top of base bdevs, and each I/O thread gets its own RAID channel that fans out to the underlying NVMe channels.

flowchart TD
    subgraph AppThread["App Thread (thread_poll loop)"]
        RPC["bdev_raid_create RPC"] --> RC["raid_bdev_create()"]
        RC --> IOD["spdk_io_device_register(raid_bdev)"]
        IOD --> ABB["raid_bdev_add_base_bdev() × N"]
        ABB --> CFG["raid_bdev_configure()"]
        CFG --> REG["spdk_bdev_register(raid_bdev)"]
        REG --> EXAM{"bdev_auto_examine?"}
        EXAM -->|Yes| AE["Notify all modules
of new bdev"]
        EXAM -->|No| SKIP["Skip — manual examine later"]
    end

    subgraph IOThread["I/O Thread (when channel opened)"]
        OPEN["spdk_get_io_channel(raid_bdev)"] --> CB["raid_bdev_create_cb()"]
        CB --> ALLOC["Allocate raid_bdev_io_channel"]
        ALLOC --> BASE["Get io_channel for each
base NVMe bdev"]
        BASE --> READY["RAID channel ready
for I/O submission"]
    end

    REG -.->|"I/O channel created
on first I/O"| OPEN

    style RPC fill:#9a3412,color:#fff,stroke:#9a3412
    style RC fill:#9a3412,color:#fff,stroke:#9a3412
    style OPEN fill:#0f766e,color:#fff,stroke:#0f766e
    style CB fill:#0f766e,color:#fff,stroke:#0f766e

RAID + Pollers + Buffers: How They Interact

RAID itself does not register its own pollers — it relies on the underlying bdev modules' pollers. Here's what happens during a RAID I/O:

Application submits I/O to RAID bdev → spdk_bdev_io allocated from per-thread cache (bdev_io_cache_size)
RAID splits/stripes the I/O across base bdevs → may need additional bdev_io structs from the pool
Each sub-I/O may request an iobuf (small or large) for data transfer
Sub-I/Os submitted to NVMe bdevs → they go to the NVMe queue pair
NVMe bdev_nvme_poll active poller picks up completions
RAID completion callback aggregates results
bdev_io structs returned to per-thread cache (or pool if cache full)

module/bdev/raid/bdev_raid.c:258 — raid_bdev_create_cb() module/bdev/raid/bdev_raid.c:1653 — raid_bdev_create()

5 — NVMf Subsystem (NQN) Creation on RAID

Scenario: `nvmf_create_subsystem` + `nvmf_subsystem_add_ns`

Exposing the RAID bdev over NVMe-oF adds the NVMf transport layer. This creates new pollers for network I/O and chains through the bdev layer to RAID and ultimately NVMe.

flowchart TD
    subgraph Host["Remote NVMe-oF Host"]
        HI["NVMe-oF Initiator"]
    end

    subgraph NVMfTarget["SPDK NVMf Target"]
        subgraph TPG["Transport Poll Group
(per I/O thread)"]
            TP["nvmf_tgroup_poll
Active Poller (period=0)"]
            TP --> RECV["Receive NVMe commands
from network"]
            RECV --> BDEV["Submit to bdev layer"]
        end

        subgraph BdevLayer["Bdev Layer"]
            BDEV --> BIO["Allocate spdk_bdev_io
from per-thread cache"]
            BIO --> IBUF["Request iobuf
(small or large)"]
            IBUF --> RAID["RAID bdev submit_request"]
        end

        subgraph RaidLayer["RAID Layer"]
            RAID --> SPLIT["Split across base bdevs"]
            SPLIT --> NB1["NVMe bdev 1"]
            SPLIT --> NB2["NVMe bdev 2"]
        end

        subgraph NVMeLayer["NVMe Driver"]
            NB1 --> NQP1["NVMe QP 1"]
            NB2 --> NQP2["NVMe QP 2"]
            NVPOLL["bdev_nvme_poll
Active Poller"] --> NQP1
            NVPOLL --> NQP2
        end
    end

    HI -->|"NVMe-oF TCP/RDMA"| TP
    NQP1 -->|"Completion"| NVPOLL
    NVPOLL -->|"Completion chain"| TP
    TP -->|"Send response"| HI

    style TP fill:#3f6212,color:#fff,stroke:#3f6212
    style NVPOLL fill:#0f766e,color:#fff,stroke:#0f766e
    style BIO fill:#d4a73a,color:#1e1b16,stroke:#d4a73a
    style IBUF fill:#d4a73a,color:#1e1b16,stroke:#d4a73a

NVMf Pollers

Poller	Type	Purpose
`nvmf_tgroup_poll`	Active (0μs)	Process incoming NVMe-oF commands from the network transport
`accept_poller`	Varies by transport	Accept new connections (TCP/RDMA/vfio-user)

lib/nvmf/transport.c:591 — nvmf_tgroup_poll() lib/nvmf/transport.c:606 — poller registration

iobuf Usage in NVMf

NVMf transport validates io_unit_size against iobuf pool sizes:

// lib/nvmf/transport.c:295
if (ctx->opts.io_unit_size > opts_iobuf.large_bufsize) {
    SPDK_ERRLOG("io_unit_size %d > large_bufsize %d\n",
        ctx->opts.io_unit_size, opts_iobuf.large_bufsize);
}
if (ctx->opts.io_unit_size <= opts_iobuf.small_bufsize) {
    count = opts_iobuf.small_pool_count;
} else {
    count = spdk_min(small_pool_count, large_pool_count);
}

lib/nvmf/transport.c:295-305

6 — bdev_io Pool & Per-Thread Cache

Two-Level Allocation: Global Pool → Per-Thread Cache

Every bdev I/O operation requires a spdk_bdev_io structure. SPDK uses a two-level scheme to minimize contention: a global mempool (shared, uses atomic ops) and per-thread caches (lockless STAILQ).

flowchart TD
    subgraph Global["Global bdev_io Pool
(spdk_mempool, shared across all threads)"]
        POOL["bdev_io_pool_size = 65535
Lock-free mempool (DPDK ring)"]
    end

    subgraph T1["Thread 1 — bdev_mgmt_channel"]
        C1["per_thread_cache
STAILQ of bdev_io
cache_size = 256"]
        IWQ1["io_wait_queue
Waiters when pool exhausted"]
    end

    subgraph T2["Thread 2 — bdev_mgmt_channel"]
        C2["per_thread_cache
STAILQ of bdev_io
cache_size = 256"]
        IWQ2["io_wait_queue"]
    end

    subgraph T3["Thread N — bdev_mgmt_channel"]
        C3["per_thread_cache
STAILQ of bdev_io
cache_size = 256"]
        IWQ3["io_wait_queue"]
    end

    POOL -->|"Pre-populate at
channel creation"| C1
    POOL -->|"Pre-populate"| C2
    POOL -->|"Pre-populate"| C3
    C1 -->|"Return when
cache full"| POOL
    C2 -->|"Return when
cache full"| POOL
    C3 -->|"Return when
cache full"| POOL

    style POOL fill:#d4a73a,color:#1e1b16,stroke:#d4a73a
    style C1 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
    style C2 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
    style C3 fill:#1e3a5f,color:#fff,stroke:#1e3a5f

bdev_io_pool_size

Global Pool

Default: 65535 (64K - 1, optimal for DPDK ring). Total spdk_bdev_io structures allocated at startup in a shared mempool.

Constraint: Must be ≥ bdev_io_cache_size × (thread_count + 1) — because each thread pre-populates its cache at channel creation.

// lib/bdev/bdev.c:519
min_pool_size = opts->bdev_io_cache_size
              * (spdk_thread_get_count() + 1);
if (opts->bdev_io_pool_size < min_pool_size) {
    SPDK_ERRLOG("bdev_io_pool_size %" PRIu32
        " is not compatible with "
        "bdev_io_cache_size %" PRIu32
        " and %" PRIu32 " threads\n", ...);
}

lib/bdev/bdev.c:38 — #define SPDK_BDEV_IO_POOL_SIZE (64 * 1024 - 1) lib/bdev/bdev.c:2350 — spdk_mempool_create()

bdev_io_cache_size

Per-Thread Cache

Default: 256. Maximum spdk_bdev_io structs cached per thread in a lockless STAILQ.

Hot path: I/O allocation first checks the local cache (no atomics!). Only falls back to the global pool if cache is empty.

// lib/bdev/bdev.c:2157 — Pre-populate cache
remaining = ch->bdev_io_cache_size
          = g_bdev_opts.bdev_io_cache_size;
while (remaining > 0) {
    spdk_mempool_get_bulk(g_bdev_mgr.bdev_io_pool,
                          bdev_ios, count);
    for (i = 0; i < count; i++) {
        STAILQ_INSERT_HEAD(&ch->per_thread_cache,
                           bdev_ios[i], ...);
        ch->per_thread_cache_count++;
    }
}

// lib/bdev/bdev.c:2644 — Return to cache
if (ch->per_thread_cache_count < ch->bdev_io_cache_size) {
    ch->per_thread_cache_count++;
    STAILQ_INSERT_HEAD(&ch->per_thread_cache, bdev_io, ...);
    // Also wake any waiters on io_wait_queue
} else {
    // Cache full — return to global pool
    spdk_mempool_put(g_bdev_mgr.bdev_io_pool, bdev_io);
}

lib/bdev/bdev.c:39 — #define SPDK_BDEV_IO_CACHE_SIZE 256

bdev_auto_examine

Discovery

Default: true. When a new bdev is registered (e.g., NVMe namespace discovered), the bdev layer automatically notifies all bdev modules to examine it. Modules like lvol, raid, crypto, etc. get a chance to claim or layer on top of it.

// lib/bdev/bdev.c:718
if (g_bdev_opts.bdev_auto_examine) {
    bdev_examine(bdev);  // Notify all modules
}

// lib/bdev/bdev.c:839
if (g_bdev_opts.bdev_auto_examine) {
    bdev_examine(bdev);  // Also on hot-plug
}

When to disable: In production systems where you want explicit control over which bdevs are examined, preventing unwanted modules from claiming devices. Set to false and use bdev_examine RPC manually.

lib/bdev/bdev.c:41 — #define SPDK_BDEV_AUTO_EXAMINE true

7 — iobuf Architecture: Global Pools & Per-Thread Caches

Three-Level Buffer System: Global Pool → Per-Thread Cache → I/O Consumer

SPDK's iobuf subsystem manages DMA-capable data buffers separately from bdev_io control structures. It uses two size classes (small and large) with both global pools (per NUMA node) and per-thread caches.

flowchart TD
    subgraph NUMA["Per-NUMA Global Pools
(hugepage-backed, DMA-capable)"]
        SP["Small Pool
small_pool_count = 8192
small_bufsize = 8KB
spdk_ring (MP/MC)"]
        LP["Large Pool
large_pool_count = 1024
large_bufsize = 132KB
spdk_ring (MP/MC)"]
    end

    subgraph TCH1["Thread 1 — spdk_iobuf_channel"]
        SC1["Small Cache
iobuf_small_cache_size = 128"]
        LC1["Large Cache
iobuf_large_cache_size = 16"]
    end

    subgraph TCH2["Thread 2 — spdk_iobuf_channel"]
        SC2["Small Cache = 128"]
        LC2["Large Cache = 16"]
    end

    subgraph Consumer["I/O Consumers"]
        BDEV["bdev layer"]
        NVMF["NVMf transport"]
        ACCEL["accel framework"]
    end

    SP -->|"Refill cache"| SC1
    SP -->|"Refill cache"| SC2
    LP -->|"Refill cache"| LC1
    LP -->|"Refill cache"| LC2
    SC1 -->|"Return on free"| SP
    LC1 -->|"Return on free"| LP

    SC1 --> BDEV
    LC1 --> BDEV
    SC1 --> NVMF
    LC1 --> NVMF
    SC2 --> ACCEL

    style SP fill:#6b21a8,color:#fff,stroke:#6b21a8
    style LP fill:#6b21a8,color:#fff,stroke:#6b21a8
    style SC1 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
    style LC1 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
    style SC2 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
    style LC2 fill:#1e3a5f,color:#fff,stroke:#1e3a5f

small_pool_count / large_pool_count

Global Pools

Defaults: small = 8192, large = 1024

Number of buffers in the global per-NUMA ring. These are allocated from hugepages (spdk_malloc with SPDK_MALLOC_DMA flag) at startup.

Minimums: small ≥ 64, large ≥ 8

// lib/thread/iobuf.c:13-14
#define IOBUF_MIN_SMALL_POOL_SIZE 64
#define IOBUF_MIN_LARGE_POOL_SIZE 8
#define IOBUF_DEFAULT_SMALL_POOL_SIZE 8192
#define IOBUF_DEFAULT_LARGE_POOL_SIZE 1024

lib/thread/iobuf.c:72-73

small_bufsize / large_bufsize

Buffer Sizes

Defaults: small = 8KB, large = 132KB

Size of each buffer. Aligned to 4096 bytes. The 132KB large buffer size accommodates the default max I/O size (128K) plus interleaved metadata.

// lib/thread/iobuf.c:17-25
#define IOBUF_MIN_SMALL_BUFSIZE  4096
#define IOBUF_MIN_LARGE_BUFSIZE  8192
#define IOBUF_DEFAULT_SMALL_BUFSIZE (8 * 1024)
// 132k = 128k data + metadata
#define IOBUF_DEFAULT_LARGE_BUFSIZE (132 * 1024)

lib/thread/iobuf.c:74-75

iobuf_small_cache_size / iobuf_large_cache_size

Per-Thread

Defaults: small = 128, large = 16

Each bdev mgmt channel (per thread) creates an iobuf channel with these cache sizes. Reduces trips to the global pool.

// lib/bdev/bdev.c:42-43
#define BUF_SMALL_CACHE_SIZE 128
#define BUF_LARGE_CACHE_SIZE 16

// lib/bdev/bdev.c:2149-2150
spdk_iobuf_channel_init(&ch->iobuf,
    "bdev",
    g_bdev_opts.iobuf_small_cache_size,
    g_bdev_opts.iobuf_large_cache_size);

lib/bdev/bdev.c:2149

iobuf pool initialization — how buffers are allocated from hugepages

// lib/thread/iobuf.c:118 — iobuf_node_initialize()
static int iobuf_node_initialize(struct iobuf_node *node, uint32_t numa_id) {
    struct spdk_iobuf_opts *opts = &g_iobuf.opts;

    // Small pool: MP/MC ring for thread-safe access
    node->small_pool = spdk_ring_create(SPDK_RING_TYPE_MP_MC,
                                        opts->small_pool_count, numa_id);

    // Allocate contiguous hugepage memory for all small buffers
    node->small_pool_base = spdk_malloc(
        opts->small_bufsize * opts->small_pool_count,
        IOBUF_ALIGNMENT,  // 4096
        NULL, numa_id, SPDK_MALLOC_DMA);

    // Same for large pool
    node->large_pool = spdk_ring_create(SPDK_RING_TYPE_MP_MC,
                                        opts->large_pool_count, numa_id);
    node->large_pool_base = spdk_malloc(
        opts->large_bufsize * opts->large_pool_count,
        IOBUF_ALIGNMENT, NULL, numa_id, SPDK_MALLOC_DMA);

    // Populate rings with buffer pointers
    for (i = 0; i < opts->small_pool_count; i++) {
        buf = node->small_pool_base + i * opts->small_bufsize;
        spdk_ring_enqueue(node->small_pool, (void **)&buf, 1, NULL);
    }
    for (i = 0; i < opts->large_pool_count; i++) {
        buf = node->large_pool_base + i * opts->large_bufsize;
        spdk_ring_enqueue(node->large_pool, (void **)&buf, 1, NULL);
    }
}

8 — Full Data Path: NVMf → RAID → NVMe with All Buffers

Complete I/O Walk-Through

Here's exactly what happens when a remote NVMe-oF host sends a 64KB write command to a RAID-1 volume backed by two NVMe SSDs. Every thread, poller, bdev_io, and iobuf interaction is shown.

sequenceDiagram
    participant Host as NVMe-oF Host
    participant NVMfPoll as nvmf_tgroup_poll
(Active Poller)
    participant BdevIO as bdev_io allocation
    participant IOBuf as iobuf subsystem
    participant RAID as RAID-1 module
    participant NVMe1 as NVMe SSD 1
    participant NVMe2 as NVMe SSD 2
    participant NVMePoll as bdev_nvme_poll
(Active Poller)

    Note over NVMfPoll: Thread poll cycle starts
thread_poll() called

    Host->>NVMfPoll: 64KB Write Command (TCP/RDMA)
    NVMfPoll->>BdevIO: Get bdev_io from per-thread cache
    Note over BdevIO: cache_count-- (256→255)
No atomics needed!
    BdevIO->>IOBuf: Request large iobuf (64KB > 8KB)
    Note over IOBuf: Check per-thread large cache (16 bufs)
If empty → global large_pool ring
    IOBuf-->>BdevIO: 132KB DMA buffer

    BdevIO->>RAID: spdk_bdev_write(raid_bdev, buf, 64KB)
    Note over RAID: RAID-1: mirror to both drives

    RAID->>NVMe1: Submit write (same buf, same thread)
    RAID->>NVMe2: Submit write (same buf, same thread)

    Note over NVMePoll: Same thread, next poll cycle

    NVMePoll->>NVMe1: spdk_nvme_poll_group_process_completions()
    NVMe1-->>NVMePoll: Write complete
    NVMePoll->>NVMe2: process_completions()
    NVMe2-->>NVMePoll: Write complete

    NVMePoll->>RAID: Both mirrors complete
    RAID->>IOBuf: Release large iobuf
    Note over IOBuf: Return to per-thread cache
or global pool if cache full
    RAID->>BdevIO: spdk_bdev_free_io()
    Note over BdevIO: Return bdev_io to cache
cache_count++ (255→256)
    BdevIO->>NVMfPoll: I/O complete callback
    NVMfPoll->>Host: Write Response (success)

Key Insight: Everything Happens on ONE Thread

In the above flow, the entire I/O path — from receiving the NVMe-oF command, through RAID mirroring, to NVMe submission and completion — happens on a single SPDK thread without any context switch or mutex. The pollers (nvmf_tgroup_poll and bdev_nvme_poll) are both registered on the same thread. The per-thread bdev_io cache and iobuf caches eliminate all contention with other threads.

This is the fundamental design principle: pin everything to one core, avoid sharing, eliminate locks.

9 — Configuration Reference

All Parameters at a Glance

Parameter	RPC	Default	Scope	What It Controls	Sizing Rule
`bdev_io_pool_size`	bdev_set_options	65535	Global (one mempool)	Total spdk_bdev_io structs. Used for I/O control metadata, NOT data.	≥ cache_size × (threads + 1). Use power-of-2 minus 1 for DPDK ring efficiency.
`bdev_io_cache_size`	bdev_set_options	256	Per thread	Lock-free bdev_io cache per thread. Pre-populated from global pool.	≥ expected max concurrent I/Os per thread. Higher = less global pool contention.
`bdev_auto_examine`	bdev_set_options	true	Global	Auto-notify modules when new bdevs appear (lvol, raid, etc. can claim them).	Set false in production for explicit control.
`iobuf_small_cache_size`	bdev_set_options	128	Per thread (bdev module)	Per-thread cache of small (≤8KB) DMA buffers from the global iobuf pool.	≥ concurrent small I/Os per thread. Higher = less ring contention.
`iobuf_large_cache_size`	bdev_set_options	16	Per thread (bdev module)	Per-thread cache of large (≤132KB) DMA buffers.	Lower default because large buffers are expensive (132KB each).
`small_pool_count`	iobuf_set_options	8192	Global (per NUMA)	Total small DMA buffers in the MP/MC ring. Backing store for per-thread caches.	≥ 64. Scale with thread count × iobuf_small_cache_size.
`large_pool_count`	iobuf_set_options	1024	Global (per NUMA)	Total large DMA buffers. Each is 132KB by default — 1024 × 132KB ≈ 132MB.	≥ 8. Scale with thread count × iobuf_large_cache_size + headroom.
`small_bufsize`	iobuf_set_options	8192	Global	Size of each small buffer. Aligned to 4096. Used for I/Os ≤ this size.	≥ 4096. Match SPDK_BDEV_SMALL_BUF_MAX_SIZE.
`large_bufsize`	iobuf_set_options	135168 (132KB)	Global	Size of each large buffer. Must accommodate max I/O size + metadata.	≥ 8192. Default = 128KB + metadata headroom. NVMf io_unit_size must not exceed this.

Memory Footprint Calculation

For a system with 4 I/O threads, default settings:

~20 MB

bdev_io pool (65535 × ~320B)

~64 MB

Small iobuf pool (8192 × 8KB)

~132 MB

Large iobuf pool (1024 × 132KB)

~216 MB

Total buffer memory

All allocated from hugepages. The per-thread caches don't allocate separate memory — they just hold pointers into the global pool.

Source Files Reference

Threading & Pollers

include/spdk/thread.h lib/thread/thread.c lib/thread/iobuf.c

Bdev Layer

include/spdk/bdev.h include/spdk/bdev_module.h lib/bdev/bdev.c lib/bdev/bdev_rpc.c

NVMe Bdev Module

module/bdev/nvme/bdev_nvme.c

RAID & NVMf

module/bdev/raid/bdev_raid.c lib/nvmf/transport.c lib/nvmf/nvmf_rpc.c