SPDK Threads, Pollers & Buffer Architecture

A deep dive into SPDK's userspace threading model, polling infrastructure, and memory buffer management — with code references to the actual implementation and real-world scenarios.

1 — Threads vs Pollers

The Core Distinction

SPDK uses a run-to-completion model. There are no kernel context switches, no blocking syscalls, and no mutexes in the I/O hot path. Everything runs on lightweight userspace threads that are polled by the application's reactor loop.

spdk_thread

Execution Context

A stackless, lightweight thread — NOT a POSIX thread. It's a logical execution unit containing:

  • Active pollers — zero-period callbacks that run every poll cycle
  • Timed pollers — callbacks on a periodic timer (RB-tree sorted)
  • Paused pollers — temporarily suspended pollers
  • Message ring — lockless SPSC ring for cross-thread messages
  • I/O channels — per-thread connections to I/O devices
  • CPU affinity mask — which cores this thread can run on
lib/thread/thread.c:114 — struct spdk_thread include/spdk/thread.h:234 — spdk_thread_create()
struct spdk_thread {
    uint64_t tsc_last;
    struct spdk_thread_stats stats;
    TAILQ_HEAD(, spdk_poller) active_pollers;
    RB_HEAD(, spdk_poller)    timed_pollers;
    TAILQ_HEAD(, spdk_poller) paused_pollers;
    struct spdk_ring  *messages;
    RB_HEAD(, spdk_io_channel) io_channels;
    struct spdk_cpuset cpumask;
    bool is_bound;
    bool in_interrupt;
    // ...
};

spdk_poller

Repeated Callback

A function repeatedly called on the same spdk_thread. Pollers are the workhorses — they check for I/O completions, process admin commands, handle network events, etc.

  • Active poller (period=0) — runs every poll cycle, round-robin
  • Timed poller (period>0) — runs at intervals, sorted in an RB-tree by next_run_tick
  • Returns SPDK_POLLER_BUSY (did work) or SPDK_POLLER_IDLE (no work)
  • Can be paused/resumed dynamically
lib/thread/thread.c:70 — struct spdk_poller include/spdk/thread.h:605 — spdk_poller_register()
struct spdk_poller {
    TAILQ_ENTRY(spdk_poller) tailq;
    RB_ENTRY(spdk_poller)    node;
    uint64_t period_ticks;
    uint64_t next_run_tick;
    uint64_t run_count;
    uint64_t busy_count;
    spdk_poller_fn fn;
    void *arg;
    struct spdk_thread *thread;
    char name[SPDK_MAX_POLLER_NAME_LEN + 1];
};

Key Relationship: Thread ← 1:N → Pollers

One spdk_thread owns many pollers. When spdk_thread_poll() is called by the reactor, it iterates through all active pollers (round-robin), then checks timed pollers whose deadline has arrived. Each poller function runs to completion before the next one is invoked — no preemption.

The reactor (DPDK's event framework or SPDK's app framework) creates one POSIX thread per core, and each POSIX thread drives one or more spdk_threads in a tight loop.

2 — The Thread Poll Loop

What happens inside spdk_thread_poll()

lib/thread/thread.c:1120 — thread_poll() lib/thread/thread.c:1223 — spdk_thread_poll()

This is the heart of SPDK's execution model. Every call to spdk_thread_poll() does the following in order:

flowchart TD
    A["spdk_thread_poll(thread, max_msgs, now)"] --> B["1. Process critical_msg
One-shot emergency callback"] B --> C["2. msg_queue_run_batch()
Drain up to max_msgs from ring buffer
Cross-thread messages land here"] C --> D["3. Active Pollers (round-robin)
TAILQ_FOREACH_REVERSE_SAFE
Each poller: fn(arg) → BUSY|IDLE"] D --> E["4. Timed Pollers (deadline check)
RB-tree sorted by next_run_tick
Only run if now >= next_run_tick"] E --> F["5. thread_update_stats()
Update busy/idle tsc counters"] F --> G{"Thread exiting?"} G -->|Yes| H["thread_exit() — clean up"] G -->|No| I["Return to reactor loop"] I --> A style A fill:#1e3a5f,color:#fff,stroke:#1e3a5f style B fill:#d4a73a,color:#1e1b16,stroke:#d4a73a style C fill:#d4a73a,color:#1e1b16,stroke:#d4a73a style D fill:#0f766e,color:#fff,stroke:#0f766e style E fill:#0f766e,color:#fff,stroke:#0f766e style F fill:#3f6212,color:#fff,stroke:#3f6212
Actual code from thread_poll() — lib/thread/thread.c:1120
static int
thread_poll(struct spdk_thread *thread, uint32_t max_msgs, uint64_t now)
{
    uint32_t msg_count;
    struct spdk_poller *poller, *tmp;
    spdk_msg_fn critical_msg;
    int rc = 0;

    thread->tsc_last = now;

    // Step 1: Process critical message (single, high-priority)
    critical_msg = thread->critical_msg;
    if (spdk_unlikely(critical_msg != NULL)) {
        critical_msg(NULL);
        thread->critical_msg = NULL;
        rc = 1;
    }

    // Step 2: Drain message queue (cross-thread msgs)
    msg_count = msg_queue_run_batch(thread, max_msgs);
    if (msg_count) { rc = 1; }

    // Step 3: Execute ALL active pollers (period_ticks == 0)
    TAILQ_FOREACH_REVERSE_SAFE(poller, &thread->active_pollers,
                               active_pollers_head, tailq, tmp) {
        int poller_rc = thread_execute_poller(thread, poller);
        if (poller_rc > rc) { rc = poller_rc; }
    }

    // Step 4: Execute timed pollers whose deadline arrived
    poller = thread->first_timed_poller;
    while (poller != NULL) {
        if (now < poller->next_run_tick) break; // sorted, so stop early
        tmp = RB_NEXT(...);
        RB_REMOVE(...);
        timer_rc = thread_execute_timed_poller(thread, poller, now);
        poller = tmp;
    }

    return rc;
}
3 — NVMe Controller Attach: What Happens

Scenario: bdev_nvme_attach_controller RPC

When you issue bdev_nvme_attach_controller, SPDK creates an NVMe controller, registers pollers for I/O and admin queues, and creates bdevs for each namespace. Here's the thread/poller dance:

sequenceDiagram
    participant RPC as JSON-RPC Thread
(app thread) participant NVMe as NVMe Driver participant BdevMod as bdev_nvme module participant Thread as SPDK Thread
(I/O core) RPC->>NVMe: spdk_nvme_connect_async(trid) Note over RPC: Probe poller registered
on app thread NVMe-->>BdevMod: connect_attach_cb() BdevMod->>BdevMod: nvme_ctrlr_create() Note over BdevMod: Allocates nvme_ctrlr struct
on app thread BdevMod->>BdevMod: SPDK_POLLER_REGISTER(
bdev_nvme_poll_adminq,
period=1000μs) Note over BdevMod: Admin queue poller
— timed poller on app thread BdevMod->>BdevMod: spdk_io_device_register(
nvme_ctrlr) Note over BdevMod: Enables I/O channel
creation on any thread BdevMod->>BdevMod: nvme_ctrlr_create_done() BdevMod->>BdevMod: Register bdevs for
each NVMe namespace Note over BdevMod: bdev_auto_examine triggers
if enabled RPC-->>Thread: When I/O channel opened: Thread->>Thread: bdev_nvme_create_poll_group_cb() Thread->>Thread: spdk_nvme_poll_group_create() Thread->>Thread: SPDK_POLLER_REGISTER(
bdev_nvme_poll, period=0) Note over Thread: I/O completion poller
— active poller (period=0)
runs EVERY poll cycle

Pollers Created During NVMe Attach

PollerTypeThreadPurpose
bdev_nvme_poll_adminqTimed (1000μs)App threadPoll admin queue for completions (identify, set features, etc.)
bdev_nvme_pollActive (0μs)I/O threadPoll NVMe I/O queue for completions — the hot path
module/bdev/nvme/bdev_nvme.c:3925 — I/O poller module/bdev/nvme/bdev_nvme.c:6113 — Admin poller

Buffer Usage During NVMe I/O

When an I/O arrives at the NVMe bdev:

  • spdk_bdev_io is pulled from the per-thread cache (fast) or the global pool (slower)
  • If the I/O needs a data buffer, it requests an iobuf — small (≤8K) or large (≤132K)
  • The bdev_nvme_poll active poller calls spdk_nvme_poll_group_process_completions()
  • On completion, bdev_io goes back to cache/pool and iobuf is released
module/bdev/nvme/bdev_nvme.c:1924 — bdev_nvme_poll()
bdev_nvme_poll() — the NVMe I/O completion poller
static int
bdev_nvme_poll(void *arg)
{
    struct nvme_poll_group *group = arg;
    int64_t num_completions;

    num_completions = spdk_nvme_poll_group_process_completions(
        group->group, 0, bdev_nvme_disconnected_qpair_cb);

    // Returns BUSY if completions processed, IDLE otherwise
    // This tells the thread whether this cycle did useful work
    return num_completions > 0 ? SPDK_POLLER_BUSY : SPDK_POLLER_IDLE;
}
4 — RAID bdev Creation

Scenario: bdev_raid_create RPC

Creating a RAID over NVMe bdevs adds another layer. The RAID bdev stacks on top of base bdevs, and each I/O thread gets its own RAID channel that fans out to the underlying NVMe channels.

flowchart TD
    subgraph AppThread["App Thread (thread_poll loop)"]
        RPC["bdev_raid_create RPC"] --> RC["raid_bdev_create()"]
        RC --> IOD["spdk_io_device_register(raid_bdev)"]
        IOD --> ABB["raid_bdev_add_base_bdev() × N"]
        ABB --> CFG["raid_bdev_configure()"]
        CFG --> REG["spdk_bdev_register(raid_bdev)"]
        REG --> EXAM{"bdev_auto_examine?"}
        EXAM -->|Yes| AE["Notify all modules
of new bdev"] EXAM -->|No| SKIP["Skip — manual examine later"] end subgraph IOThread["I/O Thread (when channel opened)"] OPEN["spdk_get_io_channel(raid_bdev)"] --> CB["raid_bdev_create_cb()"] CB --> ALLOC["Allocate raid_bdev_io_channel"] ALLOC --> BASE["Get io_channel for each
base NVMe bdev"] BASE --> READY["RAID channel ready
for I/O submission"] end REG -.->|"I/O channel created
on first I/O"| OPEN style RPC fill:#9a3412,color:#fff,stroke:#9a3412 style RC fill:#9a3412,color:#fff,stroke:#9a3412 style OPEN fill:#0f766e,color:#fff,stroke:#0f766e style CB fill:#0f766e,color:#fff,stroke:#0f766e

RAID + Pollers + Buffers: How They Interact

RAID itself does not register its own pollers — it relies on the underlying bdev modules' pollers. Here's what happens during a RAID I/O:

  1. Application submits I/O to RAID bdev → spdk_bdev_io allocated from per-thread cache (bdev_io_cache_size)
  2. RAID splits/stripes the I/O across base bdevs → may need additional bdev_io structs from the pool
  3. Each sub-I/O may request an iobuf (small or large) for data transfer
  4. Sub-I/Os submitted to NVMe bdevs → they go to the NVMe queue pair
  5. NVMe bdev_nvme_poll active poller picks up completions
  6. RAID completion callback aggregates results
  7. bdev_io structs returned to per-thread cache (or pool if cache full)
module/bdev/raid/bdev_raid.c:258 — raid_bdev_create_cb() module/bdev/raid/bdev_raid.c:1653 — raid_bdev_create()
5 — NVMf Subsystem (NQN) Creation on RAID

Scenario: nvmf_create_subsystem + nvmf_subsystem_add_ns

Exposing the RAID bdev over NVMe-oF adds the NVMf transport layer. This creates new pollers for network I/O and chains through the bdev layer to RAID and ultimately NVMe.

flowchart TD
    subgraph Host["Remote NVMe-oF Host"]
        HI["NVMe-oF Initiator"]
    end

    subgraph NVMfTarget["SPDK NVMf Target"]
        subgraph TPG["Transport Poll Group
(per I/O thread)"] TP["nvmf_tgroup_poll
Active Poller (period=0)"] TP --> RECV["Receive NVMe commands
from network"] RECV --> BDEV["Submit to bdev layer"] end subgraph BdevLayer["Bdev Layer"] BDEV --> BIO["Allocate spdk_bdev_io
from per-thread cache"] BIO --> IBUF["Request iobuf
(small or large)"] IBUF --> RAID["RAID bdev submit_request"] end subgraph RaidLayer["RAID Layer"] RAID --> SPLIT["Split across base bdevs"] SPLIT --> NB1["NVMe bdev 1"] SPLIT --> NB2["NVMe bdev 2"] end subgraph NVMeLayer["NVMe Driver"] NB1 --> NQP1["NVMe QP 1"] NB2 --> NQP2["NVMe QP 2"] NVPOLL["bdev_nvme_poll
Active Poller"] --> NQP1 NVPOLL --> NQP2 end end HI -->|"NVMe-oF TCP/RDMA"| TP NQP1 -->|"Completion"| NVPOLL NVPOLL -->|"Completion chain"| TP TP -->|"Send response"| HI style TP fill:#3f6212,color:#fff,stroke:#3f6212 style NVPOLL fill:#0f766e,color:#fff,stroke:#0f766e style BIO fill:#d4a73a,color:#1e1b16,stroke:#d4a73a style IBUF fill:#d4a73a,color:#1e1b16,stroke:#d4a73a

NVMf Pollers

PollerTypePurpose
nvmf_tgroup_pollActive (0μs)Process incoming NVMe-oF commands from the network transport
accept_pollerVaries by transportAccept new connections (TCP/RDMA/vfio-user)
lib/nvmf/transport.c:591 — nvmf_tgroup_poll() lib/nvmf/transport.c:606 — poller registration

iobuf Usage in NVMf

NVMf transport validates io_unit_size against iobuf pool sizes:

// lib/nvmf/transport.c:295
if (ctx->opts.io_unit_size > opts_iobuf.large_bufsize) {
    SPDK_ERRLOG("io_unit_size %d > large_bufsize %d\n",
        ctx->opts.io_unit_size, opts_iobuf.large_bufsize);
}
if (ctx->opts.io_unit_size <= opts_iobuf.small_bufsize) {
    count = opts_iobuf.small_pool_count;
} else {
    count = spdk_min(small_pool_count, large_pool_count);
}
lib/nvmf/transport.c:295-305
6 — bdev_io Pool & Per-Thread Cache

Two-Level Allocation: Global Pool → Per-Thread Cache

Every bdev I/O operation requires a spdk_bdev_io structure. SPDK uses a two-level scheme to minimize contention: a global mempool (shared, uses atomic ops) and per-thread caches (lockless STAILQ).

flowchart TD
    subgraph Global["Global bdev_io Pool
(spdk_mempool, shared across all threads)"] POOL["bdev_io_pool_size = 65535
Lock-free mempool (DPDK ring)"] end subgraph T1["Thread 1 — bdev_mgmt_channel"] C1["per_thread_cache
STAILQ of bdev_io
cache_size = 256"] IWQ1["io_wait_queue
Waiters when pool exhausted"] end subgraph T2["Thread 2 — bdev_mgmt_channel"] C2["per_thread_cache
STAILQ of bdev_io
cache_size = 256"] IWQ2["io_wait_queue"] end subgraph T3["Thread N — bdev_mgmt_channel"] C3["per_thread_cache
STAILQ of bdev_io
cache_size = 256"] IWQ3["io_wait_queue"] end POOL -->|"Pre-populate at
channel creation"| C1 POOL -->|"Pre-populate"| C2 POOL -->|"Pre-populate"| C3 C1 -->|"Return when
cache full"| POOL C2 -->|"Return when
cache full"| POOL C3 -->|"Return when
cache full"| POOL style POOL fill:#d4a73a,color:#1e1b16,stroke:#d4a73a style C1 fill:#1e3a5f,color:#fff,stroke:#1e3a5f style C2 fill:#1e3a5f,color:#fff,stroke:#1e3a5f style C3 fill:#1e3a5f,color:#fff,stroke:#1e3a5f

bdev_io_pool_size

Global Pool

Default: 65535 (64K - 1, optimal for DPDK ring). Total spdk_bdev_io structures allocated at startup in a shared mempool.

Constraint: Must be ≥ bdev_io_cache_size × (thread_count + 1) — because each thread pre-populates its cache at channel creation.

// lib/bdev/bdev.c:519
min_pool_size = opts->bdev_io_cache_size
              * (spdk_thread_get_count() + 1);
if (opts->bdev_io_pool_size < min_pool_size) {
    SPDK_ERRLOG("bdev_io_pool_size %" PRIu32
        " is not compatible with "
        "bdev_io_cache_size %" PRIu32
        " and %" PRIu32 " threads\n", ...);
}
lib/bdev/bdev.c:38 — #define SPDK_BDEV_IO_POOL_SIZE (64 * 1024 - 1) lib/bdev/bdev.c:2350 — spdk_mempool_create()

bdev_io_cache_size

Per-Thread Cache

Default: 256. Maximum spdk_bdev_io structs cached per thread in a lockless STAILQ.

Hot path: I/O allocation first checks the local cache (no atomics!). Only falls back to the global pool if cache is empty.

// lib/bdev/bdev.c:2157 — Pre-populate cache
remaining = ch->bdev_io_cache_size
          = g_bdev_opts.bdev_io_cache_size;
while (remaining > 0) {
    spdk_mempool_get_bulk(g_bdev_mgr.bdev_io_pool,
                          bdev_ios, count);
    for (i = 0; i < count; i++) {
        STAILQ_INSERT_HEAD(&ch->per_thread_cache,
                           bdev_ios[i], ...);
        ch->per_thread_cache_count++;
    }
}

// lib/bdev/bdev.c:2644 — Return to cache
if (ch->per_thread_cache_count < ch->bdev_io_cache_size) {
    ch->per_thread_cache_count++;
    STAILQ_INSERT_HEAD(&ch->per_thread_cache, bdev_io, ...);
    // Also wake any waiters on io_wait_queue
} else {
    // Cache full — return to global pool
    spdk_mempool_put(g_bdev_mgr.bdev_io_pool, bdev_io);
}
lib/bdev/bdev.c:39 — #define SPDK_BDEV_IO_CACHE_SIZE 256

bdev_auto_examine

Discovery

Default: true. When a new bdev is registered (e.g., NVMe namespace discovered), the bdev layer automatically notifies all bdev modules to examine it. Modules like lvol, raid, crypto, etc. get a chance to claim or layer on top of it.

// lib/bdev/bdev.c:718
if (g_bdev_opts.bdev_auto_examine) {
    bdev_examine(bdev);  // Notify all modules
}

// lib/bdev/bdev.c:839
if (g_bdev_opts.bdev_auto_examine) {
    bdev_examine(bdev);  // Also on hot-plug
}

When to disable: In production systems where you want explicit control over which bdevs are examined, preventing unwanted modules from claiming devices. Set to false and use bdev_examine RPC manually.

lib/bdev/bdev.c:41 — #define SPDK_BDEV_AUTO_EXAMINE true
7 — iobuf Architecture: Global Pools & Per-Thread Caches

Three-Level Buffer System: Global Pool → Per-Thread Cache → I/O Consumer

SPDK's iobuf subsystem manages DMA-capable data buffers separately from bdev_io control structures. It uses two size classes (small and large) with both global pools (per NUMA node) and per-thread caches.

flowchart TD
    subgraph NUMA["Per-NUMA Global Pools
(hugepage-backed, DMA-capable)"] SP["Small Pool
small_pool_count = 8192
small_bufsize = 8KB
spdk_ring (MP/MC)"] LP["Large Pool
large_pool_count = 1024
large_bufsize = 132KB
spdk_ring (MP/MC)"] end subgraph TCH1["Thread 1 — spdk_iobuf_channel"] SC1["Small Cache
iobuf_small_cache_size = 128"] LC1["Large Cache
iobuf_large_cache_size = 16"] end subgraph TCH2["Thread 2 — spdk_iobuf_channel"] SC2["Small Cache = 128"] LC2["Large Cache = 16"] end subgraph Consumer["I/O Consumers"] BDEV["bdev layer"] NVMF["NVMf transport"] ACCEL["accel framework"] end SP -->|"Refill cache"| SC1 SP -->|"Refill cache"| SC2 LP -->|"Refill cache"| LC1 LP -->|"Refill cache"| LC2 SC1 -->|"Return on free"| SP LC1 -->|"Return on free"| LP SC1 --> BDEV LC1 --> BDEV SC1 --> NVMF LC1 --> NVMF SC2 --> ACCEL style SP fill:#6b21a8,color:#fff,stroke:#6b21a8 style LP fill:#6b21a8,color:#fff,stroke:#6b21a8 style SC1 fill:#1e3a5f,color:#fff,stroke:#1e3a5f style LC1 fill:#1e3a5f,color:#fff,stroke:#1e3a5f style SC2 fill:#1e3a5f,color:#fff,stroke:#1e3a5f style LC2 fill:#1e3a5f,color:#fff,stroke:#1e3a5f

small_pool_count / large_pool_count

Global Pools

Defaults: small = 8192, large = 1024

Number of buffers in the global per-NUMA ring. These are allocated from hugepages (spdk_malloc with SPDK_MALLOC_DMA flag) at startup.

Minimums: small ≥ 64, large ≥ 8

// lib/thread/iobuf.c:13-14
#define IOBUF_MIN_SMALL_POOL_SIZE 64
#define IOBUF_MIN_LARGE_POOL_SIZE 8
#define IOBUF_DEFAULT_SMALL_POOL_SIZE 8192
#define IOBUF_DEFAULT_LARGE_POOL_SIZE 1024
lib/thread/iobuf.c:72-73

small_bufsize / large_bufsize

Buffer Sizes

Defaults: small = 8KB, large = 132KB

Size of each buffer. Aligned to 4096 bytes. The 132KB large buffer size accommodates the default max I/O size (128K) plus interleaved metadata.

// lib/thread/iobuf.c:17-25
#define IOBUF_MIN_SMALL_BUFSIZE  4096
#define IOBUF_MIN_LARGE_BUFSIZE  8192
#define IOBUF_DEFAULT_SMALL_BUFSIZE (8 * 1024)
// 132k = 128k data + metadata
#define IOBUF_DEFAULT_LARGE_BUFSIZE (132 * 1024)
lib/thread/iobuf.c:74-75

iobuf_small_cache_size / iobuf_large_cache_size

Per-Thread

Defaults: small = 128, large = 16

Each bdev mgmt channel (per thread) creates an iobuf channel with these cache sizes. Reduces trips to the global pool.

// lib/bdev/bdev.c:42-43
#define BUF_SMALL_CACHE_SIZE 128
#define BUF_LARGE_CACHE_SIZE 16

// lib/bdev/bdev.c:2149-2150
spdk_iobuf_channel_init(&ch->iobuf,
    "bdev",
    g_bdev_opts.iobuf_small_cache_size,
    g_bdev_opts.iobuf_large_cache_size);
lib/bdev/bdev.c:2149
iobuf pool initialization — how buffers are allocated from hugepages
// lib/thread/iobuf.c:118 — iobuf_node_initialize()
static int iobuf_node_initialize(struct iobuf_node *node, uint32_t numa_id) {
    struct spdk_iobuf_opts *opts = &g_iobuf.opts;

    // Small pool: MP/MC ring for thread-safe access
    node->small_pool = spdk_ring_create(SPDK_RING_TYPE_MP_MC,
                                        opts->small_pool_count, numa_id);

    // Allocate contiguous hugepage memory for all small buffers
    node->small_pool_base = spdk_malloc(
        opts->small_bufsize * opts->small_pool_count,
        IOBUF_ALIGNMENT,  // 4096
        NULL, numa_id, SPDK_MALLOC_DMA);

    // Same for large pool
    node->large_pool = spdk_ring_create(SPDK_RING_TYPE_MP_MC,
                                        opts->large_pool_count, numa_id);
    node->large_pool_base = spdk_malloc(
        opts->large_bufsize * opts->large_pool_count,
        IOBUF_ALIGNMENT, NULL, numa_id, SPDK_MALLOC_DMA);

    // Populate rings with buffer pointers
    for (i = 0; i < opts->small_pool_count; i++) {
        buf = node->small_pool_base + i * opts->small_bufsize;
        spdk_ring_enqueue(node->small_pool, (void **)&buf, 1, NULL);
    }
    for (i = 0; i < opts->large_pool_count; i++) {
        buf = node->large_pool_base + i * opts->large_bufsize;
        spdk_ring_enqueue(node->large_pool, (void **)&buf, 1, NULL);
    }
}
8 — Full Data Path: NVMf → RAID → NVMe with All Buffers

Complete I/O Walk-Through

Here's exactly what happens when a remote NVMe-oF host sends a 64KB write command to a RAID-1 volume backed by two NVMe SSDs. Every thread, poller, bdev_io, and iobuf interaction is shown.

sequenceDiagram
    participant Host as NVMe-oF Host
    participant NVMfPoll as nvmf_tgroup_poll
(Active Poller) participant BdevIO as bdev_io allocation participant IOBuf as iobuf subsystem participant RAID as RAID-1 module participant NVMe1 as NVMe SSD 1 participant NVMe2 as NVMe SSD 2 participant NVMePoll as bdev_nvme_poll
(Active Poller) Note over NVMfPoll: Thread poll cycle starts
thread_poll() called Host->>NVMfPoll: 64KB Write Command (TCP/RDMA) NVMfPoll->>BdevIO: Get bdev_io from per-thread cache Note over BdevIO: cache_count-- (256→255)
No atomics needed! BdevIO->>IOBuf: Request large iobuf (64KB > 8KB) Note over IOBuf: Check per-thread large cache (16 bufs)
If empty → global large_pool ring IOBuf-->>BdevIO: 132KB DMA buffer BdevIO->>RAID: spdk_bdev_write(raid_bdev, buf, 64KB) Note over RAID: RAID-1: mirror to both drives RAID->>NVMe1: Submit write (same buf, same thread) RAID->>NVMe2: Submit write (same buf, same thread) Note over NVMePoll: Same thread, next poll cycle NVMePoll->>NVMe1: spdk_nvme_poll_group_process_completions() NVMe1-->>NVMePoll: Write complete NVMePoll->>NVMe2: process_completions() NVMe2-->>NVMePoll: Write complete NVMePoll->>RAID: Both mirrors complete RAID->>IOBuf: Release large iobuf Note over IOBuf: Return to per-thread cache
or global pool if cache full RAID->>BdevIO: spdk_bdev_free_io() Note over BdevIO: Return bdev_io to cache
cache_count++ (255→256) BdevIO->>NVMfPoll: I/O complete callback NVMfPoll->>Host: Write Response (success)

Key Insight: Everything Happens on ONE Thread

In the above flow, the entire I/O path — from receiving the NVMe-oF command, through RAID mirroring, to NVMe submission and completion — happens on a single SPDK thread without any context switch or mutex. The pollers (nvmf_tgroup_poll and bdev_nvme_poll) are both registered on the same thread. The per-thread bdev_io cache and iobuf caches eliminate all contention with other threads.

This is the fundamental design principle: pin everything to one core, avoid sharing, eliminate locks.

9 — Configuration Reference

All Parameters at a Glance

Parameter RPC Default Scope What It Controls Sizing Rule
bdev_io_pool_size bdev_set_options 65535 Global (one mempool) Total spdk_bdev_io structs. Used for I/O control metadata, NOT data. ≥ cache_size × (threads + 1). Use power-of-2 minus 1 for DPDK ring efficiency.
bdev_io_cache_size bdev_set_options 256 Per thread Lock-free bdev_io cache per thread. Pre-populated from global pool. ≥ expected max concurrent I/Os per thread. Higher = less global pool contention.
bdev_auto_examine bdev_set_options true Global Auto-notify modules when new bdevs appear (lvol, raid, etc. can claim them). Set false in production for explicit control.
iobuf_small_cache_size bdev_set_options 128 Per thread (bdev module) Per-thread cache of small (≤8KB) DMA buffers from the global iobuf pool. ≥ concurrent small I/Os per thread. Higher = less ring contention.
iobuf_large_cache_size bdev_set_options 16 Per thread (bdev module) Per-thread cache of large (≤132KB) DMA buffers. Lower default because large buffers are expensive (132KB each).
small_pool_count iobuf_set_options 8192 Global (per NUMA) Total small DMA buffers in the MP/MC ring. Backing store for per-thread caches. ≥ 64. Scale with thread count × iobuf_small_cache_size.
large_pool_count iobuf_set_options 1024 Global (per NUMA) Total large DMA buffers. Each is 132KB by default — 1024 × 132KB ≈ 132MB. ≥ 8. Scale with thread count × iobuf_large_cache_size + headroom.
small_bufsize iobuf_set_options 8192 Global Size of each small buffer. Aligned to 4096. Used for I/Os ≤ this size. ≥ 4096. Match SPDK_BDEV_SMALL_BUF_MAX_SIZE.
large_bufsize iobuf_set_options 135168 (132KB) Global Size of each large buffer. Must accommodate max I/O size + metadata. ≥ 8192. Default = 128KB + metadata headroom. NVMf io_unit_size must not exceed this.

Memory Footprint Calculation

For a system with 4 I/O threads, default settings:

~20 MB
bdev_io pool (65535 × ~320B)
~64 MB
Small iobuf pool (8192 × 8KB)
~132 MB
Large iobuf pool (1024 × 132KB)
~216 MB
Total buffer memory

All allocated from hugepages. The per-thread caches don't allocate separate memory — they just hold pointers into the global pool.

Source Files Reference

Threading & Pollers

include/spdk/thread.h lib/thread/thread.c lib/thread/iobuf.c

Bdev Layer

include/spdk/bdev.h include/spdk/bdev_module.h lib/bdev/bdev.c lib/bdev/bdev_rpc.c

NVMe Bdev Module

module/bdev/nvme/bdev_nvme.c

RAID & NVMf

module/bdev/raid/bdev_raid.c lib/nvmf/transport.c lib/nvmf/nvmf_rpc.c