SPDK Threads, Pollers & Buffer Architecture
A deep dive into SPDK's userspace threading model, polling infrastructure, and memory buffer management — with code references to the actual implementation and real-world scenarios.
The Core Distinction
SPDK uses a run-to-completion model. There are no kernel context switches, no blocking syscalls, and no mutexes in the I/O hot path. Everything runs on lightweight userspace threads that are polled by the application's reactor loop.
spdk_thread
Execution ContextA stackless, lightweight thread — NOT a POSIX thread. It's a logical execution unit containing:
- Active pollers — zero-period callbacks that run every poll cycle
- Timed pollers — callbacks on a periodic timer (RB-tree sorted)
- Paused pollers — temporarily suspended pollers
- Message ring — lockless SPSC ring for cross-thread messages
- I/O channels — per-thread connections to I/O devices
- CPU affinity mask — which cores this thread can run on
struct spdk_thread {
uint64_t tsc_last;
struct spdk_thread_stats stats;
TAILQ_HEAD(, spdk_poller) active_pollers;
RB_HEAD(, spdk_poller) timed_pollers;
TAILQ_HEAD(, spdk_poller) paused_pollers;
struct spdk_ring *messages;
RB_HEAD(, spdk_io_channel) io_channels;
struct spdk_cpuset cpumask;
bool is_bound;
bool in_interrupt;
// ...
};
spdk_poller
Repeated CallbackA function repeatedly called on the same spdk_thread. Pollers are the workhorses — they check for I/O completions, process admin commands, handle network events, etc.
- Active poller (period=0) — runs every poll cycle, round-robin
- Timed poller (period>0) — runs at intervals, sorted in an RB-tree by next_run_tick
- Returns
SPDK_POLLER_BUSY(did work) orSPDK_POLLER_IDLE(no work) - Can be paused/resumed dynamically
struct spdk_poller {
TAILQ_ENTRY(spdk_poller) tailq;
RB_ENTRY(spdk_poller) node;
uint64_t period_ticks;
uint64_t next_run_tick;
uint64_t run_count;
uint64_t busy_count;
spdk_poller_fn fn;
void *arg;
struct spdk_thread *thread;
char name[SPDK_MAX_POLLER_NAME_LEN + 1];
};
Key Relationship: Thread ← 1:N → Pollers
One spdk_thread owns many pollers. When spdk_thread_poll() is called by the reactor, it iterates through all active pollers (round-robin), then checks timed pollers whose deadline has arrived. Each poller function runs to completion before the next one is invoked — no preemption.
The reactor (DPDK's event framework or SPDK's app framework) creates one POSIX thread per core, and each POSIX thread drives one or more spdk_threads in a tight loop.
What happens inside spdk_thread_poll()
lib/thread/thread.c:1120 — thread_poll() lib/thread/thread.c:1223 — spdk_thread_poll()This is the heart of SPDK's execution model. Every call to spdk_thread_poll() does the following in order:
flowchart TD
A["spdk_thread_poll(thread, max_msgs, now)"] --> B["1. Process critical_msg
One-shot emergency callback"]
B --> C["2. msg_queue_run_batch()
Drain up to max_msgs from ring buffer
Cross-thread messages land here"]
C --> D["3. Active Pollers (round-robin)
TAILQ_FOREACH_REVERSE_SAFE
Each poller: fn(arg) → BUSY|IDLE"]
D --> E["4. Timed Pollers (deadline check)
RB-tree sorted by next_run_tick
Only run if now >= next_run_tick"]
E --> F["5. thread_update_stats()
Update busy/idle tsc counters"]
F --> G{"Thread exiting?"}
G -->|Yes| H["thread_exit() — clean up"]
G -->|No| I["Return to reactor loop"]
I --> A
style A fill:#1e3a5f,color:#fff,stroke:#1e3a5f
style B fill:#d4a73a,color:#1e1b16,stroke:#d4a73a
style C fill:#d4a73a,color:#1e1b16,stroke:#d4a73a
style D fill:#0f766e,color:#fff,stroke:#0f766e
style E fill:#0f766e,color:#fff,stroke:#0f766e
style F fill:#3f6212,color:#fff,stroke:#3f6212
Actual code from thread_poll() — lib/thread/thread.c:1120
static int
thread_poll(struct spdk_thread *thread, uint32_t max_msgs, uint64_t now)
{
uint32_t msg_count;
struct spdk_poller *poller, *tmp;
spdk_msg_fn critical_msg;
int rc = 0;
thread->tsc_last = now;
// Step 1: Process critical message (single, high-priority)
critical_msg = thread->critical_msg;
if (spdk_unlikely(critical_msg != NULL)) {
critical_msg(NULL);
thread->critical_msg = NULL;
rc = 1;
}
// Step 2: Drain message queue (cross-thread msgs)
msg_count = msg_queue_run_batch(thread, max_msgs);
if (msg_count) { rc = 1; }
// Step 3: Execute ALL active pollers (period_ticks == 0)
TAILQ_FOREACH_REVERSE_SAFE(poller, &thread->active_pollers,
active_pollers_head, tailq, tmp) {
int poller_rc = thread_execute_poller(thread, poller);
if (poller_rc > rc) { rc = poller_rc; }
}
// Step 4: Execute timed pollers whose deadline arrived
poller = thread->first_timed_poller;
while (poller != NULL) {
if (now < poller->next_run_tick) break; // sorted, so stop early
tmp = RB_NEXT(...);
RB_REMOVE(...);
timer_rc = thread_execute_timed_poller(thread, poller, now);
poller = tmp;
}
return rc;
}
Scenario: bdev_nvme_attach_controller RPC
When you issue bdev_nvme_attach_controller, SPDK creates an NVMe controller, registers pollers for I/O and admin queues, and creates bdevs for each namespace. Here's the thread/poller dance:
sequenceDiagram
participant RPC as JSON-RPC Thread
(app thread)
participant NVMe as NVMe Driver
participant BdevMod as bdev_nvme module
participant Thread as SPDK Thread
(I/O core)
RPC->>NVMe: spdk_nvme_connect_async(trid)
Note over RPC: Probe poller registered
on app thread
NVMe-->>BdevMod: connect_attach_cb()
BdevMod->>BdevMod: nvme_ctrlr_create()
Note over BdevMod: Allocates nvme_ctrlr struct
on app thread
BdevMod->>BdevMod: SPDK_POLLER_REGISTER(
bdev_nvme_poll_adminq,
period=1000μs)
Note over BdevMod: Admin queue poller
— timed poller on app thread
BdevMod->>BdevMod: spdk_io_device_register(
nvme_ctrlr)
Note over BdevMod: Enables I/O channel
creation on any thread
BdevMod->>BdevMod: nvme_ctrlr_create_done()
BdevMod->>BdevMod: Register bdevs for
each NVMe namespace
Note over BdevMod: bdev_auto_examine triggers
if enabled
RPC-->>Thread: When I/O channel opened:
Thread->>Thread: bdev_nvme_create_poll_group_cb()
Thread->>Thread: spdk_nvme_poll_group_create()
Thread->>Thread: SPDK_POLLER_REGISTER(
bdev_nvme_poll, period=0)
Note over Thread: I/O completion poller
— active poller (period=0)
runs EVERY poll cycle
Pollers Created During NVMe Attach
| Poller | Type | Thread | Purpose |
|---|---|---|---|
bdev_nvme_poll_adminq | Timed (1000μs) | App thread | Poll admin queue for completions (identify, set features, etc.) |
bdev_nvme_poll | Active (0μs) | I/O thread | Poll NVMe I/O queue for completions — the hot path |
Buffer Usage During NVMe I/O
When an I/O arrives at the NVMe bdev:
- spdk_bdev_io is pulled from the per-thread cache (fast) or the global pool (slower)
- If the I/O needs a data buffer, it requests an iobuf — small (≤8K) or large (≤132K)
- The
bdev_nvme_pollactive poller callsspdk_nvme_poll_group_process_completions() - On completion, bdev_io goes back to cache/pool and iobuf is released
bdev_nvme_poll() — the NVMe I/O completion poller
static int
bdev_nvme_poll(void *arg)
{
struct nvme_poll_group *group = arg;
int64_t num_completions;
num_completions = spdk_nvme_poll_group_process_completions(
group->group, 0, bdev_nvme_disconnected_qpair_cb);
// Returns BUSY if completions processed, IDLE otherwise
// This tells the thread whether this cycle did useful work
return num_completions > 0 ? SPDK_POLLER_BUSY : SPDK_POLLER_IDLE;
}
Scenario: bdev_raid_create RPC
Creating a RAID over NVMe bdevs adds another layer. The RAID bdev stacks on top of base bdevs, and each I/O thread gets its own RAID channel that fans out to the underlying NVMe channels.
flowchart TD
subgraph AppThread["App Thread (thread_poll loop)"]
RPC["bdev_raid_create RPC"] --> RC["raid_bdev_create()"]
RC --> IOD["spdk_io_device_register(raid_bdev)"]
IOD --> ABB["raid_bdev_add_base_bdev() × N"]
ABB --> CFG["raid_bdev_configure()"]
CFG --> REG["spdk_bdev_register(raid_bdev)"]
REG --> EXAM{"bdev_auto_examine?"}
EXAM -->|Yes| AE["Notify all modules
of new bdev"]
EXAM -->|No| SKIP["Skip — manual examine later"]
end
subgraph IOThread["I/O Thread (when channel opened)"]
OPEN["spdk_get_io_channel(raid_bdev)"] --> CB["raid_bdev_create_cb()"]
CB --> ALLOC["Allocate raid_bdev_io_channel"]
ALLOC --> BASE["Get io_channel for each
base NVMe bdev"]
BASE --> READY["RAID channel ready
for I/O submission"]
end
REG -.->|"I/O channel created
on first I/O"| OPEN
style RPC fill:#9a3412,color:#fff,stroke:#9a3412
style RC fill:#9a3412,color:#fff,stroke:#9a3412
style OPEN fill:#0f766e,color:#fff,stroke:#0f766e
style CB fill:#0f766e,color:#fff,stroke:#0f766e
RAID + Pollers + Buffers: How They Interact
RAID itself does not register its own pollers — it relies on the underlying bdev modules' pollers. Here's what happens during a RAID I/O:
- Application submits I/O to RAID bdev →
spdk_bdev_ioallocated from per-thread cache (bdev_io_cache_size) - RAID splits/stripes the I/O across base bdevs → may need additional bdev_io structs from the pool
- Each sub-I/O may request an iobuf (small or large) for data transfer
- Sub-I/Os submitted to NVMe bdevs → they go to the NVMe queue pair
- NVMe
bdev_nvme_pollactive poller picks up completions - RAID completion callback aggregates results
- bdev_io structs returned to per-thread cache (or pool if cache full)
Scenario: nvmf_create_subsystem + nvmf_subsystem_add_ns
Exposing the RAID bdev over NVMe-oF adds the NVMf transport layer. This creates new pollers for network I/O and chains through the bdev layer to RAID and ultimately NVMe.
flowchart TD
subgraph Host["Remote NVMe-oF Host"]
HI["NVMe-oF Initiator"]
end
subgraph NVMfTarget["SPDK NVMf Target"]
subgraph TPG["Transport Poll Group
(per I/O thread)"]
TP["nvmf_tgroup_poll
Active Poller (period=0)"]
TP --> RECV["Receive NVMe commands
from network"]
RECV --> BDEV["Submit to bdev layer"]
end
subgraph BdevLayer["Bdev Layer"]
BDEV --> BIO["Allocate spdk_bdev_io
from per-thread cache"]
BIO --> IBUF["Request iobuf
(small or large)"]
IBUF --> RAID["RAID bdev submit_request"]
end
subgraph RaidLayer["RAID Layer"]
RAID --> SPLIT["Split across base bdevs"]
SPLIT --> NB1["NVMe bdev 1"]
SPLIT --> NB2["NVMe bdev 2"]
end
subgraph NVMeLayer["NVMe Driver"]
NB1 --> NQP1["NVMe QP 1"]
NB2 --> NQP2["NVMe QP 2"]
NVPOLL["bdev_nvme_poll
Active Poller"] --> NQP1
NVPOLL --> NQP2
end
end
HI -->|"NVMe-oF TCP/RDMA"| TP
NQP1 -->|"Completion"| NVPOLL
NVPOLL -->|"Completion chain"| TP
TP -->|"Send response"| HI
style TP fill:#3f6212,color:#fff,stroke:#3f6212
style NVPOLL fill:#0f766e,color:#fff,stroke:#0f766e
style BIO fill:#d4a73a,color:#1e1b16,stroke:#d4a73a
style IBUF fill:#d4a73a,color:#1e1b16,stroke:#d4a73a
NVMf Pollers
| Poller | Type | Purpose |
|---|---|---|
nvmf_tgroup_poll | Active (0μs) | Process incoming NVMe-oF commands from the network transport |
accept_poller | Varies by transport | Accept new connections (TCP/RDMA/vfio-user) |
iobuf Usage in NVMf
NVMf transport validates io_unit_size against iobuf pool sizes:
// lib/nvmf/transport.c:295
if (ctx->opts.io_unit_size > opts_iobuf.large_bufsize) {
SPDK_ERRLOG("io_unit_size %d > large_bufsize %d\n",
ctx->opts.io_unit_size, opts_iobuf.large_bufsize);
}
if (ctx->opts.io_unit_size <= opts_iobuf.small_bufsize) {
count = opts_iobuf.small_pool_count;
} else {
count = spdk_min(small_pool_count, large_pool_count);
}
lib/nvmf/transport.c:295-305
Two-Level Allocation: Global Pool → Per-Thread Cache
Every bdev I/O operation requires a spdk_bdev_io structure. SPDK uses a two-level scheme to minimize contention: a global mempool (shared, uses atomic ops) and per-thread caches (lockless STAILQ).
flowchart TD
subgraph Global["Global bdev_io Pool
(spdk_mempool, shared across all threads)"]
POOL["bdev_io_pool_size = 65535
Lock-free mempool (DPDK ring)"]
end
subgraph T1["Thread 1 — bdev_mgmt_channel"]
C1["per_thread_cache
STAILQ of bdev_io
cache_size = 256"]
IWQ1["io_wait_queue
Waiters when pool exhausted"]
end
subgraph T2["Thread 2 — bdev_mgmt_channel"]
C2["per_thread_cache
STAILQ of bdev_io
cache_size = 256"]
IWQ2["io_wait_queue"]
end
subgraph T3["Thread N — bdev_mgmt_channel"]
C3["per_thread_cache
STAILQ of bdev_io
cache_size = 256"]
IWQ3["io_wait_queue"]
end
POOL -->|"Pre-populate at
channel creation"| C1
POOL -->|"Pre-populate"| C2
POOL -->|"Pre-populate"| C3
C1 -->|"Return when
cache full"| POOL
C2 -->|"Return when
cache full"| POOL
C3 -->|"Return when
cache full"| POOL
style POOL fill:#d4a73a,color:#1e1b16,stroke:#d4a73a
style C1 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
style C2 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
style C3 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
bdev_io_pool_size
Global PoolDefault: 65535 (64K - 1, optimal for DPDK ring). Total spdk_bdev_io structures allocated at startup in a shared mempool.
Constraint: Must be ≥ bdev_io_cache_size × (thread_count + 1) — because each thread pre-populates its cache at channel creation.
// lib/bdev/bdev.c:519
min_pool_size = opts->bdev_io_cache_size
* (spdk_thread_get_count() + 1);
if (opts->bdev_io_pool_size < min_pool_size) {
SPDK_ERRLOG("bdev_io_pool_size %" PRIu32
" is not compatible with "
"bdev_io_cache_size %" PRIu32
" and %" PRIu32 " threads\n", ...);
}
lib/bdev/bdev.c:38 — #define SPDK_BDEV_IO_POOL_SIZE (64 * 1024 - 1)
lib/bdev/bdev.c:2350 — spdk_mempool_create()
bdev_io_cache_size
Per-Thread CacheDefault: 256. Maximum spdk_bdev_io structs cached per thread in a lockless STAILQ.
Hot path: I/O allocation first checks the local cache (no atomics!). Only falls back to the global pool if cache is empty.
// lib/bdev/bdev.c:2157 — Pre-populate cache
remaining = ch->bdev_io_cache_size
= g_bdev_opts.bdev_io_cache_size;
while (remaining > 0) {
spdk_mempool_get_bulk(g_bdev_mgr.bdev_io_pool,
bdev_ios, count);
for (i = 0; i < count; i++) {
STAILQ_INSERT_HEAD(&ch->per_thread_cache,
bdev_ios[i], ...);
ch->per_thread_cache_count++;
}
}
// lib/bdev/bdev.c:2644 — Return to cache
if (ch->per_thread_cache_count < ch->bdev_io_cache_size) {
ch->per_thread_cache_count++;
STAILQ_INSERT_HEAD(&ch->per_thread_cache, bdev_io, ...);
// Also wake any waiters on io_wait_queue
} else {
// Cache full — return to global pool
spdk_mempool_put(g_bdev_mgr.bdev_io_pool, bdev_io);
}
lib/bdev/bdev.c:39 — #define SPDK_BDEV_IO_CACHE_SIZE 256
bdev_auto_examine
DiscoveryDefault: true. When a new bdev is registered (e.g., NVMe namespace discovered), the bdev layer automatically notifies all bdev modules to examine it. Modules like lvol, raid, crypto, etc. get a chance to claim or layer on top of it.
// lib/bdev/bdev.c:718
if (g_bdev_opts.bdev_auto_examine) {
bdev_examine(bdev); // Notify all modules
}
// lib/bdev/bdev.c:839
if (g_bdev_opts.bdev_auto_examine) {
bdev_examine(bdev); // Also on hot-plug
}
When to disable: In production systems where you want explicit control over which bdevs are examined, preventing unwanted modules from claiming devices. Set to false and use bdev_examine RPC manually.
Three-Level Buffer System: Global Pool → Per-Thread Cache → I/O Consumer
SPDK's iobuf subsystem manages DMA-capable data buffers separately from bdev_io control structures. It uses two size classes (small and large) with both global pools (per NUMA node) and per-thread caches.
flowchart TD
subgraph NUMA["Per-NUMA Global Pools
(hugepage-backed, DMA-capable)"]
SP["Small Pool
small_pool_count = 8192
small_bufsize = 8KB
spdk_ring (MP/MC)"]
LP["Large Pool
large_pool_count = 1024
large_bufsize = 132KB
spdk_ring (MP/MC)"]
end
subgraph TCH1["Thread 1 — spdk_iobuf_channel"]
SC1["Small Cache
iobuf_small_cache_size = 128"]
LC1["Large Cache
iobuf_large_cache_size = 16"]
end
subgraph TCH2["Thread 2 — spdk_iobuf_channel"]
SC2["Small Cache = 128"]
LC2["Large Cache = 16"]
end
subgraph Consumer["I/O Consumers"]
BDEV["bdev layer"]
NVMF["NVMf transport"]
ACCEL["accel framework"]
end
SP -->|"Refill cache"| SC1
SP -->|"Refill cache"| SC2
LP -->|"Refill cache"| LC1
LP -->|"Refill cache"| LC2
SC1 -->|"Return on free"| SP
LC1 -->|"Return on free"| LP
SC1 --> BDEV
LC1 --> BDEV
SC1 --> NVMF
LC1 --> NVMF
SC2 --> ACCEL
style SP fill:#6b21a8,color:#fff,stroke:#6b21a8
style LP fill:#6b21a8,color:#fff,stroke:#6b21a8
style SC1 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
style LC1 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
style SC2 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
style LC2 fill:#1e3a5f,color:#fff,stroke:#1e3a5f
small_pool_count / large_pool_count
Global PoolsDefaults: small = 8192, large = 1024
Number of buffers in the global per-NUMA ring. These are allocated from hugepages (spdk_malloc with SPDK_MALLOC_DMA flag) at startup.
Minimums: small ≥ 64, large ≥ 8
// lib/thread/iobuf.c:13-14 #define IOBUF_MIN_SMALL_POOL_SIZE 64 #define IOBUF_MIN_LARGE_POOL_SIZE 8 #define IOBUF_DEFAULT_SMALL_POOL_SIZE 8192 #define IOBUF_DEFAULT_LARGE_POOL_SIZE 1024lib/thread/iobuf.c:72-73
small_bufsize / large_bufsize
Buffer SizesDefaults: small = 8KB, large = 132KB
Size of each buffer. Aligned to 4096 bytes. The 132KB large buffer size accommodates the default max I/O size (128K) plus interleaved metadata.
// lib/thread/iobuf.c:17-25 #define IOBUF_MIN_SMALL_BUFSIZE 4096 #define IOBUF_MIN_LARGE_BUFSIZE 8192 #define IOBUF_DEFAULT_SMALL_BUFSIZE (8 * 1024) // 132k = 128k data + metadata #define IOBUF_DEFAULT_LARGE_BUFSIZE (132 * 1024)lib/thread/iobuf.c:74-75
iobuf_small_cache_size / iobuf_large_cache_size
Per-ThreadDefaults: small = 128, large = 16
Each bdev mgmt channel (per thread) creates an iobuf channel with these cache sizes. Reduces trips to the global pool.
// lib/bdev/bdev.c:42-43
#define BUF_SMALL_CACHE_SIZE 128
#define BUF_LARGE_CACHE_SIZE 16
// lib/bdev/bdev.c:2149-2150
spdk_iobuf_channel_init(&ch->iobuf,
"bdev",
g_bdev_opts.iobuf_small_cache_size,
g_bdev_opts.iobuf_large_cache_size);
lib/bdev/bdev.c:2149
iobuf pool initialization — how buffers are allocated from hugepages
// lib/thread/iobuf.c:118 — iobuf_node_initialize()
static int iobuf_node_initialize(struct iobuf_node *node, uint32_t numa_id) {
struct spdk_iobuf_opts *opts = &g_iobuf.opts;
// Small pool: MP/MC ring for thread-safe access
node->small_pool = spdk_ring_create(SPDK_RING_TYPE_MP_MC,
opts->small_pool_count, numa_id);
// Allocate contiguous hugepage memory for all small buffers
node->small_pool_base = spdk_malloc(
opts->small_bufsize * opts->small_pool_count,
IOBUF_ALIGNMENT, // 4096
NULL, numa_id, SPDK_MALLOC_DMA);
// Same for large pool
node->large_pool = spdk_ring_create(SPDK_RING_TYPE_MP_MC,
opts->large_pool_count, numa_id);
node->large_pool_base = spdk_malloc(
opts->large_bufsize * opts->large_pool_count,
IOBUF_ALIGNMENT, NULL, numa_id, SPDK_MALLOC_DMA);
// Populate rings with buffer pointers
for (i = 0; i < opts->small_pool_count; i++) {
buf = node->small_pool_base + i * opts->small_bufsize;
spdk_ring_enqueue(node->small_pool, (void **)&buf, 1, NULL);
}
for (i = 0; i < opts->large_pool_count; i++) {
buf = node->large_pool_base + i * opts->large_bufsize;
spdk_ring_enqueue(node->large_pool, (void **)&buf, 1, NULL);
}
}
Complete I/O Walk-Through
Here's exactly what happens when a remote NVMe-oF host sends a 64KB write command to a RAID-1 volume backed by two NVMe SSDs. Every thread, poller, bdev_io, and iobuf interaction is shown.
sequenceDiagram
participant Host as NVMe-oF Host
participant NVMfPoll as nvmf_tgroup_poll
(Active Poller)
participant BdevIO as bdev_io allocation
participant IOBuf as iobuf subsystem
participant RAID as RAID-1 module
participant NVMe1 as NVMe SSD 1
participant NVMe2 as NVMe SSD 2
participant NVMePoll as bdev_nvme_poll
(Active Poller)
Note over NVMfPoll: Thread poll cycle starts
thread_poll() called
Host->>NVMfPoll: 64KB Write Command (TCP/RDMA)
NVMfPoll->>BdevIO: Get bdev_io from per-thread cache
Note over BdevIO: cache_count-- (256→255)
No atomics needed!
BdevIO->>IOBuf: Request large iobuf (64KB > 8KB)
Note over IOBuf: Check per-thread large cache (16 bufs)
If empty → global large_pool ring
IOBuf-->>BdevIO: 132KB DMA buffer
BdevIO->>RAID: spdk_bdev_write(raid_bdev, buf, 64KB)
Note over RAID: RAID-1: mirror to both drives
RAID->>NVMe1: Submit write (same buf, same thread)
RAID->>NVMe2: Submit write (same buf, same thread)
Note over NVMePoll: Same thread, next poll cycle
NVMePoll->>NVMe1: spdk_nvme_poll_group_process_completions()
NVMe1-->>NVMePoll: Write complete
NVMePoll->>NVMe2: process_completions()
NVMe2-->>NVMePoll: Write complete
NVMePoll->>RAID: Both mirrors complete
RAID->>IOBuf: Release large iobuf
Note over IOBuf: Return to per-thread cache
or global pool if cache full
RAID->>BdevIO: spdk_bdev_free_io()
Note over BdevIO: Return bdev_io to cache
cache_count++ (255→256)
BdevIO->>NVMfPoll: I/O complete callback
NVMfPoll->>Host: Write Response (success)
Key Insight: Everything Happens on ONE Thread
In the above flow, the entire I/O path — from receiving the NVMe-oF command, through RAID mirroring, to NVMe submission and completion — happens on a single SPDK thread without any context switch or mutex. The pollers (nvmf_tgroup_poll and bdev_nvme_poll) are both registered on the same thread. The per-thread bdev_io cache and iobuf caches eliminate all contention with other threads.
This is the fundamental design principle: pin everything to one core, avoid sharing, eliminate locks.
All Parameters at a Glance
| Parameter | RPC | Default | Scope | What It Controls | Sizing Rule |
|---|---|---|---|---|---|
bdev_io_pool_size |
bdev_set_options | 65535 | Global (one mempool) | Total spdk_bdev_io structs. Used for I/O control metadata, NOT data. | ≥ cache_size × (threads + 1). Use power-of-2 minus 1 for DPDK ring efficiency. |
bdev_io_cache_size |
bdev_set_options | 256 | Per thread | Lock-free bdev_io cache per thread. Pre-populated from global pool. | ≥ expected max concurrent I/Os per thread. Higher = less global pool contention. |
bdev_auto_examine |
bdev_set_options | true | Global | Auto-notify modules when new bdevs appear (lvol, raid, etc. can claim them). | Set false in production for explicit control. |
iobuf_small_cache_size |
bdev_set_options | 128 | Per thread (bdev module) | Per-thread cache of small (≤8KB) DMA buffers from the global iobuf pool. | ≥ concurrent small I/Os per thread. Higher = less ring contention. |
iobuf_large_cache_size |
bdev_set_options | 16 | Per thread (bdev module) | Per-thread cache of large (≤132KB) DMA buffers. | Lower default because large buffers are expensive (132KB each). |
small_pool_count |
iobuf_set_options | 8192 | Global (per NUMA) | Total small DMA buffers in the MP/MC ring. Backing store for per-thread caches. | ≥ 64. Scale with thread count × iobuf_small_cache_size. |
large_pool_count |
iobuf_set_options | 1024 | Global (per NUMA) | Total large DMA buffers. Each is 132KB by default — 1024 × 132KB ≈ 132MB. | ≥ 8. Scale with thread count × iobuf_large_cache_size + headroom. |
small_bufsize |
iobuf_set_options | 8192 | Global | Size of each small buffer. Aligned to 4096. Used for I/Os ≤ this size. | ≥ 4096. Match SPDK_BDEV_SMALL_BUF_MAX_SIZE. |
large_bufsize |
iobuf_set_options | 135168 (132KB) | Global | Size of each large buffer. Must accommodate max I/O size + metadata. | ≥ 8192. Default = 128KB + metadata headroom. NVMf io_unit_size must not exceed this. |
Memory Footprint Calculation
For a system with 4 I/O threads, default settings:
All allocated from hugepages. The per-thread caches don't allocate separate memory — they just hold pointers into the global pool.
Source Files Reference
Threading & Pollers
include/spdk/thread.h lib/thread/thread.c lib/thread/iobuf.cBdev Layer
include/spdk/bdev.h include/spdk/bdev_module.h lib/bdev/bdev.c lib/bdev/bdev_rpc.cNVMe Bdev Module
module/bdev/nvme/bdev_nvme.cRAID & NVMf
module/bdev/raid/bdev_raid.c lib/nvmf/transport.c lib/nvmf/nvmf_rpc.c