SPDK (part 1, tutorial)

需要提前了解的概念

Linux內核驅動:

UIO:

DPDK的官方文檔http://doc.dpdk.org/guides/linux_gsg/linux_drivers.html#UIO說的比較清楚,摘錄如下:

A small kernel module to set up the device, map device memory to user-space and register interrupts. In many cases, the standard?uio_pci_generic?module included in the Linux kernel can provide the uio capability.

For some devices which lack support for legacy interrupts, e.g. virtual function (VF) devices, the?igb_uio?module may be needed in place of?uio_pci_generic.

包括兩部分:

UIO Driver

- The device tree node for the device can use whatever you want in the compatible property as it only has to match what is used in the kernel?space driver as with any platform device driver

UIO Platform Device Driver

- The device tree node for the device needs to use "generic - uio" in it's compatible property

基本框架如下:

UIO框架

用戶態驅動工作流程:

1. 在啟動用戶態驅動前裝載內核態UIO設備驅動;

2. 啟動用戶態應用,開啟對應UIO設備(/dev/uioX),從用戶空間看,UIO設備向其他設備一樣是文件系統中的一個設備節點;

3. 通過UIO大小(如/sys/class/uio/uio0/maps/map0/size)在相應的sysfs文件目錄下找到設備內存地址信息;

4. 通過調用UIO驅動的mmap()函數將設備內存映射到進程地址空間;

5. 應用訪問設備硬件來控制設備;

6. 通過調用mynmap()來移除設備內存的映射;

7. 關閉UIO設備文件;

虛擬內存地址和物理內存地址的映射流程

更多關于UIO的細節參見:https://www.cnblogs.com/vlhn/p/7761869.html

VFIO:

向用戶態開放了IOMMU接口,通過IOCTL配置IOMMU將DMA地址空間映射并將其限制在進程虛擬地址空間。可參考:

1)https://www.kernel.org/doc/Documentation/vfio.txt?

2)https://www.ibm.com/developerworks/community/blogs/5144904d-5d75-45ed-9d2b-cf1754ee936a/entry/vfio?lang=en_us

需要BIOS和內核的支持,并配置使能IO virtualization(Intel? VT-d)

IOMMU:

參考https://nanxiao.me/iommu-introduction/,IOMMU提供了IO設備訪問實際物理內存的一套機制。在虛擬化領域,內部實現了guest虛機內存地址和host內存地址的轉換。

typical physical view
compare to MMU
summary from AMD

PCI BAR (base address register):

參見簡單說就是PCI配置機制,包括寄存器配置幀頭,設備編號(B/D/F)及對應的軟硬件實現,最終實現PCI設備的尋址。

摘錄于https://en.wikipedia.org/wiki/PCI_configuration_space的一段話,簡要說明了BDF的劃分和尋址。

One of the major improvements the?PCI Local Bus?had over other I/O architectures was its configuration mechanism. In addition to the normal memory-mapped and I/O port spaces, each device function on the bus has a?configuration space, which is 256 bytes long, addressable by knowing the eight-bit PCI?bus, five-bit device, and three-bit function numbers for the device (commonly referred to as the?BDF?or?B/D/F, as abbreviated from?bus/device/function). This allows up to 256 buses, each with up to 32 devices, each supporting eight functions. A single PCI expansion card can respond as a device and must implement at least function number zero. The first 64 bytes of configuration space are standardized; the remainder are available for vendor-defined purposes.

以下是SPDK自帶的腳本工具顯示的系統信息,目前SPDK支持的驅動包括NVMe,I/OAT(Intel的I/O加速技術)和virtio(半虛擬化的設備抽象接口規范,其規定的實現接口有PCI,MMIO和Channel I/O方式)

NVMe devices

BDF? ? ? ? ? ? Numa Node? ? ? Driver name? ? ? ? ? ? Device name

I/OAT DMA

BDF? ? ? ? ? ? Numa Node? ? ? Driver Name

0000:00:04.0? ? 0? ? ? ? ? ? ? vfio-pci

0000:80:04.0? ? 1? ? ? ? ? ? ? vfio-pci

...

virtio

BDF? ? ? ? ? ? Numa Node? ? ? Driver Name? ? ? ? ? ? Device Name

MMIO(memory-mapped I/O)

MMIO和PMIO(port-mapped I/O)作為互補的解決方案實現了CPU和外圍設備的IO互通。IO和內存使用相同的地址空間,即CPU指令中的地址既可以指向內存,也可以指向特定的IO設備。每個IO設備監控CPU的地址總線并對CPU對該地址的訪問進行回應,同時連接數據總線至指定設備的硬件寄存器,使得CPU指令可以像訪問內存一樣訪問IO設備,類比于DMA的memory-to-device,MMIO是一種cpu-to-device的技術。

參考https://en.wikipedia.org/wiki/Memory-mapped_I/O

NVMe(non-volatile memory express)

優化的高性能可擴展的主機控制器接口,利用基于PCIE的SSD來實現企業和客戶系統的需要。參見www.nvmexpress.org

支持64K隊列及每隊列64K命令

官方推薦的一個線程模型,即CPU:thread:NVMe queue=1:1:1

threading model for an application using SPDK is to spawn a fixed number of threads in a pool and dedicate a single NVMe queue pair to each thread. A further improvement would be to pin each thread to a separate CPU core, and often the SPDK documentation will use "CPU core" and "thread" interchangeably because we have this threading model in mind.

SPDK基本框架

SPDK 18.07

存儲協議層:

iSCSI target: Implementation of the established specification for block traffic over Ethernet; about twice as efficient as kernel LIO. Current version uses the kernel TCP/IP stack by default.

NVMe-oF target: Implements the?new NVMe-oF specification. Though it depends on RDMA hardware, the NVMe-oF target can serve up to 40 Gbps of traffic per CPU core.

vhost-scsi target (在上圖未體現出來,當前版本18.04已發布): A feature for KVM/QEMU that utilizes the SPDK NVMe driver, giving guest VMs lower latency access to the storage media and reducing the overall CPU load for I/O intensive workloads.

存儲服務層:

Block device abstraction layer (bdev): This generic block device abstraction is the glue that connects the storage protocols to the various device drivers and block devices. Also provides flexible APIs for additional customer functionality (RAID, compression, dedup, and so on) in the block layer.

Blobstore: Implements a highly streamlined file-like semantic (non-POSIX*) for SPDK. This can provide high-performance underpinnings for databases, containers, virtual machines (VMs), or other workloads that do not depend on much of a POSIX file system’s feature set, such as user access control.

硬件驅動層:

NVMe driver: The foundational component for SPDK, this highly optimized, lockless driver provides unparalleled scalability, efficiency, and performance.

Intel? QuickData Technology: Also known as Intel? I/O Acceleration Technology (Intel? IOAT), this is a copy offload engine built into the Intel? Xeon? processor-based platform. By providing user space access, the threshold for DMA data movement is reduced, allowing greater utilization for small-size I/Os or NTB.


安裝編譯(參考https://github.com/spdk/spdk

# git clone https://github.com/spdk/spdk

# cd spdk

# git submodule update --init

# git submodule (可以看到DPDK做為其中的一個模塊被包含了進來)

b6ae5bcff6ca09a7e1536eaa449aa6f4e704a6d9 dpdk (v18.05-12-gb6ae5bc)

134c90c912ea9376460e9d949bb1319a83a9d839 intel-ipsec-mb (v0.49-1-g134c90c)

#?./scripts/pkgdep.sh (安裝依賴包)

#?./configure (這個會在當前目錄下生成CONFIG.local,默認情況下里面只指定了DPDK的目錄路徑,通過添加其他的選項,比如--with-rdma可以將對應的配置項CONFIG_RDMA?=y寫入CONFIG.local。執行./configure -h來查看所有的選項)

# make (make也提供了類似的選項用于生成最后的CONFIG.local)

執行SPDK應用之前需要分配大頁和NVMe,I/OAT和Virtio設備的綁定,通過setup.sh這個腳本完成

# HUGEMEM=8192 scripts/setup.sh

# ./scripts/setup.sh status

Hugepages

node? ? hugesize? ? free /? total

node0? 1048576kB? ? ? ? 4 /? ? ? 8

node0? ? ? 2048kB? ? 1024 /? 1024

node1? 1048576kB? ? ? ? 4 /? ? ? 8

node1? ? ? 2048kB? ? 1024 /? 1024

NVMe devices

BDF? ? ? ? ? ? Numa Node? ? ? Driver name? ? ? ? ? ? Device name

I/OAT DMA

BDF? ? ? ? ? ? Numa Node? ? ? Driver Name

0000:00:04.0? ? 0? ? ? ? ? ? ? ioatdma

0000:80:04.0? ? 1? ? ? ? ? ? ? ioatdma

...

virtio

BDF? ? ? ? ? ? Numa Node? ? ? Driver Name? ? ? ? ? ? Device Name

官方github源代碼包含:

--NVMe driver

--I/OAT (DMA engine) driver

--NVMe over Fabrics target

--iSCSI target

--vhost target

--Virtio-SCSI driver

NVMe driver

官方提供了vagrant工具搭建的虛擬化環境,其中掛載了NVMe設備用于實踐。更多在文后展開

I/OAT (DMA engine) driver

跳轉的網頁僅提供了API接口

NVMe over Fabrics target

# apt-get install libibverbs-dev librdmacm-dev (或者yum install libibverbs-devel librdmacm-devel)

# ./configure --with-rdma

# make

編譯完畢后查看對應的binary

[root@localhost spdk]# cd app/nvmf_tgt/

[root@localhost nvmf_tgt]# ls

Makefile? nvmf_main.c? nvmf_main.d? nvmf_main.o? nvmf_tgt

參考示例的配置文件,添加對應的PCIe NVMe設備

[vagrant@localhost spdk]$ cp ./etc/spdk/nvmf.conf.in app/nvmf_tgt/nvmf.conf

[vagrant@localhost spdk]$ sudo app/nvmf_tgt/nvmf_tgt -c app/nvmf_tgt/nvmf.conf

Starting SPDK v18.10-pre / DPDK 18.05.0 initialization...

[ DPDK EAL parameters: nvmf -c 0x1 --legacy-mem --file-prefix=spdk_pid25254 ]

EAL: Detected 2 lcore(s)

EAL: Detected 1 NUMA nodes

EAL: Multi-process socket /var/run/dpdk/spdk_pid25254/mp_socket

EAL: Probing VFIO support...

app.c: 530:spdk_app_start: *NOTICE*: Total cores available: 1

reactor.c: 718:spdk_reactors_init: *NOTICE*: Occupied cpu socket mask is 0x1

reactor.c: 492:_spdk_reactor_run: *NOTICE*: Reactor started on core 0 on socket 0

EAL: PCI device 0000:00:0e.0 on NUMA socket 0

EAL:? probe driver: 80ee:4e56 spdk_nvme

示例給出了對應CPU核綁定的方法(cores 24, 25, 26 and 27)

# app/nvmf_tgt/nvmf_tgt -m 0xF000000

利用nvme cli進行發現,配置

#?modprobe nvme-rdma

# apt-get install nvme-cli(yum install nvme-cli)

# nvme list

示例基本操作

# nvme discover -t rdma -a 192.168.100.8 -s 4420

# nvme connect -t rdma -n "nqn.2016-06.io.spdk:cnode1" -a 192.168.100.8 -s 4420

# nvme disconnect -n "nqn.2016-06.io.spdk:cnode1"

iSCSI target

基本的配置調優都可以參見http://www.spdk.io/doc/iscsi.html,說得比較詳細。其中提到了VPP,思科開源的一套高性能網絡報文處理框架,可用來做vswitch,vrouter甚至vfirewall等,這里展開一下。

VPP安裝

# touch /etc/apt/sources.list.d/99fd.io.list

#?echo deb [trusted=yes] https://nexus.fd.io/content/repositories/fd.io.ubuntu.xenial.main/ ./ >> /etc/apt/sources.list.d/99fd.io.list

#?apt-get update

#?apt-get install -y vpp-lib vpp vpp-plugins

# vppctl

vpp# set interface state tapcli-0 up

vpp# set interface ip address tapcli-0 10.0.0.1/24

vpp# show int

Name? ? ? ? ? ? ? Idx? ? State? MTU (L3/IP4/IP6/MPLS)? ? Counter? ? ? ? ? Count

GigabitEthernet1/0/0? ? ? ? ? ? ? 1? ? down? ? ? ? 9000/0/0/0

GigabitEthernet1/0/1? ? ? ? ? ? ? 2? ? down? ? ? ? 9000/0/0/0

GigabitEthernet1/0/2? ? ? ? ? ? ? 3? ? down? ? ? ? 9000/0/0/0

GigabitEthernet1/0/3? ? ? ? ? ? ? 4? ? down? ? ? ? 9000/0/0/0

TenGigabitEthernet6/0/0? ? ? ? ? 5? ? down? ? ? ? 9000/0/0/0

TenGigabitEthernet6/0/1? ? ? ? ? 6? ? down? ? ? ? 9000/0/0/0

TenGigabitEthernet82/0/1? ? ? ? ? 7? ? down? ? ? ? 9000/0/0/0

local0? ? ? ? ? ? ? ? ? ? ? ? ? ? 0? ? down? ? ? ? ? 0/0/0/0

tapcli-0? ? ? ? ? ? ? ? ? ? ? ? ? 8? ? ? up? ? ? ? ? 9000/0/0/0? ? drops? ? ? ? ? ? ? ? ? ? ? ? ? 8

vpp# show hardware tapcli-0

Name? ? ? ? ? ? ? ? Idx? Link? Hardware

tapcli-0? ? ? ? ? ? ? ? ? ? ? ? ? 8? ? up? tapcli-0

vpp# show int addr

GigabitEthernet1/0/0 (dn):

GigabitEthernet1/0/1 (dn):

GigabitEthernet1/0/2 (dn):

GigabitEthernet1/0/3 (dn):

TenGigabitEthernet6/0/0 (dn):

TenGigabitEthernet6/0/1 (dn):

TenGigabitEthernet82/0/1 (dn):

local0 (dn):

tapcli-0 (up):

? L3 10.0.0.1/24

kernel上掛載一個接口進行ping測試

root@ONAP-Test-Temp:/home/set# ip addr add 10.0.0.2/24 dev tap0

root@ONAP-Test-Temp:/home/set# ip link set tap0 up

root@ONAP-Test-Temp:/home/set# ping 10.0.0.1

PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.

64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.327 ms

64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.033 ms

檢查結果

vhost target

協議參見vhost-user protocol

SPDK上啟動vhost target (http://www.spdk.io/doc/vhost.html)

# qemu-system-x86_64 -version

#?qemu-system-x86_64 -device vhost-user-scsi-pci,help

# echo 4 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages

# echo 4 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

# grep Huge /proc/meminfo

AnonHugePages:? ? 477184 kB

HugePages_Total:? ? ? 8

HugePages_Free:? ? ? ? 3

HugePages_Rsvd:? ? ? ? 0

HugePages_Surp:? ? ? ? 0

Hugepagesize:? ? 1048576 kB

# app/vhost/vhost -S /var/tmp/ -m 0x30

Starting SPDK v18.07-pre / DPDK 18.05.0 initialization...

[ DPDK EAL parameters: vhost -c 0x30 -m 1024 --legacy-mem --file-prefix=spdk_pid16756 ]

EAL: Detected 48 lcore(s)

EAL: Detected 2 NUMA nodes

EAL: Multi-process socket /var/run/dpdk/spdk_pid16756/mp_socket

EAL: Probing VFIO support...

EAL: VFIO support initialized

app.c: 530:spdk_app_start: *NOTICE*: Total cores available: 2

reactor.c: 718:spdk_reactors_init: *NOTICE*: Occupied cpu socket mask is 0x1

reactor.c: 492:_spdk_reactor_run: *NOTICE*: Reactor started on core 5 on socket 0

reactor.c: 492:_spdk_reactor_run: *NOTICE*: Reactor started on core 4 on socket 0

# ls -al /var/tmp/

drwxrwxrwt 5 root root 4096 Aug 17 01:32 .

drwxr-xr-x 12 root root 4096 Dec 11 2017 ..

srwxr-xr-x 1 root root 0 Aug 17 01:32 spdk.sock

-rw------- 1 root root 0 Aug 17 01:32 spdk.sock.lock

創建bdev

# scripts/rpc.py construct_malloc_bdev 64 512 -b Malloc0

Malloc0

創建vhost設備

參考http://www.spdk.io/doc/vhost.html?SPDK vhost application is started on CPU cores 0 and 1, QEMU on cores 2 and 3.

先創建bdev

host:~# ./scripts/rpc.py construct_nvme_bdev -b Nvme0 -t pcie -a 0000:01:00.0

EAL: PCI device 0000:01:00.0 on NUMA socket 0

EAL:? probe driver: 8086:953 spdk_nvme

EAL:? using IOMMU type 1 (Type 1)

host:~# ./scripts/rpc.py construct_malloc_bdev 128 4096 Malloc0

Malloc0

host:~# ./scripts/rpc.py construct_malloc_bdev 64 512 -b Malloc1

Malloc1

創建vhost SCSI

host:~# ./scripts/rpc.py construct_vhost_scsi_controller --cpumask 0x1 vhost.0

VHOST_CONFIG: vhost-user server: socket created, fd: 21

VHOST_CONFIG: bind to /var/tmp/vhost.0

vhost.c: 596:spdk_vhost_dev_construct: *NOTICE*: Controller vhost.0: new controller added

host:~# ./scripts/rpc.py add_vhost_scsi_lun vhost.0 0 Nvme0n1

vhost_scsi.c: 840:spdk_vhost_scsi_dev_add_tgt: *NOTICE*: Controller vhost.0: defined target 'Target 0' using lun 'Nvme0'

host:~# ./scripts/rpc.py add_vhost_scsi_lun vhost.0 1 Malloc0

vhost_scsi.c: 840:spdk_vhost_scsi_dev_add_tgt: *NOTICE*: Controller vhost.0: defined target 'Target 1' using lun 'Malloc0'

創建vhost blk

host:~# ./scripts/rpc.py construct_vhost_blk_controller --cpumask 0x2 vhost.1 Malloc1

vhost_blk.c: 719:spdk_vhost_blk_construct: *NOTICE*: Controller vhost.1: using bdev 'Malloc1'

Vhost-NVMe (experimental)

rpc_py construct_vhost_nvme_controller --cpumask 0x1 vhost.2 16?/*創建vhost nvme控制器*/

rpc_py add_vhost_nvme_ns vhost.2 Malloc0 /*綁定bdev Malloc0到對應的控制器上*/

同時在QEMU指定虛機對應啟動參數并啟動虛機

Vhost-SCSI

chardev socket,id=char0,path=/var/tmp/vhost.0

device vhost-user-scsi-pci,id=scsi0,chardev=char0

Vhost-BLK

chardev socket,id=char1,path=/var/tmp/vhost.1

device vhost-user-blk-pci,id=blk0,chardev=char1

Vhost-NVMe (experimental)

chardev socket,id=char2,path=/var/tmp/vhost.2

device vhost-user-nvme,id=nvme0,chardev=char2,num_io_queues=4

host:~# taskset -c 2,3 qemu-system-x86_64 \ --enable-kvm \ -cpu host -smp 2 \ -m 1G -object memory-backend-file,id=mem0,size=1G,mem-path=/dev/hugepages,share=on -numa node,memdev=mem0 \ -drive file=guest_os_image.qcow2,if=none,id=disk \ -device ide-hd,drive=disk,bootindex=0 \ -chardev socket,id=spdk_vhost_scsi0,path=/var/tmp/vhost.0 \ -device vhost-user-scsi-pci,id=scsi0,chardev=spdk_vhost_scsi0,num_queues=4 \ -chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.1 \ -device vhost-user-blk-pci,chardev=spdk_vhost_blk0,num-queues=4

拓撲連

Virtio-SCSI driver

TBD


最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 227,837評論 6 531
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 98,196評論 3 414
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 175,688評論 0 373
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,654評論 1 309
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,456評論 6 406
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 54,955評論 1 321
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,044評論 3 440
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,195評論 0 287
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 48,725評論 1 333
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 40,608評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,802評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,318評論 5 358
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,048評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,422評論 0 26
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,673評論 1 281
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,424評論 3 390
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,762評論 2 372

推薦閱讀更多精彩內容