这是Ceph开发每周谈的第七十九篇文章,记录从17年6月12号到17年6月19号的社区开发情况。笔者从前年开始做Ceph的技术模块分析到今年中告一段落,想必有挺多人期待下一篇Ceph技术分析。考虑到Ceph的发展已经从前年的一穷二白到现在的如火如荼,但对于社区的方向和实况仍有所脱节,笔者考虑开始Ceph开发每周谈这个系列。每篇文章都会综述上周技术更新,围绕几个热点进行深度解析,如果正好有产业届新闻的话就进行解读,最后有读者反馈问题的话并且值得一聊的话,就附上答疑部分。
一句话消息
Ceph L 版本会推迟到秋天
LMDB
• 在之前 Xinxin 实现了基于 LMDB 的 KV 数据库,期望用来替代 RocksDB 作为主要的元数据存储。https://github.com/ceph/ceph/pull/4403
• 之前主要考虑 LMDB 对于写并不是很擅长,至少比 RocksDB 差远了,但是在读上面更快,因为采用的是 BTREE 架构。
• 目前 RocksDB 主要碰到的问题是读性能太差影响了写入,比如在一个 512GB 的 RBD 卷上写入,会形成大约 200GB 的数据库写入,以及更多的读带宽。
lrwxrwxrwx 1 root root 15 Jun 7 09:07 osd-device-0-data -> ../../nvme0n1p1
lrwxrwxrwx 1 root root 15 Jun 7 09:07 osd-device-0-wal -> ../../nvme0n1p2
lrwxrwxrwx 1 root root 15 Jun 7 09:07 osd-device-0-db -> ../../nvme0n1p3
lrwxrwxrwx 1 root root 15 Jun 7 09:07 osd-device-0-block -> ../../nvme0n1p4
# DISK STATISTICS (/sec)
# <———reads—————><———writes————–><——–averages——–> Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util
09:26:30 nvme0n1p1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
09:26:30 nvme0n1p2 0 0 0 0 0 60176 0 4856 12 0 12 1 0 0 36
09:26:30 nvme0n1p3 341176 0 18K 19 0 200844 0 1570 128 11 27 26 1 0 97
09:26:30 nvme0n1p4 0 0 0 0 0 29972 0 7493 4 0 4 7 0 0 8
目前 4K 随机写在完全 Cache 情况下,大约为 3w iops,但是如果 20% 命中的时候,会只有 10K IOPS。
另外 RocksDB Compaction 导致的问题也很大。
** DB Stats **
Uptime(secs): 1678.0 total, 725.5 interval
Cumulative writes: 5587K writes, 27M keys, 5587K commit groups, 1.0 writes per commit group, ingest: 20.35 GB, 12.42 MB/s
Cumulative WAL: 5587K writes, 1973K syncs, 2.83 writes per sync, written: 20.35 GB, 12.42 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 5088K writes, 23M keys, 5088K commit groups, 1.0 writes per commit group, ingest: 18426.42 MB, 25.40 MB/s
Interval WAL: 5088K writes, 1723K syncs, 2.95 writes per sync, written: 17.99 MB, 25.40 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
———————————————————————————————————————————————————-
L0 4/0 195.37 MB 1.0 0.0 0.0 0.0 4.4 4.4 0.0 1.0 0.0 208.9 21 87 0.246 0 0
L1 4/0 250.43 MB 1.0 8.6 4.2 4.5 6.9 2.5 0.0 1.7 157.3 126.3 56 21 2.681 43M 4866K
L2 19/0 764.11 MB 0.3 6.0 1.9 4.1 4.5 0.4 0.3 2.3 217.0 162.9 28 27 1.044 18M 6905K
Sum 27/0 1.18 GB 0.0 14.6 6.1 8.5 15.8 7.3 0.3 3.6 141.3 152.8 106 135 0.785 61M 11M
Int 0/0 0.00 KB 0.0 14.0 5.7 8.4 14.5 6.1 0.1 4.0 145.4 150.0 99 122 0.811 59M 11M
Uptime(secs): 1678.0 total, 1678.0 interval
Flush(GB): cumulative 4.371, interval 3.636
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 15.80 GB write, 9.64 MB/s write, 14.62 GB read, 8.92 MB/s read, 105.9 seconds
Interval compaction: 14.49 GB write, 8.84 MB/s write, 14.04 GB read, 8.57 MB/s read, 98.9 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
以下是社区讨论的一些陈述:
Proposal:
Primary Goals:
– Clean up Xinxin’s old PR to work with current master.
– Current Status: https://github.com/markhpc/ceph/tree/wip-lmdb-retry
– Move from automake to cmake (Done)
– Implement missing features (merge, rm_range_keys) – (WIP)
– Fix observed racey behavior (noticed when using with the mon)
– Fix other potential problems (noticed as DB for bluestore)
– Profile and optimize (unnecessary memory copies, RMW during merge?, etc)
Secondary Goals:
– WAL for LMDB?
– WAL better for records smaller than 2KB?
https://news.ycombinator.com/item?id=8979517
– Potential 30X speed up for sync writes?
https://twitter.com/hyc_symas/status/644634283047567361?lang=en
– More background:
https://www.spinics.net/lists/ceph-devel/msg26948.html