XINFINI Technology | XSEA Architecture

Customer pain points

The era of all-flash data centers has arrived, bringing both opportunities and challenges

As the performance and capacity of NVME SSD continue to improve and the price continues to drop, with the popularity of high-performance lossless network 25Gb/100Gb and the emergence of 400Gb, with the popularity of NVMe over RoCE technology (accessing storage from any location in the data center only requires microsecond-level latency), which has accelerated the arrival of all-flash data center architecture. The all-flash data center architecture can provide enterprise-level customers with significant performance improvements, cost savings and business flexibility, thereby helping enterprises gain advantages in a highly competitive market environment. However, existing all-flash storage has the following pain points:

Problems with local NVMe DAS

TCO remains high, storage capacity and performance utilization are uneven, and resources are wasted seriously
Operation and maintenance management is difficult. There are a large number of local disks on the server. There is a lack of unified hard disk management tools and the inability to automatically warn and handle storage media failures and sub-health problems, resulting in high operation and maintenance costs
Low reliability, no data redundancy protection on local disk

Problems with traditional centralized all-flash array

Difficulty in horizontal expansion and inability to decouple software and hardware
Poor openness and unable to meet the requirements of the cloud trend
Poor economics, unable to meet the requirements of high performance, large scale, and economical performance at the same time, customers cannot enjoy SSD value dividends

Shared-Nothing storage problem

It cannot meet the requirements of all-scenario business. It relies on three copies and does not have the ability to reduce data. As a result, the cost per TB of available capacity is too high, and customers cannot obtain the huge advantages of all-flash capacity at an appropriate cost
When an SSD slow disk or node failure occurs, the switching time is as high as 5 seconds or more, and the performance fluctuates greatly, making it unable to meet the demanding OLTP key business service quality

Architecture overview

XSKY eXtreme Shared-Everything Architecture

XSKY designed a new revolutionary eXtreme Shared-Everything Architecture (referred to as: XSEA) by drawing on the Shared-Everything architecture of high-end storage and XSKY's years of experience in distributed storage. The standard storage protocol and network technology will subvert the storage hierarchy of the data center, replace some NVMe DAS and hybrid flash storage, and solve the performance, reliability, and scale problems of the traditional Shared-Nothing architecture in the past 20 years through breakthrough methods. , cost compromise issues.

Core advantages

Provide all-flash storage products with high reliability, high performance, and high efficiency capabilities

XSEA Architecture has achieved three 100s through three technological innovations: Shared-Everything, single-layer flash media, and end-to-end NVMe:

Shared-Everything

Fully shared data storage

100ms failover time

Single layer flash media

For TLC NVMe SSD

100% Effective storage ratio

End-to-end NVMe

Maximize hardware offloading

100µs ultra-low latency

Shared-everything data storage, ultimate high reliability

XSEA architecture adopts the "Shared-Everything" model to achieve fully shared data storage, allowing each node to directly access all SSDs to improve data access speed and flexibility. In slow disk and sub-health scenarios, failover can be completed quickly within 100ms.

Shared-Everything vs Shared-Nothing: Performance scales linearly

Shared Nothing

Each node processes data independently
As the number of nodes increases, the benefits of expansion will be limited

Shared Everything

Each node does not need to communicate with other node services
As the number of nodes increases, it can support linear expansion capabilities

Shared-Everything vs Shared-Nothing: Flexible resource allocation

Shared Nothing

The resources of each node cannot be used uniformly
In the early stages of planning the system, a large amount of resources need to be reserved in advance, resulting in waste

Shared Everything

Storage capacity and performance are decoupled from CPU and memory resources
Allocate resources on demand according to actual business scenarios

Shared-Everything vs Shared-Nothing: global perspective scheduling

Shared Nothing

Each node is divided into independent units
Partial perspective, affecting business stability

Shared Everything

Global data reading and writing capabilities of each node
Global flow control greatly improves space utilization efficiency

Shared-Everything vs Shared-Nothing: Higher quality of service

Shared Nothing

Lack of business level sensitivity
In sub-healthy condition, the failover time takes 10 seconds

Shared Everything

Guaranteed business continuity
In sub-healthy condition, the failover time takes 100ms

Optimized for TLC NVMe SSD, with over 100% effective storage ratio

The XSEA architecture uses a single-layer TLC NVMe SSD to build a storage pool, simplifying the storage hardware structure of the cluster. In terms of usage, the Append Only method is used to write data, reducing the write amplification phenomenon. And through a carefully designed space layout, the dual functions of cache and persistent storage are achieved on a single SSD. These technologies ensure sufficient performance stability without a dedicated cache medium.

In conventional mixed read-write business scenarios, compared with the layered cache method, single-layer flash memory can significantly reduce media costs by more than 20%. At the same time, combined with the global EC and compression functions brought by the shared-everything architecture model, the cluster's effective storage ratio exceeds 100%.

End-to-end NVMe design for IO paths, achieving ultra-low latency of 100µs

On the end-to-end IO path, the XSEA architecture is built using the standard NVMe over Fabrics protocol. In addition to the client accessing storage using the NVMe over Fabrics standard protocol, the storage internal interconnection network also uses the NVMe over Fabrics standard protocol. It is a complete end-to-end NVMe design, which means that all storage nodes can efficiently access each NVMe SSD through NVMe over Fabrics, thereby avoiding the additional overhead caused by storage protocol conversion. On the end-to-end NVMe I/O path, the XSEA architecture also uses an efficient Polling mode to process each I/O request, and optimizes the memory access efficiency of different services through NUMA binding, and finally achieves an end-to-end latency as low as 100 microseconds. This allows any location in the data center to access the Xinfini all-flash storage with only microsecond latency.

Logical architecture

XINFINI with a new architecture, comprehensive breakthroughs without compromise

XINFINI software is based on XSEA architecture, inherits the mature capabilities of XSKY SDS blocks and files, provides a unified storage platform, and provides support for virtualization, private cloud, OLTP, HPDA, AI/ML and other scenarios. The data plane of XINFINI storage software provides a data persistence layer, a data service layer, and a protocol access layer, which is responsible for the storage, protection, security, and presentation of data, and provides standard storage access protocols to client hosts.

These three layers work together to provide the following capabilities of next-generation distributed all-flash storage:

Efficiently manage PB-level to EB-level NVMe SSD devices, providing all-flash performance and affordable economy
Provides extremely low latency and ultra-high performance data path
Provide complete data services, including snapshots and clones, etc
Provide consistency guarantees for data and metadata updates
Reduce SSD wear through various methods
Supports large-scale EC and efficient compression while meeting the requirements of high performance, large scale, and economy

Data persistence layer

The data Persistent Layer provides data persistence services to the upper layer and has 3 core designs:

Provide AppendLog semantic calls to the upper layer;
Provides ultimate write latency;
Chunk metadata is centrally managed.

Data service layer

The data service layer Service Layer provides block storage services and file storage services to the upper layer, including BlockServer and FileServer.

BlockServer is a storage engine for block storage. Based on the high-performance read and write Chunk interface provided by Persistent Layer, it uses Log-Structured to organize data and abstract the virtual block layer. Supported storage access protocols are NVMe/RoCE, NVMe/TCP, iSCSI and KVM vhost-blk.

Protocol access layer

BlockServer provides NVMe over RoCE/TCP Target externally for client access.

BlockDataClient is a private client, deployed on KVM computing nodes, and provides vhost-blk block storage protocol interface for KVM.

Glossary

XINFINI	The name of XSKY’s new generation of all-flash technology
XSEA	The name of XSKY’s new generation all-flash architecture
eXtreme Shared-Everything Architecture	The abbreviation is "XSEA", which means extremely fast fully shared architecture, that is, "XSEA Architecture"
Shared-Everything Architecture	Fully shared architecture, a type of distributed system architecture
Shared-Nothing Architecture	Shared nothing architecture, a type of distributed system architecture
Persistent Layer	(data) persistence layer
Service Layer	(data) service layer
Access Layer	(Protocol) Access Layer
AppendLog Write	An important technique used to ensure data consistency and fault tolerance in distributed systems.
QAT	Intel Xeon Scalable processor built-in hardware accelerator for compression and decompression operations
NVMe DAS	NVMe direct attached storage means that the server uses a local NVMe disk
NVMe	Non-Volatiltee Memory Express, Storage over PCle protocol
RDMA	Remote Direct Memory Access, remote direct memory access network protocol
CE	Converged Ethernet, lossless Ethernet network
NVMe-oF Target	NVMe over Fabrics storage side
NVMe-oF Initiator	NVMe over Fabrics client
RoCE	Stands for RDMA Over Converged Ethernet. RDMA over Converged Ethernet (RoCE), which allows Remote Direct Memory Access (RDMA) over Ethernet

Customer pain points

The era of all-flash data centers has arrived, bringing both opportunities and challenges

Problems with local NVMe DAS

Problems with traditional centralized all-flash array

Shared-Nothing storage problem

Architecture overview

XSKY eXtreme Shared-Everything Architecture

Core advantages

Provide all-flash storage products with high reliability, high performance, and high efficiency capabilities

Shared-everything data storage, ultimate high reliability

Shared-Everything vs Shared-Nothing: Performance scales linearly

Shared Nothing

Shared Everything

Shared-Everything vs Shared-Nothing: Flexible resource allocation

Shared Nothing

Shared Everything

Shared-Everything vs Shared-Nothing: global perspective scheduling

Shared Nothing

Shared Everything

Shared-Everything vs Shared-Nothing: Higher quality of service

Shared Nothing

Shared Everything

Optimized for TLC NVMe SSD, with over 100% effective storage ratio

End-to-end NVMe design for IO paths, achieving ultra-low latency of 100µs

Logical architecture

XINFINI with a new architecture, comprehensive breakthroughs without compromise

Glossary

Stay ahead, make data alive