Engineering Reliable Persistence @ ACM SIGARCH Blog

Integrating non-volatile main memories (NVMMs) into the storage/memory hierarchy make data integrity a critical design consideration.  Protecting data in NVMM is a complex problem:  media errors and software bugs can corrupt data and the reliability of each memory cell degrades as it is used, potentially leading to premature failure.  Hardware and software both have a role to play, but trying to solve problems in the wrong place can needlessly complicate the system, leave the system open to data corruption, and/or sacrifice performance.

I wrote this originally for the SIGARCH Blog.

Media Errors and Scribbles

Data corruption is an unfortunate fact of life in storage systems, and hardware, system software, and applications have a role to play in dealing with them in NVMM.

The first line of defense is the error correcting code (ECC) hardware built into the NVMM DIMMs and/or the CPU’s memory controller.  DRAM controllers provide single-error correction/dual error detection (SECDED) using Hamming Codes.  NVMMs will probably use something stronger to correct or detect more errors.

ECC can correct some errors, but not all.  Detectable but uncorrectable errors cause an unmaskable exception on Intel processors.  If the exception happens in kernel mode, the default response is to crash the system.  In user mode, the application gets a SIGBUS and crashes.  In either case, software can try to handle the error rather than crash and critical components like the file system must do so to be reliable.  Undetectable errors (i.e., silent data corruption) are more insidious:  The memory controller just returns incorrect data.

Reliable storage software must deal with these two classes of errors and one more:  Corruption due to software accidently writing data to the wrong place.  We call these errors “scribbles,” and they are common enough that some conventional file systems try to protect their data structures from them.  Scribbles and undetected errors appear the same to software:  Future reads return corrupted data.

File systems are software’s first line of defense against media errors and scribbles. Conventional block-based file systems handle them by applying checksums or RAID-style parity to disk blocks, file contents, and/or metadata structures.  We have applied these techniques to the NOVA NVMM file system and find that they work well and have limited impact on performance.

For block-based storage, protections in the file system are all you need, but fully utilizing NVMM and providing fault tolerance requires the application’s help.  The problem arises when we try to combine file system-based data protection with DAX-style mmap().  DAX mmap() allows an application to map NVMM directly into its address space and then access that data using load and store instructions.  By design, the file system is not involved in these memory-like accesses.  This is great for performance, but it means that file contents can change without the file system’s knowledge, something that is impossible in conventional file systems.

If the file system doesn’t know the file’s contents changed, it cannot update error correction information.  For example, NOVA uses RAID-4 style parity to protect file data, so each time the file contents change, the parity needs to change, too.  If parity doesn’t change, then the data will appear to be corrupted.

To avoid this problem, NOVA disengages RAID protection when pages in a file are mmap()’d for writing and re-engages it when the mapping is removed (or when the system restarts).  To maintain protection against media errors and scribbles, the application must implement its own error protection in userspace to detect and recover from data corruption.

A seemingly attractive alternative to implementing RAID in software is to add stronger error correction in hardware.  Technologies like Chipkill and RAIM provide RAID-like protection and reduce (but do not eliminate) media errors.  However, these techniques are no help with scribbles — software causes scribbles and only software can detect and correct them.

NVMM Wear Leveling

The media error rate for all non-volatile memories (including the NAND flash in SSDs) increases as the memory is used — that is, the memory cells wear out.  If the cells in a device wear out unevenly, then some will fail sooner than others, reducing the effective capacity of the device.  Effective “wear-leveling” can spread writes out over the entire device so that all the memory cells are available for the device’s warranted lifetime.

Wear-out is a well known bogeyman of NAND-based SSDs, but the wear problem with NVMMs is quite different.  In SSDs, wear-leveling is one of two tasks that the flash translation layer (FTL) in the SSD’s firmware must perform to provide a block-based interface to decidedly non-blocked-like NAND flash memory.  The FTL’s second task is to manage flash’s inability to modify data in place and the fact that program operations operate on pages (4-32kB) while erase operation operate on blocks (many 10s to a few 100 of pages).  FTLs use log-structured writes, complicated mapping schemes, and performance-stealing garbage collection to make it work.  Doing all this while maintaining consistent, high performance is fundamental challenge in SSD design and one of the main points of innovation in SSDs over the last 10 years.  It is also the motivation for moving FTL functionality further up the software stack where application-specific interfaces or knowledge of access patterns can simplify the task.

NVMM does not suffer from the page/block, program/erase mismatch, so it does not need complex maps or garbage collection.  As a result, NVMMs can, should, and do dispense with most of the FTL, but they do need wear-leveling, and the only reasonable place to implement wear-leveling for NVMMs is in hardware.

To see why the alternative, software, is a non-starter, consider the challenges of implementing wear-leveling in software.  First, which software layer should perform wear-leveling?  The file system must be involved because it implements normal file IO and decides where data and metadata are stored.  Indeed, techniques like copy-on-write and log-structuring seem good for wear-leveling of file data and some metadata.  However, there are likely to be some “hot spots” (e.g., the super block) that may wear out anyway.

The file system designer could try to ensure that all NVMM data structures move around in NVMM, but the software engineering challenges are daunting:  How should we test the file system to make sure that the migration system doesn’t introduce new bugs? How do you ensure that it prevents hot spots?  How do ensure that new hotspots don’t appear as the file system evolves?  These problems are not unsolvable, but file systems are complex and buggy enough as it is.  Further, the details of how wear-leveling should work depend on the particular media, requiring the file system to understand those details, a further source of bugs and complexity.

It gets worse, though, since DAX mmap() means that solving these problem is not enough.  Applications that access NVMM via loads and stores will need to implement wear-leveling too.  The same set of testing and verification challenges arise, but in the absence of the rigour that file system design and testing (hopefully) enjoys.  It is more than likely that buggy application wear-leveling techniques will cause both application bugs and hot spots.   Bugs are the application’s problem, but hot spots physically damage the system’s hardware, something software should not be able to do.

Implementing wear-leveling in hardware, by contrast, is simple and efficient.  Start-gap wear-leveling is an elegant scheme for wear-leveling NVMMs developed by Moin Qureshi’s group while he was working at IBM (he’s now at Georgia Tech).  In essence, it slowly rotates the physical address space across the NVMM on a single DIMM, and keeps track of the current offset so it is easy to find data after it has been written.  We implemented start-gap wear leveling in an FPGA for a PCM-based SSD we built in 2011.  It took about 100 lines of Verilog, added two cycles to memory latency, and had no measurable impact on performance.

Start-gap applies to all operations equally whether they operate on application data or file system metadata.  Everything is covered all the time, and software does not need to do a thing.  Even better, the memory controller can adjust start-gap’s parameters to match the needs of the media at hand.

Cross-Layer Data Protection

Protecting data effectively in an NVMM-based storage system requires careful thought about how media errors, software errors, and wear-out interact with hardware and software.  The differences between NVMM and more familiar persistent memories (e.g., flash) require us to rethink the role each layer of system should play and when they should stay out of the way.