Inside the Pure Storage Flash Array: Building a High Performance, Data Reducing Storage System from Commodity SSDs

Ethan Miller
Pure Storage/UCSC
The storage industry is currently in the midst of a flash revolution. Today’s smartphones, cameras, and many laptops all use flash storage, but the $30 billion a year enterprise storage market is still dominated by spinning disk. Flash has large advantages in speed and power consumption, but its disadvantages (cost, limited overwrites, large erase block size) have prevented it from being a drop-in replacement for disk in enterprise storage environments. This talk will describe the techniques that we’ve developed at Pure Storage to overcome these obstacles in creating a high-performance flash storage array using commodity SSDs. We’ll describe the design of the Pure FlashArray, an enterprise storage array built from the ground up from relatively inexpensive consumer flash storage. The array and its software, Purity, leverage the advantages of flash while minimizing the downsides. Purity performs all writes to flash in multiples of the SSD erase block size, and keeps data in a key-value store that persists approximate answers to further reduce writes at the cost of extra (cheap) reads. Our key-value store, which includes a medium-grained identifiers to enable large numbers of snapshots and a key range invalidation table, provides other advantages, such as the ability to take nearly instantaneous, zero-overhead snapshots and the ability to bound the size of our metadata structures despite using monotonically-increasing unique identifiers for many purposes. Purity also reduces the amount of user data stored on flash through a range of techniques, including compression, deduplication, and thin provisioning. The system relies upon RAID both for reliability and for performance consistency: by avoiding reads to devices that are being written, it ensures more efficient writes and eliminates long-latency reads. The net result is a flash array that delivers sustained read-write performance of over 500,000 8KB I/O requests per second while maintaining uniform sub-millisecond latency and providing an average data reduction rate of 6x, averaged across installed systems.
Speaker Biography: 
Ethan L. Miller is the Symantec Presidential Chair for Storage and Security and a Professor of Computer Science at the University of California, Santa Cruz, where he is the Director of the NSF I/UCRC Center for Research in Storage Systems (CRSS) and Associate Director of the Storage Systems Research Center (SSRC). He received his ScB from Brown in 1987 and his PhD from UC Berkeley in 1995, and has been on the UC Santa Cruz faculty since 2000. He has written over 125 papers covering topics such as archival storage, file systems for high-end computing, metadata and information retrieval, file systems performance, secure file systems, and distributed systems. He was a member of the team that developed Ceph, a scalable high-performance distributed file system for scientific computing that is now being adopted by several high-end computing organizations. His work on reliability and security for scalable and distributed storage is also widely recognized, as is his work on secure, efficient long-term archival storage and scalable metadata systems. His current research projects, which are funded by the National Science Foundation, Department of Energy, and industry support for the CRSS and SSRC, include long-term archival storage systems, scalable metadata and indexing structures, high performance petabyte-scale storage systems, and file systems for non-volatile memory technologies. Prof. Miller’s broader interests include file systems, parallel and distributed systems, operating systems, and computer security. In addition to research and teaching in storage systems and operating systems, Prof. Miller has worked with Pure Storage since its founding in 2009 to help develop affordable all-flash storage based on commodity SSDs for enterprise environments.