How to Estimate SSD Endurance in Embedded Systems
If you are asking, “How long will a SSD last in my system?”, “What is SSD endurance?” or “What is the best method for estimating SSD endurance in an embedded system?” then this blog post is for you.
The endurance or total life span of an SSD is dependent upon the workload. To determine the workload, we may need to capture transactions of disk accesses using special software or equipment. The process to capture traffic transactions is not practical.
In this post, we describe the methods used to estimate SSD life span in the system without the need to capture transactions or using special software or equipment. An estimate can be made during system operation using standard software.
We explain the fundamental concepts, the method to calculate SSD life span, and provide links to tools and resources (popular industry tools and Virtium tools).
SMART (Self-Monitoring, Analysis, and Reporting Technology) is implemented in all SSDs. We can use the SMART information to calculate the estimated endurance.
Here are some key concepts before starting:
- Capture the SMART statistics as often as possible during the endurance test (approximately every six (6) hours).
- Run the test with ample time so that enough data is written to the SSD. This will exercise the firmware features, such as garbage collection, wear leveling, read disturb management, and ECC. Please see the following sections for more information.
- The estimate will be accurate provided the workload remains steady for the remainder of the SSD life. If there is a change in the workload, then it is recommended to recalculate the estimate. A discussion of the methods appears below.
Using JEDEC Workloads and Methods
The JEDEC (www.jedec.org) JESD218A Client and JESD219A Enterprise specifications provide instructions on how to test and estimate TBW for SSDs. These two specifications contain instructions and information for the test durations and workloads.
The tests can be performed using a standard PC running special software. One available software package for running the JESD219A Enterprise tests is Vdbench, which is provided by Oracle and can be downloaded here: https://www.oracle.com/technetwork/server-storage/vdbench-downloads-1901681.html
Calculate the Write Amplification (WA)
Most of us have heard about WA in SSDs and use WA to estimate endurance and TBW. The SMART data attributes of NAND Written and Data Written are used to calculate the estimates: WA = NAND Written / Data Written. You then use the formula provided by the SSD vendors to calculate the TBW.
Virtium does not recommend that only WD be used to calculate TBW. Instead, think about “cause and effect.” For example, with the current workload, the drive will last 5 years. If improvements are made to our software code, then the drive may last 7 years. vtView® helps you to visualize the effects your software on the SSD and its endurance. Please see the vtView® section later in this blog post.
Real Life Workload in an Embedded System
We can use the “percentage used” of the drive to estimate the TBW. We can run the tests to discover the amount of TBW per 1% life, and then extrapolate to 100%. This way, the system workload does not matter, as you will be able to estimate how long the SSD will last within the current system conditions (temperature, workload, writes per day, etc.).
Running the Test
- Set up the test to capture SMART data every 6 hours.
- Run the system with your software for 100 hours.
- If the “percentage life” changed a few steps within the first 100 hours, there is enough information to estimate the SSD endurance.
- In some cases, there are minimal writes, so that even after 100 hours, there is not enough information to estimate endurance. We suggest the following:
- For NVMe SSDs, we will be able to calculate an estimate after approximately 40 days of continuous testing.
- For SATA SSDs, we will be able to calculate an estimate after approximately 80 days of continuous testing.
Definitions and Terms
The three popular workloads are the JESD219 Client workload, the JESD219A Enterprise workload, and the JESD218A workload. The JESD219A Enterprise workload has the greatest adverse effect on SSD life span.
The type of workload affects the SSD life span. The more random the workload, the shorter the SSD life span. Random workloads will cause the SSD to work harder in terms of garbage collection and wear leveling; hence it consumes its NAND faster and the TBW will be less.
TBW (Terabytes Written)
TBW is amount of data that can be written to a SSD until its useful life percentage (%) reaches 100%. The unit is TB, with 1TB = 1000GB.
TBW is Not a Fixed Value
The TBW will vary and is dependent on how the SSD is used. The concept is similar to how far a car can travel with a full tank of gas; the distance is dependent on the acceleration. For example, the TBW on a 128GB SSD can range from 25TB to 74TB, depending on the situation.
Real Life Workload
How do we know how much workload is generated for our system and software (sequential, client workload, or enterprise workload)? To be able to know this, we would need software or expensive equipment to capture read and write transactions and is therefore impractical. However, there is way to know how “our software workload impacts SSD life span” without the need for special software or equipment. This is described in the next section.
The temperature conditions affect SSD endurance in regards to the ECC on NAND. There is more ECC at higher temperature than lower temperature. Writing at one temperature and then reading at a different temperature will also cause ECC. The more ECC that occurs will cause the SSD controller to work harder to correct ECC and will affect the performance and endurance.
Write Amplification (WA)
We have already discussed WA in this article, but it is important to know that WA is not the same for every workload, every firmware version, or every hardware configuration (number of channels and number of NAND per channel). Therefore, Virtium does not recommended using one WA to generalize the endurance for a SSD. Rather we recommend using the “cause and effect” method as previously described.