$USER@bit-of-a-byte ~ $ cat /var/log/thothbackup-part-2.log

ThothBackup - Part 2

Hello! It’s that time of the week again, where I update everyone on my latest work. This episode is far less technical and focuses more on the concept of a “One and Done” backup solution, aka the holy grail of data maintenance.

It fucking sucks.

Introduction ### {: #thoth-2-introduction }

This entry is slightly unidirectional. The concept of a simple, easy to implement, catch everything you might ever need solution is quite literally the holy grail, yet it has never honestly been implemented. Sure, user data is generally scooped out, but in the day and age of game mods, and with some development projects taking place outside of the User directory, it seemed prudent to at least attempt the full backup. Well, I’ve been attempting it for seven days. Here’s what I’ve found.

Focus

We will not be focusing on the space impact of a complete backup. This is actually fairly negligible. With out-of-band deduplication, only one set of operating system files would ever be stored, so server side storage would reach a weird type of equilibrium fairly quickly. Instead, I’ll talk about three things:

  • Metadata Overhead
  • Metadata Processing
  • Initial Synchronization

There may be another post tonight talking about additional things, but this deserves it’s own little deal.

Metadata Overhead

A fully updated Windows 10 partition of your average gamer, aka my fiancé, is composed of 479,641 files and 70,005 directories which comprise a total data size of ~216 GiB. This is actually just the C drive and typical programs. If you factor in the actual game drive in use by our test case, that drive contains 354,315 files and 29,111 directories which comprise a total of ~385 GiB of space.

In summation, an initial synchronization of what is typically considered a “full system backup” comprises 833,956 files and 99116 directories comprising ~601GiB which results in an average filesize of ~755KiB and an average directory size of ~9 files.

SyncThing creates a block store that is comprised of, by default, 128KiB blocks. This means that for our system, assuming the data is contiguous, we need 4923392 Metadata Entries. Assuming the files are NOT contiguous, this is probably closer to about 5 Million metadata entries. As of right now, the server side metadata storage for the testing pool is at 1.7 GiB and initial syncronization is not yet complete. Extrapolating a bit, we can assume that 2.0 GiB would not be an unreasonable size for a final server side data store.

The client side store, at the time of writing, is approximately 1 GiB and may grow slightly larger. However, I will use 1 GiB. This means that there is a plausible total of 3GiB of metadata overhead representing an overhead percentage of ~0.5% across the pool. Scaling up, this means 10 clients with 1TB of data each would require 51.2GB of Metadata.

Should anything happen to the metadata store, it would need to be rebuilt by data reprocessing. This introduces a potentially massive liability, as scanning frequency would need to be reduced to not impact the rebuild operation.

Metadata Processing

The server is capable of a hash rate of 107MB/s. I am picking the server’s hash rate because it is both the slowest hash rate of the pool and would have the most metadata that would need to be rebuilt.

For a complete rebuild of the data of our current cluster, it would take the server ~96 Minutes during which no data synchronization could occur. This equates to a minimum of 1 Missed Hourly Update and could potentially result in up to 2 missed hourly updates if the timing was unfortunate enough.

For a complete rebuild of the data of our theoretical cluster, we will allow for a hash rate of 300MB/s. The total data needed to be rebuilt would be 10TB. This would result in a database rebuilt time of ~10 Hours which could result in up to 11 missed synchronization attempts.

Initial Synchronization

The initial syncronization is composed of three primary parts. First, the client and host must agree on what folders to syncronize. Second, the client must build a database of the content hosted locally. Next, utilizing a rolling hash algorithm, data is entered into the metadata cache and transmitted to the server.

Per the developer of SyncThing, millions of small files are the worst case scenario for the backup system. As of my independent, albeit anecdotal testing, After 7 days the synchronization process is still in effect. This represents a very poor user experience and would not be ideal for a widespread rollout.

Conclusion ### {: #thoth-2-conclusion }

The primary goal of a backup utility is to synchronize files and achieve cross system consistency as quickly as possible. While it is true that eventually consistent systems are utilized in large scale operations, this type of consistency is allowable only, in my opinion, at data sizes over 10TB. The current testing set is approximately 1TB at most, and thus this is unacceptable.

Either the backup paradigm must change, or the utility used to implement it must change. While I do not expect to find any faster utilities for performing the backup process, I do plan to continue to experiment. At this time, however, it seems that the most likely way to make the process as friendly as possible would be the implementation of a default backup subset, with additional data added upon user request, and after the high priority synchronization had been completed.