350M files on a 1TB disk

2015-12-20

Summary

Assume we were to append empty files to a parent file on a 1TB disk, one at a time.

As of Boomla v0.0.4, we could create about 250.000 such files before the disk was full.
As of Boomla v0.0.5, we can create about 350.000.000 of them.

No breaking changes were introduced. Even though v0.0.5 restructured the filesystem layout, Boomla remained backwards compatible, no data conversion is required.

Why

Filesystems are simple. Databases are not only hard, they are walled gardens. They are designed so that my sister has little chance to become a DB native. Yet working with files is totally natural to her.

One of the goals of Boomla is to provide a simple, unified storage for all our general needs.

Oddly, for programming, it is the other way around. If you are programming, databases are simple and files are hard.

So, while it may seem that databases are complex and need to be fixed to be as simple as files, really files are hard and need to be fixed to be as simple as databases! Or, maybe the two worlds are just different and there is no reason why we couldn’t build a storage solution that takes the best of both worlds. A filesystem that doubles as a database at the same time.

The Boomla FS is designed to be that storage solution.

Databases let one store lots of entries in a table. To be a useful alternative, we would require the Boomla FS to have similar characteristics.

How

Up to v0.0.4, the list of files in a parent file (“directory”) were stored as a simple list. Thus inserting a new file would require us to make a copy of the entire list, insert our entry and store the new list. Note that the Boomla FS is also version controlled, so we are not allowed to modify the old file listing.

As of v0.0.5, the file listing is stored using B+ trees. This way we can efficiently store incremental changes to the filesystem, instead of storing the entire listing.

Note that the 350M files are created one transaction at a time. This means that your 1TB disk would also hold the previous 350M states of your website. Without the history, you could store over 4B files.

Why now

Although storing millions of files in a single parent was not a pressing issue just yet, it was a required by feature I’m working on. I just decided to release more often.

More numbers

I have benchmarked creating an empty filesystem and then appending N files to the root.

Results for v0.0.4

     N | all space required | average space per file
     1 |              579 B |                  579 B
    10 |            5 043 B |                  504 B
   100 |          178 608 B |                1 786 B
 1 000 |       15 240 108 B |               15 240 B
10 000 |    1 542 868 608 B |              154 286 B

Results for v0.0.5

     N | all space required | average space per file
     1 |              573 B |                  573 B
    10 |            4 955 B |                  495 B
   100 |           73 655 B |                  736 B
 1 000 |        1 042 988 B |                1 042 B
10 000 |       13 617 689 B |                1 361 B

The chart below shows how much extra space is required (in bytes) for adding an empty file on the old vs. the new implementation.

Note that both axes are logarithmic.

boomla-v0.0.5-storage-space-improvement-chart.png


Cheers,

you can follow me on Twitter