Friday, May 15, 2009

Pruning File Servers

One of our main file servers filled up its disk last night.

This happens quite regularly to us. Our nightly build and test process produces something like 20-40 Gb of output each day, so we are always fighting a losing battle with disk space.

Modern file servers can easily hold several terabytes of disk space at a quite reasonable price, so we have about 5 terabytes of shared master file server space available.

But since we produce about a terabyte a month, that's only 4-5 months worth of data that we can keep online.

So, given that we can't keep everything online forever, how do we handle it?

Well, there's a few basic principles that we go by:
  • Don't clean until we need to, even to the point of allowing the disk to go 100% full. Modern file systems seem to handle this without going corrupt or doing other stupid things.
  • When cleaning, clean minimally and conservatively.
  • When cleaning, consider that not all files are alike. Some files are more important than others.
That is, we try to ensure that we keep more important data online in preference to less important data, and we try to ensure that we keep as much data online as possible.

At time in the past, we've used automatic disk pruning tools that I've written, and there are still a few places on our file servers where we do this. But writing and running automatic disk pruning tools is a very scary activity, because a bug in the tool could easily cause much more damage than I'm willing to risk. When writing those tools, I tried to be very careful:
  • Testing the tool extensively
  • Auditing the actions of the tool, and mailing myself a full report when the tool took action
  • Providing the tool with extensive rules so that it could decide which files to deleted based on age, size, type, name, and other characteristics
  • Controlling the tool so that it didn't just rummage over the entire hard disk, but instead was an "opt in" system, where we had to explicitly instruct the tool to access a certain area on the file system.
  • Never running the tool as a highly-privileged user.
Even so, I was always worried whenever the tool ran, nervous that it would do something terrible to the master file server.

Each time the file server fills up, there is a certain amount of inconvenience, and a certain amount of downtime, while we scramble to find and free up some space to keep running. And each time this happens, I think that I should return to my automated pruning tools, and get them back into operation, so that this part of the operations runs more smoothly.

But for the time being, this part of our production system remains a manual process.

No comments:

Post a Comment