| json

I have been hosting Nextcloud for more than 8 years (and previously Owncloud), both for myself and for customers. During that time, backups have saved me countless times and I have continuously optimized my approach to them. In this series, I'll share what I learned and what results in a solution for Nextcloud backups which, at least for me, leaves nothing to be desired.

This first part covers the considerations behind my backup design. Stay tuned for the next article to learn how to implement a backup process that fulfills the requirements discussed here.

What actually makes a great backup?

Before jumping into setting up our backup process, let's take a second to think about what backups are actually supposed to achieve.

The main purpose of backups

Backups are our last line of defense against data loss. Therefore, in order to figure out whether they serve their purpose, we have to think about are causes of data loss that we need to take into account.

Data loss by accident

8 years of hosting experience definitely means one thing: I have fucked up more than once. Be it a corrupted copy during update, a misconfigured Nextcloud client or a lost encryption key. It might be embarrassing, but pretending that those things don't happen would just be plain dangerous. It might not even be us - users have the capability to accidentally delete their own important data as well. Luckily, if we play our cards right and have strong backups we can use those to get away in such situations with a black eye.

Having any sort of backup will safe us here. We just have to make sure that we backup frequently enough for our needs.

Data loss from hardware failure

A manufacturing issue, a power surge or just normal disk degradation - hardware failures are just a matter of time when running a server. And once they affect our disks, we are at risk of data loss. While we can (and definitely should!) protect ourselves from some cases of disk failure by disk health monitoring and RAID setups, we can't rule them out entirely. Especially if you have an issue with your power supply, that can cause hardware failures in multiple disks at once. So we really want backups that are not at risk of being affected by hardware failures in our server.

The key here is to decouple our backups from the hardware our service is running on. A separate backup disk is better than nothing, a separate server is better. And of course, we need to take care to use reliable storage in the first place, with builtin redundancy (e.g. RAID 5) and error correction.

Data loss from natural disaster or theft

Wherever our server is located - there's probably a nonzero chance of "natural" disasters like fire, flood, or similar. And then, depending on how much we're able to protect our server physically, there's also the chance for vandalism or theft. Mine is sitting in my flat, for example, so that's definitely a scenario I need to prepare for. Backups are part of the solution here (of course accompanied by strong disk encryption and information security).

To accommodate this, we need to go one step further: Our backup storage should be in a different physical location than our application server. Common recommendations are a minimum distance of 200km, but if you are running a private cloud, having a backup at a friend's place (ideally in a different city) is probably fine.

Data loss caused by malicious actors

So, now we're talking about the other kind of malicious actor, not the kind that breaks your front door and leaves with your server under their arm. Instead, this is about our server being compromised by attackers who are using security vulnerabilities in our setup. As demonstrated by the xz utils backdoor, the log4shell vulnerability in a popular java logging library or the regreSSHion vulnerability in SSH, it's not possible to reliably rule out that our server will at any point be vulnerable. A compromised server can be abused by attackers in many ways, using harvested credentials to impersonate users, send spam mail, collect or leak valuable data and all of those are bad. However, there's one specific attack that caused an estimated 40-50 billion dollars of cost in 2024 and can be prevented by backups. I am, of course talking about ransomware, where attackers encrypt our data and try to blackmail us for a ransom to get it back. If we set up our backups with this scenario in mind, however, we can just purge our server, set it up from scratch and restore a backup, which will be annoying but by far better than paying the ransom. For completeness’s sake, though, I want to mention that ransomware attacks often also include risk of leaking sensitive data, which backups can't protect us from.

The key here is to setup our backup storage in a way that allows our server to write backups, but not to delete or modify them (at least not before a specified retention time). One way to achieve this is by using advanced storage servers like object storage (e.g. Amazon AWS S3, Google Cloud Storage or open source alternatives like Ceph or Min-IO). Alternatively you could automate a privileged process on the backup location that changes a backups permissions so the writing client can't modify it anymore or move it to another location it does not have access to. Or you trigger the entire backup transfer from the backup server in the first place.

As you can see, this can be a bit tricky to setup. In the next part of this series I will give a detailed example for solving this with object storage.

Reliable Recovery Process

Of course, any backup is just as good as our capability of restoring it. So knowing exactly, how to do it when establishing a backup process is mandatory, as well as ensuring we will still have that knowledge when we need it.

We will be really happy to have good documentation on the restore process when we need our backups. Even better (if applicable) would be a script that can restore a backup in an automated fashion. Don't forget to test it, though! :)

Regular backups + retention

We want to achieve a satisfactory frequency of backup creation as well as duration to keep them. The former defines the maximum time period we can lose (time since last backup) and the latter defines the time we have to realize that there's an issue with our data before losing the chance to correct it. Both taken together define the absolute number of backups we want to keep. In an naive approach, where disk spaces scales linearly with the number of backups, this tends to quickly be a tradeoff between expensive storage costs and recoverability. However, I will present options to keep many backups with minimal storage costs.

This one is very straight forward: "Just do it". The caveat here is that regular and long retained backups will result in a large number of total backups and therefore potentially rising storage costs. So this is about striking a good compromise between storage cost and backup count - or about very storage efficient backups.

Low Storage costs

As just mentioned, we ideally want to keep many backups, so we need to keep an eye on our storage costs.

Our two best approaches to reducing storage requirements are deduplication (or incremental backups) and compression. This can be solved at the storage level or by your backup utility/process.

Zero downtime backups

As we want regular backups, we need backups without service downtime (If we back up more than once per day we might have a hard time explaining to our users, why our service is interrupted that often).

I'm afraid, there's no one size fits all here. How and if zero downtime backups are feasible largely depends on the service's infrastructure, specifically how it's state is managed. If the application state is mostly in traditional relational databases, you will probably be able to use single transaction copies or snapshot features to perform zero downtime backups, if it's mostly files, you can use filesystem snapshot features and in other cases you might want to rely on virtual machine/hypervisor snapshots.

In the next part of this series I will give a detailed example for solving this for a Nextcloud server.

Functional backups

When we need our backups, they should be present, complete and working, so we should find ways to ensure that. This might sound trivial, right after data loss is an unconvenient time to notice that your backup process has had an issue, but it will be the time we do notice it if we don't take precautions.

The key to ensuring the integrity and functionality of our backups is robust backup monitoring. There are a number of things to take into consideration here:

  1. What to monitor: Depending on the checks we perform on our backups we get vastly different guarantees:
    • presence monitoring: We know there actually is a backup, but we don't know if it works
    • integrity monitoring: We know the backup is still the same as we wrote it. This can be helpful to detect bit rot or disk failures (if we didn't detect them by other means)
    • restore tests: The gold standard is, of course, to attempt restoring the backup and see if everything is there and working. This may not be feasible though, depending on the complexity of our service and the size of our backup.
  2. Where the is monitoring running: Ideally, we would run it separately from both our application and our backup server. Alternatively we need some way to get notified if our monitoring itself is down (e.g. by a mobile app polling its status endpoint).
  3. How we will notice if something isn't right: Monitoring is only half as valuable without good alerting. Our monitoring solution should have a way to notify us if our backups don't look right.

There's a lot to talk about here, which is why I will dedicate the third and final part of this series to backup monitoring.

Don't leave the porch open

Having backups should not increase our risk of data leaks. Therefore, we need to make sure that the backups are at least as secure as the service data itself.

We have basically two options here: Either we use backup storage that we trust (both in terms of security and compliance) or we securely encrypt our backups when writing them (in both cases we still need to trust in the capability and willingness of our provider to provide storage integrity and availability).

Summary

Alright, let's wrap up our goals:

That's certainly a lot of boxes to tick. A detailed walk through my approach to accomplishing this will be the topic of the next part in this series.

You are welcome to follow me on mastodon or subscribe your feed reader to the RSS feed to make sure, you won't miss it. :)