FilterHN

Borg – Deduplicating archiver with compression and encryption

130 points

by rubyn00bie

11 days ago

| past

| 16 comments

| borgbackup.org

| HN

▲

thangngoc89

10 days ago

[-]

I switched to restic (https://restic.net/) and the backrest webui (https://github.com/garethgeorge/backrest) for Windows support. Files are deduplicated across machines with good compression support.

▲

jeltz

10 days ago

[-]

One big advantage of using restic is that its append only storage actually works unlike for Borg where it is just a hack.

▲

sureglymop

10 days ago

[-]

I also use restic and do backups to append-only rest-servers in multiple locations.

I also back up multiple hosts to the same repository, which actually results in insane storage space savings. One thing I'm missing though is being able to specify multiple repositories for one snapshot such that I have consistency across the multiple backup locations. For now the snapshots just have different ids.

▲

linsomniac

10 days ago

[-]

>back up multiple hosts to the same repository

I haven't tried that recently (~3 years), does that work with concurrency or do you need to ensure one backup is running at a time? Back when I tried it I got the sense that it wasn't really meant to have many machines accessing the repo at once, and decided it was probably worth wasting space but having potentially more robust backups. Especially for my home use case where I only have a couple machines I'm backing up. But it'd be pretty cool if I could replace my main backup servers (using rsync --inplace and zfs snapshots) with restic and get deduplication.

▲

sureglymop

10 days ago

[-]

It works. In general, multiple clients can back up to/restore from the same repository at the same time and do writes/reads in parallel. However, restic does have a concept of exclusive and non-exclusive locks and I would recommend reading the manual/reference section on locks. It has some smart logic to detect and clean up stale locks by itself.

Locks are created e.g. when you want to forget/prune data or when doing a check. The way I handle this is that I use systemd timers for my backup jobs. Before I do e.g. a check command I use an ansible ad-hoc command to pause the systemd units on all hosts and then wait until their operations are done. After doing my modifications to the repos I enable the units again.

Another tip is that you can create individual keys for your hosts for the same repository. Each host gets its own key so that host compromise only leads to that key being compromised which can then be revoked after the breach. And as I said I use rest-servers in append-only mode so a hacker can only "waste storage" in case of a breach. And I also back up to multiple different locations (sequentially) so if a backup location is compromised I could recover from that.

I don't back up the full hosts, mainly application data. I use tags to tag by application, backup type, etc. One pain point is, as I mentioned, that the snapshot IDs in the different repositories/locations are different. Also, because I back up sequentially, data may have already changed between writing to the different locations. But this is still better than syncing them with another tool as that would be bad in case one of the backup locations was compromised. The tag combinations help me deal with this issue.

Restic really is an insanely powerful tool and can do almost everything other backup tools can!

The only major downside to me is that it is not available in library form to be used in a Go program. But that may change in the future.

Also, what would be even cooler for the multiple backup locations, is if the encrypted data could be distributed using e.g. something like shamir secret sharing where you'd need access to k of n backup locations to recreate the secret data. That would also mean that you wouldn't have to trust whatever provider you use to back up to (e.g. if it's amazon s3 or something).

▲

l33tman

10 days ago

[-]

The issue with this is that if someone hacks one of the hosts now they have access to the backups of all your other hosts. With borg at least and the standard setup, would be cool if I was wrong though

▲

sureglymop

10 days ago

[-]

At least with restic that is not an issue. See my other comment here: https://news.ycombinator.com/item?id=44626515

Backups are append only and each host gets its own key, the keys can be individually revoked.

Edit: I have to correct myself. After further research, it seems that append-only != write-only. Thus you are correct in that a single host could possibly access/read data backed up by another host. I suppose it depends on use-case whether that is a problem.

▲

linsomniac

9 days ago

[-]

It would be nice if one of the backup systems supported public key crypto for the bulk of the data, so that the keys used for recovering data would be different from the keys used for backing up. I know there is an open ticket for one of restic/borg, because I subscribed to it a few years ago and periodically get updates on it, but nobody has come up with a solution to it yet.

▲

blablabla123

10 days ago

[-]

I once met the Borg author at a conference, pretty chill guy. He said that when people file bugs because of data corruption, it's because his tool found the underlying disk to be broken. Sounds quite reliable although I'm mostly fine with tar...

▲

vrighter

10 days ago

[-]

I used to work on backup software. I lost count of the number of times this happened to us with our clients too

▲

ValentineC

10 days ago

[-]

I used CrashPlan in 2014. Back then, their implementation of Windows's Volume Shadow Copy Service (VSS) was buggy, and I lost data because of that. I doubt my underlying disk was broken.

▲

ralferoo

9 days ago

[-]

Must be fated initials! Before that, Microsoft's Visual Source Safe used the acronym VSS and it was so bad, a lot of users informally referred to it as Visual Source Loss.

▲

im3w1l

10 days ago

[-]

While saying "hardware issue not my fault not my problem" is a valid stance, I'm thinking that if you hear it again and again from your users, maybe you should consider if you can do more. Verify the file was written correctly is a low hanging fruit. Other possibilities is run some s.m.a.r.t. check and show warning, or adding redundancy to recover from partial failure.

▲

ddtaylor

10 days ago

[-]

I think the failure mode that is happening for users/devs here is bit rot. It's not that the device won't report back the same bytes, even if you disable whatever caching is happening, it's that after T amount of time it will report the wrong bytes. Some file systems have "scrubs" and stuff they do to automatically find these and sometimes attempt to repair them (ZFS can do this).

▲

prirun

3 days ago

[-]

I'm the author of HashBackup. IMO, silent bitrot is not really a thing. I say this because every disk sector written has an extensive ECC recorded with it, so the idea that a bit can flip in a sector and you get bad data without an I/O error seems extremely unlikely. Yes, you could have buggy OS disk drivers, drive controllers, or user-level programs that ignore disk errors. And yes, you could have a bit flip on magnetic media causing an I/O error because the data doesn't match the ECC.

I believe that that using non-ECC RAM is a potential cause of silent disk errors. If you read a sector without error, then a cosmic ray flips a bit in RAM containing that sector, you now have a bad copy of the sector with no error indication. Even if the backup software does a hash of the bad data and records it with the data, it's too late: the hash is of bad data. If you are lucky and the hash is created before the RAM bit flip, at least the hash won't match the bad data, so if you try to restore the file, you'll get an error at restore time. It's impossible to recover the correct data, but at least you'll know that.

The good news is that if you backup the bad data again, it will be read correctly, and be different from the previous backup. The bad news is, most backup software skips files based on metadata such as ctime and mtime, so until the file changes, it won't be re-saved.

We are so dependent on computers these days, it's a real shame that all computers don't come standard with ECC RAM. The real reason for that is that server menufacturers want to charge higher prices to data centers for "real" servers with ECC.

▲

kachapopopow

10 days ago

[-]

Restic is far better both in terms of usability and packaging (borgmatic pretty much is a requirement for usability). Have used both extensively, you can argue that borg can just be scripted instead and is a lot more versitile, but I had a much better experience with restic in terms of setup and forget. I am not scared that restic will break, with borg I did.

Also not sure why this was posted, did a new version release or something?

▲

kmarc

10 days ago

[-]

> you can argue that borg can just be scripted

And that's what I did myself. Organically it grew to ~200 lines, but it sits in the background (created a systemd unit for it, too) and does its job. I also use rclone to store the encrypted backups in an AWS S3 bucket

I so much forget about it that sometimes I have to remind myself to test it out if it still works (it does).

                           Original size      Compressed size    Deduplicated size
    All archives:                2.20 TB              1.49 TB             52.97 GB

▲

johng

10 days ago

[-]

Emborg is also really cool: https://emborg.readthedocs.io/en/stable/

▲

mekster

10 days ago

[-]

How is the performance for both?

Last time I used restic a few years ago, it choked on not so large data set with high memory usage. I read Borg doesn't choke like that.

▲

homebrewer

10 days ago

[-]

Depends on what you consider large; I looked at one of the machines (at random), and it backups about two terabytes of data spread across about a million files. Most of them aren't changing day to day. I ran another backup, and restic rescanned them & created a snapshot in exactly 35 seconds, using ~800 MiB of RAM at peak and about 600 on average.

The files are on HDD, and the machine doesn't have a lot of RAM, looking at high I/O wait times and low CPU load overall, I'm pretty sure the bottleneck is in loading filesystem metadata off disk.

I wouldn't backup billions of files or petabytes of data with either restic or borg; stick to ZFS for anything of this scale.

I don't remember what the initial scan time was (it was many years ago), but it wasn't unreasonable — pretty sure the bottleneck also was in disk I/O.

▲

jszymborski

10 days ago

[-]

I use Vorta, which makes Borg use very easy.

https://vorta.borgbase.com/

▲

bjoli

10 days ago

[-]

Pika backup is pretty darn simple.

▲

sunaookami

10 days ago

[-]

Love borg, use it to backup all my servers and laptop to a Hetzner Storage Box. Always impressed with the deduplication stats!

▲

stevekemp

10 days ago

[-]

Same story here, using Borg with a Hetzner storage box to give me offsite backups.

Cheap, reliable, and almost trouble-free.

▲

ElectronBadger

10 days ago

[-]

I using it with via Vorta (https://vorta.borgbase.com) frontend. My favorite backup solution so far.

▲

Kudos

10 days ago

[-]

Pika Backup (https://apps.gnome.org/PikaBackup/) pointed at https://borgbase.com is my choice.

▲

evulhotdog

10 days ago

[-]

Kopia is an awesome tool that checks the same boxes, and has a wonderful GUI if you need that.

Not affiliated, just a happy user.

▲

jszymborski

10 days ago

[-]

I've been using the Vorta GUI [0] and Hetzner's Storage Box service for ages and it works great. Has saved me from some headaches.

I switched over from Duplicati a long while back when my laptop's sole HDD failed and Duplicati was giving me 143 year estimates for the restore to complete. This was true whether I aimed to restore the whole drive or just a single file.

https://vorta.borgbase.com/

▲

toenail

10 days ago

[-]

Last time I checked the deduplication only works per host when backups are encrypted, which makes sense. Anyway, borg is one of the three backup systems I use, it's alright.

▲

arendtio

10 days ago

[-]

Which are the others?

▲

guerby

10 days ago

[-]

https://kopia.io/

▲

toenail

10 days ago

[-]

backuppc and a shell script using rsync, for backups to usb sticks

▲

rjh29

10 days ago

[-]

People like to recommend restic but I stay with Borg because it is old, popular and battle tested. Very important when dealing with backing up data!

▲

muppetman

10 days ago

[-]

Restic is hardly new and untested? I don't think they're dissimilar in age. Restic is certainly battle tested. Are you thinking of rustic?

▲

rjh29

10 days ago

[-]

It's at least 5 years older, it's not 1.0 yet, and it seems to be still under heavy development. For example compression was only added in 2022 and people reported severe performance issues / high RAM usage with larger backups only a few years ago.

Fair point though, both have enough of a user base that they could be considered safe at this point.

▲

dxs

10 days ago

[-]

Also: Baqpaq

"Baqpaq takes snapshots of files and folders on your system, and syncs them to another machine, or uploads it to your Google Drive or Dropbox account. Set up any schedule you prefer and Baqpaq will create, prune, sync, and upload snapshots at the scheduled time.

"Baqpaq is a tool for personal data backups on Linux systems. Powered by BorgBackup, RSync, and RClone it is designed to run on Linux distributions based on Debian, Ubuntu, Fedora, and Arch Linux."

At: https://store.teejeetech.com/product/baqpaq/

Though personally I use Borg, Rsync, and some scripts I wrote based on Tar.

▲

AnonC

10 days ago

[-]

I’ve been looking at this project occasionally for more than four years. The development of version 2.0 started sometime in April 2022 (IIRC) and there’s still no release candidate yet. I’m guessing that it’ll be finished in a year from now.

What are the current recommendations here to do periodic backups of a NAS with lower (not lowest) costs for about 1 TB of data (mostly personal photos and videos), ease of use and robustness that one can depend on (I know this sounds like a “pick two” situation)? I also want the backup to be completely private.

▲

homebrewer

10 days ago

[-]

You definitely should have checksumming in some form, even if compression and deduplication are worthless in this particular use case, so either use ZFS on both the sending and the receiving side (most efficient, but probably will force you to redo the NAS), or stick to restic.

I've been mostly using restic over the past five years to backup two dozen servers + several desktops (one of them Windows), no problems so far, and it's been very stable in both senses of the word (absence of bugs & unchanging API — both "technical" and "user-facing").

https://github.com/restic/restic

The important thing is to run periodic scrubs with full data read to check that your data can actually be restored (I do it once a week; once a month is probably the upper limit).

  restic check --read-data ...

Some suggestions for the receiver unless you want to go for your own hardware:

https://www.rsync.net/signup/order.html?code=experts

https://www.borgbase.com

(the code is NOT a referral, it's their own internal thingy that cuts the price in half)

▲

ralferoo

9 days ago

[-]

Borg 1.x will still be fine for your needs, especially if you only have a single machine that you want to back up. The main advantage for Borg 2.x will bethat multiple machines can backup to the same repository and de-duplication will work across all of them.

There are plenty of storage server providers where you can get ssh access and 1-2TB for a few dollars per TB per month. You can run multiple repositories from a single server.

As the data is encrypted, even if the storage server is compromised, your data can't be read by others without the key.

▲

rollcat

10 days ago

[-]

I've been using it for ~10 years at work and at home. Fantastic software.

▲

johng

10 days ago

[-]

Plakar is a new project out there that is interesting.... lots of cool stuff happening.

https://plakar.io/

▲

mrflop

9 days ago

[-]

Thank you for mentioning it (Plakar here).

We are doing our best to complete existing solutions :)

▲

poolpOrg

9 days ago

[-]

plakar developer here, thanks :-p

▲

bjoli

10 days ago

[-]

They are also a prominent user of aes-ocb iirc.

▲

creamyhorror

10 days ago

[-]

I remember using Borg Backup before eventually switching to Duplicati. It's been a while.

▲

Snild

10 days ago

[-]

I currently use borg, and have never heard of Duplicati. What made you switch?

▲

racked

10 days ago

[-]

I've had an awful experience with Duplicati. Unstable, incomplete, hell to install natively on Linux. This was 5 years ago and development in Duplicati seemed slow back then. Not sure how the situation is now.

▲

cpodlaski

1 day ago

[-]

A lot changes in 5 years! You should check it out again.

▲

jszymborski

10 days ago

[-]

Likewise. The ETA for the restore of my 500Gb HDD was like 100+ years or something. It's what caused me to ditch it for borg.

▲

creamyhorror

10 days ago

[-]

Interesting to hear. I use Duplicati on Windows and it's been fine, though I haven't extensively used its features.

▲

TacticalCoder

10 days ago

[-]

I'll die on this hill... If may files that are named like this:

    DSC009847.JPG

were actually named like this:

    DSC009847-b3-73ea2364d158.JPG

where "-b3-" means "what's coming before the extension are the first x bits (choose as many hexdigits as you want) of the Blake3 cryptographic hash of the file...

We'd be living in a better world.

I do that for many of my files. Notably family pictures and family movies, but also .iso files, tar/gzip'ed files, etc.

This makes detecting bitflips trivial.

I've create little shellscripts for verification, backups, etc. that work with files having such a naming scheme.

It's bliss.

My world is a better place now. I moved to such a scheme after I had a series of 20 pictures from vacation with old friends that were corrupted (thankfully I had backups, but the concept of "determining which one is the correct file" programmatically is not that easy).

And, yes, it detected one bitflip since I'm using it.

I don't always verify all the checksums: but I've got a script that does random sampling... It picks x% of the files with such a naming scheme and verifies the checksum of these x% of files picked randomly.

It's not incompatible with ZFS: I still run ZFS on my Proxmox server. It's not incompatible with restic/borg/etc. either.

This solves so many issues, including the "How do you know your data is correct?" (answer is: "Because I've already looked that family movie after the cryptographic hash was added to its name").

Not a panacea but doesn't hurt and it's really not much work.

▲

networked

10 days ago

[-]

I prefer

  DSC009847.JPG.b3sum

sidecar files [1] or per-directory checksum files like

  B3SUMS

because they can be verified with standard tools. This scheme also allows you to checksum files whose names you can't or don't want to change. (Though in that situation you have an alternative of using a symlink for either the original name or the name with the checksum.) I have used the scheme less since I adopted ZFS.

I do use very similar example.com/foo/bar/b3-abcd0123.html for https://example.com/foo/bar in the archival tool for outgoing links on my website. It avoids the need to have a date prefix like in the Wayback Machine while preventing duplication.

Speaking of .iso files. A recent PR [2] to my favorite Linux USB-disk-image burning tool Caligula has added support for detecting and verifying sidecar files like foo.iso.sha256 (albeit not Blake).

[1] https://en.wikipedia.org/wiki/Sidecar_file

[2] https://github.com/ifd3f/caligula/pull/186

▲

homebrewer

10 days ago

[-]

It's an old idea and is also how some anime fansub groups prepare their releases: the filename of each episode contains the CRC32 of the file inside [square brackets].

Doesn't really make much sense for BitTorrent uploads (which provides its own much stronger hashes), it's a holdover from the era of IRC bots.

▲

somat

10 days ago

[-]

There are also tools like mtree to help you audit and maintain filesystem integrity.

https://man.openbsd.org/mtree