FilterHN

Git clone –depth 2 is vastly better than –depth 1 if you want to Git push later

137 points

by jakub_g

5 months ago

| past

| 13 comments

| stackoverflow.com

| HN

▲

rafaelcosta

5 months ago

[-]

I'm wondering what the "because when we read it in, we mangle it" part really means... does this mean that there's no way to reference the commit (signaling that it's just a reference and has no actual data) without actually reading the contents of it?

-- Update: just realized why it wouldn't make sense: `git push` would send only the delta from the previous commit and the previous commit is... non-existent (we only know it's ID), so we'd be back in square 1 (sending everything).

▲

kruador

5 months ago

[-]

See my top-level response, but basically nothing is mangled. Instead Git internally treats it as a 'graft' and knows not to look for parents of the prior commit.

I started that comment as a reply to you but I realised that a) it may just have been a bug that might already be fixed and b) it looks like the Stack Overflow answer was speculative and not tested!

▲

kruador

5 months ago

[-]

It isn't mangled. The commit is there as-is. Instead the repository has a file, ".git/shallow", which tells it not to look for the parents of any commit listed there. If you do a '--depth 1' clone, the file will list the single commit that was retrieved.

This is similar to the 'grafts' feature. Indeed 'git log' says 'grafted'.

You can test this using "git cat-file -p" with the commit that got retrieved, to print the raw object.

> git clone --depth 1 https://github.com/git/git > git log

commit 388218fac77d0405a5083cd4b4ee20f6694609c3 (grafted, HEAD -> master, origin/master, origin/HEAD) Author: Junio C Hamano <gitster@pobox.com> Date: Mon Feb 10 10:18:17 2025 -0800

    The ninth batch

    Signed-off-by: Junio C Hamano <gitster@pobox.com>

> git cat-file -p 388218fac77d0405a5083cd4b4ee20f6694609c3

tree fc620998515e75437810cb1ba80e9b5173458d1c parent 50e1821529fd0a096fe03f137eab143b31e8ef55 author Junio C Hamano <gitster@pobox.com> 1739211497 -0800 committer Junio C Hamano <gitster@pobox.com> 1739211512 -0800

The ninth batch

Signed-off-by: Junio C Hamano <gitster@pobox.com>

I can't reproduce the problem pushing to Bitbucket, using the most recent Git for Windows (2.47.1.windows.2). It only sent 3 objects (which would be the blob of the new file, the tree object containing the new file, and the commit object describing the tree), not the 6000+ in the repository I tested it on.

It may be that there was a bug that has now been fixed. Or it may be something that only happens/happened with GitHub (i.e. a bug at the receiving end, not the sending one!)

I note that the Stack Overflow user who wrote the answer left a comment underneath saying

"worth noting: I haven't tested this; it's just some simple applied math. One clone-and-push will tell you if I was right. :-)"

▲

5 months ago

[-]

Reading this again reminds me of the fact how beautifully git uses the file system as a database. Where everything is laid out nicely in directories and files.

Except for performance, is there any downside to this?

In other words: When you store data in an application that only reads and writes data occasionally, is it a good idea to use the git approach and store it in files?

▲

remram

5 months ago

[-]

Performance is one problem, concurrency is another (you'll need another locking and logging system to make it concurrent-safe and atomic), it can also be unwieldy to move around, and it will be broken by dropbox-like apps that will mark individual files as conflicted (rather than your whole database).

▲

Ferret7446

5 months ago

[-]

Concurrency is not a big problem. The only concurrency issue with Git is that refs (and states like rebase/merge conflicts) are stored in "loose" files. This can be easily and elegantly solved like how jj does it, by putting the repo metadata into the object store too.

You can use a jj repo concurrently, e.g., over Dropbox with coworkers, and all it requires is a minor modification on top of the existing Git data model.

▲

bennofs

5 months ago

[-]

One major downside is that it becomes really hard to do transactions, especially across multiple files. If you store mostly immutable data though like git (where except the refs every object is immutable, mutating creates a new object) it can work nicely.

▲

5 months ago

[-]

Hmm... is the mutability of data really enough to create a need for transactions?

For example here on HN (which afaik also stores the data in files) you can change a comment you wrote. But that type of mutability does not call for transactions, right?

▲

t0mas88

5 months ago

[-]

Depends on the requirements, if you need concurrent access and things like durability in a crash you're going to be implementing something that looks a lot like a database.

▲

5 months ago

[-]

Do files really prevent concurrent access?

And does git not need crash durability?

▲

chithanh

5 months ago

[-]

It's called flat file storage and it has a number of advantages and disadvantages. I prefer it because it is more robust and performs well, even for large amounts of data, as long as your filesystem is reasonably fast and reliable.

Think maildir vs. mbox/PST/etc. for message storage. I stopped counting the number of times that I have seen Outlook mangle its message database and require a rebuild.

Generally it is not so popular, in part because of the OSes like Windows and macOS which have somewhat lacking filesystem implementations. Git also has performance issues with large repositories on Windows, which need to be worked around by various methods.

Transactions are another limitation as mentioned by the other reply (they are possible to implement on top of "normal" filesystems, but not in a portable way).

▲

necovek

5 months ago

[-]

I like the fact that none of this was tested, even if described with such authority :)

Anyone try it out yet?

(Not that I don't trust it, but I usually fetch the full history locally anyway)

▲

kruador

5 months ago

[-]

I can't replicate the initial problem, at least pushing to Bitbucket. I'm using Windows, so I didn't use `touch` - instead I used 'echo' to create a new file in a shallow clone of my repo. That repo is 126 MB on Bitbucket, and the shallow clone downloaded 6395 objects taking 40.68 MB.

I've tried with a new file both having content ('Test shallow clone push'), and again with an empty file. In both cases it pushed 3 objects, and in the empty file case it reused one (it turns out my repo already has some empty files in it).

It's always possible that this is (or was) a GitHub bug - I haven't tried it there.

▲

jbreckmckye

5 months ago

[-]

Why can't git push, when it encounters a `.git/shallow`, just ask the git server to fill in the remaining history by verifying the parent hashes the client can send?

▲

TachyonicBytes

5 months ago

[-]

It can, but that's another type of "shallow", or more exactly "not-deep" cloning, called blobless cloning [1]. There is also treeless cloning, with other tradeoffs, but much to the same effect.

I found this[2] very enlightening.

[1] https://github.blog/open-source/git/get-up-to-speed-with-par...

[2] https://www.howtogeek.com/devops/how-to-use-git-shallow-clon...

▲

Timwi

5 months ago

[-]

This seems like a bug to me. Even if the previous commit is “mangled” as they call it, there's no reason why you can't diff against it and only send the diff.

▲

gbin

5 months ago

[-]

I believe this is because the generated unique id of that node is derived from the link to its parent: https://gist.github.com/masak/2415865

So the "where to attach it to the tree" info is effectively lost.

▲

nopurpose

5 months ago

[-]

`git clone --filter blob:none` FTW

▲

ksynwa

5 months ago

[-]

What does this do?

▲

ChocolateGod

5 months ago

[-]

Have a look at this blog posts, it explains that option really well as well as other alternatives to shallow clones.

https://github.blog/open-source/git/get-up-to-speed-with-par...

▲

pabs3

5 months ago

[-]

I use tree:0 instead.

▲

edflsafoiewq

5 months ago

[-]

Do blobless clones suffer from this?

▲

jakub_g

5 months ago

[-]

From my experience, I heavily don't recommend blobless/treeless clones for local dev. They should only be used in CI for throwaway build & forget scenario.

When you have blobless/treeless clone on local, it will fetch missing blobs on-demand when doing random `git` operations, in the least expected moments, and it does this very inefficiently, one-by-one, which is super slow. (I also didn't find a way to go back from blobless/treeless clone to a normal clone, i.e. to force-fetch all missing blobs efficiently).

It's especially tricky when those `git` operations happen in background e.g. from a GUI (IDE extension etc.) and you don't have any feedback what is happening.

Some further reading:

- https://github.blog/open-source/git/get-up-to-speed-with-par...

- https://github.blog/open-source/git/git-clone-a-data-driven-...

▲

nopurpose

5 months ago

[-]

blobless still better than shallow because at least commit history is preserved.

▲

jakub_g

5 months ago

[-]

It might be fine for small repos, but for massive repos, blobless/treeless clones become unmanageable because many git operations become very slow. I added some further links.

From my side, when you have non-trivial sized repo, on local machine one should either use either a shallow, or a full clone (or, if possible, a sparse checkout, but this requires repo to be compatible).

▲

nopurpose

5 months ago

[-]

you are comparing what is possible but can be slow with blobless with what is impossible with shallow, which is unfair.

▲

cluckindan

5 months ago

[-]

Given that remote git repos are fundamentally incremental in nature, the limit of impracticality is infinite and eventually must equal impossibility.

There is a Moore’s law analogue buried there somewhere in how fast repos grow in relation to network and computing resources (an increase in which of course also makes repos grow faster)

▲

wvh

5 months ago

[-]

That's a beautiful answer. Sometimes people explain something you already know, but different parts of your brain light up. This doesn't just explain git once more, but also plants some seeds related to hashed state optimisations in other, future challenges.

▲

bradley13

5 months ago

[-]

Ok, I'm a simplistic Git user, but: I always do a full clone. Maybe (probably) I will never need all that history, but...maybe I will. Disk space is cheap.

▲

emmelaich

5 months ago

[-]

Often I just want to have a look locally. It's saving time. I'm not very concerned about disk space. Easy enough to deepen it later if wanted.

▲

koiueo

5 months ago

[-]

That is until you have to clone a 10 GiB repo on shaky 5g while commuting on a train

▲

diggan