1104 lines
46 KiB
Plaintext
1104 lines
46 KiB
Plaintext
Table of contents:
|
|
|
|
* Terminology
|
|
* Purpose of sparse-checkouts
|
|
* Usecases of primary concern
|
|
* Oversimplified mental models ("Cliff Notes" for this document!)
|
|
* Desired behavior
|
|
* Behavior classes
|
|
* Subcommand-dependent defaults
|
|
* Sparse specification vs. sparsity patterns
|
|
* Implementation Questions
|
|
* Implementation Goals/Plans
|
|
* Known bugs
|
|
* Reference Emails
|
|
|
|
|
|
=== Terminology ===
|
|
|
|
cone mode: one of two modes for specifying the desired subset of files
|
|
in a sparse-checkout. In cone-mode, the user specifies
|
|
directories (getting both everything under that directory as
|
|
well as everything in leading directories), while in non-cone
|
|
mode, the user specifies gitignore-style patterns. Controlled
|
|
by the --[no-]cone option to sparse-checkout init|set.
|
|
|
|
SKIP_WORKTREE: When tracked files do not match the sparse specification and
|
|
are removed from the working tree, the file in the index is marked
|
|
with a SKIP_WORKTREE bit. Note that if a tracked file has the
|
|
SKIP_WORKTREE bit set but the file is later written by the user to
|
|
the working tree anyway, the SKIP_WORKTREE bit will be cleared at
|
|
the beginning of any subsequent Git operation.
|
|
|
|
Most sparse checkout users are unaware of this implementation
|
|
detail, and the term should generally be avoided in user-facing
|
|
descriptions and command flags. Unfortunately, prior to the
|
|
`sparse-checkout` subcommand this low-level detail was exposed,
|
|
and as of time of writing, is still exposed in various places.
|
|
|
|
sparse-checkout: a subcommand in git used to reduce the files present in
|
|
the working tree to a subset of all tracked files. Also, the
|
|
name of the file in the $GIT_DIR/info directory used to track
|
|
the sparsity patterns corresponding to the user's desired
|
|
subset.
|
|
|
|
sparse cone: see cone mode
|
|
|
|
sparse directory: An entry in the index corresponding to a directory, which
|
|
appears in the index instead of all the files under that directory
|
|
that would normally appear. See also sparse-index. Something that
|
|
can cause confusion is that the "sparse directory" does NOT match
|
|
the sparse specification, i.e. the directory is NOT present in the
|
|
working tree. May be renamed in the future (e.g. to "skipped
|
|
directory").
|
|
|
|
sparse index: A special mode for sparse-checkout that also makes the
|
|
index sparse by recording a directory entry in lieu of all the
|
|
files underneath that directory (thus making that a "skipped
|
|
directory" which unfortunately has also been called a "sparse
|
|
directory"), and does this for potentially multiple
|
|
directories. Controlled by the --[no-]sparse-index option to
|
|
init|set|reapply.
|
|
|
|
sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
|
|
define the set of files of interest. A warning: It is easy to
|
|
over-use this term (or the shortened "patterns" term), for two
|
|
reasons: (1) users in cone mode specify directories rather than
|
|
patterns (their directories are transformed into patterns, but
|
|
users may think you are talking about non-cone mode if you use the
|
|
word "patterns"), and (b) the sparse specification might
|
|
transiently differ in the working tree or index from the sparsity
|
|
patterns (see "Sparse specification vs. sparsity patterns").
|
|
|
|
sparse specification: The set of paths in the user's area of focus. This
|
|
is typically just the tracked files that match the sparsity
|
|
patterns, but the sparse specification can temporarily differ and
|
|
include additional files. (See also "Sparse specification
|
|
vs. sparsity patterns")
|
|
|
|
* When working with history, the sparse specification is exactly
|
|
the set of files matching the sparsity patterns.
|
|
* When interacting with the working tree, the sparse specification
|
|
is the set of tracked files with a clear SKIP_WORKTREE bit or
|
|
tracked files present in the working copy.
|
|
* When modifying or showing results from the index, the sparse
|
|
specification is the set of files with a clear SKIP_WORKTREE bit
|
|
or that differ in the index from HEAD.
|
|
* If working with the index and the working copy, the sparse
|
|
specification is the union of the paths from above.
|
|
|
|
vivifying: When a command restores a tracked file to the working tree (and
|
|
hopefully also clears the SKIP_WORKTREE bit in the index for that
|
|
file), this is referred to as "vivifying" the file.
|
|
|
|
|
|
=== Purpose of sparse-checkouts ===
|
|
|
|
sparse-checkouts exist to allow users to work with a subset of their
|
|
files.
|
|
|
|
You can think of sparse-checkouts as subdividing "tracked" files into two
|
|
categories -- a sparse subset, and all the rest. Implementationally, we
|
|
mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
|
|
out of the working tree. The SKIP_WORKTREE files are still tracked, just
|
|
not present in the working tree.
|
|
|
|
In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
|
|
is missing from the working tree but pretend the file contents match HEAD".
|
|
That was not only bogus (it actually meant the file missing from the
|
|
working tree matched the index rather than HEAD), but it was also a
|
|
low-level detail which only provided decent behavior for a few commands.
|
|
There were a surprising number of ways in which that guiding principle gave
|
|
command results that violated user expectations, and as such was a bad
|
|
mental model. However, it persisted for many years and may still be found
|
|
in some corners of the code base.
|
|
|
|
Anyway, the idea of "working with a subset of files" is simple enough, but
|
|
there are multiple different high-level usecases which affect how some Git
|
|
subcommands should behave. Further, even if we only considered one of
|
|
those usecases, sparse-checkouts can modify different subcommands in over a
|
|
half dozen different ways. Let's start by considering the high level
|
|
usecases:
|
|
|
|
A) Users are _only_ interested in the sparse portion of the repo
|
|
|
|
A*) Users are _only_ interested in the sparse portion of the repo
|
|
that they have downloaded so far
|
|
|
|
B) Users want a sparse working tree, but are working in a larger whole
|
|
|
|
C) sparse-checkout is a behind-the-scenes implementation detail allowing
|
|
Git to work with a specially crafted in-house virtual file system;
|
|
users are actually working with a "full" working tree that is
|
|
lazily populated, and sparse-checkout helps with the lazy population
|
|
piece.
|
|
|
|
It may be worth explaining each of these in a bit more detail:
|
|
|
|
|
|
(Behavior A) Users are _only_ interested in the sparse portion of the repo
|
|
|
|
These folks might know there are other things in the repository, but
|
|
don't care. They are uninterested in other parts of the repository, and
|
|
only want to know about changes within their area of interest. Showing
|
|
them other files from history (e.g. from diff/log/grep/etc.) is a
|
|
usability annoyance, potentially a huge one since other changes in
|
|
history may dwarf the changes they are interested in.
|
|
|
|
Some of these users also arrive at this usecase from wanting to use partial
|
|
clones together with sparse checkouts (in a way where they have downloaded
|
|
blobs within the sparse specification) and do disconnected development.
|
|
Not only do these users generally not care about other parts of the
|
|
repository, but consider it a blocker for Git commands to try to operate on
|
|
those. If commands attempt to access paths in history outside the sparsity
|
|
specification, then the partial clone will attempt to download additional
|
|
blobs on demand, fail, and then fail the user's command. (This may be
|
|
unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
|
|
reconcile outside the sparse specification, but we should limit how often
|
|
users are forced to connect to the network.)
|
|
|
|
Also, even for users using partial clones that do not mind being
|
|
always connected to the network, the need to download blobs as
|
|
side-effects of various other commands (such as the printed diffstat
|
|
after a merge or pull) can lead to worries about local repository size
|
|
growing unnecessarily[10].
|
|
|
|
(Behavior A*) Users are _only_ interested in the sparse portion of the repo
|
|
that they have downloaded so far (a variant on the first usecase)
|
|
|
|
This variant is driven by folks who using partial clones together with
|
|
sparse checkouts and do disconnected development (so far sounding like a
|
|
subset of behavior A users) and doing so on very large repositories. The
|
|
reason for yet another variant is that downloading even just the blobs
|
|
through history within their sparse specification may be too much, so they
|
|
only download some. They would still like operations to succeed without
|
|
network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
|
|
or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
|
|
partial results that depend on what happens to have been downloaded.
|
|
|
|
This variant could be viewed as Behavior A with the sparse specification
|
|
for history querying operations modified from "sparsity patterns" to
|
|
"sparsity patterns limited to the blobs we have already downloaded".
|
|
|
|
(Behavior B) Users want a sparse working tree, but are working in a
|
|
larger whole
|
|
|
|
Stolee described this usecase this way[11]:
|
|
|
|
"I'm also focused on users that know that they are a part of a larger
|
|
whole. They know they are operating on a large repository but focus on
|
|
what they need to contribute their part. I expect multiple "roles" to
|
|
use very different, almost disjoint parts of the codebase. Some other
|
|
"architect" users operate across the entire tree or hop between different
|
|
sections of the codebase as necessary. In this situation, I'm wary of
|
|
scoping too many features to the sparse-checkout definition, especially
|
|
"git log," as it can be too confusing to have their view of the codebase
|
|
depend on your "point of view."
|
|
|
|
People might also end up wanting behavior B due to complex inter-project
|
|
dependencies. The initial attempts to use sparse-checkouts usually involve
|
|
the directories you are directly interested in plus what those directories
|
|
depend upon within your repository. But there's a monkey wrench here: if
|
|
you have integration tests, they invert the hierarchy: to run integration
|
|
tests, you need not only what you are interested in and its in-tree
|
|
dependencies, you also need everything that depends upon what you are
|
|
interested in or that depends upon one of your dependencies...AND you need
|
|
all the in-tree dependencies of that expanded group. That can easily
|
|
change your sparse-checkout into a nearly dense one.
|
|
|
|
Naturally, that tends to kill the benefits of sparse-checkouts. There are
|
|
a couple solutions to this conundrum: either avoid grabbing in-repo
|
|
dependencies (maybe have built versions of your in-repo dependencies pulled
|
|
from a CI cache somewhere), or say that users shouldn't run integration
|
|
tests directly and instead do it on the CI server when they submit a code
|
|
review. Or do both. Regardless of whether you stub out your in-repo
|
|
dependencies or stub out the things that depend upon you, there is
|
|
certainly a reason to want to query and be aware of those other stubbed-out
|
|
parts of the repository, particularly when the dependencies are complex or
|
|
change relatively frequently. Thus, for such uses, sparse-checkouts can be
|
|
used to limit what you directly build and modify, but these users do not
|
|
necessarily want their sparse checkout paths to limit their queries of
|
|
versions in history.
|
|
|
|
Some people may also be interested in behavior B over behavior A simply as
|
|
a performance workaround: if they are using non-cone mode, then they have
|
|
to deal with its inherent quadratic performance problems. In that mode,
|
|
every operation that checks whether paths match the sparsity specification
|
|
can be expensive. As such, these users may only be willing to pay for
|
|
those expensive checks when interacting with the working copy, and may
|
|
prefer getting "unrelated" results from their history queries over having
|
|
slow commands.
|
|
|
|
(Behavior C) sparse-checkout is an implementational detail supporting a
|
|
special VFS.
|
|
|
|
This usecase goes slightly against the traditional definition of
|
|
sparse-checkout in that it actually tries to present a full or dense
|
|
checkout to the user. However, this usecase utilizes the same underlying
|
|
technical underpinnings in a new way which does provide some performance
|
|
advantages to users. The basic idea is that a company can have an in-house
|
|
Git-aware Virtual File System which pretends all files are present in the
|
|
working tree, by intercepting all file system accesses and using those to
|
|
fetch and write accessed files on demand via partial clones. The VFS uses
|
|
sparse-checkout to prevent Git from writing or paying attention to many
|
|
files, and manually updates the sparse checkout patterns itself based on
|
|
user access and modification of files in the working tree. See commit
|
|
ecc7c8841d ("repo_read_index: add config to expect files outside sparse
|
|
patterns", 2022-02-25) and the link at [17] for a more detailed description
|
|
of such a VFS.
|
|
|
|
The biggest difference here is that users are completely unaware that the
|
|
sparse-checkout machinery is even in use. The sparse patterns are not
|
|
specified by the user but rather are under the complete control of the VFS
|
|
(and the patterns are updated frequently and dynamically by it). The user
|
|
will perceive the checkout as dense, and commands should thus behave as if
|
|
all files are present.
|
|
|
|
|
|
=== Usecases of primary concern ===
|
|
|
|
Most of the rest of this document will focus on Behavior A and Behavior
|
|
B. Some notes about the other two cases and why we are not focusing on
|
|
them:
|
|
|
|
(Behavior A*)
|
|
|
|
Supporting this usecase is estimated to be difficult and a lot of work.
|
|
There are no plans to implement it currently, but it may be a potential
|
|
future alternative. Knowing about the existence of additional alternatives
|
|
may affect our choice of command line flags (e.g. if we need tri-state or
|
|
quad-state flags rather than just binary flags), so it was still important
|
|
to at least note.
|
|
|
|
Further, I believe the descriptions below for Behavior A are probably still
|
|
valid for this usecase, with the only exception being that it redefines the
|
|
sparse specification to restrict it to already-downloaded blobs. The hard
|
|
part is in making commands capable of respecting that modified definition.
|
|
|
|
(Behavior C)
|
|
|
|
This usecase violates some of the early sparse-checkout documented
|
|
assumptions (since files marked as SKIP_WORKTREE will be displayed to users
|
|
as present in the working tree). That violation may mean various
|
|
sparse-checkout related behaviors are not well suited to this usecase and
|
|
we may need tweaks -- to both documentation and code -- to handle it.
|
|
However, this usecase is also perhaps the simplest model to support in that
|
|
everything behaves like a dense checkout with a few exceptions (e.g. branch
|
|
checkouts and switches write fewer things, knowing the VFS will lazily
|
|
write the rest on an as-needed basis).
|
|
|
|
Since there is no publicly available VFS-related code for folks to try,
|
|
the number of folks who can test such a usecase is limited.
|
|
|
|
The primary reason to note the Behavior C usecase is that as we fix things
|
|
to better support Behaviors A and B, there may be additional places where
|
|
we need to make tweaks allowing folks in this usecase to get the original
|
|
non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index:
|
|
add config to expect files outside sparse patterns", 2022-02-25). The
|
|
secondary reason to note Behavior C, is so that folks taking advantage of
|
|
Behavior C do not assume they are part of the Behavior B camp and propose
|
|
patches that break things for the real Behavior B folks.
|
|
|
|
|
|
=== Oversimplified mental models ===
|
|
|
|
An oversimplification of the differences in the above behaviors is:
|
|
|
|
Behavior A: Restrict worktree and history operations to sparse specification
|
|
Behavior B: Restrict worktree operations to sparse specification; have any
|
|
history operations work across all files
|
|
Behavior C: Do not restrict either worktree or history operations to the
|
|
sparse specification...with the exception of branch checkouts or
|
|
switches which avoid writing files that will match the index so
|
|
they can later lazily be populated instead.
|
|
|
|
|
|
=== Desired behavior ===
|
|
|
|
As noted previously, despite the simple idea of just working with a subset
|
|
of files, there are a range of different behavioral changes that need to be
|
|
made to different subcommands to work well with such a feature. See
|
|
[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw
|
|
that mere composition of other commands that individually worked correctly
|
|
in a sparse-checkout context did not imply that the higher level command
|
|
would work correctly; it sometimes requires further tweaks. So,
|
|
understanding these differences can be beneficial.
|
|
|
|
* Commands behaving the same regardless of high-level use-case
|
|
|
|
* commands that only look at files within the sparsity specification
|
|
|
|
* diff (without --cached or REVISION arguments)
|
|
* grep (without --cached or REVISION arguments)
|
|
* diff-files
|
|
|
|
* commands that restore files to the working tree that match sparsity
|
|
patterns, and remove unmodified files that don't match those
|
|
patterns:
|
|
|
|
* switch
|
|
* checkout (the switch-like half)
|
|
* read-tree
|
|
* reset --hard
|
|
|
|
* commands that write conflicted files to the working tree, but otherwise
|
|
will omit writing files to the working tree that do not match the
|
|
sparsity patterns:
|
|
|
|
* merge
|
|
* rebase
|
|
* cherry-pick
|
|
* revert
|
|
|
|
* `am` and `apply --cached` should probably be in this section but
|
|
are buggy (see the "Known bugs" section below)
|
|
|
|
The behavior for these commands somewhat depends upon the merge
|
|
strategy being used:
|
|
* `ort` behaves as described above
|
|
* `recursive` tries to not vivify files unnecessarily, but does sometimes
|
|
vivify files without conflicts.
|
|
* `octopus` and `resolve` will always vivify any file changed in the merge
|
|
relative to the first parent, which is rather suboptimal.
|
|
|
|
It is also important to note that these commands WILL update the index
|
|
outside the sparse specification relative to when the operation began,
|
|
BUT these commands often make a commit just before or after such that
|
|
by the end of the operation there is no change to the index outside the
|
|
sparse specification. Of course, if the operation hits conflicts or
|
|
does not make a commit, then these operations clearly can modify the
|
|
index outside the sparse specification.
|
|
|
|
Finally, it is important to note that at least the first four of these
|
|
commands also try to remove differences between the sparse
|
|
specification and the sparsity patterns (much like the commands in the
|
|
previous section).
|
|
|
|
* commands that always ignore sparsity since commits must be full-tree
|
|
|
|
* archive
|
|
* bundle
|
|
* commit
|
|
* format-patch
|
|
* fast-export
|
|
* fast-import
|
|
* commit-tree
|
|
|
|
* commands that write any modified file to the working tree (conflicted
|
|
or not, and whether those paths match sparsity patterns or not):
|
|
|
|
* stash
|
|
* apply (without `--index` or `--cached`)
|
|
|
|
* Commands that may slightly differ for behavior A vs. behavior B:
|
|
|
|
Commands in this category behave mostly the same between the two
|
|
behaviors, but may differ in verbosity and types of warning and error
|
|
messages.
|
|
|
|
* commands that make modifications to which files are tracked:
|
|
* add
|
|
* rm
|
|
* mv
|
|
* update-index
|
|
|
|
The fact that files can move between the 'tracked' and 'untracked'
|
|
categories means some commands will have to treat untracked files
|
|
differently. But if we have to treat untracked files differently,
|
|
then additional commands may also need changes:
|
|
|
|
* status
|
|
* clean
|
|
|
|
In particular, `status` may need to report any untracked files outside
|
|
the sparsity specification as an erroneous condition (especially to
|
|
avoid the user trying to `git add` them, forcing `git add` to display
|
|
an error).
|
|
|
|
It's not clear to me exactly how (or even if) `clean` would change,
|
|
but it's the other command that also affects untracked files.
|
|
|
|
`update-index` may be slightly special. Its --[no-]skip-worktree flag
|
|
may need to ignore the sparse specification by its nature. Also, its
|
|
current --[no-]ignore-skip-worktree-entries default is totally bogus.
|
|
|
|
* commands for manually tweaking paths in both the index and the working tree
|
|
* `restore`
|
|
* the restore-like half of `checkout`
|
|
|
|
These commands should be similar to add/rm/mv in that they should
|
|
only operate on the sparse specification by default, and require a
|
|
special flag to operate on all files.
|
|
|
|
Also, note that these commands currently have a number of issues (see
|
|
the "Known bugs" section below)
|
|
|
|
* Commands that significantly differ for behavior A vs. behavior B:
|
|
|
|
* commands that query history
|
|
* diff (with --cached or REVISION arguments)
|
|
* grep (with --cached or REVISION arguments)
|
|
* show (when given commit arguments)
|
|
* blame (only matters when one or more -C flags are passed)
|
|
* and annotate
|
|
* log
|
|
* whatchanged
|
|
* ls-files
|
|
* diff-index
|
|
* diff-tree
|
|
* ls-tree
|
|
|
|
Note: for log and whatchanged, revision walking logic is unaffected
|
|
but displaying of patches is affected by scoping the command to the
|
|
sparse-checkout. (The fact that revision walking is unaffected is
|
|
why rev-list, shortlog, show-branch, and bisect are not in this
|
|
list.)
|
|
|
|
ls-files may be slightly special in that e.g. `git ls-files -t` is
|
|
often used to see what is sparse and what is not. Perhaps -t should
|
|
always work on the full tree?
|
|
|
|
* Commands I don't know how to classify
|
|
|
|
* range-diff
|
|
|
|
Is this like `log` or `format-patch`?
|
|
|
|
* cherry
|
|
|
|
See range-diff
|
|
|
|
* Commands unaffected by sparse-checkouts
|
|
|
|
* shortlog
|
|
* show-branch
|
|
* rev-list
|
|
* bisect
|
|
|
|
* branch
|
|
* describe
|
|
* fetch
|
|
* gc
|
|
* init
|
|
* maintenance
|
|
* notes
|
|
* pull (merge & rebase have the necessary changes)
|
|
* push
|
|
* submodule
|
|
* tag
|
|
|
|
* config
|
|
* filter-branch (works in separate checkout without sparse-checkout setup)
|
|
* pack-refs
|
|
* prune
|
|
* remote
|
|
* repack
|
|
* replace
|
|
|
|
* bugreport
|
|
* count-objects
|
|
* fsck
|
|
* gitweb
|
|
* help
|
|
* instaweb
|
|
* merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
|
|
* rerere
|
|
* verify-commit
|
|
* verify-tag
|
|
|
|
* commit-graph
|
|
* hash-object
|
|
* index-pack
|
|
* mktag
|
|
* mktree
|
|
* multi-pack-index
|
|
* pack-objects
|
|
* prune-packed
|
|
* symbolic-ref
|
|
* unpack-objects
|
|
* update-ref
|
|
* write-tree (operates on index, possibly optimized to use sparse dir entries)
|
|
|
|
* for-each-ref
|
|
* get-tar-commit-id
|
|
* ls-remote
|
|
* merge-base (merges are computed full tree, so merge base should be too)
|
|
* name-rev
|
|
* pack-redundant
|
|
* rev-parse
|
|
* show-index
|
|
* show-ref
|
|
* unpack-file
|
|
* var
|
|
* verify-pack
|
|
|
|
* <Everything under 'Interacting with Others' in 'git help --all'>
|
|
* <Everything under 'Low-level...Syncing' in 'git help --all'>
|
|
* <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
|
|
* <Everything under 'External commands' in 'git help --all'>
|
|
|
|
* Commands that might be affected, but who cares?
|
|
|
|
* merge-file
|
|
* merge-index
|
|
* gitk?
|
|
|
|
|
|
=== Behavior classes ===
|
|
|
|
From the above there are a few classes of behavior:
|
|
|
|
* "restrict"
|
|
|
|
Commands in this class only read or write files in the working tree
|
|
within the sparse specification.
|
|
|
|
When moving to a new commit (e.g. switch, reset --hard), these commands
|
|
may update index files outside the sparse specification as of the start
|
|
of the operation, but by the end of the operation those index files
|
|
will match HEAD again and thus those files will again be outside the
|
|
sparse specification.
|
|
|
|
When paths are explicitly specified, these paths are intersected with
|
|
the sparse specification and will only operate on such paths.
|
|
(e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
|
|
|
|
Some of these commands may also attempt, at the end of their operation,
|
|
to cull transient differences between the sparse specification and the
|
|
sparsity patterns (see "Sparse specification vs. sparsity patterns" for
|
|
details, but this basically means either removing unmodified files not
|
|
matching the sparsity patterns and marking those files as
|
|
SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
|
|
marking those files as !SKIP_WORKTREE).
|
|
|
|
* "restrict modulo conflicts"
|
|
|
|
Commands in this class generally behave like the "restrict" class,
|
|
except that:
|
|
(1) they will ignore the sparse specification and write files with
|
|
conflicts to the working tree (thus temporarily expanding the
|
|
sparse specification to include such files.)
|
|
(2) they are grouped with commands which move to a new commit, since
|
|
they often create a commit and then move to it, even though we
|
|
know there are many exceptions to moving to the new commit. (For
|
|
example, the user may rebase a commit that becomes empty, or have
|
|
a cherry-pick which conflicts, or a user could run `merge
|
|
--no-commit`, and we also view `apply --index` kind of like `am
|
|
--no-commit`.) As such, these commands can make changes to index
|
|
files outside the sparse specification, though they'll mark such
|
|
files with SKIP_WORKTREE.
|
|
|
|
* "restrict also specially applied to untracked files"
|
|
|
|
Commands in this class generally behave like the "restrict" class,
|
|
except that they have to handle untracked files differently too, often
|
|
because these commands are dealing with files changing state between
|
|
'tracked' and 'untracked'. Often, this may mean printing an error
|
|
message if the command had nothing to do, but the arguments may have
|
|
referred to files whose tracked-ness state could have changed were it
|
|
not for the sparsity patterns excluding them.
|
|
|
|
* "no restrict"
|
|
|
|
Commands in this class ignore the sparse specification entirely.
|
|
|
|
* "restrict or no restrict dependent upon behavior A vs. behavior B"
|
|
|
|
Commands in this class behave like "no restrict" for folks in the
|
|
behavior B camp, and like "restrict" for folks in the behavior A camp.
|
|
However, when behaving like "restrict" a warning of some sort might be
|
|
provided that history queries have been limited by the sparse-checkout
|
|
specification.
|
|
|
|
|
|
=== Subcommand-dependent defaults ===
|
|
|
|
Note that we have different defaults depending on the command for the
|
|
desired behavior :
|
|
|
|
* Commands defaulting to "restrict":
|
|
* diff-files
|
|
* diff (without --cached or REVISION arguments)
|
|
* grep (without --cached or REVISION arguments)
|
|
* switch
|
|
* checkout (the switch-like half)
|
|
* reset (<commit>)
|
|
|
|
* restore
|
|
* checkout (the restore-like half)
|
|
* checkout-index
|
|
* reset (with pathspec)
|
|
|
|
This behavior makes sense; these interact with the working tree.
|
|
|
|
* Commands defaulting to "restrict modulo conflicts":
|
|
* merge
|
|
* rebase
|
|
* cherry-pick
|
|
* revert
|
|
|
|
* am
|
|
* apply --index (which is kind of like an `am --no-commit`)
|
|
|
|
* read-tree (especially with -m or -u; is kind of like a --no-commit merge)
|
|
* reset (<tree-ish>, due to similarity to read-tree)
|
|
|
|
These also interact with the working tree, but require slightly
|
|
different behavior either so that (a) conflicts can be resolved or (b)
|
|
because they are kind of like a merge-without-commit operation.
|
|
|
|
(See also the "Known bugs" section below regarding `am` and `apply`)
|
|
|
|
* Commands defaulting to "no restrict":
|
|
* archive
|
|
* bundle
|
|
* commit
|
|
* format-patch
|
|
* fast-export
|
|
* fast-import
|
|
* commit-tree
|
|
|
|
* stash
|
|
* apply (without `--index`)
|
|
|
|
These have completely different defaults and perhaps deserve the most
|
|
detailed explanation:
|
|
|
|
In the case of commands in the first group (format-patch,
|
|
fast-export, bundle, archive, etc.), these are commands for
|
|
communicating history, which will be broken if they restrict to a
|
|
subset of the repository. As such, they operate on full paths and
|
|
have no `--restrict` option for overriding. Some of these commands may
|
|
take paths for manually restricting what is exported, but it needs to
|
|
be very explicit.
|
|
|
|
In the case of stash, it needs to vivify files to avoid losing the
|
|
user's changes.
|
|
|
|
In the case of apply without `--index`, that command needs to update
|
|
the working tree without the index (or the index without the working
|
|
tree if `--cached` is passed), and if we restrict those updates to the
|
|
sparse specification then we'll lose changes from the user.
|
|
|
|
* Commands defaulting to "restrict also specially applied to untracked files":
|
|
* add
|
|
* rm
|
|
* mv
|
|
* update-index
|
|
* status
|
|
* clean (?)
|
|
|
|
Our original implementation for the first three of these commands was
|
|
"no restrict", but it had some severe usability issues:
|
|
* `git add <somefile>` if honored and outside the sparse
|
|
specification, can result in the file randomly disappearing later
|
|
when some subsequent command is run (since various commands
|
|
automatically clean up unmodified files outside the sparse
|
|
specification).
|
|
* `git rm '*.jpg'` could very negatively surprise users if it deletes
|
|
files outside the range of the user's interest.
|
|
* `git mv` has similar surprises when moving into or out of the cone,
|
|
so best to restrict by default
|
|
|
|
So, we switched `add` and `rm` to default to "restrict", which made
|
|
usability problems much less severe and less frequent, but we still got
|
|
complaints because commands like:
|
|
git add <file-outside-sparse-specification>
|
|
git rm <file-outside-sparse-specification>
|
|
would silently do nothing. We should instead print an error in those
|
|
cases to get usability right.
|
|
|
|
update-index needs to be updated to match, and status and maybe clean
|
|
also need to be updated to specially handle untracked paths.
|
|
|
|
There may be a difference in here between behavior A and behavior B in
|
|
terms of verboseness of errors or additional warnings.
|
|
|
|
* Commands falling under "restrict or no restrict dependent upon behavior
|
|
A vs. behavior B"
|
|
|
|
* diff (with --cached or REVISION arguments)
|
|
* grep (with --cached or REVISION arguments)
|
|
* show (when given commit arguments)
|
|
* blame (only matters when one or more -C flags passed)
|
|
* and annotate
|
|
* log
|
|
* and variants: shortlog, gitk, show-branch, whatchanged, rev-list
|
|
* ls-files
|
|
* diff-index
|
|
* diff-tree
|
|
* ls-tree
|
|
|
|
For now, we default to behavior B for these, which want a default of
|
|
"no restrict".
|
|
|
|
Note that two of these commands -- diff and grep -- also appeared in a
|
|
different list with a default of "restrict", but only when limited to
|
|
searching the working tree. The working tree vs. history distinction
|
|
is fundamental in how behavior B operates, so this is expected. Note,
|
|
though, that for diff and grep with --cached, when doing "restrict"
|
|
behavior, the difference between sparse specification and sparsity
|
|
patterns is important to handle.
|
|
|
|
"restrict" may make more sense as the long term default for these[12].
|
|
Also, supporting "restrict" for these commands might be a fair amount
|
|
of work to implement, meaning it might be implemented over multiple
|
|
releases. If that behavior were the default in the commands that
|
|
supported it, that would force behavior B users to need to learn to
|
|
slowly add additional flags to their commands, depending on git
|
|
version, to get the behavior they want. That gradual switchover would
|
|
be painful, so we should avoid it at least until it's fully
|
|
implemented.
|
|
|
|
|
|
=== Sparse specification vs. sparsity patterns ===
|
|
|
|
In a well-behaved situation, the sparse specification is given directly
|
|
by the $GIT_DIR/info/sparse-checkout file. However, it can transiently
|
|
diverge for a few reasons:
|
|
|
|
* needing to resolve conflicts (merging will vivify conflicted files)
|
|
* running Git commands that implicitly vivify files (e.g. "git stash apply")
|
|
* running Git commands that explicitly vivify files (e.g. "git checkout
|
|
--ignore-skip-worktree-bits FILENAME")
|
|
* other commands that write to these files (perhaps a user copies it
|
|
from elsewhere)
|
|
|
|
For the last item, note that we do automatically clear the SKIP_WORKTREE
|
|
bit for files that are present in the working tree. This has been true
|
|
since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
|
|
2022-03-09)
|
|
|
|
However, such a situation is transient because:
|
|
|
|
* Such transient differences can and will be automatically removed as
|
|
a side-effect of commands which call unpack_trees() (checkout,
|
|
merge, reset, etc.).
|
|
* Users can also request such transient differences be corrected via
|
|
running `git sparse-checkout reapply`. Various places recommend
|
|
running that command.
|
|
* Additional commands are also welcome to implicitly fix these
|
|
differences; we may add more in the future.
|
|
|
|
While we avoid dropping unstaged changes or files which have conflicts,
|
|
we otherwise aggressively try to fix these transient differences. If
|
|
users want these differences to persist, they should run the `set` or
|
|
`add` subcommands of `git sparse-checkout` to reflect their intended
|
|
sparse specification.
|
|
|
|
However, when we need to do a query on history restricted to the
|
|
"relevant subset of files" such a transiently expanded sparse
|
|
specification is ignored. There are a couple reasons for this:
|
|
|
|
* The behavior wanted when doing something like
|
|
git grep expression REVISION
|
|
is roughly what the users would expect from
|
|
git checkout REVISION && git grep expression
|
|
(modulo a "REVISION:" prefix), which has a couple ramifications:
|
|
|
|
* REVISION may have paths not in the current index, so there is no
|
|
path we can consult for a SKIP_WORKTREE setting for those paths.
|
|
|
|
* Since `checkout` is one of those commands that tries to remove
|
|
transient differences in the sparse specification, it makes sense
|
|
to use the corrected sparse specification
|
|
(i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
|
|
consult SKIP_WORKTREE anyway.
|
|
|
|
So, a transiently expanded (or restricted) sparse specification applies to
|
|
the working tree, but not to history queries where we always use the
|
|
sparsity patterns. (See [16] for an early discussion of this.)
|
|
|
|
Similar to a transiently expanded sparse specification of the working tree
|
|
based on additional files being present in the working tree, we also need
|
|
to consider additional files being modified in the index. In particular,
|
|
if the user has staged changes to files (relative to HEAD) that do not
|
|
match the sparsity patterns, and the file is not present in the working
|
|
tree, we still want to consider the file part of the sparse specification
|
|
if we are specifically performing a query related to the index (e.g. git
|
|
diff --cached [REVISION], git diff-index [REVISION], git restore --staged
|
|
--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse
|
|
specification for the index usually only matters under behavior A, since
|
|
under behavior B index operations are lumped with history and tend to
|
|
operate full-tree.
|
|
|
|
|
|
=== Implementation Questions ===
|
|
|
|
* Do the options --scope={sparse,all} sound good to others? Are there better
|
|
options?
|
|
* Names in use, or appearing in patches, or previously suggested:
|
|
* --sparse/--dense
|
|
* --ignore-skip-worktree-bits
|
|
* --ignore-skip-worktree-entries
|
|
* --ignore-sparsity
|
|
* --[no-]restrict-to-sparse-paths
|
|
* --full-tree/--sparse-tree
|
|
* --[no-]restrict
|
|
* --scope={sparse,all}
|
|
* --focus/--unfocus
|
|
* --limit/--unlimited
|
|
* Rationale making me lean slightly towards --scope={sparse,all}:
|
|
* We want a name that works for many commands, so we need a name that
|
|
does not conflict
|
|
* We know that we have more than two possible usecases, so it is best
|
|
to avoid a flag that appears to be binary.
|
|
* --scope={sparse,all} isn't overly long and seems relatively
|
|
explanatory
|
|
* `--sparse`, as used in add/rm/mv, is totally backwards for
|
|
grep/log/etc. Changing the meaning of `--sparse` for these
|
|
commands would fix the backwardness, but possibly break existing
|
|
scripts. Using a new name pairing would allow us to treat
|
|
`--sparse` in these commands as a deprecated alias.
|
|
* There is a different `--sparse`/`--dense` pair for commands using
|
|
revision machinery, so using that naming might cause confusion
|
|
* There is also a `--sparse` in both pack-objects and show-branch, which
|
|
don't conflict but do suggest that `--sparse` is overloaded
|
|
* The name --ignore-skip-worktree-bits is a double negative, is
|
|
quite a mouthful, refers to an implementation detail that many
|
|
users may not be familiar with, and we'd need a negation for it
|
|
which would probably be even more ridiculously long. (But we
|
|
can make --ignore-skip-worktree-bits a deprecated alias for
|
|
--no-restrict.)
|
|
|
|
* If a config option is added (sparse.scope?) what should the values and
|
|
description be? "sparse" (behavior A), "worktree-sparse-history-dense"
|
|
(behavior B), "dense" (behavior C)? There's a risk of confusion,
|
|
because even for Behaviors A and B we want some commands to be
|
|
full-tree and others to operate sparsely, so the wording may need to be
|
|
more tied to the usecases and somehow explain that. Also, right now,
|
|
the primary difference we are focusing is just the history-querying
|
|
commands (log/diff/grep). Previous config suggestion here: [13]
|
|
|
|
* Is `--no-expand` a good alias for ls-files's `--sparse` option?
|
|
(`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
|
|
because in non-cone mode it does nothing and in cone-mode it shows the
|
|
sparse directory entries which are technically outside the sparse
|
|
specification)
|
|
|
|
* Under Behavior A:
|
|
* Does ls-files' `--no-expand` override the default `--scope=all`, or
|
|
does it need an extra flag?
|
|
* Does ls-files' `-t` option imply `--scope=all`?
|
|
* Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
|
|
|
|
* sparse-checkout: once behavior A is fully implemented, should we take
|
|
an interim measure to ease people into switching the default? Namely,
|
|
if folks are not already in a sparse checkout, then require
|
|
`sparse-checkout init/set` to take a
|
|
`--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
|
|
would set sparse.scope according to the setting given), and throw an
|
|
error if the flag is not provided? That error would be a great place
|
|
to warn folks that the default may change in the future, and get them
|
|
used to specifying what they want so that the eventual default switch
|
|
is seamless for them.
|
|
|
|
|
|
=== Implementation Goals/Plans ===
|
|
|
|
* Get buy-in on this document in general.
|
|
|
|
* Figure out answers to the 'Implementation Questions' sections (above)
|
|
|
|
* Fix bugs in the 'Known bugs' section (below)
|
|
|
|
* Provide some kind of method for backfilling the blobs within the sparse
|
|
specification in a partial clone
|
|
|
|
[Below here is kind of spitballing since the first two haven't been resolved]
|
|
|
|
* update-index: flip the default to --no-ignore-skip-worktree-entries,
|
|
nuke this stupid "Oh, there's a bug? Let me add a flag to let users
|
|
request that they not trigger this bug." flag
|
|
|
|
* Flags & Config
|
|
* Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
|
|
* Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
|
|
a deprecated aliases for `--scope=all`
|
|
* Create config option (sparse.scope?), tie it to the "Cliff notes"
|
|
overview
|
|
|
|
* Add --scope=sparse (and --scope=all) flag to each of the history querying
|
|
commands. IMPORTANT: make sure diff machinery changes don't mess with
|
|
format-patch, fast-export, etc.
|
|
|
|
=== Known bugs ===
|
|
|
|
This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
|
|
been working on it.
|
|
|
|
0. Behavior A is not well supported in Git. (Behavior B didn't used to
|
|
be either, but was the easier of the two to implement.)
|
|
|
|
1. am and apply:
|
|
|
|
apply, without `--index` or `--cached`, relies on files being present
|
|
in the working copy, and also writes to them unconditionally. As
|
|
such, it should first check for the files' presence, and if found to
|
|
be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
|
|
its work. Currently, it just throws an error.
|
|
|
|
apply, with either `--cached` or `--index`, will not preserve the
|
|
SKIP_WORKTREE bit. This is fine if the file has conflicts, but
|
|
otherwise SKIP_WORKTREE bits should be preserved for --cached and
|
|
probably also for --index.
|
|
|
|
am, if there are no conflicts, will vivify files and fail to preserve
|
|
the SKIP_WORKTREE bit. If there are conflicts and `-3` is not
|
|
specified, it will vivify files and then complain the patch doesn't
|
|
apply. If there are conflicts and `-3` is specified, it will vivify
|
|
files and then complain that those vivified files would be
|
|
overwritten by merge.
|
|
|
|
2. reset --hard:
|
|
|
|
reset --hard provides confusing error message (works correctly, but
|
|
misleads the user into believing it didn't):
|
|
|
|
$ touch addme
|
|
$ git add addme
|
|
$ git ls-files -t
|
|
H addme
|
|
H tracked
|
|
S tracked-but-maybe-skipped
|
|
$ git reset --hard # usually works great
|
|
error: Path 'addme' not uptodate; will not remove from working tree.
|
|
HEAD is now at bdbbb6f third
|
|
$ git ls-files -t
|
|
H tracked
|
|
S tracked-but-maybe-skipped
|
|
$ ls -1
|
|
tracked
|
|
|
|
`git reset --hard` DID remove addme from the index and the working tree, contrary
|
|
to the error message, but in line with how reset --hard should behave.
|
|
|
|
3. read-tree
|
|
|
|
`read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
|
|
entries it reads into the index, resulting in all your files suddenly
|
|
appearing to be "deleted".
|
|
|
|
4. Checkout, restore:
|
|
|
|
These command do not handle path & revision arguments appropriately:
|
|
|
|
$ ls
|
|
tracked
|
|
$ git ls-files -t
|
|
H tracked
|
|
S tracked-but-maybe-skipped
|
|
$ git status --porcelain
|
|
$ git checkout -- '*skipped'
|
|
error: pathspec '*skipped' did not match any file(s) known to git
|
|
$ git ls-files -- '*skipped'
|
|
tracked-but-maybe-skipped
|
|
$ git checkout HEAD -- '*skipped'
|
|
error: pathspec '*skipped' did not match any file(s) known to git
|
|
$ git ls-tree HEAD | grep skipped
|
|
100644 blob 276f5a64354b791b13840f02047738c77ad0584f tracked-but-maybe-skipped
|
|
$ git status --porcelain
|
|
$ git checkout HEAD~1 -- '*skipped'
|
|
$ git ls-files -t
|
|
H tracked
|
|
H tracked-but-maybe-skipped
|
|
$ git status --porcelain
|
|
M tracked-but-maybe-skipped
|
|
$ git checkout HEAD -- '*skipped'
|
|
$ git status --porcelain
|
|
$
|
|
|
|
Note that checkout without a revision (or restore --staged) fails to
|
|
find a file to restore from the index, even though ls-files shows
|
|
such a file certainly exists.
|
|
|
|
Similar issues occur with HEAD (--source=HEAD in restore's case),
|
|
but suddenly works when HEAD~1 is specified. And then after that it
|
|
will work with HEAD specified, even though it didn't before.
|
|
|
|
Directories are also an issue:
|
|
|
|
$ git sparse-checkout set nomatches
|
|
$ git status
|
|
On branch main
|
|
You are in a sparse checkout with 0% of tracked files present.
|
|
|
|
nothing to commit, working tree clean
|
|
$ git checkout .
|
|
error: pathspec '.' did not match any file(s) known to git
|
|
$ git checkout HEAD~1 .
|
|
Updated 1 path from 58916d9
|
|
$ git ls-files -t
|
|
S tracked
|
|
H tracked-but-maybe-skipped
|
|
|
|
5. checkout and restore --staged, continued:
|
|
|
|
These commands do not correctly scope operations to the sparse
|
|
specification, and make it worse by not setting important SKIP_WORKTREE
|
|
bits:
|
|
|
|
$ git restore --source OLDREV --staged outside-sparse-cone/
|
|
$ git status --porcelain
|
|
MD outside-sparse-cone/file1
|
|
MD outside-sparse-cone/file2
|
|
MD outside-sparse-cone/file3
|
|
|
|
We can add a --scope=all mode to `git restore` to let it operate outside
|
|
the sparse specification, but then it will be important to set the
|
|
SKIP_WORKTREE bits appropriately.
|
|
|
|
6. Performance issues; see:
|
|
https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
|
|
|
|
|
|
=== Reference Emails ===
|
|
|
|
Emails that detail various bugs we've had in sparse-checkout:
|
|
|
|
[1] (Original descriptions of behavior A & behavior B)
|
|
https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
|
|
[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
|
|
https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
|
|
[3] (Present-despite-skipped entries)
|
|
https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
|
|
[4] (Clone --no-checkout interaction)
|
|
https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
|
|
[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
|
|
https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
|
|
[6] (SKIP_WORKTREE is advisory, not mandatory)
|
|
https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
|
|
[7] (`worktree add` should copy sparsity settings from current worktree)
|
|
https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
|
|
[8] (Avoid negative surprises in add, rm, and mv)
|
|
https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
|
|
https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
|
|
[9] (Move from out-of-cone to in-cone)
|
|
https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
|
|
https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
|
|
[10] (Unnecessarily downloading objects outside sparse specification)
|
|
https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
|
|
|
|
[11] (Stolee's comments on high-level usecases)
|
|
https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
|
|
|
|
[12] Others commenting on eventually switching default to behavior A:
|
|
* https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
|
|
* https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
|
|
* https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
|
|
|
|
[13] Previous config name suggestion and description
|
|
* https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
|
|
|
|
[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
|
|
https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
|
|
|
|
[15] Lengthy email on grep behavior, covering what should be searched:
|
|
* https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
|
|
|
|
[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
|
|
search for the parenthetical comment starting "We do not check".
|
|
https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
|
|
|
|
[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/
|