Commits · e5efc4179b6d759cec37e5c9712e9ab8fbf05af3 · lvzhengyang / git2

13 Jul, 2020 1 commit

config_entries: Avoid excessive map operations · f2400a9c

When appending config entries, we currently always first get the
currently existing map entry and then afterwards update the map to
contain the current config value. In the common scenario where keys
aren't being overridden, this is the best we can do. But in case a key
gets set multiple times, then we'll also perform these two map
operations. In extreme cases, hashing the map keys will thus start to
dominate performance.

Let's optimize the pattern by using a separately allocated map entry.
Currently, we always put the current list entry into the map and update
it to get any overridden multivar. As these list entries are also used
to iterate config entries, we cannot update them in-place in the map and
are thus forced to always set the map to contain the new entry. But with
a separately allocated map entry, we can now create one once per config
key and insert it into the map. Whenever appending a new config value
with the same key, we can now just update the map entry in-place instead
of having to replace the map entry completely.

This reduces calls to the hashing function by half and trades the
improved runtime for one more allocation per unique config key. Given
that the refactoring arguably improves code readability by splitting
concerns of the `config_entry_list` type and not having to track it in
two different structures, this alone would already be reason enough to
take the trade.

Given a pathological case of a gitconfig with 100.000 repeated keys and
a section of length 10.000 characters, this reduces runtime by half from
approximately 14 seconds to 7 seconds as expected.

committed 4 years ago

f2400a9c Browse File

05 Nov, 2019 3 commits

config_entries: micro-optimize storage of multivars · b7dcea04

Multivars are configuration entries that have many values for the same
name; we can thus micro-optimize this case by just retaining the name of
the first configuration entry and freeing all the others, letting them
point to the string of the first entry.

The attached test case is an extreme example that demonstrates this. It
contains a section name that is approximately 500kB in size with 20.000
entries "a=b". Without the optimization, this would require at least
20000*500kB bytes, which is around 10GB. With this patch, it only
requires 500kB+20000*1B=20500kB.

The obvious culprit here is the section header, which we repeatedly
include in each of the configuration entry's names. This makes it very
easier for an adversary to provide a small configuration file that
disproportionally blows up in memory during processing and is thus a
feasible way for a denial-of-service attack. Unfortunately, we cannot
fix the root cause by e.g. having a separate "section" field that may
easily be deduplicated due to the `git_config_entry` structure being
part of our public API. So this micro-optimization is the best we can do
for now.

committed 5 years ago

b7dcea04 Browse File

config_entries: only keep track of a single entry list · 62320860

Whenever adding a configuration entry to the config entries structure,
we allocate two list heads:

    - The first list head is added to the global list of config entries
      in order to be able to iterate over configuration entries in the
      order they were originally added.

    - The second list head is added to the map of entries in order to
      efficiently look up an entry by its name. If no entry with the
      same name exists in the map, then we add the new entry to the map
      directly. Otherwise, we append the new entry's list head to the
      pre-existing entry's list in order to keep track of multivars.

While the former usecase is perfectly sound, the second usecase can be
optimized. The only reason why we keep track of multivar entries in
another separate list is to be able to determine whether an entry is
unique or not by seeing whether its `next` pointer is set. So we keep
track of a complete list of multivar entries just to have a single bit
of information of whether it has other multivar entries with the same
entry name.

We can completely get rid of this secondary list by just adding a
`first` field to the list structure itself. When executing
`git_config_entries_append`, we will then simply check whether the
configuration map already has an entry with the same name -- if so, we
will set the `first` to zero to indicate that it is not the initial
entry anymore. Instead of a second list head in the map, we can thus now
directly store the list head of the first global list inside of the map
and just refer to that bit.

Note that the more obvious solution would be to store a `unique` field
instead of a `first` field. But as we will only ever inspect the `first`
field of the _last_ entry that has been moved into the map, these are
semantically equivalent in that case.

Having a `first` field also allows for a minor optimization: for
multivar values, we can free the `name` field of all entries that are
_not_ first and have them point to the name of the first entry instead.

committed 5 years ago

62320860 Browse File

config_entries: mark local functions as static · 8a88701e

Some functions which are only used in "config_entries.c" are not marked
as static, which is being fixed by this very commit.

committed 5 years ago

8a88701e Browse File

26 Jul, 2019 1 commit

config_entries: fix possible segfault when duplicating entries · 76182e84

When duplicating a configuration entry, we allocate a new entry but do
not verify that we get a valid pointer back. As we're dereferencing the
pointer afterwards, we might thus run into a segfault in out-of-memory
situations.

Extract a new function `git_config_entries_dup_entry` that handles the
complete entry duplication. Fix the error by using
`GIT_ERROR_CHECK_ALLOC`.

committed 5 years ago

76182e84 Browse File

15 Feb, 2019 3 commits

strmap: introduce high-level setter for key/value pairs · 03555830

Currently, one would use the function `git_strmap_insert` to insert key/value
pairs into a map. This function has historically been a macro, which is why its
syntax is kind of weird: instead of returning an error code directly, it instead
has to be passed a pointer to where the return value shall be stored. This does
not match libgit2's common idiom of directly returning error codes.

Introduce a new function `git_strmap_set`, which takes as parameters the map,
key and value and directly returns an error code. Convert all callers of
`git_strmap_insert` to make use of it.

committed 5 years ago

03555830 Browse File

strmap: introduce `git_strmap_get` and use it throughout the tree · ef507bc7

The current way of looking up an entry from a map is tightly coupled with the
map implementation, as one first has to look up the index of the key and then
retrieve the associated value by using the index. As a caller, you usually do
not care about any indices at all, though, so this is more complicated than
really necessary. Furthermore, it invites for errors to happen if the correct
error checking sequence is not being followed.

Introduce a new high-level function `git_strmap_get` that takes a map and a key
and returns a pointer to the associated value if such a key exists. Otherwise,
a `NULL` pointer is returned. Adjust all callers that can trivially be
converted.

committed 5 years ago

ef507bc7 Browse File

maps: use uniform lifecycle management functions · 351eeff3

Currently, the lifecycle functions for maps (allocation, deallocation, resize)
are not named in a uniform way and do not have a uniform function signature.
Rename the functions to fix that, and stick to libgit2's naming scheme of saying
`git_foo_new`. This results in the following new interface for allocation:

- `int git_<t>map_new(git_<t>map **out)` to allocate a new map, returning an
  error code if we ran out of memory

- `void git_<t>map_free(git_<t>map *map)` to free a map

- `void git_<t>map_clear(git<t>map *map)` to remove all entries from a map

This commit also fixes all existing callers.

committed 5 years ago

351eeff3 Browse File

22 Jan, 2019 1 commit
- git_error: use new names in internal APIs and usage · f673e232
```
Move to the `git_error` name in the internal API for error-related
functions.
```
  Edward Thomson committed 5 years ago
  f673e232 Browse File
28 Nov, 2018 1 commit

khash: remove intricate knowledge of khash types · 852bc9f4

Instead of using the `khiter_t`, `git_strmap_iter` and `khint_t` types,
simply use `size_t` instead. This decouples code from the khash stuff
and makes it possible to move the khash includes into the implementation
files.

committed 6 years ago

852bc9f4 Browse File

28 Sep, 2018 7 commits

config_entries: refactor entries iterator memory ownership · b78f4ab0

Right now, the config file code requires us to pass in its backend to
the config entry iterator. This is required with the current code, as
the config file backend will first create a read-only snapshot which is
then passed to the iterator just for that purpose. So after the iterator
is getting free'd, the code needs to make sure that the snapshot gets
free'd, as well.

By now, though, we can easily refactor the code to be more efficient and
remove the reverse dependency from iterator to backend. Instead of
creating a read-only snapshot (which also requires us to re-parse the
complete configuration file), we can simply duplicate the config entries
and pass those to the iterator. Like that, the iterator only needs to
make sure to free the duplicated config entries, which is trivial to do
and clears up memory ownership by a lot.

committed 6 years ago

b78f4ab0 Browse File

config_entries: internalize structure declarations · d49b1365

Access to the config entries is now completely done via the modules
function interface and no caller messes with the struct's internals. We
can thus completely move the structure declarations into the
implementation file so that nobody even has a chance to mess with the
members.

committed 6 years ago

d49b1365 Browse File

config_entries: abstract away reference counting · 123e5963

Instead of directly calling `git_atomic_inc` in users of the config
entries store, provide a `git_config_entries_incref` function to further
decouple the interfaces. Convert the refcount to a `git_refcount`
structure while at it.

committed 6 years ago

123e5963 Browse File

config_entries: abstract away iteration over entries · 5a7e0b3c

The nice thing about our `git_config_iterator` interfaces is that nobody
needs to know anything about the implementation details. All that is
required is to obtain the iterator via any backend and then use it by
executing generic functions. We can thus completely internalize all the
implementation details of how to iterate over entries into the config
entries store and simply create such an iterator in our config file
backend when we want to iterate its entries. This further decouples the
config file backend from the config entries store.

committed 6 years ago

5a7e0b3c Browse File

config_entries: abstract away retrieval of config entries · 60ebc137

The code accessing config entries in the `git_config_entries` structure
is still much too intimate with implementation details, directly
accessing the maps and handling indices. Provide two new functions to
get config entries from the internal map structure to decouple the
interfaces and use them in the config file code.

The function `git_config_entries_get` will simply look up the entry by
name and, in the case of a multi-value, return the last occurrence of
that entry. The second function, `git_config_entries_get_unique`, will
only return an entry if it is unique and not included via another
configuration file. This one is required to properly implement write
operations for single entries, as we refuse to write to or delete a
single entry if it is not clear which one was meant.

committed 6 years ago

60ebc137 Browse File

config_entries: rename functions and structure · fb8a87da

The previous commit simply moved all code that is required to handle
config entries to a new module without yet adjusting any of the function
and structure names to help readability. We now rename things
accordingly to have a common "git_config_entries" entries instead of the
old "diskfile_entries" one.

committed 6 years ago

fb8a87da Browse File

config_entries: pull out implementation of entry store · 04f57d51

The configuration entry store that is used for configuration files needs
to keep track of all entries in two different structures:

- a singly linked list is being used to be able to iterate through
  configuration files in the order they have been found

- a string map is being used to efficiently look up configuration
  entries by their key

This store is thus something that may be used by other, future backends
as well to abstract away implementation details and iteration over the
entries.

Pull out the necessary functions from "config_file.c" and moves them
into their own "config_entries.c" module. For now, this is simply moving
over code without any renames and/or refactorings to help reviewing.

committed 6 years ago

04f57d51 Browse File