Memory optimizations for config entries #5243

pks-t · 2019-09-26T13:22:18Z

This provides two optimizations for our config entries code.

The first optimization is rather straight forward and an obvious win, most importantly because it decreases code complexity by not having to track two separate lists of config entries anymore.

The second commit is a micro-optimization for how we store multivar entries. As all entry names are the same for a multivar, we can just deduplicate these and thus greatly decrease memory consumption (e.g. the attached test case goes down from 10GB RAM to a few hundred kilobytes, only). In fact, we have a deeper problem here because it is actually trivial for a comparatively tiny configuration file to blow up in size during parsing, and we cannot fix that without introducing a major interface break. So I'm split whether that optimization is worth it or not, but at least the code complexity doesn't really increase by a lot.

More details are in the commit messages.

tiennou

This is so cool 👏 ! And I feel so bad for having only commit messages to dissect 🤣.

tiennou · 2019-11-05T11:43:06Z

src/config_entries.c

@@ -11,6 +11,7 @@ typedef struct config_entry_list {
 	struct config_entry_list *next;
 	struct config_entry_list *last;
 	git_config_entry *entry;
+	bool first;


Sorry, I'm having a hard time understanding your good-looking commit message, so I'll try to rephrase it :

Whenever adding a configuration entry to the config entries structure, we allocate two list heads. - The first list head is added to the global list of config entries, to be able to iterate over configuration entries while preserving their order. - The second list head is added to the map of entries. When iterating, an already existing entry would be added to this head, while a new one would be directly added to the map.

I'm confused by "its" in your version, so I might be misreading, and the difference is meaningful ^^.

We can completely get rid of this secondary list by just adding a `first` field to the first configuration entry list.

"first field to the list structure itself." I think reads clearer, "first configuration entry list" sounds like "first list head" above.

I'd split the "solution" and "optimization" ¶'s as well. Thanks for the read 😉.

tiennou · 2019-11-05T11:46:03Z

src/config_entries.c

-		 * last one at the time of adding it, which is
-		 * why we set `last` here to itself. Otherwise we
-		 * do not have to set `last` and leave it set to
-		 * `NULL`.


"Thank you for the read" wasn't enough 🤣.

tiennou · 2019-11-05T11:52:24Z

src/config_entries.c

@@ -108,7 +108,8 @@ static void config_entries_free(git_config_entries *entries)
 	list = entries->list;
 	while (list != NULL) {
 		next = list->next;
-		git__free((char *) list->entry->name);


IMHO the usage of "key" in the commitmsg makes it less clear that name and key are the same thing.

Multivars are configuration entries that have many values for the same name; we can thus micro-optimize this case by just retaining the name of the first configuration entry and freeing all the others, letting them point to string of the first entry.

Some functions which are only used in "config_entries.c" are not marked as static, which is being fixed by this very commit.

Whenever adding a configuration entry to the config entries structure, we allocate two list heads: - The first list head is added to the global list of config entries in order to be able to iterate over configuration entries in the order they were originally added. - The second list head is added to the map of entries in order to efficiently look up an entry by its name. If no entry with the same name exists in the map, then we add the new entry to the map directly. Otherwise, we append the new entry's list head to the pre-existing entry's list in order to keep track of multivars. While the former usecase is perfectly sound, the second usecase can be optimized. The only reason why we keep track of multivar entries in another separate list is to be able to determine whether an entry is unique or not by seeing whether its `next` pointer is set. So we keep track of a complete list of multivar entries just to have a single bit of information of whether it has other multivar entries with the same entry name. We can completely get rid of this secondary list by just adding a `first` field to the list structure itself. When executing `git_config_entries_append`, we will then simply check whether the configuration map already has an entry with the same name -- if so, we will set the `first` to zero to indicate that it is not the initial entry anymore. Instead of a second list head in the map, we can thus now directly store the list head of the first global list inside of the map and just refer to that bit. Note that the more obvious solution would be to store a `unique` field instead of a `first` field. But as we will only ever inspect the `first` field of the _last_ entry that has been moved into the map, these are semantically equivalent in that case. Having a `first` field also allows for a minor optimization: for multivar values, we can free the `name` field of all entries that are _not_ first and have them point to the name of the first entry instead.

Multivars are configuration entries that have many values for the same name; we can thus micro-optimize this case by just retaining the name of the first configuration entry and freeing all the others, letting them point to the string of the first entry. The attached test case is an extreme example that demonstrates this. It contains a section name that is approximately 500kB in size with 20.000 entries "a=b". Without the optimization, this would require at least 20000*500kB bytes, which is around 10GB. With this patch, it only requires 500kB+20000*1B=20500kB. The obvious culprit here is the section header, which we repeatedly include in each of the configuration entry's names. This makes it very easier for an adversary to provide a small configuration file that disproportionally blows up in memory during processing and is thus a feasible way for a denial-of-service attack. Unfortunately, we cannot fix the root cause by e.g. having a separate "section" field that may easily be deduplicated due to the `git_config_entry` structure being part of our public API. So this micro-optimization is the best we can do for now.

pks-t · 2019-11-05T12:26:27Z

Thanks @tiennou, I've amended the commit messages to make them (hopefully) easier to understand :)

pks-t mentioned this pull request Nov 5, 2019

Release 0.99 #5291

Merged

2 tasks

tiennou approved these changes Nov 5, 2019

View reviewed changes

pks-t added 3 commits November 5, 2019 13:14

config_entries: mark local functions as static

8a88701

Some functions which are only used in "config_entries.c" are not marked as static, which is being fixed by this very commit.

pks-t force-pushed the pks/config-optimize-mem branch from 36e3622 to b7dcea0 Compare November 5, 2019 12:26

pks-t merged commit 61176a9 into libgit2:master Nov 28, 2019

pks-t deleted the pks/config-optimize-mem branch November 28, 2019 13:31

snyk-bot mentioned this pull request Feb 23, 2020

[Snyk] Upgrade nodegit from 0.4.1 to 0.26.4 saurabharch/Breezeblocks#1

Open

snyk-bot mentioned this pull request Mar 12, 2020

[Snyk] Upgrade nodegit from 0.26.0 to 0.26.4 thisconnect/nodegit-kit#95

Merged

snyk-bot mentioned this pull request Apr 22, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 aminatakonate000/Graviton-App#4

Open

snyk-bot mentioned this pull request May 5, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 Barnstorm-Online/ngp-openapi-generator#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory optimizations for config entries #5243

Memory optimizations for config entries #5243

pks-t commented Sep 26, 2019

tiennou left a comment

tiennou Nov 5, 2019

tiennou Nov 5, 2019

tiennou Nov 5, 2019

pks-t commented Nov 5, 2019

Memory optimizations for config entries #5243

Memory optimizations for config entries #5243

Conversation

pks-t commented Sep 26, 2019

tiennou left a comment

Choose a reason for hiding this comment

tiennou Nov 5, 2019

Choose a reason for hiding this comment

tiennou Nov 5, 2019

Choose a reason for hiding this comment

tiennou Nov 5, 2019

Choose a reason for hiding this comment

pks-t commented Nov 5, 2019