There is a technical explanation, having to do with toggle-refs, a GObject concept which we use to interface the JS engine’s garbage collector with GObject’s reference counting system. Georges has already provided a fantastic introduction to the technical details so I will not do another one here. This post will be more about social issues, future plans, and answers to some myths I’ve seen in various comments recently. To read this post, you only need to know that the problem has to do with toggle-refs and that toggle-refs are difficult to reason about.
Not a Memory Leak
I really don’t want to call this a memory leak, much less “the GNOME memory leak” that’s become common in the press coverage lately. I find that that sets the wrong expectations for users suffering from these memory problems. You might say that for the end user it makes no difference, their computer’s memory is being occupied by GNOME Shell, so what’s the point in not calling it a memory leak? And you would be partially right. The effect is no different. The expectations are different though, especially for users who have some technical knowledge. A memory leak is a simple problem to fix. When you have one, you run your software under Valgrind or ASAN, you get a backtrace that shows where the memory was allocated that you didn’t free, and you free it. Problem solved. You can even run Valgrind in your automatic tests to prevent new leaks. That’s not the case here, and if we refer to it as a memory leak then it can only cause frustration on the part of users who are aware how simple it is to fix a memory leak.
This problem is different. It’s not a leak in the traditional sense. The memory does eventually get freed, but GNOME Shell holds onto it for too long; long enough to cause problems on some systems. As GJS contributor Andy Holmes put it, it’s a “tardy GC sweep.” I think that has a catchy ring to it, so I’ll call it the “tardy sweep problem” from now on.
Meme by Andy Holmes, used with kind permission
To be honest, I found that the OMG!Ubuntu article about “the memory leak” attracted a lot of comments that don’t sit well with me, and I think that the wrong expectations set by calling it a “memory leak” are partly to blame. With this post, I hope to give a better idea of what GNOME users can expect.
On the bright side, due to the recent publicity and especially the OMG!Ubuntu article, more GNOME developers are talking about the memory problems and suggesting things, which is causing an exciting confluence of ideas that I couldn’t have come up with on my own.
Edit: I want to be absolutely clear that with the above I’m not blaming bug reporters for not knowing whether something is a “proper” memory leak or not. This is intended to bring some attention to the wrong expectations that arise, especially among technically savvy users, when GNOME developers and the tech press use the term “memory leak,” and illustrate why we ourselves should not use the term here.
The Big Hammer
CC0 licensed image, by stevepb
I’ve been calling this patch the “Big Hammer” because it’s a drastic measure: starting a whole new garbage collection cycle in order to clean up some objects that we already know should be cleaned up.
The tardy sweep problem has now been mitigated with the Big Hammer, but reducing GNOME Shell’s memory usage has been a battle for years, and it has very little to do with memory leaks.
heapgraph tool is useful when tracking these down, but throughout most of the life of GNOME Shell this tool didn’t exist.) There’s also memory fragmentation, which can look like a memory leak in a resource monitor.1 In addition, when diagnosing reports from users, configurations vary wildly. Memory usage simply differs from system to system. Finally, people have different configurations of Shell extensions, some of which leak memory as well.
The Sad Lifecycle of a GNOME Shell Memory Leak Bug Report
- User reports “I have a memory leak”
- Developer runs Valgrind, sifts through Valgrind trace, finds a small leak and fixes it
- Problem isn’t fixed
- Repeat 2 and 3 until no leaks shown in the Valgrind trace
- Problem still isn’t fixed
- User eagerly awaits each point release hoping for relief, and is disappointed each time
- Bug report gains popularity, accretes followers like a katamari, some of whom vent about unrelated bugs, hound the developer, or become abusive, until the original point of the bug report is lost
- Developer can’t do anything productive with the bug report at this point. They know there’s still a memory problem and it’s not a traditional leak, but the bug report is not helping them find it
- Users don’t accept that answer
- Developer closes the bug report and makes users even angrier. (Or, developer ignores the bug report until it fizzles out, and makes users sad.)
Here are some examples of long-running bug reports where you can see this dynamic in action. It’s quite sad to observe, because everybody involved is doing what makes perfect sense from their perspective (except for a few people behaving badly), yet the result is a mess.
I hope this illustrates why it’s important to assume that people are acting in good faith.
I know some people will argue that the developer mustn’t close the bug report until the bug is “fixed”, meaning that there is no more unnecessary memory usage. But in my opinion that’s just not a useful way to think of bug reports. GNOME Shell developers know (and so do I, from the GJS side) that GNOME Shell uses a lot of memory. I agree it’s nice to keep a bug report open so that users know that we’re aware of it and it’s on our to-do lists somewhere, but very soon the time we spend dealing with the noise on the bug report eclipses whatever benefit it might bring to the community.
I hope GitLab will improve things a bit here, since if you feel strongly about an issue in the bugtracker, you can upvote it or add an emoji reaction to it. This is a good way for users to show that an issue is important to them, and if enough people use it then it’s a good indicator for me to see which issues are prioritized highest by users and contributors.
5 Myths About GNOME Shell’s Memory Problems (Paraphrased)
Carlos Garnacho, of GNOME Shell fame, pointed out to me that in a more global perspective there has been a very active hunt for actual memory leaks across many of GNOME Shell’s dependencies for quite some time, and he has personally patched leaks in IBus, AccountsService, libgweather, gnome-desktop, and more.
It seems the tardy sweep problem has gotten worse in recent versions of GNOME, although it’s hard to measure between different systems with different configurations. I don’t know why it’s gotten worse.2
It seems to have been known for a long time, though: for history buffs, it was alluded to in a comment in commit ae34ec49, back in 2011. That knowledge was apparently lost when GJS was without a maintainer for a couple of years. To be honest, it has taken me well over a year to get familiar enough with the toggle-ref code (which integrates the JS garbage collector with GObject’s refcounting system), that I feel even remotely comfortable making or reviewing changes to it. I only fully realized the implications of the tardy sweep problem after talking to Georges and seeing his memory graph.
It seems that a previous GJS maintainer, Giovanni Campagna, was trying to mitigate the tardy sweep problem already five years ago, with a patch that allowed objects to escape the tardy sweep by opting out of the whole toggle-ref system in some cases. Unfortunately, as far as I can tell from my bug tracker archaeology, his patch went through a few reviews and the answer was always “Wow, this is really complicated, I need to study it some more in order to understand it.” Then it fell by the wayside when he stepped down from GJS maintainership.
I picked the patch up again late last year and fixed up most of the bit-rot. It still had a few problems with it. I never found the time to fix it up completely, which Georges kindly took over for me. I initially preferred Giovanni’s patch above the Big Hammer, but unfortunately for me, Georges proved that it wasn’t as effective as we thought it would be, only clearing up about 5% of the tardy sweep memory.3
I don’t call it the Big Hammer for nothing. We were concerned about performance regressions too, so that’s reasonable. However, as you can read in the bug tracker, we did actually do some testing on lower-end hardware before merging the Big Hammer, and it was not as bad as I expected. Carlos has been doing some measurements and found that garbage collection accounts for about 2–3% of the time that GNOME Shell occupies the CPU.
However, it’s exactly because we want to be cautious that the Big Hammer has only been committed to master, which will first be released in the unstable GNOME 3.29.2 snapshot. I don’t plan to release it on a stable branch until we’ve run it some more.
Ubuntu has already put the Big Hammer in their LTS version. That’s more of a risk than I would have recommended, but it’s not my decision to make, and I am grateful that we will be getting some testing through that avenue. Endless is also considering putting the Big Hammer in their stable version.
(And alas, I don’t have a top-of-the-line machine. Feel free to donate me one if that’s what’s required to make me conform to some stereotype of GNOME developers. 😇)
“The problem should be fixed now. GNOME Shell will run smooth from now on.”
No, it’s not. GNOME Shell still isn’t that great with memory.
Carlos is working on some merge requests which are approaching being ready to merge, which should make things a bit more memory-efficient. He’s also had some success with experiments trying to reduce memory fragmentation, taking better advantage of SpiderMonkey’s compacting garbage collector.
We are also bouncing around some ideas for making the Big Hammer into a smaller hammer. In particular, we’re trying to see if the extra garbage collections can be restricted to only the JS objects that represent GObjects, since those are the only objects that are affected by the tardy sweep problem. We’re also trying to see if there’s a way to return black-marked (reachable) objects to their original white-marked (eligible for collection) state when a GObject is toggled down in the middle of a garbage collection.
Another approach to investigate is to make better use of incremental garbage collection. SpiderMonkey offers this facility but we don’t use it yet. The idea is, instead of pausing and doing a big garbage collection, we do a slice of a few milliseconds whenever we have time. I don’t know yet whether this will have a large or small effect, or even render the Big Hammer unnecessary.
We’re also going to update to SpiderMonkey 60 in GNOME 3.30 which will hopefully bring in another year’s worth of Mozilla’s garbage collector research and optimization.
Finally, I’m gradually working on another unfinished merge request left over from Giovanni’s tenure as GJS maintainer, that should drastically increase the performance of GNOME Shell’s animations (though not necessarily help with memory.)
GNOME has a fixed release schedule, so they release new versions on the release dates, with whatever is aboard the train at that time. That’s not going to change.
Of course, I want whatever version of GNOME ships with any Linux distribution to be as good as possible. But as the upstream GJS maintainer, I have no say over what a downstream Linux distribution chooses to ship. The best way for a Linux distro to make sure their release is shipshape, is to contribute resources towards fixing whatever they consider a blocker.
That sounds a bit callous, as if I refuse to fix any bugs that Ubuntu wants fixed; that’s not what I mean at all. But my free time is limited. I’m paid for a part of my GJS maintainer work, but only for specific features. I can’t work to anyone’s external deadlines in my free time, because otherwise I’ll burn out and that’s not good for anyone with any interest in GJS either. Sometimes I have other priorities besides sitting at the computer; sometimes I do have time but no ideas about a particular problem; sometimes my brain isn’t up to fixing a difficult memory problem and I choose to work on something easier.4 Bugfixing work isn’t fungible.
I picked Ubuntu to illustrate this example, because contributing is exactly what the Ubuntu team has done; Ubuntu contributors fixed stability bugs in GJS, as well as GNOME Shell and Mutter, for GNOME 3.28. To say nothing of contributors from other downstreams, as well. That’s great and I’m looking forward to more of it! Some commenters seem to see downstreams fixing bugs as something that GNOME developers should be ashamed of, but I believe everyone is better off for it when that happens!
Thanks to Carlos Garnacho and Andy Holmes, who commented on a draft version of this blog post. Thanks in addition to Andy who coined the term “tardy sweep” and provided Scruffy as the mascot; Heartbleed has branding, why shouldn’t we? And of course, thanks to Georges who kicked off the whole research in the first place!
and has often made people angry in bug reports when told it’s not a memory leak ↩
I have a hunch, though. When I updated SpiderMonkey to version 38 in GNOME 3.24, we went from a conservative collector to an exact-rooted, moving one — see this Wikipedia article for definitions of those terms. It may be that the old garbage collector, though generally considered inferior, did actually mitigate the tardy sweeps a little, because I think back then it would have been possible for more objects to make their way into an ongoing sweep. It’s also possible that it was made worse earlier than that, by some adjustments in GNOME Shell that adjusted how often the garbage collector was called. ↩
Technical explanation: Tweener, which is the animation framework used by GNOME Shell, renders many objects ineligible to opt out of the toggle-ref system. I would like to see Tweener replaced with Clutter implicit animations in GNOME Shell, which would make Giovanni’s patch much more effective, but that’s a big project. ↩
Like writing a blog post about a difficult memory problem. Joke’s on me, that’s actually really hard ↩