More Memory, More Problems

In GJS we recently committed a patch that has been making waves. Thanks to GJS contributor Georges Basile “Feaneron” Stavracas Neto, some infamous memory problems with GNOME Shell 3.28 have been mitigated. (What’s the link between GNOME Shell and GJS? GNOME Shell uses GJS as its internal Javascript engine, in which some of the UI and all of the extensions are implemented.)

There is a technical explanation, having to do with toggle-refs, a GObject concept which we use to interface the JS engine’s garbage collector with GObject’s reference counting system. Georges has already provided a fantastic introduction to the technical details so I will not do another one here. This post will be more about social issues, future plans, and answers to some myths I’ve seen in various comments recently. To read this post, you only need to know that the problem has to do with toggle-refs and that toggle-refs are difficult to reason about.

Not a Memory Leak

I really don’t want to call this a memory leak, much less “the GNOME memory leak” that’s become common in the press coverage lately. I find that that sets the wrong expectations for users suffering from these memory problems. You might say that for the end user it makes no difference, their computer’s memory is being occupied by GNOME Shell, so what’s the point in not calling it a memory leak? And you would be partially right. The effect is no different. The expectations are different though, especially for users who have some technical knowledge. A memory leak is a simple problem to fix. When you have one, you run your software under Valgrind or ASAN, you get a backtrace that shows where the memory was allocated that you didn’t free, and you free it. Problem solved. You can even run Valgrind in your automatic tests to prevent new leaks. That’s not the case here, and if we refer to it as a memory leak then it can only cause frustration on the part of users who are aware how simple it is to fix a memory leak.

This problem is different. It’s not a leak in the traditional sense. The memory does eventually get freed, but GNOME Shell holds onto it for too long; long enough to cause problems on some systems. As GJS contributor Andy Holmes put it, it’s a “tardy GC sweep.” I think that has a catchy ring to it, so I’ll call it the “tardy sweep problem” from now on.

Scruffy the janitor from Futurama, relaxing in a chair, captioned

Meme by Andy Holmes, used with kind permission

To be honest, I found that the OMG!Ubuntu article about “the memory leak” attracted a lot of comments that don’t sit well with me, and I think that the wrong expectations set by calling it a “memory leak” are partly to blame. With this post, I hope to give a better idea of what GNOME users can expect.

On the bright side, due to the recent publicity and especially the OMG!Ubuntu article, more GNOME developers are talking about the memory problems and suggesting things, which is causing an exciting confluence of ideas that I couldn’t have come up with on my own.

Edit: I want to be absolutely clear that with the above I’m not blaming bug reporters for not knowing whether something is a “proper” memory leak or not. This is intended to bring some attention to the wrong expectations that arise, especially among technically savvy users, when GNOME developers and the tech press use the term “memory leak,” and illustrate why we ourselves should not use the term here.

 

The Big Hammer

Hammer smashing a peanut

CC0 licensed image, by stevepb

I’ve been calling this patch the “Big Hammer” because it’s a drastic measure: starting a whole new garbage collection cycle in order to clean up some objects that we already know should be cleaned up.

The tardy sweep problem has now been mitigated with the Big Hammer, but reducing GNOME Shell’s memory usage has been a battle for years, and it has very little to do with memory leaks.

There are many other causes of high memory usage in GNOME Shell. Some are real memory leaks, that generally get fixed before too long. GNOME Shell developers have had their suspicions about NVidia drivers for years. Another cause is JS memory leaks in GNOME Shell. (Contrary to popular belief, it is possible to leak memory in pure Javascript code. Andy’s new heapgraph tool is useful when tracking these down, but throughout most of the life of GNOME Shell this tool didn’t exist.) There’s also memory fragmentation, which can look like a memory leak in a resource monitor.1 In addition, when diagnosing reports from users, configurations vary wildly. Memory usage simply differs from system to system. Finally, people have different configurations of Shell extensions, some of which leak memory as well.

The Sad Lifecycle of a GNOME Shell Memory Leak Bug Report

  1. User reports “I have a memory leak”
  2. Developer runs Valgrind, sifts through Valgrind trace, finds a small leak and fixes it
  3. Problem isn’t fixed
  4. Repeat 2 and 3 until no leaks shown in the Valgrind trace
  5. Problem still isn’t fixed
  6. User eagerly awaits each point release hoping for relief, and is disappointed each time
  7. Bug report gains popularity, accretes followers like a katamari, some of whom vent about unrelated bugs, hound the developer, or become abusive, until the original point of the bug report is lost
  8. Developer can’t do anything productive with the bug report at this point. They know there’s still a memory problem and it’s not a traditional leak, but the bug report is not helping them find it
  9. Users don’t accept that answer
  10. Developer closes the bug report and makes users even angrier. (Or, developer ignores the bug report until it fizzles out, and makes users sad.)

Here are some examples of long-running bug reports where you can see this dynamic in action. It’s quite sad to observe, because everybody involved is doing what makes perfect sense from their perspective (except for a few people behaving badly), yet the result is a mess.

I hope this illustrates why it’s important to assume that people are acting in good faith.

I know some people will argue that the developer mustn’t close the bug report until the bug is “fixed”, meaning that there is no more unnecessary memory usage. But in my opinion that’s just not a useful way to think of bug reports. GNOME Shell developers know (and so do I, from the GJS side) that GNOME Shell uses a lot of memory. I agree it’s nice to keep a bug report open so that users know that we’re aware of it and it’s on our to-do lists somewhere, but very soon the time we spend dealing with the noise on the bug report eclipses whatever benefit it might bring to the community.

I hope GitLab will improve things a bit here, since if you feel strongly about an issue in the bugtracker, you can upvote it or add an emoji reaction to it. This is a good way for users to show that an issue is important to them, and if enough people use it then it’s a good indicator for me to see which issues are prioritized highest by users and contributors.

5 Myths About GNOME Shell’s Memory Problems (Paraphrased)

“GNOME developers never cared about the tardy sweep problem until OMG!Ubuntu reported on it. They don’t care about users until someone makes them.”

Carlos Garnacho, of GNOME Shell fame, pointed out to me that in a more global perspective there has been a very active hunt for actual memory leaks across many of GNOME Shell’s dependencies for quite some time, and he has personally patched leaks in IBus, AccountsService, libgweather, gnome-desktop, and more.

It seems the tardy sweep problem has gotten worse in recent versions of GNOME, although it’s hard to measure between different systems with different configurations. I don’t know why it’s gotten worse.2

It seems to have been known for a long time, though: for history buffs, it was alluded to in a comment in commit ae34ec49, back in 2011. That knowledge was apparently lost when GJS was without a maintainer for a couple of years. To be honest, it has taken me well over a year to get familiar enough with the toggle-ref code (which integrates the JS garbage collector with GObject’s refcounting system), that I feel even remotely comfortable making or reviewing changes to it. I only fully realized the implications of the tardy sweep problem after talking to Georges and seeing his memory graph.

It seems that a previous GJS maintainer, Giovanni Campagna, was trying to mitigate the tardy sweep problem already five years ago, with a patch that allowed objects to escape the tardy sweep by opting out of the whole toggle-ref system in some cases. Unfortunately, as far as I can tell from my bug tracker archaeology, his patch went through a few reviews and the answer was always “Wow, this is really complicated, I need to study it some more in order to understand it.” Then it fell by the wayside when he stepped down from GJS maintainership.

I picked the patch up again late last year and fixed up most of the bit-rot. It still had a few problems with it. I never found the time to fix it up completely, which Georges kindly took over for me. I initially preferred Giovanni’s patch above the Big Hammer, but unfortunately for me, Georges proved that it wasn’t as effective as we thought it would be, only clearing up about 5% of the tardy sweep memory.3

“GNOME fixed the problem with a shoddy solution. This will decrease performance but they won’t notice because they all have top-of-the-line machines.”

I don’t call it the Big Hammer for nothing. We were concerned about performance regressions too, so that’s reasonable. However, as you can read in the bug tracker, we did actually do some testing on lower-end hardware before merging the Big Hammer, and it was not as bad as I expected. Carlos has been doing some measurements and found that garbage collection accounts for about 2–3% of the time that GNOME Shell occupies the CPU.

However, it’s exactly because we want to be cautious that the Big Hammer has only been committed to master, which will first be released in the unstable GNOME 3.29.2 snapshot. I don’t plan to release it on a stable branch until we’ve run it some more.

Ubuntu has already put the Big Hammer in their LTS version. That’s more of a risk than I would have recommended, but it’s not my decision to make, and I am grateful that we will be getting some testing through that avenue. Endless is also considering putting the Big Hammer in their stable version.

(And alas, I don’t have a top-of-the-line machine. Feel free to donate me one if that’s what’s required to make me conform to some stereotype of GNOME developers. 😇)

“The problem should be fixed now. GNOME Shell will run smooth from now on.”

No, it’s not. GNOME Shell still isn’t that great with memory.

Carlos is working on some merge requests which are approaching being ready to merge, which should make things a bit more memory-efficient. He’s also had some success with experiments trying to reduce memory fragmentation, taking better advantage of SpiderMonkey’s compacting garbage collector.

We are also bouncing around some ideas for making the Big Hammer into a smaller hammer. In particular, we’re trying to see if the extra garbage collections can be restricted to only the JS objects that represent GObjects, since those are the only objects that are affected by the tardy sweep problem. We’re also trying to see if there’s a way to return black-marked (reachable) objects to their original white-marked (eligible for collection) state when a GObject is toggled down in the middle of a garbage collection.

Another approach to investigate is to make better use of incremental garbage collection. SpiderMonkey offers this facility but we don’t use it yet. The idea is, instead of pausing and doing a big garbage collection, we do a slice of a few milliseconds whenever we have time. I don’t know yet whether this will have a large or small effect, or even render the Big Hammer unnecessary.

We’re also going to update to SpiderMonkey 60 in GNOME 3.30 which will hopefully bring in another year’s worth of Mozilla’s garbage collector research and optimization.

Finally, I’m gradually working on another unfinished merge request left over from Giovanni’s tenure as GJS maintainer, that should drastically increase the performance of GNOME Shell’s animations (though not necessarily help with memory.)

“GNOME has no business releasing any new versions until this problem is fixed.”

GNOME has a fixed release schedule, so they release new versions on the release dates, with whatever is aboard the train at that time. That’s not going to change.

“This version of GNOME is going into Ubuntu LTS! GNOME needs to work harder to fix this.”

Of course, I want whatever version of GNOME ships with any Linux distribution to be as good as possible. But as the upstream GJS maintainer, I have no say over what a downstream Linux distribution chooses to ship. The best way for a Linux distro to make sure their release is shipshape, is to contribute resources towards fixing whatever they consider a blocker.

That sounds a bit callous, as if I refuse to fix any bugs that Ubuntu wants fixed; that’s not what I mean at all. But my free time is limited. I’m paid for a part of my GJS maintainer work, but only for specific features. I can’t work to anyone’s external deadlines in my free time, because otherwise I’ll burn out and that’s not good for anyone with any interest in GJS either. Sometimes I have other priorities besides sitting at the computer; sometimes I do have time but no ideas about a particular problem; sometimes my brain isn’t up to fixing a difficult memory problem and I choose to work on something easier.4 Bugfixing work isn’t fungible.

I picked Ubuntu to illustrate this example, because contributing is exactly what the Ubuntu team has done; Ubuntu contributors fixed stability bugs in GJS, as well as GNOME Shell and Mutter, for GNOME 3.28. To say nothing of contributors from other downstreams, as well. That’s great and I’m looking forward to more of it! Some commenters seem to see downstreams fixing bugs as something that GNOME developers should be ashamed of, but I believe everyone is better off for it when that happens!

Acknowledgements

Thanks to Carlos Garnacho and Andy Holmes, who commented on a draft version of this blog post. Thanks in addition to Andy who coined the term “tardy sweep” and provided Scruffy as the mascot; Heartbleed has branding, why shouldn’t we? And of course, thanks to Georges who kicked off the whole research in the first place!


[1] and has often made people angry in bug reports when told it’s not a memory leak

[2] I have a hunch, though. When I updated SpiderMonkey to version 38 in GNOME 3.24, we went from a conservative collector to an exact-rooted, moving one — see this Wikipedia article for definitions of those terms. It may be that the old garbage collector, though generally considered inferior, did actually mitigate the tardy sweeps a little, because I think back then it would have been possible for more objects to make their way into an ongoing sweep. It’s also possible that it was made worse earlier than that, by some adjustments in GNOME Shell that adjusted how often the garbage collector was called.

[3] Technical explanation: Tweener, which is the animation framework used by GNOME Shell, renders many objects ineligible to opt out of the toggle-ref system. I would like to see Tweener replaced with Clutter implicit animations in GNOME Shell, which would make Giovanni’s patch much more effective, but that’s a big project.

[4] Like writing a blog post about a difficult memory problem. Joke’s on me, that’s actually really hard

15 thoughts on “More Memory, More Problems

  1. Thanks for the write-up. It would seem that core issue is that core parts of gnome shell have gone unmaintained for some years and institutional knowledge was lost.

  2. I think it’s important to realize that most users won’t have and never will have the technical knowledge to accurately describe a bug or the reasons it might happen, but this doesn’t make their bug report any less valid. When someone reports a “memory leak,” it doesn’t mean it’s an actual memory leak, and it doesn’t mean the developer should close it as invalid when it turns out not to meet the strict definition of what a memory leak is. It means the user is experiencing a problem that, to them, resembles what they think of as a memory leak. It’s a valid, valuable report, because it says something about how your users are experiencing your software, keeping in mind that the vast majority of your users won’t take the time to submit a bug report even when they experience major problems. If you don’t have enough information, ask for it. Troubleshoot. Don’t say that their problem doesn’t exist. They took the time to register and post on your bug tracker. When you close a bug report without resolving it you discourage the user from reporting their next problem, and it’ll be discouraging to others that search for a solution to that problem and find this report, too. That doesn’t make the problem go away.

    Some bug reports may be due to misunderstandings, unrealistic expectations, “user error,” and so on. Of course you can’t fix everything, but even those reports may be valuable because it still says something about how users are experiencing your software and you may be able to improve on that in other ways. Some bugs aren’t in code but in design, documentation, presentation, and so on.

    • Hi, thanks for your comment. I think that we agree on a lot of points and I think that more GNOME maintainers than you realize also agree with you. I read a bit of frustration between the lines, so I’m sorry if that’s the case. If you report bugs to GJS I hope you’ll have a better experience going forward, since I think I do place a lot of value on the things you ask for.

      First of all, I want to clarify that I specifically agree with you on the “not a memory leak” part. I’m in no way trying to penalize users for not meeting the definition of a memory leak. For end users it doesn’t matter, their memory is getting used! That explanation was aimed more at other GNOME developers and the tech press such as OMG!Ubuntu, exactly so we can set better expectations for users. I’ll go back and edit the post to make this clearer.

      That said…

      > it doesn’t mean the developer should close it as invalid when it turns out not to meet the strict definition of what a memory leak is.

      That isn’t what I said at all, and as far as I know we have never played this game of “you said the wrong word, CLOSED INVALID.” If this happened to a bug that you know of, please link me to it, and I’ll reopen and/or investigate it.

      > If you don’t have enough information, ask for it. Troubleshoot. Don’t say that their problem doesn’t exist.

      Look, I take this part of my maintainer tasks really seriously. What you are asking for is exactly what I do, and you’ll see it if you look at the bug reports that I’ve handled during my tenure as GJS maintainer. We have a “Needs Information” label in GitLab exactly for this purpose: to indicate that we don’t have enough information but the problem is acknowledged.

      I also think the memory leak bug reports I linked in the post show that former maintainers of GJS and GNOME Shell did spend a lot of time asking for more information, troubleshooting, and did not deny that a problem existed. On one of them, Jasper disagreed with one of the commenters on the interpretation of the symptoms of the problem, but that’s not the same thing. I’m not saying that dismissiveness never happens (I’ve seen an instance recently, in fact,) but it’s not the norm in the GNOME community.

      > When you close a bug report without resolving it you discourage the user from reporting their next problem, and it’ll be discouraging to others that search for a solution to that problem and find this report, too. That doesn’t make the problem go away.

      As I mentioned in the post, we do close bug reports sometimes if there isn’t anything useful we can do with them. Sometimes they still represent a problem with the software, but they attract enough abuse that we have to spend time dealing with that it just doesn’t make sense to keep them up. Even the newest one (https://gitlab.gnome.org/GNOME/gnome-shell/issues/64) has started attracting abusive comments the other day. I really don’t think anyone believes that the problem just goes away when the bug report is closed.

  3. To my eyes, there’s another problem here. Users, when confronted with high memory usage, call it a “memory leak” and blame the application developers.

    In the olden days that was correct: the application was directly allocating and freeing memory and unreasonably high memory usage was always the fault of the application developers. However today, with garbage collection, managed languages and lots of layers in between the person writing the code and the nitty gritty of allocating and freeing memory, that thinking is out-of-date.

    A potential way to set expectations in users and accurately track (and consolidate) bugs might be to draw a clear distinction between a potential memory leak (“memory usage of $APP drastically increases every time I do $THING”) and excess memory usage (“after a long time $APP is using 44GB of RAM”)

    • You make some good points, but I’d prefer not to place the burden of changing on users here. As I said in the post, users’ memory is getting used up, so what does it matter to them? In this case, it even was a situation where “memory usage of gnome-shell drastically increases every time I do $THING.” And anyway, these olden-days kinds of memory leaks still do occur. I’m not trying to “re-educate” users since in the end it doesn’t matter what the problem is called. But instead try to get people who publicize these things to set the right expectations so that especially the technically inclined users know what to expect.

  4. “No, it’s not. GNOME Shell still isn’t that great with memory.”

    Maybe GNOME should not write it’s shell in JavaScript. Firefox and Chrome heavily use JavaScript and look at their memory consumption.

    • This is off-topic here, but I’ll give you the summary in case you’re new to the discussion. We won’t continue it here, though, as it’s been done to death in many other comments.

      tl;dr: 1) That ship has sailed and it’s not going to change. 2) Not the whole shell is written in JavaScript, most of it is in C. 3) It wouldn’t be extensible without the scripting language and people wouldn’t be able to customize it as easily.

  5. Thanks.

    I’m glad to see Gitlab in usage only because of the ability to use upvotes. A bugtracker with the ability to vote for an issue is twice as useful, Flyspray and other have done that for many years. It allows measuring of how important and issue is for the users and cast many unnecessary posts into a useful +1 which helps to guide the development. I just remember how the howl thing around transparency in gnome-terminal escalated to and ugly nightmare.

    Honestly I’m with Téssio Fechines and I want to ask, was JavaScript the right tool? Years after the emerging of GNOME 3 it seem still a problem to tie Garbage Collection and GObject together. As said, the ship has sailed. But new ships are built also, like Gtk4 and maybe some kind of GNOME4. XML and GCONF are gone, too.

    • Yes, there could be a new ship in GNOME 4, but to be honest it’s not a clear-cut conclusion that Javascript was the wrong choice. Without the shell extensions system gnome-shell would be an entirely different (and, in my opinion, worse) desktop environment, and I think JS was the right language for the shell extensions. But to me it’s a similar question to “Was SpiderMonkey the right choice of JS engine for GJS?” Maybe, maybe not, but it’s what we have. There are other efforts underway to rewrite a GNOME JS platform with v8/Node and JSC, but they’re not as good as GJS yet. Similarly, maybe someone is going to start a cool new thing for gnome-shell 4, but it’s probably not going to be me, so this blog is not a good place to discuss it 🙂

      I do want to clarify that a complete rewrite without JS is not the only possible, or even most likely, solution to this garbage collection problem.

  6. Has anyone in the GNOME community looked at other large FOSS projects like Mozilla for inspiration? Specifically I’m thinking about the MemShrink effort for reducing Firefox’s memory usage. I wouldn’t want to misrepresent what that effort involved, since all I know about it were weekly glimpses from the blog of Nicholas Nethercote, but AFAIU it was focusing on adding instrumentation to account for all allocated memory bytes, and then tracking down suspicious ones.

    Here’s one example that stuck in my memory: Clownshoes available in sizes 2^10+1 and up!

  7. Pingback: LWDW 116: Fedora Twenty Great – LinuxGameCast

  8. Pingback: LWDW 131: Dropping Dropbox – LinuxGameCast

  9. Pingback: Taking Out the Garbage | The Mad Scientist Review

Leave a reply to John Yendt Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.