Hunting Down Memory Leaks in Golang Service

Date: 2023/07/23
Categories: Tech
Tags: Golang

So I encountered with a service that has to be restarted in every few hours or so because their memory consumption trajectory is like SpaceX’s Falcon 9 being launched toward ISS.

Hmm. Golang is garbage collector language, so we just need to add *.Close() to every opened thing, right?

Turn out it is not always the case.

The characteristics of the service

The main purpose of this service is to upload files. Almost ten thousand files uploaded in a day or about 15GB/day. Total files uploaded up until I write this post is about 14,56TB. The service was hosted in AWS ECS. They said each restart takes up to 15 minutes and when it happened, complaints bombardment coming from their users. At some points, they have to upgrade the instance to the one with bigger RAM just to buy more time before the system crashed again.

This is indeed an interesting problem.

pprof

Golang has pprof, a powerfull tool for profiling resource consumption. Its implementations are available in popular frameworks like chi or gin. This is a super easy example of how to implement pprof for a gin engine by using github.com/gin-contrib/pprof package:

import (
	...
	"github.com/gin-contrib/pprof"
	...
)

...

	isProfilingEnabled, _ := strconv.ParseBool(os.Getenv("PPROF"))
	if isProfilingEnabled {
		pprof.Register(engine)
	}

This will open some endpoints in the service for debug purpose under /debug/*. Since this case is about memory allocation, I’ll focus on allocs.

I immediately planted pprof, fetched the state dump of the service then display it locally by using this command

go tool pprof -http=":8080" http://foobar/debug/pprof/allocs

It’ll open web browser to show you the data.

ParseMultipartForm

Since the service is mainly about uploading files, I proceed to reproduce prod-like situation by spinning up a local instance, bypassing the authentication then start to upload file (1.3MB, repeated 100x).

#!/bin/sh
i=0
while [ $i -ne 100 ]
do
	i=$(($i+1))
	curl --location 'http://localhost:8000/upload' \
	--header 'Authorization: Bearer TOKEN' \
	--form 'file=@"/Users/herpiko/Downloads/foo.txt"'
done

Let’s see the pprof reports.

pprof was showing that 386MB are consumed. What are those? Let’s see FormFile documentation.

FormFile returns the first file for the provided form key. FormFile calls ParseMultipartForm and ParseForm if necessary.

And then the ParseMultipartForm,

ParseMultipartForm parses a request body as multipart/form-data. The whole request body is parsed and up to a total of maxMemory bytes of its file parts are stored in memory, with the remainder stored on disk in temporary files. ParseMultipartForm calls ParseForm if necessary. If ParseForm returns an error, ParseMultipartForm returns it but also continues parsing the request body. After one call to ParseMultipartForm, subsequent calls have no effect.

So if the file size is bigger than the maxMemory, the rest of it will be saved to disk as temporary files. The problem is the one that is in memory is not cleared by GC after sometimes. Even running interval GC in a goroutine does not help. Honestly I still don’t know how to clear this up in elegant way but what if we put zero as ParseMUltipartForm() param? So nothing will be allocated in memory, all uploaded files go straight to filesystem/disk. Fortunately gin is exposing the Request instance, so we could tweak this easily.

Because I don’t want to messing up with this system, I am using Growthbook’s feature flag to control the value. If something goes wrong after the deployment, I could revert the value back to 32MB.

	parseMultiForm := gb.Feature("request-parse-multipart-form").GetValueWithDefault(32.0).(float64)
	c.Request.ParseMultipartForm(int64(parseMultiForm) << 20)
	defer c.Request.MultipartForm.RemoveAll()

Consideration in this case:

After it got deployed, we monitored it and found that the memory consumption trajectory declined to 12 degrees.

It improved but still did not meet my expectations. At least no need to restart the system in the middle of peak hours, it bought us some little time.

So what else is left in the memory? We turned on the pprof in production. Is it okay to do performance profiling in production? Absolutely yes.

Kafka Writer

This child-nodes graph showed up under Kafka related nodes.

Before this, I just skimmed the codebase and focused on uploading-related stuff. I don’t know if this service is involving Kafka.

Let me repeat this question.

Golang is garbage collector language, so we just need to add *.Close() to every opened things, right?

At this time, you are right. For every Kafka writer that initiated, there was no Close() being called. Let’s fix this.

writer := kafka.NewWriter(kafka.WriterConfig{
	Brokers:  messagebroker.Brokers,
	Balancer: &kafka.Hash{},
	Dialer:   messagebroker.Dialer,
	Topic:    "statusNew",
})
defer writer.Close() // This is missing

Pushed this to production then memory consumption got flattened near zero.

As Lieutenant General Leslie Richard Groves Jr said,



>> Home