Goroutine leak detection — be careful with zero-buffer channel in timeout control of web request

Jing
3 min readOct 24, 2022

Recently I took over the development work of a web service from another colleague, which was as simple as firing request and handling response. Unfortunately, he did some things in the wrong way that finally caused a crash on a beautiful weekend morning, like the following, memory leaked and pod restarted repeatedly:

Memory Leak

It’s terrible.

So I need to profile the program. Let’s first look at the heap:

go tool pprof http://localhost:2022/debug/pprof/profile

Nothing was found.

How about the goroutines? It’s a common cause of memory leak. Luckily this service has prometheus monitoring configured and the graph shows a rapid growth of goroutines, nearly 20k ones!

go_goroutines{service='some-service'}
Goroutines growing

I caught you.

Now let me show you some clues. I profiled the goroutines:

go tool pprof http://localhost:2022/debug/pprof/goroutine

It resulted in a file, e.g. 1.profile. Then 5 minutes later, I profiled again, resulting in file 2.profile. So let’s compare these two files:

go tool pprof -base 1.profile 2.profile

And it showed:

Profile compare

runtime.gopark triggered the most goroutines. Let’s trace who made this call:

Traces

It’s func1! So it’s very clear that we need to diagnose func1. First let’s look at this code(it is a simplification):

package mainimport (
"fmt"
"runtime"
"time"
)
func request() {
ch := make(chan struct{})
// simulate web request
go func() {
time.Sleep(2 * time.Second)
ch <- struct{}{}
}()
select {
case <-ch:
fmt.Println("request done")
return
case <-time.After(1 * time.Second):
fmt.Println("timeout")
return
}
}
func main() {
go func() {
for i := 0; 1 < 100; i++ {
request()
time.Sleep(100 * time.Millisecond)
}
}()
go func() {
ticker := time.NewTicker(1 * time.Second)
for range ticker.C {
fmt.Println(runtime.NumGoroutine())
}
}()
time.Sleep(10 * time.Second)
}

Can you guess what would it produce? Have a look:

timeout
4
5
timeout
6
timeout
7
timeout
8
timeout
9
timeout
10
timeout
11
timeout
12
timeout

Goroutines grow! That’s just the problem! So what’s wrong with these code?

Remember we did tracing before and it showed that many goroutines were blocked at runtime.chansend . We know that zero buffered channel will be blocked if no goroutine is waiting for it. So, ur, it’s just this code:

ch <- struct{}{}

request()will return if time exceeds while the web request returns later without finding any goroutine waiting on ch, so it’s blocked forever!

How to fix it?

We should take care of this dangling web request goroutine and make sure it exit normally. So the easiest way to fix this issue is just using a buffered channel. Let’s have a try:

// modify the first line of request()
ch := make(chan struct{}, 1)

Let’s see the difference:

4
timeout
4
timeout
5
timeout
5
timeout
5
timeout
5
timeout
5
timeout
5
timeout
5
timeout

Great!

Goroutines stay at 5 and never grow! Let’s see the final result, a very clear comparison:

Conclusions

  • Be careful with timeout control of asynchronous request.
  • Always make sure when does your goroutine exit!

Follow me if you are interested https://github.com/sceneryback

--

--

Jing

Breathtaking interfaces and strong services together make great products