Motivation and Concept
APM (Application Performance Monitoring) tools like Datadog are important for our system because they provide valuable insights into the performance, behavior, and security of our applications. They allow us to identify performance issues in real-time, improve application performance, troubleshoot problems, plan capacity, and identify security vulnerabilities.
The performance tracing implementation is done by:
- Running an agent beside your service(s). One agent could be used by several services. The agent acts as gateway to send data to Datadog APIs.
- Inject the tracer inside your app.
- For monitoring a casual HTTP service, by default the tracer will watch each request that comes to the router.
- For deep performance monitoring, custom instrumentation is possible, deep down to SQL queries.
- Inter-service performance monitoring is also possible, see Distributed Tracing below.
The yellow-colored parts are the ones that you need to setup,
Examples
First, get your DATADOG _API_KEY then run the agent,
docker run -d --cgroupns host \
--pid host \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
-p 127.0.0.1:8126:8126/tcp \
-e DD_API_KEY=<DATADOG_API_KEY> \
-e DD_APM_ENABLED=true \
gcr.io/datadoghq/agent:latest
This is the basic example of ddtrace
implementation for go-chi
,
package main
import (
"net/http"
"github.com/go-chi/chi/v5"
"github.com/go-chi/chi/v5/middleware"
ddchi "gopkg.in/DataDog/dd-trace-go.v1/contrib/go-chi/chi.v5"
"gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"
)
func welcome(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("Hello world!"))
}
func main() {
tracer.Start() // Start the tracer daemon
r := chi.NewRouter()
r.Use(ddchi.Middleware()) // Inject the tracer
r.Use(middleware.Logger)
r.Get("/hello", welcome)
http.ListenAndServe(":8000", r)
}
Then run the service by pointing out the agent through DD_AGENT_HOST
envar,
$ DD_AGENT_HOST=localhost go run single/main.go
If you try to hit http://localhost:8000/hello several times, the traced data will be shown in Datadog APM dashboard.
Custom Instrument
Custom instruments allow us to get deep insight into the performance of our service. Imagine the /hello
endpoint has to call another function like this
func functionA() {
time.Sleep(60 * time.Millisecond)
return
}
func welcome(w http.ResponseWriter, r *http.Request) {
time.Sleep(10 * time.Millisecond)
functionA()
time.Sleep(10 * time.Millisecond)
w.Write([]byte("Hello world!"))
}
I did intentionally put some delays to simulate bottlenecked situations.
Nested flamegraph
At the code below I put tracer.StartSpanFromContext
at the start of the function. You can see r.Context()
is passed to another function. This context contains important tracing metadata like process-id
and runtime-id
that helped Datadog to build visual flamegraph of your performance profiling later.
func functionA(ctx context.Context) {
span, _ := tracer.StartSpanFromContext(ctx, "functionA", tracer.ResourceName("someParam"))
defer span.Finish()
time.Sleep(60 * time.Millisecond)
return
}
func welcome(w http.ResponseWriter, r *http.Request) {
time.Sleep(10 * time.Millisecond)
functionA(r.Context())
time.Sleep(10 * time.Millisecond)
w.Write([]byte("Hello world!"))
}
If you try to hit the endpoint, you’ll see the functionA() bottlenecked our /hello endpoint:
Synchronous execution
The sibling process in flamegraph is also possible, assuming that functionA()
and functionB()
are executed in sync and functionC()
is nested under functionB()
, consider this example code,
func functionA(ctx context.Context) {
span, _ := tracer.StartSpanFromContext(ctx, "functionA", tracer.ResourceName("someParam"))
defer span.Finish()
time.Sleep(60 * time.Millisecond)
return
}
func functionB(ctx context.Context) {
span, ctx := tracer.StartSpanFromContext(ctx, "functionA", tracer.ResourceName("someParam"))
defer span.Finish()
time.Sleep(10 * time.Millisecond)
functionC(ctx)
return
}
func functionC(ctx context.Context) {
span, _ := tracer.StartSpanFromContext(ctx, "functionA", tracer.ResourceName("someParam"))
defer span.Finish()
time.Sleep(60 * time.Millisecond)
return
}
func welcome(w http.ResponseWriter, r *http.Request) {
time.Sleep(10 * time.Millisecond)
functionA(r.Context())
functionB(r.Context())
time.Sleep(10 * time.Millisecond)
w.Write([]byte("Hello world!"))
}
You can see that r.Context
is passed twice, then it passes again under functionB()
to functionC()
. This kind of context-passing will build a flamegraph like this:
Database performance
Tracing the database query performance is done by wrapping the database driver like this,
sqltrace "gopkg.in/DataDog/dd-trace-go.v1/contrib/database/sql"
...
sqltrace.Register("pq", &pq.Driver{}, sqltrace.WithServiceName("postgres"))
db, err := sqltrace.Open("postgres", args)
if err != nil {
panic(err.Error())
}
Then instead of executing a query like this,
db.Query(statement)
Now you have to append Context
to the function name and pass the context.Context
object.
db.QueryContext(ctx, statement)
Also applied to other database functions (QueryRow()
to QueryRowContext()
, Exec()
to ExecContext()
and so on)
Distributed Tracing
Distributed tracing will provide inter-service flamegraph so you can inspect your services performance end to end through several services. To do this, you have to pass the context by using wrapped HTTP client and propagated context. Please take a look at this code of, let’s say, serviceA:
httptrace "gopkg.in/DataDog/dd-trace-go.v1/contrib/net/http"
...
httpClient = httptrace.WrapClient(&http.Client{})
req, _ := http.NewRequest("GET", "http://service-b/hello", nil)
req = req.WithContext(ctx)
carrier := opentracing.HTTPHeadersCarrier(req.Header)
_ = tracer.Inject(span.Context(), carrier)
httpClient.Do(req)
Then the HTTP handler in serviceB,
func hello(w http.ResponseWriter, r *http.Request) {
// Extract from propagated context
spanCtx, err := tracer.Extract(opentracing.HTTPHeadersCarrier(r.Header))
if err != nil {
log.Println(err)
}
span := tracer.StartSpan("dummyhelloservice", tracer.ResourceName("/hello"), tracer.ChildOf(spanCtx))
defer span.Finish()
log.Printf("CONTEXT %+v", span.Context())
w.Write([]byte("Hello World\n"))
}
This inter-service propagated context passing will let Datadog build flamegraph like this:
If there is something sit between the services (e.g. a proxy or gateway), this something must support context propagation. Some known proxy that have this feature/plugin/extension are Nginx and EnvoyProxy.
Impact to the codebase
You have to get used to using context.Context and always consider each time you write a function, whether you want to trace the performance or not.
For system-wide implementation and to minimize the refactor effort, you can use helper like this,
/*
* To get current function name from a function,
* heavily used by ddtrace implementation.
* Usage:
*
* pc, _, _, _ := runtime.Caller(0)
* log.Println(utils.GetFunctionName(pc))
*/
func GetFunctionName(pc uintptr) string {
splitted := strings.Split(runtime.FuncForPC(pc).Name(), ".")
functionName := splitted[len(splitted)-1]
return functionName
}
Then put these lines at the start of any functions that you want to trace,
func SomeFunction(ctx context.Context, foo string) error {
pc, _, _, _ := runtime.Caller(0)
span, ctx := tracer.StartSpanFromContext(
ctx,
utils.GetFunctionName(pc),
tracer.ResourceName("someParam"),
)
defer span.Finish()
...
Please remember to pass the context.Context as you need. Losing the context means losing the correct flamegraph representation.
>> Home