Tracing (Jaeger)
HPE Machine Learning Data Management has the ability to trace requests using Jaeger. This can be useful when diagnosing slow clusters.
Collecting Traces #
To use tracing in HPE Machine Learning Data Management, complete the following steps:
-
Run Jaeger in Kubernetes
kubectl apply -f https://raw.githubusercontent.com/pachyderm/pachyderm/v2.12.0/etc/deploy/tracing/jaeger-all-in-one.yaml
-
Point HPE Machine Learning Data Management at Jaeger
-
For
pachctl
, run:export JAEGER_ENDPOINT=localhost:14268 kubectl port-forward svc/jaeger-collector 14268 & # Collector service
-
For
pachd
, run:kubectl delete po -l suite=pachyderm,app=pachd
The port-forward command is necessary because
pachctl
sends traces to Jaeger (it actually initiates every trace), and reads theJAEGER_ENDPOINT
environment variable for the address to which it will send the trace info.Restarting the
pachd
pod is necessary becausepachd
also sends trace information to Jaeger, but it reads the environment variables corresponding to the Jaeger service[1] on startup to find Jaeger (the Jaeger service is created by thejaeger-all-in-one.yaml
manifest). Killing the pods restarts them, which causes them to connect to Jaeger.
-
-
Send HPE Machine Learning Data Management a traced request by setting the
PACH_TRACE
environment variable to “true” before running anypachctl
command (note thatJAEGER_ENDPOINT
must also be set/exported):PACH_TRACE=true pachctl list job # for example
HPE Machine Learning Data Management does not recommend exporting
PACH_TRACE
because tracing calls can slow them down and make interesting traces hard to find in Jaeger. Therefore, you might want to set this variable for the specific calls you want to trace.However, HPE Machine Learning Data Management’s client library reads this variable and implements the relevant tracing, so any binary that uses HPE Machine Learning Data Management’s go client library can trace calls if these variables are set.
View Traces #
To view traces, run:
kubectl port-forward svc/jaeger-query 16686:80 & # UI service
Then, connect to localhost:16686
in your browser, and you should see all
collected traces.
Troubleshooting #
-
If you see
<trace-without-root-span>
, this likely means thatpachd
has connected to Jaeger, butpachctl
has not. Make sure that theJAEGER_ENDPOINT
environment variable is set on your local machine, and thatkubectl port-forward "po/${jaeger_pod}" 14268
is running. -
If you see a trace appear in Jaeger with no subtraces, like so:
This might mean that
pachd
has not connected to Jaeger, butpachctl
has. Restart thepachd
pods after creating the Jaeger service in Kubernetes.