Tracing (Jaeger)

HPE Machine Learning Data Management has the ability to trace requests using Jaeger. This can be useful when diagnosing slow clusters.

Successful Trace

Collecting Traces

To use tracing in HPE Machine Learning Data Management, complete the following steps:

  1. Run Jaeger in Kubernetes

    kubectl apply -f https://raw.githubusercontent.com/pachyderm/pachyderm/v2.12.0/etc/deploy/tracing/jaeger-all-in-one.yaml
  2. Point HPE Machine Learning Data Management at Jaeger

    • For pachctl, run:

      export JAEGER_ENDPOINT=localhost:14268
      kubectl port-forward svc/jaeger-collector 14268 & # Collector service
    • For pachd, run:

      kubectl delete po -l suite=pachyderm,app=pachd

      The port-forward command is necessary because pachctl sends traces to Jaeger (it actually initiates every trace), and reads the JAEGER_ENDPOINT environment variable for the address to which it will send the trace info.

      Restarting the pachd pod is necessary because pachd also sends trace information to Jaeger, but it reads the environment variables corresponding to the Jaeger service[1] on startup to find Jaeger (the Jaeger service is created by the jaeger-all-in-one.yaml manifest). Killing the pods restarts them, which causes them to connect to Jaeger.

  3. Send HPE Machine Learning Data Management a traced request by setting the PACH_TRACE environment variable to “true” before running any pachctl command (note that JAEGER_ENDPOINT must also be set/exported):

    PACH_TRACE=true pachctl list job # for example

    HPE Machine Learning Data Management does not recommend exporting PACH_TRACE because tracing calls can slow them down and make interesting traces hard to find in Jaeger. Therefore, you might want to set this variable for the specific calls you want to trace.

    However, HPE Machine Learning Data Management’s client library reads this variable and implements the relevant tracing, so any binary that uses HPE Machine Learning Data Management’s go client library can trace calls if these variables are set.

View Traces

To view traces, run:

kubectl port-forward svc/jaeger-query 16686:80 & # UI service

Then, connect to localhost:16686 in your browser, and you should see all collected traces.

note icon Note

Troubleshooting

  1. If you see <trace-without-root-span>, this likely means that pachd has connected to Jaeger, but pachctl has not. Make sure that the JAEGER_ENDPOINT environment variable is set on your local machine, and that kubectl port-forward "po/${jaeger_pod}" 14268 is running.

  2. If you see a trace appear in Jaeger with no subtraces, like so:

    Trace with no children

    This might mean that pachd has not connected to Jaeger, but pachctl has. Restart the pachd pods after creating the Jaeger service in Kubernetes.