Working with Pipelines
A typical HPE Machine Learning Data Management workflow involves multiple iterations of experimenting with your code and pipeline specs.
In general, there are five steps to working with a pipeline. The stages can be summarized in the image below.
We will walk through each of the stages in detail.
Step 1: Write Your Analysis Code #
Because HPE Machine Learning Data Management is completely language-agnostic, the code that is used to process data in HPE Machine Learning Data Management can be written in any language and can use any libraries of choice. Whether your code is as simple as a bash command or as complicated as a TensorFlow neural network, it needs to be built with all the required dependencies into a container that can run anywhere, including inside of HPE Machine Learning Data Management. See Examples.
Your code does not have to import any special HPE Machine Learning Data Management functionality or libraries. However, it must meet the following requirements:
-
Read files from a local file system. HPE Machine Learning Data Management automatically mounts each input data repository as
/pfs/<repo_name>
in the running containers of your Docker image. Therefore, the code that you write needs to read input data from this directory, similar to any other file system.Because HPE Machine Learning Data Management automatically spreads data across parallel containers, your analysis code does not have to deal with data sharding or parallelization. For example, if you have four containers that run your Python code, HPE Machine Learning Data Management automatically supplies 1/4 of the input data to
/pfs/<repo_name>
in each running container. These workload balancing settings can be adjusted as needed through HPE Machine Learning Data Management tunable parameters in the pipeline specification. -
Write files into a local file system, such as saving results. Your code must write to the
/pfs/out
directory that HPE Machine Learning Data Management mounts in all of your running containers. Similar to reading data, your code does not have to manage parallelization or sharding.
Step 2: Build Your Docker Image #
When you create a HPE Machine Learning Data Management pipeline, you need to specify a Docker image that includes the code or binary that you want to run. Therefore, every time you modify your code, you need to build a new Docker image, push it to your image registry, and update the image tag in the pipeline spec. This section describes one way of building Docker images, but if you have your own routine, feel free to apply it.
To build an image, you need to create a Dockerfile
. However, do not
use the CMD
field in your Dockerfile
to specify the commands that
you want to run. Instead, you add them in the cmd
field in your pipeline
specification. HPE Machine Learning Data Management runs these commands inside the
container during the job execution rather than relying on Docker
to run them.
The reason is that HPE Machine Learning Data Management cannot execute your code immediately when
your container starts, so it runs a shim process in your container
instead, and then, it calls your pipeline specification’s cmd
from there.
Dockerfile
example below is provided for your reference
only. Your Dockerfile
might look completely different.To build a Docker image, complete the following steps:
-
If you do not have a registry, create one with a preferred provider. If you decide to use DockerHub, follow the Docker Hub Quickstart to create a repository for your project.
-
Create a
Dockerfile
for your project. See the OpenCV example. -
Build a new image from the
Dockerfile
by specifying a tag:docker build -t <image>:<tag> .
For more information about building Docker images, see Docker documentation.
Step 3: Push Your Docker Image to a Registry #
Once your image is built and tagged, you need to upload the image into a public or private image registry, such as DockerHub.
Alternatively, you can use the HPE Machine Learning Data Management’s built-in functionality to
tag, and push images by running the pachctl update pipeline
command
with the --push-images
flag. For more information, see
Update a pipeline.
-
Log in to an image registry.
If you use DockerHub, run:
docker login --username=<dockerhub-username> --password=<dockerhub-password> <dockerhub-fqdn>
-
Push your image to your image registry.
If you use DockerHub, run:
docker push <image>:tag
latest
, is used, the Kubernetes cluster may become out of sync with the Docker registry, concluding it already has the latest
image.Step 4: Create/Edit the Pipeline Config #
HPE Machine Learning Data Management’s pipeline specification files store the configuration information about the Docker image and code that HPE Machine Learning Data Management should run, the input repo(s) of the pipeline, parallelism settings, GPU usage etc… Pipeline specifications are stored in JSON or YAML format.
A standard pipeline specification must include the following parameters:
name
transform
input
Check our reference pipeline specification page, for a list of all available fields in a pipeline specification file.
You can store your pipeline specifications locally or in a remote location, such as a GitHub repository.
A simple pipeline specification file in JSON would look like the example below.
The pipeline takes its data from the input repo data
, runs worker containers with the defined image <image>:<tag>
and command
, then outputs the resulting processed data in the my-pipeline
output repo. During a job execution, each worker sees and reads from the local file system /pfs/data
containing only matched data from the glob
expression, and writes its output to /pfs/out
with standard file system functions; HPE Machine Learning Data Management handles the rest.
# my-pipeline.json
{
"pipeline": {
"name": "my-pipeline"
},
"transform": {
"image": "<image>:<tag>",
"cmd": ["command", "/pfs/data", "/pfs/out"]
},
"input": {
"pfs": {
"repo": "data",
"glob": "/*"
}
}
}
Step 5: Deploy/Update the Pipeline #
As soon as you create a pipeline, HPE Machine Learning Data Management spins up one or more Kubernetes pods in which the pipeline code runs. By default, after the pipeline finishes running, the pods continue to run while waiting for the new data to be committed into the HPE Machine Learning Data Management input repository. You can configure this parameter, as well as many others, in the pipeline specification.
-
Create a HPE Machine Learning Data Management pipeline from the spec:
pachctl create pipeline -f my-pipeline.json
You can specify a local file or a file stored in a remote location, such as a GitHub repository. For example,
https://raw.githubusercontent.com/pachyderm/pachyderm/2.12.x/examples/opencv/edges.json
. -
If your pipeline specification changes, you can update the pipeline by running
pachctl update pipeline -f my-pipeline.json