Overview
Introduction #
How HPE Machine Learning Data Management Works #
HPE Machine Learning Data Management is deployed within a Kubernetes cluster to manage and version your data using projects, input repositories, pipelines, datums and output repositories. A project can house many repositories and pipelines, and when a pipeline runs a data transformation job it chunks your inputs into datums for processing.
The number of datums is determined by the glob pattern defined in your pipeline specification; if the shape of your glob pattern encompasses all inputs, it will process one datum; if the shape of your glob pattern encompasses each input individually, it will process one datum per file in the input, and so on.
The end result of your data transformation should always be saved to /pfs/out
. The contents of /pfs/out
are automatically made accessible from the pipeline’s output repository by the same name. So all files saved to /pfs/out
for a pipeline named foo
are accessible from the foo
output repository.
Pipelines combine to create DAGs, and a DAG can be comprised of just one pipeline. Don’t worry if this sounds confusing! We’ll walk you through the process step-by-step.
How to Interact with HPE Machine Learning Data Management #
You can interact your HPE Machine Learning Data Management cluster using the PachCTL CLI or through Console, a GUI.
- PachCTL is great for users already experienced with using a CLI.
- Console is great for beginners and helps with visualizing relationships between projects, repos, and pipelines.
Before You Start #
- Complete the First-Time Setup Guide to install the necessary tools and set up your environment.
- Complete the Connect to Existing Instance Guide to connect to your HPE Machine Learning Data Management instance.
- Join our Slack Community so you can ask any questions you may have!
- Try out this Glob Tool to learn how to use glob patterns to select files and directories.
Part 1: Beginner Overview #
In this tutorial, we’ll walk you through how to use HPE Machine Learning Data Management to process images and videos using OpenCV. OpenCV is a popular open-source computer vision library that can be used to perform image processing and video analysis.
This DAG has 6 steps with the goal of intaking raw photos and video content, drawing edge-detected traces, and outputting a comparison collage of the original and processed images:
- Convert videos to MP4 format
- Extract frames from videos
- Trace the outline of each frame and standalone image
- Create
.gifs
from the traced video frames - Re-shuffle the content so it is organized by “original” and “traced” images
- Build a comparison collage using a static HTML page