Input PFS PPS
Spec #
This is a top-level attribute of the pipeline spec.
{
"pipeline": {...},
"transform": {...},
"input": {
"pfs": {
"project": string,
"name": string,
"repo": string,
"repoType":string,
"branch": string,
"commit": string,
"glob": string,
"joinOn":string,
"outerJoin": bool,
"groupBy": string,
"lazy": bool,
"emptyFiles": bool,
"s3": bool,
"propagation_spec": {
"never": true
},
"reference": bool,
"trigger": {
"branch": string,
"all": bool,
"rateLimitSpec": string,
"size": string,
"commits": int,
"cronSpec": string,
},
}
},
...
}
Behavior #
input.pfs.name
is the name of the input. An input with the name XXX
is visible under the path /pfs/XXX
when a job runs. Input names must be unique if the inputs are crossed, but they may be duplicated between PFSInput
s that are combined by using the union
operator. This is because when PFSInput
s are combined, you only ever see a datum from one input at a time. Overlapping the names of combined inputs allows you to write simpler code since you no longer need to consider which input directory a particular datum comes from. If an input’s name is not specified, it defaults to the name of the repo. Therefore, if you have two crossed inputs from the same repo, you must give at least one of them a unique name.
Propagation Spec #
The propagation_spec
allows you to set conditions for when commits should propagate.
- If
propagation_spec
is not set, it will default to always propagating. - If
propagation_spec.never
is set totrue
, it will prevent commits from this input from triggering the pipeline. - If
propagation_spec.never
is set tofalse
, it will allow commits from this input to trigger the pipeline.
Reference Inputs #
You can use the reference
attribute to have input(s) to a pipeline that won’t trigger the pipeline, but will be used the next time the pipeline runs (using the latest version), allowing users to avoid reprocessing old data unnecessarily.
This functionality is particularly important for scenarios where:
- Users have data that updates at different frequencies (e.g., hourly vs. yearly)
- There’s a need to process new data using the latest version of reference data without triggering reprocessing of all previous data
When input.pfs.reference
is set to true:
- The pipeline will not trigger reprocessing of datums when its file content changes
- Datums produced by a cross of a reference input and a non-reference input will only be reprocessed when the content of the non-reference input file(s) change
- The reference input will be used in processing, but changes to it alone won’t cause reprocessing