Input PFS PPS

Spec

This is a top-level attribute of the pipeline spec.

{
    "pipeline": {...},
    "transform": {...},
    "input": {
    "pfs": {
        "project": string,
        "name": string,
        "repo": string,
        "repoType":string,
        "branch": string,
        "commit": string,
        "glob": string,
        "joinOn":string,
        "outerJoin": bool,
        "groupBy": string,
        "lazy": bool,
        "emptyFiles": bool,
        "s3": bool,
        "propagation_spec": {
                "never": true
            },
        "reference": bool,
        "trigger": {
            "branch": string,
            "all": bool,
            "rateLimitSpec": string,
            "size": string,
            "commits": int,
            "cronSpec": string,
            },
        }
    },
    ...
}

Behavior

input.pfs.name is the name of the input. An input with the name XXX is visible under the path /pfs/XXX when a job runs. Input names must be unique if the inputs are crossed, but they may be duplicated between PFSInputs that are combined by using the union operator. This is because when PFSInputs are combined, you only ever see a datum from one input at a time. Overlapping the names of combined inputs allows you to write simpler code since you no longer need to consider which input directory a particular datum comes from. If an input’s name is not specified, it defaults to the name of the repo. Therefore, if you have two crossed inputs from the same repo, you must give at least one of them a unique name.

Propagation Spec

The propagation_spec allows you to set conditions for when commits should propagate.

  • If propagation_spec is not set, it will default to always propagating.
  • If propagation_spec.never is set to true, it will prevent commits from this input from triggering the pipeline.
  • If propagation_spec.never is set to false, it will allow commits from this input to trigger the pipeline.

Reference Inputs

You can use the reference attribute to have input(s) to a pipeline that won’t trigger the pipeline, but will be used the next time the pipeline runs (using the latest version), allowing users to avoid reprocessing old data unnecessarily.

This functionality is particularly important for scenarios where:

  • Users have data that updates at different frequencies (e.g., hourly vs. yearly)
  • There’s a need to process new data using the latest version of reference data without triggering reprocessing of all previous data

When input.pfs.reference is set to true:

  • The pipeline will not trigger reprocessing of datums when its file content changes⁠
  • Datums produced by a cross of a reference input and a non-reference input will only be reprocessed when the content of the non-reference input file(s) change
  • The reference input will be used in processing, but changes to it alone won’t cause reprocessing⁠
warning icon Warning
Using reference inputs may affect reproducibility, as the output of a job may not be purely a function of the input.⁠⁠