Skip to main content

Anatomy of a Pipeline YAML

1. Fundamental Structure

Every pipeline consists of these core components:

pipeline.yaml
├── Metadata (name, version)
├── Inputs (data entry points)
├── Outputs (results delivery)
├── Nodes (processing units)
└── Flows (data highways)

2. Component Deep Dive

2.1 Metadata Section

name: "Face Styler"         # Pipeline identifier
version: "1.0.2" # Versioning (SemVer recommended)
description: "Transforms portraits into artistic styles"

Why it matters:

  • name appears in UI/API
  • version enables change tracking
  • description helps discovery

2.2 Inputs/Outputs System

Inputs (Pipeline's API)

inputs:
user_photo: # Logical name
type: image # Data type constraint
required: true # Validation rule
title: "Your Portrait" # UI label
default: null # Fallback value

Outputs (Results Interface)

outputs:
styled_image:
type: image
title: "Artistic Version"
analysis_report:
type: json

Key Differences:

AspectInputsOutputs
PurposeData ingestionResult delivery
MutabilityUser-providedRead-only
ValidationRequired/optionalAlways generated

2.3 Nodes Architecture

nodes:
face_detector: # Node ID
category: "image_analysis" # Functional group
script: "detect_faces.js" # Processing logic
inputs: # Required data
photo: { type: image }
outputs: # Produced artifacts
faces: { type: json }

Node Types:

  • Input processors: First data handlers
  • Transformers: Data modifiers
  • Terminals: Produce final outputs

2.4 Flow Mechanics

flows:
detection_to_styling:
from: face_detector.faces # Source node.output
to: style_applier.faces # Target node.input
conditions: # Optional rules
- min_faces: 1

Flow Types:

  1. Linear: Sequential (A→B→C)
  2. Fan-out: One-to-many (A→B, A→C)
  3. Conditional: Branched (A→B if X else C)

3. Execution Lifecycle

3.1 Startup Sequence

start:
nodes:
- initial_processor # Entry point
- parallel_starter # Concurrent init

3.2 Environment Setup

environment:
API_ENDPOINT:
title: "Service URL"
type: string
scope: pipeline # vs 'global' scope

4. Real-world Example

# document_processor.yaml
name: "PDF Analyzer"
inputs:
pdf_file:
type: file
formats: [pdf]
outputs:
text_content: string
page_count: number

nodes:
pdf_extractor:
script: "pdf.js"
inputs:
document: { from: pdf_file }
outputs:
raw_text: { to: text_content }
pages: { to: page_count }

flows:
file_processing:
from: input.pdf_file
to: pdf_extractor.document

What happens when run:

  1. User uploads PDF
  2. System routes file to pdf_extractor
  3. Script processes document
  4. Results populate both outputs

5. Design Principles

  1. Modularity:

    • Nodes should be single-purpose
    • Example: Separate face_detector and style_applier
  2. Discoverability:

    description: "Extracts text from scanned PDFs using OCR"
    tags: ["documents", "text-recognition"]
  3. Error Resilience:

    inputs:
    photo:
    constraints:
    min_resolution: [512, 512]

6. Anti-patterns to Avoid

Monolithic Nodes

# Bad: Does too much
nodes:
mega_processor:
script: "do_everything.js"

Preferred Approach

# Good: Separated concerns
nodes:
preprocessor: {...}
analyzer: {...}
formatter: {...}

7. Debugging Tips

  1. Flow Visualization:

    graph LR
    A[Input] --> B(Processor)
    B --> C[Output]
  2. Validation Command:

    pipeline validate my_pipeline.yaml
  3. Inspection Points:

    nodes:
    debug_logger:
    script: "log_intermediate.js"

IN WORK