Config Composition & Reusable Components with ZenDag

Hydra is powerful for configuration management, especially its ability to compose configurations from smaller, reusable pieces. ZenDag leverages this: you can define common components (like loggers, trainers, data modules) as separate configurations and then include them in your main stage configs. ZenDag will still discover any deps_path or outs_path declarations within these composed parts.

Example: A Reusable File Logger

Let’s define a configuration for a simple file logger. This logger will write to a file, and we want DVC to track this log file as an output of any stage that uses this logger.

Defining the Logger Configuration

Create configs/loggers_config.py:

# configs/loggers_config.py
from hydra_zen import builds, store
from zendag.config_utils import outs_path # Logger's output file is a DVC output
from pathlib import Path
import logging # Standard logging

# This is a simplified function. In reality, it would configure the logging system.
# For ZenDag's dvc.yaml generation, we primarily care that it defines an output path.
# The actual logging setup happens when the stage runs and Hydra instantiates this.
def setup_stage_file_logger(log_file_path_str: str, log_level: str = "INFO"):
    """
    (Mock) Sets up a file logger for a stage.
    The actual configuration of the Python logging system would happen here
    when Hydra instantiates this part of the config during stage execution.
    """
    log_file_path = Path(log_file_path_str)
    log_file_path.parent.mkdir(parents=True, exist_ok=True) # Ensure directory exists

    # Simulate logger setup for demonstration
    print(f"[LoggerSetup] Configuring file logger at: {log_file_path} with level {log_level}")
    
    # In a real scenario, you might return a configured logger object or just perform side effects.
    # For ZenDag's config resolution, the important part is that `log_file_path_str` uses `outs_path`.
    return {"log_file": str(log_file_path), "level": log_level}

# Hydra-Zen config for our file logger
FileLoggerConfig = builds(
    setup_stage_file_logger,
    populate_full_signature=True,
    # The log file path is an output of the stage using this logger.
    # It will be relative to the stage's output directory.
    log_file_path_str=outs_path("logs/stage_execution.log"),
    log_level="DEBUG" # Default log level for this config
)

# Register it in a 'logger' group
store(FileLoggerConfig, group="logger", name="default_file_logger")

# Another variant
VerboseFileLoggerConfig = builds(
    setup_stage_file_logger,
    populate_full_signature=True,
    log_file_path_str=outs_path("logs/verbose_stage_execution.log"),
    log_level="NOTSET" # Using NOTSET which is more verbose than DEBUG for standard logging
)
store(VerboseFileLoggerConfig, group="logger", name="verbose_file_logger")

Using the Logger in a Stage

Let’s modify the TransformConfig from our Quickstart Notebook to include this logger using Hydra’s hydra_defaults.

Modify configs/transform_config.py:

# configs/transform_config.py (modified)
from hydra_zen import builds, store, MISSING # Import MISSING
from zendag.config_utils import deps_path, outs_path
# Assume transform_data is in my_project.stages.simple_transform
from my_project.stages.simple_transform import transform_data # Or your actual import

# Option 1: Stage function is unaware of the logger (Hydra instantiates it)
TransformConfigWithLogger = builds(
    transform_data, # transform_data itself doesn't take a logger argument here
    populate_full_signature=True,
    input_csv_path=deps_path("data/raw/input.csv"),
    output_csv_path=outs_path("data/processed/output_with_logging.csv"),
    scale_factor=2.5,
    # --- Hydra Defaults for Composition ---
    hydra_defaults=[
        "_self_",  # Always include this first
        {"logger": "default_file_logger"} # Load the 'default_file_logger' from the 'logger' group
        # To use the other logger: {"logger": "verbose_file_logger"}
        # The key 'logger' here will create a 'logger' node in the final composed config.
    ]
)
# Ensure the original default_transform (from quickstart) is also available if needed for other examples
# or update it to also use a logger if that's the new baseline.
# For this example, we create a new named config.
store(TransformConfigWithLogger, group="transform", name="logged_transform")

# If you had an original default_transform:
# OriginalTransformConfig = builds(
#     transform_data,
#     populate_full_signature=True,
#     input_csv_path=deps_path("data/raw/input.csv"),
#     output_csv_path=outs_path("data/processed/output.csv"),
#     scale_factor=1.5
# )
# store(OriginalTransformConfig, group="transform", name="default_transform")

For simplicity, we’ll focus on the case where the logger is instantiated by Hydra, and the stage function transform_data doesn’t need a logger argument directly. The setup_stage_file_logger function would typically configure a global/module logger that transform_data then uses via logging.getLogger(__name__).

How ZenDag Discovers the Logger’s Output

Update configure.py:

Import configs.loggers_config.
Ensure transform (and specifically logged_transform) is processed.

# configure.py (snippet)
import configs.transform_config # Has logged_transform
import configs.loggers_config   # Defines logger configs
# ...
# If you are also running quickstart's default_transform, keep its dummy input logic
# Path("data/raw/input.csv").parent.mkdir(parents=True, exist_ok=True) 
# pd.DataFrame({'id': [1,2], 'value': [10,20]}).to_csv(Path("data/raw/input.csv"), index=False)
# os.system(f"dvc add data/raw/input.csv")

STAGE_GROUPS = ["transform"] # This will pick up all configs in the 'transform' group
# ...

Run Configuration:
```
python configure.py
```

Inspect dvc.yaml: Look at the entry for transform/logged_transform:

stages:
  transform/logged_transform:
    cmd: python -m my_project.run_hydra_stage -cd artifacts/transform -cn logged_transform hydra.run.dir='artifacts/transform/logged_transform'
    deps:
    - data/raw/input.csv
    outs:
    # Output from transform_data itself
    - artifacts/transform/logged_transform/data/processed/output_with_logging.csv
    # Output from the composed logger!
    - artifacts/transform/logged_transform/logs/stage_execution.log 
    params:
    - artifacts/transform/logged_transform.yaml

ZenDag’s configure_pipeline calls OmegaConf.resolve(cfg) on the fully composed configuration for transform/logged_transform. This composed config includes the logger node (because of hydra_defaults), which itself contains log_file_path_str=outs_path("logs/stage_execution.log"). The outs: resolver is triggered, and the log file path is added to the outs for the transform/logged_transform DVC stage.

Running the Stage

When you run dvc exp run transform/logged_transform (or dvc exp run if it’s the only changed part):

Hydra will instantiate the logger part of its config, calling setup_stage_file_logger.
The (mock) setup_stage_file_logger will print its message. Because its log_file_path_str used outs_path, DVC now tracks this log file.
The transform_data function will run. Its own log = logging.getLogger(__name__) statements would go to the logger configured by setup_stage_file_logger if it configured the root logger or a relevant parent.
The file artifacts/transform/logged_transform/logs/stage_execution.log will be created (even if it’s just empty or with minimal content in this mock, DVC tracks its existence as an output).

Benefits

Reusability: Define logger (or trainer, optimizer, etc.) configs once, use them in many stages.
Separation of Concerns: Stage logic doesn’t need to be cluttered with detailed setup for common components.
Dynamic Outputs: ZenDag automatically picks up DVC outputs (outs_path) declared deep within composed configuration structures.
Flexibility: Easily swap out components by changing the hydra_defaults (e.g., switch to verbose_file_logger).

Conclusion

Hydra’s composition, combined with ZenDag’s deps_path and outs_path discovery, allows for building sophisticated and modular MLOps pipelines where even common components can have their outputs tracked by DVC without manual duplication in the dvc.yaml. This leads to cleaner, more maintainable, and highly reproducible workflows.