Config Composition & Reusable Components with ZenDag
Hydra is powerful for configuration management, especially its ability to compose configurations from smaller, reusable pieces. ZenDag leverages this: you can define common components (like loggers, trainers, data modules) as separate configurations and then include them in your main stage configs. ZenDag will still discover any deps_path or outs_path declarations within these composed parts.
Example: A Reusable File Logger
Let’s define a configuration for a simple file logger. This logger will write to a file, and we want DVC to track this log file as an output of any stage that uses this logger.
Defining the Logger Configuration
Create configs/loggers_config.py:
# configs/loggers_config.py
from hydra_zen import builds, store
from zendag.config_utils import outs_path # Logger's output file is a DVC output
from pathlib import Path
import logging # Standard logging
# This is a simplified function. In reality, it would configure the logging system.
# For ZenDag's dvc.yaml generation, we primarily care that it defines an output path.
# The actual logging setup happens when the stage runs and Hydra instantiates this.
def setup_stage_file_logger(log_file_path_str: str, log_level: str = "INFO"):
"""
(Mock) Sets up a file logger for a stage.
The actual configuration of the Python logging system would happen here
when Hydra instantiates this part of the config during stage execution.
"""
log_file_path = Path(log_file_path_str)
log_file_path.parent.mkdir(parents=True, exist_ok=True) # Ensure directory exists
# Simulate logger setup for demonstration
print(f"[LoggerSetup] Configuring file logger at: {log_file_path} with level {log_level}")
# In a real scenario, you might return a configured logger object or just perform side effects.
# For ZenDag's config resolution, the important part is that `log_file_path_str` uses `outs_path`.
return {"log_file": str(log_file_path), "level": log_level}
# Hydra-Zen config for our file logger
FileLoggerConfig = builds(
setup_stage_file_logger,
populate_full_signature=True,
# The log file path is an output of the stage using this logger.
# It will be relative to the stage's output directory.
log_file_path_str=outs_path("logs/stage_execution.log"),
log_level="DEBUG" # Default log level for this config
)
# Register it in a 'logger' group
store(FileLoggerConfig, group="logger", name="default_file_logger")
# Another variant
VerboseFileLoggerConfig = builds(
setup_stage_file_logger,
populate_full_signature=True,
log_file_path_str=outs_path("logs/verbose_stage_execution.log"),
log_level="NOTSET" # Using NOTSET which is more verbose than DEBUG for standard logging
)
store(VerboseFileLoggerConfig, group="logger", name="verbose_file_logger")
Using the Logger in a Stage
Let’s modify the TransformConfig from our Quickstart Notebook to include this logger using Hydra’s hydra_defaults.
Modify configs/transform_config.py:
# configs/transform_config.py (modified)
from hydra_zen import builds, store, MISSING # Import MISSING
from zendag.config_utils import deps_path, outs_path
# Assume transform_data is in my_project.stages.simple_transform
from my_project.stages.simple_transform import transform_data # Or your actual import
# Option 1: Stage function is unaware of the logger (Hydra instantiates it)
TransformConfigWithLogger = builds(
transform_data, # transform_data itself doesn't take a logger argument here
populate_full_signature=True,
input_csv_path=deps_path("data/raw/input.csv"),
output_csv_path=outs_path("data/processed/output_with_logging.csv"),
scale_factor=2.5,
# --- Hydra Defaults for Composition ---
hydra_defaults=[
"_self_", # Always include this first
{"logger": "default_file_logger"} # Load the 'default_file_logger' from the 'logger' group
# To use the other logger: {"logger": "verbose_file_logger"}
# The key 'logger' here will create a 'logger' node in the final composed config.
]
)
# Ensure the original default_transform (from quickstart) is also available if needed for other examples
# or update it to also use a logger if that's the new baseline.
# For this example, we create a new named config.
store(TransformConfigWithLogger, group="transform", name="logged_transform")
# If you had an original default_transform:
# OriginalTransformConfig = builds(
# transform_data,
# populate_full_signature=True,
# input_csv_path=deps_path("data/raw/input.csv"),
# output_csv_path=outs_path("data/processed/output.csv"),
# scale_factor=1.5
# )
# store(OriginalTransformConfig, group="transform", name="default_transform")
For simplicity, we’ll focus on the case where the logger is instantiated by Hydra, and the stage function transform_data doesn’t need a logger argument directly. The setup_stage_file_logger function would typically configure a global/module logger that transform_data then uses via logging.getLogger(__name__).
How ZenDag Discovers the Logger’s Output
Update
configure.py:Import
configs.loggers_config.Ensure
transform(and specificallylogged_transform) is processed.
# configure.py (snippet) import configs.transform_config # Has logged_transform import configs.loggers_config # Defines logger configs # ... # If you are also running quickstart's default_transform, keep its dummy input logic # Path("data/raw/input.csv").parent.mkdir(parents=True, exist_ok=True) # pd.DataFrame({'id': [1,2], 'value': [10,20]}).to_csv(Path("data/raw/input.csv"), index=False) # os.system(f"dvc add data/raw/input.csv") STAGE_GROUPS = ["transform"] # This will pick up all configs in the 'transform' group # ...
Run Configuration:
python configure.pyInspect
dvc.yaml: Look at the entry fortransform/logged_transform:stages: transform/logged_transform: cmd: python -m my_project.run_hydra_stage -cd artifacts/transform -cn logged_transform hydra.run.dir='artifacts/transform/logged_transform' deps: - data/raw/input.csv outs: # Output from transform_data itself - artifacts/transform/logged_transform/data/processed/output_with_logging.csv # Output from the composed logger! - artifacts/transform/logged_transform/logs/stage_execution.log params: - artifacts/transform/logged_transform.yaml
ZenDag’s
configure_pipelinecallsOmegaConf.resolve(cfg)on the fully composed configuration fortransform/logged_transform. This composed config includes theloggernode (because ofhydra_defaults), which itself containslog_file_path_str=outs_path("logs/stage_execution.log"). Theouts:resolver is triggered, and the log file path is added to theoutsfor thetransform/logged_transformDVC stage.
Running the Stage
When you run dvc exp run transform/logged_transform (or dvc exp run if it’s the only changed part):
Hydra will instantiate the
loggerpart of its config, callingsetup_stage_file_logger.The (mock)
setup_stage_file_loggerwill print its message. Because itslog_file_path_strusedouts_path, DVC now tracks this log file.The
transform_datafunction will run. Its ownlog = logging.getLogger(__name__)statements would go to the logger configured bysetup_stage_file_loggerif it configured the root logger or a relevant parent.The file
artifacts/transform/logged_transform/logs/stage_execution.logwill be created (even if it’s just empty or with minimal content in this mock, DVC tracks its existence as an output).
Benefits
Reusability: Define logger (or trainer, optimizer, etc.) configs once, use them in many stages.
Separation of Concerns: Stage logic doesn’t need to be cluttered with detailed setup for common components.
Dynamic Outputs: ZenDag automatically picks up DVC outputs (
outs_path) declared deep within composed configuration structures.Flexibility: Easily swap out components by changing the
hydra_defaults(e.g., switch toverbose_file_logger).
Conclusion
Hydra’s composition, combined with ZenDag’s deps_path and outs_path discovery, allows for building sophisticated and modular MLOps pipelines where even common components can have their outputs tracked by DVC without manual duplication in the dvc.yaml. This leads to cleaner, more maintainable, and highly reproducible workflows.