Config Composition & Reusable Components with ZenDag
Hydra is powerful for configuration management, especially its ability to compose configurations from smaller, reusable pieces. ZenDag leverages this: you can define common components (like loggers, trainers, data modules) as separate configurations and then include them in your main stage configs. ZenDag will still discover any deps_path
or outs_path
declarations within these composed parts.
Example: A Reusable File Logger
Let’s define a configuration for a simple file logger. This logger will write to a file, and we want DVC to track this log file as an output of any stage that uses this logger.
Defining the Logger Configuration
Create configs/loggers_config.py
:
# configs/loggers_config.py
from hydra_zen import builds, store
from zendag.config_utils import outs_path # Logger's output file is a DVC output
from pathlib import Path
import logging # Standard logging
# This is a simplified function. In reality, it would configure the logging system.
# For ZenDag's dvc.yaml generation, we primarily care that it defines an output path.
# The actual logging setup happens when the stage runs and Hydra instantiates this.
def setup_stage_file_logger(log_file_path_str: str, log_level: str = "INFO"):
"""
(Mock) Sets up a file logger for a stage.
The actual configuration of the Python logging system would happen here
when Hydra instantiates this part of the config during stage execution.
"""
log_file_path = Path(log_file_path_str)
log_file_path.parent.mkdir(parents=True, exist_ok=True) # Ensure directory exists
# Simulate logger setup for demonstration
print(f"[LoggerSetup] Configuring file logger at: {log_file_path} with level {log_level}")
# In a real scenario, you might return a configured logger object or just perform side effects.
# For ZenDag's config resolution, the important part is that `log_file_path_str` uses `outs_path`.
return {"log_file": str(log_file_path), "level": log_level}
# Hydra-Zen config for our file logger
FileLoggerConfig = builds(
setup_stage_file_logger,
populate_full_signature=True,
# The log file path is an output of the stage using this logger.
# It will be relative to the stage's output directory.
log_file_path_str=outs_path("logs/stage_execution.log"),
log_level="DEBUG" # Default log level for this config
)
# Register it in a 'logger' group
store(FileLoggerConfig, group="logger", name="default_file_logger")
# Another variant
VerboseFileLoggerConfig = builds(
setup_stage_file_logger,
populate_full_signature=True,
log_file_path_str=outs_path("logs/verbose_stage_execution.log"),
log_level="NOTSET" # Using NOTSET which is more verbose than DEBUG for standard logging
)
store(VerboseFileLoggerConfig, group="logger", name="verbose_file_logger")
Using the Logger in a Stage
Let’s modify the TransformConfig
from our Quickstart Notebook to include this logger using Hydra’s hydra_defaults
.
Modify configs/transform_config.py
:
# configs/transform_config.py (modified)
from hydra_zen import builds, store, MISSING # Import MISSING
from zendag.config_utils import deps_path, outs_path
# Assume transform_data is in my_project.stages.simple_transform
from my_project.stages.simple_transform import transform_data # Or your actual import
# Option 1: Stage function is unaware of the logger (Hydra instantiates it)
TransformConfigWithLogger = builds(
transform_data, # transform_data itself doesn't take a logger argument here
populate_full_signature=True,
input_csv_path=deps_path("data/raw/input.csv"),
output_csv_path=outs_path("data/processed/output_with_logging.csv"),
scale_factor=2.5,
# --- Hydra Defaults for Composition ---
hydra_defaults=[
"_self_", # Always include this first
{"logger": "default_file_logger"} # Load the 'default_file_logger' from the 'logger' group
# To use the other logger: {"logger": "verbose_file_logger"}
# The key 'logger' here will create a 'logger' node in the final composed config.
]
)
# Ensure the original default_transform (from quickstart) is also available if needed for other examples
# or update it to also use a logger if that's the new baseline.
# For this example, we create a new named config.
store(TransformConfigWithLogger, group="transform", name="logged_transform")
# If you had an original default_transform:
# OriginalTransformConfig = builds(
# transform_data,
# populate_full_signature=True,
# input_csv_path=deps_path("data/raw/input.csv"),
# output_csv_path=outs_path("data/processed/output.csv"),
# scale_factor=1.5
# )
# store(OriginalTransformConfig, group="transform", name="default_transform")
For simplicity, we’ll focus on the case where the logger is instantiated by Hydra, and the stage function transform_data
doesn’t need a logger
argument directly. The setup_stage_file_logger
function would typically configure a global/module logger that transform_data
then uses via logging.getLogger(__name__)
.
How ZenDag Discovers the Logger’s Output
Update
configure.py
:Import
configs.loggers_config
.Ensure
transform
(and specificallylogged_transform
) is processed.
# configure.py (snippet) import configs.transform_config # Has logged_transform import configs.loggers_config # Defines logger configs # ... # If you are also running quickstart's default_transform, keep its dummy input logic # Path("data/raw/input.csv").parent.mkdir(parents=True, exist_ok=True) # pd.DataFrame({'id': [1,2], 'value': [10,20]}).to_csv(Path("data/raw/input.csv"), index=False) # os.system(f"dvc add data/raw/input.csv") STAGE_GROUPS = ["transform"] # This will pick up all configs in the 'transform' group # ...
Run Configuration:
python configure.py
Inspect
dvc.yaml
: Look at the entry fortransform/logged_transform
:stages: transform/logged_transform: cmd: python -m my_project.run_hydra_stage -cd artifacts/transform -cn logged_transform hydra.run.dir='artifacts/transform/logged_transform' deps: - data/raw/input.csv outs: # Output from transform_data itself - artifacts/transform/logged_transform/data/processed/output_with_logging.csv # Output from the composed logger! - artifacts/transform/logged_transform/logs/stage_execution.log params: - artifacts/transform/logged_transform.yaml
ZenDag’s
configure_pipeline
callsOmegaConf.resolve(cfg)
on the fully composed configuration fortransform/logged_transform
. This composed config includes thelogger
node (because ofhydra_defaults
), which itself containslog_file_path_str=outs_path("logs/stage_execution.log")
. Theouts:
resolver is triggered, and the log file path is added to theouts
for thetransform/logged_transform
DVC stage.
Running the Stage
When you run dvc exp run transform/logged_transform
(or dvc exp run
if it’s the only changed part):
Hydra will instantiate the
logger
part of its config, callingsetup_stage_file_logger
.The (mock)
setup_stage_file_logger
will print its message. Because itslog_file_path_str
usedouts_path
, DVC now tracks this log file.The
transform_data
function will run. Its ownlog = logging.getLogger(__name__)
statements would go to the logger configured bysetup_stage_file_logger
if it configured the root logger or a relevant parent.The file
artifacts/transform/logged_transform/logs/stage_execution.log
will be created (even if it’s just empty or with minimal content in this mock, DVC tracks its existence as an output).
Benefits
Reusability: Define logger (or trainer, optimizer, etc.) configs once, use them in many stages.
Separation of Concerns: Stage logic doesn’t need to be cluttered with detailed setup for common components.
Dynamic Outputs: ZenDag automatically picks up DVC outputs (
outs_path
) declared deep within composed configuration structures.Flexibility: Easily swap out components by changing the
hydra_defaults
(e.g., switch toverbose_file_logger
).
Conclusion
Hydra’s composition, combined with ZenDag’s deps_path
and outs_path
discovery, allows for building sophisticated and modular MLOps pipelines where even common components can have their outputs tracked by DVC without manual duplication in the dvc.yaml
. This leads to cleaner, more maintainable, and highly reproducible workflows.