Sharing Data & Models: DVC Versioning with Intake Catalogs (using fsspec)
Once you’ve used ZenDag and DVC to produce versioned datasets and models, how do you easily share and consume them? Intake is a lightweight Python library for finding, investigating, loading, and disseminating data. By leveraging fsspec
and its DVC filesystem implementation (dvcfs
), Intake provides a modern and flexible way to access DVC-versioned assets.
Recap: DVC for Versioning Artifacts
As covered previously, ZenDag and DVC work together to version your data. When a stage produces an output like artifacts/transform/default_transform/output.csv
, DVC creates a .dvc
file tracking its metadata and content hash.
Step 1: Push DVC Data to a Remote
To share data, it must be in a DVC remote storage (S3, GCS, SSH, etc.).
Add a remote (if not done already):
# Example for a local directory remote (for testing) mkdir -p /tmp/my_dvc_remote dvc remote add -d mylocalremote /tmp/my_dvc_remote
Push data to the remote:
dvc push -r mylocalremote
Step 2: Tag a Specific Version in Git
Git tags mark specific, stable versions of your DVC metadata (.dvc
files and dvc.lock
).
Commit all relevant
.dvc
files anddvc.lock
.Create and push a tag:
git tag v1.0.0-data -m "Stable processed dataset version 1.0.0 from quickstart" git push origin v1.0.0-data
Step 3: Install Intake and fsspec DVC support
In the environment where you’ll consume the data, ensure you have:
pip install intake pandas # For CSV reading
# For DVC fsspec support, DVC needs to be installed with fsspec extras, or dvcfs separately
pip install "dvc[fsspec]" # Or ensure dvcfs is available if dvc is already installed
# If you plan to read other formats, install relevant intake plugins, e.g.:
# pip install intake-xarray intake-parquet
Step 4: Creating an Intake Catalog (catalog.yaml
) with fsspec
An Intake catalog (YAML file) describes your data sources. We’ll use the fsspec
dvc://
URL scheme.
Create catalog.yaml
:
# catalog.yaml (using fsspec dvc:// syntax)
sources:
processed_dataset_v1_fsspec:
driver: csv # Intake driver for CSV files (uses pandas by default)
description: "Version 1.0.0 of the processed dataset (fsspec dvc access)."
args:
# urlpath uses the dvc:// fsspec protocol.
# The part after dvc:// (e.g., your_username/your_project/) is often conventional
# as target_options.url primarily defines the Git repo.
urlpath: "dvc://your_username_placeholder/your_zendag_project_placeholder/artifacts/transform/default_transform/data/processed/output.csv" # !!! REPLACE owner/repo part !!!
storage_options:
# target_options specifies the Git repository details for dvcfs
target_options:
url: "https://github.com/your_username_placeholder/your_zendag_project_placeholder.git" # !!! REPLACE THIS !!!
# rev is the Git revision (tag, branch, or commit hash)
rev: "v1.0.0-data"
# remote: "mylocalremote" # Optional: DVC remote if not default/auto-discoverable
# Arguments for the 'csv' driver (passed to pandas.read_csv)
csv_kwargs:
dtype: {"id": "int", "value": "float", "scaled_value": "float"}
# Example for a NetCDF file, assuming intake-xarray is installed
# weather_model_output_v2_fsspec:
# driver: netcdf
# description: "Weather model output v2 (NetCDF via fsspec dvc)."
# args:
# urlpath: "dvc://your_username_placeholder/your_zendag_project_placeholder/data/models/weather_v2.nc" # !!! REPLACE !!!
# chunks: {} # Argument for xarray.open_dataset
# storage_options:
# target_options:
# url: "https://github.com/your_username_placeholder/your_zendag_project_placeholder.git" # !!! REPLACE !!!
# rev: "weather-model-v2-tag"
Important:
Replace placeholders like
your_username_placeholder/your_zendag_project_placeholder
with your actual Git repository owner and name.The
path
part of theurlpath
must be the exact path to the data file as tracked by DVC within that repository structure.Ensure the Git
rev
(e.g., tagv1.0.0-data
) exists in your Git repository.
Step 5: Using the Intake Catalog (fsspec version)
In any Python script or Jupyter Notebook:
import intake
import pandas as pd # For type hint and checking
import os
# --- Create a dummy catalog.yaml for this notebook execution ---
# In a real scenario, this file would exist independently.
# !!! REPLACE with your actual Git repo URL and path for this to work beyond this notebook !!!
DUMMY_GIT_OWNER = "your_username_placeholder"
DUMMY_GIT_REPO_NAME = "your_zendag_project_placeholder"
# Construct a file:// URL if your repo is local for testing, otherwise use https://
# For this example, we'll assume a local path could be used for placeholder.
# A real remote test would require cloning this ZenDag repo and pushing it to your own GitHub.
# For simplicity in a self-contained notebook, we'll mock the access or it will fail if placeholders aren't replaced.
DUMMY_GIT_REPO_URL_FOR_FSSPEC = f"https_IS_A_PLACEHOLDER_REPLACE_ME_github.com/{DUMMY_GIT_OWNER}/{DUMMY_GIT_REPO_NAME}.git"
# If testing locally against a checked-out version of the project (that has DVC setup):
# DUMMY_GIT_REPO_URL_FOR_FSSPEC = f"file://{os.path.abspath('.')}" # Points to current dir if it's the git repo root
if "PLACEHOLDER_REPLACE_ME" in DUMMY_GIT_REPO_URL_FOR_FSSPEC:
print(f"WARNING: DUMMY_GIT_REPO_URL_FOR_FSSPEC ('{DUMMY_GIT_REPO_URL_FOR_FSSPEC}') is a placeholder.")
print("Replace it with your actual Git repo URL and adjust paths for this example to fully work.")
catalog_fsspec_content = f"""
sources:
processed_dataset_v1_fsspec:
driver: csv
description: "Version 1.0.0 of the processed dataset (fsspec dvc access)."
args:
urlpath: "dvc://{DUMMY_GIT_OWNER}/{DUMMY_GIT_REPO_NAME}/artifacts/transform/default_transform/data/processed/output.csv"
storage_options:
target_options:
url: "{DUMMY_GIT_REPO_URL_FOR_FSSPEC}"
rev: "v1.0.0-data" # Ensure this tag exists in your repo, or use a valid commit/branch
csv_kwargs:
dtype: {{"id": "int", "value": "float", "scaled_value": "float"}}
"""
with open("temp_fsspec_catalog.yaml", "w") as f:
f.write(catalog_fsspec_content)
# --- End dummy catalog creation ---
# Ensure your temp_fsspec_catalog.yaml is in the current directory or provide its path
catalog_fsspec = None
try:
catalog_fsspec = intake.open_catalog("temp_fsspec_catalog.yaml")
except Exception as e:
print(f"Error opening fsspec catalog: {e}")
print("This example may not fully run if the Git repo URL is a placeholder or the specified rev/path doesn't exist.")
if catalog_fsspec:
print("Available sources in fsspec catalog:", list(catalog_fsspec))
dataset_entry_name_fsspec = 'processed_dataset_v1_fsspec'
if dataset_entry_name_fsspec in catalog_fsspec:
dataset_entry_fsspec = catalog_fsspec[dataset_entry_name_fsspec]
print(f"\nDataset entry '{dataset_entry_name_fsspec}' description:", dataset_entry_fsspec.description)
print("Attempting to read data using fsspec dvc (this might take a moment for the first time if accessing a remote repo)...")
try:
df_fsspec: pd.DataFrame = dataset_entry_fsspec.read() # `read()` loads the data
print("\nFirst 5 rows of the loaded DataFrame (via fsspec dvc):")
print(df_fsspec.head())
print(f"\nDataFrame shape: {df_fsspec.shape}")
except Exception as e:
print(f"\nERROR reading data source '{dataset_entry_name_fsspec}': {e}")
print("This often happens if:")
print(f" - The Git repo URL ('{DUMMY_GIT_REPO_URL_FOR_FSSPEC}') is a placeholder, incorrect, or inaccessible.")
print(f" - The Git revision (tag/commit) 'v1.0.0-data' does not exist in that repo or doesn't contain the DVC metadata for the specified path.")
print(f" - The DVC remote (if needed) is not accessible or the data for the path has not been pushed to it.")
print(f" - 'dvc' CLI is not installed, or 'dvcfs' (required by fsspec's dvc:// protocol) is not available.")
print(f" - The 'driver: csv' needs pandas or an appropriate backend installed in your Python environment.")
else:
print(f"Data source '{dataset_entry_name_fsspec}' not found in catalog.")
# Clean up dummy catalog
if os.path.exists("temp_fsspec_catalog.yaml"):
os.remove("temp_fsspec_catalog.yaml")
What happens when you call dataset_entry.read()
with this fsspec setup:
Intake identifies the
driver
(e.g.,csv
).It sees the
urlpath
starting withdvc://
.It uses
fsspec
with thedvcfs
implementation to open this URL.dvcfs
uses thestorage_options
(liketarget_options.url
for the Git repo andrev
for the Git revision) to:Access the specified Git repository at the given revision (cloning/checking out to a temporary location if needed).
Find the DVC metadata for the file path within the
urlpath
.Use DVC internally to make the actual data file available (pulling from a DVC remote if necessary).
dvcfs
then provides a file-like object to this data.
Intake’s chosen
driver
(e.g., the CSV driver) then reads from this file-like object, applying any specifiedargs
(likecsv_kwargs
).
Benefits of the fsspec
Approach
Standardization: Uses the widely adopted
fsspec
interface, making it compatible with many libraries.Flexibility: You choose the Intake
driver
based on your data format (CSV, Parquet, NetCDF, Zarr, etc.), anddvcfs
handles getting the DVC-versioned bytes to that driver.Ecosystem: Leverages the strengths of both Intake (cataloging, unified API) and
fsspec
(versatile file system access).
This fsspec
-based method is the modern and generally recommended way to use Intake with DVC-versioned data, offering greater flexibility and integration with the broader Python data ecosystem.