Sharing Data & Models: DVC Versioning with Intake Catalogs (using fsspec)

Once you’ve used ZenDag and DVC to produce versioned datasets and models, how do you easily share and consume them? Intake is a lightweight Python library for finding, investigating, loading, and disseminating data. By leveraging fsspec and its DVC filesystem implementation (dvcfs), Intake provides a modern and flexible way to access DVC-versioned assets.

Recap: DVC for Versioning Artifacts

As covered previously, ZenDag and DVC work together to version your data. When a stage produces an output like artifacts/transform/default_transform/output.csv, DVC creates a .dvc file tracking its metadata and content hash.

Step 1: Push DVC Data to a Remote

To share data, it must be in a DVC remote storage (S3, GCS, SSH, etc.).

Add a remote (if not done already):

# Example for a local directory remote (for testing)
mkdir -p /tmp/my_dvc_remote 
dvc remote add -d mylocalremote /tmp/my_dvc_remote

Push data to the remote:
```
dvc push -r mylocalremote
```

Step 2: Tag a Specific Version in Git

Git tags mark specific, stable versions of your DVC metadata (.dvc files and dvc.lock).

Commit all relevant .dvc files and dvc.lock.

Create and push a tag:

git tag v1.0.0-data -m "Stable processed dataset version 1.0.0 from quickstart"
git push origin v1.0.0-data

Step 3: Install Intake and fsspec DVC support

In the environment where you’ll consume the data, ensure you have:

pip install intake pandas # For CSV reading
# For DVC fsspec support, DVC needs to be installed with fsspec extras, or dvcfs separately
pip install "dvc[fsspec]" # Or ensure dvcfs is available if dvc is already installed
# If you plan to read other formats, install relevant intake plugins, e.g.:
# pip install intake-xarray intake-parquet

Step 4: Creating an Intake Catalog (`catalog.yaml`) with fsspec

An Intake catalog (YAML file) describes your data sources. We’ll use the fsspec dvc:// URL scheme.

Create catalog.yaml:

# catalog.yaml (using fsspec dvc:// syntax)
sources:
  processed_dataset_v1_fsspec:
    driver: csv # Intake driver for CSV files (uses pandas by default)
    description: "Version 1.0.0 of the processed dataset (fsspec dvc access)."
    args:
      # urlpath uses the dvc:// fsspec protocol.
      # The part after dvc:// (e.g., your_username/your_project/) is often conventional
      # as target_options.url primarily defines the Git repo.
      urlpath: "dvc://your_username_placeholder/your_zendag_project_placeholder/artifacts/transform/default_transform/data/processed/output.csv" # !!! REPLACE owner/repo part !!!
      storage_options:
        # target_options specifies the Git repository details for dvcfs
        target_options:
          url: "https://github.com/your_username_placeholder/your_zendag_project_placeholder.git" # !!! REPLACE THIS !!!
        # rev is the Git revision (tag, branch, or commit hash)
        rev: "v1.0.0-data"
        # remote: "mylocalremote" # Optional: DVC remote if not default/auto-discoverable
      # Arguments for the 'csv' driver (passed to pandas.read_csv)
      csv_kwargs: 
        dtype: {"id": "int", "value": "float", "scaled_value": "float"}

  # Example for a NetCDF file, assuming intake-xarray is installed
  # weather_model_output_v2_fsspec:
  #   driver: netcdf 
  #   description: "Weather model output v2 (NetCDF via fsspec dvc)."
  #   args:
  #     urlpath: "dvc://your_username_placeholder/your_zendag_project_placeholder/data/models/weather_v2.nc" # !!! REPLACE !!!
  #     chunks: {} # Argument for xarray.open_dataset
  #     storage_options:
  #       target_options:
  #         url: "https://github.com/your_username_placeholder/your_zendag_project_placeholder.git" # !!! REPLACE !!!
  #       rev: "weather-model-v2-tag"

Important:

Replace placeholders like your_username_placeholder/your_zendag_project_placeholder with your actual Git repository owner and name.
The path part of the urlpath must be the exact path to the data file as tracked by DVC within that repository structure.
Ensure the Git rev (e.g., tag v1.0.0-data) exists in your Git repository.

Step 5: Using the Intake Catalog (fsspec version)

In any Python script or Jupyter Notebook:

import intake
import pandas as pd # For type hint and checking
import os

# --- Create a dummy catalog.yaml for this notebook execution ---
# In a real scenario, this file would exist independently.
# !!! REPLACE with your actual Git repo URL and path for this to work beyond this notebook !!!
DUMMY_GIT_OWNER = "your_username_placeholder"
DUMMY_GIT_REPO_NAME = "your_zendag_project_placeholder"
# Construct a file:// URL if your repo is local for testing, otherwise use https://
# For this example, we'll assume a local path could be used for placeholder.
# A real remote test would require cloning this ZenDag repo and pushing it to your own GitHub.
# For simplicity in a self-contained notebook, we'll mock the access or it will fail if placeholders aren't replaced.
DUMMY_GIT_REPO_URL_FOR_FSSPEC = f"https_IS_A_PLACEHOLDER_REPLACE_ME_github.com/{DUMMY_GIT_OWNER}/{DUMMY_GIT_REPO_NAME}.git"
# If testing locally against a checked-out version of the project (that has DVC setup):
# DUMMY_GIT_REPO_URL_FOR_FSSPEC = f"file://{os.path.abspath('.')}" # Points to current dir if it's the git repo root


if "PLACEHOLDER_REPLACE_ME" in DUMMY_GIT_REPO_URL_FOR_FSSPEC:
    print(f"WARNING: DUMMY_GIT_REPO_URL_FOR_FSSPEC ('{DUMMY_GIT_REPO_URL_FOR_FSSPEC}') is a placeholder.")
    print("Replace it with your actual Git repo URL and adjust paths for this example to fully work.")

catalog_fsspec_content = f"""
sources:
  processed_dataset_v1_fsspec:
    driver: csv 
    description: "Version 1.0.0 of the processed dataset (fsspec dvc access)."
    args:
      urlpath: "dvc://{DUMMY_GIT_OWNER}/{DUMMY_GIT_REPO_NAME}/artifacts/transform/default_transform/data/processed/output.csv"
      storage_options:
        target_options:
          url: "{DUMMY_GIT_REPO_URL_FOR_FSSPEC}"
        rev: "v1.0.0-data" # Ensure this tag exists in your repo, or use a valid commit/branch
      csv_kwargs: 
        dtype: {{"id": "int", "value": "float", "scaled_value": "float"}}
"""
with open("temp_fsspec_catalog.yaml", "w") as f:
    f.write(catalog_fsspec_content)
# --- End dummy catalog creation ---


# Ensure your temp_fsspec_catalog.yaml is in the current directory or provide its path
catalog_fsspec = None
try:
    catalog_fsspec = intake.open_catalog("temp_fsspec_catalog.yaml")
except Exception as e:
    print(f"Error opening fsspec catalog: {e}")
    print("This example may not fully run if the Git repo URL is a placeholder or the specified rev/path doesn't exist.")

if catalog_fsspec:
    print("Available sources in fsspec catalog:", list(catalog_fsspec))
    
    dataset_entry_name_fsspec = 'processed_dataset_v1_fsspec'
    if dataset_entry_name_fsspec in catalog_fsspec:
        dataset_entry_fsspec = catalog_fsspec[dataset_entry_name_fsspec]
        print(f"\nDataset entry '{dataset_entry_name_fsspec}' description:", dataset_entry_fsspec.description)
        
        print("Attempting to read data using fsspec dvc (this might take a moment for the first time if accessing a remote repo)...")
        try:
            df_fsspec: pd.DataFrame = dataset_entry_fsspec.read() # `read()` loads the data
            
            print("\nFirst 5 rows of the loaded DataFrame (via fsspec dvc):")
            print(df_fsspec.head())
            print(f"\nDataFrame shape: {df_fsspec.shape}")
        except Exception as e:
            print(f"\nERROR reading data source '{dataset_entry_name_fsspec}': {e}")
            print("This often happens if:")
            print(f"  - The Git repo URL ('{DUMMY_GIT_REPO_URL_FOR_FSSPEC}') is a placeholder, incorrect, or inaccessible.")
            print(f"  - The Git revision (tag/commit) 'v1.0.0-data' does not exist in that repo or doesn't contain the DVC metadata for the specified path.")
            print(f"  - The DVC remote (if needed) is not accessible or the data for the path has not been pushed to it.")
            print(f"  - 'dvc' CLI is not installed, or 'dvcfs' (required by fsspec's dvc:// protocol) is not available.")
            print(f"  - The 'driver: csv' needs pandas or an appropriate backend installed in your Python environment.")

    else:
        print(f"Data source '{dataset_entry_name_fsspec}' not found in catalog.")

# Clean up dummy catalog
if os.path.exists("temp_fsspec_catalog.yaml"):
    os.remove("temp_fsspec_catalog.yaml")

What happens when you call dataset_entry.read() with this fsspec setup:

Intake identifies the driver (e.g., csv).
It sees the urlpath starting with dvc://.
It uses fsspec with the dvcfs implementation to open this URL.
dvcfs uses the storage_options (like target_options.url for the Git repo and rev for the Git revision) to:
- Access the specified Git repository at the given revision (cloning/checking out to a temporary location if needed).
- Find the DVC metadata for the file path within the urlpath.
- Use DVC internally to make the actual data file available (pulling from a DVC remote if necessary).
- dvcfs then provides a file-like object to this data.
Intake’s chosen driver (e.g., the CSV driver) then reads from this file-like object, applying any specified args (like csv_kwargs).

Benefits of the `fsspec` Approach

Standardization: Uses the widely adopted fsspec interface, making it compatible with many libraries.
Flexibility: You choose the Intake driver based on your data format (CSV, Parquet, NetCDF, Zarr, etc.), and dvcfs handles getting the DVC-versioned bytes to that driver.
Ecosystem: Leverages the strengths of both Intake (cataloging, unified API) and fsspec (versatile file system access).

This fsspec-based method is the modern and generally recommended way to use Intake with DVC-versioned data, offering greater flexibility and integration with the broader Python data ecosystem.