Sharing Data & Models: DVC Versioning with Intake Catalogs (using fsspec)

Once you’ve used ZenDag and DVC to produce versioned datasets and models, how do you easily share and consume them? Intake is a lightweight Python library for finding, investigating, loading, and disseminating data. By leveraging fsspec and its DVC filesystem implementation (dvcfs), Intake provides a modern and flexible way to access DVC-versioned assets.

Recap: DVC for Versioning Artifacts

As covered previously, ZenDag and DVC work together to version your data. When a stage produces an output like artifacts/transform/default_transform/output.csv, DVC creates a .dvc file tracking its metadata and content hash.

Step 1: Push DVC Data to a Remote

To share data, it must be in a DVC remote storage (S3, GCS, SSH, etc.).

  1. Add a remote (if not done already):

    # Example for a local directory remote (for testing)
    mkdir -p /tmp/my_dvc_remote 
    dvc remote add -d mylocalremote /tmp/my_dvc_remote
    
  2. Push data to the remote:

    dvc push -r mylocalremote
    

Step 2: Tag a Specific Version in Git

Git tags mark specific, stable versions of your DVC metadata (.dvc files and dvc.lock).

  1. Commit all relevant .dvc files and dvc.lock.

  2. Create and push a tag:

    git tag v1.0.0-data -m "Stable processed dataset version 1.0.0 from quickstart"
    git push origin v1.0.0-data
    

Step 3: Install Intake and fsspec DVC support

In the environment where you’ll consume the data, ensure you have:

pip install intake pandas # For CSV reading
# For DVC fsspec support, DVC needs to be installed with fsspec extras, or dvcfs separately
pip install "dvc[fsspec]" # Or ensure dvcfs is available if dvc is already installed
# If you plan to read other formats, install relevant intake plugins, e.g.:
# pip install intake-xarray intake-parquet

Step 4: Creating an Intake Catalog (catalog.yaml) with fsspec

An Intake catalog (YAML file) describes your data sources. We’ll use the fsspec dvc:// URL scheme.

Create catalog.yaml:

# catalog.yaml (using fsspec dvc:// syntax)
sources:
  processed_dataset_v1_fsspec:
    driver: csv # Intake driver for CSV files (uses pandas by default)
    description: "Version 1.0.0 of the processed dataset (fsspec dvc access)."
    args:
      # urlpath uses the dvc:// fsspec protocol.
      # The part after dvc:// (e.g., your_username/your_project/) is often conventional
      # as target_options.url primarily defines the Git repo.
      urlpath: "dvc://your_username_placeholder/your_zendag_project_placeholder/artifacts/transform/default_transform/data/processed/output.csv" # !!! REPLACE owner/repo part !!!
      storage_options:
        # target_options specifies the Git repository details for dvcfs
        target_options:
          url: "https://github.com/your_username_placeholder/your_zendag_project_placeholder.git" # !!! REPLACE THIS !!!
        # rev is the Git revision (tag, branch, or commit hash)
        rev: "v1.0.0-data"
        # remote: "mylocalremote" # Optional: DVC remote if not default/auto-discoverable
      # Arguments for the 'csv' driver (passed to pandas.read_csv)
      csv_kwargs: 
        dtype: {"id": "int", "value": "float", "scaled_value": "float"}

  # Example for a NetCDF file, assuming intake-xarray is installed
  # weather_model_output_v2_fsspec:
  #   driver: netcdf 
  #   description: "Weather model output v2 (NetCDF via fsspec dvc)."
  #   args:
  #     urlpath: "dvc://your_username_placeholder/your_zendag_project_placeholder/data/models/weather_v2.nc" # !!! REPLACE !!!
  #     chunks: {} # Argument for xarray.open_dataset
  #     storage_options:
  #       target_options:
  #         url: "https://github.com/your_username_placeholder/your_zendag_project_placeholder.git" # !!! REPLACE !!!
  #       rev: "weather-model-v2-tag"

Important:

  • Replace placeholders like your_username_placeholder/your_zendag_project_placeholder with your actual Git repository owner and name.

  • The path part of the urlpath must be the exact path to the data file as tracked by DVC within that repository structure.

  • Ensure the Git rev (e.g., tag v1.0.0-data) exists in your Git repository.

Step 5: Using the Intake Catalog (fsspec version)

In any Python script or Jupyter Notebook:

import intake
import pandas as pd # For type hint and checking
import os

# --- Create a dummy catalog.yaml for this notebook execution ---
# In a real scenario, this file would exist independently.
# !!! REPLACE with your actual Git repo URL and path for this to work beyond this notebook !!!
DUMMY_GIT_OWNER = "your_username_placeholder"
DUMMY_GIT_REPO_NAME = "your_zendag_project_placeholder"
# Construct a file:// URL if your repo is local for testing, otherwise use https://
# For this example, we'll assume a local path could be used for placeholder.
# A real remote test would require cloning this ZenDag repo and pushing it to your own GitHub.
# For simplicity in a self-contained notebook, we'll mock the access or it will fail if placeholders aren't replaced.
DUMMY_GIT_REPO_URL_FOR_FSSPEC = f"https_IS_A_PLACEHOLDER_REPLACE_ME_github.com/{DUMMY_GIT_OWNER}/{DUMMY_GIT_REPO_NAME}.git"
# If testing locally against a checked-out version of the project (that has DVC setup):
# DUMMY_GIT_REPO_URL_FOR_FSSPEC = f"file://{os.path.abspath('.')}" # Points to current dir if it's the git repo root


if "PLACEHOLDER_REPLACE_ME" in DUMMY_GIT_REPO_URL_FOR_FSSPEC:
    print(f"WARNING: DUMMY_GIT_REPO_URL_FOR_FSSPEC ('{DUMMY_GIT_REPO_URL_FOR_FSSPEC}') is a placeholder.")
    print("Replace it with your actual Git repo URL and adjust paths for this example to fully work.")

catalog_fsspec_content = f"""
sources:
  processed_dataset_v1_fsspec:
    driver: csv 
    description: "Version 1.0.0 of the processed dataset (fsspec dvc access)."
    args:
      urlpath: "dvc://{DUMMY_GIT_OWNER}/{DUMMY_GIT_REPO_NAME}/artifacts/transform/default_transform/data/processed/output.csv"
      storage_options:
        target_options:
          url: "{DUMMY_GIT_REPO_URL_FOR_FSSPEC}"
        rev: "v1.0.0-data" # Ensure this tag exists in your repo, or use a valid commit/branch
      csv_kwargs: 
        dtype: {{"id": "int", "value": "float", "scaled_value": "float"}}
"""
with open("temp_fsspec_catalog.yaml", "w") as f:
    f.write(catalog_fsspec_content)
# --- End dummy catalog creation ---


# Ensure your temp_fsspec_catalog.yaml is in the current directory or provide its path
catalog_fsspec = None
try:
    catalog_fsspec = intake.open_catalog("temp_fsspec_catalog.yaml")
except Exception as e:
    print(f"Error opening fsspec catalog: {e}")
    print("This example may not fully run if the Git repo URL is a placeholder or the specified rev/path doesn't exist.")

if catalog_fsspec:
    print("Available sources in fsspec catalog:", list(catalog_fsspec))
    
    dataset_entry_name_fsspec = 'processed_dataset_v1_fsspec'
    if dataset_entry_name_fsspec in catalog_fsspec:
        dataset_entry_fsspec = catalog_fsspec[dataset_entry_name_fsspec]
        print(f"\nDataset entry '{dataset_entry_name_fsspec}' description:", dataset_entry_fsspec.description)
        
        print("Attempting to read data using fsspec dvc (this might take a moment for the first time if accessing a remote repo)...")
        try:
            df_fsspec: pd.DataFrame = dataset_entry_fsspec.read() # `read()` loads the data
            
            print("\nFirst 5 rows of the loaded DataFrame (via fsspec dvc):")
            print(df_fsspec.head())
            print(f"\nDataFrame shape: {df_fsspec.shape}")
        except Exception as e:
            print(f"\nERROR reading data source '{dataset_entry_name_fsspec}': {e}")
            print("This often happens if:")
            print(f"  - The Git repo URL ('{DUMMY_GIT_REPO_URL_FOR_FSSPEC}') is a placeholder, incorrect, or inaccessible.")
            print(f"  - The Git revision (tag/commit) 'v1.0.0-data' does not exist in that repo or doesn't contain the DVC metadata for the specified path.")
            print(f"  - The DVC remote (if needed) is not accessible or the data for the path has not been pushed to it.")
            print(f"  - 'dvc' CLI is not installed, or 'dvcfs' (required by fsspec's dvc:// protocol) is not available.")
            print(f"  - The 'driver: csv' needs pandas or an appropriate backend installed in your Python environment.")

    else:
        print(f"Data source '{dataset_entry_name_fsspec}' not found in catalog.")

# Clean up dummy catalog
if os.path.exists("temp_fsspec_catalog.yaml"):
    os.remove("temp_fsspec_catalog.yaml")

What happens when you call dataset_entry.read() with this fsspec setup:

  1. Intake identifies the driver (e.g., csv).

  2. It sees the urlpath starting with dvc://.

  3. It uses fsspec with the dvcfs implementation to open this URL.

  4. dvcfs uses the storage_options (like target_options.url for the Git repo and rev for the Git revision) to:

    • Access the specified Git repository at the given revision (cloning/checking out to a temporary location if needed).

    • Find the DVC metadata for the file path within the urlpath.

    • Use DVC internally to make the actual data file available (pulling from a DVC remote if necessary).

    • dvcfs then provides a file-like object to this data.

  5. Intake’s chosen driver (e.g., the CSV driver) then reads from this file-like object, applying any specified args (like csv_kwargs).

Benefits of the fsspec Approach

  • Standardization: Uses the widely adopted fsspec interface, making it compatible with many libraries.

  • Flexibility: You choose the Intake driver based on your data format (CSV, Parquet, NetCDF, Zarr, etc.), and dvcfs handles getting the DVC-versioned bytes to that driver.

  • Ecosystem: Leverages the strengths of both Intake (cataloging, unified API) and fsspec (versatile file system access).

This fsspec-based method is the modern and generally recommended way to use Intake with DVC-versioned data, offering greater flexibility and integration with the broader Python data ecosystem.