Artifacts, Logging, and Reproducible Workflows¶

Snapshot Preview

1. Introduction to Snapshots¶

(a). The Challenge of Reproducibility in AI¶

Building an AI pipeline is an iterative process. You start with some implementation-maybe a data preprocessing pipeline and an initial model. You run experiments, evaluate results, tweak hyperparameters, try a different model architecture, retrain, evaluate again, and repeat. This cycle continues until you reach a working solution.

But as your project evolves through these iterations, keeping track of what you've done and what outputs you've generated becomes messy. Here's why:

Messy Files Everywhere: Scripts, outputs, and data files pile up with each iteration. You lose track of what's important and what state your project is in.
No Version Control for Results: While code is typically version-controlled using systems like Git, data outputs, models, and visualizations often lack systematic versioning, leading to ambiguity regarding which script with what configuration generated which result.
Can't Recreate Past Experiments: Want to rerun an experiment from last week? You're now hunting through different config files, trying to remember which hyperparameters you used, which data preprocessing steps were active, and what model checkpoint you started from. Without careful tracking, it's nearly impossible to know.
Inconsistent Logs: Everyone logs differently (or not at all), making it hard to debug issues or understand what happened during a run.

These problems get worse with each iteration. What starts as a minor inconvenience becomes a major blocker when you need to compare results across experiments, validate findings, or move from development to production.

(b). OpenCrate's Solution: The Snapshot API¶

Here's what Snapshots give you:

Auto-organized Outputs: File paths and versions are handled automatically. Everything stays clean and traceable across iterations.
Built-in Logging: Every run gets logged automatically. No more scattered print statements.
Easy Artifact Management: Save and load any data type (CSVs, images, models and more...) with simple APIs. OpenCrate handles the messy details.
Version Control for Results: Backup important files before overwriting them. Experiment freely without fear of losing work.

OpenCrate handles the boring file management stuff so you can focus on your actual work. This guide will show you how to use it to build clean, reproducible pipelines.

2. Core Concepts: Snapshots¶

(a). Understanding Snapshots¶

A Snapshot is nothing but a special folder for your pipeline's execution run. Think of it like a Git commit, but for your results/configurations instead of code. It sounds bit silly and simple, which it is, but a simple folder containing any kind of config of your entire pipeline along with all the outputs generated during that run solves every major pain point mentioned above.

What makes snapshots useful:

Isolation: Each snapshot is separate. Different experiments don't interfere with each other.
Automatic Versioning: Snapshots are numbered (v0, v1, v2...). Easy to track what changed when.
Reproducibility: Everything from a specific run is stored together. You can always go back to see exactly what you had.
Flexibility: Create snapshots at different stages of your workflow, that is data processing, model training, evaluation, etc.

(b). Initializing a Snapshot¶

Use oc.snapshot.setup() to create or resume a snapshot. Here are the main parameters:

name (str): Give your snapshot series a unique name (e.g., "my_experiment").

start (str or int): Controls which snapshot version to use:

"new": Create a fresh snapshot (increments version: v0 → v1 → v2...)
"last": Resume from the most recent snapshot
0, 1, 2...: Resume from or create a specific version number

tag (str, optional): Add a label to the snapshot (e.g., v0:baseline, v1:feature-x). Useful for marking different experiments or configurations.

log_level (str, optional): How much detail to log. Options: "debug", "info", "warning", "error", "critical". Default is "info".

(c). Demonstrating Snapshot Initialization¶

Let's create our first snapshot named snapshot_guide with the tag initial-run.

In [1]:

Copied!

import opencrate as oc
import opencrate as oc

In [2]:

Copied!





oc.snapshot.setup(name="snapshot_guide", start="new", tag="initial-run")
oc.snapshot.reset(confirm=True)
oc.snapshot.setup(name="snapshot_guide", start="new", tag="initial-run")

oc.info(
    f"Snapshot with version `{oc.snapshot.version}` and name `{oc.snapshot.version_name}` has been set up at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)
oc.snapshot.setup(name="snapshot_guide", start="new", tag="initial-run")
oc.snapshot.reset(confirm=True)
oc.snapshot.setup(name="snapshot_guide", start="new", tag="initial-run")

oc.info(
    f"Snapshot with version `{oc.snapshot.version}` and name `{oc.snapshot.version_name}` has been set up at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)

INFO      Snapshot with version `0` and name `v0:initial-run` has been set up at: `snapshots/snapshot_guide/v0:initial-run`

snapshots
└── snapshot_guide/
    └── v0:initial-run/
        └── snapshot_guide.log

As you can see, we've created a very first snapshot for our pipeline called "snapshot_guide" with version v0 and tag initial-run.

(d). Resuming an Existing Snapshot¶

Sometimes you need to continue work from where you left off. Use oc.snapshot.setup() with start="last" to pick up from the most recent snapshot, or start=<version_number> to target a specific version.

Important: If your snapshot has a tag, you must pass the same tag when resuming. Otherwise, OpenCrate creates a new snapshot without the tag instead of resuming the existing one.

In our example: use start="last" to resume the most recent snapshot, or start=0 to specifically resume v0.

In [1]:

Copied!

# Notebook is restarted here to simulate a fresh run.

import opencrate as oc
# Notebook is restarted here to simulate a fresh run.

import opencrate as oc

In [2]:

Copied!





oc.snapshot.setup(name="snapshot_guide", start="last", tag="initial-run")
# in our case we can pass start="last" as we are resuming from the last snapshot
# otherwise we can pass start="0" as we are resuming from the version v0

oc.info(
    f"Resumed Snapshot with version `{oc.snapshot.version}` and name `{oc.snapshot.version_name}` located at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)
oc.snapshot.setup(name="snapshot_guide", start="last", tag="initial-run")
# in our case we can pass start="last" as we are resuming from the last snapshot
# otherwise we can pass start="0" as we are resuming from the version v0

oc.info(
    f"Resumed Snapshot with version `{oc.snapshot.version}` and name `{oc.snapshot.version_name}` located at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)

INFO      Resumed Snapshot with version `0` and name `v0:initial-run` located at: `snapshots/snapshot_guide/v0:initial-run`

snapshots
└── snapshot_guide/
    └── v0:initial-run/
        ├── snapshot_guide.history.log
        └── snapshot_guide.log

You might notice that our log files under snapshot_guide.log and snapshot_guide.history.log are automatically created. We'll talk more about logging in a bit.

(e). Creating a New Snapshot Version¶

When you hit a major milestone or want to save a stable state before making big changes, create a new snapshot version using start="new". OpenCrate will bump the version number automatically (v0 → v1 → v2...).

When to create a new version:

Before major changes: Save a baseline before experimenting with new features or algorithms
After important updates: Document results from significant changes (new model architecture, different hyperparameters, etc.)
For clean history: Keep each major stage of development separate and easy to compare

In [1]:

Copied!

# Notebook is restarted here to simulate a fresh run.

import opencrate as oc
# Notebook is restarted here to simulate a fresh run.

import opencrate as oc

In [2]:

Copied!





oc.snapshot.setup(name="snapshot_guide", start="new", tag="major-update")

oc.info(
    f"New Snapshot version `{oc.snapshot.version}` with name `{oc.snapshot.version_name}` has been set up at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)
oc.snapshot.setup(name="snapshot_guide", start="new", tag="major-update")

oc.info(
    f"New Snapshot version `{oc.snapshot.version}` with name `{oc.snapshot.version_name}` has been set up at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)

INFO      New Snapshot version `1` with name `v1:major-update` has been set up at: `snapshots/snapshot_guide/v1:major-update`

snapshots
└── snapshot_guide/
    ├── v0:initial-run/
    │   ├── snapshot_guide.history.log
    │   └── snapshot_guide.log
    └── v1:major-update/
        └── snapshot_guide.log

3. Integrated Logging for Pipeline Observability¶

(a). Why Logging Matters¶

Good logging is essential for any serious project. Here's why:

Debugging: Logs show exactly when and where things went wrong
Monitoring: Track your pipeline's progress and performance
Reproducibility: Document what happened in each run so you can recreate or verify results
Status Updates: See what's happening during long-running jobs

Without good logs, you're flying blind. Debugging becomes guesswork and reproducing results becomes impossible.

(b). OpenCrate's Logging System¶

When you create a snapshot, OpenCrate automatically sets up logging. You get two log files in your snapshot directory:

<name>.log (e.g., snapshot_guide.log): Logs from the current run only. Gets overwritten each time you run your pipeline. Perfect for checking what just happened.

<name>.history.log (e.g., snapshot_guide.history.log): All logs from all runs, appended over time. Your complete history for this snapshot version. Only created after you've run the pipeline more than once.

This gives you both a clean view of your latest run and a full history when you need it.

(c). Logging Levels and Usage¶

OpenCrate provides simple logging functions for different situations:

oc.info(): General updates about what's happening
oc.debug(): Detailed info for troubleshooting (usually filtered out in production)
oc.warning(): Something's off but not broken
oc.error(): Something failed in a specific task
oc.critical(): Major failure, pipeline might crash
oc.success(): Confirm an important step completed
oc.exception(): Use in try...except blocks to log full error details with traceback

Just pass a string to any of these functions and OpenCrate handles the rest.

In [3]:

Copied!





oc.info("This is an informational message from the current run.")
oc.debug("Detailed debug information for troubleshooting.")
oc.warning("A potential issue detected, but execution continues.")
oc.error("An error occurred, affecting a part of the pipeline.")
oc.critical("Critical failure: pipeline likely to terminate.")
oc.success("Important step completed successfully!")

try:
    # Simulate an error
    result = 10 / 0
except ZeroDivisionError:
    oc.exception("Caught a division by zero error.")

oc.info("All log messages have been dispatched.")
oc.info("This is an informational message from the current run.")
oc.debug("Detailed debug information for troubleshooting.")
oc.warning("A potential issue detected, but execution continues.")
oc.error("An error occurred, affecting a part of the pipeline.")
oc.critical("Critical failure: pipeline likely to terminate.")
oc.success("Important step completed successfully!")

try:
    # Simulate an error
    result = 10 / 0
except ZeroDivisionError:
    oc.exception("Caught a division by zero error.")

oc.info("All log messages have been dispatched.")

INFO      This is an informational message from the current run.
WARNING   A potential issue detected, but execution continues.
ERROR     An error occurred, affecting a part of the pipeline.
CRITICAL  Critical failure: pipeline likely to terminate.
SUCCESS   Important step completed successfully!
ERROR     Caught a division by zero error.
Traceback (most recent call last):

  File "/tmp/ipykernel_351658/1530191759.py", line 10, in <module>
    result = 10 / 0

ZeroDivisionError: division by zero
INFO      All log messages have been dispatched.

(d). Demonstrating Logging and Log File Analysis¶

Let's check the log files OpenCrate created. Notice:

The v0:initial-run snapshot has two log files: snapshot_guide.log (latest run) and snapshot_guide.history.log (previous runs).
The v1:major-update snapshot only has snapshot_guide.log because it's only been run once.

In [4]:

Copied!

oc.io.show_files_in_dir("snapshots", depth=4)
oc.io.show_files_in_dir("snapshots", depth=4)

snapshots
└── snapshot_guide/
    ├── v0:initial-run/
    │   ├── snapshot_guide.history.log
    │   └── snapshot_guide.log
    └── v1:major-update/
        └── snapshot_guide.log

Let's compare the logs for v0:initial-run to see the difference.

In [5]:

Copied!

!cat snapshots/snapshot_guide/v0:initial-run/snapshot_guide.log
!cat snapshots/snapshot_guide/v0:initial-run/snapshot_guide.log

2025-11-16 11:44:43 - INFO     Resumed Snapshot with version `0` and name `v0:initial-run` located at: `snapshots/snapshot_guide/v0:initial-run`

In [6]:

Copied!

!cat snapshots/snapshot_guide/v0:initial-run/snapshot_guide.history.log
!cat snapshots/snapshot_guide/v0:initial-run/snapshot_guide.history.log

2025-11-16 11:44:23 - INFO     Snapshot with version `0` and name `v0:initial-run` has been set up at: `snapshots/snapshot_guide/v0:initial-run`
2025-11-16 11:44:43 - INFO     Resumed Snapshot with version `0` and name `v0:initial-run` located at: `snapshots/snapshot_guide/v0:initial-run`

As expected: snapshot_guide.history.log has logs from both our current and previous runs, while snapshot_guide.log only has the current run. Every time you resume a snapshot, .log gets overwritten with fresh logs, while .history.log keeps growing with the full timeline.

Quick check: let's look at v1:major-update/snapshot_guide.log. It should only have logs from the current run.

In [7]:

Copied!

!cat snapshots/snapshot_guide/v1:major-update/snapshot_guide.log
!cat snapshots/snapshot_guide/v1:major-update/snapshot_guide.log

2025-11-16 11:45:03 - INFO     New Snapshot version `1` with name `v1:major-update` has been set up at: `snapshots/snapshot_guide/v1:major-update`
2025-11-16 11:45:07 - INFO     This is an informational message from the current run.
2025-11-16 11:45:07 - WARNING  A potential issue detected, but execution continues.
2025-11-16 11:45:07 - ERROR    An error occurred, affecting a part of the pipeline.
2025-11-16 11:45:07 - CRITICAL Critical failure: pipeline likely to terminate.
2025-11-16 11:45:07 - SUCCESS  Important step completed successfully!
2025-11-16 11:45:07 - ERROR    Caught a division by zero error.
Traceback (most recent call last):

  File "/tmp/ipykernel_351658/1530191759.py", line 10, in <module>
    result = 10 / 0

ZeroDivisionError: division by zero
2025-11-16 11:45:07 - INFO     All log messages have been dispatched.

Perfect!

4. Artifact Management: Saving and Loading Data¶

(a). What Are Artifacts?¶

An artifact is any important output from your pipeline that you want to keep. Not temporary files-stuff that matters:

Processed Datasets: Cleaned data, feature-engineered datasets (e.g., training_data.csv)
Models: Trained weights, saved model files (e.g., model_v1.pth, classifier.pkl)
Visualizations: Important plots and charts (e.g., accuracy_plot.png, confusion_matrix.jpg)
Config Files: Settings and parameters used during training

OpenCrate handles all the annoying details: file paths, serialization, versioning. You just call .save() and .load().

(b). Built-in Artifact Handlers¶

OpenCrate has handlers for common file types. Just pick the right one for your data, give it a name, and call .save(). OpenCrate handles the rest.

Data & Configuration Handlers:¶

oc.snapshot.json(name): Manages Python dictionaries, lists, and other JSON-serializable objects, saving them as .json files.
oc.snapshot.yaml(name): Ideal for configuration management, handling dictionaries and similar structures as .yaml files.
oc.snapshot.csv(name): Designed for tabular data, supporting Pandas DataFrames, lists of lists, or NumPy arrays for saving to .csv format.
oc.snapshot.text(name): A versatile handler for saving any string data to a plain .txt file.

Media Handlers:¶

oc.snapshot.image(name): Handles various image formats, supporting saving and loading from NumPy arrays, PIL Images, or Matplotlib figures. Offers lib parameter for specifying image processing library (e.g., "pil", "cv2").
oc.snapshot.gif(name): Facilitates the creation and loading of animated GIFs from a sequence of images.
oc.snapshot.video(name): Manages video files from diverse sources.
oc.snapshot.audio(name): Supports audio data from libraries like Torchaudio or Librosa, with options to specify the sampling rate and library.

Machine Learning Model Handlers:¶

oc.snapshot.checkpoint(name): A powerful handler for saving and loading machine learning model checkpoints. It supports a wide array of popular frameworks and formats, including:
- PyTorch (.pth, .pt, .safetensors)
- TensorFlow/Keras (.h5, .keras)
- Scikit-learn (.joblib, .pkl)
- And more, typically by handling a dictionary containing model state, optimizer state, and other metadata.

Let's save different file types: JSON, CSV, text, images, audio, and a PyTorch model checkpoint.

In [8]:

Copied!





import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch

In [9]:

Copied!





# First we initialize our artifacts based on their handling type

greeting_artifact = oc.snapshot.text("greeting.txt")
data_artifact = oc.snapshot.json("data.json")
config_artifact = oc.snapshot.yaml("config.yaml")
sample_data_artifact = oc.snapshot.csv("sample_data.csv")
sine_artifact = oc.snapshot.image("sine_wave_plot.png")
numpy_image_artifact = oc.snapshot.image("random_numpy_image.jpg")
audio_artifact = oc.snapshot.audio("high_pitch_sine.wav")
custom_model_ckpt_artifact = oc.snapshot.checkpoint("custom_model_checkpoint.pth")

greeting_artifact.save("Hello, OpenCrate Guide!") # saving as plain text
data_artifact.save({"array": [10, 20, 30], "message": "Sample JSON data"}) # saving as JSON
config_artifact.save({"project": "OpenCrate Guide", "version": 1.1, "settings": {"debug_mode": True}}) # saving as YAML
sample_data_artifact.save(pd.DataFrame({"col_a": [100, 200], "col_b": [300, 400]}), index=False) # saving as CSV

figure = plt.figure(figsize=(6, 4))
plt.plot(np.sin(np.linspace(0, 2 * np.pi, 50)))
plt.title("Sine Wave Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
sine_artifact.save(figure) # saving matplotlib figure image
plt.close(figure)

numpy_image = np.random.randint(0, 256, (128, 128, 3), dtype=np.uint8)
numpy_image_artifact.save(numpy_image) # saving numpy array as image

sr = 44100
duration = 3
frequency = 220.0
t = np.linspace(0., duration, int(sr * duration), endpoint=False)
amplitude = 0.3 * np.iinfo(np.int16).max
audio_data = (amplitude * np.sin(2. * np.pi * frequency * t)).astype(np.int16)
audio_artifact.save(audio_data, sr, lib="soundfile")

model = torch.nn.Sequential(
    torch.nn.Linear(20, 10),
    torch.nn.ReLU(),
    torch.nn.Linear(10, 1)
)
optimizer = torch.optim.Adam(lr=0.001, params=model.parameters())

custom_model_ckpt_artifact.save(
    {
        "epoch": 5,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
        "loss": 0.015,
        "description": "A sample PyTorch model checkpoint after 5 epochs."
    }
)
# First we initialize our artifacts based on their handling type

greeting_artifact = oc.snapshot.text("greeting.txt")
data_artifact = oc.snapshot.json("data.json")
config_artifact = oc.snapshot.yaml("config.yaml")
sample_data_artifact = oc.snapshot.csv("sample_data.csv")
sine_artifact = oc.snapshot.image("sine_wave_plot.png")
numpy_image_artifact = oc.snapshot.image("random_numpy_image.jpg")
audio_artifact = oc.snapshot.audio("high_pitch_sine.wav")
custom_model_ckpt_artifact = oc.snapshot.checkpoint("custom_model_checkpoint.pth")

greeting_artifact.save("Hello, OpenCrate Guide!") # saving as plain text
data_artifact.save({"array": [10, 20, 30], "message": "Sample JSON data"}) # saving as JSON
config_artifact.save({"project": "OpenCrate Guide", "version": 1.1, "settings": {"debug_mode": True}}) # saving as YAML
sample_data_artifact.save(pd.DataFrame({"col_a": [100, 200], "col_b": [300, 400]}), index=False) # saving as CSV

figure = plt.figure(figsize=(6, 4))
plt.plot(np.sin(np.linspace(0, 2 * np.pi, 50)))
plt.title("Sine Wave Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
sine_artifact.save(figure) # saving matplotlib figure image
plt.close(figure)

numpy_image = np.random.randint(0, 256, (128, 128, 3), dtype=np.uint8)
numpy_image_artifact.save(numpy_image) # saving numpy array as image

sr = 44100
duration = 3
frequency = 220.0
t = np.linspace(0., duration, int(sr * duration), endpoint=False)
amplitude = 0.3 * np.iinfo(np.int16).max
audio_data = (amplitude * np.sin(2. * np.pi * frequency * t)).astype(np.int16)
audio_artifact.save(audio_data, sr, lib="soundfile")

model = torch.nn.Sequential(
    torch.nn.Linear(20, 10),
    torch.nn.ReLU(),
    torch.nn.Linear(10, 1)
)
optimizer = torch.optim.Adam(lr=0.001, params=model.parameters())

custom_model_ckpt_artifact.save(
    {
        "epoch": 5,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
        "loss": 0.015,
        "description": "A sample PyTorch model checkpoint after 5 epochs."
    }
)

(c). Visualizing Artifact Storage¶

OpenCrate automatically organizes artifacts into folders by type. Clean and easy to navigate.

In [ ]:

Copied!

oc.io.show_files_in_dir("snapshots", depth=4, verbose=True)
# neat trick - you can use verbose argument in show_files_in_dir to see file sizes and last modified times
oc.io.show_files_in_dir("snapshots", depth=4, verbose=True)
# neat trick - you can use verbose argument in show_files_in_dir to see file sizes and last modified times

snapshots
└── snapshot_guide/ (2025-11-16 11:45, 298.0 KB)
    ├── v0:initial-run/ (2025-11-16 11:44, 435 B)
    │   ├── snapshot_guide.history.log (2025-11-16 11:44, 290 B)
    │   └── snapshot_guide.log (2025-11-16 11:44, 145 B)
    └── v1:major-update/ (2025-11-16 11:45, 297.5 KB)
        ├── audios/ (2025-11-16 11:45, 258.4 KB)
        │   └── high_pitch_sine.wav (2025-11-16 11:45, 258.4 KB)
        ├── checkpoints/ (2025-11-16 11:45, 3.8 KB)
        │   └── custom_model_checkpoint.pth (2025-11-16 11:45, 3.8 KB)
        ├── csvs/ (2025-11-16 11:45, 28 B)
        │   └── sample_data.csv (2025-11-16 11:45, 28 B)
        ├── images/ (2025-11-16 11:45, 34.3 KB)
        │   ├── random_numpy_image.jpg (2025-11-16 11:45, 10.2 KB)
        │   └── sine_wave_plot.png (2025-11-16 11:45, 24.1 KB)
        ├── jsons/ (2025-11-16 11:45, 54 B)
        │   └── data.json (2025-11-16 11:45, 54 B)
        ├── texts/ (2025-11-16 11:45, 23 B)
        │   └── greeting.txt (2025-11-16 11:45, 23 B)
        ├── yamls/ (2025-11-16 11:45, 67 B)
        │   └── config.yaml (2025-11-16 11:45, 67 B)
        └── snapshot_guide.log (2025-11-16 11:45, 844 B)

(d). Loading Artifacts¶

Loading is just as easy as saving. Call .load() and you get your data back in its original Python format. No worrying about file paths or deserialization.

In [11]:

Copied!





loaded_greeting = greeting_artifact.load()
oc.info(f"Loaded Text: {loaded_greeting}")

loaded_json_data = data_artifact.load()
oc.info(f"Loaded JSON: {loaded_json_data}")

loaded_config = config_artifact.load()
oc.info(f"Loaded YAML Config: {loaded_config}")

loaded_csv_data = sample_data_artifact.load()
oc.info(f"Loaded CSV Data:\n{loaded_csv_data}")

loaded_sine_wave_plot = sine_artifact.load(lib="cv2")
oc.info(f"Loaded Sine Wave Plot (shape): {loaded_sine_wave_plot.shape}")

loaded_numpy_image = numpy_image_artifact.load(lib="cv2")
oc.info(f"Loaded NumPy Image (size): {loaded_numpy_image.size}")

# For audio, you might need to specify the library used during saving if not default
# For checkpoint, it typically returns the dictionary it was saved with
loaded_checkpoint = custom_model_ckpt_artifact.load()
oc.info(f"Loaded Checkpoint Keys: {loaded_checkpoint.keys()}")
oc.info(f"Loaded Checkpoint Description: {loaded_checkpoint['description']}")
loaded_greeting = greeting_artifact.load()
oc.info(f"Loaded Text: {loaded_greeting}")

loaded_json_data = data_artifact.load()
oc.info(f"Loaded JSON: {loaded_json_data}")

loaded_config = config_artifact.load()
oc.info(f"Loaded YAML Config: {loaded_config}")

loaded_csv_data = sample_data_artifact.load()
oc.info(f"Loaded CSV Data:\n{loaded_csv_data}")

loaded_sine_wave_plot = sine_artifact.load(lib="cv2")
oc.info(f"Loaded Sine Wave Plot (shape): {loaded_sine_wave_plot.shape}")

loaded_numpy_image = numpy_image_artifact.load(lib="cv2")
oc.info(f"Loaded NumPy Image (size): {loaded_numpy_image.size}")

# For audio, you might need to specify the library used during saving if not default
# For checkpoint, it typically returns the dictionary it was saved with
loaded_checkpoint = custom_model_ckpt_artifact.load()
oc.info(f"Loaded Checkpoint Keys: {loaded_checkpoint.keys()}")
oc.info(f"Loaded Checkpoint Description: {loaded_checkpoint['description']}")

INFO      Loaded Text: Hello, OpenCrate Guide!
INFO      Loaded JSON: {'array': [10, 20, 30], 'message': 'Sample JSON data'}
INFO      Loaded YAML Config: {'project': 'OpenCrate Guide', 'settings': {'debug_mode': True}, 'version': 1.1}
INFO      Loaded CSV Data:
   col_a  col_b
0    100    300
1    200    400
INFO      Loaded Sine Wave Plot (shape): (393, 557, 3)
INFO      Loaded NumPy Image (size): 49152
INFO      Loaded Checkpoint Keys: dict_keys(['epoch', 'model_state_dict', 'optimizer_state_dict', 'loss', 'description'])
INFO      Loaded Checkpoint Description: A sample PyTorch model checkpoint after 5 epochs.

In [12]:

Copied!





loaded_audio = audio_artifact.load(lib="soundfile")

def audio_playback_widget(audio_data, sample_rate, volume=0.1):
    import IPython.display as ipd
    import numpy as np

    audio_data = np.array(audio_data) * volume
    ipd.display(
        ipd.Audio(data=audio_data, rate=sample_rate, autoplay=False, normalize=False)
    )

audio_playback_widget(loaded_audio["data"], loaded_audio["sample_rate"])
loaded_audio = audio_artifact.load(lib="soundfile")

def audio_playback_widget(audio_data, sample_rate, volume=0.1):
    import IPython.display as ipd
    import numpy as np

    audio_data = np.array(audio_data) * volume
    ipd.display(
        ipd.Audio(data=audio_data, rate=sample_rate, autoplay=False, normalize=False)
    )

audio_playback_widget(loaded_audio["data"], loaded_audio["sample_rate"])

(e). Advanced Artifact Features¶

Beyond basic save/load, artifacts have useful properties and methods:

Properties:

.exists: Returns True if the artifact file exists. Use this for conditional logic.
.path: The full file path where the artifact is stored. Useful when other tools need the path.

Methods:

.backup(tag=None): Creates a backup copy before you overwrite something important. Add a tag or use automatic timestamps.
.list_backups(): Shows all backup files for this artifact.
.delete(confirm=False): Delete an artifact. Requires confirm=True to prevent accidents.

Let's try them out.

In [13]:

Copied!





oc.info(f"Artifact Name: {custom_model_ckpt_artifact.name}")
oc.info(f"Artifact Type: {custom_model_ckpt_artifact.snapshot_type}")
oc.info(f"Artifact Exists: {custom_model_ckpt_artifact.exists}") # Should be True as we just saved it
oc.info(f"Artifact Path: {custom_model_ckpt_artifact.path}")
oc.info(f"Artifact Name: {custom_model_ckpt_artifact.name}")
oc.info(f"Artifact Type: {custom_model_ckpt_artifact.snapshot_type}")
oc.info(f"Artifact Exists: {custom_model_ckpt_artifact.exists}") # Should be True as we just saved it
oc.info(f"Artifact Path: {custom_model_ckpt_artifact.path}")

INFO      Artifact Name: custom_model_checkpoint.pth
INFO      Artifact Type: checkpoint
INFO      Artifact Exists: True
INFO      Artifact Path: snapshots/snapshot_guide/v1:major-update/checkpoints/custom_model_checkpoint.pth

Creating Backups with `.backup()`¶

Before modifying an important artifact, create a backup. This way you can always recover if something goes wrong.

You can tag backups for easy identification, or let OpenCrate use timestamps automatically.

In [14]:

Copied!





custom_model_ckpt_artifact.backup(tag="initial-version")
oc.info("Created initial backup with tag 'initial-version'.")

# Simulate some changes and then create another backup
loaded_state = custom_model_ckpt_artifact.load()
loaded_state["loss"] = 0.012 # Simulate a better loss
custom_model_ckpt_artifact.save(loaded_state)
oc.info("Modified and re-saved the main artifact.")

custom_model_ckpt_artifact.backup(tag="improved-loss")
oc.info("Created backup with tag 'improved-loss' after modification.")

# Create a backup without a tag (timestamped)
custom_model_ckpt_artifact.backup()
oc.info("Created a timestamped backup without a specific tag.")

oc.io.show_files_in_dir(os.path.dirname(custom_model_ckpt_artifact.path), verbose=True)
custom_model_ckpt_artifact.backup(tag="initial-version")
oc.info("Created initial backup with tag 'initial-version'.")

# Simulate some changes and then create another backup
loaded_state = custom_model_ckpt_artifact.load()
loaded_state["loss"] = 0.012 # Simulate a better loss
custom_model_ckpt_artifact.save(loaded_state)
oc.info("Modified and re-saved the main artifact.")

custom_model_ckpt_artifact.backup(tag="improved-loss")
oc.info("Created backup with tag 'improved-loss' after modification.")

# Create a backup without a tag (timestamped)
custom_model_ckpt_artifact.backup()
oc.info("Created a timestamped backup without a specific tag.")

oc.io.show_files_in_dir(os.path.dirname(custom_model_ckpt_artifact.path), verbose=True)

INFO      Created initial backup with tag 'initial-version'.
INFO      Modified and re-saved the main artifact.
INFO      Created backup with tag 'improved-loss' after modification.
INFO      Created a timestamped backup without a specific tag.

checkpoints
├── custom_model_checkpoint.backup_11:46:37_16-Nov-2025.pth (2025-11-16 11:46, 3.8 KB)
├── custom_model_checkpoint.backup_improved-loss.pth (2025-11-16 11:46, 3.8 KB)
├── custom_model_checkpoint.backup_initial-version.pth (2025-11-16 11:46, 3.8 KB)
└── custom_model_checkpoint.pth (2025-11-16 11:46, 3.8 KB)

Listing and Loading Backups¶

Use .list_backups() to see all your saved backup versions. Then load any backup just like you'd load a regular artifact.

In [15]:

Copied!

all_backups = "\n".join(custom_model_ckpt_artifact.list_backups())
oc.info(f"All Backups:\n{all_backups}")
all_backups = "\n".join(custom_model_ckpt_artifact.list_backups())
oc.info(f"All Backups:\n{all_backups}")

INFO      All Backups:
custom_model_checkpoint.backup_initial-version.pth
custom_model_checkpoint.backup_improved-loss.pth
custom_model_checkpoint.backup_11:46:37_16-Nov-2025.pth

In [16]:

Copied!





initial_checkpoint_artifact = oc.snapshot.checkpoint("custom_model_checkpoint.backup_initial-version.pth")

if initial_checkpoint_artifact.exists:
    initial_checkpoint = initial_checkpoint_artifact.load()
    oc.info(f"Loaded Initial Version Loss: {initial_checkpoint['loss']}")
else:
    oc.warning("Initial version backup not found.")
initial_checkpoint_artifact = oc.snapshot.checkpoint("custom_model_checkpoint.backup_initial-version.pth")

if initial_checkpoint_artifact.exists:
    initial_checkpoint = initial_checkpoint_artifact.load()
    oc.info(f"Loaded Initial Version Loss: {initial_checkpoint['loss']}")
else:
    oc.warning("Initial version backup not found.")

INFO      Loaded Initial Version Loss: 0.015

Deleting Artifacts¶

Delete old or unnecessary artifacts with .delete(confirm=True). The confirm=True requirement prevents accidents.

In [17]:

Copied!





if all_backups:
    artifact_to_delete_name = all_backups.split("\n")[0] # Let's delete the first backup
    artifact_to_delete = oc.snapshot.checkpoint(artifact_to_delete_name)
    artifact_to_delete.delete(confirm=True)
    oc.info(f"Deleted backup: {artifact_to_delete_name}")
    oc.io.show_files_in_dir(
        os.path.dirname(custom_model_ckpt_artifact.path), verbose=True
    )
else:
    oc.warning("No backups to delete.")
if all_backups:
    artifact_to_delete_name = all_backups.split("\n")[0] # Let's delete the first backup
    artifact_to_delete = oc.snapshot.checkpoint(artifact_to_delete_name)
    artifact_to_delete.delete(confirm=True)
    oc.info(f"Deleted backup: {artifact_to_delete_name}")
    oc.io.show_files_in_dir(
        os.path.dirname(custom_model_ckpt_artifact.path), verbose=True
    )
else:
    oc.warning("No backups to delete.")

INFO      Deleted backup: custom_model_checkpoint.backup_initial-version.pth

checkpoints
├── custom_model_checkpoint.backup_11:46:37_16-Nov-2025.pth (2025-11-16 11:46, 3.8 KB)
├── custom_model_checkpoint.backup_improved-loss.pth (2025-11-16 11:46, 3.8 KB)
└── custom_model_checkpoint.pth (2025-11-16 11:46, 3.8 KB)

5. Extending OpenCrate: Custom Artifact Handlers¶

(a). Why Custom Handlers?¶

OpenCrate has handlers for common formats (CSV, JSON, images, models, etc.), but sometimes you need something specific:

Unique file formats in your field
Custom data validation or preprocessing
Special compression or storage requirements
Proprietary data structures

Custom handlers let you save/load any data type while keeping all of OpenCrate's versioning and logging benefits.

(b). How to Create a Custom Handler¶

Create a Python class with at least two methods: save() and load(). You can add other methods too (like reset() for cleanup).

class BoundingBoxHandler:
    def save(self, bounding_boxes_list):
        # Your save logic here using self.path
        ...
    
    def load(self):
        # Your load logic here using self.path
        ...

bounding_box_artifact = oc.snapshot.labels(
    "bounding_boxes", handler=BoundingBoxHandler
)

OpenCrate automatically gives your handler these attributes:

self.path: Where to save/load the file
self.verbose: Whether to print detailed logs
self.name: The artifact name (e.g., "bounding_boxes")
self.snapshot_type: The handler type (e.g., "labels")

Your save() method writes data to self.path. Your load() method reads from self.path and returns the data.

Let's see two practical examples.

(c). Example 1: Bounding Box Handler¶

Say you're doing object detection and want to save bounding box coordinates. Instead of overwriting a single file, let's keep a history by saving each set of boxes as a new numbered file.

This BoundingBoxHandler creates files like bounding_boxes_0.txt, bounding_boxes_1.txt, etc. The load() method reads all of them and returns the complete history.

In [18]:

Copied!





from shutil import rmtree
from typing import Dict, List


class BoundingBoxHandler:
    def save(self, bboxes: List[Dict[str, float]], *args, **kwargs):
        # Ensure the directory exists for storing individual bounding box files
        os.makedirs(self.path, exist_ok=True)

        idx = len(os.listdir(self.path)) # Determine the next index for the file
        file_path = os.path.join(self.path, f"bounding_boxes_{idx}.txt")

        lines = []
        for bbox in bboxes:
            # Format bounding box coordinates into a single line
            line = f"{bbox['x1']} {bbox['y1']} {bbox['x2']} {bbox['y2']}"
            lines.append(line)

        content = '\n'.join(lines)
        oc.io.text.save(content, file_path) # Use OpenCrate's internal text handler to save the file
        # you can also use your custom serialization logic here as well instead of oc.io.text.save

        if self.verbose:
            oc.success(f"Successfully saved {len(bboxes)} bounding boxes to {file_path}")

    def load(self, *args, **kwargs) -> List[List[Dict[str, float]]]:
        if self.verbose:
            oc.info(f"Loading bounding boxes from {self.path}")

        loaded_boxes_history = [] # To store list of lists of bboxes

        if not os.path.exists(self.path):
            if self.verbose:
                oc.warning(f"Bounding box directory not found at {self.path}. Returning empty list.")
            return []

        # List files and sort them numerically to maintain the order of saving
        files_in_dir = oc.io.list_files_in_dir(self.path)
        sorted_files = sorted(files_in_dir, key=lambda x: int(x.split('_')[-1].split('.')[0]))

        for file_name in sorted_files:
            file_path = os.path.join(self.path, file_name)
            content = oc.io.text.load(file_path) # Load content of each bounding box file

            current_bboxes_list = []
            for line in content.strip().split('\n'):
                if line.strip():
                    coords = line.strip().split()
                    if len(coords) == 4:
                        bbox = {
                            'x1': float(coords[0]),
                            'y1': float(coords[1]),
                            'x2': float(coords[2]),
                            'y2': float(coords[3])
                        }
                        current_bboxes_list.append(bbox)
            loaded_boxes_history.append(current_bboxes_list)

        if self.verbose:
            oc.info(f"Successfully loaded {len(loaded_boxes_history)} sets of bounding boxes")

        return loaded_boxes_history

    def reset(self, *args, **kwargs):
        # Custom reset logic to delete the directory and recreate it
        if os.path.exists(self.path):
            rmtree(self.path)
        os.makedirs(self.path, exist_ok=True)
        if self.verbose:
            oc.success(f"Reset bounding box handler at {self.path}")

# Instantiate the custom bounding box artifact handler
bounding_box_artifact = oc.snapshot.labels(
    "bounding_boxes", handler=BoundingBoxHandler, verbose=True
)
oc.info(f"Custom Bounding Box Artifact Handler initialized at: {bounding_box_artifact.path}")
from shutil import rmtree
from typing import Dict, List


class BoundingBoxHandler:
    def save(self, bboxes: List[Dict[str, float]], *args, **kwargs):
        # Ensure the directory exists for storing individual bounding box files
        os.makedirs(self.path, exist_ok=True)

        idx = len(os.listdir(self.path)) # Determine the next index for the file
        file_path = os.path.join(self.path, f"bounding_boxes_{idx}.txt")

        lines = []
        for bbox in bboxes:
            # Format bounding box coordinates into a single line
            line = f"{bbox['x1']} {bbox['y1']} {bbox['x2']} {bbox['y2']}"
            lines.append(line)

        content = '\n'.join(lines)
        oc.io.text.save(content, file_path) # Use OpenCrate's internal text handler to save the file
        # you can also use your custom serialization logic here as well instead of oc.io.text.save

        if self.verbose:
            oc.success(f"Successfully saved {len(bboxes)} bounding boxes to {file_path}")

    def load(self, *args, **kwargs) -> List[List[Dict[str, float]]]:
        if self.verbose:
            oc.info(f"Loading bounding boxes from {self.path}")

        loaded_boxes_history = [] # To store list of lists of bboxes

        if not os.path.exists(self.path):
            if self.verbose:
                oc.warning(f"Bounding box directory not found at {self.path}. Returning empty list.")
            return []

        # List files and sort them numerically to maintain the order of saving
        files_in_dir = oc.io.list_files_in_dir(self.path)
        sorted_files = sorted(files_in_dir, key=lambda x: int(x.split('_')[-1].split('.')[0]))

        for file_name in sorted_files:
            file_path = os.path.join(self.path, file_name)
            content = oc.io.text.load(file_path) # Load content of each bounding box file

            current_bboxes_list = []
            for line in content.strip().split('\n'):
                if line.strip():
                    coords = line.strip().split()
                    if len(coords) == 4:
                        bbox = {
                            'x1': float(coords[0]),
                            'y1': float(coords[1]),
                            'x2': float(coords[2]),
                            'y2': float(coords[3])
                        }
                        current_bboxes_list.append(bbox)
            loaded_boxes_history.append(current_bboxes_list)

        if self.verbose:
            oc.info(f"Successfully loaded {len(loaded_boxes_history)} sets of bounding boxes")

        return loaded_boxes_history

    def reset(self, *args, **kwargs):
        # Custom reset logic to delete the directory and recreate it
        if os.path.exists(self.path):
            rmtree(self.path)
        os.makedirs(self.path, exist_ok=True)
        if self.verbose:
            oc.success(f"Reset bounding box handler at {self.path}")

# Instantiate the custom bounding box artifact handler
bounding_box_artifact = oc.snapshot.labels(
    "bounding_boxes", handler=BoundingBoxHandler, verbose=True
)
oc.info(f"Custom Bounding Box Artifact Handler initialized at: {bounding_box_artifact.path}")

INFO      Custom Bounding Box Artifact Handler initialized at: snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes

In [19]:

Copied!





boxes1 = [
    {"x1": 10.0, "y1": 20.0, "x2": 150.0, "y2": 200.0},
    {"x1": 50.0, "y1": 60.0, "x2": 180.0, "y2": 250.0},
]
boxes2 = [
    {"x1": 100.0, "y1": 110.0, "x2": 220.0, "y2": 300.0},
]

# Reset the handler to ensure a clean state before saving
bounding_box_artifact.reset()

# Save multiple sets of bounding boxes, each creating a new file
bounding_box_artifact.save(boxes1)
bounding_box_artifact.save(boxes2)

oc.info("Saved multiple sets of bounding boxes using the custom handler.")
oc.io.show_files_in_dir(bounding_box_artifact.path)
boxes1 = [
    {"x1": 10.0, "y1": 20.0, "x2": 150.0, "y2": 200.0},
    {"x1": 50.0, "y1": 60.0, "x2": 180.0, "y2": 250.0},
]
boxes2 = [
    {"x1": 100.0, "y1": 110.0, "x2": 220.0, "y2": 300.0},
]

# Reset the handler to ensure a clean state before saving
bounding_box_artifact.reset()

# Save multiple sets of bounding boxes, each creating a new file
bounding_box_artifact.save(boxes1)
bounding_box_artifact.save(boxes2)

oc.info("Saved multiple sets of bounding boxes using the custom handler.")
oc.io.show_files_in_dir(bounding_box_artifact.path)

SUCCESS   Reset bounding box handler at snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes
SUCCESS   Successfully saved 2 bounding boxes to snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes/bounding_boxes_0.txt
INFO      ✓ 'bounding_boxes' of 'labels' saved successfully at 'snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes'.
SUCCESS   Successfully saved 1 bounding boxes to snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes/bounding_boxes_1.txt
INFO      ✓ 'bounding_boxes' of 'labels' saved successfully at 'snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes'.
INFO      Saved multiple sets of bounding boxes using the custom handler.

bounding_boxes
├── bounding_boxes_0.txt
└── bounding_boxes_1.txt

In [20]:

Copied!





loaded_bounding_boxes_history = bounding_box_artifact.load()
oc.info(f"Loaded Bounding Boxes History: {loaded_bounding_boxes_history}")

# You can access individual sets of bounding boxes
oc.info(f"First set of boxes: {loaded_bounding_boxes_history[0]}")
oc.info(f"Second set of boxes: {loaded_bounding_boxes_history[1]}")
loaded_bounding_boxes_history = bounding_box_artifact.load()
oc.info(f"Loaded Bounding Boxes History: {loaded_bounding_boxes_history}")

# You can access individual sets of bounding boxes
oc.info(f"First set of boxes: {loaded_bounding_boxes_history[0]}")
oc.info(f"Second set of boxes: {loaded_bounding_boxes_history[1]}")

INFO      Loading bounding boxes from snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes
INFO      Successfully loaded 2 sets of bounding boxes
INFO      ✓ 'bounding_boxes' of 'labels' loaded successfully from 'snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes'.
INFO      Loaded Bounding Boxes History: [[{'x1': 10.0, 'y1': 20.0, 'x2': 150.0, 'y2': 200.0}, {'x1': 50.0, 'y1': 60.0, 'x2': 180.0, 'y2': 250.0}], [{'x1': 100.0, 'y1': 110.0, 'x2': 220.0, 'y2': 300.0}]]
INFO      First set of boxes: [{'x1': 10.0, 'y1': 20.0, 'x2': 150.0, 'y2': 200.0}, {'x1': 50.0, 'y1': 60.0, 'x2': 180.0, 'y2': 250.0}]
INFO      Second set of boxes: [{'x1': 100.0, 'y1': 110.0, 'x2': 220.0, 'y2': 300.0}]

(d). Example 2: Zipped Image Dataset Handler¶

Managing hundreds of individual image files is messy. Better to bundle them into a single ZIP file.

This ImageZipHandler saves a list of NumPy arrays (images) as PNGs inside a compressed ZIP archive. When loading, it unpacks them back into NumPy arrays.

In [21]:

Copied!





import zipfile

import cv2


class ImageZipHandler:
    def save(self, images: List[np.ndarray], *args, **kwargs):
        if self.verbose:
            oc.info(f"Saving {len(images)} images to {self.path}")

        with zipfile.ZipFile(self.path, 'w', zipfile.ZIP_DEFLATED) as zipf:
            for i, img_data in enumerate(images):
                # Encode image to PNG format before adding to zip
                is_success, buffer = cv2.imencode(".png", img_data)
                if not is_success:
                    oc.warning(f"Could not encode image at index {i}")
                    continue
                zipf.writestr(f"image_{i:04d}.png", buffer.tobytes()) # Use 4-digit padding for sorting

        if self.verbose:
            oc.success(f"Successfully saved {len(images)} images to {self.path}")

    def load(self, *args, **kwargs) -> List[np.ndarray]:
        if self.verbose:
            oc.info(f"Loading images from {self.path}")

        images = []
        if not os.path.exists(self.path):
            if self.verbose:
                oc.warning(f"Image zip file not found at {self.path}. Returning empty list.")
            return []

        with zipfile.ZipFile(self.path, 'r') as zipf:
            # Sort names to ensure consistent loading order
            for file_name in sorted(zipf.namelist()):
                with zipf.open(file_name) as img_file:
                    file_bytes = np.frombuffer(img_file.read(), np.uint8)
                    img = cv2.imdecode(file_bytes, cv2.IMREAD_COLOR)
                    if img is not None:
                        images.append(img)
                    else:
                        oc.warning(f"Could not decode image {file_name}")

        if self.verbose:
            oc.info(f"Loaded {len(images)} images from {self.path}")

        return images

# Instantiate the custom image dataset artifact handler
image_dataset_artifact = oc.snapshot.image_archive(
    "images_archive.zip", handler=ImageZipHandler, verbose=True
)
oc.info(f"Custom Image Archive Artifact Handler initialized at: {image_dataset_artifact.path}")
import zipfile

import cv2


class ImageZipHandler:
    def save(self, images: List[np.ndarray], *args, **kwargs):
        if self.verbose:
            oc.info(f"Saving {len(images)} images to {self.path}")

        with zipfile.ZipFile(self.path, 'w', zipfile.ZIP_DEFLATED) as zipf:
            for i, img_data in enumerate(images):
                # Encode image to PNG format before adding to zip
                is_success, buffer = cv2.imencode(".png", img_data)
                if not is_success:
                    oc.warning(f"Could not encode image at index {i}")
                    continue
                zipf.writestr(f"image_{i:04d}.png", buffer.tobytes()) # Use 4-digit padding for sorting

        if self.verbose:
            oc.success(f"Successfully saved {len(images)} images to {self.path}")

    def load(self, *args, **kwargs) -> List[np.ndarray]:
        if self.verbose:
            oc.info(f"Loading images from {self.path}")

        images = []
        if not os.path.exists(self.path):
            if self.verbose:
                oc.warning(f"Image zip file not found at {self.path}. Returning empty list.")
            return []

        with zipfile.ZipFile(self.path, 'r') as zipf:
            # Sort names to ensure consistent loading order
            for file_name in sorted(zipf.namelist()):
                with zipf.open(file_name) as img_file:
                    file_bytes = np.frombuffer(img_file.read(), np.uint8)
                    img = cv2.imdecode(file_bytes, cv2.IMREAD_COLOR)
                    if img is not None:
                        images.append(img)
                    else:
                        oc.warning(f"Could not decode image {file_name}")

        if self.verbose:
            oc.info(f"Loaded {len(images)} images from {self.path}")

        return images

# Instantiate the custom image dataset artifact handler
image_dataset_artifact = oc.snapshot.image_archive(
    "images_archive.zip", handler=ImageZipHandler, verbose=True
)
oc.info(f"Custom Image Archive Artifact Handler initialized at: {image_dataset_artifact.path}")

INFO      Custom Image Archive Artifact Handler initialized at: snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip

In [22]:

Copied!





# Generate some random images for demonstration
random_images = [np.random.randint(0, 256, (64, 64, 3), dtype=np.uint8) for _ in range(50)]

# Save the images using the custom handler
image_dataset_artifact.save(random_images)
oc.info("Saved a collection of random images into a zip archive.")

# Load the images back from the zip archive
loaded_images = image_dataset_artifact.load()
oc.info(f"Loaded {len(loaded_images)} images from the archive. First image shape: {loaded_images[0].shape}")

oc.io.show_files_in_dir(os.path.dirname(image_dataset_artifact.path), verbose=True)
# Generate some random images for demonstration
random_images = [np.random.randint(0, 256, (64, 64, 3), dtype=np.uint8) for _ in range(50)]

# Save the images using the custom handler
image_dataset_artifact.save(random_images)
oc.info("Saved a collection of random images into a zip archive.")

# Load the images back from the zip archive
loaded_images = image_dataset_artifact.load()
oc.info(f"Loaded {len(loaded_images)} images from the archive. First image shape: {loaded_images[0].shape}")

oc.io.show_files_in_dir(os.path.dirname(image_dataset_artifact.path), verbose=True)

INFO      Saving 50 images to snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip
SUCCESS   Successfully saved 50 images to snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip
INFO      ✓ 'images_archive.zip' of 'image_archive' saved successfully at 'snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip'.
INFO      Saved a collection of random images into a zip archive.
INFO      Loading images from snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip
INFO      Loaded 50 images from snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip
INFO      ✓ 'images_archive.zip' of 'image_archive' loaded successfully from 'snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip'.
INFO      Loaded 50 images from the archive. First image shape: (64, 64, 3)

image_archive
└── images_archive.zip (2025-11-16 11:48, 612.4 KB)

6. Best Practices for Artifact Management¶

(a). Choose Artifacts Wisely¶

Not every file needs to be an artifact. Save what matters and skip the rest. Too many artifacts clutters your snapshots and wastes storage.

Good artifacts:

Final cleaned datasets
Trained model checkpoints
Important plots and visualizations
Config files with key parameters

Not artifacts:

Temporary cache files
Files you can easily regenerate
Large raw datasets (unless you're specifically versioning them)

If your pipeline generates tons of related files, group them into one artifact instead of saving each individually.

Benefits:

Less clutter
Easier to manage (backup, delete, load as a unit)
Clearer organization

How to group:

Save the whole directory: Use a custom handler to save an entire folder as one artifact
Compress into an archive: Bundle files into a ZIP or tar.gz (like our ImageZipHandler example)

Example: If you generate 1,300 JSON annotation files, don't create 1,300 artifacts. Either save the parent directory or compress them into annotations.zip. Much simpler.

7. Conclusion¶

This guide covered everything you need to use OpenCrate effectively: snapshots, logging, artifacts, and custom handlers.

What OpenCrate Gives You¶

Reproducibility: Version your outputs and recreate past experiments easily
Organization: Auto-organized folders and files. No more mess.
Easy Artifact Handling: Save and load any data type with simple commands
Safety: Backup important files before changes. No more accidental overwrites.
Flexibility: Extend with custom handlers for any file format

OpenCrate takes care of the boring file management stuff so you can focus on actual data science work.

Next Steps¶

Read the docs: Check the official OpenCrate documentation for the full API reference
Join the community: Ask questions, share ideas, contribute
Try it yourself: Start using OpenCrate in your own projects

Thanks for reading! We hope OpenCrate makes your workflows cleaner and more reproducible.

Artifacts, Logging, and Reproducible Workflows¶

1. Introduction to Snapshots¶

(a). The Challenge of Reproducibility in AI¶

(b). OpenCrate's Solution: The Snapshot API¶

2. Core Concepts: Snapshots¶

(a). Understanding Snapshots¶

(b). Initializing a Snapshot¶

(c). Demonstrating Snapshot Initialization¶

(d). Resuming an Existing Snapshot¶

(e). Creating a New Snapshot Version¶

3. Integrated Logging for Pipeline Observability¶

(a). Why Logging Matters¶

(b). OpenCrate's Logging System¶

(c). Logging Levels and Usage¶

(d). Demonstrating Logging and Log File Analysis¶

4. Artifact Management: Saving and Loading Data¶

(a). What Are Artifacts?¶

(b). Built-in Artifact Handlers¶

Data & Configuration Handlers:¶

Media Handlers:¶

Machine Learning Model Handlers:¶

(c). Visualizing Artifact Storage¶

(d). Loading Artifacts¶

(e). Advanced Artifact Features¶

Creating Backups with .backup()¶

Listing and Loading Backups¶

Deleting Artifacts¶

5. Extending OpenCrate: Custom Artifact Handlers¶

(a). Why Custom Handlers?¶

(b). How to Create a Custom Handler¶

(c). Example 1: Bounding Box Handler¶

(d). Example 2: Zipped Image Dataset Handler¶

6. Best Practices for Artifact Management¶

(a). Choose Artifacts Wisely¶

(b). Group Related Files¶

7. Conclusion¶

What OpenCrate Gives You¶

Next Steps¶

Creating Backups with `.backup()`¶