Artifacts, Logging, and Reproducible Workflows¶

1. Introduction to Snapshots¶
(a). The Challenge of Reproducibility in AI¶
Building an AI pipeline is an iterative process. You start with some implementation-maybe a data preprocessing pipeline and an initial model. You run experiments, evaluate results, tweak hyperparameters, try a different model architecture, retrain, evaluate again, and repeat. This cycle continues until you reach a working solution.
But as your project evolves through these iterations, keeping track of what you've done and what outputs you've generated becomes messy. Here's why:
- Messy Files Everywhere: Scripts, outputs, and data files pile up with each iteration. You lose track of what's important and what state your project is in.
- No Version Control for Results: While code is typically version-controlled using systems like Git, data outputs, models, and visualizations often lack systematic versioning, leading to ambiguity regarding which script with what configuration generated which result.
- Can't Recreate Past Experiments: Want to rerun an experiment from last week? You're now hunting through different config files, trying to remember which hyperparameters you used, which data preprocessing steps were active, and what model checkpoint you started from. Without careful tracking, it's nearly impossible to know.
- Inconsistent Logs: Everyone logs differently (or not at all), making it hard to debug issues or understand what happened during a run.
These problems get worse with each iteration. What starts as a minor inconvenience becomes a major blocker when you need to compare results across experiments, validate findings, or move from development to production.
(b). OpenCrate's Solution: The Snapshot API¶
Here's what Snapshots give you:
- Auto-organized Outputs: File paths and versions are handled automatically. Everything stays clean and traceable across iterations.
- Built-in Logging: Every run gets logged automatically. No more scattered print statements.
- Easy Artifact Management: Save and load any data type (CSVs, images, models and more...) with simple APIs. OpenCrate handles the messy details.
- Version Control for Results: Backup important files before overwriting them. Experiment freely without fear of losing work.
OpenCrate handles the boring file management stuff so you can focus on your actual work. This guide will show you how to use it to build clean, reproducible pipelines.
2. Core Concepts: Snapshots¶
(a). Understanding Snapshots¶
A Snapshot is nothing but a special folder for your pipeline's execution run. Think of it like a Git commit, but for your results/configurations instead of code. It sounds bit silly and simple, which it is, but a simple folder containing any kind of config of your entire pipeline along with all the outputs generated during that run solves every major pain point mentioned above.
What makes snapshots useful:
- Isolation: Each snapshot is separate. Different experiments don't interfere with each other.
- Automatic Versioning: Snapshots are numbered (v0, v1, v2...). Easy to track what changed when.
- Reproducibility: Everything from a specific run is stored together. You can always go back to see exactly what you had.
- Flexibility: Create snapshots at different stages of your workflow, that is data processing, model training, evaluation, etc.
(b). Initializing a Snapshot¶
Use oc.snapshot.setup() to create or resume a snapshot. Here are the main parameters:
name (str): Give your snapshot series a unique name (e.g., "my_experiment").
start (str or int): Controls which snapshot version to use:
"new": Create a fresh snapshot (increments version: v0 → v1 → v2...)"last": Resume from the most recent snapshot0,1,2...: Resume from or create a specific version number
tag (str, optional): Add a label to the snapshot (e.g., v0:baseline, v1:feature-x). Useful for marking different experiments or configurations.
log_level (str, optional): How much detail to log. Options: "debug", "info", "warning", "error", "critical". Default is "info".
(c). Demonstrating Snapshot Initialization¶
Let's create our first snapshot named snapshot_guide with the tag initial-run.
import opencrate as oc
oc.snapshot.setup(name="snapshot_guide", start="new", tag="initial-run")
oc.snapshot.reset(confirm=True)
oc.snapshot.setup(name="snapshot_guide", start="new", tag="initial-run")
oc.info(
f"Snapshot with version `{oc.snapshot.version}` and name `{oc.snapshot.version_name}` has been set up at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)
INFO Snapshot with version `0` and name `v0:initial-run` has been set up at: `snapshots/snapshot_guide/v0:initial-run`
snapshots
└── snapshot_guide/
└── v0:initial-run/
└── snapshot_guide.log
As you can see, we've created a very first snapshot for our pipeline called "snapshot_guide" with version v0 and tag initial-run.
(d). Resuming an Existing Snapshot¶
Sometimes you need to continue work from where you left off. Use oc.snapshot.setup() with start="last" to pick up from the most recent snapshot, or start=<version_number> to target a specific version.
Important: If your snapshot has a tag, you must pass the same tag when resuming. Otherwise, OpenCrate creates a new snapshot without the tag instead of resuming the existing one.
In our example: use start="last" to resume the most recent snapshot, or start=0 to specifically resume v0.
# Notebook is restarted here to simulate a fresh run.
import opencrate as oc
oc.snapshot.setup(name="snapshot_guide", start="last", tag="initial-run")
# in our case we can pass start="last" as we are resuming from the last snapshot
# otherwise we can pass start="0" as we are resuming from the version v0
oc.info(
f"Resumed Snapshot with version `{oc.snapshot.version}` and name `{oc.snapshot.version_name}` located at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)
INFO Resumed Snapshot with version `0` and name `v0:initial-run` located at: `snapshots/snapshot_guide/v0:initial-run`
snapshots
└── snapshot_guide/
└── v0:initial-run/
├── snapshot_guide.history.log
└── snapshot_guide.log
You might notice that our log files under snapshot_guide.log and snapshot_guide.history.log are automatically created. We'll talk more about logging in a bit.
(e). Creating a New Snapshot Version¶
When you hit a major milestone or want to save a stable state before making big changes, create a new snapshot version using start="new". OpenCrate will bump the version number automatically (v0 → v1 → v2...).
When to create a new version:
- Before major changes: Save a baseline before experimenting with new features or algorithms
- After important updates: Document results from significant changes (new model architecture, different hyperparameters, etc.)
- For clean history: Keep each major stage of development separate and easy to compare
# Notebook is restarted here to simulate a fresh run.
import opencrate as oc
oc.snapshot.setup(name="snapshot_guide", start="new", tag="major-update")
oc.info(
f"New Snapshot version `{oc.snapshot.version}` with name `{oc.snapshot.version_name}` has been set up at: `{oc.snapshot.dir_path}`"
)
oc.io.show_files_in_dir("snapshots", depth=4)
INFO New Snapshot version `1` with name `v1:major-update` has been set up at: `snapshots/snapshot_guide/v1:major-update`
snapshots
└── snapshot_guide/
├── v0:initial-run/
│ ├── snapshot_guide.history.log
│ └── snapshot_guide.log
└── v1:major-update/
└── snapshot_guide.log
3. Integrated Logging for Pipeline Observability¶
(a). Why Logging Matters¶
Good logging is essential for any serious project. Here's why:
- Debugging: Logs show exactly when and where things went wrong
- Monitoring: Track your pipeline's progress and performance
- Reproducibility: Document what happened in each run so you can recreate or verify results
- Status Updates: See what's happening during long-running jobs
Without good logs, you're flying blind. Debugging becomes guesswork and reproducing results becomes impossible.
(b). OpenCrate's Logging System¶
When you create a snapshot, OpenCrate automatically sets up logging. You get two log files in your snapshot directory:
<name>.log (e.g., snapshot_guide.log): Logs from the current run only. Gets overwritten each time you run your pipeline. Perfect for checking what just happened.
<name>.history.log (e.g., snapshot_guide.history.log): All logs from all runs, appended over time. Your complete history for this snapshot version. Only created after you've run the pipeline more than once.
This gives you both a clean view of your latest run and a full history when you need it.
(c). Logging Levels and Usage¶
OpenCrate provides simple logging functions for different situations:
oc.info(): General updates about what's happeningoc.debug(): Detailed info for troubleshooting (usually filtered out in production)oc.warning(): Something's off but not brokenoc.error(): Something failed in a specific taskoc.critical(): Major failure, pipeline might crashoc.success(): Confirm an important step completedoc.exception(): Use intry...exceptblocks to log full error details with traceback
Just pass a string to any of these functions and OpenCrate handles the rest.
oc.info("This is an informational message from the current run.")
oc.debug("Detailed debug information for troubleshooting.")
oc.warning("A potential issue detected, but execution continues.")
oc.error("An error occurred, affecting a part of the pipeline.")
oc.critical("Critical failure: pipeline likely to terminate.")
oc.success("Important step completed successfully!")
try:
# Simulate an error
result = 10 / 0
except ZeroDivisionError:
oc.exception("Caught a division by zero error.")
oc.info("All log messages have been dispatched.")
INFO This is an informational message from the current run. WARNING A potential issue detected, but execution continues. ERROR An error occurred, affecting a part of the pipeline. CRITICAL Critical failure: pipeline likely to terminate. SUCCESS Important step completed successfully! ERROR Caught a division by zero error. Traceback (most recent call last): File "/tmp/ipykernel_351658/1530191759.py", line 10, in <module> result = 10 / 0 ZeroDivisionError: division by zero INFO All log messages have been dispatched.
(d). Demonstrating Logging and Log File Analysis¶
Let's check the log files OpenCrate created. Notice:
- The
v0:initial-runsnapshot has two log files:snapshot_guide.log(latest run) andsnapshot_guide.history.log(previous runs). - The
v1:major-updatesnapshot only hassnapshot_guide.logbecause it's only been run once.
oc.io.show_files_in_dir("snapshots", depth=4)
snapshots
└── snapshot_guide/
├── v0:initial-run/
│ ├── snapshot_guide.history.log
│ └── snapshot_guide.log
└── v1:major-update/
└── snapshot_guide.log
Let's compare the logs for v0:initial-run to see the difference.
!cat snapshots/snapshot_guide/v0:initial-run/snapshot_guide.log
2025-11-16 11:44:43 - INFO Resumed Snapshot with version `0` and name `v0:initial-run` located at: `snapshots/snapshot_guide/v0:initial-run`
!cat snapshots/snapshot_guide/v0:initial-run/snapshot_guide.history.log
2025-11-16 11:44:23 - INFO Snapshot with version `0` and name `v0:initial-run` has been set up at: `snapshots/snapshot_guide/v0:initial-run` 2025-11-16 11:44:43 - INFO Resumed Snapshot with version `0` and name `v0:initial-run` located at: `snapshots/snapshot_guide/v0:initial-run`
As expected: snapshot_guide.history.log has logs from both our current and previous runs, while snapshot_guide.log only has the current run. Every time you resume a snapshot, .log gets overwritten with fresh logs, while .history.log keeps growing with the full timeline.
Quick check: let's look at v1:major-update/snapshot_guide.log. It should only have logs from the current run.
!cat snapshots/snapshot_guide/v1:major-update/snapshot_guide.log
2025-11-16 11:45:03 - INFO New Snapshot version `1` with name `v1:major-update` has been set up at: `snapshots/snapshot_guide/v1:major-update`
2025-11-16 11:45:07 - INFO This is an informational message from the current run.
2025-11-16 11:45:07 - WARNING A potential issue detected, but execution continues.
2025-11-16 11:45:07 - ERROR An error occurred, affecting a part of the pipeline.
2025-11-16 11:45:07 - CRITICAL Critical failure: pipeline likely to terminate.
2025-11-16 11:45:07 - SUCCESS Important step completed successfully!
2025-11-16 11:45:07 - ERROR Caught a division by zero error.
Traceback (most recent call last):
File "/tmp/ipykernel_351658/1530191759.py", line 10, in <module>
result = 10 / 0
ZeroDivisionError: division by zero
2025-11-16 11:45:07 - INFO All log messages have been dispatched.
Perfect!
4. Artifact Management: Saving and Loading Data¶
(a). What Are Artifacts?¶
An artifact is any important output from your pipeline that you want to keep. Not temporary files-stuff that matters:
- Processed Datasets: Cleaned data, feature-engineered datasets (e.g.,
training_data.csv) - Models: Trained weights, saved model files (e.g.,
model_v1.pth,classifier.pkl) - Visualizations: Important plots and charts (e.g.,
accuracy_plot.png,confusion_matrix.jpg) - Config Files: Settings and parameters used during training
OpenCrate handles all the annoying details: file paths, serialization, versioning. You just call .save() and .load().
(b). Built-in Artifact Handlers¶
OpenCrate has handlers for common file types. Just pick the right one for your data, give it a name, and call .save(). OpenCrate handles the rest.
Data & Configuration Handlers:¶
oc.snapshot.json(name): Manages Python dictionaries, lists, and other JSON-serializable objects, saving them as.jsonfiles.oc.snapshot.yaml(name): Ideal for configuration management, handling dictionaries and similar structures as.yamlfiles.oc.snapshot.csv(name): Designed for tabular data, supporting Pandas DataFrames, lists of lists, or NumPy arrays for saving to.csvformat.oc.snapshot.text(name): A versatile handler for saving any string data to a plain.txtfile.
Media Handlers:¶
oc.snapshot.image(name): Handles various image formats, supporting saving and loading from NumPy arrays, PIL Images, or Matplotlib figures. Offerslibparameter for specifying image processing library (e.g.,"pil","cv2").oc.snapshot.gif(name): Facilitates the creation and loading of animated GIFs from a sequence of images.oc.snapshot.video(name): Manages video files from diverse sources.oc.snapshot.audio(name): Supports audio data from libraries like Torchaudio or Librosa, with options to specify the sampling rate and library.
Machine Learning Model Handlers:¶
oc.snapshot.checkpoint(name): A powerful handler for saving and loading machine learning model checkpoints. It supports a wide array of popular frameworks and formats, including:- PyTorch (
.pth,.pt,.safetensors) - TensorFlow/Keras (
.h5,.keras) - Scikit-learn (
.joblib,.pkl) - And more, typically by handling a dictionary containing model state, optimizer state, and other metadata.
- PyTorch (
Let's save different file types: JSON, CSV, text, images, audio, and a PyTorch model checkpoint.
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
# First we initialize our artifacts based on their handling type
greeting_artifact = oc.snapshot.text("greeting.txt")
data_artifact = oc.snapshot.json("data.json")
config_artifact = oc.snapshot.yaml("config.yaml")
sample_data_artifact = oc.snapshot.csv("sample_data.csv")
sine_artifact = oc.snapshot.image("sine_wave_plot.png")
numpy_image_artifact = oc.snapshot.image("random_numpy_image.jpg")
audio_artifact = oc.snapshot.audio("high_pitch_sine.wav")
custom_model_ckpt_artifact = oc.snapshot.checkpoint("custom_model_checkpoint.pth")
greeting_artifact.save("Hello, OpenCrate Guide!") # saving as plain text
data_artifact.save({"array": [10, 20, 30], "message": "Sample JSON data"}) # saving as JSON
config_artifact.save({"project": "OpenCrate Guide", "version": 1.1, "settings": {"debug_mode": True}}) # saving as YAML
sample_data_artifact.save(pd.DataFrame({"col_a": [100, 200], "col_b": [300, 400]}), index=False) # saving as CSV
figure = plt.figure(figsize=(6, 4))
plt.plot(np.sin(np.linspace(0, 2 * np.pi, 50)))
plt.title("Sine Wave Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
sine_artifact.save(figure) # saving matplotlib figure image
plt.close(figure)
numpy_image = np.random.randint(0, 256, (128, 128, 3), dtype=np.uint8)
numpy_image_artifact.save(numpy_image) # saving numpy array as image
sr = 44100
duration = 3
frequency = 220.0
t = np.linspace(0., duration, int(sr * duration), endpoint=False)
amplitude = 0.3 * np.iinfo(np.int16).max
audio_data = (amplitude * np.sin(2. * np.pi * frequency * t)).astype(np.int16)
audio_artifact.save(audio_data, sr, lib="soundfile")
model = torch.nn.Sequential(
torch.nn.Linear(20, 10),
torch.nn.ReLU(),
torch.nn.Linear(10, 1)
)
optimizer = torch.optim.Adam(lr=0.001, params=model.parameters())
custom_model_ckpt_artifact.save(
{
"epoch": 5,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"loss": 0.015,
"description": "A sample PyTorch model checkpoint after 5 epochs."
}
)
(c). Visualizing Artifact Storage¶
OpenCrate automatically organizes artifacts into folders by type. Clean and easy to navigate.
oc.io.show_files_in_dir("snapshots", depth=4, verbose=True)
# neat trick - you can use verbose argument in show_files_in_dir to see file sizes and last modified times
snapshots └── snapshot_guide/ (2025-11-16 11:45, 298.0 KB) ├── v0:initial-run/ (2025-11-16 11:44, 435 B) │ ├── snapshot_guide.history.log (2025-11-16 11:44, 290 B) │ └── snapshot_guide.log (2025-11-16 11:44, 145 B) └── v1:major-update/ (2025-11-16 11:45, 297.5 KB) ├── audios/ (2025-11-16 11:45, 258.4 KB) │ └── high_pitch_sine.wav (2025-11-16 11:45, 258.4 KB) ├── checkpoints/ (2025-11-16 11:45, 3.8 KB) │ └── custom_model_checkpoint.pth (2025-11-16 11:45, 3.8 KB) ├── csvs/ (2025-11-16 11:45, 28 B) │ └── sample_data.csv (2025-11-16 11:45, 28 B) ├── images/ (2025-11-16 11:45, 34.3 KB) │ ├── random_numpy_image.jpg (2025-11-16 11:45, 10.2 KB) │ └── sine_wave_plot.png (2025-11-16 11:45, 24.1 KB) ├── jsons/ (2025-11-16 11:45, 54 B) │ └── data.json (2025-11-16 11:45, 54 B) ├── texts/ (2025-11-16 11:45, 23 B) │ └── greeting.txt (2025-11-16 11:45, 23 B) ├── yamls/ (2025-11-16 11:45, 67 B) │ └── config.yaml (2025-11-16 11:45, 67 B) └── snapshot_guide.log (2025-11-16 11:45, 844 B)
(d). Loading Artifacts¶
Loading is just as easy as saving. Call .load() and you get your data back in its original Python format. No worrying about file paths or deserialization.
loaded_greeting = greeting_artifact.load()
oc.info(f"Loaded Text: {loaded_greeting}")
loaded_json_data = data_artifact.load()
oc.info(f"Loaded JSON: {loaded_json_data}")
loaded_config = config_artifact.load()
oc.info(f"Loaded YAML Config: {loaded_config}")
loaded_csv_data = sample_data_artifact.load()
oc.info(f"Loaded CSV Data:\n{loaded_csv_data}")
loaded_sine_wave_plot = sine_artifact.load(lib="cv2")
oc.info(f"Loaded Sine Wave Plot (shape): {loaded_sine_wave_plot.shape}")
loaded_numpy_image = numpy_image_artifact.load(lib="cv2")
oc.info(f"Loaded NumPy Image (size): {loaded_numpy_image.size}")
# For audio, you might need to specify the library used during saving if not default
# For checkpoint, it typically returns the dictionary it was saved with
loaded_checkpoint = custom_model_ckpt_artifact.load()
oc.info(f"Loaded Checkpoint Keys: {loaded_checkpoint.keys()}")
oc.info(f"Loaded Checkpoint Description: {loaded_checkpoint['description']}")
INFO Loaded Text: Hello, OpenCrate Guide! INFO Loaded JSON: {'array': [10, 20, 30], 'message': 'Sample JSON data'} INFO Loaded YAML Config: {'project': 'OpenCrate Guide', 'settings': {'debug_mode': True}, 'version': 1.1} INFO Loaded CSV Data: col_a col_b 0 100 300 1 200 400 INFO Loaded Sine Wave Plot (shape): (393, 557, 3) INFO Loaded NumPy Image (size): 49152 INFO Loaded Checkpoint Keys: dict_keys(['epoch', 'model_state_dict', 'optimizer_state_dict', 'loss', 'description']) INFO Loaded Checkpoint Description: A sample PyTorch model checkpoint after 5 epochs.
loaded_audio = audio_artifact.load(lib="soundfile")
def audio_playback_widget(audio_data, sample_rate, volume=0.1):
import IPython.display as ipd
import numpy as np
audio_data = np.array(audio_data) * volume
ipd.display(
ipd.Audio(data=audio_data, rate=sample_rate, autoplay=False, normalize=False)
)
audio_playback_widget(loaded_audio["data"], loaded_audio["sample_rate"])
(e). Advanced Artifact Features¶
Beyond basic save/load, artifacts have useful properties and methods:
Properties:
.exists: ReturnsTrueif the artifact file exists. Use this for conditional logic..path: The full file path where the artifact is stored. Useful when other tools need the path.
Methods:
.backup(tag=None): Creates a backup copy before you overwrite something important. Add a tag or use automatic timestamps..list_backups(): Shows all backup files for this artifact..delete(confirm=False): Delete an artifact. Requiresconfirm=Trueto prevent accidents.
Let's try them out.
oc.info(f"Artifact Name: {custom_model_ckpt_artifact.name}")
oc.info(f"Artifact Type: {custom_model_ckpt_artifact.snapshot_type}")
oc.info(f"Artifact Exists: {custom_model_ckpt_artifact.exists}") # Should be True as we just saved it
oc.info(f"Artifact Path: {custom_model_ckpt_artifact.path}")
INFO Artifact Name: custom_model_checkpoint.pth INFO Artifact Type: checkpoint INFO Artifact Exists: True INFO Artifact Path: snapshots/snapshot_guide/v1:major-update/checkpoints/custom_model_checkpoint.pth
Creating Backups with .backup()¶
Before modifying an important artifact, create a backup. This way you can always recover if something goes wrong.
You can tag backups for easy identification, or let OpenCrate use timestamps automatically.
custom_model_ckpt_artifact.backup(tag="initial-version")
oc.info("Created initial backup with tag 'initial-version'.")
# Simulate some changes and then create another backup
loaded_state = custom_model_ckpt_artifact.load()
loaded_state["loss"] = 0.012 # Simulate a better loss
custom_model_ckpt_artifact.save(loaded_state)
oc.info("Modified and re-saved the main artifact.")
custom_model_ckpt_artifact.backup(tag="improved-loss")
oc.info("Created backup with tag 'improved-loss' after modification.")
# Create a backup without a tag (timestamped)
custom_model_ckpt_artifact.backup()
oc.info("Created a timestamped backup without a specific tag.")
oc.io.show_files_in_dir(os.path.dirname(custom_model_ckpt_artifact.path), verbose=True)
INFO Created initial backup with tag 'initial-version'. INFO Modified and re-saved the main artifact. INFO Created backup with tag 'improved-loss' after modification. INFO Created a timestamped backup without a specific tag.
checkpoints ├── custom_model_checkpoint.backup_11:46:37_16-Nov-2025.pth (2025-11-16 11:46, 3.8 KB) ├── custom_model_checkpoint.backup_improved-loss.pth (2025-11-16 11:46, 3.8 KB) ├── custom_model_checkpoint.backup_initial-version.pth (2025-11-16 11:46, 3.8 KB) └── custom_model_checkpoint.pth (2025-11-16 11:46, 3.8 KB)
Listing and Loading Backups¶
Use .list_backups() to see all your saved backup versions. Then load any backup just like you'd load a regular artifact.
all_backups = "\n".join(custom_model_ckpt_artifact.list_backups())
oc.info(f"All Backups:\n{all_backups}")
INFO All Backups:
custom_model_checkpoint.backup_initial-version.pth
custom_model_checkpoint.backup_improved-loss.pth
custom_model_checkpoint.backup_11:46:37_16-Nov-2025.pth
initial_checkpoint_artifact = oc.snapshot.checkpoint("custom_model_checkpoint.backup_initial-version.pth")
if initial_checkpoint_artifact.exists:
initial_checkpoint = initial_checkpoint_artifact.load()
oc.info(f"Loaded Initial Version Loss: {initial_checkpoint['loss']}")
else:
oc.warning("Initial version backup not found.")
INFO Loaded Initial Version Loss: 0.015
Deleting Artifacts¶
Delete old or unnecessary artifacts with .delete(confirm=True). The confirm=True requirement prevents accidents.
if all_backups:
artifact_to_delete_name = all_backups.split("\n")[0] # Let's delete the first backup
artifact_to_delete = oc.snapshot.checkpoint(artifact_to_delete_name)
artifact_to_delete.delete(confirm=True)
oc.info(f"Deleted backup: {artifact_to_delete_name}")
oc.io.show_files_in_dir(
os.path.dirname(custom_model_ckpt_artifact.path), verbose=True
)
else:
oc.warning("No backups to delete.")
INFO Deleted backup: custom_model_checkpoint.backup_initial-version.pth
checkpoints ├── custom_model_checkpoint.backup_11:46:37_16-Nov-2025.pth (2025-11-16 11:46, 3.8 KB) ├── custom_model_checkpoint.backup_improved-loss.pth (2025-11-16 11:46, 3.8 KB) └── custom_model_checkpoint.pth (2025-11-16 11:46, 3.8 KB)
5. Extending OpenCrate: Custom Artifact Handlers¶
(a). Why Custom Handlers?¶
OpenCrate has handlers for common formats (CSV, JSON, images, models, etc.), but sometimes you need something specific:
- Unique file formats in your field
- Custom data validation or preprocessing
- Special compression or storage requirements
- Proprietary data structures
Custom handlers let you save/load any data type while keeping all of OpenCrate's versioning and logging benefits.
(b). How to Create a Custom Handler¶
Create a Python class with at least two methods: save() and load(). You can add other methods too (like reset() for cleanup).
class BoundingBoxHandler:
def save(self, bounding_boxes_list):
# Your save logic here using self.path
...
def load(self):
# Your load logic here using self.path
...
bounding_box_artifact = oc.snapshot.labels(
"bounding_boxes", handler=BoundingBoxHandler
)
OpenCrate automatically gives your handler these attributes:
self.path: Where to save/load the fileself.verbose: Whether to print detailed logsself.name: The artifact name (e.g., "bounding_boxes")self.snapshot_type: The handler type (e.g., "labels")
Your save() method writes data to self.path. Your load() method reads from self.path and returns the data.
Let's see two practical examples.
(c). Example 1: Bounding Box Handler¶
Say you're doing object detection and want to save bounding box coordinates. Instead of overwriting a single file, let's keep a history by saving each set of boxes as a new numbered file.
This BoundingBoxHandler creates files like bounding_boxes_0.txt, bounding_boxes_1.txt, etc. The load() method reads all of them and returns the complete history.
from shutil import rmtree
from typing import Dict, List
class BoundingBoxHandler:
def save(self, bboxes: List[Dict[str, float]], *args, **kwargs):
# Ensure the directory exists for storing individual bounding box files
os.makedirs(self.path, exist_ok=True)
idx = len(os.listdir(self.path)) # Determine the next index for the file
file_path = os.path.join(self.path, f"bounding_boxes_{idx}.txt")
lines = []
for bbox in bboxes:
# Format bounding box coordinates into a single line
line = f"{bbox['x1']} {bbox['y1']} {bbox['x2']} {bbox['y2']}"
lines.append(line)
content = '\n'.join(lines)
oc.io.text.save(content, file_path) # Use OpenCrate's internal text handler to save the file
# you can also use your custom serialization logic here as well instead of oc.io.text.save
if self.verbose:
oc.success(f"Successfully saved {len(bboxes)} bounding boxes to {file_path}")
def load(self, *args, **kwargs) -> List[List[Dict[str, float]]]:
if self.verbose:
oc.info(f"Loading bounding boxes from {self.path}")
loaded_boxes_history = [] # To store list of lists of bboxes
if not os.path.exists(self.path):
if self.verbose:
oc.warning(f"Bounding box directory not found at {self.path}. Returning empty list.")
return []
# List files and sort them numerically to maintain the order of saving
files_in_dir = oc.io.list_files_in_dir(self.path)
sorted_files = sorted(files_in_dir, key=lambda x: int(x.split('_')[-1].split('.')[0]))
for file_name in sorted_files:
file_path = os.path.join(self.path, file_name)
content = oc.io.text.load(file_path) # Load content of each bounding box file
current_bboxes_list = []
for line in content.strip().split('\n'):
if line.strip():
coords = line.strip().split()
if len(coords) == 4:
bbox = {
'x1': float(coords[0]),
'y1': float(coords[1]),
'x2': float(coords[2]),
'y2': float(coords[3])
}
current_bboxes_list.append(bbox)
loaded_boxes_history.append(current_bboxes_list)
if self.verbose:
oc.info(f"Successfully loaded {len(loaded_boxes_history)} sets of bounding boxes")
return loaded_boxes_history
def reset(self, *args, **kwargs):
# Custom reset logic to delete the directory and recreate it
if os.path.exists(self.path):
rmtree(self.path)
os.makedirs(self.path, exist_ok=True)
if self.verbose:
oc.success(f"Reset bounding box handler at {self.path}")
# Instantiate the custom bounding box artifact handler
bounding_box_artifact = oc.snapshot.labels(
"bounding_boxes", handler=BoundingBoxHandler, verbose=True
)
oc.info(f"Custom Bounding Box Artifact Handler initialized at: {bounding_box_artifact.path}")
INFO Custom Bounding Box Artifact Handler initialized at: snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes
boxes1 = [
{"x1": 10.0, "y1": 20.0, "x2": 150.0, "y2": 200.0},
{"x1": 50.0, "y1": 60.0, "x2": 180.0, "y2": 250.0},
]
boxes2 = [
{"x1": 100.0, "y1": 110.0, "x2": 220.0, "y2": 300.0},
]
# Reset the handler to ensure a clean state before saving
bounding_box_artifact.reset()
# Save multiple sets of bounding boxes, each creating a new file
bounding_box_artifact.save(boxes1)
bounding_box_artifact.save(boxes2)
oc.info("Saved multiple sets of bounding boxes using the custom handler.")
oc.io.show_files_in_dir(bounding_box_artifact.path)
SUCCESS Reset bounding box handler at snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes SUCCESS Successfully saved 2 bounding boxes to snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes/bounding_boxes_0.txt INFO ✓ 'bounding_boxes' of 'labels' saved successfully at 'snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes'. SUCCESS Successfully saved 1 bounding boxes to snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes/bounding_boxes_1.txt INFO ✓ 'bounding_boxes' of 'labels' saved successfully at 'snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes'. INFO Saved multiple sets of bounding boxes using the custom handler.
bounding_boxes ├── bounding_boxes_0.txt └── bounding_boxes_1.txt
loaded_bounding_boxes_history = bounding_box_artifact.load()
oc.info(f"Loaded Bounding Boxes History: {loaded_bounding_boxes_history}")
# You can access individual sets of bounding boxes
oc.info(f"First set of boxes: {loaded_bounding_boxes_history[0]}")
oc.info(f"Second set of boxes: {loaded_bounding_boxes_history[1]}")
INFO Loading bounding boxes from snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes INFO Successfully loaded 2 sets of bounding boxes INFO ✓ 'bounding_boxes' of 'labels' loaded successfully from 'snapshots/snapshot_guide/v1:major-update/labels/bounding_boxes'. INFO Loaded Bounding Boxes History: [[{'x1': 10.0, 'y1': 20.0, 'x2': 150.0, 'y2': 200.0}, {'x1': 50.0, 'y1': 60.0, 'x2': 180.0, 'y2': 250.0}], [{'x1': 100.0, 'y1': 110.0, 'x2': 220.0, 'y2': 300.0}]] INFO First set of boxes: [{'x1': 10.0, 'y1': 20.0, 'x2': 150.0, 'y2': 200.0}, {'x1': 50.0, 'y1': 60.0, 'x2': 180.0, 'y2': 250.0}] INFO Second set of boxes: [{'x1': 100.0, 'y1': 110.0, 'x2': 220.0, 'y2': 300.0}]
(d). Example 2: Zipped Image Dataset Handler¶
Managing hundreds of individual image files is messy. Better to bundle them into a single ZIP file.
This ImageZipHandler saves a list of NumPy arrays (images) as PNGs inside a compressed ZIP archive. When loading, it unpacks them back into NumPy arrays.
import zipfile
import cv2
class ImageZipHandler:
def save(self, images: List[np.ndarray], *args, **kwargs):
if self.verbose:
oc.info(f"Saving {len(images)} images to {self.path}")
with zipfile.ZipFile(self.path, 'w', zipfile.ZIP_DEFLATED) as zipf:
for i, img_data in enumerate(images):
# Encode image to PNG format before adding to zip
is_success, buffer = cv2.imencode(".png", img_data)
if not is_success:
oc.warning(f"Could not encode image at index {i}")
continue
zipf.writestr(f"image_{i:04d}.png", buffer.tobytes()) # Use 4-digit padding for sorting
if self.verbose:
oc.success(f"Successfully saved {len(images)} images to {self.path}")
def load(self, *args, **kwargs) -> List[np.ndarray]:
if self.verbose:
oc.info(f"Loading images from {self.path}")
images = []
if not os.path.exists(self.path):
if self.verbose:
oc.warning(f"Image zip file not found at {self.path}. Returning empty list.")
return []
with zipfile.ZipFile(self.path, 'r') as zipf:
# Sort names to ensure consistent loading order
for file_name in sorted(zipf.namelist()):
with zipf.open(file_name) as img_file:
file_bytes = np.frombuffer(img_file.read(), np.uint8)
img = cv2.imdecode(file_bytes, cv2.IMREAD_COLOR)
if img is not None:
images.append(img)
else:
oc.warning(f"Could not decode image {file_name}")
if self.verbose:
oc.info(f"Loaded {len(images)} images from {self.path}")
return images
# Instantiate the custom image dataset artifact handler
image_dataset_artifact = oc.snapshot.image_archive(
"images_archive.zip", handler=ImageZipHandler, verbose=True
)
oc.info(f"Custom Image Archive Artifact Handler initialized at: {image_dataset_artifact.path}")
INFO Custom Image Archive Artifact Handler initialized at: snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip
# Generate some random images for demonstration
random_images = [np.random.randint(0, 256, (64, 64, 3), dtype=np.uint8) for _ in range(50)]
# Save the images using the custom handler
image_dataset_artifact.save(random_images)
oc.info("Saved a collection of random images into a zip archive.")
# Load the images back from the zip archive
loaded_images = image_dataset_artifact.load()
oc.info(f"Loaded {len(loaded_images)} images from the archive. First image shape: {loaded_images[0].shape}")
oc.io.show_files_in_dir(os.path.dirname(image_dataset_artifact.path), verbose=True)
INFO Saving 50 images to snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip SUCCESS Successfully saved 50 images to snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip INFO ✓ 'images_archive.zip' of 'image_archive' saved successfully at 'snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip'. INFO Saved a collection of random images into a zip archive. INFO Loading images from snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip INFO Loaded 50 images from snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip INFO ✓ 'images_archive.zip' of 'image_archive' loaded successfully from 'snapshots/snapshot_guide/v1:major-update/image_archive/images_archive.zip'. INFO Loaded 50 images from the archive. First image shape: (64, 64, 3)
image_archive
└── images_archive.zip (2025-11-16 11:48, 612.4 KB)
6. Best Practices for Artifact Management¶
(a). Choose Artifacts Wisely¶
Not every file needs to be an artifact. Save what matters and skip the rest. Too many artifacts clutters your snapshots and wastes storage.
Good artifacts:
- Final cleaned datasets
- Trained model checkpoints
- Important plots and visualizations
- Config files with key parameters
Not artifacts:
- Temporary cache files
- Files you can easily regenerate
- Large raw datasets (unless you're specifically versioning them)
(b). Group Related Files¶
If your pipeline generates tons of related files, group them into one artifact instead of saving each individually.
Benefits:
- Less clutter
- Easier to manage (backup, delete, load as a unit)
- Clearer organization
How to group:
- Save the whole directory: Use a custom handler to save an entire folder as one artifact
- Compress into an archive: Bundle files into a ZIP or tar.gz (like our
ImageZipHandlerexample)
Example: If you generate 1,300 JSON annotation files, don't create 1,300 artifacts. Either save the parent directory or compress them into annotations.zip. Much simpler.
7. Conclusion¶
This guide covered everything you need to use OpenCrate effectively: snapshots, logging, artifacts, and custom handlers.
What OpenCrate Gives You¶
- Reproducibility: Version your outputs and recreate past experiments easily
- Organization: Auto-organized folders and files. No more mess.
- Easy Artifact Handling: Save and load any data type with simple commands
- Safety: Backup important files before changes. No more accidental overwrites.
- Flexibility: Extend with custom handlers for any file format
OpenCrate takes care of the boring file management stuff so you can focus on actual data science work.
Next Steps¶
- Read the docs: Check the official OpenCrate documentation for the full API reference
- Join the community: Ask questions, share ideas, contribute
- Try it yourself: Start using OpenCrate in your own projects
Thanks for reading! We hope OpenCrate makes your workflows cleaner and more reproducible.