Common DVC Commands
Common DVC Commands
DVC (Data Version Control) is an open-source tool for managing datasets, machine learning models, and experiments. It works alongside Git, extending version control to large files and directories while enabling reproducible pipelines. Designed for data science and ML workflows, DVC helps track data changes, share outputs, and automate processes without bloating Git repositories.
Features
- Git-like operations for data/models (e.g.,
dvc add
,dvc push
). - Reproducible pipelines (
dvc run
,dvc repro
). - Metrics tracking for experiments (
dvc metrics
). - Storage-agnostic (supports S3, GCS, SSH, etc.).
- Collaboration via shared data repositories.
Example
dvc init
dvc add dataset/ # Track large datasets
git commit -m "Track data with DVC"
Some of the common DVC commands are as follows:
Command | Description | Use Case |
---|---|---|
dvc init |
Initialize a DVC project | Start tracking data in your Git repository |
dvc add |
Track files/directories with DVC | Add datasets or large files to DVC tracking |
dvc commit |
Save changes to tracked files | Update metadata after modifying data files |
dvc push |
Upload data to remote storage | Backup or share DVC-tracked data |
dvc pull |
Download data from remote storage | Retrieve data tracked by DVC |
dvc run |
Create a pipeline stage | Define data processing steps |
dvc repro |
Reproduce a pipeline | Rerun pipeline when dependencies change |
dvc checkout |
Checkout data files | Switch between versions of the data |
dvc status |
Show changes in data | Check if data files differ from the cache |
dvc metrics |
Evaluate metrics | Track and compare ML model performance |
dvc remote add |
Configure remote storage | Set up cloud storage (S3, GCS, etc.) |
dvc cache dir |
Configure cache location | Change the default cache directory |
dvc gc |
Garbage collection | Clean unused data from the cache |
dvc dag |
Show pipeline graph | Visualize pipeline dependencies |