Git-LFS (Large File System) and DVC

Git clean and smudge binary files to and from a separate LFS Store, transparently

Overview

Identify Large Files
Installation
Configure globally
vendors
In each repo
Add binary file
Migrate
Commit LFS
Clone/merge of lfs
Un-installation
Other
JFrog Artifactory LFS
DVC (Data Version Control)
Github.com LFS Store
References

Those who store binary files in the same repo as their text would experience much faster push and pull times if they instead use gif-lfs.

Git-LFS (Large File System) is actually a misnomer for a way to store and retrieve binary files from GitHub in a system designed to store text files.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

Examples of binary files are: pdf files, videos (mp4), audio samples (.mp3), graphics (.png, .jpg, etc.). Other examples are text pointers, datasets, etc. Java .war files and .NET Intermediate Language (IL) files can also be considered binary files.

Major SaaS SCM vendors offer their own LFS store service to store binary files:

GitHub: https://github.blog/2015-04-08-announcing-git-large-file-storage-lfs/
GitLab: https://docs.gitlab.com/ee/topics/git/lfs/
BitBucket:

There is a DIY Git-LFS server that stores files on AWS S3 at https://github.com/meltingice/git-lfs-s3

(Gogs, a Git service that runs on your own hardware, does not have built-in Git-LFS support.)

TUTORIAL: https://www.atlassian.com/git/tutorials/git-lfsis a great introduction, with cartoons.

But most SVM users don’t know or care about LFS because they have not created individual files larger than the 2 GBthat GitHub can handle. CAUTION: Video files can easily exceed that size.

Also, due to their size, binary files should really be stored at a SaaS service (such as Clooudinary) where they can be automatically resized for various client form factors and then duplicated in a set of servers for fast retrieval world-wide.

So an alternative to Git-LFS is DVC. Unlike git-lfs, DVC doesn’t require installing a dedicated server; It can be used on-premises (NAS, SSH, for example) or with any major cloud provider (S3, Google Cloud, Azure).

Identify Large Files

VIDEO:

https://github.com/bloomberg/repofactor (by @hashpling)

Identify

generate-larger-than 50000 | add-file-info | sort -k3nr

Convert

bfg --convert-to-git-lfs 'logo-*.png' --no-blob-protection

Installation

Install git
Download and install the Git command line extension (written in Go) from https://github.com/git-lfs/git-lfs

Within MacOS, Homebrew:
```
brew install git-lfs
```
Alternately, to use MacPorts:
```
port install git-lfs
```
Within Windows:
```
choco install git-lfs -y
```
Within Ubuntu:
```
sudo apt install git-lfs
```
Configure globally

VIDEO: Git Large File Storage - How to Work with Big Files Jun 12, 2015 by “GitHub Training & Guides”
Set up Git LFS for your user account by running (only once):
```
git lfs install
```
The command edits file ~/.gitconfig to add activation of Git’s internal clean and smudge filters:
```
[filter "lfs"]
 clean = git-lfs clean %f
 smudge = git-lfs smudge %f
 required = true
```
The clean filter is invoked

The smudge filter is invoked upon git checkout to find LFS objects within .git/lfs/objects or the hosted LFS store (in the back-end).

vendors

https://news.ycombinator.com/item?id=33232705
- Netlify
- Cloudinary offers plugin within a headless CMS (like Contentful) where you just use a URL or plugins like the Gatsby plugin: https://github.com/cloudinary-devs/gatsby-transformer-cloudi… (only supports images so far)
- CloudFlare - https://github.com/colbyfayock/netlify-plugin-cloudinary/
In each repo

The command above changes the content of git’s .config file and Git Hooks within the .hooks folder within the .git folder initialized by Git within each repository Git manages. Listed here by sequence of execution:
1. pre-push
2. post-checkout
3. post-commit
4. post-merge
Git message

git-LFS works by adding commands within Git Hooks on every user’s laptop:

These hooks replace files with pointers to a different registry, such as GitHub LFS or JFrog Artifactory.

Specify all file extension media types to be tracked (managed) by Git-LFS:

git lfs track "images/"  # all files in folder
git lfs track "*.zip"
git lfs track "*.tar"
git lfs track "*.fbx"
git lfs track "*.stl"   # 3D models
git lfs track "*.pdf"
git lfs track "*.psd"   # Adobe project
git lfs track "*.png"
git lfs track "*.jpg"
git lfs track "*.gif"   # animations
git lfs track "*.mp4"   # videos
git lfs track "*.mp3"   # audios
git lfs track "*.aiff"

The command can be run anytime to configure more file extensions.

PROTIP: Each of the above commands adds in file .gitattributes a line such as:

*.png filter=lfs diff=lfs merge=lfs -text

Individual file:

test_LFS.bak filter=lfs diff=lfs merge=lfs -text

https://dev.to/mikolajbuchwald/initializing-a-git-repository-with-git-lfs-large-file-storage-1f7d

data/**/* filter=lfs diff=lfs merge=lfs -text
supplementary_materials/**/* filter=lfs diff=lfs merge=lfs -text

Avoid issues with line endings when switching between Windows and Mac with this entry in .gitattributes:
```
* text=auto
```
- https://www.git-scm.com/docs/gitattributes
- https://dev.to/deadlybyte/please-add-gitattributes-to-your-git-repository-1jld
- https://docs.github.com/en/get-started/getting-started-with-git/configuring-git-to-handle-line-endings
List what files LFS is managing:
```
git lfs ls-files
```
Make sure .gitattributes is tracked:
```
git add .gitattributes
git commit -m "Add attributes for LFS"
git push origin master
```
Note that defining the file types Git LFS should track will not, by itself, convert any pre-existing files to Git LFS, such as files on other branches or in your prior commit history.

Add binary file
```
git add huge/* 
```
PROTIP: The above command results in binary file types specified being moved within the lfs folder within the repo’s .git folder.

Upon git push, notice:
```
Uploading LFS objects:
```
After the upload, on Github.com, a “LFS” marker is shown next to each binary file managed under LFS.

Notice that large (binary) files remain within the repo locally.

Migrate
command, which has a range of options designed to suit various potential use cases.
```
git lfs migrate
```
The above command creates files that contain a pointer to where the actual file contents are stored:
```
version https://git-lfs.github.com/spec/v1
old sha256:f23b93923ac92811771c3929d2323eeab233aa93239b32323b1ac222
size 1122323
```
Commit LFS
Commit and push to GitHub as per normal:
```
git add file.psd
git commit -m "Add design file"
git push origin master
```
Clone/merge of lfs

When a repo using LFS is cloned on another folder/laptop, the “git-merge” hook downloads binary fines to the .git/lfs folder within the repo.

REMEMBER: GitHub does not (cannot) identify what has changed within each file like it does within text files.
To manually change the size of a file:
```
fallocate -l 1000001 binary/some-1mb.mp4
```
REMEMBER: Because intermediate versions of binary files are not tracked like git-managed versioned text files, only the most recent binary file is retrieved from LFS.

Un-installation

git lfs uninstall

Other

To configure git-lfs (GitHub’s Large File System), commands like these are placed in hook files:

#!/bin/sh
command -v git-lfs >/dev/null 2>&1 || { echo >&2 "\nThis repository is configured for Git LFS but 'git-lfs' was not found on your path. If you no longer wish to use Git LFS, remove this hook by deleting .git/hooks/post-commit.\n"; exit 2; }
git lfs post-commit "$@"

The above commands first checks if git-lfs is installed, then performs the git lfs command, with “$@” forwarding the attributes passed into the hook file.

JFrog Artifactory LFS

Artifactory lets you define which users or groups of users can access your LFS repositories. A full set of permissions can be configured:

where developers can deploy binary assets to,
whether they can delete assets and more.

Artifactory integrates with the most common access protocols such as LDAP, SAML, Crowd, and others.

VIDEO: Managing huge files on the right storage with Git LFS Jul 19, 2016 [39:46] by Tim Pettersen (@kannonboy) at the JFrog conference where Atlassian was going to announce a competitive solution than LFS at the same JFrog conference. He illustrates Git internals to show why large binary files are so expensive. LFS OID are gen’d in SHA256* which S3 can validate automatically.

Managing huge files on the right storage with Git LFS by JFrog

https://www.jfrog.com/confluence/display/JFROG/Git+LFS+Repositories

Advantages to using Artifactory instead of GitHub’s LFS support:

In Artifactory UI, click “Set Me Up”

cat ~/.lfsconfig

[lfs]
 url = "https://artifactory/api/lfs/my-big-objects"

Artifactory can be setup as a cache local. to set watches, etc.

Virtual LFS storage

DVC (Data Version Control)

dvc is a CLI package that wraps around (extends) Git and git-lfs to store large files in away from GitHub.

dvc also enable git checkout of versioned data.

DVC’s installers include snap, pip, brew, choco, conda (Anaconda) or an OS-specific package. It installs/upgrades python3, grpc, awscli, azure-cli, sqlite, Qt, opencv, imagemagick, ansible, vtk, apache-arrow, thrift, zstd, and other packages it needs.

https://dvc.org is Apache2 open sourced at https://github.com/iterative/dvc

On MacOS:
```
brew install dvc
```
DVC Data Flow chart

VIDEO: Versioning Data with DVC (Hands-On Tutorial!) Sep 30, 2020 and https://dvc.org/doc/start/data-versioning:

Initialize dvc raw data manifest md5
Create a repository.
Make a folder and cd into it.
dvc init establishes metric collection and creates folder ~/.dvc.

QUESTION: Does it create the config file ???
Create a data folder and download into it a sample data.xml file referenced by the ML model:

dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
dvc add data/data.xml

This creates a .dvc file which contains a MD5 hash value of the contents. This “pointer” enables versioning of the data.
cat data/data.xml.dvc
```
outs:
       - md5: a23023923ab033023f09a9bbc333d
  path: data.xml
```
In other words, DVC uses reflinks (or hardlinks) to avoid copy operations on checkouts.
git add data/data.xml.dvc
In the imported .gitignore file there is a line to not upload file “/data.xml”. Add it:

git add data/.gitignore

PROTIP: This is so that when the actual (big) file is pulled in locally, it won’t be accidentally uploaded into GitHub.
git commit -m “Add raw data”
cd data

Remote Files in GCP
On Google Drive, get the URL to a folder:
Highlight and copy the response.
Define the remote:

dvc remote add -d storage gdrive://234203409023909fabcde

The command updates the config file in the .dvc (hidden) folder in the repository this:
```
['remote storage']
url = gdrive://234203409023909fabcde
```
Commit:

dvc commit .dvc/config -m “Configure remote storage” dvc push
Paste when you’re prompted to “Enter a Verification Code”. Download of the actual file represented by the md5 hash should now occur.

Remote files in EC2 on AWS

To share data in S3:

dvc remote add myremote -d s3://mybucket/image_cnn

To obtain actual files from remote storage:

dvc push and underlying dvc fetch commands extend git to populate assets from remote data storage (S3, GS, Azure, SSH, etc.) based on SHA references in the Local Cache.

“Makefiles for data and ML projects” done right.

The “pkl” file type defines the local workspace model pulled from GitHub or https://dagshub.com, DVC’s SasS equivalent of GitHub.com.

They are called DVC pipelines (computational graph) because they connect code and data together to specify all steps to produce a model: input dependencies including data, commands to run, and output information to be saved. This is why DVC is advertised as a reproducible and shareable data pipeline (for Machine Learning experiments) with metrics tracking.

Versioning of datasets by dvc enables repeatable runs and thus comparison of results from different datasets. This is described as “CML (Continuous Machine Learning)” leverages GitHub Actions, as described at series of youtube videos.

Elle O’Brien, Ph.D. at Dmitry Petrov’s iterative.ai (with her stuffed rainbow owl DeeVee) created a series of videos on Youtube

VIDEO: Version Control for Data Science Explained in 5 Minutes (No Code!)
https://www.linkedin.com/feed/update/urn:li:activity:6742278455498551296/

Marcel Ribeiro-Dantas is the first DVC Ambassador

Invite to DVC’s Discord chat

https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8

https://blog.codecentric.de/en/2020/01/remote-training-gitlab-ci-dvc/

How to Track your Development Process with DVC Dec 11, 2019 by Mark Keinhoerster and Tim Sabsch

https://dvc.org/blog

Github.com LFS Store

GitHub provides 1GB of free LFS storage. Previous to December 2020, GitHub had users pre-pay for storage each month ($60 for each 50GB block). However, GitHub now has a pay-as-you-go invoicing for LFS storage and bandwidth usage.

The Admin can disable LFS at the Organizational level after removing existing LFS objects. A support ticket needs to be raised with object IDs of specific LFS files to be deleted. Find each object ID using this command (replacing “path/to/file” with the path to the file in your repository):

shasum -a 256 path/to/file

References

https://git-lfs.github.com/

Pluralsight video: 045 Introduction to Git LFS (Large File Storage) Oct 6, 2019 by Dan Gitschooldude

How To setup Git with Git LFS for Unity Broken Knights Games

https://medium.com/junior-dev/how-to-use-git-lfs-large-file-storage-to-push-large-files-to-github-41c8db1e2d65

https://dzone.com/articles/git-lfs-why-and-how-to-use

GitLFS - How to handle large files in Git by Lars Schneider at FOSSASIA Summit 2017

Git LFS at Light Speed - Git Merge 2017

Tracking Huge Files with Git LFS - Atlassian Summit 2016 Atlassian

At https://github.com/enterprises/mckinsey-and-company/enterprise_licensing it says our enteprise has 4,016 user licenses, with 3,803 consumed. See the “Enterprise-Licensing.png” attached. That means 213 license available to our orgs: See mck-internal-licenses.png attached. Our confusion is that at https://github.com/enterprises/mckinsey-and-company/people shows not 3,803 but 1,778 people in McKinsey and Company. See “enterprise-members.png” attached.

Wilson Mar

Git-LFS (Large File System) and DVC

Identify Large Files

Installation

Configure globally

vendors

In each repo

Add binary file

Migrate

Commit LFS

Clone/merge of lfs

Un-installation

Other

JFrog Artifactory LFS

DVC (Data Version Control)

DVC Data Flow chart

Initialize dvc raw data manifest md5

Remote Files in GCP

Remote files in EC2 on AWS

Github.com LFS Store

References

You might also enjoy (View all posts)