Wilson Mar bio photo

Wilson Mar

Hello!

Email me Calendar Skype call

LinkedIn Twitter Gitter Instagram Youtube

Github Stackoverflow Pinterest

Git clean and smudge binary files to and from a separate LFS Store, transparently

US (English)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Cyrillic Russian   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean

Overview

Git-LFS (Large File System) is actually a misnomer for a way to store and retrieve binary files from a system away from a system designed to store text files.

Examples of binary files are: pdf files, videos (mp4), audio samples (.mp3), graphics (.png, .jpg, etc.). Other examples are text pointers, datasets, etc. Java .war files and .NET Intermediate Language (IL) files can also be considered binary files.

Major SaaS SCM vendors offer their own LFS store service to store binary files:

  • GitHub: https://github.blog/2015-04-08-announcing-git-large-file-storage-lfs/
  • GitLab: https://docs.gitlab.com/ee/topics/git/lfs/
  • BitBucket:

There is a DIY Git-LFS server that stores files on AWS S3 at https://github.com/meltingice/git-lfs-s3

(Gogs, a Git service that runs on your own hardware, does not have built-in Git-LFS support.)

https://www.atlassian.com/git/tutorials/git-lfs is a great introduction, with cartoons.

Those who store binary files in the same repo as their text would experience much faster push and pull times if they instead use gif-lfs.

But most SVM users don’t know or care about LFS because they have not created individual files larger than the 2 GBthat GitHub can handle. CAUTION: Video files can easily exceed that size.

Also, due to their size, binary files should really be stored at a SaaS service (such as Clooudinary) where they can be automatically resized for various client form factors and then duplicated in a set of servers for fast retrieval world-wide.

So an alternative to Git-LFS is DVC. Unlike git-lfs, DVC doesn’t require installing a dedicated server; It can be used on-premises (NAS, SSH, for example) or with any major cloud provider (S3, Google Cloud, Azure).

Identify Large Files

VIDEO:

  1. https://github.com/bloomberg/repofactor (by @hashpling)

  2. Identify

    generate-larger-than 50000 | add-file-info | sort -k3nr
  3. Convert

    bfg --convert-to-git-lfs 'logo-*.png' --no-blob-protection
    
    
    

Installation

  1. Install git

  2. Download and install the Git command line extension (written in Go) from https://github.com/git-lfs/git-lfs

    Within MacOS, Homebrew:

    brew install git-lfs

    Alternately, to use MacPorts:

    port install git-lfs

    Within Windows:

    choco install git-lfs -y

    Within Ubuntu:

    sudo apt install git-lfs

    Configure globally

    VIDEO: Git Large File Storage - How to Work with Big Files Jun 12, 2015 by “GitHub Training & Guides”

  3. Set up Git LFS for your user account by running (only once):

    git lfs install

    The command edits file ~/.gitconfig to add activation of Git’s internal clean and smudge filters:

    [filter "lfs"]
     clean = git-lfs clean %f
     smudge = git-lfs smudge %f
     required = true
    

    The clean filter is invoked

    The smudge filter is invoked upon git checkout to find LFS objects within .git/lfs/objects or the hosted LFS store (in the back-end).

    In each repo

    The command above changes the content of git’s .config file and Git Hooks within the .hooks folder within the .git folder initialized by Git within each repository Git manages. Listed here by sequence of execution:

    1. pre-push
    2. post-checkout
    3. post-commit
    4. post-merge

    Git message

    git-LFS works by adding commands within Git Hooks on every user’s laptop:

    These hooks replace files with pointers to a different registry, such as GitHub LFS or JFrog Artifactory.

  4. Specify all file extension media types to be tracked (managed) by Git-LFS:

    git lfs track "images/"  # all files in folder
    git lfs track "*.zip"
    git lfs track "*.tar"
    git lfs track "*.fbx"
    git lfs track "*.stl"   # 3D models
    git lfs track "*.pdf"
    git lfs track "*.psd"   # Adobe project
    git lfs track "*.png"
    git lfs track "*.jpg"
    git lfs track "*.gif"   # animations
    git lfs track "*.mp4"   # videos
    git lfs track "*.mp3"   # audios
    git lfs track "*.aiff"
    

    PROTIP: Each of the above commands adds in file .gitattributes a line such as:

    *.png filter=lfs diff=lfs merge=lfs -text

    The command can be run anytime to configure more file extensions.

  5. List what files LFS is managing:

    git lfs ls-files
    
  6. Make sure .gitattributes is tracked:

    git add .gitattributes
    git commit -m "Add attributes for LFS"
    git push origin master
    

    Note that defining the file types Git LFS should track will not, by itself, convert any pre-existing files to Git LFS, such as files on other branches or in your prior commit history.

    Add binary file

    git add huge/* 

    PROTIP: The above command results in binary file types specified being moved within the lfs folder within the repo’s .git folder.

    Upon git push, notice:

    Uploading LFS objects:

    After the upload, on Github.com, a “LFS” marker is shown next to each binary file managed under LFS.

    Notice that large (binary) files remain within the repo locally.

    Migrate

  7. command, which has a range of options designed to suit various potential use cases.

    git lfs migrate

    The above command creates files that contain a pointer to where the actual file contents are stored:

    version https://git-lfs.github.com/spec/v1
    old sha256:f23b93923ac92811771c3929d2323eeab233aa93239b32323b1ac222
    size 1122323
    

    Commit LFS

  8. Commit and push to GitHub as per normal:

    git add file.psd
    git commit -m "Add design file"
    git push origin master
    

    Clone/merge of lfs

    When a repo using LFS is cloned on another folder/laptop, the “git-merge” hook downloads binary fines to the .git/lfs folder within the repo.

    REMEMBER: GitHub does not (cannot) identify what has changed within each file like it does within text files.

  9. To manually change the size of a file:

    fallocate -l 1000001 binary/some-1mb.mp4

    REMEMBER: Because intermediate versions of binary files are not tracked like git-managed versioned text files, only the most recent binary file is retrieved from LFS.

Un-installation

git lfs uninstall

Other

To configure git-lfs (GitHub’s Large File System), commands like these are placed in hook files:

#!/bin/sh
command -v git-lfs >/dev/null 2>&1 || { echo >&2 "\nThis repository is configured for Git LFS but 'git-lfs' was not found on your path. If you no longer wish to use Git LFS, remove this hook by deleting .git/hooks/post-commit.\n"; exit 2; }
git lfs post-commit "$@"
   

The above commands first checks if git-lfs is installed, then performs the git lfs command, with “$@” forwarding the attributes passed into the hook file.

JFrog Artifactory LFS

Artifactory lets you define which users or groups of users can access your LFS repositories. A full set of permissions can be configured:

  • where developers can deploy binary assets to,
  • whether they can delete assets and more.

Artifactory integrates with the most common access protocols such as LDAP, SAML, Crowd, and others.

VIDEO: Managing huge files on the right storage with Git LFS Jul 19, 2016 [39:46] by Tim Pettersen (@kannonboy) at the JFrog conference where Atlassian was going to announce a competitive solution than LFS at the same JFrog conference. He illustrates Git internals to show why large binary files are so expensive. LFS OID are gen’d in SHA256* which S3 can validate automatically.

Managing huge files on the right storage with Git LFS by JFrog

https://www.jfrog.com/confluence/display/JFROG/Git+LFS+Repositories

Advantages to using Artifactory instead of GitHub’s LFS support:

  1. In Artifactory UI, click “Set Me Up”

    cat ~/.lfsconfig
    [lfs]
     url = "https://artifactory/api/lfs/my-big-objects"
    

Artifactory can be setup as a cache local. to set watches, etc.

Virtual LFS storage

DVC (Data Version Control)

dvc is a CLI package that wraps around (extends) Git and git-lfs to store large files in away from GitHub.

dvc also enable git checkout of versioned data.

  1. DVC’s installers include snap, pip, brew, choco, conda (Anaconda) or an OS-specific package. It installs/upgrades python3, grpc, awscli, azure-cli, sqlite, Qt, opencv, imagemagick, ansible, vtk, apache-arrow, thrift, zstd, and other packages it needs.

    https://dvc.org is Apache2 open sourced at https://github.com/iterative/dvc

    On MacOS:

    brew install dvc

    DVC Data Flow chart

    git-lfs-dvc

    VIDEO: Versioning Data with DVC (Hands-On Tutorial!) Sep 30, 2020 and https://dvc.org/doc/start/data-versioning:

    Initialize dvc raw data manifest md5

  2. Create a repository.
  3. Make a folder and cd into it.
  4. dvc init establishes metric collection and creates folder ~/.dvc.

    QUESTION: Does it create the config file ???

  5. Create a data folder and download into it a sample data.xml file referenced by the ML model:

    dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml

  6. dvc add data/data.xml

    This creates a .dvc file which contains a MD5 hash value of the contents. This “pointer” enables versioning of the data.

  7. cat data/data.xml.dvc

    outs:
           - md5: a23023923ab033023f09a9bbc333d
      path: data.xml

    In other words, DVC uses reflinks (or hardlinks) to avoid copy operations on checkouts.

  8. git add data/data.xml.dvc

  9. In the imported .gitignore file there is a line to not upload file “/data.xml”. Add it:

    git add data/.gitignore

    PROTIP: This is so that when the actual (big) file is pulled in locally, it won’t be accidentally uploaded into GitHub.

  10. git commit -m “Add raw data”
  11. cd data

    Remote Files in GCP

  12. On Google Drive, get the URL to a folder:

    git-lfs-gdrive

  13. Highlight and copy the response.
  14. Define the remote:

    dvc remote add -d storage gdrive://234203409023909fabcde

    The command updates the config file in the .dvc (hidden) folder in the repository this:

    ['remote storage']
    url = gdrive://234203409023909fabcde
    
  15. Commit:

    dvc commit .dvc/config -m “Configure remote storage” dvc push

  16. Paste when you’re prompted to “Enter a Verification Code”. Download of the actual file represented by the md5 hash should now occur.

    Remote files in EC2 on AWS

  17. To share data in S3:

    dvc remote add myremote -d s3://mybucket/image_cnn
  18. To obtain actual files from remote storage:

    dvc push and underlying dvc fetch commands extend git to populate assets from remote data storage (S3, GS, Azure, SSH, etc.) based on SHA references in the Local Cache.

    “Makefiles for data and ML projects” done right.

    The “pkl” file type defines the local workspace model pulled from GitHub or https://dagshub.com, DVC’s SasS equivalent of GitHub.com.

    They are called DVC pipelines (computational graph) because they connect code and data together to specify all steps to produce a model: input dependencies including data, commands to run, and output information to be saved. This is why DVC is advertised as a reproducible and shareable data pipeline (for Machine Learning experiments) with metrics tracking.

Versioning of datasets by dvc enables repeatable runs and thus comparison of results from different datasets. This is described as “CML (Continuous Machine Learning)” leverages GitHub Actions, as described at series of youtube videos.

Elle O’Brien, Ph.D. at Dmitry Petrov’s iterative.ai (with her stuffed rainbow owl DeeVee) created a series of videos on Youtube

Marcel Ribeiro-Dantas is the first DVC Ambassador

Invite to DVC’s Discord chat

https://towardsdatascience.com/why-git-and-git-lfs-is-not-enough-to-solve-the-machine-learning-reproducibility-crisis-f733b49e96e8

https://blog.codecentric.de/en/2020/01/remote-training-gitlab-ci-dvc/

How to Track your Development Process with DVC Dec 11, 2019 by Mark Keinhoerster and Tim Sabsch

https://dvc.org/blog


Github.com LFS Store

GitHub provides 1GB of free LFS storage. Previous to December 2020, GitHub had users pre-pay for storage each month ($60 for each 50GB block). However, GitHub now has a pay-as-you-go invoicing for LFS storage and bandwidth usage.

The Admin can disable LFS at the Organizational level after removing existing LFS objects. A support ticket needs to be raised with object IDs of specific LFS files to be deleted. Find each object ID using this command (replacing “path/to/file” with the path to the file in your repository):

shasum -a 256 path/to/file

References

https://git-lfs.github.com/

Pluralsight video: 045 Introduction to Git LFS (Large File Storage) Oct 6, 2019 by Dan Gitschooldude

How To setup Git with Git LFS for Unity Broken Knights Games

https://medium.com/junior-dev/how-to-use-git-lfs-large-file-storage-to-push-large-files-to-github-41c8db1e2d65

https://dzone.com/articles/git-lfs-why-and-how-to-use

GitLFS - How to handle large files in Git by Lars Schneider at FOSSASIA Summit 2017

Git LFS at Light Speed - Git Merge 2017

Tracking Huge Files with Git LFS - Atlassian Summit 2016 Atlassian

At https://github.com/enterprises/mckinsey-and-company/enterprise_licensing it says our enteprise has 4,016 user licenses, with 3,803 consumed. See the “Enterprise-Licensing.png” attached. That means 213 license available to our orgs: See mck-internal-licenses.png attached. Our confusion is that at https://github.com/enterprises/mckinsey-and-company/people shows not 3,803 but 1,778 people in McKinsey and Company. See “enterprise-members.png” attached.