Git

Inspiration here. Official git documentation here. software carpentry repoducibility course

What is Version Control, Git and why should I use it?

We’ll start by exploring how version control can be used to keep track of what one person did and when. Even if you aren’t collaborating with other people, automated version control is much better than this situation:

If you work alone, git is great to track changes and recover previous versions of your files. You can also use a remote repository to have a back up and share your work.

Version control system (VCS)

Let’s imagine that we have a repository already working. When you create a new file as part of the repository (or repo), that file is untracked or unversioned. This means that git will ignore the file and any change you make to it until you add it to the repo. At that point the file is staged and is ready to get into the repository. To do that you do a commit and save that version of the file to the repo.

This workflow modify --> add --> commit will repeat every time you want to save a version of the file. We don’t recommend making a commit every time you save the file or change a comma, and its also not a good idea to make a commit with a billion of changes. With practice and depending on how you work, you will find a comfortable middle ground.

git is (by far) the most widely used and distributed VCS in the world. It is a free software that you install on your computer. It is a command line tool, but there are many graphical interfaces that can help you to use it.

How it works ?

A Git repository (or repo for short) contains all of the project files and the entire revision history. You’ll take an ordinary folder of files, and tell Git to make it a repository. This creates a .git subfolder, which contains all of the Git metadata for tracking changes.

As you commit files, Git stores a reference to the previous version of the file. If the file has changed, Git stores a reference to the changes. This allows Git to quickly compare the current version of a file with the previous version, so you have a record of what has been done, and you can revert to specific versions should you ever need to.

Collaboration

Collaboration is the way different people can work on the same project together. Git makes collaboration easier, allowing changes by different people at the same time, on the same files. This is done by creating branches, which are different versions of the same project. Each branch can be worked on independently, and then merged back into the main branch.

At some point, to make this collaboration easier, you will need to use a remote repository. This is a copy of your local repository that is stored on a server. This allows you to share your work with others, and to access it from different computers, by uploading your changes and downloading changes from others. The most popular remote repository is GitHub (however owned by Microsoft), but there are others like GitLab or Bitbucket.

Note:

Institutional remote repositories are also available and can be easily accessed using your usual institutional credentials. They have the advantage of being hosted localy, and not prone to sudden changes in the terms of service (among others). Also, you don’t contribute to the monopoly of a single company and let them use your data. However, you won’t have the same visibility as on GitHub and the same quality of services. - IRD GitLab - CIRAD GitLab - INRAE GitLab

Backup

The use of a remote repository is also a way to backup your work. If your computer crashes, you can easily recover your work by downloading the repository from the remote repository. It is as simple as:

git clone https://github.com/username/repo.git

Not only this gives you access to your work, but also to all the previous versions of your work.

Warning:

Those remote repositories are not meant to host large files. They are meant to host code and text files. If you need to backup large files associated with your code or project, you should check for git-lfs. If you’re looking for a VCS for your large dataset, you should check zenodo, or your institutional dataverse. Please check amap-data website for complete information.

Reproducibility

VCS are also a way to ensure the reproducibility of your work. By keeping track of all the changes you made to your files, you can easily reproduce the analysis you did at a specific time. This is also a way to ensure the transparency of your work, as you can share the repository with others, and they can see all the changes you made to the files. More and more journals are asking for the code and data associated with the publication, and using a VCS is a way to ensure that you can provide this information.

By providing a reproducible workflow (including the code, data and the environment to do the analysis), you also foster the reuse of your work by others. This is a way to ensure that your work has a greater impact, and that you can contribute to the scientific community (see Ram (2013)).

Install git

You need Git, so you can use it at the command line and so RStudio or your favorite editor can call it.

If there’s any chance it’s installed already, verify that, rejoice, and skip this step. (But consider updating an existing installation.)

Otherwise, find installation instructions below for your operating system.

Git already installed?

Go to the shell. Enter which git to request the path to your Git executable:

which git
## /usr/bin/git

and git --version to see its version:

git --version
## git version 2.45.2

If you are successful, that’s great! You have Git already. No need to install! Move on.

If, instead, you see something more like git: command not found, keep reading.

macOS users might get an immediate offer to install command line developer tools. Yes, you should accept! Click “Install” and read more below.

Windows

Updating Git for Windows

If you already have Git for Windows, but it’s not the latest version, it’s a good idea to update. You can update like so from the command line:

git update-git-for-windows

macOS

Linux

Install Git via your distro’s package manager.

Ubuntu or Debian Linux:

sudo apt-get install git

Fedora or RedHat Linux:

sudo yum install git

A comprehensive list for various Linux and Unix package managers can be found here.

Global configuration (only once)

Necessary configuration

In the shell:

git config --global user.name "Jane Doe"
git config --global user.email "jane@example.com"
git config --global --list

substituting your name and the email associated with your GitHub account.

The usethis R package offers an alternative approach. You can set your Git user name and email from within R:

## install if needed (do this exactly once):
## install.packages("usethis")

library(usethis)
use_git_config(user.name = "Jane Doe", user.email = "jane@example.org")

You can have a secondary mail associated with your git account. To do so, you can add it to your global configuration:

git config --global user.email

Optional configuration

  • Default editor. Some git commands need to open an editor. You can set your favorite editor as the default one. For example, to use Rstudio as the default editor:
git config --global core.editor "Rstudio"
  • Default branch. Every time you create a new repository, a default branch (we will see what it is later) is created. You can set the name of this default branch. main is the most common name for this branch.
git config --global init.defaultBranch main

Personnal access token

You can use a personal access token in place of a password when authenticating to your remote repository in the command line or with the API. This will help you to save some time and avoid typing your password each time you push or pull from the remote repository.

  • Go to this page and click “Generate token”.
  • From R: usethis::create_github_token()

Git client

There are many integrations and graphical interfaces for git in IDEs. We here present only the most common ones, but feel free to explore others and complete this section.

Rstudio integration

RStudio is probably your main text editor if you work with R. Luckily, it integrates natively with git.

  • First indicate the path to the git executable in the Tools > Global Options > Git/SVN menu.
  • Then you need to create a #Personal Access Token# to use with RStudio. Go to your GitHub account, click on your profile picture, then Settings > Developer settings > Personal access tokens > Generate new token. Give it a name, and select the repo scope. Copy the token and paste it in RStudio when prompted.

You are now ready to use git with RStudio.

Go in File > New Project > Version Control > Git and paste the URL of the repository you want to clone. Then click on Create Project.

You will now be able to see a tab Git in the top right pane of RStudio, commit changes, and push them to the remote repository. You can also use the terminal pane to use git commands directly.

VSCode integration

If you are using VSCode, you can install the GitLens extension to have a better integration with git.

This allows you to see the history of your project, compare files between versions, push and pull changes, and many more. All without the terminal or the git command line.

Note:

VSCode is highly customizable to fit your needs. You can also check the GitHub Pull Requests extension to review, comment and merge pull requests directly from VSCode, the GitHub Actions extension to see the status of your CI/CD pipelines, and the GitHub Copilot extension to have an AI assistant to help you write code.

Github desktop

Github Desktop is not an editor, but a graphical interface for git with a lot of feature for an easy integration with GitHub. You can see the history of your project, compare files between versions, push and pull changes, and many more. All without the terminal or the git command line.

Create a new git repository

When you start a new project, whether it is a new analysis, a new package, or a new paper, you will need to create a new repository that will be a git repository. This repository will contain all the files of your project with the history of the changes all the contributors made to the files.

You have two options to create a new repository:

  • create a new repository on the git server and clone it on your computer,

  • create a new repository on your computer and push it to the git server.

Where to store your git repositories

First, you need to decide where you want to store your git repositories. You have many options, depending on your needs (features, UI, collaboration, etc…) and your preferences (visibility to other researchers, privacy, etc…).

GitHub

Github is (by far) the most popular git server, created in 2008 and the first cloud-hosted solution. It is now owned by Microsoft. This is probably the best option if you have an open-source project that you want to share with the community, and you are open to collaboration. It provides a lot of features for collaboration: issues, discussions, project management, pull requests reviews, etc… It also has a built in CI/CD pipeline with GitHub Actions that can help you automate your workflow (testing, building, deploying, etc…) and a web hosting service with GitHub Pages.

Institutional git servers

CIRAD, INRAE and IRD have their own git servers based on Gitlab. This means that all the repositories you create on those servers are hosted by the institutions and not by a third party. This is an excellent option for internal or institution-based projects as Gitlab offers very similar features to GitHub and you’ve got a local support. However, collaboration with other researchers outside of the institution is more complicated (RENATER login are used to access the platform) and some services are not as good as GitHub (particularly CI/CD).

Self-hosted git servers

Mentioned here for completeness, but not recommended for most users as this requires advanced technical skills and a dedicated server.

You can host your own git server on your own server. This is a good option if you want to have full control over your data and your repositories, but it requires some maintenance. A good comparison of the different self-hosted git servers is available here.

Github first

To create a new git repository starting from GitHub, you need to:

  • Create a new repository from the Repositories tab

  • Give a name to your repository (1), a description, and choose if you want it to be public or private (2). It is a (very) good practice to add a README.md file to your repository that will be displayed on the main page of your repository and will give a brief description of your project (3), a LICENSE file that will define the license of your project (4), and a .gitignore file that will define the files that should not be tracked by git (5).

  • You now have a brand new repository on GitHub. You (and any collaborator) can now clone it to have it localy on your own computer by clicking on the green Code button (1) and copying the URL of the repository (2).

  • You can now clone the repository on your computer by opening a terminal and typing:
git clone https://github.com/thomasarsouze/test.git

or using RStudio and File > New Project > Version Control > Git and pasting the URL of the repository you want to clone. Then click on Create Project;

or using GitHub Desktop and File > Clone repository and pasting the URL of the repository you want to clone;

or using VSCode and View > Command Palette > Git: Clone and pasting the URL of the repository you want to clone.

Gilab first

Creating a new repository on Gitlab is very similar to GitHub. You can see the steps :

Gitlab provides pre-defined customable actions for initiating your repository.

Local first

  • git init
  • .git folder
  • .gitignore file

Git basics

Git basic commands

  • git status
  • git add
  • liens vers formations + documentation

Git workflow

  • Working directory, staging area, repository
  • branches
  • push and pull

Merge conflicts

FAQ

My repo is too big ! How to clean it ?

If by mistake you added large files to your repository, remove it from directory will not be sufficient, as you also need to remove it from the history of the repository. Here are some tips to clean your repository:

  • List files in the history by size. This command will list the 100 biggest files in your repository, and provide the blob hash and the name of the file concerned, as well as the size of the file.
# For OSX only
brew install coreutils

git rev-list --objects --all --missing=print |
    git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
    sed -n 's/^blob //p' |
    grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') | sort --numeric-sort --key=2 |
    $(command -v gnumfmt || echo gnumfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest |
    head -100
    >> big_blobs.txt
# First clone your repo using --mirror
git clone --mirror https://github.com/myorg/myrepo.git

# Use the BFG repo-cleaner
# First let's check all options
java -jar bfg.jar --help
# From the list of big blobs obtained previously, keep only the blob hash
# you can then remove them from the history of the repository
java -jar bfg.jar --strip-blobs-with-ids big_blobs.txt myrepo.git
# Other option is to remove directly files bigger thant XMo
java -jar bfg.jar --strip-blobs-bigger-than 5M myrepo.git
# Finaly update your repository online
git reflog expire --expire=now --all && git gc --prune=now --aggressive
Back to top

References

Ram, Karthik. 2013. “Git Can Facilitate Greater Reproducibility and Increased Transparency in Science.” Source Code for Biology and Medicine 8 (1): 7. https://doi.org/10.1186/1751-0473-8-7.