Unveiling Git and GitHub for Data Science Students

Table of Contents

DALL•E 3
Source: DALL•E 3

Introduction

If you are a Data Science student working with Python, you have probably heard of Git and GitHub. These tools are essential for version control and collaboration in programming projects. In this article, we will explore step by step how to start using Git and GitHub, from installation to collaborative work.

Verifying your Git Installation

Before we begin, it is important to check if Git is installed on your computer. To do this, open the terminal (in the case of MacOS or Windows) and type the following command:

git --version

If you do not have Git installed, follow the official installation instructions for your operating system on the Git website.

Creating a GitHub Account

If you do not already have a GitHub account, go to github.com and click “Sign up” to create one. Follow the instructions to set up your account. Remember to choose a username relevant to your field of study.

Creating a New Repository

Now that you have Git installed and a GitHub account, let’s learn how to create a new repository.

On GitHub

  1. Log in to your GitHub account.
  2. Click the “+” icon in the upper right corner and choose “New repository.”
  3. Fill in the repository name, an optional description, and choose whether it will be public or private.
  4. Click “Create repository.”

On Your Machine

  1. Open the terminal and navigate to the folder where you want to create the repository.
  2. Use the command git init to initialize a new local repository.

Connecting the Local Repository to GitHub

After creating the local and GitHub repositories, it’s time to connect them.

On GitHub

  1. In the newly created repository, click the “Code” button and copy the repository’s URL.

In the Terminal

  1. Use the command git remote add origin [URL] to add the remote repository. Replace [URL] with the URL you copied earlier.

Adding Files, Committing, Pulling, and Pushing

Now that your repository is set up, you can start working with files. But before proceeding, it is important to understand some key concepts: commit, pull, and push.

Commit

A commit is a kind of “snapshot” of the current state of the files in your repository. It is a way to record the changes you have made. Each commit has a descriptive message that explains the changes made. It is good practice to keep these messages concise but informative. For example, when using the command git commit -m "Added chart functionality", you are recording a commit with the message “Added chart functionality” that reflects the changes you made to your files.

Pull

The git pull command is used to update your local repository with changes made in the remote GitHub repository. Imagine you are working on a team project, and a team member has made some changes to the code and pushed them to GitHub. To keep your local repository up to date and synchronized with their changes, you use git pull. This ensures that you are always working with the latest version of the code.

Push:

The git push command is used to send your local commits to the remote repository on GitHub. When you make changes to your files and create local commits, these changes exist only on your computer. To share them with other collaborators or securely back them up on GitHub, you use git push. This sends your commits to the remote repository, making your changes available to other people working on the same project.

In summary, the workflow typically involves making changes to files, adding these changes to commits, recording the changes with descriptive messages (commit), keeping your local repository up to date with git pull to synchronize with the remote repository, and then sending your changes to GitHub with git push. This way, you maintain an efficient collaboration and track changes in your Data Science project.

Working with Outdated Files

If you have edited files locally before pulling and realized that they are not up to date compared to the GitHub repository, there are ways to handle this situation.

  1. Use git stash to save your local changes.
  2. Perform a git pull to update your local repository.
  3. Use git stash apply to reapply your saved changes.

Understanding Branches

Branches are a fundamental part of Git and GitHub. They allow you to work on different versions of a project simultaneously.

  • To create a new branch, use git checkout -b [branch-name].
  • To switch between branches, use git checkout [branch-name].
  • To merge changes from one branch to another, use git merge [branch-name].

Working with Collaborators

Collaborating on GitHub projects involves using pull requests (PRs) to propose and review changes. Here’s a quick overview:

  1. A collaborator forks the main repository.
  2. They create a new branch for their changes.
  3. After completing the changes, they submit a PR to the main repository.
  4. Reviewers can comment, approve, or request changes in the PR.
  5. When the PR is approved, the changes are merged into the main repository.

It is important to note that in open-source projects, it is usually the project maintainers who have the power to approve PRs. In private projects or teams, the approval process may vary but typically involves designated reviewers.

Keep Your Code Updated

Remember to start your work on the code with a pull to ensure you are using the latest version of the project, and finish with a pull to ensure everyone has the updated code. This helps prevent conflicts and maintains effective collaboration.

Now that you have a basic understanding of Git and GitHub, you are ready to start collaborating on Data Science projects more effectively. Remember to practice and explore more features as you progress in your programming journey. Good luck!