DSC 80 – The Practice and Application of Data Science
❓ Tech Support
Assignments in DSC 80 are mostly coding assignments, so it's important to make sure that your computing environment is set up properly. There are two ways to go about things: you can set up a local environment or use a remote environment that is largely pre-configured. On this page, we'll talk about both options.
Writing code locally, on your personal computer, is our preferred option. We won't lie – it involves a little more time to set up and a steeper learning curve. But in the long run, you'll likely find the local environment more comfortable and faster since you can customize it to your own needs. Additionally, setting up your own local Python environment is something you'll be expected to do when working as a data scientist, so it's a good idea to start now.
There has been a lot written about how to set up a Python environment, so we won't reinvent the wheel. This page will only be a summary; Google will be your main resource. But always feel free to come to a staff member's office hours if you have a question about setting up your environment, using Git, or similar — we're here to help.
Much of the content on this page is also covered in the walkthrough video below:
Working Locally
Working locally simply refers to developing code using software installed on your own machine. For this class, the software you'll need includes Python 3.8, a few specific Python packages, Git, and a text editor.
Installing Python
There are several ways of installing Python on your own computer. We recommend downloading the Anaconda Python distribution and following the instructions for installing it. You should install the standard Anaconda (not Miniconda) with Python 3.8.
As mentioned above, Anaconda is a Python distribution. This means it comes
with many useful Python packages, including, for instance, pandas
. If you
should need to install a new Python package, you can use the conda
command.
You'll need to open a terminal. On Windows, you can use the Anaconda Prompt;
on macOS or Linux, you can use the terminal app that comes with the operating
system, or install one (Alacritty is a popular choice). Then, inside the#
terminal, type conda install <package_name>
, where <package_name>
is
replaced by the name of the package you want to install, and hit enter.
Anaconda comes with pandas, numpy, and many other data science packages. You
will, however, need to install otter-grader
; this is the autograder package
that runs the tests in labs, projects, etc. You can do so by running: pip
install otter-grader
in a terminal.
Replicating the Gradescope Environment
Gradescope has a package environment using which it evaluates the submissions. It is advised to create the same environment so that there are no issues due to version changes during development vs. evaluation. Please follow the below steps to create the environment with required packages
1. Have a requirements.txt file with the following text
matplotlib==3.4.3
notebook==6.4.12
numpy==1.21.2
otter-grader==3.1.4
pandas==1.3.3
Pillow==8.3.2
pydantic==1.8.2
PyYAML==5.4.1
requests==2.26.0
tqdm==4.62.3
urllib3==1.26.7
scikit-learn==1.0
seaborn==0.11.2
beautifulsoup4==4.10.0
2. In Terminal, create a new conda environment: conda create -n dsc80 python=3.8
3. Activate the environment: conda activate dsc80
4. Install the requirements in the new env: pip install -r requirements.txt
Every time you work on DSC80, activate this environment by running conda activate
dsc80
Git
All of our course materials, including your assignments, are hosted on GitHub in a Git repository. This means that you'll need to download Git and use in order to work with the course materials.
Git is a version control system. In short, it is used to keep track of the history of a project. With Git, you can go back in time to any previous version of your project, or even work on two different versions (or "branches") in parallel and "merge" them together at some point in the future. We'll stick to using the basic features of Git in DSC 80.
There are Git GUIs, and you can use them for this class. You can also use the command-line version of Git. To get started, you'll need to "clone" the course repository. The command to do this is:
git clone https://github.com/dsc-courses/dsc80-2022-fa
This will copy the repository to a directory on your computer. To bring in
the latest version of the repository, run git pull
. This will not overwrite
your work. In fact, Git is designed to make it very difficult to lose work
(although it's still possible!).
Choosing a Text Editor or IDE
In this class, you will need to use a combination of editors for doing your assignments: The python files should be developed with an IDE (for syntax highlighting and running doctests) and the data/results should be analyzed/presented in Jupyter Notebooks. Below is an incomplete list of IDEs you might want to try. For more information about them, feel free to ask me!
-
The JupyterLab text editor: see above. A nice combination with notebooks.
-
sublime: A favorite text editor of hackers, famous for its multiple cursors. A good, general-purpose choice.
-
atom: GitHub’s editor. Pretty nice fully featured IDE. Can only work locally.
-
PyCharm (IntelliJ): Those who feel at home coding Java. Can only work locally.
-
VSCode: Microsoft Visual Studio Code -- the trendy IDE of 2021. Might be too much for the small projects we're working on, like using a sledgehammer to drive in a thumbtack.
-
nano: available on most unix commandlines (e.g. DataHub Terminal). If you use this for more than changing a word or two, you'll hate your life.
-
(neo)vim: lightweight, productive text-editor that might be the most efficient way to edit text, if you can ever learn how to use it. Beware opening vim, as you may never figure out how to quit (literally). Justin's text editor of choice.
-
emacs: A text editor for those who prefer a life of endless toil. Endlessly customizable, it promises everything, but you’re never good enough to deliver. Its keyboard shortcuts are guaranteed to give you carpal tunnel.
Working Remotely (DataHub)
Working remotely means using an environment that someone else set up for you on a computer far, far away, usually through the browser. This is the way you wrote code in DSC 10, for instance. There's nothing wrong with this, per se, and it is simpler, but you should think of this option as developing with "training wheels". Eventually, you will need to learn how to set up your own Python environment, and now is as good a time as any.
There are servers available to use at datahub.ucsd.edu. These are a lot like the jupyterhub servers that you used in DSC 10, however they are customized for this course. After logging in with your ucsd account, you will be taken the familiar juptyer landing page. The server you are logged into has ~4GB of RAM available, and has Python with all the necessary packages.
⚠️ Warning!
DataHub outages are not uncommon, and they can be expected to occur once or twice per quarter (sometimes more). Outages typically last for a few hours or less, but they can prevent you from working on your assignment.
Since we do not manage DataHub, we cannot make any guarantees about its availability. DataHub crashes that prevent you from turning in or working on your assignment near the deadline are typically handled via the usual slip day mechanism. If DataHub has been down for a long time (more than 24 hours), let us know and we'll consider a blanket extension – though this has very rarely (never?) happened.
Our advice is to use a local development environment, or to at least have one as available as a backup option. If you decide to use DataHub as your first choice, you should keep an extra slip day or two in reserve in case the server crashes.
Installing or Updating Python Packages
To update a package (e.g. pandas) on DataHub, you'll need to use the command line. To do this, open “New > Terminal” and type:
pip install --user --upgrade pandas
followed by the enter key to run the command.
One package that you'll likely need to install is otter-grader
. This package
provides the autograder that checks your answers in the labs and projects.
JupyterLab
The remote servers have a development environment installed on them, however, it’s non-intuitive how to access it. Once on the landing page, the url should read something like:
https://datahub.ucsd.edu/user/USER/tree
You can access the IDE (integrate development environment) by changing "tree" to "lab". This brings up jupyterlab. The url should look something like this:
https://datahub.ucsd.edu/user/USER/lab
For more information on this IDE, you can see read about it here. From within jupyterlab, you can:
- use a python console
- run jupyter notebooks
- use a terminal (e.g. to pull git repos)
- develop python code in .py files.
Git
Using DataHub Servers will require you to pull down HW assignments from GitHub
using the command-line (manually uploading to the servers is very cumbersome!)
To do this, open “new > terminal” and, to get the course repository for the
first time, type: git clone https://github.com/dsc-courses/dsc80-2022-fa
Then,
open up the file-tree in the original Jupyter tab, and you should see all the
course files now there. If you have already cloned the repository, and just want
to get the latest files, type git pull
and you should see the
updated files.