You are reading Articles

Data Science Curriculum

Introduction

This curriculum is designed to serve as an overview of the tools, techniques, and knowledge required to become a successful data scientist. It assumes a small level of basic scientific and statistical understanding; those newer to the field may want to brush up on those baseline skills, while those with more experience may find the basic sections to be a simple refresher. Not every skill we recommend will be used on a given project, and advanced projects will sometimes require novel research or using techniques and tools not listed here. However, a facility with the items in this curriculum should leave you both competitive in the landscape of current data scientists as well as equipped to learn new skills as necessary.

Most critically, the crucial role of a data scientist is being able to reliably understand and manipulate data to produce meaningful insights, and to differentiate meaningful insights from spurious ones. Practicing and learning this skill can be much more difficult to come by than tutorials on how to perform a given analysis in a certain programming language. That being said, critical thinking and problem solving, fundamentals to the data science process, aren’t exclusive to data science training. Many backgrounds whether in the humanities or elsewhere will have set up their foundation. Practice in utilizing these thought processes on novel data, and in acquiring new knowledge through external resources and lessons, will act as a strong foundation for any new or aspiring data scientist.

A Note on Timelines

Timelines are expressed as three categories and indicate the following ranges of expected learning time. In all cases except explicitly otherwise, these indicate time-to-learn for immediate usage as a practical data scientist, not time-to-master. For instance, for data scientific use, a short amount of time should be devoted to the basics of version control using Github, so it is denoted Short Term. Mastering the variety of version and repository control available in a system like Github is a much longer term project.

Short Term: Measured in minutes to hours
Medium Term: Measured in hours to days
Long Term: Measured in days to weeks

Baseline Knowledge

Expected to have been already acquired:

Most data science relies heavily on an effective conceptual understanding of the math involved, however, the actual math is performed via programming. In general, this means that an understanding of where mistakes might occur in math is sufficient to guarantee proper processing.

  • Simple algebra
  • Simple order of operations
  • Experience reading plots and visualizations of data

Foundational Skills

  • Scientific method (Short term)
  • Hypothesis testing (Medium term)
    • What p-values actually mean
    • Selecting the appropriate test for the data or question at hand
    • Bootstrapping and non-parametric methods
  • Research-driven problem solving (Short term)
    • Search
    • Stack Overflow
    • Github Issues

Programming

(Medium term to learn from scratch, long term to master)

  • Primary languages
    • Python
      • Installation and environment handling
        • Pip installation
        • Virtualenv
        • Anaconda distribution
      • Critical Packages:
        • Numpy
        • Pandas
        • Matplotlib
        • Scikit-learn
    • R
      • RStudio
      • Package installation
      • Critical Packages
        • Tidyverse
        • ggplot2
        • Dplyr
        • Tidyr
  • Version Control (Short term)
    • Github
  • Notebook format (Short term)
    • Jupyter Notebook and Lab
  • Big Data concepts and applications (Medium term)
    • Hadoop / Spark
      • Purpose and function of map/reduce
      • Scala Programming
    • In-memory data handling and streaming
    • Apache Beam
  • Containerization (Medium term)
    • Primary
      • Docker
    • Secondary
      • Kubernetes, Swarm, etc.

Data Wrangling and Cleaning

(Short term)

  • Data dimensionality
    • Common characteristics of 1D Data
      • Autocorrelation
      • Seasonality
      • Timestamp handling
      • Frequency
    • Common Characteristics of 2D Data
      • Matrix multiplication
      • Curse of dimensionality
    • Handling missing data in a principled way
      • Avoiding look ahead bias
      • Forward and back filling
      • Interpolation
      • Windsorizing
    • Common data formats

Databases

(Medium term)

  • SQL language
    • Common commands:
      • Creating tables
      • Updating tables
      • Select
      • Delete
      • Relationships
  • Database types
    • Common varieties to expect: MySQL, PostgreSQL
    • SQL vs. NoSQL
    • Cold Storage
    • Data specific
    • Time series
    • Graph

Data Visualization

(Short term)

  • Interpreting data using visualization methods
    • Exploring data by visualizing it
    • Fitting kernels
    • Covariance matrices
    • Heatmaps
    • Confusion matrices
    • 2 and 3 dimensional plots

Statistics

(Long term)

  • Means
  • Variance
  • Outliers
  • Statistical moments
    • Testing for normality
  • Correlation
    - Pearson
    - Spearman
  • Statistical distributions
    • Discrete vs. continuous
    • Primary examples
  • Linear methods
    • Regression
    • Multiple and Hierarchical Regression
  • Nonlinear methods
    • Logistic regression
    • Bayesian statistics
      • Bayes rule
      • Defining priors
    • Markov Chain Monte Carlo (MCMC)
      • R: STAN
      • Python: Pymc3
    • Variational Inference
    • Hierarchical modeling

Machine Learning

(Medium Term)

  • Classification vs. regression
  • Ensembling
    • Boosting
      • XGBoost
      • Bagging
    • Decision Trees
      • Random forests
  • Neural Networks
    • Rudimentary backpropagation
    • Activation functions
    • Layering
    • Convolutions
    • Recurrent
  • Hyperparameter tuning
  • Transfer learning
  • Generative adversarial networks
  • Reinforcement learning
  • Unsupervised methods
    • Clustering
      • Hierarchical
      • K-means
    • Auto-encoders
    • Principal components analysis
    • Independent components analysis
  • Data quantity requirements
    • Resampling and data expansion methods

Cloud Computing

(Short Term to make decisions on what to use, Medium Term to properly utilize an individual product)

  • A basic understanding of the major cloud providers available to data scientists
    • Google Cloud Platform (GCP)
    • Amazon Web Services (AWS)
    • Microsoft Azure
  • Common storage services
    • Google Cloud Storage (GCS) / Buckets
    • Amazon S3
    • Azure Blobs
  • Common simplified scaleable functions
    • Google Cloud Functions
    • Amazon Lambda
    • Azure Functions
  • Data Science specific tools
  • Google
    • Dataproc
    • Dataprep
    • AI Hub
    • Jupyter Notebook
    • Machine Learning Engine
    • BigQuery
    • AutoML (Vision, Tables, Language)
    • NLP API
  • AWS
    • EMR
    • Redshift
    • QuickSight
    • SageMaker
  • Azure
    • Azure Databricks
    • Machine Learning Service
    • Machine Learning Studio
    • HDInsight
    • Azure Notebooks
    • Data Science Virtual Machine

Meet the Authors

Be notified when we release new material

Join over 3,500 data science enthusiasts.