Skip to main content
Colorful programming code on black screen.

(T6) Data Science Workflows Using R and Spark

E. James Harner

Abstract:
This tutorial covers the data science process using R as a programming language and Spark as a big-data platform. Powerful workflows are developed for data extraction, data transformation and tidying, data modeling, and data visualization. During the course R-based illustrations show how data is transported using REST APIs, sockets, etc. into persistent data stores such as the Hadoop Distributed File System (HDFS), relational databases and in some cases sent directly to Spark's real-time compute engine. Workflows using dplyr verbs are used for data manipulation within R, within relational databases (PostgreSQL),and within Spark using sparklyr. These data-based workflows extend into machine learning algorithms, model evaluation, and data visualization. The machine learning algorithms taught in this tutorial include supervised techniques such as linear regression, logistic regression, decision trees, gradient-boosted trees, and random forests. Feature selection is done primarily by regularization and models are evaluated using various metrics. Unsupervised techniques include k-means clustering and dimension reduction. Big-data architectures are discussed including the Docker containers used for building the tutorial infrastructure called rspark.

See:
https://github.com/jharner/rspark
https://github.com/jharner/rspark-tutorial
https://github.com/jharner/rspark-docker

The Docker containers can be run on the desktop, run using vagrant, or deployed to Amazon Web Services (AWS). As a result, students will have access to a full big-data computing platform and extensive course content.

Presenter:
E. James Harner, West Virginia University, USA