Wit

Day 1

Use data.table which is an enhanced version of data.frame. It is a fast file reader that can be used to add / modify / delete columns by using groups without copying. It is especially useful for large data (>1GB in RAM).

Use the package lubridate to make it easier to work with dates and times.

myData[, TransDate:=dmy(TransDate)]

Select

myData[c(1:3,5),] # will select the 1,2,3 and 5th rows
myData[.N, ] # .N gives the number of rows and hence selects the last one
tail(myData, 1) # also selects the last row
myData(nrow(myData),] # also selects the last row
myData[TransDate > ymd("2010-12-24") & PurchAmount > 70, ] # rows can be selected using conditions based on their values
myData[Customer %in% c(149332, 172951),] # the operator %in% can be used to include multiple results
myData[, list(Customer, TransDate, PurchAmount)] # same principle applies for columns
myData[, c(1:2,4)] # columns 1,2 and 4 are selected

# everything combined:
myData[PurchAmount > 100, list(TransDate, Cost)] # filters data based on rows / cols
subset(myData, PurchAmount > 100, select = c(TransDate, Cost)) # function subset is applied to myData

The following logical operators are available:

< less than; <= less than or equal than
> greater than; >= greater than or equal than
== equal to
!= not equal to
! logical negation (NOT)
& logical AND
| logical OR

Aggregate

It has 3 components:

the aggregating function (e.g. sum)
the variable by which to aggregate
the dimension by which to aggregate

myData[row, function, dimension]
myData[, sum(PurchAmount), by=Customer] # aggregated by single variable
myData[, list(AggPurch=sum(PurchAmount)), by=Customer] # this will create a new column called AggPurch
# short list of aggregating functions:
# mathematical operators: sum(), min(), max()
# summary statistics: mean(x, ), median(x, ), sd(x, )
# rounding functions: round(x), floor(x), ceiling(x)

# apply an aggreagting function to a variable by an aggregating dimension to a selection of rows:
myData[2:5, list(AggPurch=sum(PurchAmount)), by=Customer)
myData[, list(AggPurch=sum(PurchAmount))] # no changes to myData are made
myData[,AggPurch := sum(PurchAmount)] # creates new column in myData

# creating a lagged variable
myData[, CostLag := shift(Cost), by=Customer] # will show the cost in the previous period

Merge

Merge requires a minimum 2 data.tables which have a common identifier (key).

There are different ways to merge data:

Inner Join: cross-section of the two sets
Full Outer Join: all data in both sets
Right Outer Join: all elements of right data set
Left Outer Join: all elements of left data set

merge(A, B, by="ID", all=TRUE) # full outer join
merge(A, B, by="ID", all.x=TRUE) # left outer join
merge(A, B, by="ID", all.y=TRUE) # right outer join
merge(A, B, by.x="ID", by.y="CustID", all=FALSE) # inner join with two different IDs
merge(A, B, by=c("ID", "TransDate"), all=FALSE) # merge using multiple IDs

Conditions

if(condition) {
  print("some other stuff")
} else if(condition2) {
  print("some second stuff")
} else {
  print("some stuff")
}
# logical operators can be used to chain multiple arguments
if(x > 0 & x < 100) {
  print("x is between 0 and 100")
}
if(x < 0 | x > 100) {
  print("x is out of the range of 0 and 100")
}

Loops

# will be executed as long as statement is true
while(x <= 7) {
  print(paste("x is set to", x))
  x <- x + 1
}
# executes part of a sequence, the value we are looping over is called an "iterable"
for(city in cities) { # cities is the "iterable", city is a variable inside the loop
  if(city == "London") {
    break # exits the loop
  }
  if(city == "Paris") {
    next # moves to next item (skips the print for this one)
  }
  print(city)
}
# the for loop can be run on iterables directly but we can also run it using
# an interator
cities <- c("New York", "Paris", "London", "Tokyo", "Cape Town")
for(i in 1:length(cities)) {
  print(cities[i])
}
# storing results in a loop
numChars <- c() # create new vector to store the results
for(i in 1:length(cities)) {
  numChars[i] <- nchar(cities[i]) # nchar counts the characters
}

# substitute loops with sapply()
sapply(cities, nchar) # the output of sapply will always have the same length
sapply(cities, nchar, USE.NAMES=FALSE) # will remove the names

# substitute loops with vectorization
nchar(cities) # nchar also works on a vector

Loop over vectors

PRO: easy to code and read
CON: slower
CON: only element in the vector can be used, position is unknown)

Loop over numeric indices

PRO: more flexible
PRO: position of element in vector is known
PRO: faster
CON: more difficult to code

Why Vectorization?

Uses less code and is much faster than loops or sapply(). Loops make sense where the functions do not work with vectors or when the results of each iteration depend on the results of previous iterations.

Functions

myFun <- function(arg1, arg2=1) { # arguments# body# the last evaluated value is returnedreturn(value) # or it can be specified explicitly
}
add <- function(x, y=1) {
  result <- x + y
  return(result)
}
add(4) # 5
add(4,5) # 9

# function can return multiple values:
add <- function(x,y,z) {
  result1 <- x + y
  result2 <- x + y + z
  return(list(result1,result2))
}
add(1,2,4) # [3,7]

How to write a proper function?

Start with a simple problem (precise, with a known answer)
Get some working code to solve simplified problem
Rewrite the code to use temporary variables
Finally turn your code into a function using the function template

A good function must have an appropriate name. Usually verbs are used: removeFirstPurchases() (bad: myAwesomeFunction(), fun())

Two types of arguments

data arguments: supply data to compute the function, are listed first
detail arguments: control the details of the computation, should have a default

Scoring Model

Measure the recency, frequency, and monetary value of customer purchases.
Set a score to rank customers according to their purchase recency, frequency, and monetary value.
Calculate the overall RFM score.
Analyze RFM group differences.

Step 1

Recency: difference in days between the latest purchase and today
Frequency: number of transactions
Monetary value: average amount spent per transaction

Step 2

The higher the recency measure, the lower the score (i.e. last purchase was long ago)
The higher the frequency measure, the higher the score.
The higher the monetary value measure, the higher the score.

Step 3

RFM scores can be weighted or unweighted. The unweighted score takes the simple average over all scores. The weighted score takes a weighted average over the scores (e.g. 60-20-20)

Step 4

Put customers in group based on RFM differences.

library(Hmisc)
# goal: divide the range of X into intervals and code the values in X according
# to which interval they fall.
# input: numeric vector X, number of quantile groups g.
# output: factor object with levels of the form [a,b)

Day 2

Data in a relational database is stored in tables. Interactions with the database happen through a Database Management System, a software package designed to retrieve, manage and insert data into a database. In R there are different R packages that provide an SQL interface to interact with the database.

Connect to a Database

install.packages("RSQLite")
library(RSQLite)
# 1. create a connection to the database
con <- dbConnect(drv=RSQLite::SQLite(), dbname="database.sqlite")

# 2. perform various actions
dbListTables(conn=con) # list all tables
dbListFields(conn=con,"demographics") # list all fields in a table
dbReadTable(conn=con, name="transactions") # read a table
dbWriteTable(conn=con, name="transactions_copy", value=myData) # create a new table
dbRemoveTable(conn=con, name="transactions_copy") # remove a table

# 3. Close the connection
dbDisconnect(conn=con)

SQL Queries

# simultaneously submits and executes queries
dbGetQuery(con, "SELECT * FROM transactions;") # selects all available columns from transactions
# filtering rows
dbGetQuery(con, "SELECT * FROM transactions WHERE PurchAmount > 100;")
# filtering columns
dbGetQuery(con, "SELECT Customer, TransDate, PurchAmount FROM transactions;")
# filtering columns and rows
dbGetQuery(con, "SELECT TransDate, Cost FROM transactions WHERE PurchAmount > 100;")
# perform aggregations 
dbGetQuery(con, "SELECT Customer, SUM(PurchAmount) AS AggPurch FROM transactions GROUP BY Customer;")
# perform multiple aggregations at once
dbGetQuery(con, "SELECT Customer, SUM(PurchAmount) AS AggPurch, SUM(Quantity) AS AggQuant FROM transactions GROUP BY Customer;")
# merge: inner join
dbGetQuery(con, "SELECT * FROM A INNER JOIN B ON A.ID=B.ID;")
merge(A,B, by="ID",all=FALSE) # equivalent in R
# merge: inner join with multiple IDs
dbGetQuery(con, "SELECT * FROM A INNER JOIN B ON A.ID=B.ID AND A.TransDate=B.TransDate;")
merge(A,B, by=c("ID","TransDate"), all=FALSE) # equivalent in R

Big Data Handling with Spark

Spark is a general purpose data processing engine. The data within spark is distributed over multiple nodes (think of "machines") which makes the data access much faster and resilient. The user uses SQL to interact with an interface that handles the spark session and interacts with the spark cluster.

# 0. Install Spark
install.packages("sparklyr")
library(sparklyr)
spark_install(version = "2.1.0")
spark_web(sc) # will start the web-interface of spark

# 1. Connect to spark
sc <- spark_connect(master = "local") # will connect to cluster with name "local"

# create a new table from data.frame / data.table
t_on_spark_table <- copy_to(sc, df=transactions, name="t_on_spark")
# create a new table from csv file
spark_read_csv(sc, path="transactions.csv", name="t_on_spark")

# Retrieve a list of tables
dplyr::src_tbls(sc)

# Execute regular SQL queries
library(DBI)
query_result_tables <- dbGetQuery(sc, "SHOW TABLES IN default")
query_result_columns <- dbGetQuery(sc, "SHOW COLUMNS IN t_on_spark")

# export to R data.frame
t_as_regular_df <- as.data.frame(t_on_spark_table)

# export as CSV
spark_write_csv(t_on_spark_table, "table_from_spark.csv")

# 3. Disconnect from Spark
spark_disconnect(sc)

What is a Shiny App

Dashboards

Dashboard: A dashboard is a user interface that organizes and presents information in a way that is easy to read. To some extent most graphical user interfaces (GUIs) resemble a dashboard.

Dashboards allow managers to have access to up-to-date reports on latest figures (presentation). Furthermore, they are a self-service analytics tool which enables data exploration to look at the impact of contextual factors. However, dashboards are by no means a panacea. Be aware of possible traps that might lead to wrong decisions:

Importance trap: dashboard does not show relevant metrics for business (focus on relevant metrics)
Context trap: Figures are not used in the right context (Context implies what to focus on, e.g. absolute or relative figures)
Causality trap: Correlation does not imply causation. (Is a relationship based on correlation or causation?)

What is Shiny?

Shiny is a platform for creating interactive R programs embedded into a webpage. A shiny app is a webpage (user interface - UI) connected to a computer running a live R session (server). Users can manipulate the UI which will cause the server to update the UI's displays (by running R code).

# minimal example
library(shiny) # load the R package
ui <- fluidPage() # create a user interface
server <- function(input,output) {} # creating the backend (server)
shinyApp(ui=ui,server=server) # running the app
runApp()
runGithub()
runGist()

Shiny apps can have to different formats:

single file: has all the code in one file, better suited for very simple apps
multi file: has code in multiple files, better suited for more complex apps

Deploy a Shiny App

What is deployment and why should you bother?

To deploy (from the French "employer") is "to spread out or arrange strategically". In its IT context it encompasses all the processes involved in getting new software or hardware up and running in its environment including installation, configuration, running, testing and making necessary changes. The word implementation is sometimes used to mean the same thing.

Deployment is one of the last steps before it is distributed to the users.

Structure of a Shiny App

Input and Output Objects

Outputs can be any object that R creates and that we want to display in our app. Type specific placeholders for each output need to be added to the user interface (e.g. a table or map). The actual form of the output is defined on the server side (e.g. data shown in the table or geographic region shown on the map).

Inputs (or control widgets) introduce interactivity i.e., they enable users to request customized outputs. Input objects are added to the user interface (e.g. dropdown list). The state of input objects is then passed on and used on the server side (e.g. filed of dropdown list).

Steps to Build an Interactive Shiny App

Start a Shiny app with the RStudio template that sets up the UI and server part.
In the frontend part, setup the general structure and design of the Shiny app (e.g. fluidPage())
In the UI part, setup the input widgets
In the UI part, setup placeholders for the outputs
In the server part, define the outputs - in particular how the inputs shape the output

Build a user interface for a Shiny app

In a fluid website layout also referred to as a liquid layout, the majority of the components inside have percentage widths, and thus adjust to the user's screen resolution.

More popular layout options.

ui <- fluidPage (
  fluidRow( # create rows (horizontal splits)
    column(2, "sidebar_1"), # create columns (vertical splits)
    column(2, "sidebar_2"),
    column(8, "main")
  )
)
# sidebar layout
# window contains a small sidebar and a big main panel
ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(),
    mainPanel()
  )
)
# split layout
# window is split into equally sized parts
ui <- fluidPage(
  splitLayout(
    # object 1# object 2 
  )
) 
# flow layout
# elements will appear next to each other and break onto the next line if
# width of window is too small
ui <- fluidPage(
  flowLayout(
    # object 1,# object 2,
    # object 3
  )
)
# vertical layout
# elements will be stacked on top of each other
ui <- fluidPage(
  verticalLayout(
    # object 1,# object 2,# object 3
  )
)

Other Layout elements

ui <- fluidPage(
  titlePanel("title of your app"), # adds a title to the page
  sidebarLayout(position="right",
    sidebarPanel("sidebar panel"),
    mainPanel("main panel")
  )
)

# include an image
img(src="filename.jpg") # image must be in www folder
# include a css file
includeCSS("file.css")
# alternative way to include a css file
tags$head(tags$link(rel = "stylesheet", type="text/css", href="file.css"))
# include JS file
includeJS("file.js")
# alternative way to include js file
tags$head(tags$script(src="file.js"))

Using Themes

install.packages("shinythemes")
library(shinythemes)
ui <-  fluidPage(theme = shinytheme("slate"), 
  # ... 
)

List of themes is available here.

Adding Control Widgets to Front and Back End of a Shiny App

Adding interactivity is done in two steps:

Add R objects to user interface (either in the single file or in ui.R)
Tell shiny how to build the output objects (either in the single file or in server.R). The output object will be reactive if the code that builds it relies on an input widget value.

Variables in shiny are reactive. This means that when the value of a variable x changes, then anything that relies on x gets re-evaluated.

x <- 5
y <- x + 1
x <- 10
# in regular R: y = 5
# in shiny: y = 11 because it is updated when x changes

A reactive expression is an R expression that uses widget input and returns a value. The reactive expression will update this value whenever the original widget changes. Reactive expressions are a bit smarter than regular R functions. They work like this:

The first time it is run, its result is saved.
The next time it is run, it checks if the saved value has become out of date (i.e., whether the widget it depends on have changed)
if the value is out of date, the reactive object will recalculate it (and then save the new result)
if the value is up-to-date, the reactive expression will return the saved value without doing any computation

# there are two ways in Shiny to implement reactive programming
# render* - an output will automatically update whenever an input in its render* function changes
output$z <- renderText({
  input$a
})
# reactive expression - use reactive to create objects that will be used in multiple outputs
x <- reactive({ input$a })
output$y <- renderText({ x() })
output$z <- renderText({ x() })

If an input is used multiple times, the reactive expression approach is more efficient.

Input Widgets

There are numerous widgets available here.

# each widget requires several arguments
# the first two arguments are:
# name - user will not see this but it is used to identify the value
# label - user will see this, can be empty (e.g. "")
actionButton(inputId="action_1", label="Click Me")

Output Widgets

Will be updated automatically when an input widget changes.

# Render a Data Table
DT::renderDataTable(expr, options, callback, escape, env, quote) # Server
dataTableOutput(outputId, icon, ...) # UI

# Render an Image
renderImage(expr, env, quoted, deleteFile) # Server
imageOutput(outputId, width, height, click, dblclick, hover, hoverDelay, hoverDelayType, brush, clickId, hoverId, inline) # UI

# Render Plot
renderPlot(expr, width, height, res, ..., env, quoted, func) # Server
plotOutput(outputId, width, height, click, dblclick, hover, hoverDelay, hoverDelayType, brush, clickId, hoverId, inline) # UI

# Render Print
renderPrint(expr, env, quoted, func, width) # Server
verbatimTextOutput(outputId) # UI

# Render Table
renderTable(expr, ..., env, quoted, func) # Server
tableOutput(outputId) # UI

# Render Text
renderText(expr, env, quoted, func) # Server
textOutput(outputId, container, inline) # UI

# Render UI
renderUI(expr, env, quoted, func) # Server
uiOutput(outputId, inline, container, ...) # UI
htmlOutput(outputId, inline, container, ...) # UI

Commands are re-run depending on their location

Where you place your R code in server.R will determine how many times the commands are run (or re-run).

In the beginning of the file outside of the server function:

Shiny will only run this code once, which is all you need to set your server up to run the R expressions contained in the server function.
Source scripts, load libraries, and read data sets

Within the server function but outside of any render*() function:

Run once each time a user visits the app. -> user specific objects.
These would be objects that you think each user will need their own personal copy of. (e.g. an object that records the user's session information)

Inside a render* function:

Shiny will rerun all of the code in a render* chunk each time a user changes a widget mentioned in the chunk. This can be quite often.
It should contain code that Shiny must rerun to build an object.

Shiny Apps versus Shiny Widgets

Shiny apps represent the output of an analysis. They are designed to be used by end users. They are intended to be deployed on servers (like Shiny Server or ShinyApps.io) and are accessed via a web browser.

Shiny widgets are designed to be used in the course of the analysis and are intended to be used by R users. They are only ever to be invoked from code (in the R console or form an R script) or from within R Studio.

Day 3

What is Git?

Teamwork in coding projects is a struggle. New files might be accidentally overwritten by old files. Someone messed up but cannot revert to the last "working version". It is not clear who edited which files and when and which files contain the latest code. And if two people worked on changes they need to be merged by hand.

Solution: Git

"Git" is not actually an acronym but stands for "unpleasant person" and is a joke by Linus Torvalds, the creator of it. Git is a version control system that allows us to:

track changes of files and projects (long term history)
revert to any previous stage of a file
work on one project at the same time
work on different versions of a project at the same time
automatically merge changes

Terminology

Repository: a folder that is under version control; each user has a local copy
Pull: Download all changes of the hosted repo
Push: Upload changes to the hosted repo
Sync: Both push and pull together
Commit: Snapshot of your project; any file can be reverted to it; contain comments to describe what the changes do
Merge: brings forked history back together again; integrates different branches into a single one
GitHub: Web-based graphical interface for Git that provides additional collaboration tools (issue section to report bugs, graphical contribution tracking)

Basic Git Workflow

Pull updates to your local repo
Revise (i.e. develop) the code in your local repo
Commit your changes to your local repo
Push all your commits to the hosted repository

Getting Started with Git

Steps to Contribute to a GitHub Project

Add (existing or new) project to Git on your local computer
Committing to your local Git repository
Create a new repository on GitHub
Setup credentials to connect GitHub to your local computer
Use Password
Generate RSA key
Push commits from local Git repository to GitHub

WARNING: Do nut put a git repository in Dropbox, Google Drive. It will mess with git.

# the following commands have to be run in your console / terminal / shell
# if your not within your project folder:
cd your-project # will switch to the folder "your-project" within the current folder

# 1. Add git and tell it to track the changes in files
git init # localize a local git repository within the created project
git add test.R # add the test.R file to the git repository, changes will then be tracked
git add .  # (mind the space!) adds all files to the git repository

# 2. Committing to your local git repository
git commit -m "add menu bar" # creates a commit with all changes so far (in all files that are in the repository)

# 4. Setup credentials to connect GitHub to your local computer
git config --global user.email "youremail@uzh.ch"
git config --global user.name "your-github-username"

# 5a. Connect GitHub repository in the cloud to local repository (only needs to be done once)
git remote add origin https://github.com/<your-username>/<your-repository>.git

# 5b. Push local Git repository to GitHub
git push -u origin master

Advanced Git Features

Forking

Forking is similar to branching however it is done to code that is not owned or is open source and you intend to make your own version of it. Forking allows people to experiment with changes to a project without affecting the original repository.

Fork the repository
Make a change (could be a fix or an addition)
Submit a pull request to the owner of the repository

Forking creates a clone of the repository on GitHub (i.e. in the cloud). However, you cannot directly push your changes back to the original repository. In order to keep it in sync the original repository needs to be fetched regularly.

Branching creates another version of the code within the same repository. You can only do this if you have access as a collaborator to the existing repository (which will not be the case for most open source projects). Changes made to a branch can be pushed back to the original repository and the branch can be merged. (Usually there is one "main" or "master" branch that holds the current stable version - the version that is running without errors - and then there are multiple branches for features that are in development).

A pull request basically asks the author of the original code that you've forked to review the changes you've made and pull your changes into the original repository.

Why should you create a Package?

Code can be accessed in an easy way without having to worry about the organization of single files. Packages allow to easily distribute your code either internally (e.g. in your team / company) or publicly.

A good package fulfills several requirements:

Easy package installation
The basic directory structure is intuitive
Detailed and comprehensive documentation exists

The documentation should be easily accessible using the man function. It should also provide examples on how to use it.

What is actually a package?

A package creates a collection of functions, data and compiled code.

Keep Calm and Read the Manual

Providing a detailed documentation for a package will enable others to really use it and save time when doing so. Documentation that is added in-source (i.e. directly within the code) integrates easily with your code and makes it easy to change / generate the documentation.

Adding a README file

Adding a README file to your package distribution is useful when the package is shared on GitHub. It will be shown as a description of the package.

install.packages("usethis")
install.packages("rmarkdown")
library(usethis)
use_readme_rmd() # will create a README.Rmd file
rmarkdown::render("README.Rmd") # will knit it -> create the file

The goal of the README file is to answer the following questions about your package:

Why should I use it?
How do I use it? -> provide an example
How do I get it?

Document using Roxygen2

#' calculateRFM
#' 
# Description
#' This function calculates the RFM score for a given dataset
#'
#' @details
#' \code{data} contains the transaction data. The data set must contain 
#'             a column labelled 'Customer' that allows unique customer
#'             identification, a column labelled "TransDate", indicating 
#'             the purchase date and a column labelled "PurchAmount" 
#'             specifying the total spending per purchase.
#'
# Arguments
#' @param data the data set
#' @param weight_recency a number specifying the weight of the recency
#' @param weight_frequency a number specifying the weight of the frequency
#' @param weight_monetary a number specifying the weight of the monetary
#'
# Returned values
#' @return a table with the following columns: 
#'         \code{Customer} the identification of the customer
#'         \code{recency} the recency score for that customer
#'         \code{frequency} the frequency score for that customer
#'         \code{monetary} the monetary score for that customer
#'         \code{finalscore} the final score with weights applied for that customer
#'         \code{group} the group the user belongs to
#'
# Examples
#' @examples
#' calculateRFM(transactions, 20, 20, 60)
#' calculateRFM(transactions, 30, 40, 30)
#'
#' @export

calculateRFM <- function(data, weight_recency, weight_frequency, weight_monetary) {
   # ...
}

Publish Your Package

There are three ways to distribute your package:

Build a Source or Binary Package

This creates a compressed file which may be shared with others. There are two options available in R Studio:

Source Package: Package contains all R source code which can be extracted again. Can be installed on all platforms (Windows, MacOS, Linux)
Binary Package: Compiled code that can be executed. The R source code cannot be extracted (might be helpful for security / copy-right issues). It can only be installed on the platform for which it is produced.

Upload your Code on GitHub

Publishing a package on GitHub has a number of advantages:

GitHub issue tracking: others have the possibility to report issues
GitHub pull request: others may actively fix or enhance your code
Easy installation directly from GitHub using the install_github() command from the devtools package.

The following steps are required to put your package on GitHub:

Create a new empty repository. Do not add a README yet.

Type the following in the shell, ensure that you are in your current rood directory of the package you want to upload:

git init
git remote add origin https://<your-username>/<your-repository>.git
git add .
git commit -m "initial commit"
git push -u origin master

GitHub will automatically recognize the README file and display it in the repository. It can also be found through the search function on GitHub.

Publish your Package using CRAN

CRAN stands for Comprehensive R Archive Network and provides only R packages whereas GitHub is used for other languages as well. It allows very easy installation for other users and reaching a broad amount of users. However, solid checks are required and the submission policies are very stright.

Debugging Find the Mistakes

When developing code one will usually make mistakes. Or the users will use the code in ways that were not thought of while writing it. There are two ways to find and prevent bugs:

Debugging: Finds the source of a programming flaw; Helps to understand the execution of the program
Software Testing: Quality and correctness checks to avoid future problems , can be used to specify requirements; Assesses the usability of program interfaces

Rule of thumb: Debugging consumes about 2/3 of the development time.

Debugging

Debugging is recommended when the return value (e.g. of a unit test) is incorrect and the error is not obvious. It uses tools that examine the control flow and values of variables in a program. Many programming environments support line-by-line execution debugging, where only one line of code at a time is executed.

Debugging Strategy

Realize that you have a bug
Reproduce / generate input values that cause the bug
Isolate the flawed component with a binary search (split code in half and find out if bug is in first or second half of code)
Fix it
Confirm its successful resolution using the previous input (when using unit testing, create an automated test)

Debugging in R

Output variables to the screen: print() or browser()
Using built-in commands in R: e.g. traceback()
use the interactive debugger inside R studio
Use exception handling to prevent bugs in the future

# interrupts the execution of an expression and allows you to inspect the
# environment of where it was called from
browser() 

# prints the call stack of the last uncaught error 
# (i.e. the sequence of calls that lead to the error)
# IMPORTANT: should be placed in console and not in code!
traceback()

# can be used to wrap an expression that might fail and allow the user's code
# to handle error-recovery (won't stop execution of program)
try("a" + 5)

# will start the debugger, is normally used when the source code is not
# known to the user
debug()
undebug() # can be called to stop the debugging
debugonce() # can be used to debug the function only once

While debugging we can simply type "c" to continue to the end of the current code. Traceback does not tell us where the error occurred in the function. To know which line causes the error, we will have to step through the function using debug().

Unit Testing: Cover all Bases

Software testing is a method to check if a software product matches the expected requirements. It involves automatic and manual execution of tools to evaluate one or more properties of interest. The purpose is to identify errors, gaps or missing requirements. Apart from functionality we can also test factors such as usability or reliability. At the end we don't only want to have a program that works, looks neat and meets performance expectations and is both reliable and compatible with other programs.

Some aspects of testing can be static (e.g. proofreading, reviews, verification) other steps are dynamic (e.g. automated unit tests).

Unit Tests

Unit testing is a type of software testing where individual units or components of a software are tested. A unit test is a series of expectation functions, which collectively test one small unit of functionality. They should be used to ensure that the output of the code meets our expectations and to assure others code works correctly.

They help us to identify bugs in the current code today and also to make the current package functionality robust to future changes.

When creating an R package, unit tests can be included. These will be run automatically whenever the package is built and checked. To do so, a framework for the tests must be created.

Unit Testing Framework

# the following command will create the testing framework
# make sure to install the package first
usethis::use_testthat()
# it will create a directory called 'testthat' -> all tests must go there
# it will also create an R script called 'testthat.R' -> will run all the tests

Test Example

# located in test_correct_output.R

# provides a brief description of what the file is testing
context("correct output")

# a test groups together expectations that are related
test_that("one is added to number", { 
  expect_equal(addOne(4), 5) # expectations are the atom of testing
  expect_equal(addOne(-3),-2) # they describe the expected result
})

# test_that ....

When this function is run, the test code will be evaluated in its own environment. An error will be produced if any of the test code produces an error.

Expectation Functions in testthat

a <- 3 + 2
expect_equals(a, 5) # compares a value of an object a to a given reference value
expect_equals(a,6) # produces an error if the value does not match

val <- 43.7
expect_is(val, "numeric") # compares the type of a variable to a reference value
expect_is(val, "character") # produces an error if the type does not match

vec <- c(1,7,6,4,9)
expect_length(vec, 5) # compares the length of a vector to the specified value
expect_length(vec, 6) # produces an error if the length does not match

Overview of all expectation functions: Testthat website

Run Tests

# make sure you installed the devtools package
devtools::test() # runst the tests
# creating new test files
usethis:::use_test("new_test")

When a built package is checked using R CMD check, any tests included in the package will be run automatically, alongside all other usual package checks. If any tests fail, an error message will be produced.

Continuous Integration Services

Some functions cannot be executed for testing purposes, e.g.:

Functions that access different systems: e.g. online authentication
Persistent manipulations of databases
Hardware controlling functions, e.g. a robot arm
Execution of financial transactions
Functions' dependency on non-existent code.

Stubs and mocks can be used in these cases.

Stubs

The underlying operation is replaced by a stub for testing. Stubs can perform primitive operations but usually return only a value. Example: Test depends on a method that takes 5 minutes to complete. This is replaced by a stub which only returns a hard-coded value and thus takes up much less time.

Mocks

Mocks are replacements for full objects in Object-oriented programming (OOP). They check whether methods are called as expected. Example: You are testing a user creation class, so you want to check after calling createUser() that sendConfirmationMail() is called.

Code Coverage

Code coverage shows which lines of code are tested. It helps to identify non-tested code regions. The coverage is usually measures as a ratio, e.g. 60% of all lines, functions etc. are covered by tests. However, a high coverage does not guarantee thorough testing (you might be testing things wrong). As a recommendation, focus especially on the boundaries (so called "edge-cases") of parameter ranges (0, NA, Inf, etc.) to identify unhandled problems.

The r package covr supports coverage. However, only when testing full packages. The workaround is to create a dummy package as a wrapper for it. When covr is loaded you can use the command report() to get a report of your code coverage:

install.packages("covr")
report()

Continuous Integration

In software engineering, continuous integration (CI) is the practice of regularly merging all changes to source code to a central repository and building and testing the source code after every change. This allows any bugs that are introduced to be identified quickly.

If Alice and Bob are working together on a project on GitHub and Bob pushes some new changes to the code, then continuous integration means, that before the code is actually merged, all the tests are automatically run to ensure that Bob's code didn't break anything that has worked before. It might also mean, that the system will throw an error if Bob forgot to write tests for his newly written code.

Travis CI (travis-ci.org) is a continuous integration service that can be used with GitHub code repositories. It is free to use for open source projects. There is alternative CI software for GitHub or alternative version control systems and hosting services (e.g. Jenkins).

Steps to enable Automatic Test Execution & Code Coverage Calculation for a R Package

Upload package to GitHub
Create an account on travis-ci.com
Activate GitHub repository containing your package
Add a .travis.yml (note the dot in the beginning of the file) to the root directory of your package. The use_travis() function in the usethis package will do this. The use_covr() function will help you to set up the automatic calculation of the code coverage.
Commit this new file to your Git repository.

Now whenever any commits are pushed to your online GitHub repository, Travis will automatically build and check your package. Travis will notify you if the check results in any error, including those generated by failed unit tests.

# .travis.yml (note the dot in the beginning of the file)
language: r

Day 4

Profiling: Measuring Performance

R is not the most efficient language (compared to e.g. C++ or javascript). However, beyond limitations due to design and implementation, a lot of R code is slow simply because it's poorly written (e.g., unnecessary loops, inefficient structures and solutions).

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered."

"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming".
Donald Knuth, inventor of LaTeX

How to proceed when profiling your code?

Analyze your code using profiling: Where do you lose the most of the time?
Check for alternative functions: Find more efficient functions and packages for your code to address bottlenecks.

A profiler examines your whole code to determine what parts of it are taking up the largest amount of time, either because they are slow or are called many times.

Profiling with Rprof()

Rprof("path_for_output/file_to_store_output")
# -------
# your code that you want to profile
# e.g. containing for loops, data reading, manipulation etc.
# -------
Rprof(NULL) # specifies the end of the profiling
summaryRprof("path_for_output/file_to_store_output") # generates a summary

The summaryRprof() command will output the following characteristics:

by.self: time spent in function alone
by.total: time spent in function and callees
sample.interval: the sampling interval, (defaults to 20 ms)
sampling.time: total time of profiling run

Use profvis() to see how R spends its time

# 1. install package
install.packages("profvis")
library(profvis)

# 2. wrap code
profvis({
  # your code that you want to profile
  plot(PurchAmount ~ Cost, data = transactions)
  m <- lm(PurchAmount ~ Cost, data = transactions)
  abline(m, col = "red") 
})

Benchmarking

Benchmarking is used to determine which function is faster. It is used to compare running times. Given two functions that basically compute the same thing, benchmarking helps to choose the faster one. In contrast to profiling, benchmarking focuses on a single operation where profiling focuses on analyzing the whole code (or chunks of it).

Using system.time

# the system.time() command takes a single R expression as its argument and
# evaluates the time spent on running it.
system.time(mydata <- read.csv("demographics.csv")) # takes 1.434s
system.time(mydata <- fread("demographics.csv")) # takes 0.074s
# system time outputs three measures:
#    user time: time spent by CPU by the current process (R session)
#    system time: the time spent by the operating system
#    elapsed: the total time spent

fread() is significantly faster than read.csv()

Using microbenchmark()

Microbenchmark is a more accurate replacement of system.time(expr) expression. It uses sub-millisecond accurate timing functions and all evaluations of the expressions are done in C code to minimize any overhead.

# how to use microbenchmark()
install.packages("microbenchmark")
library(microbenchmark)
microbenchmark(expression, times = 100L, ...) 
# the times argument specifies how often a function is run
# NOTE: 100 is not suitable for complex functions because it will run it 100 times

Predefined functions vs. own calculations

mean1 <- function(x) { mean(x)}
mean2 <- function(x) { sum(x) / length(x) }
microbenchmark(mean1(x), mean2(x))
# surprisingly the built in mean() function is much slower than the custom function
# this is due to it having more checks for missing values and other code executed

Faster is not always better. Just because a function/ expression seems faster than another does not mean that it is better. Some functions are much more general and robust, and those added benefits come with a cost. Sometimes profiling / benchmarking times aren't exact or can be influenced by our computer, so it is important to profile/benchmark with multiple repetitions (many if feasible).

How to do it better

What to do after identifying bottlenecks?

Once you know what parts of your code are slowing you down and you have a method to test speed, you need tools and options for writing more efficient code. There are two ways to improve your code:

Find better packages and functions.
Simplify your code and use more efficient structures.

Find better Packages

There are thousands of packages on CRAN and on Github, but many of them are redundant and implement the same thing. The goal is to find the best one. For Google, try rseek.org or be creative "r package CRAN fast >something<". For stackoverflow.com, restrict your search by including the R tag [R] in your search.

Simplify your code and use more efficient structures

There are several ways of making your code faster (e.g. vectorization, pre-allocate data structures, use more efficient structures). But remember:

"Premature optimization is the root of all evil" - Donald Knuth

Don't start rewriting your code unless the time saved is going to be worth the time invested. Other approaches, like finding a bigger machine or parallelization, could give you more bang for the buck in terms of programming time.

Vectorization

Vectorizing is about taking a "whole object" approach to a problem, thinking about vectors, not scalars. There are two key attributes of a vectorized function:

It makes many problems simpler. Instead of having to think about the components of a vector, you only think about the entire vector.
The loops in a vectorized function are written in C instead of R. Loops in C are much faster because they have much less overhead.

rowSums()
colSums()
rowMeans()
colMeans()
# examples:
colSums(transactions[,.(PurchAmount, Cost)] # faster
apply(transactions[,.(PurchAmount, Cost)],2,sum) # slower

Prepare your output variables

Initialize your data structures and output variables to required length and data type before taking it to loop for computations.

cities <- c("New York", "Paris", "London", "Tokyo", "Cape Town")
output <- vector("list", length(cities) # pre-define the object you want to get
names(output) <- cities # names() can be used to change the names of list objects
# R does not have to resize the objects but simply fill the values for each city
for(i in 1:length(cities)) {
  output[i] <- nchar(cities[i])
}

Use more efficient structures

In R some objects are flexible and can store multiple types of data at once. However, simpler data structures that only store one type of data can be manipulated much faster.

myMatrix <- matrix(rnorm(1e+05), 10000, 100)
myDF <- as.data.frame(myMatrix)
microbenchmark(apply(myMatrix, 2, sd), apply(myDF, 2,sd)) # df is slower

Sidenote: Housekeeping

Free up the memory when you don't use it anymore.

remove variables and flush memory as early as possible. Remove objects rm() that are no longer needed, especially before going into lengthy loop operations.
Use the gc() function to clean your memory of variables you do not longer need (garbage collection)

Just-in-time Compilation

R works with interpreted code (code is compiled just before execution). This can produce (slow) overhead if some code parts are executed many times (e.g. loops, custom functions that are used multiple times).

The idea behind Just-in-time compilation is that the code should only be compiled to machine-code once. The JIT compiler caches the machine code so it can be re-used. Note: JIT compilation is only effective if your code is executed many times.

Using JIT compilation: the compiler pacakge

# no installation is necessary since the package is part of R base
library(compiler)

# your own function
slowFunction <- function { 
  # some code
}
# compiles the function
compiledFunction <- cmpfun(slowFunction()) 

# example
myFunction <- function() { for(i in 1:10000000) { 1*(1+1) } }
myCompiledFunction <- cmpfun(myFunction)
system.time(myFunction()) # 3.134 elapsed time
system.time(myCompiledFunction()) # 0.142 elapsed time (32x faster!)

JIT compilation is not always a good idea

Do not compile functions that are called only once
Enable JIT before loops with many repetitions and disable it afterwards again
Check whether you are actually speeding up your code
Always slower than C implementation (using Rcpp)
Nested functions are not automatically compiled.

slowFunction <- function() { otherFunction() { ... }} 
compiledFunction <- cmpfun(slowFunction())
# no speed gain for otherFunction()

# solution:
enableJIT(3) 
# 0: off, 
# 1: compile when called the first time
# 2: 1 plus compile before duplicating
# 3: 2 plus compile for all lopps

Implicit Parallelism

Implicit parallelism is a feature of a program to decide automatically which features of it to run in parallel. When dealing with numeric calculations in R one option do deal with implicit parallelism is BLAS.

Basic Linear Algebra Subprograms (BLAS)

The standard version of BLAS delivers correct results but is not really optimized for large matrices. BLAS can be optimized to specific CPUs (e.g. Intel Math Kernel Library). Optimized BLAS take advantage of multi-core systems (multiple CPUs).

An optimized BLAS can be installed to accelerate matrix operations in R.

Link R to optimized BLAS

# terminal in /Applications/Utilities
cd /Library/Frameworks/R.framework/Versions/4.0/Resources/lib # use version of R
mv libRblas.dylib libRblas.dylib.bup # backup existing library
ln -s /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib

Explicit Parallelism

By default R only uses one of your CPU cores. Repeat computations could significantly benefit from multiple cores (e.g. loops) if they are distributed amongst them.

doParallel to use multiple cores

Compared to other alternatives this works on Mac OS, Windows and Linux

install.packages("doParallel")
library(doParallel)

Tell R how many cores you want to use

# 1. detecting the maximum number of cores
detectCores() # e.g. 4
# 2. Register a cluster with the desired number of cores
cl <- makeCluster(3) # usually you use the total number of cores -1 to leave some resources for the system
registerDoParallel(cl)

Execute computations in parallel: Basic steps

Split problem into pieces: find parts of the code that are not dependent on each other. e.g. for loops, *apply() commands
Execute in parallel
Collect and combine results

Execute in parallel using foreach

library(foreach)
# the parLapply command can be used to use apply instead of the for-loop
result <- foreach(i = 1:3) %dopar% {
  # the specific packages used within the loop must be defined in the argument .packages e.g. .packages = c("package1", "package2")
  # code to be executed
}
# by default results are stored as alist. if you want to change the format of the
# returned object i.e. sum up all results, use the .combine argument
# e.g. .combine = "+"

Pitfalls of parallelizing

Communication overhead when setting up a cluster: do not use for simple stuff
Error handling: by default all loops stop (and you lose all results) if an error in just one tasks occurs
Shut down the cluster when you are done to avoid problems for next tasks: stopCluster(cl)

Launching a VM on GCP

Your own laptop/server is rarely powerful enough to estimate a ML model. Besides, estimating a ML model will block the processors of your laptop/server for quite some time. The cloud is cheap, flexible and massively scalable. It offers as much computing power as you need.

The usual Steps for Setting Up Cloud Computing

Find the cloud offering that best matches your needs and create an account with that cloud provider.
Create a Virtual Machine instance
Configure the Virtual Machine
Upload and run your code

What is Google Cloud Platform?

The Google Cloud Platform (GCP) is a suite of cloud products. It is the equivalent of AWS to Amazon or Azure to Microsoft.

Computing Engine is one of the products available in GCP. It provides scalable computing capacity and gives users the ability to quickly deploy virtual machines (called instances) in the cloud.

Configuring the VM

Every cloud provider has their own set of configuration steps. Usually, these can be found on the cloud provider's website. In our case, we need to:

Change the Firewall settings: This allows us to access RStudio Server. It uses port 8787

R - a non-technical introduction to big data techniques, team work and interactive visualization

R - Big Data