R-toolbox: Key Points

Pre-Alpha

R-toolbox

Reproducible Data Analysis

Use RMarkdown to enforce reproducible analysis

Reading data from fileCountryNamePhonenumber

The readr version of read_csv() is preferred
Remember that csv is not always actually separated with commas.
The haven package contains functions for reading common proprietary file formats.
In general a package will exist for reading strange datatypes. Google is your friend!
Use code to read in your data

Descriptive Statistics

We have access to a lot of summarising descriptive indicators the the location, spread and shape of our data.

Histograms

Histograms are used for visualising the distribution of data
A lot of different rules for chosing number of bins exists
Binwidth and number of bins are equivalent
Chose the number of bins that best supports your data story. Without hiding inconvenient truths about your data.
Never use unequal binwidths
Consider using natural breaks as an alternative

Table One

A Table One provides a compact describtion of the data we are working with
With a little bit of work we can control the content of the table.

Tidy Data

tidy data provides a consistent way of organizing data

The normal distribution

Use .md files for episodes when you want static content
Use .Rmd files for episodes when you need to generate output
Run sandpaper::check_lesson() to identify any issues with your lesson
Run sandpaper::build_lesson() to preview your lesson locally

Testing for normality

Begin by a visual inspection to assess if data is normally distributed
Use a statistical test to support your conclusion
Do not fret too much about non-normality. It is quite normal.

How is the data distributed?

The data generating function is not necessarily the same as the distribution that best fit the data
Chose the distribution that best describes your data - not the one that fits best

Linear regression

Linear regression show the (linear) relationship between variables.
The assumption of normalcy is on the residuals, not the data!

Multiple Linear Regression

We can do linear regression on multiple independent variables
Be careful not to overfit - only retain variables that are significant (and sensible)
We can fit just as well on categorical variables - but make sure they are categorical
Interpreting linear models with multiple variables are not trivial _ Interpreting linear models with interaction terms are even less trivial

Logistic regression

Using the predict-function to predict results is the easier way

Central Limit Theorem

The mean of a sample can be treated as if it is normally distributed

Nicer barcharts

Relatively small changes to a bar chart can make it look much more professional

Power Calculations

Use .md files for episodes when you want static content

k-means

kmeans is an unsupervised technique, that will find the hidden structure in our data that we do not know about
kmeans will find the number of clusters we ask for. Even if there is no structure in the data at all

ANOVA

Use .md files for episodes when you want static content

Cohens Kappa

With two raters Cohens \(\kappa\) can be used as a measure of interrater agreement on categorical, nominal classes.
Other methods exists for ordinal classes, and more than two raters.

R on Ucloud

Ucloud provide access to a multitude of advanced software that we are not able to run locally
Running RStudio on Ucloud give us access to more compute and memory than we have on our own computer
More advanced work might require a quite complicated setup
Restarting a virtual machine means all our work and setup might be lost, unless we take certain precautions.

A deeper dive into pipes

The base-R, native, pipe |> is faster than the magrittr pipe, %>%
The magrittr package provides a selection of other useful pipes

Setup for GIS

Working with geospatial data requires additional software

Setup for Git

RStudio, Git and a GitHub account are needed for this lesson

Practice makes perfect

A multitude of materials for learning Git and GitHub exists
Try to use Git and GitHub in your daily work

Statistical tests

Use an appropriate statistical test
Make sure you understand the assumptions underlying the test

When install.packages fail

Use .md files for episodes when you want static content

Fences på vores undervisningssider

Brug fences til at fremhæve eller skjule indhold på siderne.

Make a new course

It is fairly easy to make a new course page
Have patience!