Reproducible Data Analysis


  • Use RMarkdown to enforce reproducible analysis

Reading data from fileCountryNamePhonenumber


  • The readr version of read_csv() is preferred
  • Remember that csv is not always actually separated with commas.
  • The haven package contains functions for reading common proprietary file formats.
  • In general a package will exist for reading strange datatypes. Google is your friend!
  • Use code to read in your data

Descriptive Statistics


  • We have access to a lot of summarising descriptive indicators the the location, spread and shape of our data.

Histograms


  • Histograms are used for visualising the distribution of data
  • A lot of different rules for chosing number of bins exists
  • Binwidth and number of bins are equivalent
  • Chose the number of bins that best supports your data story. Without hiding inconvenient truths about your data.
  • Never use unequal binwidths
  • Consider using natural breaks as an alternative

Table One


  • A Table One provides a compact describtion of the data we are working with
  • With a little bit of work we can control the content of the table.

Tidy Data


  • tidy data provides a consistent way of organizing data

The normal distribution


  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally

Testing for normality


  • Begin by a visual inspection to assess if data is normally distributed
  • Use a statistical test to support your conclusion
  • Do not fret too much about non-normality. It is quite normal.

How is the data distributed?


  • The data generating function is not necessarily the same as the distribution that best fit the data
  • Chose the distribution that best describes your data - not the one that fits best

Linear regression


  • Linear regression show the (linear) relationship between variables.
  • The assumption of normalcy is on the residuals, not the data!

Multiple Linear Regression


  • We can do linear regression on multiple independent variables
  • Be careful not to overfit - only retain variables that are significant (and sensible)
  • We can fit just as well on categorical variables - but make sure they are categorical
  • Interpreting linear models with multiple variables are not trivial _ Interpreting linear models with interaction terms are even less trivial

Logistic regression


  • Using the predict-function to predict results is the easier way

Central Limit Theorem


  • The mean of a sample can be treated as if it is normally distributed

Nicer barcharts


  • Relatively small changes to a bar chart can make it look much more professional

Power Calculations


  • Use .md files for episodes when you want static content

k-means


  • kmeans is an unsupervised technique, that will find the hidden structure in our data that we do not know about
  • kmeans will find the number of clusters we ask for. Even if there is no structure in the data at all

ANOVA


  • Use .md files for episodes when you want static content

Cohens Kappa


  • With two raters Cohens \(\kappa\) can be used as a measure of interrater agreement on categorical, nominal classes.
  • Other methods exists for ordinal classes, and more than two raters.

R on Ucloud


  • Ucloud provide access to a multitude of advanced software that we are not able to run locally
  • Running RStudio on Ucloud give us access to more compute and memory than we have on our own computer
  • More advanced work might require a quite complicated setup
  • Restarting a virtual machine means all our work and setup might be lost, unless we take certain precautions.

A deeper dive into pipes


  • The base-R, native, pipe |> is faster than the magrittr pipe, %>%
  • The magrittr package provides a selection of other useful pipes

Setup for GIS


  • Working with geospatial data requires additional software

Setup for Git


  • RStudio, Git and a GitHub account are needed for this lesson

Practice makes perfect


  • A multitude of materials for learning Git and GitHub exists
  • Try to use Git and GitHub in your daily work

Statistical tests


  • Use an appropriate statistical test
  • Make sure you understand the assumptions underlying the test

When install.packages fail


  • Use .md files for episodes when you want static content

Fences på vores undervisningssider


  • Brug fences til at fremhæve eller skjule indhold på siderne.

Make a new course


  • It is fairly easy to make a new course page
  • Have patience!