R-toolbox
- Use RMarkdown to enforce reproducible analysis
- The
readr
version of read_csv()
is
preferred
- Remember that csv is not always actually separated with commas.
- The
haven
package contains functions for reading common
proprietary file formats.
- In general a package will exist for reading strange datatypes.
Google is your friend!
- Use code to read in your data
- We have access to a lot of summarising descriptive indicators the
the location, spread and shape of our data.
- Histograms are used for visualising the distribution of data
- A lot of different rules for chosing number of bins exists
- Binwidth and number of bins are equivalent
- Chose the number of bins that best supports your data story. Without
hiding inconvenient truths about your data.
- Never use unequal binwidths
- Consider using natural breaks as an alternative
- A Table One provides a compact describtion of the data we are
working with
- With a little bit of work we can control the content of the
table.
- tidy data provides a consistent way of organizing data
- Use
.md
files for episodes when you want static
content
- Use
.Rmd
files for episodes when you need to generate
output
- Run
sandpaper::check_lesson()
to identify any issues
with your lesson
- Run
sandpaper::build_lesson()
to preview your lesson
locally
- Begin by a visual inspection to assess if data is normally
distributed
- Use a statistical test to support your conclusion
- Do not fret too much about non-normality. It is quite normal.
- The data generating function is not necessarily the same as the
distribution that best fit the data
- Chose the distribution that best describes your data - not the one
that fits best
- Linear regression show the (linear) relationship between
variables.
- The assumption of normalcy is on the residuals, not the data!
- We can do linear regression on multiple independent variables
- Be careful not to overfit - only retain variables that are
significant (and sensible)
- We can fit just as well on categorical variables - but make sure
they are categorical
- Interpreting linear models with multiple variables are not trivial _
Interpreting linear models with interaction terms are even less
trivial
- Using the predict-function to predict results is the easier way
- The mean of a sample can be treated as if it is normally
distributed
- Relatively small changes to a bar chart can make it look much more
professional
- Use
.md
files for episodes when you want static
content
- kmeans is an unsupervised technique, that will find the hidden
structure in our data that we do not know about
- kmeans will find the number of clusters we ask for. Even if there is
no structure in the data at all
- Use
.md
files for episodes when you want static
content
- With two raters Cohens \(\kappa\)
can be used as a measure of interrater agreement on categorical, nominal
classes.
- Other methods exists for ordinal classes, and more than two
raters.
- Ucloud provide access to a multitude of advanced software that we
are not able to run locally
- Running RStudio on Ucloud give us access to more compute and memory
than we have on our own computer
- More advanced work might require a quite complicated setup
- Restarting a virtual machine means all our work and setup might be
lost, unless we take certain precautions.
- The base-R, native, pipe |> is faster than the magrittr pipe,
%>%
- The magrittr package provides a selection of other useful pipes
- Working with geospatial data requires additional software
- RStudio, Git and a GitHub account are needed for this lesson
- A multitude of materials for learning Git and GitHub exists
- Try to use Git and GitHub in your daily work
- Use an appropriate statistical test
- Make sure you understand the assumptions underlying the test
- Use
.md
files for episodes when you want static
content
- Brug fences til at fremhæve eller skjule indhold på siderne.
- It is fairly easy to make a new course page
- Have patience!