large-data

Last updated on 2025-02-04 | Edit this page

Overview

Questions

  • How do you write a lesson using R Markdown and sandpaper?

Objectives

  • Explain how to use markdown with the new lesson template
  • Demonstrate how to include pieces of code, figures, and nested challenge blocks

Generelle tips.

Optimer datatyper - eg. brug faktorer i stedet for tekst (hvis der er gengangere i tekterne…) og integer i stedet for numeric.

library(tidyverse) vanilla <- nycflights13::flights

int_1 <- vanilla %>% mutate(dep_delay = as.integer(dep_delay))

int_all <- vanilla %>% mutate(across(c(dep_delay, arr_delay, air_time, flight, distance, hour, minute), as.integer))

fct_1 <- vanilla %>% mutate(carrier = fct(carrier))

fct_all <- vanilla %>% mutate(across(c(carrier, tailnum, origin, dest), as.factor))

all_conv <- vanilla %>% mutate(across(c(carrier, tailnum, origin, dest), as.factor)) %>% mutate(across(c(dep_delay, arr_delay, air_time, flight, distance, hour, minute), as.integer))

object.size(vanilla) object.size(int_1) object.size(int_all) object.size(fct_1) object.size(fct_all) object.size(all_conv)

27213880/40650040

Eller, i dette specifikke tilfælde kan vi reducere størrelsen med 1/3 duckdb.

det er da ikke så ringe endda.

data.table

tbl(con, “flights”) %>% group_by(origin) %>% count(dest, sort = TRUE, name = “N”) %>% slice_max(order_by = N, n = 3) %>% select(origin, dest)

Key Points

  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally