Factor Analysis

Last updated on 2024-10-22 | Edit this page

Overview

Questions

  • How do you write a lesson using R Markdown and sandpaper?

Objectives

  • Explain how to use markdown with the new lesson template
  • Demonstrate how to include pieces of code, figures, and nested challenge blocks

Minder en del om pca - også i matematikken. Antager at der er et mindre antal underliggende faktorer, som kan forklare observationerne. Det er de underliggende faktorer vi forsøger at belyse. faktoranalysen forklarer kovariansen i data. pca forklarer variansen.

en pca komponent er en lineær kombination af observerede variable. faktoranalysen leder til at de observerede variable er en linear kombination af uobserverede variable eller faktorer.

PCA er dimensionsreducerende. FA finder latente variable.

PCA er en type af faktoranalyse. Men er observationel, mens FA er en modelleringsteknik.

fa kører i to trin. Eksplorativ faktoranalyse, hvor vi identificerer faktorerne. og confirmatory faktoranalyse, hvor vi bekræfter at vi faktisk fik identificeret faktorerne.

Den er vældig brugt i psykologien.

R

library(lavaan) 

OUTPUT

This is lavaan 0.6-19
lavaan is FREE software! Please report any bugs.

R

library(tidyverse)

OUTPUT

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     

OUTPUT

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

R

library(psych)

OUTPUT


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

The following object is masked from 'package:lavaan':

    cor2cov

Let us look at some data:

R

glimpse(HolzingerSwineford1939)

OUTPUT

Rows: 301
Columns: 15
$ id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, …
$ sex    <int> 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, …
$ ageyr  <int> 13, 13, 13, 13, 12, 14, 12, 12, 13, 12, 12, 12, 12, 12, 12, 12,…
$ agemo  <int> 1, 7, 1, 2, 2, 1, 1, 2, 0, 5, 2, 11, 7, 8, 6, 1, 11, 5, 8, 3, 1…
$ school <fct> Pasteur, Pasteur, Pasteur, Pasteur, Pasteur, Pasteur, Pasteur, …
$ grade  <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …
$ x1     <dbl> 3.333333, 5.333333, 4.500000, 5.333333, 4.833333, 5.333333, 2.8…
$ x2     <dbl> 7.75, 5.25, 5.25, 7.75, 4.75, 5.00, 6.00, 6.25, 5.75, 5.25, 5.7…
$ x3     <dbl> 0.375, 2.125, 1.875, 3.000, 0.875, 2.250, 1.000, 1.875, 1.500, …
$ x4     <dbl> 2.333333, 1.666667, 1.000000, 2.666667, 2.666667, 1.000000, 3.3…
$ x5     <dbl> 5.75, 3.00, 1.75, 4.50, 4.00, 3.00, 6.00, 4.25, 5.75, 5.00, 3.5…
$ x6     <dbl> 1.2857143, 1.2857143, 0.4285714, 2.4285714, 2.5714286, 0.857142…
$ x7     <dbl> 3.391304, 3.782609, 3.260870, 3.000000, 3.695652, 4.347826, 4.6…
$ x8     <dbl> 5.75, 6.25, 3.90, 5.30, 6.30, 6.65, 6.20, 5.15, 4.65, 4.55, 5.7…
$ x9     <dbl> 6.361111, 7.916667, 4.416667, 4.861111, 5.916667, 7.500000, 4.8…

We have 301 observations. School children (with an id) of different sex (sex), age (ageyr, agemo) at different schools (school) and different grades (grade) have been tested on their ability to solve a battery of tasks:

  • x1 Visual Perception - A test measuring visual perception abilities.
  • x2 Cubes - A test assessing the ability to mentally rotate three-dimensional objects.
  • x3 Lozenges - A test that evaluates the ability to identify shape changes.
  • x4 Paragraph Comprehension - A test of reading comprehension, measuring the ability to understand written paragraphs.
  • x5 Sentence Completion - A test that assesses the ability to complete sentences, typically reflecting verbal ability.
  • x6 Word Meaning - A test measuring the understanding of word meanings, often used as a gauge of vocabulary knowledge.
  • x7 Speeded Addition - A test of arithmetic skills, focusing on the ability to perform addition.
  • x8 speeded counting of dots - A test that evaluates counting skills using dot patterns.
  • x9 speeded discrimination straight and curved capitals - A test measuring the ability to recognize straight and curved capital letters in text.

The thesis is, that there are some underlying factors.

Spatial ability - the ability to perceive and manipulate visual and spatial information, x1, x2 og x3. verbal ability x4,x5 og x6 mathematical ability - x7 og x8 speed of processing - x9

The thinking is, that if the student is good at math, he or she will score high on x7 and x8. That is, a student scoring high on x7 will probably score high x8. Or low on both.

This makes intuitive sense. But we would like to be able to actually identify these underlying factors.

Exploratory

We do a factor analysis, and ask for 9 factors - that is the maximum factors we can expect.

R

library(psych)
hs.efa <- fa(select(HolzingerSwineford1939, x1:x9), nfactors = 9, 
             rotate = "none", fm = "ml")
hs.efa

OUTPUT

Factor Analysis using method =  ml
Call: fa(r = select(HolzingerSwineford1939, x1:x9), nfactors = 9, rotate = "none",
    fm = "ml")
Standardized loadings (pattern matrix) based upon correlation matrix
    ML1   ML2   ML3 ML4 ML5 ML6 ML7 ML8 ML9   h2   u2 com
x1 0.50  0.31  0.32   0   0   0   0   0   0 0.44 0.56 2.4
x2 0.26  0.19  0.38   0   0   0   0   0   0 0.25 0.75 2.3
x3 0.29  0.40  0.38   0   0   0   0   0   0 0.38 0.62 2.8
x4 0.81 -0.17 -0.03   0   0   0   0   0   0 0.69 0.31 1.1
x5 0.81 -0.21 -0.08   0   0   0   0   0   0 0.70 0.30 1.2
x6 0.81 -0.15  0.01   0   0   0   0   0   0 0.67 0.33 1.1
x7 0.23  0.41 -0.44   0   0   0   0   0   0 0.41 0.59 2.5
x8 0.27  0.55 -0.28   0   0   0   0   0   0 0.45 0.55 2.0
x9 0.39  0.54 -0.03   0   0   0   0   0   0 0.44 0.56 1.8

                       ML1  ML2  ML3  ML4  ML5  ML6  ML7  ML8  ML9
SS loadings           2.63 1.15 0.67 0.00 0.00 0.00 0.00 0.00 0.00
Proportion Var        0.29 0.13 0.07 0.00 0.00 0.00 0.00 0.00 0.00
Cumulative Var        0.29 0.42 0.49 0.49 0.49 0.49 0.49 0.49 0.49
Proportion Explained  0.59 0.26 0.15 0.00 0.00 0.00 0.00 0.00 0.00
Cumulative Proportion 0.59 0.85 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Mean item complexity =  1.9
Test of the hypothesis that 9 factors are sufficient.

df null model =  36  with the objective function =  3.05 with Chi Square =  904.1
df of  the model are -9  and the objective function was  0.11

The root mean square of the residuals (RMSR) is  0.03
The df corrected root mean square of the residuals is  NA

The harmonic n.obs is  301 with the empirical chi square  19.16  with prob <  NA
The total n.obs was  301  with Likelihood Chi Square =  33.29  with prob <  NA

Tucker Lewis Index of factoring reliability =  1.199
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy
                                                   ML1  ML2   ML3 ML4 ML5 ML6
Correlation of (regression) scores with factors   0.93 0.80  0.70   0   0   0
Multiple R square of scores with factors          0.86 0.65  0.49   0   0   0
Minimum correlation of possible factor scores     0.72 0.29 -0.03  -1  -1  -1
                                                  ML7 ML8 ML9
Correlation of (regression) scores with factors     0   0   0
Multiple R square of scores with factors            0   0   0
Minimum correlation of possible factor scores      -1  -1  -1

De enkelte elementer: SS loadings (Sum of Squared Loadings): Summen af kvadrerede faktorbelastninger for hver faktor. Dette viser, hvor meget varians hver faktor forklarer.

For ML1 er det 2.63, hvilket betyder, at denne faktor forklarer det meste af variansen blandt de faktorer, der blev estimeret. ML4 til ML9 forklarer ingen varians (0.00), hvilket bekræfter din observation om, at kun de første få faktorer er betydningsfulde. Proportion Var (Proportion of Variance): Andelen af total varians forklaret af hver faktor.

ML1 forklarer 29% af variansen, ML2 forklarer 13%, og ML3 forklarer 7%. De resterende faktorer forklarer ingen varians. Cumulative Var (Cumulative Variance): Den kumulative andel af total varians forklaret op til og med den pågældende faktor.

ML1 alene forklarer 29% af variansen. De første tre faktorer tilsammen forklarer 49%. Proportion Explained: Procentdelen af den forklarede varians, der kan tilskrives hver faktor.

ML1 står for 59% af den forklarede varians, ML2 står for 26%, og ML3 står for 15%. Cumulative Proportion: Den kumulative andel af den forklarede varians.

De første tre faktorer forklarer 100% af variansen, hvilket indikerer, at resten af faktorerne ikke bidrager yderligere til at forklare variansen.

Mean Item Complexity Dette tal repræsenterer gennemsnittet af antallet af faktorer, som hver variabel har betydelige belastninger på. En værdi på 1.9 indikerer, at de fleste variabler har betydelige belastninger på næsten 2 faktorer.

Model Fit df null model: Antal frihedsgrader i nulmodellen (den model, der antager, at alle variabler er ukorrelerede). Chi Square: Chi-kvadrat statistikken for nulmodellen. df of the model: Antal frihedsgrader i den estimerede model. En negativ værdi indikerer en overparametriseret model (flere faktorer end dataene kan understøtte). Objective Function: Værdien af objektivfunktionen for den estimerede model.

RMSR (Root Mean Square Residual) RMSR: Gennemsnittet af kvadratroden af residualerne (forskellen mellem de observerede og modelpredikterede værdier). En lav RMSR indikerer en god modeltilpasning.

Chi Square and Fit Indices Harmonic n.obs: Det harmoniske gennemsnit af antallet af observationer. Empirical Chi Square: Den empiriske chi-kvadrat værdi. Likelihood Chi Square: Chi-kvadrat statistikken for modellen baseret på likelihood metoden. Tucker Lewis Index (TLI): En fit-indeks, hvor værdier over 0.9 typisk indikerer en god modeltilpasning. En værdi på 1.199 er meget høj. Fit based upon off diagonal values: Indikator for modeltilpasning baseret på off-diagonale værdier i residualkorrelationsmatricen. En værdi på 0.99 indikerer fremragende fit.

Often a visual representation of the model is useful:

R

plot(hs.efa$e.values)

The rule of thumb is that we reject factors with an eigenvalue lower than 1.0.

Three factors are sufficient. We now do the factor analysis again:

R

hs.efa <- fa(select(HolzingerSwineford1939, x1:x9), nfactors = 3, 
             rotate = "none", fm = "ml")
hs.efa

OUTPUT

Factor Analysis using method =  ml
Call: fa(r = select(HolzingerSwineford1939, x1:x9), nfactors = 3, rotate = "none",
    fm = "ml")
Standardized loadings (pattern matrix) based upon correlation matrix
    ML1   ML2   ML3   h2   u2 com
x1 0.49  0.31  0.39 0.49 0.51 2.7
x2 0.24  0.17  0.40 0.25 0.75 2.1
x3 0.27  0.41  0.47 0.46 0.54 2.6
x4 0.83 -0.15 -0.03 0.72 0.28 1.1
x5 0.84 -0.21 -0.10 0.76 0.24 1.2
x6 0.82 -0.13  0.02 0.69 0.31 1.0
x7 0.23  0.48 -0.46 0.50 0.50 2.4
x8 0.27  0.62 -0.27 0.53 0.47 1.8
x9 0.38  0.56  0.02 0.46 0.54 1.8

                       ML1  ML2  ML3
SS loadings           2.72 1.31 0.82
Proportion Var        0.30 0.15 0.09
Cumulative Var        0.30 0.45 0.54
Proportion Explained  0.56 0.27 0.17
Cumulative Proportion 0.56 0.83 1.00

Mean item complexity =  1.8
Test of the hypothesis that 3 factors are sufficient.

df null model =  36  with the objective function =  3.05 with Chi Square =  904.1
df of  the model are 12  and the objective function was  0.08

The root mean square of the residuals (RMSR) is  0.02
The df corrected root mean square of the residuals is  0.03

The harmonic n.obs is  301 with the empirical chi square  8.03  with prob <  0.78
The total n.obs was  301  with Likelihood Chi Square =  22.38  with prob <  0.034

Tucker Lewis Index of factoring reliability =  0.964
RMSEA index =  0.053  and the 90 % confidence intervals are  0.015 0.088
BIC =  -46.11
Fit based upon off diagonal values = 1
Measures of factor score adequacy
                                                   ML1  ML2  ML3
Correlation of (regression) scores with factors   0.95 0.86 0.78
Multiple R square of scores with factors          0.90 0.73 0.60
Minimum correlation of possible factor scores     0.80 0.46 0.21

og hvilke er så med i hvilke?

R

hs.efa <- fa(select(HolzingerSwineford1939, x1:x9), nfactors = 3, 
             rotate = "varimax", fm = "ml")
hs.efa

OUTPUT

Factor Analysis using method =  ml
Call: fa(r = select(HolzingerSwineford1939, x1:x9), nfactors = 3, rotate = "varimax",
    fm = "ml")
Standardized loadings (pattern matrix) based upon correlation matrix
    ML1   ML3   ML2   h2   u2 com
x1 0.28  0.62  0.15 0.49 0.51 1.5
x2 0.10  0.49 -0.03 0.25 0.75 1.1
x3 0.03  0.66  0.13 0.46 0.54 1.1
x4 0.83  0.16  0.10 0.72 0.28 1.1
x5 0.86  0.09  0.09 0.76 0.24 1.0
x6 0.80  0.21  0.09 0.69 0.31 1.2
x7 0.09 -0.07  0.70 0.50 0.50 1.1
x8 0.05  0.16  0.71 0.53 0.47 1.1
x9 0.13  0.41  0.52 0.46 0.54 2.0

                       ML1  ML3  ML2
SS loadings           2.18 1.34 1.33
Proportion Var        0.24 0.15 0.15
Cumulative Var        0.24 0.39 0.54
Proportion Explained  0.45 0.28 0.27
Cumulative Proportion 0.45 0.73 1.00

Mean item complexity =  1.2
Test of the hypothesis that 3 factors are sufficient.

df null model =  36  with the objective function =  3.05 with Chi Square =  904.1
df of  the model are 12  and the objective function was  0.08

The root mean square of the residuals (RMSR) is  0.02
The df corrected root mean square of the residuals is  0.03

The harmonic n.obs is  301 with the empirical chi square  8.03  with prob <  0.78
The total n.obs was  301  with Likelihood Chi Square =  22.38  with prob <  0.034

Tucker Lewis Index of factoring reliability =  0.964
RMSEA index =  0.053  and the 90 % confidence intervals are  0.015 0.088
BIC =  -46.11
Fit based upon off diagonal values = 1
Measures of factor score adequacy
                                                   ML1  ML3  ML2
Correlation of (regression) scores with factors   0.93 0.81 0.84
Multiple R square of scores with factors          0.87 0.66 0.70
Minimum correlation of possible factor scores     0.74 0.33 0.40

R

print(hs.efa$loadings, cutoff = 0.4)

OUTPUT


Loadings:
   ML1    ML3    ML2
x1         0.623
x2         0.489
x3         0.663
x4  0.827
x5  0.861
x6  0.801
x7                0.696
x8                0.709
x9         0.406  0.524

                 ML1   ML3   ML2
SS loadings    2.185 1.343 1.327
Proportion Var 0.243 0.149 0.147
Cumulative Var 0.243 0.392 0.539

Bum. Så har vi identificeret hvilke manifeste variable der indgår i hvilke latente faktorer.

Confirmatory factor analysis


Nu bør vi hive fat i et nyt datasæt med manifeste variable, og se hvor godt vores models latente variable beskriver variationen i det.

I praksis er de studerende (og mange andre) dovne og springer over hvor gærdet er lavest (og hvorfor pokker skulle man også vælge at springe over hvor det er højest.)

så hvordan laver man den bekræftende analyse?

Vi kal have stillet en model op. Den ser lidt speciel ud.

R

HS.model <- 'visual  =~ x1 + x2 + x3
             textual =~ x4 + x5 + x6
             speed   =~ x7 + x8 + x9'

Så fitter vi:

R

fit <- cfa(HS.model, data = HolzingerSwineford1939)

R

fit

OUTPUT

lavaan 0.6-19 ended normally after 35 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        21

  Number of observations                           301

Model Test User Model:

  Test statistic                                85.306
  Degrees of freedom                                24
  P-value (Chi-square)                           0.000

Nej hvor er det fint. Selvfølgelig er det det. Vi har fittet vores oprindelige data på den model vi fik fra samme data. Det skal helst være ret godt.

R

summary(fit, standardized=TRUE, fit.measures=TRUE, rsquare=TRUE)

OUTPUT

lavaan 0.6-19 ended normally after 35 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        21

  Number of observations                           301

Model Test User Model:

  Test statistic                                85.306
  Degrees of freedom                                24
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                               918.852
  Degrees of freedom                                36
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.931
  Tucker-Lewis Index (TLI)                       0.896

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -3737.745
  Loglikelihood unrestricted model (H1)      -3695.092

  Akaike (AIC)                                7517.490
  Bayesian (BIC)                              7595.339
  Sample-size adjusted Bayesian (SABIC)       7528.739

Root Mean Square Error of Approximation:

  RMSEA                                          0.092
  90 Percent confidence interval - lower         0.071
  90 Percent confidence interval - upper         0.114
  P-value H_0: RMSEA <= 0.050                    0.001
  P-value H_0: RMSEA >= 0.080                    0.840

Standardized Root Mean Square Residual:

  SRMR                                           0.065

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  visual =~
    x1                1.000                               0.900    0.772
    x2                0.554    0.100    5.554    0.000    0.498    0.424
    x3                0.729    0.109    6.685    0.000    0.656    0.581
  textual =~
    x4                1.000                               0.990    0.852
    x5                1.113    0.065   17.014    0.000    1.102    0.855
    x6                0.926    0.055   16.703    0.000    0.917    0.838
  speed =~
    x7                1.000                               0.619    0.570
    x8                1.180    0.165    7.152    0.000    0.731    0.723
    x9                1.082    0.151    7.155    0.000    0.670    0.665

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  visual ~~
    textual           0.408    0.074    5.552    0.000    0.459    0.459
    speed             0.262    0.056    4.660    0.000    0.471    0.471
  textual ~~
    speed             0.173    0.049    3.518    0.000    0.283    0.283

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .x1                0.549    0.114    4.833    0.000    0.549    0.404
   .x2                1.134    0.102   11.146    0.000    1.134    0.821
   .x3                0.844    0.091    9.317    0.000    0.844    0.662
   .x4                0.371    0.048    7.779    0.000    0.371    0.275
   .x5                0.446    0.058    7.642    0.000    0.446    0.269
   .x6                0.356    0.043    8.277    0.000    0.356    0.298
   .x7                0.799    0.081    9.823    0.000    0.799    0.676
   .x8                0.488    0.074    6.573    0.000    0.488    0.477
   .x9                0.566    0.071    8.003    0.000    0.566    0.558
    visual            0.809    0.145    5.564    0.000    1.000    1.000
    textual           0.979    0.112    8.737    0.000    1.000    1.000
    speed             0.384    0.086    4.451    0.000    1.000    1.000

R-Square:
                   Estimate
    x1                0.596
    x2                0.179
    x3                0.338
    x4                0.725
    x5                0.731
    x6                0.702
    x7                0.324
    x8                0.523
    x9                0.442

R

fitted(fit)

OUTPUT

$cov
      x1    x2    x3    x4    x5    x6    x7    x8    x9
x1 1.358
x2 0.448 1.382
x3 0.590 0.327 1.275
x4 0.408 0.226 0.298 1.351
x5 0.454 0.252 0.331 1.090 1.660
x6 0.378 0.209 0.276 0.907 1.010 1.196
x7 0.262 0.145 0.191 0.173 0.193 0.161 1.183
x8 0.309 0.171 0.226 0.205 0.228 0.190 0.453 1.022
x9 0.284 0.157 0.207 0.188 0.209 0.174 0.415 0.490 1.015

R

coef(fit)

OUTPUT

      visual=~x2       visual=~x3      textual=~x5      textual=~x6
           0.554            0.729            1.113            0.926
       speed=~x8        speed=~x9           x1~~x1           x2~~x2
           1.180            1.082            0.549            1.134
          x3~~x3           x4~~x4           x5~~x5           x6~~x6
           0.844            0.371            0.446            0.356
          x7~~x7           x8~~x8           x9~~x9   visual~~visual
           0.799            0.488            0.566            0.809
textual~~textual     speed~~speed  visual~~textual    visual~~speed
           0.979            0.384            0.408            0.262
  textual~~speed
           0.173 

R

resid(fit, type = "normalized")

OUTPUT

$type
[1] "normalized"

$cov
       x1     x2     x3     x4     x5     x6     x7     x8     x9
x1  0.000
x2 -0.493  0.000
x3 -0.125  1.539  0.000
x4  1.159 -0.214 -1.170  0.000
x5 -0.153 -0.459 -2.606  0.070  0.000
x6  0.983  0.507 -0.436 -0.130  0.048  0.000
x7 -2.423 -3.273 -1.450  0.625 -0.617 -0.240  0.000
x8 -0.655 -0.896 -0.200 -1.162 -0.624 -0.375  1.170  0.000
x9  2.405  1.249  2.420  0.808  1.126  0.958 -0.625 -0.504  0.000

CAVE! SEMPLOT FEJLER UNDER INSTALLATION PÅ GITHUB! library(semPlot) semPaths(fit, “std”, layout = “tree”, intercepts = F, residuals = T, nDigits = 2, label.cex = 1, edge.label.cex=.95, fade = F)

Key Points

  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally