#r #logistic-regression #glm #weighted
Вопрос:
При выполнении агрегированной регрессии с использованием аргумента weights в glm я могу добавить категориальные предикторы для сопоставления результатов с регрессией по отдельным данным (игнорируя различия в df), но когда я добавляю непрерывный предиктор, результаты больше не совпадают.
напр.,
summary(glm(am ~ as.factor(cyl) carb,
data = mtcars,
family = binomial(link = "logit")))
##
## Call:
## glm(formula = am ~ as.factor(cyl) carb, family = binomial(link = "logit"),
## data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8699 -0.5506 -0.1869 0.6185 1.9806
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6718 1.0924 -0.615 0.53854
## as.factor(cyl)6 -3.7609 1.9072 -1.972 0.04862 *
## as.factor(cyl)8 -5.5958 1.9381 -2.887 0.00389 **
## carb 1.1144 0.5918 1.883 0.05967 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 26.287 on 28 degrees of freedom
## AIC: 34.287
##
## Number of Fisher Scoring iterations: 5
Приведенные выше результаты соответствуют следующим:
mtcars_percent <- mtcars %>%
group_by(cyl, carb) %>%
summarise(
n = n(),
am = sum(am)/n
)
summary(glm(am ~ as.factor(cyl) carb,
data = mtcars_percent,
family = binomial(link = "logit"),
weights = n
))
##
## Call:
## glm(formula = am ~ as.factor(cyl) carb, family = binomial(link = "logit"),
## data = mtcars_percent, weights = n)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.9179 -0.9407 -0.3772 -0.0251 0.4468 -0.3738 -0.5602 0.1789
## 9
## 0.3699
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6718 1.0925 -0.615 0.53858
## as.factor(cyl)6 -3.7609 1.9074 -1.972 0.04865 *
## as.factor(cyl)8 -5.5958 1.9383 -2.887 0.00389 **
## carb 1.1144 0.5919 1.883 0.05971 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19.6356 on 8 degrees of freedom
## Residual deviance: 2.6925 on 5 degrees of freedom
## AIC: 18.485
##
## Number of Fisher Scoring iterations: 5
The coefficients and standard errors above match.
However adding a continuous predictor (e.g., mpg
) to this experiment produces differences. Individual data:
summary(glm(formula = am ~ as.factor(cyl) carb mpg,
family = binomial,
data = mtcars))
##
## Call:
## glm(formula = am ~ as.factor(cyl) carb mpg, family = binomial,
## data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8933 -0.4595 -0.1293 0.1475 1.6969
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.3024 9.3442 -1.959 0.0501 .
## as.factor(cyl)6 -1.8594 2.5963 -0.716 0.4739
## as.factor(cyl)8 -0.3029 2.8828 -0.105 0.9163
## carb 1.6959 0.9918 1.710 0.0873 .
## mpg 0.6771 0.3645 1.858 0.0632 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 18.467 on 27 degrees of freedom
## AIC: 28.467
##
## Number of Fisher Scoring iterations: 6
And now aggregating:
mtcars_percent <- mtcars %>%
group_by(cyl, carb) %>%
summarise(
n = n(),
am = sum(am)/n,
mpg = mean(mpg)
)
# A tibble: 9 x 5
# Groups: cyl [3]
cyl carb n am mpg
<dbl> <dbl> <int> <dbl> <dbl>
1 4 1 5 0.8 27.6
2 4 2 6 0.667 25.9
3 6 1 2 0 19.8
4 6 4 4 0.5 19.8
5 6 6 1 1 19.7
6 8 2 4 0 17.2
7 8 3 3 0 16.3
8 8 4 6 0.167 13.2
9 8 8 1 1 15
glm(formula = am ~ as.factor(cyl) carb mpg,
family = binomial,
data = mtcars_percent,
weights = n
) %>%
summary()
##
## Call:
## glm(formula = am ~ as.factor(cyl) carb mpg, family = binomial,
## data = mtcars_percent, weights = n)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.75845 -0.73755 -0.24505 -0.02649 0.34041 -0.50528 -0.74002 0.46178
## 9
## 0.17387
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.3593 19.9611 -0.569 0.569
## as.factor(cyl)6 -1.7932 3.7491 -0.478 0.632
## as.factor(cyl)8 -1.4419 7.3124 -0.197 0.844
## carb 1.4059 1.0718 1.312 0.190
## mpg 0.3825 0.7014 0.545 0.585
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19.6356 on 8 degrees of freedom
## Residual deviance: 2.3423 on 4 degrees of freedom
## AIC: 20.134
##
## Number of Fisher Scoring iterations: 6
Coefficients, standard errors and p-values are now different, and I would like to understand why and what can be done to match the individual data model?
In the help section of glm()
, it states «weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations.»
I take that to mean I can calculate the mean(mpg) for each grouping factor as I’ve done and the regression should work. Obviously I am misunderstanding something…
Thanks for your help