Как добавить непрерывный предиктор в агрегированную (логистическую) регрессию с использованием glm в R

#r #logistic-regression #glm #weighted

Вопрос:

При выполнении агрегированной регрессии с использованием аргумента weights в glm я могу добавить категориальные предикторы для сопоставления результатов с регрессией по отдельным данным (игнорируя различия в df), но когда я добавляю непрерывный предиктор, результаты больше не совпадают.

напр.,

 summary(glm(am ~ as.factor(cyl)   carb, 
            data = mtcars, 
            family = binomial(link = "logit")))
 
 ## 
## Call:
## glm(formula = am ~ as.factor(cyl)   carb, family = binomial(link = "logit"), 
##     data = mtcars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8699  -0.5506  -0.1869   0.6185   1.9806  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      -0.6718     1.0924  -0.615  0.53854   
## as.factor(cyl)6  -3.7609     1.9072  -1.972  0.04862 * 
## as.factor(cyl)8  -5.5958     1.9381  -2.887  0.00389 **
## carb              1.1144     0.5918   1.883  0.05967 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 26.287  on 28  degrees of freedom
## AIC: 34.287
## 
## Number of Fisher Scoring iterations: 5
 

Приведенные выше результаты соответствуют следующим:

 mtcars_percent <- mtcars %>%
  group_by(cyl, carb) %>%
  summarise(
    n = n(),
    am  = sum(am)/n
  )

summary(glm(am ~ as.factor(cyl)   carb, 
            data = mtcars_percent, 
            family = binomial(link = "logit"),
            weights = n
            ))
 
 ## 
## Call:
## glm(formula = am ~ as.factor(cyl)   carb, family = binomial(link = "logit"), 
##     data = mtcars_percent, weights = n)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.9179  -0.9407  -0.3772  -0.0251   0.4468  -0.3738  -0.5602   0.1789  
##       9  
##  0.3699  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      -0.6718     1.0925  -0.615  0.53858   
## as.factor(cyl)6  -3.7609     1.9074  -1.972  0.04865 * 
## as.factor(cyl)8  -5.5958     1.9383  -2.887  0.00389 **
## carb              1.1144     0.5919   1.883  0.05971 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19.6356  on 8  degrees of freedom
## Residual deviance:  2.6925  on 5  degrees of freedom
## AIC: 18.485
## 
## Number of Fisher Scoring iterations: 5
 

The coefficients and standard errors above match.

However adding a continuous predictor (e.g., mpg ) to this experiment produces differences. Individual data:

 summary(glm(formula = am ~ as.factor(cyl)   carb   mpg,
      family = binomial,
      data = mtcars))
 
 ## 
## Call:
## glm(formula = am ~ as.factor(cyl)   carb   mpg, family = binomial, 
##     data = mtcars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8933  -0.4595  -0.1293   0.1475   1.6969  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     -18.3024     9.3442  -1.959   0.0501 .
## as.factor(cyl)6  -1.8594     2.5963  -0.716   0.4739  
## as.factor(cyl)8  -0.3029     2.8828  -0.105   0.9163  
## carb              1.6959     0.9918   1.710   0.0873 .
## mpg               0.6771     0.3645   1.858   0.0632 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 18.467  on 27  degrees of freedom
## AIC: 28.467
## 
## Number of Fisher Scoring iterations: 6
 

And now aggregating:

 mtcars_percent <- mtcars %>%
  group_by(cyl, carb) %>%
  summarise(
    n = n(),
    am  = sum(am)/n,
    mpg = mean(mpg)
  )

# A tibble: 9 x 5
# Groups:   cyl [3]
    cyl  carb     n    am   mpg
  <dbl> <dbl> <int> <dbl> <dbl>
1     4     1     5 0.8    27.6
2     4     2     6 0.667  25.9
3     6     1     2 0      19.8
4     6     4     4 0.5    19.8
5     6     6     1 1      19.7
6     8     2     4 0      17.2
7     8     3     3 0      16.3
8     8     4     6 0.167  13.2
9     8     8     1 1      15  
 
 glm(formula = am ~ as.factor(cyl)   carb   mpg,
    family = binomial,
    data = mtcars_percent,
    weights = n
  ) %>%
  summary()
 
 ## 
## Call:
## glm(formula = am ~ as.factor(cyl)   carb   mpg, family = binomial, 
##     data = mtcars_percent, weights = n)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7         8  
##  0.75845  -0.73755  -0.24505  -0.02649   0.34041  -0.50528  -0.74002   0.46178  
##        9  
##  0.17387  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -11.3593    19.9611  -0.569    0.569
## as.factor(cyl)6  -1.7932     3.7491  -0.478    0.632
## as.factor(cyl)8  -1.4419     7.3124  -0.197    0.844
## carb              1.4059     1.0718   1.312    0.190
## mpg               0.3825     0.7014   0.545    0.585
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19.6356  on 8  degrees of freedom
## Residual deviance:  2.3423  on 4  degrees of freedom
## AIC: 20.134
## 
## Number of Fisher Scoring iterations: 6
 

Coefficients, standard errors and p-values are now different, and I would like to understand why and what can be done to match the individual data model?

In the help section of glm() , it states «weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations.»

I take that to mean I can calculate the mean(mpg) for each grouping factor as I’ve done and the regression should work. Obviously I am misunderstanding something…

Thanks for your help