32  Logistic Regression - Extended Examples

Author

Brian Powers

Published

April 13, 2023

32.1 The Pima dataset - predicting diabetes

library(MASS)
head(Pima.te)
  npreg glu bp skin  bmi   ped age type
1     6 148 72   35 33.6 0.627  50  Yes
2     1  85 66   29 26.6 0.351  31   No
3     1  89 66   23 28.1 0.167  21   No
4     3  78 50   32 31.0 0.248  26  Yes
5     2 197 70   45 30.5 0.158  53  Yes
6     5 166 72   19 25.8 0.587  51  Yes
Pima.te$glu_bin <- (round(Pima.te$glu/10))*10
Pima.te$typeTF <- as.numeric(Pima.te$type=="Yes")
Pima.glm <- glm(typeTF ~ glu, data=Pima.te, family="binomial")

aggr.props <- aggregate(typeTF ~ glu_bin, data=Pima.te, FUN=mean)

xs <- seq(60,200,10)
zs <- predict(Pima.glm, newdata=data.frame(glu=xs))
ps <- exp(zs)/(1+exp(zs))

plot(x=Pima.te$glu, y=Pima.te$typeTF, pch=16, col=rgb(0,0,1,.2), xlim=c(50,200))
#points(aggr.props, ylim=c(0,1))
lines(x=xs,y=ps, col="red")
text(x=xs-2, y=as.vector(aggr.props$typeTF*.8+.5*.2), label=round(aggr.props$typeTF,2))
segments(x0=xs-5, aggr.props$typeTF, xs+5,aggr.props$typeTF)
abline(v=seq(55,195,10), col="gray")

Model summary

summary(Pima.glm)

Call:
glm(formula = typeTF ~ glu, family = "binomial", data = Pima.te)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2343  -0.7270  -0.4985   0.6663   2.3268  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.946808   0.659839  -9.013   <2e-16 ***
glu          0.042421   0.005165   8.213   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 420.30  on 331  degrees of freedom
Residual deviance: 325.99  on 330  degrees of freedom
AIC: 329.99

Number of Fisher Scoring iterations: 4

32.2 Odds Ratios

Odds Ratio = Odds(Event | Case A) / Odds(Event | Case B) We look at odds ratio when the cases differe by an increase of 1 unit in the predictor \(x\). \(\beta_1\) is the log of the odds ratio associated with a 1 unit increase in \(x\).

The log(odds ratio)=0.042421

The odds ratio = exp(0.042421)

exp(0.042421)
[1] 1.043334

32.2.1 Interpret the slope

Pima.glm$coefficients[2]
       glu 
0.04242098 
exp(Pima.glm$coefficients[2])
     glu 
1.043334 

Odd ratio for a 1 unit increase in blood glucose is 1.0433 so if we hold all other variables constant and increase blood glucose by 1 unit, we expect an increase in the Odds by 4.33%

32.2.2 odds ratio vs risk ratio

Risk is Prob(Event | case A) / Pr(Event | Case B) which is not the same as odds ratio. Consider this example:

Predict probability for glu 200 vs 201

predict(Pima.glm, type="response", newdata=data.frame(glu=c(200,201)))
        1         2 
0.9267217 0.9295508 
#The risk ratio is
0.9295508 / 0.9267217
[1] 1.003053
#The Odds ratio is
(0.9295508/(1-0.9295508)) / (0.9267217/(1-0.9267217))
[1] 1.043333

Compare glu 100 to 101

predict(Pima.glm, type="response", newdata=data.frame(glu=c(100,101)))
        1         2 
0.1538511 0.1594550 
#The risk ratio is
0.1594550 / 0.1538511 
[1] 1.036424
#The Odds ratio is
(0.1594550/(1-0.1594550)) / (0.1538511 /(1-0.1538511 ))
[1] 1.043334

The risk ratios are different but the odds ratio is a constant 1.04333

32.3 Log Likelihood

log likelihood calculation Let’s compare the true Y values and the predicted Y values

y <- Pima.te$typeTF

y.hat <- predict(Pima.glm, type="response")
pY <- y * y.hat + (1-y)*(1-y.hat)
head(data.frame(y,y.hat,pY),10)
   y      y.hat         pY
1  1 0.58212363 0.58212363
2  0 0.08778183 0.91221817
3  0 0.10235379 0.89764621
4  1 0.06673426 0.06673426
5  1 0.91759616 0.91759616
6  1 0.74933615 0.74933615
7  1 0.28067169 0.28067169
8  0 0.17115736 0.82884264
9  0 0.35394014 0.64605986
10 1 0.28931541 0.28931541

The likelihood is the product of predicted probabilities (of the actual label) For actual 1s, we use y.hat For actual 0s we use 1-y.hat

L <- prod(y * y.hat + (1-y)*(1-y.hat))
L
[1] 1.629075e-71

Maximizing likelihood is equivalent to maximising the log of likelihood, and because logarithms turn products into sums, the log likelihood is easier to work with mathematically. Because likelihood < 1, log likelihood is going to be negative.

log(L)
[1] -162.9955
l <- sum(log(y * y.hat + (1-y)*(1-y.hat)))
l
[1] -162.9955

The closer to zero the log likelihood is, the better; The closer to -infinity, the “worse”.

The residual deviance is defined as -2 times the log likelihood. It turns a big negative log likelihood into a large positive number (easier to interpret)

#residual deviance
-2*l
[1] 325.9911

For each observation, we can translate \(P(Y=y | model)\) into a log-likelihood by just taking the log of that decimal. Furthermore, we could multiply each of these by -2 to see which data points contribute more (or less) to the total “residual deviance”

ll <- log(pY)
head(data.frame(y,y.hat,pY, ll, -2*ll),10)
   y      y.hat         pY         ll  X.2...ll
1  1 0.58212363 0.58212363 -0.5410724 1.0821449
2  0 0.08778183 0.91221817 -0.0918761 0.1837522
3  0 0.10235379 0.89764621 -0.1079793 0.2159585
4  1 0.06673426 0.06673426 -2.7070368 5.4140735
5  1 0.91759616 0.91759616 -0.0859979 0.1719958
6  1 0.74933615 0.74933615 -0.2885676 0.5771352
7  1 0.28067169 0.28067169 -1.2705696 2.5411393
8  0 0.17115736 0.82884264 -0.1877250 0.3754499
9  0 0.35394014 0.64605986 -0.4368631 0.8737262
10 1 0.28931541 0.28931541 -1.2402378 2.4804756

32.4 Prediction

Suppose someone has glucose level 150; what is the estimated probability of diabetes? \[\hat{logodds}(Y+i=1) = \hat{\beta_0}+\hat{\beta_1}X_i\]

#Predict log odds
predict(Pima.glm, newdata=data.frame(glu=150))
        1 
0.4163392 
#Predicts probability
predict(Pima.glm, newdata=data.frame(glu=150), type="response")
        1 
0.6026069 
#calculation
log.odds.150 <- sum(Pima.glm$coefficients * c(1, 150))

#Estimated probability = e^log-odds / (1+e^log-odds)
#or
#                      = 1/ (1 + e^ -logodds)
exp(log.odds.150) / (1+ exp(log.odds.150))
[1] 0.6026069
#or
1 / (1+ exp(-log.odds.150))
[1] 0.6026069

32.5 The null model

The null model would be what we compare our “better” model to - it is a logistic model that does not have any predictors. Just like the null model for linear regression predicts \(\hat{y}=\bar{y}\) the null model predicts \(\hat{y}=\) the proportion of 1s in the data set.

pima.glm.null <- glm(typeTF ~ 1, data=Pima.te, family="binomial")
summary(pima.glm.null)

Call:
glm(formula = typeTF ~ 1, family = "binomial", data = Pima.te)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8921  -0.8921  -0.8921   1.4925   1.4925  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.7158     0.1169  -6.125 9.07e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 420.3  on 331  degrees of freedom
Residual deviance: 420.3  on 331  degrees of freedom
AIC: 422.3

Number of Fisher Scoring iterations: 4
log.odds <- pima.glm.null$coefficients[1]
exp(log.odds) / (1+ exp(log.odds))
(Intercept) 
  0.3283133 
mean(Pima.te$typeTF)
[1] 0.3283133

Let’s just double check that. We can calculate the coefficient manually

#What is the proportion of 1s in the data?
(prop1 <- mean(Pima.te$typeTF))
[1] 0.3283133
#convert this to a log-odds
log(prop1/(1-prop1))
[1] -0.7158239

This is the same as the constant in the null model.

#null deviance calculation

#log likelihood = sum [y * y.hat + (1-y)*(1-yhat)]

#deviance = -2*log likelihood

n <- nrow(Pima.te)
m <- sum(Pima.te$typeTF)
-2*(m*log(m/n) + (n-m)*log((n-m)/n))
[1] 420.2972

32.6 Results of logistic regression: confusion matrix

See https://en.wikipedia.org/wiki/Confusion_matrix

pima_logit <- glm(typeTF ~ glu, data=Pima.te, family="binomial")

confusion.table <- table(Pima.te$typeTF, factor(predict(pima_logit, type="response") >= .5, levels=c(FALSE, TRUE)))

colnames(confusion.table) <- c("Predict 0","Predict 1")
rownames(confusion.table) <- c("No diabetes","Yes diabetes")

#confusion.table
TP <- confusion.table[2,2]
TN <- confusion.table[1,1]
FP <- confusion.table[1,2]
FN <- confusion.table[2,1]

t(addmargins(confusion.table))
           
            No diabetes Yes diabetes Sum
  Predict 0         200           53 253
  Predict 1          23           56  79
  Sum               223          109 332

Statistics for a predictor:

Accuracy of a classifier - the total # correct predictions / total predictions

(accuracy <- (TP+TN)/(TP+TN+FP+FN))  
[1] 0.7710843

Sensitivity (aka recall, hit rate, true positive rate, aka POWER) - what proportion of actual positive cases are classified as positive

(sensitivity <- TP / (TP+FN))
[1] 0.5137615

Specificity (selectivity, true negative rate, \(1-\alpha\)) - what proportion of actual negative cases are classified as negative

(specificity <- TN / (TN+FP))
[1] 0.896861

Precision (positive predictive value) - what proportion of positive predictions are correct

(ppv <- TP/(TP+FP))
[1] 0.7088608

Negative Predictive Value (NPV) - what proportion of negative predictions are correct

(npv <- TN/(TN+FN))
[1] 0.7905138

What was the Type 1 error rate? The False positive rate

FP / (FP + TN)
[1] 0.103139

Let’s just change the theshold from \(p=.5\) to some other number to see what happens to this matrix.

confusion.table <- table(Pima.te$typeTF, factor(predict(pima_logit, type="response") >= .6, levels=c(FALSE, TRUE)))
colnames(confusion.table) <- c("Predict 0","Predict 1")
rownames(confusion.table) <- c("No diabetes","Yes diabetes")
#confusion.table
t(addmargins(confusion.table))
           
            No diabetes Yes diabetes Sum
  Predict 0         210           61 271
  Predict 1          13           48  61
  Sum               223          109 332
#column proportions
prop.table(t(confusion.table), 2)
           
            No diabetes Yes diabetes
  Predict 0  0.94170404   0.55963303
  Predict 1  0.05829596   0.44036697

If I look at column proportions, the first column (bottom row) is the alpha, type 1 error rate i.e. FPR. Column 2, in row 2 gives power.

The accuracy is (210 + 48 ) / 332

(210 + 48  ) / 332
[1] 0.7771084

By adjusting the positive prediction threshold we can improve/worsen statistics such as accuracy, specificity, sensitivity, etc.

pima_logit <- glm(typeTF ~ glu, data=Pima.te, family="binomial")
acc <- 0
thresh <- seq(0.01, 0.99, .01)
for(i in 1:length(thresh)){
  confusion.table <- table(Pima.te$typeTF, factor(predict(pima_logit, type="response")>= thresh[i], levels=c(FALSE, TRUE)))
  colnames(confusion.table) <- c("Predict 0","Predict 1")
  rownames(confusion.table) <- c("No diabetes","Yes diabetes")
  #confusion.table
  TP <- confusion.table[2,2]
  TN <- confusion.table[1,1]
  FP <- confusion.table[1,2]
  FN <- confusion.table[2,1]
  acc[i] = (TP+TN)/(TP+FP+TN+FN)
}
plot(thresh, acc, type="l", main="Accuracy by threshold", ylim=c(0,1))

In particular, we can look at the relationship between True positive rate and False Positive Rate (what proportion of actual negatives are incorrectly predicted to be positive) (FP / (FP+TN))

TPR <- 0; FPR <- 0; 
thresh <- seq(0, 1, .001)
for(i in 1:length(thresh)){
  confusion.table <- table(Pima.te$typeTF, factor(predict(pima_logit, type="response")>= thresh[i], levels=c(FALSE, TRUE)))
  colnames(confusion.table) <- c("Predict 0","Predict 1")
  rownames(confusion.table) <- c("No diabetes","Yes diabetes")
  #confusion.table
  TP <- confusion.table[2,2]
  TN <- confusion.table[1,1]
  FP <- confusion.table[1,2]
  FN <- confusion.table[2,1]
  
  TPR[i] <- TP/(TP+FN) #power
  FPR[i] <- FP/(FP+TN) #alpha / significance / type 1 error
}
plot(thresh, TPR, type="n", main="TPR and FPR by threshold", xlim=c(0,1), ylim=c(0,1))
lines(thresh, TPR, lty=1)
lines(thresh, FPR, lty=2)
abline(h=0.05, col="red", lty=3)
legend(x=.8, y=1, legend = c("TPR", "FPR"), lty=c(1,2))

To put things into the language of hypothesis testing we could ask this: What threshold would we use to achieve a 5% significance level (false positive rate) and what would the power of the classifier be?

optimalThresh <- min(which(FPR<=0.05))
thresh[optimalThresh]
[1] 0.613
as.numeric(FPR[optimalThresh])
[1] 0.04484305
as.numeric(TPR[optimalThresh])
[1] 0.4311927
#data.frame(TPR, FPR)

A scatterplot of these rates is called the ROC curve. We start at (0,0) and go to (1,1)

plot(c(1,FPR,0),c(1, TPR,0), type="s", xlab="False Positive Rate", ylab="True Positive Rate")

The PRROC package produces pretty ROC curves.

#install.packages("PRROC")
library(PRROC)
Loading required package: rlang
PRROC_obj <- roc.curve(scores.class0 = predict(pima_logit, type="response"), 
                       weights.class0=Pima.te$typeTF,
                       curve=TRUE)
plot(PRROC_obj)

The area under the curve (AUC) is a common metric measuring how good your classifier is. An AUC = 0.5 (where the ROC curve is a diagonal line) is a “dumb” predictor. AUC closer to 1 represents a discerning “good” predictor.

32.7 Comparing using AIC and BIC

As we’ll see next week, the AIC (Akaike Information Criterion) is a useful statistic to compare logistic models. \[AIC = \text{residual deviance} + 2k\text{ (k is the number of coefficients fit)}\] \[BIC = \text{residual deviance} + ln(n)*k\] When sample size is large then BIC will give a larger penalty per predictor. \(ln(8) > 2\) so when \(n \geq 8\) BIC gives a heftier penalty per coefficient in the model than AIC does and thus will favor simpler models.

When using either of these comparative statistics, the lower the value the better.

pima.logit1 <- glm(typeTF ~ 1 + glu,data=Pima.te, family="binomial")
pima.logit2 <- glm(typeTF ~ 1 + glu + bmi, data=Pima.te, family="binomial")
summary(pima.logit1)

Call:
glm(formula = typeTF ~ 1 + glu, family = "binomial", data = Pima.te)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2343  -0.7270  -0.4985   0.6663   2.3268  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.946808   0.659839  -9.013   <2e-16 ***
glu          0.042421   0.005165   8.213   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 420.30  on 331  degrees of freedom
Residual deviance: 325.99  on 330  degrees of freedom
AIC: 329.99

Number of Fisher Scoring iterations: 4
summary(pima.logit2)

Call:
glm(formula = typeTF ~ 1 + glu + bmi, family = "binomial", data = Pima.te)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2240  -0.7216  -0.4421   0.6158   2.3544  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.025302   0.942896  -8.511  < 2e-16 ***
glu          0.039311   0.005257   7.478 7.57e-14 ***
bmi          0.072645   0.020621   3.523 0.000427 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 420.30  on 331  degrees of freedom
Residual deviance: 312.49  on 329  degrees of freedom
AIC: 318.49

Number of Fisher Scoring iterations: 4

32.8 Example 2: A Logistic model with a categorical predictor

#Let's use age as a predictor
#Only for age > 50
Pima.te$age50 <- as.numeric(Pima.te$age > 50)

diabetes.aggr <- aggregate(typeTF ~ age50, data=Pima.te, FUN=mean)
diabetes.aggr
  age50    typeTF
1     0 0.3129032
2     1 0.5454545

Among those 50 or younger, proportion with diabetes is .3129032 among those 51+ it is .545454

#Log-odds for the two groups
odds <- diabetes.aggr$typeTF / (1-diabetes.aggr$typeTF)
odds
[1] 0.4553991 1.2000000
logodds <- log(odds)
logodds
[1] -0.7865812  0.1823216
diff(logodds)
[1] 0.9689027

Notice the log-odds are -.7865 and .1823 respectively. If we were to build a model we would say that \[ \hat{LO} = -.7865812 + .9689 \times age_{>50}\]

Pima50 <- glm(typeTF ~ age50, data=Pima.te, family="binomial")
summary(Pima50)

Call:
glm(formula = typeTF ~ age50, family = "binomial", data = Pima.te)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2557  -0.8663  -0.8663   1.5244   1.5244  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.7866     0.1225  -6.422 1.35e-10 ***
age50         0.9689     0.4454   2.176   0.0296 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 420.30  on 331  degrees of freedom
Residual deviance: 415.59  on 330  degrees of freedom
AIC: 419.59

Number of Fisher Scoring iterations: 4

32.9 Example 3: Iris dataset

Another classic example is the iris dataset. This famous dataset gives measurements in centimeters of the sepal and petal width and length for 3 species of iris: setosa, versicolor and virginica. The dataset has 150 rows, with 50 samples of each.

Because the logistic model handles binary response variables, what we will do is create a logistic model to predict whether an iris is virginica Let’s build a full model with all 4 predictors.

virginica.model <- glm(Species=="virginica" ~ 1 + Sepal.Length + Sepal.Width+Petal.Length+Petal.Width, data=iris, family="binomial")
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(virginica.model)

Call:
glm(formula = Species == "virginica" ~ 1 + Sepal.Length + Sepal.Width + 
    Petal.Length + Petal.Width, family = "binomial", data = iris)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.01105  -0.00065   0.00000   0.00048   1.78065  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)   -42.638     25.708  -1.659   0.0972 .
Sepal.Length   -2.465      2.394  -1.030   0.3032  
Sepal.Width    -6.681      4.480  -1.491   0.1359  
Petal.Length    9.429      4.737   1.990   0.0465 *
Petal.Width    18.286      9.743   1.877   0.0605 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 190.954  on 149  degrees of freedom
Residual deviance:  11.899  on 145  degrees of freedom
AIC: 21.899

Number of Fisher Scoring iterations: 12
plot(predict(virginica.model), iris$Species=="virginica", col=iris$Species)
rng <- range(predict(virginica.model))
xs <- seq(rng[1],rng[2], length.out=100)
ys <- 1/(1+exp(-xs))
lines(xs,ys)

Let’s just contrast this plot with a model that only uses sepal length.

virginica.model.simple <- glm(Species=="virginica" ~ 1 + Sepal.Length, data=iris, family="binomial")
summary(virginica.model.simple)

Call:
glm(formula = Species == "virginica" ~ 1 + Sepal.Length, family = "binomial", 
    data = iris)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9870  -0.5520  -0.2614   0.5832   2.7001  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -16.3198     2.6581  -6.140 8.27e-10 ***
Sepal.Length   2.5921     0.4316   6.006 1.90e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 190.95  on 149  degrees of freedom
Residual deviance: 117.35  on 148  degrees of freedom
AIC: 121.35

Number of Fisher Scoring iterations: 5
plot(jitter(predict(virginica.model.simple)), iris$Species=="virginica", col=iris$Species)
rng <- range(predict(virginica.model.simple))
xs <- seq(rng[1],rng[2], length.out=100)
ys <- 1/(1+exp(-xs))
lines(xs,ys)

How well we did? TRUE represents virginica

table(predict = predict(virginica.model, type="response")>.5, actual = iris$Species=="virginica")
       actual
predict FALSE TRUE
  FALSE    99    1
  TRUE      1   49

column prop table will show us type 1&2 error rates and power:

prop.table(table(predict = predict(virginica.model, type="response")>.5, actual = iris$Species=="virginica"), margin=2)
       actual
predict FALSE TRUE
  FALSE  0.99 0.02
  TRUE   0.01 0.98

Can we achieve a 5% Type 1 error rate? Adjust the threshold. We can find the 5th percentile for the predictions for the “other”

quantile(predict(virginica.model, type="response")[iris$Species!="virginica"], .95)
       95% 
0.00504067 
prop.table(table(predict = predict(virginica.model, type="response")>0.00504067 , actual = iris$Species=="virginica"), margin=2)
       actual
predict FALSE TRUE
  FALSE  0.95 0.00
  TRUE   0.05 1.00

We require very little evidence to claim the flower is virginica, a probability of 0.00504 is enough. In this case we have a 5% type 1 error rate. But note that the power has increased to 100%.

32.9.1 When a logistic model cannot converge

If your model is so good that it correctly labels everyhing as 1 or zero this is what the output looks like

setosa.model <- glm(Species=="setosa" ~ 1 + Sepal.Length + Sepal.Width+Petal.Length+Petal.Width, data=iris, family="binomial")
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(setosa.model)

Call:
glm(formula = Species == "setosa" ~ 1 + Sepal.Length + Sepal.Width + 
    Petal.Length + Petal.Width, family = "binomial", data = iris)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-3.185e-05  -2.100e-08  -2.100e-08   2.100e-08   3.173e-05  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)     -16.946 457457.106       0        1
Sepal.Length     11.759 130504.043       0        1
Sepal.Width       7.842  59415.386       0        1
Petal.Length    -20.088 107724.596       0        1
Petal.Width     -21.608 154350.619       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.9095e+02  on 149  degrees of freedom
Residual deviance: 3.2940e-09  on 145  degrees of freedom
AIC: 10

Number of Fisher Scoring iterations: 25

It means that the coefficients are able to separate the predicted log odds so far that you have no misclassifications. Note above that after 25 fisher iterations it stopped. You can set the max iterations to be a lot lower, like 3, and then the model summary will be far more interpretable. But the standard errors of the coefficients will still be high so you can’t say much about the significance of individual predictors.

setosa.model.maxit <- glm(Species=="setosa" ~ 1 + Sepal.Length + Sepal.Width+Petal.Length+Petal.Width, data=iris, family="binomial", control = list(maxit = 2))
Warning: glm.fit: algorithm did not converge
summary(setosa.model.maxit)

Call:
glm(formula = Species == "setosa" ~ 1 + Sepal.Length + Sepal.Width + 
    Petal.Length + Petal.Width, family = "binomial", data = iris, 
    control = list(maxit = 2))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7003  -0.2994  -0.1592   0.2188   0.7140  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)   -2.4428     3.1506  -0.775   0.4381  
Sepal.Length   0.4038     0.8871   0.455   0.6490  
Sepal.Width    1.8157     0.8944   2.030   0.0424 *
Petal.Length  -1.7249     0.8993  -1.918   0.0551 .
Petal.Width   -0.2378     1.5787  -0.151   0.8803  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 190.954  on 149  degrees of freedom
Residual deviance:  14.313  on 145  degrees of freedom
AIC: 24.313

Number of Fisher Scoring iterations: 2

How correlated are flower measurements?

cor(iris[,1:4])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000