Another situation that calls for logistic regression, rather than an anova or t–test, is when you determine the values of the measurement variable, while the values of the nominal variable are free to vary. For example, let's say you are studying the effect of incubation temperature on sex determination in Komodo dragons. You raise 10 eggs at 30 °C, 30 eggs at 32°C, 12 eggs at 34°C, etc., then determine the sex of the hatchlings. It would be silly to compare the mean incubation temperatures between male and female hatchlings, and test the difference using an anova or t–test, because the incubation temperature does not depend on the sex of the offspring; you've set the incubation temperature, and if there is a relationship, it's that the sex of the offspring depends on the temperature.
We can also compute simple correlation between and the predicted value of , that is, rY, Y'. For these data, that correlation is .94, which is also the correlation between X and Y (observed height and observed weight). (This is so because Y' is a linear transformation of X.) If we square .94, we get .88, which is called R-square, the squared correlation between and . Notice that R-square is the same as the proportion of the variance due to regression: they are the same thing. We could also compute the correlation between and the residual, . For our data, the resulting correlation is .35. If we square .35, we get .12, which is the squared correlation between and the residual, that is, rYe. This is also the proportion of variance due to error, and it agrees with the proportions we got based upon the sums of squares and variances. We could also correlate Y' with e. The result would be zero. There are two separate, uncorrelated pieces of Y, one due to regression (Y') and the other due to error (e).
When you attempt to fit a regression model to the observations, you are trying to explain some of the variation of the observations using this model. If the regression model is such that the resulting fitted regression line passes through all of the observations, then you would have a "perfect" model (see (a) of the figure below). In this case the model would explain all of the variability of the observations. Therefore, the model sum of squares (also referred to as the regression sum of squares and abbreviated ) equals the total sum of squares; i.e., the model explains all of the observed variance:
For the perfect model, the regression sum of squares, , equals the total sum of squares, , because all estimated values, , will equal the corresponding observations, . can be calculated using a relationship similar to the one for obtaining by replacing by in the relationship of . Therefore:
The test for the significance of regression, for the regression model obtained for the data in the table (see ), is illustrated in this example. The null hypothesis for the model is:
The test is used to check the significance of individual regression coefficients in the multiple linear regression model. Adding a significant variable to a regression model makes the model more effective, while adding an unimportant variable may make the model worse. The hypothesis statements to test the significance of a particular regression coefficient, , are:
If you only have one observation of the nominal variable for each value of the measurement variable, as in the spider example, it would be silly to draw a scattergraph, as each point on the graph would be at either 0 or 1 on the Y axis. If you have lots of data points, you can divide the measurement values into intervals and plot the proportion for each interval on a bar graph. Here is data from the on 2180 sampling sites in Maryland streams. The measurement variable is dissolved oxygen concentration, and the nominal variable is the presence or absence of the central stoneroller, Campostoma anomalum. If you use a bar graph to illustrate a logistic regression, you should explain that the grouping was for heuristic purposes only, and the logistic regression was done on the raw, ungrouped data.
where is the total sum of squares and is the number of degrees of freedom associated with . In multiple linear regression, the following equation is used to calculate :
Since it's possible to think of multiple explanations for an association between two variables, does that mean you should cynically sneer "Correlation does not imply causation!" and dismiss any correlation studies of naturally occurring variation? No. For one thing, observing a correlation between two variables suggests that there's something interesting going on, something you may want to investigate further. For example, studies have shown a correlation between eating more fresh fruits and vegetables and lower blood pressure. It's possible that the correlation is because people with more money, who can afford fresh fruits and vegetables, have less stressful lives than poor people, and it's the difference in stress that affects blood pressure; it's also possible that people who are concerned about their health eat more fruits and vegetables and exercise more, and it's the exercise that affects blood pressure. But the correlation suggests that eating fruits and vegetables may reduce blood pressure. You'd want to test this hypothesis further, by looking for the correlation in samples of people with similar socioeconomic status and levels of exercise; by statistically controlling for possible confounding variables using techniques such as ; by doing animal studies; or by giving human volunteers controlled diets with different amounts of fruits and vegetables. If your initial correlation study hadn't found an association of blood pressure with fruits and vegetables, you wouldn't have a reason to do these further studies. Correlation may not imply causation, but it tells you that something interesting is going on.
The test statistic for this test is based on the distribution (and is similar to the one used in the case of simple linear regression models in ):
When we plot our initial results on a graph it will usually be clear whether they best fit a linear relationship or a logarithmic relationship or something else, like a sigmoid curve. We can analyse all these relationships in exactly the same way as above if we transform the x and values as appropriate so that the relationship between and becomes linear. BEWARE - you MUST look at a scatter plot on graph paper to see what type of relationship you have. If you simply instruct a computer programme such as "Excel" to run a regression on untransformed data it will do this by assuming that the relationship is linear!
The values of R-sq, R-sq(adj) and S are indicators of how well the regression model fits the observed data. The values of PRESS and R-sq(pred) are indicators of how well the regression model predicts new observations. For example, higher values of PRESS or lower values of R-sq(pred) indicate a model that predicts poorly. The figure below shows these values for the data. The values indicate that the regression model fits the data well and also predicts well.