home

Excellent. You have DATA! To a scientist, having new data is like having an unwrapped birthday present... so much excitement, so much potential! So, did we get a new toy or a sweater?

Inferential statistics are used to make inferences - to draw conclusions - from data. Scientists use inferential statistics to make objective decisions about what data tell us, rather than relying solely on their own opinion. That is, well implemented and interpretted statistics provide evidence for our conclusions that others can evaluate.

As described before, we constrained the type of data you collected to that which could be analyzed with simple linear regression, or simply, regression. Regression is a technique that tries to identify a (typically) linear relationship between two numerical variables. That is, as we change the value of one variable, does the value of the other variable change in a predictable way?

Ordinarily, we have some inkling that one variable reponds to the other - the dependent variable responds to changes in the independent variable. In experiments we might vary the independent variable and look for changes in the dependent variable. Or, in an observational study, we may think, for instance, that ant's body size determines how big a leaf they can carry. We plot the independent variable on the x axis and the dependent variable on the y axis.

Look at the data plotted to the left. Here someone measured both the body length of 16 individual ants, and the area of the leaves they were carrying. Do you see a pattern here? How would you describe it in words?

We will leave the details of calculation of linear regression to another day, but it is an easy process to understand visually. The computer will try to fit a straight line to the data that minimizes the (sum of the squared) vertical distances between the points and the line (see second graph, on the right).

This is sometimes called the line of best fit, or the regression line. Imagine, just for comparison, a line on the graph at the left with a flat (i.e. zero) slope... the vertical distances from each point to the line would be much larger... it wouldn't fit the data particularly well.

Regression analysis gives us an equation for the line of best fit - with a slope and a y-intercept. Remember y = mx + b ? The magnitude of the slope is informative - it tells us HOW MUCH the y axis variable changes with a unit change in the x axis variable. Moreover, we can use the equation to make predictions: in the graph to the left, if an ant's body length is 10mm (x=10), then the leaf area we would expect it to be carrying (y) would be 0.086*(10) - 0.2922, or 0.57 cm^2.

Regression analysis also gives us a value called R^2, R squared. This tells us how much of the variation in the y axis variable's values is accounted for by the variation in the x axis variable's values.

One way to think about it is this: if you could tell EXACTLY what the y axis value would be if you knew the x axis value, there is a perfect relationship between the two, and R^2 =1. All the data points would lie exactly on the line of best fit. If you would have no clue what the y axis value would be given a particular x axis value, the two variables are unrelated and R^2=0. Very roughly speaking, R^2 tells us about the strength of the relationship between the two variables.

Well, what about an R^2 of 0.3, or 0.8... what do those mean? There is no simple answer. In physics or chemsitry, when there is a single very strong cause or mechanism controlling the outcome, we typically expect very high R^2 values... like for the concentration of a chemical regressed against the absorbance in a spectrophotometer. More chemical equals more absorbance, and very little else controls or contributes to absorbance. In ecology, most of the processes or characterstics we are interested in have many contributory factors - like the growth of a tree might respond to average climate, recent weather, local soils, local plant competition, etc. Now, if we were to regress tree growth versus just one of those factors, we wouldn't really expect it to explain all of the variation in growth... realistically it will only explain a small part. So, there is a bit of an art to interpretting R^2 values... but for tree growth explaining 30% of the variation (R^2 = 0.30) might be a real achievement!

Lastly, regression analysis gives us a p-value. A p-value from a regression analysis, strictly speaking, is the probability that if we repeated the study again and again, we would observe a relationship as strong or stronger than the one we did, assuming the null hypothesis (that there is no relationship between the two variables) were true. A little convoluted, yes, but the logic is this - if the p-value is very low, we are unlikely to have observed data like ours by chance, given two variables that are in fact unrelated. It is still possible, but rather unlikely. Thus, if the null hypothesis seems unlikely to be true, we can be confident that there is actually a relationship between the variables we measured. If the p-value is high, we have little confidence that the variables are actually related. By convention, p-values less than 0.05 (or 5% chance) are taken as evidence for a "significant" relationship - or, in other words, we are sufficiently confident to conclude there is a relationship. P-values larger than 0.05 suggest we should not be confident that a real relationship exists between the variables.

So, if the p-value for a linear regression analysis of the above data is p=0.001, and the R^2 is 0.80, what would you conclude about these data?

The most often regurgitated axiom of statistics is that "correlation doesn't imply casuation" - which means that simply because we see a relationship between two variables doesn't mean we have demonstrated that one factor CAUSES the other. True. However, a strong relationship is still CONSISTENT with a cause-effect relationship.

Need more background? Watch these videos: introduction to regression 1 and 2

Next->

image analysis basics image analysis instructions hypothesis formation regression basics regression instructions post-lab bibliography

image and video sets regression practice