Bio 231 Regression Tutorial

Linear Regression Tutorial for Biology 231

Introduction
Variance and Covariance
Calculating the Slope
Calculating the y-intercept
Estimating Narrow-sense Heritability

Introduction.

It is quite common for researchers to try to determine if the value of one continuous variable depends, at least in part, on the value of another continuous variable. For example, does running speed depend on leg length? Does the percentage of dead roaches depend on the concentration of a poison? Does damage (in millions of dollars) depend on wind speed of a storm? In all of these cases, the direction of dependence is clear. We don't make someone's legs grow by pushing them from behind. We don't increase the concentration of a poison by stomping on roaches. And we don't expect the wind to pick up if a meteor strikes the mall. In all of these cases, one variable is dependent and one variable is independent.

In many cases, the relationship between a dependent and an independent variable is complex. However, there are times when it makes sense to assume that the relationship is linear, and it is in these cases where the method of linear regression is valuable.

Here is the general problem... you measure the values of the dependent and independent variables for N data points. In this example, N = 10. By convention, you plot the dependent variable against the y-axis (the vertical axis) and you plot the independent variable on the x-axis (the horizontal axis).

The relationship between these variables can be written in standard algebraic form:

y = bx + c,

where b is the slope of the line (Dy / Dx) and c is the y-intercept (the value of y when x = 0). We might be tempted to draw an eye-fitted line through these points. However, which of these is the "best" line?

Clearly there's too much subjectivity. The method of linear regression provides an objective solution to this problem. Basically, it chooses a line that minimizes the squared vertical deviations between the points and the line. For this reason, the method is often referred to as "least squares linear regression."

Variance and Covariance.

The method of linear regression relies on two kinds of measures: sample variance and sample covariance. Variance describes the squared deviation of the N measures from their mean value. That is, variance increases as the spread of the values increases. If we were able to measure the given value for every individual in a population, then we would know the parametric variance of the population. However, we are usually limited to measuring the value in a subset, or sample, of the population. Hence, we are only able to estimate the parametric variance; we refer to this estimate as the sample variance. [Along the same lines, we are usually unable to measure the parametric mean of a variable, usually represented by m_x. Instead, we calculate the sample mean.] Consider the following examples:

First Set of measures	Second Set of measures
2.0 1.0 3.0 5.0 2.0 7.0 2.0 9.0 6.0 3.0	3.5 5.0 3.5 6.0 3.5 5.0 5.5 3.0 2.5 2.5
Sample sum = 40.0	Sample sum = 40.0
Sample mean = 4.0	Sample mean = 4.0

Clearly, the spread is greater in the first set than in the second set, even though their means are the same. Sample variance is calculated using the following formula:

For the first set of values, the left side of the formula (the summation of squared deviations from the mean) is 62.0. [We get these by subtracting the mean, 4.0, from each value, then squaring the result. Thus, we add together (-2.0)², (-3.0)², (-1.0)², (+1.0)², etc.] The sample variance for the first set of values is 62.0 / 9 = 6.9. For the second set of values, the sample variance is 14.5 / 9 = 1.6.

By definition, the standard deviation of a distribution is the square root of the variance. The parametric standard deviation is usually written as s_x, while the sample standard deviation is usually written as s_x. Since standard deviation is the square root of variance, parametric variance can be represented by s_x², while sample variance can be represented by s_x². The standard deviation of the first set is 2.6, while the standard deviation of the second set is 1.3.

Covariance is similar to variance, but it describes how two variables change with respect to each other. If the value of one tends to go up as the value of the other goes up, covariance is positive. However, if the value of one tends to go down as the other goes up, covariance is negative. You probably think of these relationships as proportional or inversely proportional. For example, human weight is usually proportional to height, while a car's gas mileage is usually inversely proportional to weight.

Consider the following data points:

value of x	value of y
3.0 5.0 3.0 7.0 5.0 8.0 7.0 4.0 6.0 2.0	21.0 26.0 20.0 32.0 23.0 42.0 35.0 24.0 30.0 17.0
Sum = 50.0	Sum = 270.0
Mean = 5.0	Mean = 27.0
Var_x = 4.0	Var_y = 59.3
s_x = 2.0	s_y = 7.7

It appears that, in general, the value of y increases as the value of x increases. Thus, we expect the covariance to be positive. Covariance is calculated using the following formula:

This is best done by building an appropriate table:

value of x	x_i - mean x	value of y	y_i - mean y	x_i - mean x x y_i - mean y
3.0 5.0 3.0 7.0 5.0 8.0 7.0 4.0 6.0 2.0	-2.0 0.0 -2.0 2.0 0.0 3.0 2.0 -1.0 1.0 -3.0	21.0 26.0 20.0 32.0 23.0 42.0 35.0 24.0 30.0 17.0	-6.0 -1.0 -7.0 5.0 -4.0 15.0 8.0 -3.0 3.0 -10.0	12.0 0.0 14.0 10.0 0.0 45.0 16.0 3.0 3.0 30.0
Sum = 50.0	.	Sum = 270.0	.	Sum = 133.0
Mean = 5.0	.	Mean = 27.0	.	.
Var_x = 4.0	.	Var_y = 59.3	.	Cov_xy = 14.8
s_x = 2.0	.	s_y = 7.7	.	.

In this case, all of the values in the last column are positive. This isn't required for positive covariance; all that matters is that the sum of the values in the last column is greater than zero. That value is 133.0, and we have 10 points. Therefore, the covariance between x and y is 133.0 / 9 = 14.8.

Calculating the Slope.

It happens that the slope of the line that minimizes the squared vertical deviations from that line is easily calculated from the sample variance of the independent variable (x) and the sample covariance between x and y:

slope = b = Cov_xy / Var_x

We've already done all the work for the previous example. The sample covariance was 14.8, and the sample variance of the x was 4.0. Therefore, we calculate the slope, b, as 14.8 / 4.0 = 3.7. So we now have most of the algebraic formula for the best-fit line through the points:

y = 3.7 x + c

However, the equation is incomplete without a y-intercept (c).

Calculating the y-Intercept.

It turns out that the best-fit line must pass through the point {sample mean x, sample mean y}. We won't get into the mathematical proof, but it's true. Thus, calculating the y-intercept is easy:

If... y = b x + c,
Then... c = y - b x.

We just have to plug in the sample means for x and y...

c = 27.0 - 3.7 x 5.0 = 8.5

Therefore, we have the complete algebraic formula for the line:

y = 3.7 x + 8.5

We can now show that this method works by drawing this line through a graph of the points. The actual data points are shown as blue circles in red outline. Two other points are drawn onto the graph: (1) the y-intercept at {0,85}, and (2) an arbitrary point calculated from the formula for the value x=8 ({8,38.1}). These are shown as yellow circles in black outline. The regression line is then drawn through these points.

One final note before you try this yourself. The slope and y-intercept calculated by linear regression produce the best-fit least-squares line through the sampled data points. However, just as the means, variance and covariance used in the calculations are not the parametric values, the consequent slope and y-intercept are also not parametric values; they are still estimates of the parameters.

TEST YOUR UNDERSTANDING.

It is the year 2068. You survey supermarkets in seven cities, noting population size and the cost of a dozen eggs. What is the formula of a line that best relates the cost of eggs to population size?

City	Population Size in thousands (x)	Cost of Eggs (y)
Boston	1004	$58.43
Chicago	1543	$64.12
Des Moines	4012	$200.14
Frankfort	589	$30.15
New York City	9050	$455.68
San Francisco	8054	$389.39
Toledo	7837	$255.71

Solution

TEST YOUR UNDERSTANDING.

If we accept the formula calculated above, we can use it to predict the price of eggs in other cities. What is the expected price of eggs in Miami (population size 1,789,000)? Conversely, if eggs cost $182.14 in Fargo, estimate its population size.

Solution

Estimating Narrow-Sense Heritability.

Many phenotypic traits vary in a continuous manner among individuals in a population. Obvious examples in humans are height and weight. The sample variance measured for a particular phenotypic trait is represented by V_P.

Phenotypic variance can arise even in populations that have no genetic variation. Phenotypic variation can be due to variation in the environment (including the internal environment of the organism and even its environment during gestation). Phenotypic variance due to environmental variation is called V_E.

Phenotypic variance can arise in populations that have absolutely no environmental variation (though this may be very difficult to achieve). Here, the phenotypic variation would reflect genetic variation. Phenotypic variance due to genotypic variation is represented by V_G.

There can be fairly complex interactions between genes and the environment. We call this genotype-environment interaction, and its component of phenotypic variance is called V_GE. It's a little complicated to explain exactly what V_GE is, but here's an example to help you think about it. Individuals with the AA genotype grow to an average of 8 feet in Environment J, but only to an average of 4 feet in Environment K. On the other hand, individuals with the genotype aa grow to an average of 4 feet in Environment J and to an average of 8 feet in Environment K. Let's say that half of the individuals with each genotype live in each environment. We would expect an average height of (8 feet + 4 feet)/2 = 6 feet for AA. We would expect an average height of (4 feet + 8 feet)/2 = 6 feet for aa. So, if we don't pay attention to environment, it looks like genotype has no predictable effect on height. Similarly, if half the population is AA and half is aa in each environment, we'd expect the average individual in Environment J to be 6 feet tall, and we'd expect the same for Environment K. It might seem that if we don't pay attention to genotype, environment seems to have no predictable effect. We only see the effects when we consider each genotype in each environment. This is a clear case of genotype-environment interaction.

All components of variance add together to give the total phenotypic variance. So...

V_P = V_E + V_G + V_GE.

The fraction of phenotypic variance attributable to genetic variance is called the "broad-sense" heritability (H²):

Broad-sense heritability = H² = V_G / V_P

If there is no genetic variation contributing to phenotypic variation, H² = 0 (because V_G = 0). If all of the phenotypic variation is due to genetic variation, then H² = 1.0.

V_G can be divided into three components. One of these is the additive component; additive genetic variance is called V_A. This basically refers to the phenotypic variation that arises from the average effects of the different alleles at the relevant genes. In the simplest scenario, every gene would show incomplete dominance, where the average phenotypic value of the heterozygote would fall exactly between the average phenotypic value for either homozygote. Also in this simplest scenario, the genotype at one gene will not mask the genotype at another; that is, there is no epistasis. However, in reality, heterozygotes are not necessarily exact intermediates between the two homozygotes (think about complete dominance). And in reality, genes do interact in epistatic manners. Therefore, we have to consider two components of genetic variance that mask, to varying extents, the effects of specific alleles: V_D (dominance genetic variance) and V_I (epistasis genetic variance). As a rule...

V_G = V_A + V_D + V_I.

If you think about this a little, you'll realize that both dominance and epistasis make natural selection on the basis of phenotype less efficient. The effects of individual alleles can be masked by alleles at the same gene (dominance) or by alleles at other genes (epistasis). Individuals with different genotypes can have the same phenotype, and natural selection acts on the phenotype. If the effects of every allele could be "unmasked," natural selection would more effectively choose among genotypes.

The fraction of the total phenotypic variance due to additive genetic variance is called the "narrow-sense" heritability (h²):

Narrow-sense heritability = h² = V_A / V_P

It should be clear that this value also ranges between 0.0 and 1.0. It can also never be greater than broad-sense heritability, since V_A can never exceed V_G. It is the narrow-sense heritability that is most important to breeders, since it predicts how quickly selective breeding can change the average phenotype of the population. Specifically, the response to selection (R = average value at generation 1 minus the average value at generation 0) depends on h² and the selection differential (S = average value of parents from generation 0 selected to produce generation 1 minus the average value of all individuals in generation 0). The relationship is simple:

R = h² x S.

From this equation, you should see that, for a given selection differential, the response to selection is proportional to narrow-sense heritability. If there is no additive genetic variance for the phenotype, selection will be ineffective. [It is for this reason that adaptation does not necessarily follow natural selection in natural populations. Differential survival and reproduction on the basis of phenotype will only lead to predictable change if there is non-zero narrow-sense heritability for the trait.]

We must be careful to recognize that any estimate of heritability is only relevant to a particular population in a particular set of environments. Different populations have different gene frequencies. If you move a given population elsewhere, the set of environments faced by individuals will change. Therefore, V_P, V_G, V_E and V_GE are all influenced by the population and the set of environments.

Given this important caveat, we can note the two main ways that narrow-sense heritability is estimated: (1) response to artificial selection, and (2) parent-offspring regression. Both are sensitive to bad experimental design. As it relates to this tutorial, we will consider parent-offspring regression.

It should be apparent that the phenotype of offspring may be dependent, in part, on the phenotypes of the parents -- as long as the parents' genes have some influence on phenotype. It should also be clear that parent phenotype ought not depend, in any mechanistic way, on offspring phenotype. For reasons we will not get into here, it turns out that additive genetic variance is mathematically equivalent to parent-offspring covariance. If we use the parents' phenotypic variance as a measure of V_P, we can estimate h² by solving the linear regression formula:

T_offspring = Cov_{parent offspring} / Var_parent x T_parent + c

where T represents the value of the trait for which we want to estimate narrow-sense heritability. The slope of the line, as you should see, is mathematically equivalent to h².

The data below is for abdominal bristle number in a lab population of Drosophila mauritiana. The data was collected during the 1995-1996 academic year by Radford University undergraduate Christine Seay as part of her independent study project in my lab. Shown are the data for 38 families; the mother's score (the independent x variable) is on the left, and the mean score of her offspring (the dependent y variable) is on the right. The data are plotted below the table.

Mother	Offspring Mean
17 16 22 17 15 15 16 15 16 17 15 18 18 22 15 19 15 14 16 17 15 15 17 17 16 19 19 19 16 16 18 16 18 15 15 15 18 19	17.7 16.0 19.2 15.2 17.8 16.7 19.4 15.8 17.2 15.8 16.4 17.0 19.5 18.9 15.5 17.6 15.9 15.1 18.1 17.6 15.5 16.4 15.8 16.0 15.6 19.5 19.5 16.0 16.9 16.0 16.1 18.6 16.3 16.6 15.6 16.7 20.8 17.4

The regression of offspring score on mother's score requires calculating the variance of the mother's score and the covariance between the mother's score and the mean score of her offspring. The variance in the mother's score turns out to be 1.504, and the covariance is 3.630. Therefore, the slope of the line is 1.504 / 3.630 = 0.414. The y-intercept is easily calculated (it is 10.088) as described above. We can find y values for two arbitrary x values, plot these on the graph, and draw the regression line through them.

It should be noted that while linear regression chooses the "best-fit" line through the points, there is considerable room for statistical error. The 95% confidence interval for the slope, based on the variance in the scores, is 0.194 to 0.635. An explanation for how this is calculated is available in any good statistics book. Suffice it to say, the more points you have, the narrower will be the 95% confidence interval.

Christine's regression analysis was of mean offspring on mother. It should be obvious that the father's genes also affected the offspring. Mothers and fathers were paired randomly, so they may have had very different scores. Regression of offspring on only one parent underestimates narrow-sense heritability by about 50%. Therefore, Christine's analysis suggested a narrow-sense heritability of 2 x 0.414 = 0.828. This is a very high estimate. However, there is considerable statistical error associated with the estimate, and a narrow-sense heritability as low as 0.388 would fall within the 95% confidence interval.