What did that professor say? Statistics made easy

We are surrounded by numbers every day. You may not realize it, but statistics plays a large role in our daily lives as well. Weather forecasting takes numbers and makes predictions about the weather based on weather models. Disease models for predicting turfgrass diseases do a similar service. Based on numbers related to temperature, humidity and leaf wetness, these models can forecast the startup of a turfgrass disease. We know that pest control products are tested for their effectiveness to control pests. Statistics are behind every medical study and batting average you hear about. Soon we will be bombarded with those political voter polls.


Statistics are sets of mathematical equations that are used to analyze what is happening in the world around us. It is a science of decision making. It is a science of “chance” or “probability.” It is the science of collecting, organizing, and interpreting data whether it is numerical or non-numerical. We live in an information and technological age where we have everything at our finger tips. H.G. Wells, the father of science fiction, predicted that statistical thinking would be as necessary for daily living as reading and writing. Statistics may seem intimidating at first, but it is not once you develop a clear understanding of this simple subject.


Basic understanding of terms


Before we start, a discussion and understanding of some basic terms are needed. Descriptive statistics are used to describe sets of numbers such as plants heights achieved due to applications of fertilizers. Researchers can organize these numbers into tables and graphs called frequency distributions (the frequency a number may occur due to a factor involved). The following data set illustrates measurements of plant heights in centimeters after a fertilizer application). We will use this data to help us define some terms.


Plant Heights (cm) due to Fertilizer Applications

10

14

11

12

15

15

12

13

14

13

12

8

12

9

10

13

11

12

8

10

9

16

7

11

9


 


As we look that this simple data set, we can determine a median, a mean, and a standard deviation. The median is the measurement that lies in the middle of the data, at the 50th percentile. In this example, it is 12 (range is 7-16). At times, it is better to express the median rather than the average (also known as the mean, see below), especially if the data contains outliers. The median could be a better indicator of true center especially when NBA salaries are being discussed.


 


Plant Height (cm)

Frequency

Percent

Percentile

7

1

4

4

8

2

8

12

9

3

12

24

10

3

12

36

11

3

12

48

12

5

20

68

13

3

12

80

14

2

8

88

15

2

8

96

16

1

4

100

Totals

25

100

 


 


The mean is simply the average (plant height x frequency observations = 286 cm / 25 frequency observations = 11.44 cm) for the data set. The standard deviation (SD = 2.38) indicates the average difference individual data varies from the mean; how concentrated the data are around the mean. So why is this important? Without standard deviation, you cannot get a feel for how close the data are to the mean or whether the data are spread out over a wide range. Without standard deviation, you cannot compare two data sets effectively. Two data sets can have the same mean, but vary greatly in the concentration of data around the mean; therefore different standard deviations.


The distribution of a data set can be a graph of all values and their frequency of occurrence. One of the most common distributions is called the normal distribution or bell-shaped curve displaying numerical data in a symmetrical curve.



Mean


The center of the bell is the mean and most of the data is usually centered on the mean.


 



1 SDev


2 SDev


3 SDev


 


The red area represents this data and one standard deviation +/- from the mean, 68% of the data (34% on either side of the average). The green area represents two standard deviations +/- from the mean or 95% of the data (red plus green) under the curve. The blue area then represents three deviations +/- from the mean or 99% of the data. Since every set of data has a different mean and standard deviation, an infinite number of normal distribution curves exist.


Confidence intervals (CI), usually set by the researcher, establish a level of confidence or reliability to an end result based on some treatment perhaps to a human being or plant in repeatable trials. The CI is represented by a percentage, so when we say, “we are 95% confident that the result of this herbicide application will provide 98% control of dandelion,” we express that 95% of the observations will hold true. In practice, confidence intervals are typically stated at the 95% confidence level. However, they can be shown at several confidence levels like, 68%, 95%, and 99%. When a research trial is conducted, the confidence level is the complement of the respective level of significance, i.e. a 95% confidence interval reflects a significance level of 0.05, referred to as alpha (α). The level of confidence is often dependent on the number of observations with more observations yielding a higher level of confidence.


When data is collected, researchers typically look for something unusual or out of the ordinary and often ask if this is significantly different from a norm. Will it or does this happen with a very small probability of happening just by chance? Least Significant Difference (LSD) is a measure of significance usually with a level of significance (α = 0.05) denoted as LSDα=0.05 or LSD0.05. We will revisit the use of this term when we show an example of a data table and bar graph.


Experimental designs


How an experiment is designed can make the difference between the collection of good data and bad data. The objective of experiments is to make comparisons of treatments that will support a thought or hypothesis about an area of interest. Treatments can include the applications of fertilizers or pesticides, the incorporation of a cultural practice or the evaluation of disease resistant turfgrass cultivars or combinations thereof. While comparisons of treatments are important, so are comparisons to an untreated control to determine the true effects of each treatment if nothing was being applied. The untreated control establishes a baseline for comparison. Collecting good data and then applying the proper data analysis is important for drawing or making appropriate conclusions about the experiment.


In experimental designs, data (measurements/observations) are usually subject to various, uncertain external factors. Treatments and full experiments are usually repeated, replications, to help identify any sources of variation, to better estimate the true effects of the treatments thereby strengthening the reliability and validity of the experiment. Statistically, replications help to reduce experimental error due to unknown or uncontrollable factors (i.e. variations in soils). Replicating treatments within an experiment is as important as repeating entire experiments to see if results can be repeated with confidence.


Randomization is also an important component to experimental design. One way to minimize bias in an experiment is to randomize treatments. This will become clearer as we look at some experimental designs.


Two common experimental designs that you may hear of in a seminar or conference presentation are illustrated below.


Complete Randomized Block Designs are one of the simplest, most common experimental designs for field trials. Here, you may be looking at the effects of one type of treatment, i.e. herbicide effectiveness. Treatments can be replicated three, four or more times dependent on the type of trial it is. Disease trials tend to have more replications due to the high variability among treatments from replication to replication. Treatments also remain in single blocks.


                                    Complete Randomized Block Design


Replicate 1

7

4

6

1

3

5

2

Replicate 2

6

4

1

7

5

3

2

Replicate 3

5

7

2

3

1

4

6


 


You will note that seven treatments are completely randomized in each of three replications or blocks. The treatment numbers can correspond to a treatment list.


                                                Treatment No.             Treatments


1                              Untreated control


2                              Herbicide A, Rate 1


3                              Herbicide A, Rate 2


4                              Herbicide B, Rate 1


5                              Herbicide B, Rate 2


6