Describing The Data

        line
      Descriptive Statistics
      Central Tendency and Spread or Dispersion of Distributions
        line
      Descriptive statistics are used to describe the basic features of the data. They provide summaries about the sample and the variables. Along with graphs, they form the basis of the quantitative analysis of data.

      Descriptive statistics are typically distinguished from inferential statistics. Descriptive statistics describe the data while inferential statistics, allow us to infer from the sample what is occurring in the population.  Inferential statistics are also used to make judgments of the probability that the difference we observe between groups is dependable or that the difference may be due to chance or sampling error in the study.

      Descriptive Statistics are used to simplify and present the large amounts of data in a manageable form.  Each descriptive statistic reduces the data into a summary. For instance, consider your GPA (Grade Point Average) statistic. This single number describes the general performance of a student across a potentially wide range of courses.

      When we describe a large set of observations with a single indicator we risk distorting the original data or losing important detail. The GPA doesn't tell us how difficult the courses were or what field the courses were in.  In spite of these limitations, descriptive statistics give us a powerful summary that may enable us to compare across people, groups, or other units.

          line
      There are three major characteristics of a single variable that we should look at:
      • The distribution is a summary of the frequency of individual values (or ranges of values) for a variable. The simplest distribution lists every value of a variable and the number of persons at each value. For example, we would describe gender by the number or percent of males and females in our sample. With variables where there is a large number of possible values and relatively few people at each one we group the scores into categories according to ranges of values. For example we might group income into four or five ranges of income values.

      • One way to describe the distribution of a single variable is with a frequency distribution. Depending on the variable, we can represent all possible values, or we can group the values into categories first.  The values are first grouped into ranges and then frequencies determined. Frequency distributions can be presented in a table or as a graph (histogram or bar chart).
         
      • The central tendency of a distribution is an estimate of the "center" of a distribution of values. There are three major statistics used to estimate central tendency:
        • Mean
        • Median
        • Mode
      • Dispersion refers to the spread of the values around the center of the data.  Two common measures of dispersion are the range and the standard deviation. The range is the highest value minus the lowest value.
      In most situations, we would describe all three of these characteristics for each of the variables in our study.  How we define the center depends on the level of measurement of the variable we are talking about.
       
         
    Set A:
    12,34,36,42,52
    54,68,72,81,93
    Set B:
    152,154,155,155,156
    158,159,161,163,163
         

        line
       Nominal Variables  Ordinal Variables Interval, Ratio Variables
        line

      Nominal Variables

      Measures of Central Tendency

      The mode is the only measure of central tendency that we can use with nominal data. The mode is the most frequently occurring score (or value) for a particular variable. A distribution of scores does not always have a mode, For example, if we have scores of 293, 154, 167, and 52, we do not have a mode.
      What is the mode of Set B, above?   Set B above is actually bimodal.  There are 2 sets of scores that occur more frequently than other scores, but with the same frequency -- 155 and 163.  In some distributions there is more than one modal value. In a bimodal distribution there are two values that occur most frequently.

      The mode is an unstable measure--minor changes in the data can change it substantially, but it our only choice for nominal level data.

      Measures of Spread

      The range measures the spread of scores found in the data. This is the only measure available for nominal level variables.  Outliers or extreme values can distort the range.
      The range is the distance from the smallest to the largest score: in Set A at the top of this page, Range = 93 - 12 = 81.
       
        line

      Ordinal Variables

      Measures of Central Tendency

      Both the median and the mode can be used to measure central tendency when the variable is ordinal.  The median is the middle score in the distribution.  In the distribution 31, 33, 36, 48, 79, the median is 36.  When we have an even number of cases, such as 100, 158, 160, 195, the median is the score that would fall half-way between the two center values:
    Median = (158 + 160)/2 = 159.
      The median is a good measure of central tendency for ordinal data.  It is sometimes useful with interval/ratio data.

      Measures of Spread

      The dispersion of ordinal variables is also measured using the range.
       
          line

      Interval, Ratio Variables

      Measures of Central Tendency

      The mean only has substantive meaning when used with interval/ratio variables.  The mean (average or arithmetic mean) is the sum of the scores in the distribution divided by the number of scores. For example, the mean of 12, 13, 23, 43, 32 is:
      Mean = (12 + 13 + 23 + 43 + 32)/5 = 24.6
      If the distribution is "normal" (i.e., bell-shaped), the mean, median and mode are all equal to each other.

      Properties of Mean and Median

      The most important property of the mean and median is illustrated in the example below. What happens to a set of five scores when the largest score is increased by several points?

      Set A:              2, 13, 23, 32, 43             Mean = 24.6      Median = 23
      Set A Altered: 12, 13, 23, 32, 143           Mean = 44.6      Median = 23

      The mean is affected by extreme scores (known as outliers), while the median remains the same.  When extreme scores occur in your data, you should report both the mean and the median as  measures of dispersion.
       

      Measures of Spread (Variability, Heterogeneity)

      There are a few common measures of variability of a distribution:

      The two most common measures of variability are the variance and the standard deviation.  The standard deviation shows the relation that a set of scores has to the mean of the sample. Normal distributions, which are important in both descriptive and inferential statistics, are completely determined by two "parameters": the mean and the variance.
      The variance is the average difference of all the scores from the mean score. The variance describes the heterogeneity of a distribution and is calculated from a formula using every score in the distribution. It is typically symbolized as "s2 ". The formula is:

                                           Sum-all-scores(Score - Mean)2
               Variance=   s2   = ________________________
                                                              n - 1
      The square root of the variance is known as the "standard deviation."  It is symbolized by "s".
                          
         Square-root(Variance)= s = Standard Deviation.
      The variance and standard deviation are measures of the variability of the distribution. They also act as measures of risk or uncertainty if we are trying to use a sample to make inferences or guesses about the population. The greater the variance, the more likely our sample will not reflect the population.

      Properties of the Standard Deviation

      As the scores in a distribution become more heterogeneous, more "spread out" and different, the value of the standard deviation grows larger.

      If the standard deviation of women's gender-role attitude scores was .90 on an 8-point scale and the standard deviation of men's scores was 1.68, you would know that men are more varied in gender-role attitudes than women are.  Can you come up with a sociologically sound explanation of why this might be so?

      Assuming that the distribution of scores is normal or bell-shaped (or close to it):

      • approximately 69% of the scores in the sample lie within one standard deviation of the mean
      • approximately 95% of the scores in the sample lie within two standard deviations of the mean
      • approximately 99% of the scores in the sample lie within three standard deviations of the mean
      NOTE: A visual examination of the frequency distribution is necessary to a good analysis. No single measure can adequately describe the distribution of an interval level variable. These summary statistics, complemented by a visual examination of the distribution of the variable, give us the ability to understand the distribution. A complete examination reduces the chance that we will make interpretive statements that can't be supported by the data.

      Use your browser BACK button to return to previous page.
      Or go to Methods and Measurements Index or Site Index.