Descriptive Statistics

Descriptive Statistics

Descriptive statistics help you to accurately describe different sets of data. These often make it easier to present conclusions and make comparisons.

They include some basic statistics which you will have covered in maths, along with some you may need a reminder about:

Types of Data

Understanding what type of data you have is important when selecting the right statistical method.

Qualitative data 

Uses words to describe and observe.

Nominal data

Name based data which have no order.

Ordinal data

Data that can be given an ascending or descending order, such as:

Quantitative data 

Uses numbers to measure and report.

Interval data

Data that has real numbers, with no true zero. This is rare – temperature (in Celsius or Fahrenheit) & dates are the only common examples. 

You can’t make statements such as “Gairloch is twice as warm as London” – because there’s not a true zero.

Ratio data

Data that has real numbers with a true zero. Most numerical data is ratio data.

Question: Would qualitative or quantitative data be more useful in these studies?

A – Understanding the perceptions of rural vs urban residents towards the installation of a new power line.

B – A study into the success of soil improvements in crofts in the NW Highlands.

C – Whether improvements to a High Street have been successful.

Activity

For each of these examples, state whether it is:

Mean, median & mode

These quick statistical calculations are all forms of average.

You should be able to choose which one is appropriate for the data you are describing.

Mean

This is a type of average which is calculated by adding together all the values in the dataset (), and dividing by the total count of values (n).

We use the mean when…

Data is similar and doesn’t have any very low or high values. If you were to plot it on a scatter graph there would be few gaps.

e.g. If a student sits four geography exams the best measurement would be the mean as it would give a good picture of overall performance.

Median

This is the middle value, when all the values are placed in ascending (or descending) order.

If there are two middle numbers, find the mean of these.

We use the median when…

Data has some extreme lows or highs which may skew a mean. 

e.g. House prices where few homes with a high or low price.

Mode

The most frequently occurring value.

This is the only measure of average which can be used with nominal data.

If there are two modes, the data is “bi-modal”. If there are no modes, it simply has no mode!

We use the mode when…

The data is in the format of frequency. 

e.g. plotting the land use in a town – you can’t get a “mean” land use but you can get the modal land use.

Question: A geographer is measuring the impacts of the Great Green Wall on soil in Burkina Faso. Which would be most suitable to describe the results of:

A – Soil pH testing using a pH meter

B – Soil texture following a flow-chart

C – Soil moisture (%) using the drying method

Activity

Complete these questions in your notes.

a) A distance runner entered seven marathons. His times were (hh:mm) 3:45, 4:05, 3:55, 4:25, 4:20, 3:48, 3:55. He wants to find his overall performance.

b) The average wage of the employees in an office from managing director to the office junior.

c) In the 2022 Boston marathon, finishers were reported in two categories; male and female. There were 6562 male finishers and 1562 female finishers.

Grouped Data

Working out the mean, median and mode can be more tricky to calculate when data is sorted into groups. 

We use the midpoint (or median) of the group and multiply this by the frequency. 

Worked Example

We can then calculate the mean

… which is in the 5-9 group.

We can then calculate the median

… which, as there are 30 total visitors (n=30), will be the 15th and 16th values when they are placed in order. This will also fall into the 5-9 group. However, because the data is groups, we don’t get any sense of where it is between 5-9.

We can then calculate the mode

… well actually, we can’t find the mode. We can only find the modal group. This is the one that occurs most frequently (i.e. the 5-9 group). Note that the modal group is sometimes known as the modal class.

Activity

A survey of visitors to a new health clinic in central Chad is undertaken, to find out it’s sphere of influence. The results are shown in the table.

  1.  

  2.  

Range & Inter-quartile range

These are both measures of dispersion. They show how much data differs from the average.

If data was plotted onto a scatter graph it would show how spread out or distributed the data was – but wouldn’t show where it actually is on the axis.

Range

is the difference between the highest and lowest values.

Marks in a geography exam: 

10         25        45 47 49 51 51 52 52 54 56 57 58 60 62 66 68 70    75            90

The range of this data is 80. However, one disadvantage of the range is that it doesn’t tell us that most of the data is grouped around the middle.

Inter-quartile range

The inter-quartile range is the range of the middle half of the values

In this case, it would be a better measure as it omits the extremes.

Worked example

Every set of data has three quartiles:

Once we have worked out Q1 and Q3, we can calculate the inter-quartile range. 

Marks in a geography exam: 

10         25        45 47 49 51 51 52 52 54 56 57 58 60 62 66 68 70    75            90

Q1 is half way between the 49 and 51 values.  Q1 = 50

Q2 is the median – half way between 54 and 56. Q2 = 55

Q3 is half way between the 62 and 66 values. Q3 = 64

Therefore, the Interquartile range is: Q3-Q1 = 64-50 = 14

When comparing two sets of data, a box-plot is often used to display the interquartile range.

Activity

Answer these questions

1.  The data below shows how long on average it takes a group of cyclists to travel to work 

along a cycle path (in minutes):

22, 19, 24, 31, 16, 48, 29, 29, 21, 15, 22, 28, 27, 23, 37, 31, 23, 30, 26, 16, 26, 29

Calculate the median and inter-quartile range.

2.   The data shows the minimum temperatures of ten weather stations in Britain on a winter’s 

day. The temperatures are: 5, 9, 3, 2, 7, 9, 8, 2, 2, 3 (C)

Calculate the median and inter-quartile range.

3.  Minimum temps in summer at ten weather stations: 10, 12, 11, 14, 9, 8, 15, 12, 12, 11 (C)

Compare, using two box plots, these with the figures from question 2.

Standard Deviation

The Standard Deviation measures dispersion of values around the mean.

The standard deviation is more useful than the interquartile range because it tells us how it’s grouped around the mean, rather than the absolute value. It allows us to compare data sets which are very different.

The formula for standard deviation is:

But what does standard deviation actually mean? 

When we get a value for standard deviation, we can then use it to describe the data:

Worked example

Find the standard deviation of the minimum temperatures, measured at ten weather stations in Britain on a winter’s day.

The temperatures are:

5, 9, 3, 2, 7, 9, 8, 2, 2, 3 (C)

Constructing a table can help calculate the standard deviation correctly.

Column 1 – list all values. Add them up to find ∑x. Calculate the mean (). 

Column 2 – write the mean temperature () in every row.

Column 3 – subtract each value (temperature) from the mean. It does not matter if you get a negative number.

Column 4 – square all column 3 figures, to remove any negative numbers.

Once you’ve used the formula to calculate the standard deviation, you should always write the answer out in full.

The standard deviation of the minimum temperatures at the weather stations is 2.8 oC. This means that 68% of the data lies within 2.8oC of the mean, which is 5oC.

Standard deviation measures the spread of data from the mean.

Activity

A sample of coniferous trees had the following heights: 3, 2, 1, 2, 3, 4, 3, 7, 6, 5 (m) 

Calculate the standard deviation.

Coefficient of Variation and Standard Error of the Mean

These are other ways to describe the spread of data around a mean.

Coefficient of variation

This is the spread of values around the mean, in percentage format.

The higher the coefficient of variation the wider the spread of data around the mean.

The reason we use percentages is to make it easy to compare 2 or more data sets, which have different means.

Standard error of the mean

Because the data we use is often a sample (e.g. we don’t measure the height of every tree in a forest) we need to be certain that the mean obtained from our sample represents the true mean. The standard error of the mean is used to make an estimate of the limits, within which the true mean lies.

Once we’ve calculated the value we can say that the true mean has a 68% chance of lying within one standard error from our sample mean. 

Therefore, for any estimate of the mean that we arrive at (from a sample), it is possible to estimate where the true population mean is.

Worked example

Instead of sampling the atmospheric pressure for all days of the year, a sample of 10% was taken.

This means that the atmospheric temperature was recorded every 10th day. This is an example of systematic sampling.

The following statistics were recorded:

Mean = 1013.6mb

Standard deviation = 11.2mb

Because another thirty-six readings taken on different days will probably produce a different mean and standard deviation, the sample is subject to error.

We use the standard error of the mean formula to calculate:

And can therefore write out that:

The true mean is 1013.6 ±1.9mb

Activity

Smolt trapping has been used by the Wester Ross Fisheries Trust to sample salmon smolts in the Badachro River. 

A sample of 25 smolts were trapped and their sizes measured in cm.

24, 14, 19, 20, 13, 19, 23, 15, 19, 20, 28, 11, 20, 24, 27, 17, 19, 17, 24, 25, 16, 22, 18, 21, 20