Inferential Statistics

Inferential statistics help to identify relationships, test hypothesis draw conclusions about sets of data.

Most of these statistical methods will be new to you, but none of the maths behind them is tricky! If you follow the instructions step-by-step, they’re a really worthwhile technique to encorporate in your Geographical study. You will also find they come up in university courses such as Geography, Psychology, Science and Humanities subjects.

These are not statistical methods, but are important foundations.

Significance

Chi² Analysis
Linear regression
Nearest Neighbour

Pearson’s Product Moment Correlation Coefficient
Spearman’s Rank correlation coefficient

Hypothesis

A hypothesis is a statement or a hunch. It is used to set up a question which can then be tested – data gathered, processed and a conclusion drawn.

Hypothesis

Most of the time in Geography, research is centred around testing whether a geographical model is true.

For example, understanding the process of attrition means that we would expect that the size of sediment should get smaller as you move downstream.

To test whether this is true for a particular river, a hypothesis would be written.

“There is a relationship between the size of the bedload in River X and the distance in which it is found from the source.”

However, as in a court of law it’s “innocent until proven guilty.” It’s better to assume there’s no relationship and then find one!

A better hypothesis would be:
“There is no relationship between the size of bedload in River X and the distance in which it is found from the source.” This is an example of a null hypothesis.

Null Hypothesis

To test a hypothesis we first write the null hypothesis (H0).

This is the “no change” position. Phrases included in null hypothesis include…

“There is no difference between…”
“There has been no change since…”
“There is no relationship between…”

The reason we do this is because it’s good practice to assume a relationship doesn’t exist, and then prove that one does.

If you later prove a relationship does exist you can reject the null hypothesis and accept the alternative hypothesis (H1).

Alternative Hypothesis

“There is a difference between…”
“There has been a change since…”
“There is a relationship between…”

Question: Practice writing out more hypothesis, using the model above.

The Bradshaw Model is commonly used by Geographers to predict how river characteristics will change downstream. Practice writing out some null & alternative hypothesis for other river characteristics. You can use the one above, for sediment size, to help you.

Activity

For each of the investigations, write out the null and alternative hypotheses.

1. The James Hutton Institute want to find out if the yield of wheat on Scottish farms has changed after using brand X fertiliser instead of using brand Y fertiliser.

2. The Walk Wheel Cycle Trust found that the mean travel time to work for Glasgow residents in 2015 was 20 minutes. A transport official wants to use a sample of this year’s travel times to see if they have changed since 2015.

3. The team at Our World in Data are writing an article about improvements in healthcare. They want to establish whether a relationship exists between the GDP per capita ($) of ten low-income countries and the provision of doctors per 1,000 people in those countries.

Significance

Significance is our measure of at what stage we accept or reject the null hypothesis. Using significance ensures your data isn’t just demonstrating a relationship by chance.

When you calculate the result of a statistical test, you will have to check it against a significance table.

We will use data tables sourced from the Island Geographer website. These should be referenced if you use them in your Geographical Study.

Chi Squared Significance Table

Spearman’s Rank Significance Table

Pearson’s Product Significance Table

Significance tables show values of probability.

This example uses the Pearson’s Product significance table.

A significance level of 0.01 says that there is a 1 in 100 chance of your data being a chance event. That means that there is 99% certainty that the relationship exists.

If the value that you’ve calculated in your statistical test is more than this (e.g. more than 0.995 – the critical value) then you can say that you’re highly certain the relationship exists.

e.g. The result of 0.998 exceeds the 0.01 significance level. The null hypothesis must be rejected and replaced by the alternative hypothesis – that there is a relationship between the size of sediment in River X and the distance from the source.

However, the degrees of freedom refers to the numbers of values that have been studied to reach this conclusion. The more data the better – but also the more likely the data is to have anomalies. This means that at higher degrees of freedom, the number which needs to be reached is less. Be sure to calculate the degrees of freedom correctly – it’s done differently depending on the test conducted.

Question: You’re studying the change in house prices as you move away from the city centre. You have done a Spearman’s Rank statistical test and got a result of 0.826 based on 8 sets of data (n).

Based on a significance level of 0.02 (98%), do you:

A – Reject the null hypothesis and accept the alternative hypothesis.

B – Accept the null hypothesis.

C – Change the level of significance you’re looking at because although it doesn’t meet it at 98%, it would at 95%.

Activity

The Scottish Geology Trust are investigating the feasibility of a new path and interpretation centre at Siccar Point. They’ve compared the occurrences of path erosion at Siccar Point to locations on other parts of the coastal path. They want to see if visitors to Siccar Point are causing an increase in path erosion.

A chi squared analysis was undertaken to statistically test the data. The result (critical value) of this was 24.57. The degrees of freedom was 18.

Write down the null hypothesis and the alternative hypothesis.
Open the significance table for Chi squared. Is the critical value of 24.57 high enough to pass at any of the listed significant levels? If so, which ones?
Write down whether you accept or reject the null hypothesis.
Suggest two other data gathering techniques, other than measuring path erosion, that they might want to use to collect data to inform their plans.

Pearson’s Product

Pearson’s Product Moment Correlation Coefficient, to give it it’s full name, is a way to examine two sets of data and find out if there is a significant relationship between them.

It gives us an idea of the strength of the relationship and what direction – i.e. is it a positive or negative relationship.

Before learning about how to do the statistical calculation, it might be worth reminding yourself about correlation. This was covered in the Graphical Data Presentation part of the course.

A Pearson’s Product calculation will give us a value which will allow us to decide what kind of correlation the data has – and if it’s strong enough for us to accept a hypothesis.

The formula for Pearson’s Product looks complicated, but it just has to be calculated in steps. Each of these steps is represented with a heading in the table.

Worked Example

A geographical researcher is examining how air pollution levels vary with distance from the CBD in Inverness. Nitrogen dioxide levels were recorded and mapped across the city at regular intervals along a transect.

The researcher believed that as the centre of the city was quite old and contained few roads wide enough for large volumes of traffic, it would not have as high a level of nitrogen dioxide as the city outskirts.

The researcher therefore formulated the following null hypothesis:

“There is no correlation between the level of nitrogen dioxide in the air and the distance from the CBD.”

Step 1: Use the data to create a scattergraph. This will allow you to determine what kind of correlation to expect.’

In this case, the data shows a negative correlation. We will expect the result of our Pearson’s Product calculation to be a negative number.

Step 2: Create a table as shown. This gives you all the components you need to work out Pearson’s Product using the formula.

If you need more rows (for a higher number of data sets), then just add them!

What are all the columns for?

In the first column, put the values for your independent variable x (in this case, distance from the CBD).

In the second column, put the values for your dependent variable y (in this case, nitrogen dioxide levels).

The (x – x̅) column is how different the x values are from the mean.

The (y – y̅) column is how different the y values are from the mean.

The (x – x̅)(y – y̅) column is these two figures multiplied together, for each data set.

The (x – x̅)² and (y – y̅)² columns are the square.

Step 3: Calculate the mean of x (xbar) and the mean of y (ybar).

Step 4: Use the table to work through and calculate all the components you will need to use the formula.

These are:

Total (∑) of (x – x̅)(y – y̅)
Total (∑) of (x – x̅)²
Total (∑) of (y – y̅)²

The completed table is shown.

Step 5: Use the components to slot into the formula.

The negative value confirms to us that there is a negative correlation. The value is also close to -0.9, which tells us that there is a strong, negative correlation. But, before we can accept or reject the hypothesis we should check it’s significance.

Step 6: Check the calculated value against the critical value, using the significance table. The degrees of freedom for Pearson’s Product is
n-1 so in this case would be 12. When checking the level of significance, ignore any negative signs.

We can see from the table that the value has to be above 0.661 to be accepted. Our value of 0.908 is therefore above and we can say there is a significant relationship.

Step 7: Write out your answer in full, referring to the calculated value, critical value, significance level and hypothesis.

The Pearson’s value result of -0.908 exceeds the 0.005 or 99.5% significance level at 12 degrees of freedom of 0.661. The null hypothesis must be rejected and replaced by the alternative hypothesis – that there is a significant, strong, negative correlation between the level of nitrogen dioxide in the air and the distance from the CBD.

Question: Pearson’s Product isn’t suitable for use with all types of data or hypothesis.
Which of these could it be used with?

A – To find out if there’s a relationship between the depth of soil and distance up a slope

B – To find out if two towns are seeing a similar rise in incidences of coastal erosion

C – To find out if there is a correlation between the GDP per capita and carbon dioxide emissions in South American countries

Pearson’s Product Moment Correlation Coefficient tests the strength and direction of a correlation between two sets of data.

It uses absolute values, which is more accurate than relative (ranked) values.
It is useful when examining interval data; which means that it can be used with a mixture of negative and positive numbers.
It can also be used with ratio data (which has a true zero).
It assumes a linear relationship, and does not work well with data which might have a different relationship (e.g. logarithmic).
There are many steps to the calculation, which increases the likelihood of a calculation error.

Activity

A researcher for a health charity is looking to find out if there is a relationship between the % of people who work in agriculture and the life expectancy. She has randomly sampled 8 countries and sourced data from Gapminder.

Write out the null hypothesis for the investigation and undertake a Pearson’s Product analysis to determine whether it should be accepted or rejected.

Step 1: Scattergraph

Step 2: Table

Step 3: Calculate mean

Step 4: Complete table

Step 5: Formula

Step 6: Significance table

Step 7: Write-out in full

Spearman’s Rank

Spearman’s Rank Coefficient, to give it it’s full name, is another way to examine two sets of data and find out if there is a significant relationship between them.

It gives us an idea of the strength of the relationship and what direction – i.e. is it a positive or negative relationship.

It uses ranked data rather than the true values.

Spearman’s Rank has a lot of similarities to Pearson’s Product and is used with similar data sets. However, it has some key differences:

It uses the data’s rank, rather than it’s true value.

This is an advantage because it reduces the impact of extreme values.
This makes the technique less powerful than Pearson’s Product which gives more reliability and precision.

It has an fewer steps than Pearson’s Product, which makes it an easier statistical test to complete.

The formula for Spearman’s Rank has fewer components:

∑d² = the total of the differences in rank for each set of paired data
n = the number of sets of paired data

However, there are still a number of steps to follow to get the information you need.

Worked Example

A Geography class are investigating whether there is a relationship between the adult literacy of countries and their fertility rate.

The adult literacy rate is the percentage of people ages 15 and above who can both read and write, with understanding, a short simple statement about their everyday life.
Total fertility rate represents the number of children that would be born to a woman if she were to live to the end of her childbearing years and bear children in accordance with age-specific fertility rates of the specified year.

In order to see if there is a relationship, and if it’s significant, they will undertake a Spearman’s Rank calculation. Their teacher has provided them with a dataset featuring 68 countries.

Null hypothesis: There is no relationship between the adult literacy rate and the total fertility rate in select countries.

Data sampling: As the data set is very large, the students decided to use a sampling strategy to select 14 countries each. That is because ranking very large data sets is difficult and there is a higher chance of an error occurring.

They could do this using a:

systematic sampling approach, taking every 5th country
stratified sampling approach, separating the countries into those with high, medium and low human development and then selecting 4 from each
random sampling approach, generating 14 random numbers between 1-68 to select their countries.

This worked example, shown in the table, used a systematic sampling method and selected countries on every 5th line, starting with line 2.

Step 1: Use the data to create a scattergraph. This will allow you to determine what kind of correlation to expect.’

In this case, the data shows a negative correlation. We will expect the result of our Spearman‘s Product calculation to be a negative number.

Step 2: Create a table as shown. This gives you all the components you need to work out Spearman’s Rank using the formula.

If you need more rows (for a higher number of data sets), then just add them!

What are all the columns for?

In the first column, put the values for your independent variable x (in this case, adult literacy rate).

In the second column, put the values for your dependent variable y (in this case, total fertility rate).

The rank x and rank y columns are the rank given to the data. There is more information about how to do this below.

The d column is the difference between the ranks for x and y (difference between column 3 and 5)

The d² column is the d column squared. We do this to remove negative numbers.

Step 3: Rank the x and y data.

You can do this in ascending (doing up) or descending (going down) order. It doesn’t matter, as long as it’s the same for both x and y.
If there are two identical values (like the values in red) then you give them a rank half way between the two. For example, the value of 4.8 would have come at rank position 2 and 3, so it’s given the rank of 2.5 Importantly, you then continue ranking from value 4. I like to keep a track of this by writing in brackets – otherwise I get in a mess!
If there are three (or more!) identical values, like those shown in blue, you give them a rank that is the average of the values. In this case, the adult literacy rate of 100 would have been ranks 1, 2 and 3 – the average of which is 2. You then continue ranking from rank 4. Again, I add the numbers in brackets just for my own sake and keeping track.

You should know that you’ve done this step correctly as your last rank will be the same as the number of data sets you have.

Step 4: Calculate the difference in ranks and square, to complete the table.

Step 5: Use the calculated ∑d² value in the formula. Remember to do the 1 – part of the calculation.

The negative value confirms to us that there is a negative correlation. The value is -0.715, which tells us that the relationship isn’t strong, but it is there. But, before we can accept or reject the hypothesis we should check it’s significance.

Step 6: Check the calculated value against the critical value, using the significance table. The degrees of freedom for Spearman’s Rank is
simply n, so in this case would be 14. When checking the level of significance, ignore any negative signs.

We can see from the table that the value has to be above 0.675 to be accepted at a 0.01 level of significance. Our value of -0.715 is therefore above and we can say there is a significant relationship.

Step 7: Write out your answer in full, referring to the calculated value, critical value, significance level and hypothesis.

The Spearman’s Rank value result of -0.715 exceeds the 0.01 or 99% significance level at 14 degrees of freedom of 0.675. The null hypothesis must be rejected and replaced by the alternative hypothesis – that there is a significant negative relationship between the adult literacy rate and total fertility rate of select countries.

Activity

Conduct a Spearman’s Rank analysis on another 14 countries from the data set.

You could choose to sample these with random or stratified sampling.

Sampling

Step 1: Scattergraph

Step 2: Table

Step 3: Ranking

Step 4: Complete table

Step 5: Formula

Step 6: Significance table

Step 7: Write-out in full

Chi Squared

Chi Squared is a statistical method used for data which is in categories. It’s useful when we’re looking at whether there is an association between data across different sites or groups of people.

Chi Squared is really useful for lots of Advanced Higher Geography studies. That’s because lots of them will compare data across different locations.

Considerations:

Both sets of data need to be categorical, so you may have to arrange any continuous data (e.g. ages, distances) into groups. For example, you could create groups labelled 0-15 years, 16-30 years, 31-45 years, 46-60 years.
Depending on the data being analysed, you might want to combine groups so that you have around 4-5 to work with.
The calculated value should be checked against a significance table – note the different way to calculate the degrees of freedom.
Beware of categories with small expected values. One or two are fine, but if there are any under 1 – or a few under 5 – it means the test isn’t very reliable for your data. You could limit the impact of this by combining groups (e.g. making the age categories larger). In reality, it’s difficult to plan for this in AH studies but the more data you have, the easier it is to manipulate it and change categories should this be the case.

The formula for Chi Squared (X²) works a little differently to your previous inferential statistics. You calculate the data in a table, and then total it.

O = the observed value for each category
E = the calculated, expected value for each category
∑ = the total

Worked Example

A geography pupil has investigated two areas of peatland. One has undergone peatland restoration, whilst the other has not. One of the things studied was the occurrences of indicator species in each peatland.

There will be an association between species frequency and restoration status of peatlands.

In order to see if there is an association, and if it’s significant, they will undertake a Chi Squared calculation.

Null hypothesis: There will be no association between species frequency and restoration status of peatlands.

Step 1: Find out the total of each row and column in the table.

If this has been done correctly, the total of the columns and the total of the rows should be the same.

Step 2: Calculate the expected values. The expected values assume that there is no difference between the two peatlands. We’re looking to see how different the observed (real) data is from this.

Expected = column total x row total
overall total

If you’ve done this correctly, the row totals should still be the same!

Step 3: One more table to make, this time the components needed to total up the Chi Squared value.

Step 4: Check the calculated value against the critical value, using the significance table. The degrees of freedom for Chi Squared is:

(No of rows – 1) * (No of columns – 1)

So in this case would be 3. Calculated values for Chi Squared should always be positive numbers.

We can see from the table that the value has to be above 7.815 to be accepted at a 0.05 level of significance. Our value of 8.789 is therefore above and we can say there is an association.

Step 7: Write out your answer in full, referring to the calculated value, critical value, significance level and hypothesis.

The Chi Squared result of 8.789 exceeds the 0.05 or 95% significance level at 3 degrees of freedom of 7.815. The null hypothesis must be rejected and replaced by the alternative hypothesis – that there is an association between species frequency and restoration status of peatlands.

Activity

Climate Hebrides wanted to collect qualitative data, based on people’s own experiences, to contribute to work on climate change adaptations. They’ve conducted a survey of residents across the Outer Hebrides, asking information about their characteristics and opinions on the impact of climate change.

Their data indicated that people may have different opinions based on their age.

Undertake a Chi Squared analysis of the data.

Extension: How would you present this data graphically?

Nearest Neighbour

Nearest Neighbour looks at whether data is clustered, random or dispersed orientated over a spatial area.

The calculated nearest neighbour value will range from:

0 – the data is very clustered
1 – the data is randomly dispersed
2.5 – the data is regularly dispersed, spread evenly across the area

The formula used is:

Where:

Rn = Nearest Neighbour Index
D = the average distance between each point and its nearest neighbour
n = the number of points in the study
a = the size of area in the study

Worked Example

Birch seeds are dispersed by the wind, often being carried for long distances. In a commercial coniferous forest plantation, birch trees can end up growing naturally, in amongst commercially grown trees.

Forestry and Land Scotland has visited an area of commercial forest where this has occurred. They have mapped out the location of the birch trees and wish to know how they are dispersed through the forest – in clusters, randomly or regularly dispersed.

Step 1: Write out a hypothesis for the calculation.

“The birch trees will be regularly dispersed across the commercial coniferous plantation.”

Step 2: Find out the distance from each tree to it’s nearest neighbour. You may have to use a map to do this.

It doesn’t matter what unit you use, as long as it’s consistent with the unit used for area.

Do you predict, looking at this map, that this is going to give a clustered, random or dispersed pattern? We’re now going to use statistics to give more than just your opinion/prediction!

Step 3: Add a column to your table with the distances. Use this to calculate the average distance to nearest neighbour.

Step 4: Calculate the Rn (Nearest Neighbour Index) using the formula.

Where:

Rn = Nearest Neighbour Index
D = 18 m
n = 10
a = 10,000 m²

Step 5: Write your answer out in full, commenting on whether the result shows a clustered, random or dispersed distribution.

The Nearest Neighbour Index (Rn) of 1.138 shows that the data is close to randomly dispersed.

How does that match with your prediction? In a Geographical Study, you could then:

Compare it to other parts of the forest, or other plantations.
Go on to explain or suggest reasons why it might be randomly dispersed.

Activity

The Metropolitan Police, who cover London, use a system of stop and search to reduce crime across the capital.

The policy is controversial, with critics highlighting that some groups (e.g. those who are from an ethnic minority) are targeted by officers.

They publish data each month about stop and searches which have been conducted.

Data from December 2025 shows 25 occurrences where a person under 10 was stopped and searched. They wish to understand if this is happening in clusters or is regularly dispersed across their 1,578 km² beat.

The data has been plotted on a Geographical Information Systems map, as linked below.

You may wish to copy the map, or open it in Google Earth, to allow access to the measuring tool.

Calculate the Nearest Neighbour Index for the data.

Linear Regression

You’re used to drawing best fit lines on scatter graphs, but how do you know where to actually draw it?

Linear regression allows us to calculate the best fit line mathematically. We can then use it to predict what further values might be, based on the line.

It uses principles that you will have learned about in maths.

You’ll know from maths that the equation for a straight line is y = mx + c

Therefore, we need to calculate what the m (gradient) and c (intercept) is to get the formula for the line of best fit.

Worked Example

In the process of a Pearson’s Product Moment Correlation Coefficient calculation, you have already generated lots of the components needed to work out the formula of the best fit line.

We will, therefore, use the existing data we have from the worked example and activity you completed during the Pearson’s Product work, further up this page.

In the worked example, we were looking at how nitrogen dioxide levels change as you move out from the CBD of a city.

Step 1: The formula for a straight line is y = mx + c

First we need to calculate m. We can do this using the formula shown. If you’ve already done a Person’s Product, you will already have calculated both the numerator and denominator.

Step 2: Substitute these values into the formula and solve for m.

m = -1008
182

m = -5.54

If your data shows a negative correlation, this should be reflected in the value for m.

Step 3: Substitute these values into the formula and solve for c.

c = y̅ – mx̅

c = 43.69 – (-5.54*6)

c = 76.92

Step 4: Write out the equation for your line of best fit.

y = -5.54x + 76.92

Step 5: If you’re plotting the best fit line by hand, you’ll need to figure out a couple of points on the line to join up. You can do this by substituting values for x into your equation.

e.g.

y = (-5.54 x 2) + 76.92
y = (-5.54 x 6) + 76.92

A digital hack: Using Google Sheets, you can have the best fit-line plotted and the equation for it displayed. You do this by using the customise chart, adding a trend line and asking it to label the equation of the line.

Activity

Revisit the activity you did for Pearson’s Product. Calculate the linear regression – show that you understand the steps rather than just skipping to plotting the line on Google sheets!

Step 1: Formula

Step 2: Solve for m

Step 3: Solve for c

Step 4: Write out the equation

Step 5: Plot the line