Applying Supervised Machine Learning to Landuse Cover in Santa Barbara, CA

Quarto

Spatial

using algorithms to understand how we use our land

Author

Published

December 19, 2023

Overview

Humans have been altering the natural world for centuries through agriculture, farming, development, recreation, etc. The impacts of landuse change have become critical to understand as evidence shows that it contributes significantly to climate change and is responsible for ecological degredation globally.[^1] Monitoring the distribution and change in landuse types can help us understand the impacts of climate change, natural disasters, deforestation, urbanization, and much more.

Dale, V. H. (1997). The relationship between land‐use change and climate change. Ecological applications, 7(3), 753-769.

Determining land cover types over large areas is a major application of remote sensing because we are able to distinguish different materials based on their spectral reflectance. In other words, remote sensing has opened up new doors to study land use change by looking at different proportions of light that it reflects up to satellites. By utiliizng remotely sensed imagery, we can classify landcover into classes or groups that allow us to understand the distribution and change in landcover types over large areas. There are many approaches for performing landcover classification – supervised approaches use training data labeled by the user, whereas unsupervised approaches use algorithms to create groups which are identified by the user afterward.

Objective

In this analysis, I used a form of supervised classification, a decision tree classifier, to predict landuse cover in Santa Barbara, California. Decision trees classify pixels using a series of conditions based on values in spectral bands. These conditions (or decisions) are developed based on training data. Here, I created a land cover classification for southern Santa Barbara County based on multi-spectral imagery and data on the location of 4 land cover types:

green vegetation
dry grass or soil
urban
water

Summary

My approach to this analysis can be summed up in 5 steps:

Step 1: load and process Landsat scene data
Step 2: crop and mask Landsat data to study area (Santa Barbara)
Step 3: extract spectral data at training sites (subset of parcels within Santa Barbara)
Step 4: train and apply decision tree classifier
Step 5: plot results

Data

Landsat 5 Thematic Mapper

Data was obtained from Landsat 5, including 1 scene from September 25, 2007. The specific spectral bands being used are: 1, 2, 3, 4, 5, 7 from the collection 2 surface reflectance product.

Study area and training data A polygon representing southern Santa Barbara county was used as the overall study site. I also used polygons of regions within Santa Barbara representing training sites, including character string with land cover type. The decision tree derived from the training data is then applied to the entire Santa Barbara county polygon to create predictions of landuse cover for the whole region.

Workflow

Process data

Load packages and set working directory

Because this project required working with both vector and raster data, I used both sf and terra packages in my workflow. To train our classification decision tree and plot the results, I used the rpart and rpart.plot packages.

code

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, results = FALSE, fig.align = "center")

library(sf)
library(terra)
library(here)
library(dplyr)
library(rpart)
library(rpart.plot)
library(tmap)

Load Landsat data

First, I created a raster stack based on the 6 bands I worked with. Each file name ends with the band number (e.g. B1.tif). Band 6 corresponds to thermal data, which we will not be working with for this analysis, so it was not included in the data. To create a raster stack, I made a list of the files that I wanted to work with and read them all in at once using the rast function. The names of the layers were then updated to match the spectral bands and plot a true color image to see what we’re working with.

code

# list files for each band, including the full file path
filelist <- list.files(file.path(data, "landsat-data/"), full.names = TRUE)

# read in and store as a raster stack

landsat <- rast(filelist)
landsat

# update layer names to match band

names(landsat) <- c("blue", "green", "red", "NIR", "SWIR1", "SWIR2")
landsat

# plot true color image

plotRGB(landsat, r = 3, g = 2, blue = 1, stretch = "lin")

Load study area

I wanted to constrain the analysis to the southern portion of the county where there was training data, so I read in a file that defines the region of interest (ROI). I made sure that both CRS from the shape file and the Landsat data matched.

code

# read in shapefile for southern portion of SB county
SB_county_south <- st_read(file.path(data, "SB_county_south.shp"))

# project to match the Landsat data
SB_county_south <- st_transform(SB_county_south, crs = st_crs(landsat))

# plot(SB_county_south)

Crop and mask Landsat data to study area

Next, I cropped and masked the Landsat data to the study area. This reduced the amount of data that needed to be worked with and therefore saves computational time and energy. Furthermore, I removed objects that I no longer was going to be working with using the rm() function to save space (optional).

code

# crop Landsat scene to the extent of the SB county shapefile
landsat_cropped <- crop(landsat, SB_county_south)

# mask the raster to southern portion of SB county
landsat_masked <- mask(landsat_cropped, SB_county_south)

# remove unnecessary object from environment
rm(landsat, landsat_cropped, SB_county_south)

plotRGB(landsat_masked, r = 3, g = 2, blue = 1, stretch = "lin")

Convert Landsat values to reflectance

Next, I converted the values in the raster stack to correspond to reflectance values. To do so, I first removed erroneous values and applied any scaling factors to convert to reflectance. In this case, I worked with Landsat Collection 2. The valid range of pixel values for this collection is 7,273-43,636, with a multiplicative scale factor of 0.0000275 and an additive scale factor of -0.2. So I reclassified any erroneous values as NA and updated the values for each pixel based on the scaling factors. Now the pixel values should range from 0-100%.

code

summary(landsat_masked)
# reclassify erroneous values as NA
rcl <- matrix(c(-Inf, 7273, NA,
         43636, Inf, NA), 
        ncol = 3, byrow = TRUE)


landsat <- classify(landsat_masked, rcl = rcl)

# adjust values based on scaling factor

landsat <- (landsat * 0.0000275 - 0.2)*100
summary(landsat)

# plot true color image to check results
plotRGB(landsat, r = 3, g = 2, b = 1, stretch = "lin")

code

# check values are 0 - 100
summary(landsat)

Classify image

Extract reflectance values for training data

Here, I loaded in the shapefile identifying different locations within the study area as containing one of the 4 land cover types. I then extracted the spectral values at each site to create a data frame that relates land cover types to their spectral reflectance.

code

# read in and transform training data
training_data <- st_read(file.path(data, "trainingdata.shp")) %>% 
  st_transform(., crs = st_crs(landsat))
#plot
plot(training_data)

code

# extract reflectance values at training sites
training_data_values <- extract(landsat, training_data, df = TRUE)

# convert training data to data frame
training_data_attributes <- training_data %>% 
  st_drop_geometry()

# join training data attributes and extracted reflectance values

SB_training_data <- left_join(training_data_values, training_data_attributes,
          by = c("ID" = "id")) %>% 
  mutate(type = as.factor(type))

Train decision tree classifier

To train the decision tree, I first needed to establish a model formula (i.e. what the response and predictor variables are). The rpart function implements the CART algorithm. The rpart function needs to know the model formula and training data you would like to use. Because I was performing a classification, I set method = "class". I also set na.action = na.omit to remove any pixels with NAs from the analysis.

To understand how the decision tree will classify pixels, I first plotted the results. The decision tree is comprised of a hierarchy of binary decisions. Each decision rule has 2 outcomes based on a conditional statement pertaining to values in each spectral band.

code

# establish model formula

SB_formula <- type ~ red + green + blue + NIR + SWIR1 + SWIR2

# train decision tree
SB_decision_tree <- rpart(formula = SB_formula,
      data = SB_training_data, 
      method = "class", 
      na.action = na.omit)

# plot decision tree
prp(SB_decision_tree)

Apply decision tree

After making the decision tree, I applied it to the entire image. The terra package includes a predict() function that allows you to apply a model to the data. In order for this to work properly, the names of the layers needed to match the column names of the predictors I used to train our decision tree. The predict() function then returns a raster layer with integer values. These integer values correspond to the factor levels in the training data. To figure out what category each integer corresponded to, I then inspected the levels of the training data.

code

# classify image based on decision tree
SB_classification <- predict(landsat, SB_decision_tree, type = "class", na.rm = TRUE)

# inspect level to understand the order of classes in prediction
SB_classification

Plot results

Finally, I plotted the results to check out the land cover map.

code

# plot results
tm_shape(SB_classification) +
  tm_raster()

Conclusion

This analysis highlights the power of using supervised classification to gain insight of land use over on a large scale. While this technique has some challenges, it is a powerful tool that can be used in the environmental field to understand not just land use cover but a wide variety of environmental topics. It is important that these practices are also ground truthed to ensure the predicted model results are accurate.

--- title: "Applying Supervised Machine Learning to Landuse Cover in Santa Barbara, CA" description: "building algorithms to understand how we use our land" author: - name: Raymond Hunter url: https://ramhunte.github.io/ date: 12-019-2023 # bibliography: references.bib #citation: # url: image: "images/SB.png" categories: [Quarto, R, Spatial] # self-defined categories format: html: code-fold: true code-copy: true code-summary: "code" code-line-numbers: true code-tools: true code-block-border-left: true toc: true draft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready! --- # Overview Humans have been altering the natural world for centuries through agriculture, farming, development, recreation, etc. The impacts of landuse change have become critical to understand as evidence shows that it contributes significantly to climate change and is responsible for ecological degradation globally.[^1] Monitoring the distribution and change in landuse types can help us understand the impacts of climate change, natural disasters, deforestation, urbanization, and much more. [^1]: Dale, V. H. (1997). The relationship between land‐use change and climate change. Ecological applications, 7(3), 753-769. Determining land cover types over large areas is a major application of remote sensing because we are able to distinguish different materials based on their spectral reflectance. In other words, remote sensing has opened up new doors to study land use change by looking at different proportions of light that it reflects up to satellites. By utilizing remotely sensed imagery, we can classify landcover into classes or groups that allow us to understand the distribution and change in landcover types over large areas. There are many approaches for performing landcover classification -- *supervised* approaches use training data labeled by the user, whereas *unsupervised* approaches use algorithms to create groups which are identified by the user afterward. # Objective In this analysis, I used a form of supervised classification, a *decision tree classifier*, to predict landuse cover in Santa Barbara, California. [Decision trees](https://medium.com/@ml.at.berkeley/machine-learning-crash-course-part-5-decision-trees-and-ensemble-models-dcc5a36af8cd) classify pixels using a series of conditions based on values in spectral bands. These conditions (or decisions) are developed based on training data. Here, I created a land cover classification for southern Santa Barbara County based on multi-spectral imagery and data on the location of 4 land cover types: - green vegetation - dry grass or soil - urban - water ## Summary My approach to this analysis can be summed up in 5 steps: - Step 1: load and process Landsat scene data - Step 2: crop and mask Landsat data to study area (Santa Barbara) - Step 3: extract spectral data at training sites (subset of parcels within Santa Barbara) - Step 4: train and apply decision tree classifier - Step 5: plot results ## Data **Landsat 5 Thematic Mapper** Data was obtained from [Landsat 5](https://www.usgs.gov/landsat-missions/landsat-5), including 1 scene from September 25, 2007. The specific spectral bands being used are: 1, 2, 3, 4, 5, 7 from the collection 2 surface reflectance product. **Study area and training data** A polygon representing southern Santa Barbara county was used as the overall study site. I also used polygons of regions within Santa Barbara representing training sites, including character string with land cover type. The decision tree derived from the training data is then applied to the entire Santa Barbara county polygon to create predictions of landuse cover for the whole region. # Workflow ## Process data #### Load packages and set working directory Because this project required working with both vector and raster data, I used both `sf` and `terra` packages in my workflow. To train our classification decision tree and plot the results, I used the `rpart` and `rpart.plot` packages. ```{r include=TRUE, message=FALSE, warning=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, results = FALSE, fig.align = "center") library(sf) library(terra) library(here) library(dplyr) library(rpart) library(rpart.plot) library(tmap) ``` ```{r include=FALSE, message=FALSE, warning=FALSE} rm(list = ls()) data <- file.path("/Users/rayhunter/Documents/Bren/Past_Courses/EDS_223/week9/data") ``` #### Load Landsat data First, I created a raster stack based on the 6 bands I worked with. Each file name ends with the band number (e.g. `B1.tif`). Band 6 corresponds to thermal data, which we will not be working with for this analysis, so it was not included in the data. To create a raster stack, I made a list of the files that I wanted to work with and read them all in at once using the `rast` function. The names of the layers were then updated to match the spectral bands and plot a true color image to see what we're working with. ```{r include=TRUE} # list files for each band, including the full file path filelist <- list.files(file.path(data, "landsat-data/"), full.names = TRUE) # read in and store as a raster stack landsat <- rast(filelist) landsat # update layer names to match band names(landsat) <- c("blue", "green", "red", "NIR", "SWIR1", "SWIR2") landsat # plot true color image plotRGB(landsat, r = 3, g = 2, blue = 1, stretch = "lin") ``` #### Load study area I wanted to constrain the analysis to the southern portion of the county where there was training data, so I read in a file that defines the region of interest (ROI). I made sure that both CRS from the shape file and the Landsat data matched. ```{r include=TRUE} # read in shapefile for southern portion of SB county SB_county_south <- st_read(file.path(data, "SB_county_south.shp")) # project to match the Landsat data SB_county_south <- st_transform(SB_county_south, crs = st_crs(landsat)) # plot(SB_county_south) ``` #### Crop and mask Landsat data to study area Next, I cropped and masked the Landsat data to the study area. This reduced the amount of data that needed to be worked with and therefore saves computational time and energy. Furthermore, I removed objects that I no longer was going to be working with using the `rm()` function to save space (optional). ```{r include=TRUE} # crop Landsat scene to the extent of the SB county shapefile landsat_cropped <- crop(landsat, SB_county_south) # mask the raster to southern portion of SB county landsat_masked <- mask(landsat_cropped, SB_county_south) # remove unnecessary object from environment rm(landsat, landsat_cropped, SB_county_south) plotRGB(landsat_masked, r = 3, g = 2, blue = 1, stretch = "lin") ``` #### Convert Landsat values to reflectance Next, I converted the values in the raster stack to correspond to reflectance values. To do so, I first removed erroneous values and applied any [scaling factors](https://www.usgs.gov/faqs/how-do-i-use-scale-factor-landsat-level-2-science-products#:~:text=Landsat%20Collection%202%20surface%20temperature,the%20scale%20factor%20is%20applied.) to convert to reflectance. In this case, I worked with [Landsat Collection 2](https://www.usgs.gov/landsat-missions/landsat-collection-2). The valid range of pixel values for this collection is 7,273-43,636, with a multiplicative scale factor of 0.0000275 and an additive scale factor of -0.2. So I reclassified any erroneous values as `NA` and updated the values for each pixel based on the scaling factors. Now the pixel values should range from 0-100%. ```{r include=TRUE} summary(landsat_masked) # reclassify erroneous values as NA rcl <- matrix(c(-Inf, 7273, NA, 43636, Inf, NA), ncol = 3, byrow = TRUE) landsat <- classify(landsat_masked, rcl = rcl) # adjust values based on scaling factor landsat <- (landsat * 0.0000275 - 0.2)*100 summary(landsat) # plot true color image to check results plotRGB(landsat, r = 3, g = 2, b = 1, stretch = "lin") # check values are 0 - 100 summary(landsat) ``` ## Classify image #### Extract reflectance values for training data Here, I loaded in the shapefile identifying different locations within the study area as containing one of the 4 land cover types. I then extracted the spectral values at each site to create a data frame that relates land cover types to their spectral reflectance. ```{r include=TRUE} # read in and transform training data training_data <- st_read(file.path(data, "trainingdata.shp")) %>% st_transform(., crs = st_crs(landsat)) #plot plot(training_data) # extract reflectance values at training sites training_data_values <- extract(landsat, training_data, df = TRUE) # convert training data to data frame training_data_attributes <- training_data %>% st_drop_geometry() # join training data attributes and extracted reflectance values SB_training_data <- left_join(training_data_values, training_data_attributes, by = c("ID" = "id")) %>% mutate(type = as.factor(type)) ``` #### Train decision tree classifier To train the decision tree, I first needed to establish a model formula (i.e. what the response and predictor variables are). The `rpart` function implements the [CART algorithm](https://medium.com/geekculture/decision-trees-with-cart-algorithm-7e179acee8ff). The `rpart` function needs to know the model formula and training data you would like to use. Because I was performing a classification, I set `method = "class"`. I also set `na.action = na.omit` to remove any pixels with `NA`s from the analysis.\ To understand how the decision tree will classify pixels, I first plotted the results. The decision tree is comprised of a hierarchy of binary decisions. Each decision rule has 2 outcomes based on a conditional statement pertaining to values in each spectral band. ```{r include=TRUE} # establish model formula SB_formula <- type ~ red + green + blue + NIR + SWIR1 + SWIR2 # train decision tree SB_decision_tree <- rpart(formula = SB_formula, data = SB_training_data, method = "class", na.action = na.omit) # plot decision tree prp(SB_decision_tree) ``` #### Apply decision tree After making the decision tree, I applied it to the entire image. The `terra` package includes a `predict()` function that allows you to apply a model to the data. In order for this to work properly, the names of the layers needed to match the column names of the predictors I used to train our decision tree. The `predict()` function then returns a raster layer with integer values. These integer values correspond to the *factor levels* in the training data. To figure out what category each integer corresponded to, I then inspected the levels of the training data. ```{r include=TRUE} # classify image based on decision tree SB_classification <- predict(landsat, SB_decision_tree, type = "class", na.rm = TRUE) # inspect level to understand the order of classes in prediction SB_classification ``` #### Plot results Finally, I plotted the results to check out the land cover map. ```{r} # plot results tm_shape(SB_classification) + tm_raster() ``` ## Conclusion This analysis highlights the power of using supervised classification to gain insight of land use over on a large scale. While this technique has some challenges, it is a powerful tool that can be used in the environmental field to understand not just land use cover but a wide variety of environmental topics. It is important that these practices are also ground truthed to ensure the predicted model results are accurate.