K-Means Land Classification with Dask#

K-Means is a clustering algorithm that creates a segmentation map of different “clusters” which can represent estimated/easily-separable classifications which share similar values to a centroid optimum that represents the groups mean value. The classifications should not be considered accurate and requires verification - however it is a great starting point for unsupervised classification problems to determine separable classes.

For geospatial applications, we can use K-Means to create rough land-classification segmentation maps or generate automated labeled data given supporting methods to verify the classification is correct.

[ ]:

# We will be using Sentinel-2 L2A imagery from Microsoft Planetary Computer STAC server:
!pip install planetary_computer

[ ]:

import os
import rasterio
import rioxarray
import pystac
import stackstac
import datetime
import planetary_computer
import dask
import json
import gcsfs

import dask_ml.cluster

import numpy as np
import xarray as xr
import rioxarray as rxr
import matplotlib.pyplot as plt
import geopandas as gpd

from skimage.exposure import rescale_intensity
from dask_gateway import Gateway
from shapely.geometry import Polygon
from pystac_client import Client

1. Initialize Dask Cluster#

We will use Dask to power our computations of a K-Means algorithm with which will be fitted and used to for predictions. Start by initializing a dask cluster in a separate notebook and connecting to it. We then scaled our cluster to have 3 workers.

Remember to replace the dask cluster’s name below with the one you instantiate.

[4]:

gateway = Gateway()
cluster = gateway.connect('daskhub.81d82a23b4ea4bb2aac199856b4049f2')
client = cluster.get_client()
cluster

AOI#

This AOI was generated from: https://www.keene.edu/campus/maps/tool/

We will, for the purpose of this demonstration, look at the Timberlea suburb in Montreal, Quebec, Canada

[10]:

_polygon = {
  "coordinates": [
    [
      [
        -73.8847303,
        45.4294192
      ],
      [
        -73.883357,
        45.4445361
      ],
      [
        -73.9108229,
        45.4442049
      ],
      [
        -73.9120245,
        45.4263471
      ],
      [
        -73.8847303,
        45.4294192
      ]
    ]
  ],
  "type": "Polygon"
}

[11]:

lon_list = []
lat_list = []

for lon,lat in _polygon['coordinates'][0]:
    lon_list.append(lon)
    lat_list.append(lat)
polygon_geom = Polygon(zip(lon_list, lat_list))
crs = 'EPSG:4326'
polygon = gpd.GeoDataFrame(index=[0], crs=crs, geometry=[polygon_geom])

[12]:

# Set up Stac Client
api = Client.open('https://planetarycomputer.microsoft.com/api/stac/v1')
api

[12]:

Client: microsoft-pc

id: microsoft-pc

title: Microsoft Planetary Computer STAC API

description: Searchable spatiotemporal metadata describing Earth science datasets hosted by the Microsoft Planetary Computer

type: Catalog

conformsTo: ['http://www.opengis.net/spec/cql2/1.0/conf/basic-cql2', 'http://www.opengis.net/spec/cql2/1.0/conf/cql2-json', 'http://www.opengis.net/spec/cql2/1.0/conf/cql2-text', 'http://www.opengis.net/spec/ogcapi-features-1/1.0/conf/core', 'http://www.opengis.net/spec/ogcapi-features-1/1.0/conf/geojson', 'http://www.opengis.net/spec/ogcapi-features-1/1.0/conf/oas30', 'http://www.opengis.net/spec/ogcapi-features-3/1.0/conf/filter', 'https://api.stacspec.org/v1.0.0-rc.1/collections', 'https://api.stacspec.org/v1.0.0-rc.1/core', 'https://api.stacspec.org/v1.0.0-rc.1/item-search', 'https://api.stacspec.org/v1.0.0-rc.1/item-search#fields', 'https://api.stacspec.org/v1.0.0-rc.1/item-search#filter', 'https://api.stacspec.org/v1.0.0-rc.1/item-search#query', 'https://api.stacspec.org/v1.0.0-rc.1/item-search#sort', 'https://api.stacspec.org/v1.0.0-rc.1/ogcapi-features']

Children

Only the first child shown

CollectionClient: daymet-annual-pr

id: daymet-annual-pr

title: Daymet Annual Puerto Rico

description: Annual climate summaries derived from [Daymet](https://daymet.ornl.gov) Version 4 daily data at a 1 km x 1 km spatial resolution for five variables: minimum and maximum temperature, precipitation, vapor pressure, and snow water equivalent. Annual averages are provided for minimum and maximum temperature, vapor pressure, and snow water equivalent, and annual totals are provided for the precipitation variable. [Daymet](https://daymet.ornl.gov/) provides measurements of near-surface meteorological conditions; the main purpose is to provide data estimates where no instrumentation exists. The dataset covers the period from January 1, 1980 to the present. Each year is processed individually at the close of a calendar year. Data are in a Lambert conformal conic projection for North America and are distributed in Zarr and NetCDF formats, compliant with the [Climate and Forecast (CF) metadata conventions (version 1.6)](http://cfconventions.org/). Use the DOI at [https://doi.org/10.3334/ORNLDAAC/1852](https://doi.org/10.3334/ORNLDAAC/1852) to cite your usage of the data. This dataset provides coverage for Hawaii; North America and Puerto Rico are provided in [separate datasets](https://planetarycomputer.microsoft.com/dataset/group/daymet#annual).

providers:

Microsoft (host, processor)
ORNL DAAC (producer)

type: Collection

sci:doi: 10.3334/ORNLDAAC/1852

sci:citation: Thornton, M.M., R. Shrestha, Y. Wei, P.E. Thornton, S. Kao, and B.E. Wilson. 2020. Daymet: Annual Climate Summaries on a 1-km Grid for North America, Version 4. ORNL DAAC, Oak Ridge, Tennessee, USA. https://doi.org/10.3334/ORNLDAAC/1852

msft:group_id: daymet

cube:variables: {'vp': {'type': 'data', 'unit': 'Pa', 'attrs': {'units': 'Pa', 'long_name': 'annual average of daily average vapor pressure', 'cell_methods': 'area: mean time: mean within days time: mean over days', 'grid_mapping': 'lambert_conformal_conic'}, 'shape': [41, 231, 364], 'chunks': [1, 231, 364], 'dimensions': ['time', 'y', 'x'], 'description': 'annual average of daily average vapor pressure'}, 'lat': {'type': 'auxiliary', 'unit': 'degrees_north', 'attrs': {'units': 'degrees_north', 'long_name': 'latitude coordinate', 'standard_name': 'latitude'}, 'shape': [231, 364], 'chunks': [231, 364], 'dimensions': ['y', 'x'], 'description': 'latitude coordinate'}, 'lon': {'type': 'auxiliary', 'unit': 'degrees_east', 'attrs': {'units': 'degrees_east', 'long_name': 'longitude coordinate', 'standard_name': 'longitude'}, 'shape': [231, 364], 'chunks': [231, 364], 'dimensions': ['y', 'x'], 'description': 'longitude coordinate'}, 'swe': {'type': 'data', 'unit': 'kg/m2', 'attrs': {'units': 'kg/m2', 'long_name': 'annual average snow water equivalent', 'cell_methods': 'area: mean time: sum within days time: mean over days', 'grid_mapping': 'lambert_conformal_conic'}, 'shape': [41, 231, 364], 'chunks': [1, 231, 364], 'dimensions': ['time', 'y', 'x'], 'description': 'annual average snow water equivalent'}, 'prcp': {'type': 'data', 'unit': 'mm', 'attrs': {'units': 'mm', 'long_name': 'annual total precipitation', 'cell_methods': 'area: mean time: sum within days time: sum over days', 'grid_mapping': 'lambert_conformal_conic'}, 'shape': [41, 231, 364], 'chunks': [1, 231, 364], 'dimensions': ['time', 'y', 'x'], 'description': 'annual total precipitation'}, 'tmax': {'type': 'data', 'unit': 'degrees C', 'attrs': {'units': 'degrees C', 'long_name': 'annual average of daily maximum temperature', 'cell_methods': 'area: mean time: maximum within days time: mean over days', 'grid_mapping': 'lambert_conformal_conic'}, 'shape': [41, 231, 364], 'chunks': [1, 231, 364], 'dimensions': ['time', 'y', 'x'], 'description': 'annual average of daily maximum temperature'}, 'tmin': {'type': 'data', 'unit': 'degrees C', 'attrs': {'units': 'degrees C', 'long_name': 'annual average of daily minimum temperature', 'cell_methods': 'area: mean time: minimum within days time: mean over days', 'grid_mapping': 'lambert_conformal_conic'}, 'shape': [41, 231, 364], 'chunks': [1, 231, 364], 'dimensions': ['time', 'y', 'x'], 'description': 'annual average of daily minimum temperature'}, 'time_bnds': {'type': 'data', 'attrs': {'time': 'days since 1950-01-01 00:00:00'}, 'shape': [41, 2], 'chunks': [1, 2], 'dimensions': ['time', 'nv']}, 'lambert_conformal_conic': {'type': 'data', 'attrs': {'false_easting': 0.0, 'false_northing': 0.0, 'semi_major_axis': 6378137.0, 'grid_mapping_name': 'lambert_conformal_conic', 'standard_parallel': [25.0, 60.0], 'inverse_flattening': 298.257223563, 'latitude_of_projection_origin': 42.5, 'longitude_of_central_meridian': -100.0}, 'shape': [], 'dimensions': []}}

msft:container: daymet-zarr

cube:dimensions: {'x': {'axis': 'x', 'step': 1000.0, 'type': 'spatial', 'extent': [3445750.0, 3808750.0], 'description': 'x coordinate of projection', 'reference_system': {'name': 'undefined', 'type': 'ProjectedCRS', '$schema': 'https://proj.org/schemas/v0.4/projjson.schema.json', 'base_crs': {'name': 'undefined', 'datum': {'name': 'undefined', 'type': 'GeodeticReferenceFrame', 'ellipsoid': {'name': 'undefined', 'semi_major_axis': 6378137, 'inverse_flattening': 298.257223563}}, 'coordinate_system': {'axis': [{'name': 'Longitude', 'unit': 'degree', 'direction': 'east', 'abbreviation': 'lon'}, {'name': 'Latitude', 'unit': 'degree', 'direction': 'north', 'abbreviation': 'lat'}], 'subtype': 'ellipsoidal'}}, 'conversion': {'name': 'unknown', 'method': {'id': {'code': 9802, 'authority': 'EPSG'}, 'name': 'Lambert Conic Conformal (2SP)'}, 'parameters': [{'id': {'code': 8823, 'authority': 'EPSG'}, 'name': 'Latitude of 1st standard parallel', 'unit': 'degree', 'value': 25}, {'id': {'code': 8824, 'authority': 'EPSG'}, 'name': 'Latitude of 2nd standard parallel', 'unit': 'degree', 'value': 60}, {'id': {'code': 8821, 'authority': 'EPSG'}, 'name': 'Latitude of false origin', 'unit': 'degree', 'value': 42.5}, {'id': {'code': 8822, 'authority': 'EPSG'}, 'name': 'Longitude of false origin', 'unit': 'degree', 'value': -100}, {'id': {'code': 8826, 'authority': 'EPSG'}, 'name': 'Easting at false origin', 'unit': 'metre', 'value': 0}, {'id': {'code': 8827, 'authority': 'EPSG'}, 'name': 'Northing at false origin', 'unit': 'metre', 'value': 0}]}, 'coordinate_system': {'axis': [{'name': 'Easting', 'unit': 'metre', 'direction': 'east', 'abbreviation': 'E'}, {'name': 'Northing', 'unit': 'metre', 'direction': 'north', 'abbreviation': 'N'}], 'subtype': 'Cartesian'}}}, 'y': {'axis': 'y', 'step': -1000.0, 'type': 'spatial', 'extent': [-1995000.0, -1765000.0], 'description': 'y coordinate of projection', 'reference_system': {'name': 'undefined', 'type': 'ProjectedCRS', '$schema': 'https://proj.org/schemas/v0.4/projjson.schema.json', 'base_crs': {'name': 'undefined', 'datum': {'name': 'undefined', 'type': 'GeodeticReferenceFrame', 'ellipsoid': {'name': 'undefined', 'semi_major_axis': 6378137, 'inverse_flattening': 298.257223563}}, 'coordinate_system': {'axis': [{'name': 'Longitude', 'unit': 'degree', 'direction': 'east', 'abbreviation': 'lon'}, {'name': 'Latitude', 'unit': 'degree', 'direction': 'north', 'abbreviation': 'lat'}], 'subtype': 'ellipsoidal'}}, 'conversion': {'name': 'unknown', 'method': {'id': {'code': 9802, 'authority': 'EPSG'}, 'name': 'Lambert Conic Conformal (2SP)'}, 'parameters': [{'id': {'code': 8823, 'authority': 'EPSG'}, 'name': 'Latitude of 1st standard parallel', 'unit': 'degree', 'value': 25}, {'id': {'code': 8824, 'authority': 'EPSG'}, 'name': 'Latitude of 2nd standard parallel', 'unit': 'degree', 'value': 60}, {'id': {'code': 8821, 'authority': 'EPSG'}, 'name': 'Latitude of false origin', 'unit': 'degree', 'value': 42.5}, {'id': {'code': 8822, 'authority': 'EPSG'}, 'name': 'Longitude of false origin', 'unit': 'degree', 'value': -100}, {'id': {'code': 8826, 'authority': 'EPSG'}, 'name': 'Easting at false origin', 'unit': 'metre', 'value': 0}, {'id': {'code': 8827, 'authority': 'EPSG'}, 'name': 'Northing at false origin', 'unit': 'metre', 'value': 0}]}, 'coordinate_system': {'axis': [{'name': 'Easting', 'unit': 'metre', 'direction': 'east', 'abbreviation': 'E'}, {'name': 'Northing', 'unit': 'metre', 'direction': 'north', 'abbreviation': 'N'}], 'subtype': 'Cartesian'}}}, 'nv': {'type': 'count', 'values': [0, 1], 'description': "Size of the 'time_bnds' variable."}, 'time': {'type': 'temporal', 'extent': ['1980-07-01T12:00:00Z', '2020-07-01T12:00:00Z'], 'description': '24-hour day based on local time'}}

msft:group_keys: ['annual', 'puerto rico']

msft:storage_account: daymeteuwest

msft:short_description: Annual climate summaries on a 1-km grid for Puerto Rico

msft:region: westeurope

STAC Extensions

https://stac-extensions.github.io/scientific/v1.0.0/schema.json

https://stac-extensions.github.io/datacube/v2.0.0/schema.json

Link:

rel: items

href: https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-pr/items

type: application/geo+json

Link: Microsoft Planetary Computer STAC API

rel: root

href: https://planetarycomputer.microsoft.com/api/stac/v1/

type: application/json

title: Microsoft Planetary Computer STAC API

Link: EOSDIS Data Use Policy

rel: license

href: https://science.nasa.gov/earth-science/earth-science-data/data-information-policy

title: EOSDIS Data Use Policy

Link:

rel: cite-as

href: https://doi.org/10.3334/ORNLDAAC/1852

Link: Human readable dataset overview and reference

rel: describedby

href: https://planetarycomputer.microsoft.com/dataset/daymet-annual-pr

type: text/html

title: Human readable dataset overview and reference

Link:

rel: self

href: https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-annual-pr

type: application/json

Link: Microsoft Planetary Computer STAC API

rel: parent

href: https://planetarycomputer.microsoft.com/api/stac/v1

type: application/json

title: Microsoft Planetary Computer STAC API

Assets

Asset: Daymet annual Puerto Rico map thumbnail

href: https://ai4edatasetspublicassets.blob.core.windows.net/assets/pc_thumbnails/daymet-annual-pr.png

type: image/png

title: Daymet annual Puerto Rico map thumbnail

roles: ['thumbnail']

owner: daymet-annual-pr

Asset: Annual Puerto Rico Daymet Azure Blob File System Zarr root

href: abfs://daymet-zarr/annual/pr.zarr

type: application/vnd+zarr

title: Annual Puerto Rico Daymet Azure Blob File System Zarr root

description: Azure Blob File System of the annual Puerto Rico Daymet Zarr Group on Azure Blob Storage for use with adlfs.

roles: ['data', 'zarr', 'abfs']

owner: daymet-annual-pr

xarray:open_kwargs: {'consolidated': True}

xarray:storage_options: {'account_name': 'daymeteuwest'}

Asset: Annual Puerto Rico Daymet HTTPS Zarr root

href: https://daymeteuwest.blob.core.windows.net/daymet-zarr/annual/pr.zarr

type: application/vnd+zarr

title: Annual Puerto Rico Daymet HTTPS Zarr root

description: HTTPS URI of the annual Puerto Rico Daymet Zarr Group on Azure Blob Storage.

roles: ['data', 'zarr', 'https']

owner: daymet-annual-pr

xarray:open_kwargs: {'consolidated': True}

Items

Only the first item shown

Item: UKESM1-0-LL.ssp585.2100

id: UKESM1-0-LL.ssp585.2100

bbox: [-180, -90, 180, 90]

datetime: None

cmip6:year: 2100

cmip6:model: UKESM1-0-LL

end_datetime: 2100-12-30T12:00:00Z

cmip6:scenario: ssp585

cube:variables: {'pr': {'type': 'data', 'unit': 'kg m-2 s-1', 'attrs': {'units': 'kg m-2 s-1', 'comment': 'includes both liquid and solid phases', 'long_name': 'Precipitation', 'cell_methods': 'area: time: mean', 'cell_measures': 'area: areacella', 'original_name': 'mo: (stash: m01s05i216, lbproc: 128)', 'standard_name': 'precipitation_flux'}, 'shape': [360, 600, 1440], 'dimensions': ['time', 'lat', 'lon'], 'description': 'Precipitation'}, 'tas': {'type': 'data', 'unit': 'K', 'attrs': {'units': 'K', 'comment': 'near-surface (usually, 2 meter) air temperature; derived from downscaled tasmax & tasmin', 'long_name': 'Daily Near-Surface Air Temperature', 'cell_methods': 'area: mean time: maximum', 'cell_measures': 'area: areacella', 'original_name': 'mo: (stash: m01s03i236, lbproc: 8192)', 'standard_name': 'air_temperature'}, 'shape': [360, 600, 1440], 'dimensions': ['time', 'lat', 'lon'], 'description': 'Daily Near-Surface Air Temperature'}, 'hurs': {'type': 'data', 'unit': '%', 'attrs': {'units': '%', 'comment': 'The relative humidity with respect to liquid water for T> 0 C, and with respect to ice for T<0 C.', 'long_name': 'Near-Surface Relative Humidity', 'cell_methods': 'area: time: mean', 'cell_measures': 'area: areacella', 'original_name': 'mo: (stash: m01s03i245, lbproc: 128)', 'standard_name': 'relative_humidity'}, 'shape': [360, 600, 1440], 'dimensions': ['time', 'lat', 'lon'], 'description': 'Near-Surface Relative Humidity'}, 'huss': {'type': 'data', 'unit': '1', 'attrs': {'units': '1', 'comment': 'Near-surface (usually, 2 meter) specific humidity.', 'long_name': 'Near-Surface Specific Humidity', 'cell_methods': 'area: time: mean', 'cell_measures': 'area: areacella', 'original_name': 'mo: (stash: m01s03i237, lbproc: 128)', 'standard_name': 'specific_humidity'}, 'shape': [360, 600, 1440], 'dimensions': ['time', 'lat', 'lon'], 'description': 'Near-Surface Specific Humidity'}, 'rlds': {'type': 'data', 'unit': 'W m-2', 'attrs': {'units': 'W m-2', 'comment': "mo: For instantaneous outputs, this diagnostic represents an average over the radiation time step using the state of the atmosphere (T,q,clouds) from the first dynamics step within that interval. The time coordinate is the start of the radiation time step interval, so the value for t(N) is the average from t(N) to t(N+1)., ScenarioMIP_table_comment: The surface called 'surface' means the lower boundary of the atmosphere. 'longwave' means longwave radiation. Downwelling radiation is radiation from above. It does not mean 'net downward'. When thought of as being incident on a surface, a radiative flux is sometimes called 'irradiance'. In addition, it is identical with the quantity measured by a cosine-collector light-meter and sometimes called 'vector irradiance'. In accordance with common usage in geophysical disciplines, 'flux' implies per unit area, called 'flux density' in physics.", 'long_name': 'Surface Downwelling Longwave Radiation', 'cell_methods': 'area: time: mean', 'cell_measures': 'area: areacella', 'original_name': 'mo: (stash: m01s02i207, lbproc: 128)', 'standard_name': 'surface_downwelling_longwave_flux_in_air'}, 'shape': [360, 600, 1440], 'dimensions': ['time', 'lat', 'lon'], 'description': 'Surface Downwelling Longwave Radiation'}, 'rsds': {'type': 'data', 'unit': 'W m-2', 'attrs': {'units': 'W m-2', 'comment': 'mo: For instantaneous outputs, this diagnostic represents an average over the radiation time step using the state of the atmosphere (T,q,clouds) from the first dynamics step within that interval. The time coordinate is the start of the radiation time step interval, so the value for t(N) is the average from t(N) to t(N+1)., ScenarioMIP_table_comment: Surface solar irradiance for UV calculations.', 'long_name': 'Surface Downwelling Shortwave Radiation', 'cell_methods': 'area: time: mean', 'cell_measures': 'area: areacella', 'original_name': 'mo: (stash: m01s01i235, lbproc: 128)', 'standard_name': 'surface_downwelling_shortwave_flux_in_air'}, 'shape': [360, 600, 1440], 'dimensions': ['time', 'lat', 'lon'], 'description': 'Surface Downwelling Shortwave Radiation'}, 'tasmax': {'type': 'data', 'unit': 'K', 'attrs': {'units': 'K', 'comment': "maximum near-surface (usually, 2 meter) air temperature (add cell_method attribute 'time: max')", 'long_name': 'Daily Maximum Near-Surface Air Temperature', 'cell_methods': 'area: mean time: maximum', 'cell_measures': 'area: areacella', 'original_name': 'mo: (stash: m01s03i236, lbproc: 8192)', 'standard_name': 'air_temperature'}, 'shape': [360, 600, 1440], 'dimensions': ['time', 'lat', 'lon'], 'description': 'Daily Maximum Near-Surface Air Temperature'}, 'tasmin': {'type': 'data', 'unit': 'K', 'attrs': {'units': 'K', 'comment': "minimum near-surface (usually, 2 meter) air temperature (add cell_method attribute 'time: min')", 'long_name': 'Daily Minimum Near-Surface Air Temperature', 'cell_methods': 'area: mean time: minimum', 'cell_measures': 'area: areacella', 'standard_name': 'air_temperature'}, 'shape': [360, 600, 1440], 'dimensions': ['time', 'lat', 'lon'], 'description': 'Daily Minimum Near-Surface Air Temperature'}, 'sfcWind': {'type': 'data', 'unit': 'm s-1', 'attrs': {'units': 'm s-1', 'comment': 'near-surface (usually, 10 meters) wind speed.', 'long_name': 'Daily-Mean Near-Surface Wind Speed', 'cell_methods': 'area: time: mean', 'cell_measures': 'area: areacella', 'original_name': 'mo: (stash: m01s03i230, lbproc: 128)', 'standard_name': 'wind_speed'}, 'shape': [360, 600, 1440], 'dimensions': ['time', 'lat', 'lon'], 'description': 'Daily-Mean Near-Surface Wind Speed'}}

start_datetime: 2100-01-01T12:00:00Z

cube:dimensions: {'lat': {'axis': 'y', 'step': 0.25, 'type': 'spatial', 'extent': [-59.875, 89.875], 'description': 'latitude', 'reference_system': 4326}, 'lon': {'axis': 'x', 'step': 0.25, 'type': 'spatial', 'extent': [0.125, 359.875], 'description': 'longitude', 'reference_system': 4326}, 'time': {'step': 'P1DT0H0M0S', 'type': 'temporal', 'extent': ['2100-01-01T12:00:00Z', '2100-12-30T12:00:00Z'], 'description': 'time'}}

STAC Extensions

https://stac-extensions.github.io/datacube/v2.0.0/schema.json

Assets

Asset:

href: https://nasagddp.blob.core.windows.net/nex-gddp-cmip6/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/pr/pr_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2100.nc

type: application/netcdf

roles: ['data']