4 The Data Source

It is important to understand exactly what data is available and how to obtain that data.

We encourage you to explore the dataset on UBC’s website to further understand the dataset’s use and development by UBC (BlueSky Canada 2021).

Here we establish:

What systems are available to obtain the smoke forecast data from UBC?
What are in the files that UBC provides?
What metadata is associated with the files we obtain, both from within the file and about the files as a whole?

This information was not determined the first time we explored this dataset.

As seen in the later chapters, we operated under misinformed assumptions and encountered issues only resolvable by operating on the data “blindly”. The purpose of this chapter is to establish the final set of information we learned about our data source after extensive exploration and manipulation.

4.1 Overview

The Weather Forecast Research Team at the University of British Columbia (UBC) generates a short term dataset of PM2.5 smoke particulate presence in North America. This is done using their The BlueSky Western Canada Wildfire Smoke Forecasting System. Over the past 3 years, each day four times a day, UBC creates 2-day forecasts of PM2.5 smoke particulate on the ground for Canada and the continental United States. Each such forecast is downloadable as a NetCDF file or KMZ file. UBC provides access to these predictions for free on a daily basis at their website firesmoke.ca.

4.2 UBC Smoke Forecast Files Access

4.2.1 Available Forecasts

All forecast files are uniquely identifiable with a forecast ID based on when their meteorology forecast is initiated, a smoke forecast initialization time, and by date. The time ranges of available files by forecast ID is shown in Table 4.1. Please note, there are occassional failed forecasts or otherwise unavailable files within the date ranges specified Table 4.1, see Section 4.4 for further details.

Table 4.1: Dates for which all forecast ID datasets are publicly available. All times are in UTC and the grid size is 12 km.

Forecast ID	Meteorology Forecast Initialization (UTC)	Smoke Forecast Initialization (UTC)	Start Date	End Date
BSC00CA12-01	00Z	08Z	March 4, 2021	Present Day
BSC06CA12-01	06Z	14Z	March 4, 2021	Present Day
BSC12CA12-01	12Z	20Z	March 3, 2021	Present Day
BSC18CA12-01	18Z	02Z	March 4, 2021	Present Day

The smoke forecasts are updated daily, including the present day, so there is no fixed end date. Therefore, the latest data must be downloaded on a regular basis. We have not implemented this process yet, so the latest forecast files we use are up to June 27, 2024.

There is no official source stating the earliest available date for each forecast. So, knowing the project began in 2021, we inferenced that the earliest available date would be in 2021. Via trial and error we found the earliest available dates.

4.2.2 Download Instructions

To download the 2-day forecast for the forecast initialization date of one’s choice, one follows the instructions below. The downloaded file can be a NetCDF or KMZ file.

Go to the URL: https://firesmoke.ca/forecasts/{Forecast ID}/{YYYYMMDD}{InitTime}/{File Type}

Where:

YYYYMMDD is the date of choice.
ForecastID and InitTime are the chosen values as described in Table 4.2.
File Type is either dispersion.nc or dispersion.kmz for either the NetCDF file or KMZ file, respectively.

Table 4.2: UBC Smoke Forecast Data Download Parameters.

Forecast ID	Smoke Forecast Initialization (UTC)
`BSC00CA12-01`	`08`
`BSC06CA12-01`	`14`
`BSC12CA12-01`	`20`
`BSC18CA12-01`	`02`

4.2.2.1 Download Example

Let’s try downloading the forecast for January 1, 2024 where the weather forecast is initiated at 00:00:00 UTC and the smoke forecast is initialized at 08:00:00 UTC by navigating to the corresponding URL.

Code

forecast_id = "BSC00CA12-01"
yyyymmdd = "20210304"
init_time = "08"

url = (
    f"https://firesmoke.ca/forecasts/{forecast_id}/{yyyymmdd}{init_time}/dispersion.nc"
)

print(f"Navigate to this URL in your browser: {url}")

Navigate to this URL in your browser: https://firesmoke.ca/forecasts/BSC00CA12-01/2021030408/dispersion.nc

4.3 The NetCDF File

Next, let’s look at what is within the NetCDF file located at the URL in our previous example.

4.3.1 File Preview

We load dispersion.nc using xarray, which provides a preview of the file.

Code

import xarray as xr

ds = xr.open_dataset("data_notebooks/data_source/dispersion.nc")
ds

<xarray.Dataset> Size: 81MB
Dimensions:  (TSTEP: 51, VAR: 1, DATE-TIME: 2, LAY: 1, ROW: 381, COL: 1041)
Dimensions without coordinates: TSTEP, VAR, DATE-TIME, LAY, ROW, COL
Data variables:
    TFLAG    (TSTEP, VAR, DATE-TIME) int32 408B ...
    PM25     (TSTEP, LAY, ROW, COL) float32 81MB ...
Attributes: (12/33)
    IOAPI_VERSION:  $Id: @(#) ioapi library version 3.0 $                    ...
    EXEC_ID:        ????????????????                                         ...
    FTYPE:          1
    CDATE:          2021063
    CTIME:          101914
    WDATE:          2021063
    ...             ...
    VGLVLS:         [10.  0.]
    GDNAM:          HYSPLIT CONC    
    UPNAM:          hysplit2netCDF  
    VAR-LIST:       PM25            
    FILEDESC:       Hysplit Concentration Model Output                       ...
    HISTORY:

xarray.Dataset

Dimensions:
- TSTEP: 51
- VAR: 1
- DATE-TIME: 2
- LAY: 1
- ROW: 381
- COL: 1041
Coordinates: (0)
Data variables: (2)
- TFLAG
  (TSTEP, VAR, DATE-TIME)
  int32
  ...
  units :
  <YYYYDDD,HHMMSS>
  long_name :
  TFLAG
  var_desc :
  Timestep-valid flags: (1) YYYYDDD or (2) HHMMSS
```
[102 values with dtype=int32]
```
- PM25
  (TSTEP, LAY, ROW, COL)
  float32
  ...
  long_name :
  PM25
  units :
  ug/m^3
  var_desc :
  PM25
```
[20227671 values with dtype=float32]
```
Indexes: (0)
Attributes: (33)
IOAPI_VERSION :
$Id: @(#) ioapi library version 3.0 $
EXEC_ID :
????????????????
FTYPE :
1
CDATE :
2021063
CTIME :
101914
WDATE :
2021063
WTIME :
101914
SDATE :
2021063
STIME :
90000
TSTEP :
10000
NTHIK :
1
NCOLS :
1041
NROWS :
381
NLAYS :
1
NVARS :
1
GDTYP :
1
P_ALP :
0.0
P_BET :
0.0
P_GAM :
0.0
XCENT :
-104.0
YCENT :
51.0
XORIG :
-156.0
YORIG :
32.0
XCELL :
0.10000000149011612
YCELL :
0.10000000149011612
VGTYP :
5
VGTOP :
-9999.0
VGLVLS :
[10. 0.]
GDNAM :
HYSPLIT CONC
UPNAM :
hysplit2netCDF
VAR-LIST :
PM25
FILEDESC :
Hysplit Concentration Model Output lat-lon coordinate system
HISTORY :

4.3.2 File Attributes

dispersion.nc contains the following attributes. Note that for all files across forecast IDs, they have the same dimension and variable names:

4.3.2.1 Dimensions:

The dimensions described in Table 4.3 determine on which indicies we may index our variables.

Table 4.3: Description of Dimensions for Indexing Data in NetCDF Files

Dimension	Size	Description
TSTEP	51	This dimension represents the number of time steps in the file. Each file has 51 hours represented.
VAR	1	This dimension is a placeholder for the variables in the file.
DATE-TIME	2	This dimension stores the date and time information for each time step.
LAY	1	This dimension represents the number of layers in the file, which is 1 in this case.
ROW	381	This dimension represents the number of rows in the spatial grid.
COL	1041	This dimension represents the number of columns in the spatial grid.

4.3.2.2 Variables:

The variables described in Table 4.4 contain the data in question that we would like to extract.

Table 4.4: Description of Variables in NetCDF Files

Variable	Dimensions	Data Type	Description
TFLAG	TSTEP, VAR, DATE-TIME	int32	This variable stores the date and time of each time step.
PM25	TSTEP, LAY, ROW, COL	float32	This variable contains the concentration of particulate matter (PM2.5) for each time step, layer, row, and column in the spatial grid.

4.3.2.3 Attributes

Of the 33 available attributes we use the ones shown in Table 4.5:

Table 4.5: Description of Attributes in dispersion.nc

Attribute	Value	Description
CDATE	2021063	The creation date of the dataset, in YYYYDDD format.
CTIME	101914	The creation time of the dataset, in HHMMSS format.
WDATE	2021063	The date for which the weather forecast is initiated, in YYYYDDD format.
WTIME	101914	The time for which the weather forecast is initiated, in HHMMSS format.
SDATE	2021063	The date for which the smoke forecast is initiated,in YYYYDDD format.
STIME	90000	The time for which the weather forecast is initiated, in HHMMSS format.
NCOLS	1041	The number of columns in the spatial grid.
NROWS	381	The number of rows in the spatial grid.
XORIG	-156.0	The origin (starting point) of the grid in the x-direction.
YORIG	32.0	The origin (starting point) of the grid in the y-direction.
XCELL	0.10000000149011612	The cell size in the x-direction.
YCELL	0.10000000149011612	The cell size in the y-direction.

Let’s look closer at what exactly is within one NetCDF file in the following demo.

4.3.3 NetCDF Visualization Demo

In this demo we load a different dispersion.nc file and explore how to visualize the data within the file.

4.3.3.1 Accessing the File

We use the forecast for March 4, 2021 where the weather forecast is initiated at 00:00:00 UTC and the smoke forecast is initialized at 08:00:00 UTC. You can download this file by navigating to the URL below.

Code

forecast_id = "BSC00CA12-01"
yyyymmdd = "20210304"
init_time = "08"

url = (
    f"https://firesmoke.ca/forecasts/{forecast_id}/{yyyymmdd}{init_time}/dispersion.nc"
)

print(f"Download this file from URL: {url}")

# import urllib.request
# urllib.request.urlretrieve(url, "dispersion.nc")

Download this file from URL: https://firesmoke.ca/forecasts/BSC00CA12-01/2021030408/dispersion.nc

4.3.3.1.1 Opening the File

We use xarray to open the NetCDF file and preview it.

Code

import xarray as xr

ds = xr.open_dataset("dispersion.nc")

ds

<xarray.Dataset>
Dimensions:  (TSTEP: 51, VAR: 1, DATE-TIME: 2, LAY: 1, ROW: 381, COL: 1041)
Dimensions without coordinates: TSTEP, VAR, DATE-TIME, LAY, ROW, COL
Data variables:
    TFLAG    (TSTEP, VAR, DATE-TIME) int32 ...
    PM25     (TSTEP, LAY, ROW, COL) float32 ...
Attributes: (12/33)
    IOAPI_VERSION:  $Id: @(#) ioapi library version 3.0 $                    ...
    EXEC_ID:        ????????????????                                         ...
    FTYPE:          1
    CDATE:          2021063
    CTIME:          101914
    WDATE:          2021063
    ...             ...
    VGLVLS:         [10.  0.]
    GDNAM:          HYSPLIT CONC    
    UPNAM:          hysplit2netCDF  
    VAR-LIST:       PM25            
    FILEDESC:       Hysplit Concentration Model Output                       ...
    HISTORY:

xarray.Dataset

Dimensions:
- TSTEP: 51
- VAR: 1
- DATE-TIME: 2
- LAY: 1
- ROW: 381
- COL: 1041
Coordinates: (0)
Data variables: (2)
- TFLAG
  (TSTEP, VAR, DATE-TIME)
  int32
  ...
  units :
  <YYYYDDD,HHMMSS>
  long_name :
  TFLAG
  var_desc :
  Timestep-valid flags: (1) YYYYDDD or (2) HHMMSS
```
[102 values with dtype=int32]
```
- PM25
  (TSTEP, LAY, ROW, COL)
  float32
  ...
  long_name :
  PM25
  units :
  ug/m^3
  var_desc :
  PM25
```
[20227671 values with dtype=float32]
```
Indexes: (0)
Attributes: (33)
IOAPI_VERSION :
$Id: @(#) ioapi library version 3.0 $
EXEC_ID :
????????????????
FTYPE :
1
CDATE :
2021063
CTIME :
101914
WDATE :
2021063
WTIME :
101914
SDATE :
2021063
STIME :
90000
TSTEP :
10000
NTHIK :
1
NCOLS :
1041
NROWS :
381
NLAYS :
1
NVARS :
1
GDTYP :
1
P_ALP :
0.0
P_BET :
0.0
P_GAM :
0.0
XCENT :
-104.0
YCENT :
51.0
XORIG :
-156.0
YORIG :
32.0
XCELL :
0.10000000149011612
YCELL :
0.10000000149011612
VGTYP :
5
VGTOP :
-9999.0
VGLVLS :
[10. 0.]
GDNAM :
HYSPLIT CONC
UPNAM :
hysplit2netCDF
VAR-LIST :
PM25
FILEDESC :
Hysplit Concentration Model Output lat-lon coordinate system
HISTORY :

4.3.3.2 Using the Data

4.3.3.2.1 Accessing Arrays

The data we are interested in is the PM2.5 values. Let’s use xarray to get the array in the PM25 variable.

Code

ds["PM25"]

<xarray.DataArray 'PM25' (TSTEP: 51, LAY: 1, ROW: 381, COL: 1041)>
[20227671 values with dtype=float32]
Dimensions without coordinates: TSTEP, LAY, ROW, COL
Attributes:
    long_name:  PM25            
    units:      ug/m^3          
    var_desc:   PM25                                                         ...

The dimensions of the PM25 data array are composed of TSTEP, LAY, ROW, and COL. We do not need the LAY dimension, so let’s use numpy to drop it.

Code

import numpy as np

ds_pm25_vals = ds["PM25"].values
print(f'The shape of the data contained in PM25 variable is: {np.shape(ds_pm25_vals)}')

ds_pm25_vals = np.squeeze(ds_pm25_vals)
print(f'After squeezing, the shape is: {np.shape(ds_pm25_vals)}')

1: Use .values to get the four dimensional array.
2: Use np.squeeze to drop the LAY axis

The shape of the data contained in PM25 variable is: (51, 1, 381, 1041)
After squeezing, the shape is: (51, 381, 1041)

We now have ds_pm25_vals. Next, let’s select a time step and visualize the data.

4.3.3.2.2 Visualize Array in `matplotlib`

We can index time step 10 and use matplotlib to visualize the timestep

Code

import matplotlib.pyplot as plt

tstep = 10
smoke_at_tstep = ds_pm25_vals[tstep, :, :]

my_fig, my_plt = plt.subplots(figsize=(15, 6))

my_norm = "log"
my_aspect = 'auto'
my_origin = 'lower'
my_cmap = 'viridis'
 
plot = my_plt.imshow(smoke_at_tstep, norm=my_norm, aspect=my_aspect, origin=my_origin, cmap=my_cmap)

my_fig.colorbar(plot, location='right', label='ug/m^3')

my_fig.suptitle('Ground level concentration of PM2.5 microns and smaller')

my_plt.set_title(f'Timestep {tstep}')

plt.show()

1: Index ds_pm25_vals at TSTEP = 10, selecting all ROWs and COLs.
2: Color PM25 values on a log scale, since values are small.
3: Ensure the aspect ratio of our plot fits all data, matplotlib can do this automatically.
4: Tell matplotlib our origin is the lower-left corner.
5: Select a colormap for our plot and draw the color bar on the right.
6: Create our plot using imshow.
7: Add a colorbar to our figure, based on the plot we just made above.
8: Set title of our figure.
9: Set title of our plot as the timestamp of our data.
10: Show the resulting visualization.

Notice there are no axis labels or metadata presented here. Next we will show how to use the metadata in dispersion.nc so the data is actually interpretable.

4.3.3.3 Incorporating Metadata to Visualization via Coordinates

4.3.3.3.1 Latitude and Longitude Coordinates

dispersion.nc includes attributes to generate the latitude and longitude values on the grid defined by NCOLS and NROWS. We use this grid to match each data point in the PM25 variable to a lat/lon coordinate.

Code

xorig = ds.XORIG
yorig = ds.YORIG
xcell = ds.XCELL
ycell = ds.YCELL
ncols = ds.NCOLS
nrows = ds.NROWS

longitude = np.linspace(xorig, xorig + xcell * (ncols - 1), ncols)
latitude = np.linspace(yorig, yorig + ycell * (nrows - 1), nrows)

print("Size of longitude & latitude arrays:")
print(f'np.size(longitude) = {np.size(longitude)}')
print(f'np.size(latitude) = {np.size(latitude)}\n')
print("Min & Max of longitude and latitude arrays:")
print(f'longitude: min = {np.min(longitude)}, max = {np.max(longitude)}')
print(f'latitude: min = {np.min(latitude)}, max = {np.max(latitude)}')

Size of longitude & latitude arrays:
np.size(longitude) = 1041
np.size(latitude) = 381

Min & Max of longitude and latitude arrays:
longitude: min = -156.0, max = -51.999998450279236
latitude: min = 32.0, max = 70.00000056624413

xarray allows us to create coordinates, which maps variable values to a value of our choice. In this case, we create coordinates mapping PM25 values to a latitude and longitude value.

Code

ds.coords['lat'] = ('ROW', latitude)
ds.coords['lon'] = ('COL', longitude)

ds = ds.swap_dims({'COL': 'lon', 'ROW': 'lat'})

ds

1: Create coordinates for latitude and longitude.
2: Replace COL and ROW dimensions with newly calculated longitude and latitude coordinates.

Now let’s move on to incorporating time stamp metadata.

4.3.3.3.2 Time Coordinates

Recall, there is a TFLAG variable in dispersion.nc.

Code

ds['TFLAG']

<xarray.DataArray 'TFLAG' (TSTEP: 51, VAR: 1, DATE-TIME: 2)>
[102 values with dtype=int32]
Dimensions without coordinates: TSTEP, VAR, DATE-TIME
Attributes:
    units:      <YYYYDDD,HHMMSS>
    long_name:  TFLAG           
    var_desc:   Timestep-valid flags:  (1) YYYYDDD or (2) HHMMSS             ...

The earliest and latest TFLAGs look like the following:

Code

print(f"Earliest available TFLAG is {ds['TFLAG'].values[0][0]}")
print(f"Latest available TFLAG is {ds['TFLAG'].values[-1][0]}")

Earliest available TFLAG is [2021063   90000]
Latest available TFLAG is [2021065  110000]

This time flags require processing to be immediately legible. Let’s write a function to process the time flag accordingly. We use the datetime library.

Code

import datetime

def parse_tflag(tflag):
    """
    Return the tflag as a datetime object
    :param list tflag: a list of two int32, the 1st representing date and 2nd representing time
    """
    date = int(tflag[0])
    year = date // 1000
    day_of_year = date % 1000

    final_date = datetime.datetime(year, 1, 1) + datetime.timedelta(days=day_of_year - 1)

    time = int(tflag[1])
    hours = time // 10000
    minutes = (time % 10000) // 100
    seconds = time % 100

    full_datetime = datetime.datetime(year, final_date.month, final_date.day, hours, minutes, seconds)
    return full_datetime

1: Obtain year and day of year from tflag[0] (date).
2: Extract the year from the first 4 digits of tflag[0].
3: Extract the day of the year from the last 3 digits of tflag[0].
4: Create a datetime object representing the date.
5: Obtain hour, minutes, and seconds from tflag[1] (time).
6: Extract hours from the first 2 digits of tflag[1].
7: Extract minutes from the 3rd and 4th digits of tflag[1].
8: Extract seconds from the last 2 digits of tflag[1].
9: Create the final datetime object with the extracted date and time components.

Now we have datetime objects to represent the timeflag in a more legible and usable format.

Code

print(f"Earliest available TFLAG is {parse_tflag(ds['TFLAG'].values[0][0])}")
print(f"Latest available TFLAG is {parse_tflag(ds['TFLAG'].values[-1][0])}")

Earliest available TFLAG is 2021-03-04 09:00:00
Latest available TFLAG is 2021-03-06 11:00:00

4.3.3.3.3 Visualize Array in `matplotlib`

Let’s visualize timestep 10 again, but now we can label the data using latitudes and longitudes, and the corresponding time flag.

Code

tstep = 10
smoke_at_tstep = ds_pm25_vals[tstep, :, :]
tstep_tflag = parse_tflag(ds['TFLAG'].values[tstep][0])

import matplotlib.pyplot as plt
import cartopy.crs as ccrs

my_fig, my_plt = plt.subplots(figsize=(15, 6), subplot_kw=dict(projection=ccrs.PlateCarree()))

my_norm = "log"
my_extent = [np.min(longitude), np.max(longitude), np.min(latitude), np.max(latitude)]
my_aspect = 'auto'
my_origin = 'lower'
my_cmap = 'viridis'

plot = my_plt.imshow(smoke_at_tstep, norm=my_norm, extent=my_extent, 
          aspect=my_aspect, origin=my_origin, cmap=my_cmap)

my_plt.coastlines()

my_plt.gridlines(draw_labels=True)

my_fig.colorbar(plot, location='right', label='ug/m^3')

my_fig.supxlabel('Longitude')
my_fig.supylabel('Latitude')

my_fig.suptitle('Ground level concentration of PM2.5 microns and smaller')

my_plt.set_title(f'{tstep_tflag}')

plt.show()

1: Define the time step.
2: Extract the PM2.5 data for the specified time step.
3: Parse the time flag for the specified time step.
4: Initialize a figure and plot with a specific projection.
5: Set the normalization for PM2.5 values to a logarithmic scale.
6: Define the extent of the plot based on the longitude and latitude range.
7: Set the aspect ratio of the plot to fit all data automatically.
8: Specify the origin of the plot as the lower-left corner.
9: Choose a colormap for the plot.
10: Create the plot using imshow with the specified parameters.
11: Draw coastlines on the plot.
12: Draw latitude and longitude lines with labels.
13: Add a colorbar to the figure based on the plot.
14: Set the x-axis label.
15: Set the y-axis label.
16: Set the title of the figure.
17: Set the title of the plot as the timestamp of the data.
18: Display the resulting visualization.

Code

ds.close()

Now that we understand how to load the data and metadata from the file and process it for visualization, let’s establish the data and metadata available to us across all NetCDF files.

4.4 Information Across all NetCDF Files

Knowing what is within one NetCDF file as well as the date range for which we can download them, let’s establish the metadata associated with the NetCDF files as a collection.

4.4.1 Disk Size

For the time ranges we cover, Table 6.1 shows how large the set of files per forecast ID are.

Table 4.6: File Sizes and Counts for Each Forecast ID within the Specified Date Range

Forecast ID	Date Range	Size	File Count
BSC00CA12-01	March 4, 2021 - June 27, 2024	84G	1077
BSC06CA12-01	March 4, 2021 - June 27, 2024	78G	1022
BSC12CA12-01	March 3, 2021 - June 27, 2024	79G	1022
BSC18CA12-01	March 4, 2021 - June 27, 2024	79G	1023
Total		320G	4144

4.4.2 Temporal Data Availability

4.4.2.1 Loading Files

We have downloaded all NetCDF files available up to June 27, 2022.

Let’s demonstrate the staggered nature of the forecasts by opening up the files for March 2, 2024 to March 4, 2022. We use the parse_tflag function.

Code

import datetime

def parse_tflag(tflag):
    """
    Return the tflag as a datetime object
    :param list tflag: a list of two int32, the 1st representing date and 2nd representing time
    """
    date = int(tflag[0])
    year = date // 1000 # first 4 digits of tflag[0]
    day_of_year = date % 1000 # last 3 digits of tflag[0]

    final_date = datetime.datetime(year, 1, 1) + datetime.timedelta(days=day_of_year - 1)

    time = int(tflag[1])
    hours = time // 10000 # first 2 digits of tflag[1]
    minutes = (time % 10000) // 100 # 3rd and 4th digits of tflag[1]
    seconds = time % 100  # last 2 digits of tflag[1]

    full_datetime = datetime.datetime(year, final_date.month, final_date.day, hours, minutes, seconds)
    return full_datetime

1: Obtain year and day of year from tflag[0] (date)
2: Create datetime object representing date
3: Obtain hour, mins, and secs from tflag[1] (time)
4: Create final datetime object

Code

files_dir = 'dispersion_files'
ids = ["BSC18CA12-01", "BSC00CA12-01", "BSC06CA12-01", "BSC12CA12-01"]
file_names = ['dispersion_20220302.nc', 'dispersion_20220303.nc', 'dispersion_20220304.nc']

1: Define the directory location of our files.
2: List of the forecast IDs we use.
3: file_names is each file we will open per forecast ID next.

Code

import xarray as xr
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.lines import Line2D

fig, ax = plt.subplots()

colors = ['pink', 'orange', 'green']
num_files = len(file_names)
bar_height = 1 / (num_files + 1)  # Dynamic height for the bars

for idx, id_ in enumerate(ids):
    for i, f in enumerate(file_names):
        curr_file = f'{files_dir}/{id_[3:5]}/{f}'
        ds = xr.open_dataset(curr_file)
        tflags = ds['TFLAG'].values
        
        earliest_time = parse_tflag(tflags[0][0])
        latest_time = parse_tflag(tflags[-1][0])
        
        # Adjust y position to avoid overlap, using `idx` and `i` to space out bars
        y_position = idx + i * bar_height
        ax.barh(y_position, latest_time - earliest_time, left=earliest_time, 
                height=bar_height * 0.8, color=colors[i])

# Adjust y-axis to display ID labels in a way that matches the bar positions
ax.set_yticks([i + (num_files - 1) * bar_height / 2 for i in range(len(ids))])
ax.set_yticklabels(ids)

ax.xaxis.set_major_formatter(mdates.DateFormatter("%m-%d %H:%M"))
ax.xaxis.set_major_locator(mdates.HourLocator(interval=6))
fig.autofmt_xdate(rotation=50)

ax.set_xlabel("Time")
ax.set_ylabel("Forecast ID")
# ax.set_title("Smoke Predictions per Forecast ID per File, for 03/02/22-03/05/22")

legend_elements = [Line2D([0], [0], linewidth=4, color=colors[i], label=file_names[i]) for i in range(3)]
ax.legend(handles=legend_elements, loc='center right')

# plt.show()
plt.savefig('overlaps.png', bbox_inches="tight")

1: Open each file for each forecast ID.
2: Create the path string.
3: Open the file with xarray and get the TFLAG values.
4: Get the earliest and latest available time flags.
5: Plot the time range represented with the time flags as a horizontal bar.

The 6 loaded files cover the time ranges shown above.

Notice, for any given hour, there can be various predictions available to choose from. For example, for 2022-03-04 01:00, we can find it represented in dispersion_20220303.nc across all forecast IDs.

Given that these predictions are forecasts, we run on the assumption that the earliest available predictions per file are the most accurate.

Therefore, for example, if we want to choose the most accurate PM2.5 prediction for 2022-03-04 01:00 we would load the prediction with the BSC12CA12-01 forecast ID.

For further details on how we load the best prediction for every hour of the dataset, see Section 6.6.

Now that we know what exactly is within a NetCDF file and across all the NetCDF files, we will continue to describe our data repurposing process. Next we describe how we load all of the data available from UBC onto our machine.

4.1 Overview

4.2 UBC Smoke Forecast Files Access

4.2.1 Available Forecasts

4.2.2 Download Instructions

4.2.2.1 Download Example

4.3 The NetCDF File

4.3.1 File Preview

4.3.2 File Attributes

4.3.2.1 Dimensions:

4.3.2.2 Variables:

4.3.2.3 Attributes

4.3.3 NetCDF Visualization Demo

4.3.3.1 Accessing the File

4.3.3.1.1 Opening the File

4.3.3.2 Using the Data

4.3.3.2.1 Accessing Arrays

4.3.3.2.2 Visualize Array in matplotlib

4.3.3.3 Incorporating Metadata to Visualization via Coordinates

4.3.3.3.1 Latitude and Longitude Coordinates

4.3.3.3.2 Time Coordinates

4.3.3.3.3 Visualize Array in matplotlib

4.4 Information Across all NetCDF Files

4.4.1 Disk Size

4.4.2 Temporal Data Availability

4.4.2.1 Loading Files

4.3.3.2.2 Visualize Array in `matplotlib`

4.3.3.3.3 Visualize Array in `matplotlib`