The success of this data curation project requires competent data validation. Data validation is the bridge between theoretical and real-world data curation. We unfortunately used many assumptions about the data and systems we used that did not hold true, leading to unexplainable issues throughout the data curation process.

Here, we will describe our deficiencies in performing data validation and the consequences we faced. We conclude with a discussion on how best data validation can be incorporated in this project and future such projects, so that one may avoid such issues next time.

7.1 Data Validation vs Data Exploration

Data exploration enlightens one on the contents of the data and metadata one presumes they have. We performed data exploration by loading and visualizing a few files. This allowed us to understand what data UBC aims to provide.

What data exploration does not do is explain the origin of issues such missing or seemingly corrupt data. Data exploration uses the assumption that the data is perfect.

Data validation forces one to consider, where do issues that appear in the data come from, and are these issues with the data or with the systems used to access and manipulate the data?

7.2 Data Validation for Data Loading

Here, we will describe how we discovered the consequences of failing to validate that the files we downloaded were true NetCDF files instead of HTML webpages, as described in Chapter 5.

7.2.1 The Problem

When doing rudimentary visualizations we saw missing timesteps from our final dataset in the IDX file. We assumed that any missing time steps were due to the files being unavailable from the data source. However, we decided to validate which time steps were missing and why, in case the cause for apparently missing time steps was our fault.

7.2.2 Visualizing all Timesteps

To determine exactly which time steps were unavailable, we decided to load and visualize every timestep from March 3, 2021 to June 27, 2024 as a video. We then identify which time steps fail to load or visualize and diagnose why. The scripts we proceed to describe can be found in the side bar or here.


In the following scripts, we generate PNG images for every time step in our IDX file and for every time step directly loaded from the downloaded NetCDF files. The time steps we load are the same ones specified in our idx_calls array from Chapter 6.

We load from both our IDX file and NetCDF files to crosscheck any issues we encounter in both file formats.

8 Create .PNG images of all timesteps in IDX firesmoke dataset

8.1 Import necessary libraries

Code
import numpy as np
import os
import requests
import xarray as xr
from openvisuspy.xarray_backend import OpenVisusBackendEntrypoint
import time
import datetime
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pickle
os.environ["VISUS_CACHE"]="./visus_cache_can_be_erased"
from tqdm import tqdm
from OpenVisus import *
1
For numerical work
2
For accessing the file system
3
For downloading the latest firesmoke NetCDF
4
For loading NetCDF files and metadata
5
For connecting the OpenVisus framework to xarray (from openvisuspy)
6
Used for processing NetCDF time data
7
Used for indexing via metadata
8
For plotting
9
For exporting the dictionary of issue files at the end of the notebook
10
Stores the OpenVisus cache in the local directory
11
Accessory, used to generate a progress bar for running for loops
12
For importing OpenVisus functions

8.1.1 In this section, we load our data using xr.open_dataset.

Code
url = 'https://github.com/sci-visus/NSDF-WIRED/raw/main/data/firesmoke_metadata_recent.nc'

response = requests.get(url)
local_netcdf = 'firesmoke_metadata.nc'
with open(local_netcdf, 'wb') as f:
    f.write(response.content)

ds = xr.open_dataset(local_netcdf, engine=OpenVisusBackendEntrypoint)
1
Path to the tiny NetCDF file
2
Download the file using requests
3
Local filename for the NetCDF file
4
Write the downloaded content to the local file system
5
Open the tiny NetCDF file with xarray and the OpenVisus backend

8.2 Calculate derived metadata using original metadata above to create coordinates

8.2.1 This is required to allow for indexing of data via metadata

8.2.1.1 Calculate latitude and longitude grid

Code
xorig = ds.XORIG
yorig = ds.YORIG
xcell = ds.XCELL
ycell = ds.YCELL
ncols = ds.NCOLS
nrows = ds.NROWS

longitude = np.linspace(xorig, xorig + xcell * (ncols - 1), ncols)
latitude = np.linspace(yorig, yorig + ycell * (nrows - 1), nrows)

8.2.1.2 Using calculated latitude and longitude, create coordinates allowing for indexing data using lat/lon

Code
ds.coords['lat'] = ('ROW', latitude)
ds.coords['lon'] = ('COL', longitude)

ds = ds.swap_dims({'COL': 'lon', 'ROW': 'lat'})
1
Create coordinates for latitude based on the ROW dimension
2
Create coordinates for longitude based on the COL dimension
3
Replace the COL and ROW dimensions with the newly calculated longitude and latitude arrays

8.3 Get timestamps to label video frames

Need to use idx_calls generated during conversion

Code
with open('idx_calls_v4.pkl', 'rb') as f:
    idx_calls = pickle.load(f)
1
Load idx_calls from file
8.3.0.0.1 Return an array of the tflags as pandas timestamps
Code
timestamps = []

for call in idx_calls:
    timestamps.append(pd.Timestamp(call[2]))
1
Initialize an empty list to store pandas timestamps
2
Loop through the idx_calls to process each call
3
Convert the tflags to pandas Timestamp and store in the timestamps list

8.4 Create the video

Code
data_resolution = 0
folder = "/usr/sci/scratch_nvme/arleth/dump/idx_frames"

my_norm = "log"
my_extent = [np.min(longitude), np.max(longitude), np.min(latitude), np.max(latitude)]
my_aspect = 'auto'
my_origin = 'lower'
my_cmap = 'hot'

issue_files = {}

for i in tqdm(range(len(idx_calls))):
    data_array_at_time = ds['PM25'].loc[i, :, :, data_resolution]

    try:
        my_fig, my_plt = plt.subplots(figsize=(15, 6), subplot_kw=dict(projection=ccrs.PlateCarree()))
        plot = my_plt.imshow(data, norm=my_norm, extent=my_extent, aspect=my_aspect, origin=my_origin, cmap=my_cmap)
        my_fig.colorbar(plot, location='right', label='ug/m^3')
        my_plt.coastlines()
        my_plt.gridlines(draw_labels=True)
        my_fig.suptitle(f'Ground level concentration of PM2.5 microns and smaller {timestamps[i]}\n')
        my_plt.text(0.5, -0.1, 'IDX Data', ha='center', va='center', transform=my_plt.transAxes)
        
        plt.savefig(folder + "/frames%05d.png" % i, dpi=280)
        plt.close(my_fig)
        matplotlib.pyplot.close()
    except:
        issue_files[timestamps[i]] = data
        continue
1
Set the resolution level of the PM25 data to max
2
Directory of environment to save frames
3
Set parameters for creating visualization of each timestep with matplotlib.
4
Dictionary to keep track of files with ‘issues’.
5
For all timesteps create visualization of firesmoke at time.
6
Get PM2.5 values and provide 4 values, the colons mean select all lat and lon indices.
7
Try creating the visualization or catch exceptions accordingly.
8
Save images to file.
9
Print exception if one is found and save issue in issue dictionary using timestamp t as key.
Code
with open('new_idx_issues.pkl', 'wb') as f:
    pickle.dump(issue_files, f)
1
Save ‘issue_files’ to review

9 Create .PNG images of all timesteps from idx_calls loading from netCDF files

9.1 Import necessary libraries

Code
import numpy as np
import os
import xarray as xr
import time
import datetime
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import pickle
from tqdm import tqdm
1
For numerical work
2
For accessing file system
3
For loading NetCDF files, for metadata
4
Used for processing netCDF time data
5
Used for indexing via metadata
6
For plotting
7
For exporting the dictionary of issue files at the end of notebook and importing idx_calls.pkl
8
Accessory, used to generate progress bar for running for loops

9.2 Get path to original firesmoke data

Code
netcdf_dir = "/usr/sci/cedmav/data/firesmoke"
Code
ids = ["BSC18CA12-01", "BSC00CA12-01", "BSC06CA12-01", "BSC12CA12-01"]
start_dates = ["20210304", "20210304", "20210304", "20210303"]
end_dates = ["20240627", "20240627", "20240627", "20240627"]

id_dates = {ids[i]: {"start_date": start_dates[i], "end_date": end_dates[i]} for i in range(len(ids))}
1
Date ranges for each smoke forecast dataset.
2
Create dictionary of all file names using information above

9.2.1 In this section, we load metadata from 381x1041 and 381x1081 files using xr.open_dataset.

Code
file_s = f'{netcdf_dir}/{ids[1]}/dispersion_20210304.nc'
file_b = f'{netcdf_dir}/{ids[1]}/dispersion_20240101.nc'

ds_s = xr.open_dataset(file_s)
ds_b = xr.open_dataset(file_b)
1
Path to small and big files
2
Open metadata of each file

9.3 Calculate derived metadata using original metadata above to create coordinates

9.3.1 We’ll use this for creating our visualizations

9.3.1.1 Calculate latitude and longitude grid for each set of files’ metadata

Code
longitude_s = np.linspace(ds_s.XORIG, ds_s.XORIG + ds_s.XCELL * (ds_s.NCOLS - 1), ds_s.NCOLS)
latitude_s = np.linspace(ds_s.YORIG, ds_s.YORIG + ds_s.YCELL * (ds_s.NROWS - 1), ds_s.NROWS)

longitude_b = np.linspace(ds_b.XORIG, ds_b.XORIG + ds_b.XCELL * (ds_b.NCOLS - 1), ds_b.NCOLS)
latitude_b = np.linspace(ds_b.YORIG, ds_b.YORIG + ds_b.YCELL * (ds_b.NROWS - 1), ds_b.NROWS)

9.3.1.2 The timestamps used in the files may not be intuitive. The following utility function returns the desired pandas timestamp based on your date and time of interest.

9.3.1.2.1 When you index the data at a desired time, use this function to get the timestamp you need to index.
Code
def parse_tflag(tflag):
    """
    Return the tflag as a datetime object
    :param list tflag: a list of two int32, the 1st representing date and 2nd representing time
    """
    date = int(tflag[0])
    year = date // 1000 # first 4 digits of tflag[0]
    day_of_year = date % 1000 # last 3 digits of tflag[0]

    final_date = datetime.datetime(year, 1, 1) + datetime.timedelta(days=day_of_year - 1)

    time = int(tflag[1])
    hours = time // 10000 # first 2 digits of tflag[1]
    minutes = (time % 10000) // 100 # 3rd and 4th digits of tflag[1]
    seconds = time % 100  # last 2 digits of tflag[1]

    full_datetime = datetime.datetime(year, final_date.month, final_date.day, hours, minutes, seconds)
    return full_datetime
1
Obtain year and day of year from tflag[0] (date)
2
Create datetime object representing date
3
Obtain hour, mins, and secs from tflag[1] (time)
4
Create final datetime object
Code
def get_timestamp(year, month, day, hour):
    """
    return a pandas timestamp using the given date-time arguments
    :param int year: year
    :param int month: month
    :param int day: day
    :param int hour: hour
    """
    full_datetime = datetime.datetime(year, month, day, hour)
    
    year = full_datetime.year
    day_of_year = full_datetime.timetuple().tm_yday
    hours = full_datetime.hour
    minutes = full_datetime.minute
    seconds = full_datetime.second

    tflag0 = year * 1000 + day_of_year
    tflag1 = hours * 10000 + minutes * 100 + seconds

    return pd.Timestamp(full_datetime)
1
Convert year, month, day, and hour to a datetime object
2
Extract components from the datetime object
3
Compute tflag[0] and tflag[1]
4
Return the Pandas Timestamp object

9.4 Import sequence of data slices to get at what time step

Code
with open('idx_calls_v4.pkl', 'rb') as f:
    idx_calls = pickle.load(f)
1
Load idx_calls from file

9.5 Create the video

Code
folder = "/usr/sci/scratch_nvme/arleth/dump/netcdf_frames"

my_norm = "log"
my_extent_s = [np.min(longitude_s), np.max(longitude_s), np.min(latitude_s), np.max(latitude_s)]
my_extent_b = [np.min(longitude_b), np.max(longitude_b), np.min(latitude_b), np.max(latitude_b)]
my_aspect = 'auto'
my_origin = 'lower'
my_cmap = 'hot'

issue_files = {}

frame_num = 0

for call in tqdm(idx_calls):
    curr_id = call[0]
    curr_file = call[1]
    curr_date = call[2]
    tstep_index = call[3]
    
    ds = xr.open_dataset(f'{netcdf_dir}/{curr_id}/{curr_file}')

    ds_vals = np.squeeze(ds['PM25'].values)

    data_at_time = ds_vals[tstep_index]

    t = pd.Timestamp(parse_tflag(ds['TFLAG'].values[tstep_index][0]))
    
    try:
        my_fig, my_plt = plt.subplots(figsize=(15, 6), subplot_kw=dict(projection=ccrs.PlateCarree()))
        curr_extent = my_extent_s if ds['PM25'].shape[3] == 1041 else my_extent_b
        plot = my_plt.imshow(data_at_time, norm=my_norm, extent=curr_extent, aspect=my_aspect, origin=my_origin, cmap=my_cmap)
        my_fig.colorbar(plot,location='right', label='ug/m^3')
        my_plt.coastlines()
        my_plt.gridlines(draw_labels=True)
        my_fig.suptitle(f'Ground Level Concentration of PM2.5 Microns and Smaller\n{t}')
        
        my_plt.text(0.5, -0.1, 'Original NetCDF Data', ha='center', va='center', transform=my_plt.transAxes)
        
        plt.savefig(folder + "/frames%05d.png" % frame_num, dpi=280)
        plt.close(my_fig);
        matplotlib.pyplot.close()
    except:
        print(f"issue! {t}")
        issue_files[t] = data_at_time
        continue
    frame_num = frame_num + 1
1
Directory of environment to save frames
2
Set parameters for creating visualization of each timestep with matplotlib.
3
Dictionary to keep track of files with ‘issues’.
4
To track what frame we’re on in the following loop.
5
For all timesteps create visualization of firesmoke at time.
6
Get instructions from call.
7
Open the current file with xarray.
8
Get the PM25 values, squeeze out empty axis.
9
Get PM2.5 values at tstep_index and visualize them.
10
Get the timestamp for titling our plot, use hour ‘h’.
11
Catch exceptions accordingly.
12
Extent is either with the 381x1041 lons/lats or 381x1081 lons/lats.
13
Add a title with the time information.
14
Add an additional caption for context.
15
Save the visualization as a frame.
16
Close the figure after saving.
17
Print exception if one is found and save issue in issue dictionary using timestamp t as key.
18
Whether exception or not, next frame count to align with idx script.
Code
with open('new_netcdf_issues.pkl', 'wb') as f:
    pickle.dump(issue_files, f)
1
Save ‘issue_files’ to review

Now with all images for each time step generated, we create videos to save to file.

10 Create videos using .PNG images generated

Code
# ref: https://stackoverflow.com/questions/43048725/python-creating-video-from-images-using-opencv
import cv2
import numpy as np
import os
import time
1
For creating the video
2
For numerical work
3
For accessing file system
4
Used for processing NetCDF time data
Code
def make_video(img_dir, video_dir, video_name):
    '''
    Create a video made of frames at img_dir and save to video_dir with the name video_name
    :param 
    '''
    # ref: https://stackoverflow.com/questions/27593227/listing-png-files-in-folder
    images = [img for img in os.listdir(img_dir) if img.endswith(".png") and img.startswith("frames")]
    images = np.sort(images)

    frame = cv2.imread(os.path.join(img_dir, images[0]))
    height, width, layers = frame.shape

    video = cv2.VideoWriter(video_dir + video_name,cv2.VideoWriter_fourcc(*'mp4v'), 10, (width, height))

    start_time = time.time()

    for img in images:
        frame = cv2.imread(f'{img_dir}{img}')
        video.write(frame)

    end_time = time.time()
    execution_time = end_time - start_time

    print(f"Total execution time: {execution_time:.2f} seconds")

    cv2.destroyAllWindows()
    video.release()
1
Generate list of images sorted by name, this is chronological order of timesteps.
2
Initialize to first frame.
3
Record start time.
4
Write each image to video.
5
Record end time.
6
Print execution time.
7
Close applications.
Code
idx_image_folder = '/usr/sci/scratch_nvme/arleth/dump/idx_frames/'
idx_video_folder = '/usr/sci/scratch_nvme/arleth/dump/videos/'
idx_vid_name = 'idx_video_7_10_24.mp4'

netcdf_image_folder = '/usr/sci/scratch_nvme/arleth/dump/netcdf_frames/'
netcdf_video_folder = '/usr/sci/scratch_nvme/arleth/dump/videos/'
netcdf_vid_name = 'netcdf_video_7_10_24.mp4'
1
Directories to IDX .PNGs and video name.
2
Directories to netCDF .PNGs and video name.

Visually inspecting these videos allowed us to explore where significant bouts of missing timeseries data were missing.