The success of this data curation project requires competent data validation. Data validation is the bridge between theoretical and real-world data curation. We unfortunately used many assumptions about the data and systems we used that did not hold true, leading to unexplainable issues throughout the data curation process.
Here, we will describe our deficiencies in performing data validation and the consequences we faced. We conclude with a discussion on how best data validation can be incorporated in this project and future such projects, so that one may avoid such issues next time.
7.1 Data Validation vs Data Exploration
Data exploration enlightens one on the contents of the data and metadata one presumes they have. We performed data exploration by loading and visualizing a few files. This allowed us to understand what data UBC aims to provide.
What data exploration does not do is explain the origin of issues such missing or seemingly corrupt data. Data exploration uses the assumption that the data is perfect.
Data validation forces one to consider, where do issues that appear in the data come from, and are these issues with the data or with the systems used to access and manipulate the data?
7.2 Data Validation for Data Loading
Here, we will describe how we discovered the consequences of failing to validate that the files we downloaded were true NetCDF files instead of HTML webpages, as described in Chapter 5.
7.2.1 The Problem
When doing rudimentary visualizations we saw missing timesteps from our final dataset in the IDX file. We assumed that any missing time steps were due to the files being unavailable from the data source. However, we decided to validate which time steps were missing and why, in case the cause for apparently missing time steps was our fault.
7.2.2 Visualizing all Timesteps
To determine exactly which time steps were unavailable, we decided to load and visualize every timestep from March 3, 2021 to June 27, 2024 as a video. We then identify which time steps fail to load or visualize and diagnose why. The scripts we proceed to describe can be found in the side bar or here.
In the following scripts, we generate PNG images for every time step in our IDX file and for every time step directly loaded from the downloaded NetCDF files. The time steps we load are the same ones specified in our idx_calls array from Chapter 6.
We load from both our IDX file and NetCDF files to crosscheck any issues we encounter in both file formats.
Set parameters for creating visualization of each timestep with matplotlib.
4
Dictionary to keep track of files with ‘issues’.
5
For all timesteps create visualization of firesmoke at time.
6
Get PM2.5 values and provide 4 values, the colons mean select all lat and lon indices.
7
Try creating the visualization or catch exceptions accordingly.
8
Save images to file.
9
Print exception if one is found and save issue in issue dictionary using timestamp t as key.
Code
withopen('new_idx_issues.pkl', 'wb') as f: pickle.dump(issue_files, f)
1
Save ‘issue_files’ to review
9 Create .PNG images of all timesteps from idx_calls loading from netCDF files
9.1 Import necessary libraries
Code
import numpy as npimport osimport xarray as xrimport timeimport datetimeimport pandas as pdimport matplotlibimport matplotlib.pyplot as pltimport cartopy.crs as ccrsimport picklefrom tqdm import tqdm
1
For numerical work
2
For accessing file system
3
For loading NetCDF files, for metadata
4
Used for processing netCDF time data
5
Used for indexing via metadata
6
For plotting
7
For exporting the dictionary of issue files at the end of notebook and importing idx_calls.pkl
8
Accessory, used to generate progress bar for running for loops
9.3.1.2 The timestamps used in the files may not be intuitive. The following utility function returns the desired pandas timestamp based on your date and time of interest.
9.3.1.2.1 When you index the data at a desired time, use this function to get the timestamp you need to index.
Code
def parse_tflag(tflag):""" Return the tflag as a datetime object :param list tflag: a list of two int32, the 1st representing date and 2nd representing time """ date =int(tflag[0]) year = date //1000# first 4 digits of tflag[0] day_of_year = date %1000# last 3 digits of tflag[0] final_date = datetime.datetime(year, 1, 1) + datetime.timedelta(days=day_of_year -1) time =int(tflag[1]) hours = time //10000# first 2 digits of tflag[1] minutes = (time %10000) //100# 3rd and 4th digits of tflag[1] seconds = time %100# last 2 digits of tflag[1] full_datetime = datetime.datetime(year, final_date.month, final_date.day, hours, minutes, seconds)return full_datetime
1
Obtain year and day of year from tflag[0] (date)
2
Create datetime object representing date
3
Obtain hour, mins, and secs from tflag[1] (time)
4
Create final datetime object
Code
def get_timestamp(year, month, day, hour):""" return a pandas timestamp using the given date-time arguments :param int year: year :param int month: month :param int day: day :param int hour: hour """ full_datetime = datetime.datetime(year, month, day, hour) year = full_datetime.year day_of_year = full_datetime.timetuple().tm_yday hours = full_datetime.hour minutes = full_datetime.minute seconds = full_datetime.second tflag0 = year *1000+ day_of_year tflag1 = hours *10000+ minutes *100+ secondsreturn pd.Timestamp(full_datetime)
1
Convert year, month, day, and hour to a datetime object
2
Extract components from the datetime object
3
Compute tflag[0] and tflag[1]
4
Return the Pandas Timestamp object
9.4 Import sequence of data slices to get at what time step
Code
withopen('idx_calls_v4.pkl', 'rb') as f: idx_calls = pickle.load(f)
Set parameters for creating visualization of each timestep with matplotlib.
3
Dictionary to keep track of files with ‘issues’.
4
To track what frame we’re on in the following loop.
5
For all timesteps create visualization of firesmoke at time.
6
Get instructions from call.
7
Open the current file with xarray.
8
Get the PM25 values, squeeze out empty axis.
9
Get PM2.5 values at tstep_index and visualize them.
10
Get the timestamp for titling our plot, use hour ‘h’.
11
Catch exceptions accordingly.
12
Extent is either with the 381x1041 lons/lats or 381x1081 lons/lats.
13
Add a title with the time information.
14
Add an additional caption for context.
15
Save the visualization as a frame.
16
Close the figure after saving.
17
Print exception if one is found and save issue in issue dictionary using timestamp t as key.
18
Whether exception or not, next frame count to align with idx script.
Code
withopen('new_netcdf_issues.pkl', 'wb') as f: pickle.dump(issue_files, f)
1
Save ‘issue_files’ to review
Now with all images for each time step generated, we create videos to save to file.
10 Create videos using .PNG images generated
Code
# ref: https://stackoverflow.com/questions/43048725/python-creating-video-from-images-using-opencvimport cv2import numpy as npimport osimport time
1
For creating the video
2
For numerical work
3
For accessing file system
4
Used for processing NetCDF time data
Code
def make_video(img_dir, video_dir, video_name):''' Create a video made of frames at img_dir and save to video_dir with the name video_name :param '''# ref: https://stackoverflow.com/questions/27593227/listing-png-files-in-folder images = [img for img in os.listdir(img_dir) if img.endswith(".png") and img.startswith("frames")] images = np.sort(images) frame = cv2.imread(os.path.join(img_dir, images[0])) height, width, layers = frame.shape video = cv2.VideoWriter(video_dir + video_name,cv2.VideoWriter_fourcc(*'mp4v'), 10, (width, height)) start_time = time.time()for img in images: frame = cv2.imread(f'{img_dir}{img}') video.write(frame) end_time = time.time() execution_time = end_time - start_timeprint(f"Total execution time: {execution_time:.2f} seconds") cv2.destroyAllWindows() video.release()
1
Generate list of images sorted by name, this is chronological order of timesteps.
---format: html: code-links: - text: Visualizing all Timesteps icon: file-code href: https://github.com/sci-visus/NSDF-WIRED/tree/main/visualizations/make_videos---# Data Validation {#sec-data-validation}The success of this data curation project requires competent data validation. Data validation is the bridge between theoretical and real-world data curation. We unfortunately used many assumptions about the data and systems we used that did not hold true, leading to unexplainable issues throughout the data curation process.Here, we will describe our deficiencies in performing data validation and the consequences we faced. We conclude with a discussion on how best data validation can be incorporated in this project and future such projects, so that one may avoid such issues next time.## Data Validation vs Data ExplorationData exploration enlightens one on the contents of the data and metadata one presumes they have. We performed data exploration by loading and visualizing a few files. This allowed us to understand what data UBC aims to provide.What data exploration *does not* do is explain the origin of issues such missing or seemingly corrupt data. Data exploration uses the assumption that the data is perfect.Data validation forces one to consider, where do issues that appear in the data come from, and are these issues with the data or with the systems used to access and manipulate the data?## Data Validation for Data LoadingHere, we will describe how we discovered the consequences of failing to validate that the files we downloaded were true NetCDF files instead of HTML webpages, as described in @sec-data-loading.### The ProblemWhen doing rudimentary visualizations we saw missing timesteps from our final dataset in the IDX file. We assumed that any missing time steps were due to the files being unavailable from the data source. However, we decided to validate which time steps were missing and why, in case the cause for apparently missing time steps was our fault.### Visualizing all TimestepsTo determine exactly which time steps were unavailable, we decided to load and visualize every timestep from March 3, 2021 to June 27, 2024 as a video. We then identify which time steps fail to load or visualize and diagnose *why*. The scripts we proceed to describe can be found in the side bar or [here](https://github.com/sci-visus/NSDF-WIRED/tree/main/visualizations/make_videos).---In the following scripts, we generate PNG images for every time step in our IDX file *and* for every time step directly loaded from the downloaded NetCDF files. The time steps we load are the same ones specified in our `idx_calls` array from @sec-data-conversion.We load from both our IDX file and NetCDF files to crosscheck any issues we encounter in both file formats.::: {.panel-tabset}# IDX File PNGs{{< embed data_notebooks/data_validation/firesmoke_idx_all_frames.ipynb echo=true >}}# NetCDF Files PNGs{{< embed data_notebooks/data_validation/firesmoke_netcdf_all_frames.ipynb echo=true >}}:::---Now with all images for each time step generated, we create videos to save to file.{{< embed data_notebooks/data_validation/make_videos.ipynb echo=true >}}Visually inspecting these videos allowed us to explore where significant bouts of missing timeseries data were missing.<!-- ## Data Conversion -->