5 Data Loading

Knowing what data is available and how to obtain the data, we can proceed with loading the data onto our staging system.

5.1 Downloading Data Locally

We decided to download all the available files from the data source onto our data staging machine to process from there.

We created 4 directories for each forecast ID that UBC provides at the following directory on our machine:

/usr/sci/cedmav/data/firesmoke
├── BSC00CA12-01
├── BSC06CA12-01
├── BSC12CA12-01
├── BSC18CA12-01

The following shows our approaches to doing this and discusses how our approach evolved. Note the scripts below refer to varying directories, but through simple copying operations we stored the final downloaded files to the directories listed above.

5.1.1 First Approach

We delineate our first approach by detailing our download script, which is available in it’s entirety in the side bar.

Code

import wget
import pandas as pd

ids = ["BSC18CA12-01", "BSC00CA12-01", "BSC06CA12-01", "BSC12CA12-01"]
start_dates = ["20210304", "20210304", "20210304", "20210303"]
end_dates = ["20231016", "20240210", "20231016", "20231015"]
init_times = ["02", "08", "14", "20"]

for i in zip(start_dates, end_dates, ids, init_times):
    start_date = i[0]
    end_date = i[1]
    forecast_id = i[2]
    init_time = i[3]

    dates = pd.date_range(start=start_date, end=end_date)
    dates = dates.strftime("%Y%m%d").tolist()

    for date in dates:
      url = (
          "https://firesmoke.ca/forecasts/"
          + forecast_id
          + "/"
          + date
          + init_time
          + "/dispersion.nc"
      )
      directory = "/Users/arleth/Mount/firesmoke/" + forecast_id + "/dispersion_" + date + ".nc"
      wget.download(url, out=directory)

1: First, create 4 lists containing forecast IDs, the start and end dates we wish to index on, and the smoke forecast initiation times. We will loop through the 4 sets of parameters.
2: In a for loop, we use pandas to create a list of every date from the start date and end date of the current iteration. We will loop through these dates next.
3: For each date in the list, we create the url to download the file.
4: Finally, we use wget to download the contents at urlto directory. We append date to the file name so each file downloaded is identifiable by date.

We assumed that for all URLs, there was an available NetCDF file for download

However, we realized that we downloaded either a NetCDF file or an HTML webpage. Using wget forcibly saved the contents at the URL into a NetCDF file.

This issue was not identified until after we visualized each hour of the data, and we noticed gaps and errors in our scripts to create visualizations. See Chapter 7 for further details on identifying these issues. For now, we show our modified approach to downloading the NetCDF files.

5.1.2 Second Approach

Our second approach is similar to the first, except we use requests, an HTTP client library that allows us to see the headers returned from the URL we query. The script is available in the side bar.

Code

import requests
import pandas as pd
from datetime import datetime

ids = ["BSC18CA12-01", "BSC00CA12-01", "BSC06CA12-01", "BSC12CA12-01"]
start_dates = ["20210304", "20210304", "20210304", "20210303"]
today = datetime.now().strftime("%Y%m%d")
init_times = ["02", "08", "14", "20"]

for i in zip(start_dates, ids, init_times):
    start_date = i[0]
    forecast_id = i[1]
    init_time = i[2]

    dates = pd.date_range(start=start_date, end=today)
    dates = dates.strftime("%Y%m%d").tolist()

    for date in dates:
        url = (
            "https://firesmoke.ca/forecasts/"
            + forecast_id
            + "/"
            + date
            + init_time
            + "/dispersion.nc"
        )
        directory = (
            "/usr/sci/scratch_nvme/arleth/basura_total/"
            + forecast_id
            + "/dispersion_"
            + date
            + ".nc"
        )

        response = requests.get(url, stream=True)
        header = response.headers
        if (
            "Content-Type" in header
            and header["Content-Type"] == "application/octet-stream"
        ):
            with open(directory, mode="wb") as file:
                file.write(response.content)
                print(f"Downloaded file {directory}")
        else:
            print(header["Content-Type"])

1: First, create 3 lists containing forecast IDs, the start dates we wish to index on, and the smoke forecast initiation times . Notice we define a variable today, this allows us to run this script and query all URLs up to today’s date. Note we ran up to June 27, 2024 for now. We will loop through these sets of parameters.
2: In a for loop, we use pandas to create a list of every date from the start date and end date of the current iteration. We will loop through these dates next.
3: For each date in the list, we create the url to download the file.
4: Define directory, the directory and file name to save the file to.
5: We use requests to get the HTTP header at url. We inspect the Content-Type and if it is application/octet-stream, we download the file. We confirmed that a URL with a NetCDF file had this content type header.
6: We write the content to directory, else we print the content header out to check what it is.

This approach yielded the results we expected, we downloaded only NetCDF files. We had failed downloads which appeared during conversion as described in Chapter 6. We assumed those files were unavailable from UBC which we later confirmed as described in Chapter 7.

--- format: html: code-links: - text: First Approach icon: file-code href: https://github.com/sci-visus/NSDF-WIRED/blob/main/data_download/get_data_v0.py - text: Second Approach icon: file-code href: https://github.com/sci-visus/NSDF-WIRED/blob/main/data_download/get_data_v1.py --- # Data Loading {#sec-data-loading} Knowing what data is available and how to obtain the data, we can proceed with loading the data onto our staging system. ## Downloading Data Locally We decided to download all the available files from the [data source](data_source.qmd) onto our [data staging machine](sys_specs.qmd) to process from there. We created 4 directories for each forecast ID that UBC provides at the following directory on our machine: ``` /usr/sci/cedmav/data/firesmoke ├── BSC00CA12-01 ├── BSC06CA12-01 ├── BSC12CA12-01 ├── BSC18CA12-01 ``` The following shows our approaches to doing this and discusses how our approach evolved. Note the scripts below refer to varying directories, but through simple copying operations we stored the final downloaded files to the directories listed above. ### First Approach {#sec-loading-first-approach} We delineate our first approach by detailing our download script, which is available in it's entirety in the side bar. ```{python} #| eval: false import wget import pandas as pd ids = ["BSC18CA12-01", "BSC00CA12-01", "BSC06CA12-01", "BSC12CA12-01"]# <1> start_dates = ["20210304", "20210304", "20210304", "20210303"] end_dates = ["20231016", "20240210", "20231016", "20231015"] init_times = ["02", "08", "14", "20"]# <1> for i in zip(start_dates, end_dates, ids, init_times):# <2> start_date = i[0] end_date = i[1] forecast_id = i[2] init_time = i[3] dates = pd.date_range(start=start_date, end=end_date) dates = dates.strftime("%Y%m%d").tolist()# <2> for date in dates: # <3> url = ( "https://firesmoke.ca/forecasts/" + forecast_id + "/" + date + init_time + "/dispersion.nc" )# <3> directory = "/Users/arleth/Mount/firesmoke/" + forecast_id + "/dispersion_" + date + ".nc" # <4> wget.download(url, out=directory) # <4> ``` 1. First, create 4 lists containing forecast IDs, the start and end dates we wish to index on, and the smoke forecast initiation times. We will loop through the 4 sets of parameters. 2. In a for loop, we use `pandas` to create a list of every date from the start date and end date of the current iteration. We will loop through these dates next. 3. For each `date` in the list, we create the `url` to download the file. 4. Finally, we use `wget` to download the contents at `url`to `directory`. We append `date` to the file name so each file downloaded is identifiable by date. We assumed that for all URLs, there was an available NetCDF file for download However, we realized that we downloaded either a NetCDF file *or an HTML webpage*. Using `wget` forcibly saved the contents at the URL into a NetCDF file. This issue was not identified until *after* we visualized each hour of the data, and we noticed gaps and errors in our scripts to create visualizations. See @sec-data-validation for further details on identifying these issues. For now, we show our modified approach to downloading the NetCDF files. ### Second Approach Our second approach is similar to the first, except we use [`requests`](https://requests.readthedocs.io/en/latest/), an HTTP client library that allows us to see the headers returned from the URL we query. The script is available in the side bar. ```{python} # | eval: false import requests import pandas as pd from datetime import datetime ids = ["BSC18CA12-01", "BSC00CA12-01", "BSC06CA12-01", "BSC12CA12-01"] # <1> start_dates = ["20210304", "20210304", "20210304", "20210303"] today = datetime.now().strftime("%Y%m%d") init_times = ["02", "08", "14", "20"] # <1> for i in zip(start_dates, ids, init_times): # <2> start_date = i[0] forecast_id = i[1] init_time = i[2] dates = pd.date_range(start=start_date, end=today) dates = dates.strftime("%Y%m%d").tolist() # <2> for date in dates: # <3> url = ( "https://firesmoke.ca/forecasts/" + forecast_id + "/" + date + init_time + "/dispersion.nc" ) # <3> directory = ( # <4> "/usr/sci/scratch_nvme/arleth/basura_total/" + forecast_id + "/dispersion_" + date + ".nc" ) # <4> response = requests.get(url, stream=True) # <5> header = response.headers if ( "Content-Type" in header and header["Content-Type"] == "application/octet-stream" ): # <5> with open(directory, mode="wb") as file: # <6> file.write(response.content) print(f"Downloaded file {directory}") else: print(header["Content-Type"]) # <6> ``` 1. First, create 3 lists containing forecast IDs, the start dates we wish to index on, and the smoke forecast initiation times . Notice we define a variable `today`, this allows us to run this script and query all URLs up to today's date. Note we ran up to June 27, 2024 for now. We will loop through these sets of parameters. 2. In a for loop, we use `pandas` to create a list of every date from the start date and end date of the current iteration. We will loop through these dates next. 3. For each `date` in the list, we create the `url` to download the file. 4. Define `directory`, the directory and file name to save the file to. 5. We use `requests` to get the HTTP header at `url`. We inspect the `Content-Type` and if it is `application/octet-stream`, we download the file. We confirmed that a URL with a NetCDF file had this content type header. 6. We write the content to `directory`, else we print the content header out to check what it is. This approach yielded the results we expected, we downloaded only NetCDF files. We had failed downloads which appeared during conversion as described in @sec-data-conversion. We assumed those files were unavailable from UBC which we later confirmed as described in @sec-data-validation.