Glossary

Key Points

Introduction	Python has a number of libraries that can help us work better with big datasets. Every computer contains a Central Processing Unit or CPU. Modern CPUs tend to have multiple processing cores that can operate simultaneously. High Perfomance Computing (HPC) facilities operate clusters of computers known as nodes. Each node will typically have several multi-core CPUs. Parallel processing involves splitting a task up so that different parts of it can be performed simultaneously on different CPU cores, different CPUs or even different computers/HPC nodes. Computer memory can mean RAM which is relatively fast and loses its contents when powered off or disk storage which retains its contents but is slower. NetCDF files are commonly used for environmental science data and/or gridded datasets. They include metadata with them, store data in space efficient binary formats and libraries for reading/writing them can be found in most languages including Python.
Dataset Parallelism	GNU Parallel can apply the same command to every file in a dataset GNU Parallel works on the command line and doesn’t require Python, but it can run multiple copies of a Python script It is often the simplest way to apply parallelism It requires a problem that works independently across a set of files or a range of parameters Without invoking a more complex job scheduler, GNU Parallel only works on a single computer By default GNU Parallel will use every CPU core available to it
Parallelisation with Numpy and Numba	We can measure how long a Jupyter cell takes to run with %time or %timeit magics. We can use a profiler to measure how long each line of code in a function takes. We should measure performance before attemping to optimise code and target our optimisations at the things which take longest. Numpy can perform operations to whole arrays, this will perform faster than using for loops. Numba can replace some Numpy operations with just in time compilation that is even faster. One way numba achieves higher performance is to use vectorisation extensions of some CPUs that process multiple pieces of data in one instruction. Numba ufuncs let us write arbitary functions for Numba to use. It’s Just In Time compiler can still speed these up.
Working with data in Xarray	Xarray can load NetCDF files We can address dimensions by their name using the `.dimensionname`, `['dimensionname]` or `sel(dimensionname)` syntax. With lazy loading data is only loaded into memory when it is requested We can apply mathematical operations to the whole (or part) of the array, this is more efficient than using a for loop. We can also apply custom functions to operate on the whole or part of the array. We can plot data from Xarray invoking matplotlib. Hvplot can plot interactive graphs. Xarray has many useful built in operations it can perform such as resampling, coarsening and grouping.
Plotting Geospatial Data with Cartopy	Cartopy can plot data on maps Cartopy can use Xarray DataArrays We can apply different projections to the map We can add gridlines and country/region boundaries
Parallelising with Dask	Dask is a parallel computing framework for Python Dask creates a task graph of all the steps in the operation we request Dask can use your local computer, an HPC cluster, Kubernetes cluster or a remote system over SSH We can monitor Dask’s processing with its dashboard Xarray can use Dask to parallelise some of its operations Delayed tasks let us lazily evaluate functions, only causing them to execute when the final result is requested Futures start a task immediately but return a futures object until computation is completed
Storing and Accessing Data in Parallelism Friendly Formats	We can process faster in parallel if we can read or write data in parallel too Data storage is many times slower than accessing our computer’s memory Object stores are one way to store data that is accessible over the web/http, allows replication of data and can scale to very large quantities of data. Zarr is an object store friendly file format intended for storing large array data. Zarr files are stored in chunks and software such as Xarray can just read the chunks that it needs instead of the whole file. Xarray can be used to read in Zarr files
GPUs	GPUs are Graphics Processing Units, they have large numbers of very simple processing cores and are suited to some parallel tasks like machine learning and array operations Many laptops and desktops won’t have very powerful GPUs, instead we’ll want to use HPC or Cloud systems to access a GPU. Google’s Colab provides free access to GPUs with a Jupyter notebooks interface. Numba can use GPUs with minor modifications to the code. NVIDIA have drop in replacements for Pandas, Numpy and SciKit learn that are GPU accelerated. For a GPU to access data it must be copied into the GPU’s memory. This can sometimes be a major bottleneck to GPU operations.

The glossary would go here, formatted as:

{:auto_ids}
key word 1
:   explanation 1

key word 2
:   explanation 2

({:auto_ids} is needed at the start so that Jekyll will automatically generate a unique ID for each item to allow other pages to hyperlink to specific glossary entries.) This renders as:

key word 1: explanation 1
key word 2: explanation 2

Advanced Python for Environmental Scientists: Glossary

Key Points

Glossary