|
Introduction
|
Python has a number of libraries that can help us work better with big datasets.
Every computer contains a Central Processing Unit or CPU. Modern CPUs tend to have multiple processing cores that can operate simultaneously.
High Perfomance Computing (HPC) facilities operate clusters of computers known as nodes. Each node will typically have several multi-core CPUs.
Parallel processing involves splitting a task up so that different parts of it can be performed simultaneously on different CPU cores, different CPUs or even different computers/HPC nodes.
Computer memory can mean RAM which is relatively fast and loses its contents when powered off or disk storage which retains its contents but is slower.
NetCDF files are commonly used for environmental science data and/or gridded datasets. They include metadata with them, store data in space efficient binary formats and libraries for reading/writing them can be found in most languages including Python.
|
|
Dataset Parallelism
|
GNU Parallel can apply the same command to every file in a dataset
GNU Parallel works on the command line and doesn’t require Python, but it can run multiple copies of a Python script
It is often the simplest way to apply parallelism
It requires a problem that works independently across a set of files or a range of parameters
Without invoking a more complex job scheduler, GNU Parallel only works on a single computer
By default GNU Parallel will use every CPU core available to it
|
|
Parallelisation with Numpy and Numba
|
We can measure how long a Jupyter cell takes to run with %time or %timeit magics.
We can use a profiler to measure how long each line of code in a function takes.
We should measure performance before attemping to optimise code and target our optimisations at the things which take longest.
Numpy can perform operations to whole arrays, this will perform faster than using for loops.
Numba can replace some Numpy operations with just in time compilation that is even faster.
One way numba achieves higher performance is to use vectorisation extensions of some CPUs that process multiple pieces of data in one instruction.
Numba ufuncs let us write arbitary functions for Numba to use. It’s Just In Time compiler can still speed these up.
|
|
Working with data in Xarray
|
Xarray can load NetCDF files
We can address dimensions by their name using the .dimensionname, ['dimensionname] or sel(dimensionname) syntax.
With lazy loading data is only loaded into memory when it is requested
We can apply mathematical operations to the whole (or part) of the array, this is more efficient than using a for loop.
We can also apply custom functions to operate on the whole or part of the array.
We can plot data from Xarray invoking matplotlib.
Hvplot can plot interactive graphs.
Xarray has many useful built in operations it can perform such as resampling, coarsening and grouping.
|
|
Plotting Geospatial Data with Cartopy
|
Cartopy can plot data on maps
Cartopy can use Xarray DataArrays
We can apply different projections to the map
We can add gridlines and country/region boundaries
|
|
Parallelising with Dask
|
Dask is a parallel computing framework for Python
Dask creates a task graph of all the steps in the operation we request
Dask can use your local computer, an HPC cluster, Kubernetes cluster or a remote system over SSH
We can monitor Dask’s processing with its dashboard
Xarray can use Dask to parallelise some of its operations
Delayed tasks let us lazily evaluate functions, only causing them to execute when the final result is requested
Futures start a task immediately but return a futures object until computation is completed
|
|
Storing and Accessing Data in Parallelism Friendly Formats
|
We can process faster in parallel if we can read or write data in parallel too
Data storage is many times slower than accessing our computer’s memory
Object stores are one way to store data that is accessible over the web/http, allows replication of data and can scale to very large quantities of data.
Zarr is an object store friendly file format intended for storing large array data.
Zarr files are stored in chunks and software such as Xarray can just read the chunks that it needs instead of the whole file.
Xarray can be used to read in Zarr files
|
|
GPUs
|
GPUs are Graphics Processing Units, they have large numbers of very simple processing cores and are suited to some parallel tasks like machine learning and array operations
Many laptops and desktops won’t have very powerful GPUs, instead we’ll want to use HPC or Cloud systems to access a GPU.
Google’s Colab provides free access to GPUs with a Jupyter notebooks interface.
Numba can use GPUs with minor modifications to the code.
NVIDIA have drop in replacements for Pandas, Numpy and SciKit learn that are GPU accelerated.
For a GPU to access data it must be copied into the GPU’s memory. This can sometimes be a major bottleneck to GPU operations.
|
{:auto_ids}
key word 1
: explanation 1
key word 2
: explanation 2