The default VM setup is to use a single CPU core. In order to demonstrate the power of parallel processing, you must first determine whether your physical hardware has more than a single core.
On Linux this is done in the terminal with the ‘nproc’ command.
On Mac this is done in the terminal with the ‘sysctl -n hw.ncpu’ command.
On Windows this is done graphically using the Task Manager’s Performance tab.
We want tune our VM to harness the power of several CPUs. Follow the following steps:
- Shut down the IPython notebook Server (Ctrl-C, answer yes)
- Shutdown the VM (click the X button in the VM window, choose power down the machine)
- Select the VM in the VirtualBox Manager Window, from the menu choose Machine->Setting
- Choose the System Tab, then Processor, use the slider to set the number of Processor to 2, 4, or 8 depending on your system resources.
- Click Ok, and then start the machine
- Login, use the script to start the IPython server, start the notebook and you should have multiple processors!
retrieve data from s3 bucket¶
import os
import urllib.request
from pathlib import Path
# Set the URL for the cloud
URL = ""
path = "pythia/radar/erad2024/baltrad/baltrad_short_course/"
!mkdir -p data
files = [
for file in files:
file0 = os.path.join(path, file)
name = os.path.join("data", Path(file).name)
if not os.path.exists(name):
print(f"downloading, {name}")
f"{URL}{file0}", os.path.join("data", Path(file).name)
downloading, data/seang.h5
downloading, data/searl.h5
downloading, data/sease.h5
downloading, data/sehud.h5
downloading, data/sekkr.h5
downloading, data/selek.h5
downloading, data/selul.h5
downloading, data/seosu.h5
downloading, data/sevar.h5
downloading, data/sevil.h5
Verify from Python the number of CPU cores at our disposal¶
import multiprocessing
print("We have %i cores to play with!" % multiprocessing.cpu_count())
We have 4 cores to play with!
Yay! Now we’re going to set up some rudimentary functionality that will allow us to distribute a processing load among our cores.
Define a generator¶
import os
import _raveio, odc_polarQC
# Specify the processing chain
odc_polarQC.algorithm_ids = [
# Run processing chain on a single file. Return an output file string.
def generate(file_string):
rio =
pvol = rio.object
pvol = odc_polarQC.QC(pvol)
rio.object = pvol
# Derive an output file name
path, fstr = os.path.split(file_string)
ofstr = os.path.join(path, "qc_" + fstr)
return ofstr
Feed the generator, sequentially¶
import glob, time
ifstrs = glob.glob("data/se*.h5")
before = time.time()
for fstr in ifstrs:
print(fstr, generate(fstr))
after = time.time()
print("Processing time: %3.2f seconds" % (after - before))
data/searl.h5 data/qc_searl.h5
data/sease.h5 data/qc_sease.h5
data/seang.h5 data/qc_seang.h5
data/seosu.h5 data/qc_seosu.h5
data/sekkr.h5 data/qc_sekkr.h5
data/sekir.h5 data/qc_sekir.h5
data/selul.h5 data/qc_selul.h5
data/selek.h5 data/qc_selek.h5
data/sevar.h5 data/qc_sevar.h5
data/sehud.h5 data/qc_sehud.h5
data/sevil.h5 data/qc_sevil.h5
Processing time: 51.06 seconds
Mental note: repeat once!
Multiprocess the generator¶
# Both input and output are a list of file strings
def multi_generate(fstrs, procs=None):
pool = multiprocessing.Pool(
) # Pool of processors. Defaults to all available logical cores
results = []
# chunksize=1 means feed a process a new job as soon as the process is idle.
# In our case, this restricts the queue to one "dispatcher" which is faster.
r = pool.map_async(generate, fstrs, chunksize=1, callback=results.append)
return results[0]
Feed the monster, asynchronously!¶
before = time.time()
ofstrs = multi_generate(ifstrs)
after = time.time()
print("Processing time: %3.2f seconds" % (after - before))
Processing time: 2.83 seconds