Dataset Catalog

Dataset Catalog#

The repository ships with multiple HDF5 datasets stored under datasets/<name>/. Each folder contains data.hdf5 (train/test/val splits), optional parameter sets, and metadata such as timesteps. Datasets are typically produced via the codes.create_dataset helper described in :doc:guides/extending-benchmark, which ensures a consistent layout:

  • train / test / val: arrays with shape (n_samples, n_timesteps, n_quantities).

  • Optional *_params: per-trajectory parameters (e.g., radiation field, metallicity).

  • timesteps: explicit timeline (logarithmic or linear).

  • Attributes such as n_quantities, n_parameters, and train/test/val counts for quick inspection (older datasets may infer these values on the fly).

When you add new datasets, follow the same convention so the CLI can auto-discover everything.

Visualisations#

For each dataset we usually publish a quick set of plots (trajectories, gradients, distributions, random example). Generate or refresh them via:

python datasets/_data_analysis/analyse_all_datasets.py

The script iterates over the list called datasets inside datasets/_data_analysis/analyse_all_datasets.py. To visualise a new dataset, append its identifier there and add an entry to the dictionary defined in datasets/_data_analysis/dataset_dict.py. Each entry specifies whether to plot on a log scale, how many quantities per subplot (app), and the plotting tolerance (controls axis limits). The script then produces PNGs under each dataset folder.

osu dataset trajectories

Example visualisation (osu2008) showing the log-scale trajectories generated by the helper script.#

Download helper#

You rarely need to download anything manually: the training, tuning, and evaluation CLIs call download_data on demand, which uses the URLs defined in datasets/data_sources.yaml. The first time you reference a dataset, its data.hdf5 is fetched and cached under datasets/<name>/. Keep data_sources.yaml up to date if you mirror the data in your own storage.