======================================== Calibration with Simulation Cancellation ======================================== Introduction - Calibration with libEnsemble and a Regression Model ------------------------------------------------------------------ This tutorial demonstrates libEnsemble's capability to selectively cancel pending simulations based on instructions from a calibration Generator Function. This capability is desirable, especially when evaluations are expensive, since compute resources may then be more effectively applied toward critical evaluations. For a somewhat different approach than libEnsemble's :doc:`other tutorials`, we'll emphasize the settings, functions, and data fields within the calling script, :ref:`persistent generator`, manager, and :ref:`sim_f` that make this capability possible, rather than outlining a step-by-step process. The libEnsemble regression test ``test_persistent_surmise_calib.py`` demonstrates cancellation of pending simulations, while the ``test_persistent_surmise_killsims.py`` test demonstrates libEnsemble's capability to also kill running simulations that have been marked as cancelled. Overview of the Calibration Problem ----------------------------------- The generator function featured in this tutorial can be found in ``gen_funcs/persistent_surmise_calib.py`` and uses the `surmise`_ library for its calibration surrogate model interface. The surmise library uses the "PCGPwM" emulation method in this example. Say there is a computer model :math:`f(\theta, x)` to be calibrated. To calibrate is to find some parameter :math:`\theta_0` such that :math:`f(\theta_0, x)` closely resembles data collected from a physical experiment. For example, a (simple) physical experiment may involve dropping a ball at different heights to study the gravitational constant, and the corresponding computer model could be the set of differential equations that govern the drop. In a case where the computation of the computer model is relatively expensive, we employ a fast surrogate model to approximate the model and to inform good parameters to test next. Here the computer model :math:`f(\theta, x)` is accessible only through performing :ref:`sim_f` evaluations. As a convenience for testing, the ``observed`` data values are modelled by calling the ``sim_f`` for the known true theta, which in this case is the center of a unit hypercube. These values are therefore stored at the start of libEnsemble's main :doc:`History array<../history_output_logging>` array, and have associated ``sim_id``'s. The generator function ``gen_f`` then samples an initial batch of parameters :math:`(\theta_1, \ldots, \theta_n)` and constructs a surrogate model. For illustration, the initial batch of evaluations are arranged in the following sense: .. math:: \newcommand{\T}{\mathsf{T}} \mathbf{f} = \begin{pmatrix} f(\theta_1)^\T \\ \vdots \\ f(\theta_n)^\T \end{pmatrix} = \begin{pmatrix} f(\theta_1, x_1) & \ldots & f(\theta_1, x_m) \\ \vdots & \ddots & \vdots \\ f(\theta_n, x_1) & \ldots & f(\theta_n, x_m) \end{pmatrix}. The surrogate then generates (suggests) new parameters for ``sim_f`` evaluations, so the number of parameters :math:`n` grows as more evaluations are scheduled and performed. As more evaluations are performed and received by ``gen_f``, the surrogate evolves and suggests parameters closer to :math:`\theta_0` with uncertainty estimates. The calibration can be terminated when either ``gen_f`` determines it has found :math:`\theta_0` with some tolerance in the surrounding uncertainty, or computational resource runs out. At termination, the generator exits and returns, initiating the shutdown of the libEnsemble routine. The following is a pseudocode overview of the generator. Functions directly from the calibration library used within the generator function have the ``calib:`` prefix. Helper functions defined to improve the data received by the calibration library by interfacing with libEnsemble have the ``libE:`` prefix. All other statements are workflow logic or persistent generator helper functions like ``send`` or ``receive``:: 1 libE: calculate observation values and first batch 2 while STOP_signal not received: 3 receive: evaluated points 4 unpack points into 2D Theta x Point structures 5 if new model condition: 6 calib: construct new model 7 else: 8 wait to receive more points 9 if some condition: 10 calib: generate new thetas from model 11 calib: if error threshold reached: 12 exit loop - done 13 send: new points to be evaluated 14 if any sent points must be obviated: 15 libE: mark points with cancel request 16 send: points with cancel request Point Cancellation Requests and Dedicated Fields ------------------------------------------------ While the generator loops and updates the model based on returned points from simulations, it detects conditionally if any new Thetas should be generated from the model, simultaneously evaluating if any *pending* simulations ought to be cancelled ("obviated"). If so, the generator then calls ``cancel_columns()``:: if select_condition(pending): new_theta, info = select_next_theta(step_add_theta, cal, emu, pending, n_explore_theta) ... c_obviate = info["obviatesugg"] # suggested if len(c_obviate) > 0: cancel_columns(obs_offset, c_obviate, n_x, pending, ps) ``obs_offset`` is an offset that excludes the observations when mapping points in surmise data structures to ``sim_id``'s, ``c_obviate`` is a selection of columns to cancel, ``n_x`` is the number of ``x`` values, and ``pending`` is used to check that points marked for cancellation have not already returned. ``ps`` is the instantiation of the *PersistentSupport* class that is set up for persistent generators, and provides an interface for communication with the manager. Within ``cancel_columns()``, each column in ``c_obviate`` is iterated over, and if a point is ``pending`` and thus has not yet been evaluated by a simulation, its ``sim_id`` is appended to a list to be sent to the Manager for cancellation. Cancellation is requested using the helper function ``request_cancel_sim_ids`` provided by the *PersistentSupport* class. Each of these helper functions is described :ref:`here`. The entire ``cancel_columns()`` routine is listed below: .. code-block:: python def cancel_columns(obs_offset, c, n_x, pending, ps): """Cancel columns""" sim_ids_to_cancel = [] columns = np.unique(c) for c in columns: col_offset = c * n_x for i in range(n_x): sim_id_cancel = obs_offset + col_offset + i if pending[i, c]: sim_ids_to_cancel.append(sim_id_cancel) pending[i, c] = 0 ps.request_cancel_sim_ids(sim_ids_to_cancel) In future calls to the allocation function by the manager, points that would have been distributed for simulation work but are now marked with "cancel_requested" will not be processed. The manager will send kill signals to workers that are already processing cancelled points. These signals can be caught and acted on by the user ``sim_f``; otherwise they will be ignored. Allocation Function and Cancellation Configuration -------------------------------------------------- The allocation function used in this example is the *only_persistent_gens* function in the *start_only_persistent* module. The calling script passes the following specifications: .. code-block:: python libE_specs["kill_canceled_sims"] = True alloc_specs = { "alloc_f": alloc_f, "user": { "init_sample_size": init_sample_size, "async_return": True, "active_recv_gen": True, }, } **async_return** tells the allocation function to return results to the generator as soon as they come back from evaluation (once the initial sample is complete). **init_sample_size** gives the size of the initial sample that is batch returned to the gen. This is calculated from other parameters in the calling script. **active_recv_gen** allows the persistent generator to handle irregular communications (see below). By default, workers (including persistent workers), are only allocated work when they're in an :ref:`idle or non-active state`. However, since this generator must asynchronously update its model, the worker running this generator remains in an *active receive* state, until it becomes non-persistent. This means both the manager and persistent worker (generator in this case) must be prepared for irregular sending/receiving of data. Calling Script - Reading Results -------------------------------- Within the libEnsemble calling script, once the main :doc:`libE()<../libe_module>` function call has returned, it's a simple enough process to view the History rows that were marked as cancelled:: if __name__ == "__main__": # required by multiprocessing on macOS and windows H, persis_info, flag = libE(sim_specs, gen_specs, exit_criteria, persis_info, alloc_specs=alloc_specs, libE_specs=libE_specs) if is_manager: print("Cancelled sims", H["cancel_requested"]) Here's an example graph showing the relationship between scheduled, cancelled (obviated), failed, and completed simulations requested by the ``gen_f``. Notice that for each batch of scheduled simulations, most either complete or fail but the rest are successfully obviated: .. image:: ../images/gen_v_fail_or_cancel.png :width: 600 :alt: surmise_sample_graph :align: center Please see the ``test_persistent_surmise_calib.py`` regression test for an example routine using the surmise calibration generator. The associated simulation function and allocation function are included in ``sim_funcs/surmise_test_function.py`` and ``alloc_funcs/start_only_persistent.py`` respectively. Using cancellations to kill running simulations ------------------------------------------------ If a generated point is cancelled by the generator before it has been given to a worker for evaluation, then it will never be given. If it has already returned from the simulation, then results can be returned, but the ``cancel_requested`` field remains as True. However, if the simulation is running when the manager receives the cancellation request, a kill signal will be sent to the worker. This can be caught and acted upon by a user function, otherwise it will be ignored. To demonstrate this, the test ``test_persistent_surmise_killsims.py`` captures and processes this signal from the manager. In order to do this, a compiled version of the borehole function is launched by ``sim_funcs/borehole_kills.py`` via the :doc:`Executor<../executor/overview>`. As the borehole application used here is serial, we use the :doc:`Executor base class<../executor/executor>` rather than the commonly used :doc:`MPIExecutor<../executor/mpi_executor>` class. The base Executor submit routine simply sub-processes a serial application in-place. After the initial sample batch of evaluations has been processed, an artificial delay is added to the sub-processed borehole to allow time to receive the kill signal and terminate the application. Killed simulations will be reported at the end of the test. As this is dependent on timing, the number of killed simulations will vary between runs. This test is added simply to demonstrate the killing of running simulations and thus uses a reduced number of evaluations. .. _surmise: https://github.com/mosesyhc/surmise