# CLL Machine Learning Potential Training Workflow

```bash
ai2-kit workflow cll-mlp-training
```

## Introduction

CLL workflow is an improved version of DPGEN workflow, which is designed to meet more complex potential training requirements and sustainable code integration. CLL workflow adopts a closed-loop learning mode to automatically train MLP potential. In each iteration, the workflow uses labeled structures generated by first-principles methods to train multiple MLP models. Then, these models are used to explore new structures as training data for the next iteration. The iteration will continue until the quality of the MLP model meets the predetermined standard. The configuration of each iteration can be updated according to the training needs to further improve the training efficiency.

![cll-mlp-diagram](../res/cll-mlp-diagram.svg)

The main improvements of CLL workflow include:
* More semantic configuration system to support the selection of different software, and set different software configurations according to different systems.
* More robust checkpoint mechanism to reduce the impact of execution interruption.
* Support remote Python method execution to avoid unnecessary data transfer and improve the efficiency and stability of execution on HPC clusters.

Currently, CLL workflow supports the following tools for potential training:
* Label: CP2K, VASP
* Train: DeepMD
* Explore: LAMMPS, LASP
* Select: Model deviation, Distinctive structure selection

Currently, CLL workflow submits jobs through the HPC executor provided by `ai2-kit`, which supports completing calculations in a single HPC cluster. In the future, multi-cluster scheduling and support for different workflow engines including `DFlow` will be considered according to needs.

## Environment Requirements
* The Python version of the workflow execution environment and the HPC execution environment need to be consistent, otherwise there will be problems with remote execution.
* The HPC execution environment needs to install `ai2-kit`. Generally speaking, the `ai2-kit` on the HPC does not need to be strictly the same as the local `ai2-kit` version, but if the difference is too large, there may still be problems, so it is recommended to use the same version of `ai2-kit` when conditions permit.


## Installation

```bash
pip install -U ai2-kit
```

## Usage

The usage of CLL workflow is demonstrated through an example.

### Data Preparation
The data required for the execution of the workflow needs to be placed on the HPC cluster node in advance. Before starting the execution of the workflow, you need to prepare the following data:
* Initial structure or initial data set for potential training
* Initial structure for structure search

Assuming that you already have a trajectory generated by AIMD `h2o_64.aimd.xyz`, then you can use the [`ai2-kit tool ase`](./ase.md) command line tool to prepare these data.

```bash
mkdir -p data/explore

# Extract training set from 0-900 frames, extract every 5 frames
ai2-kit tool ase read h2o_64.aimd.xyz --index ':900:5' - set_cell "[12.42,12.42,12.42,90,90,90]" - write data/training.xyz

# Extract validation set from 900- frames, extract every 5 frames
ai2-kit tool ase read h2o_64.aimd.xyz --index '900::5' - set_cell "[12.42,12.42,12.42,90,90,90]" - write data/validation.xyz

# Extract data for initial structure search, extract every 100 frames
ai2-kit tool ase read h2o_64.aimd.xyz --index '::100' - set_cell "[12.42,12.42,12.42,90,90,90]" - write_frames "./data/explore/POSCAR-{i:04d}" --format vasp
```

### Configuration File Preparation

The configuration file of CLL workflow is in YAML format and supports splitting in any dimension. `ai2-kit` will automatically merge them during execution. Moderate splitting is conducive to the maintenance and reuse of configuration files. In general, we can split the configuration file into the following parts:

* artifact.yml: the data required by the workflow
* executor.yml: the parameters of the HPC executor
* workflow.yml: the parameters of the workflow software

Another way to build configuration is to use the configuration file provided in [example](../../example/config/cll-mlp-training/) as a reference to create your own workflow. For details, please refer to the document in the example directory.

We start with `artifact.yml`, which is used to configure the data required by the workflow. In this example, we need to configure three datasets, which are the training dataset, the validation dataset, and the dataset for structure search. The configuration of these three datasets is as follows:


```yaml
.base_dir: &base_dir /home/user01/data/

artifacts:
  h2o_64-train:
    url: !join [*base_dir, training.xyz]
    
  h2o_64-validation:
    url: !join [*base_dir, validation.xyz]
    attrs:
      deepmd:
        validation_data: true  # Specify this dataset as the validation set, Optional

  h2o_64-explore:
    url: !join [*base_dir, explore]
    includes: POSCAR*
    attrs:  
    # If necessary, you can specify specific software configurations for specific systems here. This example does not require this configuration, so it is empty
      # lammps:
      #   plumed_config:  !load_text plumed.in
      # cp2k:
      #   input_template: !load_text cp2k.inp
```

Here we use the custom tag `!join` provided by `ai2-kit` to simplify the data configuration. For related functions, please refer to the [TIPS](./tips.md) document.

Next, we configure the `executor.yml` file, which is used to configure parameters related to the HPC link and the use template of the software.

```yaml
executors:
  hpc-cluster01:
    ssh:
      host: user01@login-01  # Login node
      gateway:
        host: user01@jump-host  # Jump host (optional)
    queue_system:
      slurm: {}  # Use slurm as the job scheduling system

    work_dir: /home/user01/ai2-kit/workdir  # Working directory
    python_cmd: /home/user01/libs/conda/env/py39/bin/python  # Remote Python interpreter

    context:
      train:
        deepmd:  # Configure the deepmd job submission template
          script_template:
            header: |
              #SBATCH -N 1
              #SBATCH --ntasks-per-node=4
              #SBATCH --job-name=deepmd
              #SBATCH --partition=gpu3
              #SBATCH --gres=gpu:1
              #SBATCH --mem=8G
            setup: |
              set -e
              module load deepmd/2.2
              set +e

      explore:
        lammps:  # Configure the lammps job submission template
          lammps_cmd: lmp_mpi
          concurrency: 5
          script_template:
            header: |
              #SBATCH -N 1
              #SBATCH --ntasks-per-node=4
              #SBATCH --job-name=lammps
              #SBATCH --partition=gpu3
              #SBATCH --gres=gpu:1
              #SBATCH --mem=24G
            setup: |
              set -e
              module load deepmd/2.2
              export OMP_NUM_THREADS=1
              set +e

      label:
        cp2k:  # Configure the cp2k job submission template
          cp2k_cmd: mpiexec.hydra cp2k.popt
          concurrency: 5
          script_template:
            header: |
              #SBATCH -N 1
              #SBATCH --ntasks-per-node=16
              #SBATCH -t 12:00:00
              #SBATCH --job-name=cp2k
              #SBATCH --partition=c52-medium
            setup: |
              set -e
              module load intel/17.5.239 mpi/intel/2017.5.239
              module load gcc/5.5.0
              module load cp2k/7.1
              set +e
```

Finally, the configuration of the `workflow.yml` file is the configuration of the parameters of the workflow. This file is used to configure the parameters of the workflow.

```yaml
workflow:
  general:
    type_map: [ H, O ]
    mass_map: [ 1.008, 15.999 ]
    max_iters: 2  # Specify the maximum number of iterations

  train:
    deepmd:  # deepmd parameter configuration
      model_num: 4
      # The data used in this example has not been labeled, so it needs to be configured here. 
      # If there is an existing labeled deepmd/npy dataset, you can specify it here.
      init_dataset: [ ]  
      input_template:
        model:
          descriptor:
            type: se_a  
            sel:
            - 100
            - 100
            rcut_smth: 0.5
            rcut: 5.0
            neuron:
            - 25
            - 50
            - 100
            resnet_dt: false
            axis_neuron: 16
            seed: 1
          fitting_net:
            neuron:
            - 240
            - 240
            - 240
            resnet_dt: true
            seed: 1
        learning_rate:
          type: exp
          start_lr: 0.001
          decay_steps: 2000
        loss:
          start_pref_e: 0.02
          limit_pref_e: 2
          start_pref_f: 1000
          limit_pref_f: 1
          start_pref_v: 0
          limit_pref_v: 0
        training:
          #numb_steps: 400000
          numb_steps: 5000
          seed: 1
          disp_file: lcurve.out
          disp_freq: 1000
          save_freq: 1000
          save_ckpt: model.ckpt
          disp_training: true
          time_training: true
          profiling: false
          profiling_file: timeline.json

  label:
    cp2k:  # Specify the cp2k parameter configuration
      limit: 10
      # The data used in this example has not been labeled, so it needs to be configured here. 
      # If there is an existing labeled dataset, this should be empty
      # When this configuration is empty, the workflow will automatically skip the label phase of the first iteration 
      # and start execution from the train phase.
      init_system_files: [ h2o_64-train, h2o_64-validation ]
      input_template: |
        &GLOBAL
           PROJECT  DPGEN
        &END
        &FORCE_EVAL
           &DFT
              BASIS_SET_FILE_NAME  /home/user01/data/cp2k/BASIS/BASIS_MOLOPT
              POTENTIAL_FILE_NAME  /home/user01/data/cp2k/POTENTIAL/GTH_POTENTIALS
              CHARGE  0
              UKS  F
              &MGRID
                 CUTOFF  600
                 REL_CUTOFF  60
                 NGRIDS  4
              &END
              &QS
                 EPS_DEFAULT  1.0E-12
              &END
              &SCF
                 SCF_GUESS  RESTART
                 EPS_SCF  3.0E-7
                 MAX_SCF  50
                 &OUTER_SCF
                    EPS_SCF  3.0E-7
                    MAX_SCF  10
                 &END
                 &OT
                    MINIMIZER  DIIS
                    PRECONDITIONER  FULL_SINGLE_INVERSE
                    ENERGY_GAP  0.1
                 &END
              &END
              &LOCALIZE
                 METHOD  CRAZY
                 MAX_ITER  2000
                 &PRINT
                    &WANNIER_CENTERS
                       IONS+CENTERS
                       FILENAME  =64water_wannier.xyz
                    &END
                 &END
              &END
              &XC
                 &XC_FUNCTIONAL PBE
                 &END
                 &vdW_POTENTIAL
                    DISPERSION_FUNCTIONAL  PAIR_POTENTIAL
                    &PAIR_POTENTIAL
                       TYPE  DFTD3
                       PARAMETER_FILE_NAME  dftd3.dat
                       REFERENCE_FUNCTIONAL  PBE
                    &END
                 &END
              &END
           &END
           &SUBSYS
              @include coord_n_cell.inc
              &KIND O
                 BASIS_SET  DZVP-MOLOPT-SR-GTH
                 POTENTIAL  GTH-PBE-q6
              &END
              &KIND H
                 BASIS_SET  DZVP-MOLOPT-SR-GTH
                 POTENTIAL  GTH-PBE-q1
              &END
           &END
           &PRINT
              &FORCES ON
              &END
            &END
        &END

  explore:
    lammps:
      timestep: 0.0005
      sample_freq: 100
      nsteps: 2000
      ensemble: nvt

      template_vars:
        POST_INIT: |
          neighbor 1.0 bin
          box      tilt large

        POST_READ_DATA: |
          change_box all triclinic

      system_files: [ h2o-64-explore ]

      explore_vars:
        TEMP: [ 330, 430, 530]
        PRES: [1]
        TAU_T: 0.1  # Optional
        TAU_P: 0.5  # Optional

  select:
    model_devi:
        f_trust_lo: 0.4
        f_trust_hi: 0.6
        # Remove structurally similar configurations using clustering method.
        # If not needed, delete the following line. 
        asap_options: {}

  update:
    walkthrough:
      # You can specify the parameter configuration to be used from the second iteration and beyond here
      # The parameters configured here will override any configuration in the workflow section, 
      # and can be adjusted according to the training strategy
      table:
        - train:  # The training steps are 10000 in the second iteration
            deepmd:
              input_template:
                training:
                  numb_steps: 10000
        - train:  # The training steps are 20000 in the third iteration
            deepmd:
              input_template:
                training:
                  numb_steps: 20000
```

### Execute Workflow

After completing the configuration, you can start the execution of the workflow

```bash
ai2-kit workflow cll-mlp-training *.yml --executor hpc-cluster01 --path-prefix h2o_64-run-01 --checkpoint run-01.ckpt
```

In the above parameters,
* `*.yml` is used to specify the configuration file, you can specify multiple configuration files, `ai2-kit` will automatically merge them, `*` wildcard is used here;
* `--executor hpc-cluster01` is used to specify the HPC executor to use. Here, the `hpc-cluster01` executor configured in the previous section is used;
* `--path-prefix h2o_64-run-01` specifies the remote working directory, which will create a `h2o_64-run-01` directory under `work_dir` to store the execution results of the workflow;
* `--checkpoint run-01.cpkt` will generate a checkpoint file locally to save the execution status of the workflow, so as to resume execution after the execution is interrupted.

## Special Use Cases

### Train MLP model for FEP based Redox Potential Calculation

To train MLP model for FEP based Redox Potential Calculation,
you need to make use of `fparam` feature of `deepmd`
to fit the PES of initial (ini) and finial (fin) state of the reaction.

And should also config different label configuration for ini and fin state.

Here is the key configuration you need to pay attention to:

#### Config explore artifacts with `fep-ini` and `fep-fin` attrs

Explore artifacts are used as initial structures for structure explore.  For example:

```yaml
artifacts:
  explore-h2ox64:
    url: /path/to/h2ox64.xyz
    attrs:
      fep-ini:
        dp_fparam: 0
        cp2k:
          input_template: !load_text cp2k-ini.inp
      fep-fin:
        dp_fparam: 1
        cp2k:
          input_template: !load_text cp2k-fin.inp
```

#### Config LAMMPS explore mode with `fep-redox`
The following example dismiss the common configuration of LAMMPS,
and only focus on the `fep-redox` specific configuration.

```yaml
workflow:
  explore:
    lammps:
      mode: fep-redox
      template_vars:
        # The fparam option of deepmd pair style
        # The value of fparam should be consistent with the fparam in the explore artifacts
        # doc: https://docs.deepmodeling.com/projects/deepmd/en/latest/third-party/lammps-command.html#pair-style-deepmd
        FEP_INI_DP_OPT: fparam 0
        FEP_FIN_DP_OPT: fparam 1
```

#### Config `numb_fparam` in `deepmd` input template
https://docs.deepmodeling.com/projects/deepmd/en/master/train/train-input.html

TODO: example.

### Training MLP for FEP based pKa Calculation
TODO

### Screening Beyond Model Deviation
The configurations produced through the process of `Exploration` are chosen simply based on whether their Maximum Model Deviation of Forces falls within the range defined by `f_trust_lo` and `f_trust_hi`. However, such screening process still leaves a large number of configurations that require labeling. The screening{cite}`Guo2023checmate` is improved by adding an clustering procedure, which is allowed to remove structurally similar configurations. To enable this functionality, `asap_options: {}` is added in the above `workflow.yml`. For details of clustering method, users are referred to [the documentation](https://bingqingcheng.github.io/cluster.html) of ASAP{cite}`Cheng2020mapping`. 

## Citation
If you use clustering methods in your research, please cite the following paper: {cite}`Guo2023checmate,Cheng2020mapping`

If you use LASP in your research, please cite the following paper: {cite}`Huang2019lasp`