
# Hands-on tutorial for DP-GEN (v0.10.6)

## Workflow of the DP-GEN
DeeP Potential GENerator (DP-GEN) is a package that implements a concurrent learning scheme to generate reliable DP models. Typically, the DP-GEN workflow contains three processes: init, run, and autotest. 

1. init: generate the initial training dataset by first-principle calculations.
2. run: the main process of DP-GEN, in which the training dataset is enriched and the quality of the DP models is improved automatically.
3. autotest: calculate a simple set of properties and/or perform tests for comparison with DFT and/or empirical interatomic potentials.

This tutorial aims to help you quickly get command of the run process, so only a brief introduction to the init and auto-test processes is offered.

## Example: a gas phase methane molecule
The following introduces the basic usage of the DP-GEN, taking a gas-phase methane molecule as an example. 

### Init

The initial dataset is used to train multiple (default 4) initial DP models and it can be generated in a custom way or in the standard way provided by DP-GEN.

**Custom way**

Performing ab-initio molecular dynamics (AIMD) simulations directly is a common custom way of generating initial data. The following suggestions are given to users who generate initial data via AIMD simulation:

- Performing AIMD simulations at higher temperatures.
- Start AIMD simulations from several (as many as possible) unrelated initial configurations.
- Save snapshots from AIMD trajectories at a time interval to avoid sampling highly-related configurations.

**Standard way of DP-GEN**

For block materials, the initial data can be generated using DP-GE's init_bulk method. In the init_bulk method, the given configuration is initially relaxed by ab-initio calculation and subsequently scaled or perturbed. Next, these scaled or perturbed configurations are used to start small-scale AIMD simulations, and the AIMD format data is finally converted to the data format required by DeePMD-kit. Basically, init_bulk can be divided into four parts:

1.  Relax in folder  `00.place_ele`
2.  Perturb and scale in folder  `01.scale_pert`
3.  Run a short AIMD in folder  `02.md`
4.  Collect data in folder  `02.md`.

For surface systems, the initial data can be generated using DP-GE's init_surf method. Basically  init_surf  can be divided into two parts:

1.  Build a specific surface in folder  `00.place_ele`
2.  Perturb and scale in folder  `01.scale_pert`

Above steps are carried out automatically when generating the initial data in the standard way of DP-GEN. Users only need to prepare the input files for ab-initio calculation and DP-GEN (param.json and machine.json).

When generating the initial data for the block materials in the standard way, execute the following command:
```sh
$ dpgen init_bulk param.json machine.json
```
For surface systems, execute
```sh
$ dpgen init_surf param.json machine.json
```
A detailed description for preparing initial data in the standard way can be found at ‘Init’ Section of the [DP-GEN's documentation](https://docs.deepmodeling.com/projects/dpgen/en/latest/).

**Initial data of this tutorial**

In this tutorial, we take a gas-phase methane molecule as an example. We have prepared initial data in dpgen_example/init. Now download the dpgen_example and uncompress it:
```sh
wget https://dp-public.oss-cn-beijing.aliyuncs.com/community/dpgen_example.tar.xz
tar xvf dpgen_example.tar.xz
```
Go to and check the dpgen_example folder
```sh
$ cd dpgen_example  
$ ls  
init run
```
- Folder init contains the initial data generated by AIMD simulations.
- Folder run contains input files for the run process.

First, check the init folder with the `tree` command.
```sh
$ tree init -L 2  
```
On the screen, you can see
```sh
init  
├── CH4.POSCAR  
├── CH4.POSCAR.01x01x01  
│ ├── 00.place_ele  
│ ├── 01.scale_pert
│ ├── 02.md  
│ └── param.json  
├── INCAR_methane.md  
├── INCAR_methane.rlx  
└── param.json
```
- Folder CH4.POSCAR.01x01x01 contains the files generated by the DP-GEN init_bulk process.
- INCAR_* and CH4.POSCAR are the standard INCAR and POSCAR files for VASP.
- param.json is used to specify the details of the DP-GEN init_bulk process.

Note that POTCAR and machine.json are the same for init and run process, which can be found in the folder run.

### Run
The run process contains a series of successive iterations, undertaken in order such as heating the system to certain temperature. Each iteration is composed of three steps: exploration, labeling, and training.

#### Input files
Firstly, we introduce the input files required for the run process of DP-GEN. We have prepared input files in dpgen_example/run

Now go into the dpgen_example/run.
```sh
$ cd dpgen_example/run
$ ls
INCAR_methane  machine.json  param.json  POTCAR_C  POTCAR_H
```
- param.json is the settings for DP-GEN for the current task. 
- machine.json is a task dispatcher where the machine environment and resource requirements are set. 
- INCAR* and POTCAR* are the input file for the VASP package.  All first-principle calculations share the same parameters as the one you set in param.json.

We can perform the run process as we expect by specifying the keywords in param.json and machine.json. A description of these keywords is given below.

#### param.json

 The keywords in param.json can be split into 4 parts：
 
- System and data: used to specify the atom types, initial data, etc.
- Training: mainly used to specify tasks in the training step；
- Exploration: mainly used to specify tasks in the labeling step；
- Labeling: mainly used to specify tasks in the labeling step.
 
Here we introduce the main keywords in param.json, taking a gas-phase methane molecule as an example. 

**System and data**

The system and data related keywords are given in the following:

```json
"type_map": ["H","C"],
"mass_map": [1,12],
"init_data_prefix": "../",
"init_data_sys": ["init/CH4.POSCAR.01x01x01/02.md/sys-0004-0001/deepmd"],
"sys_configs_prefix": "../",
"sys_configs": [
     ["init/CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale-1.000/00000*/POSCAR"],
     ["init/CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale-1.000/00001*/POSCAR"]
],
"_comment": " that's all ",
```  
Description of keywords:

| Key         | Type            | Description             |
|-------------|-----------------|-------------------------|
| "type_map"  | list    | Atom types              |
| "mass_map"  | list    | Standard atom weights.  |
| "init_data_prefix"    | str          | Prefix of initial data directories                                                                          |
| "init_data_sys"       | list         | Directories of initial data. You may use either the absolute or relative path here.                             |
| "sys_configs_prefix"  | str          | Prefix of sys_configs                                                                                       |
| "sys_configs"         | list         | Containing directories of structures to be explored in iterations. Wildcard characters are supported here.  |

Description of example:

The system related keys specify the basic information about the system. "type_map" gives the atom types, i.e. "H" and "C". "mass_map" gives the standard atom weights, i.e. "1" and "12". 

The data related keys specify the init data for traning initial DP models and structures used for model_devi calculations. "init_data_prefix" and "init_data_sys" specify the location of the init data. "sys_configs_prefix" and "sys_configs" specify the location of the structures. Here, the init data is provided at "...... /init/CH4.POSCAR.01x01x01/02.md/sys-0004-0001/deepmd". These structures are divided into two groups and provided at "....../init/CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale-
1.000/00000*/POSCAR" and "....../init/CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale-
1.000/00001*/POSCAR".

 **Training**
 
The training related keywords are given in the following: 

```json
"numb_models": 4,
"default_training_param": {
     "model": {
         "type_map": ["H","C"],
         "descriptor": {
             "type": "se_a",
             "sel": [16,4],
             "rcut_smth": 0.5,
             "rcut": 5.0,
             "neuron": [120,120,120],
             "resnet_dt": true,
             "axis_neuron": 12,
             "seed": 1
        },
         "fitting_net": {
             "neuron": [25,50,100],
             "resnet_dt": false,
             "seed": 1
         }
     },
     "learning_rate": {
         "type": "exp",
         "start_lr": 0.001,
         "decay_steps": 5000
     },
     "loss": {
         "start_pref_e": 0.02,
         "limit_pref_e": 2,
         "start_pref_f": 1000,
         "limit_pref_f": 1,
         "start_pref_v": 0.0,
         "limit_pref_v": 0.0
     },
     "training": {
         "stop_batch": 400000,
         "disp_file": "lcurve.out",
         "disp_freq": 1000,
         "numb_test": 4,
         "save_freq": 1000,
         "save_ckpt": "model.ckpt",
         "disp_training": true,
         "time_training": true,
         "profiling": false,
         "profiling_file": "timeline.json",
         "_comment": "that's all"
     }
 },
```
Description of keywords:
| Key                       | Type     | Description                                  |
|---------------------------|----------|----------------------------------------------|
| "numb_models"             | int  | Number of models to be trained in 00.train.  |
| "default_training_param"  | dict     | Training parameters for deepmd-kit.          |

Description of example:

The training related keys specify the details of training tasks. "numb_models" specifies the number of models to be trained. "default_training_param" specifies the training parameters for  `DeePMD-kit`. Here, 4 DP models will be trained. 

The training part of DP-GEN is performed by DeePMD-kit, so the keywords here are the same as those of DeePMD-kit and will not be explained here. A detailed explanation of those keywords can be found at  [DeePMD-kit’s documentation](https://docs.deepmodeling.com/projects/deepmd/en/master/).

**Exploration**

The exploration related keywords are given in the following: 
```json
"model_devi_dt": 0.002,
"model_devi_skip": 0,
"model_devi_f_trust_lo": 0.05,
"model_devi_f_trust_hi": 0.15,
"model_devi_e_trust_lo": 10000000000.0,
"model_devi_e_trust_hi": 10000000000.0,
"model_devi_clean_traj": true,
"model_devi_jobs": [
     {"sys_idx": [0],"temps": [100],"press": [1.0],"trj_freq": 10,"nsteps": 300,"ensemble": "nvt","_idx": "00"},
     {"sys_idx": [1],"temps": [100],"press": [1.0],"trj_freq": 10,"nsteps": 3000,"ensemble": "nvt","_idx": "01"}
],
```
Description of keywords:

| Key                      | Type                    | Description   |
|--------------------------|-------------------------|---------------|
| "model_devi_dt"          | float  | Timestep for MD                                                                                                                                                                                                                                |
| "model_devi_skip"        | int    | Number of structures skipped for fp in each MD                                                                                                                                                                                                |
| "model_devi_f_trust_lo"  | float  | Lower bound of forces for the selection. If List, should be set for each index in sys_configs, respectively.                                                                                                                                  |
| "model_devi_f_trust_hi"  | int    | Upper bound of forces for the selection. If List, should be set for each index in sys_configs, respectively.                                                                                                                                  |
| "model_devi_v_trust_hi"  | float or list  | Lower bound of virial for the selection. If List, should be set for each index in sys_configs, respectively. Should be used with DeePMD-kit v2.x.                                                                                             |
| "model_devi_v_trust_hi"  | float or list  | Upper bound of virial for the selection. If List, should be set for each index in sys_configs, respectively. Should be used with DeePMD-kit v2.x.                                                                                             |
| "model_devi_clean_traj"  | bool or int    | If the type of model_devi_clean_traj is boolean type then it denotes whether to clean traj folders in MD since they are too large. If it is Int type, then the most recent n iterations of traj folders will be retained, others will be removed.  |
| "model_devi_jobs"        | list            | Settings for exploration in 01.model_devi. Each dict in the list corresponds to one iteration. The index of model_devi_jobs exactly accord with the index of iterations                                               |
| &nbsp;&nbsp;&nbsp;&nbsp;"sys_idx"   | List of integer         | Systems to be selected as the initial structure of MD and be explored. The index corresponds exactly to the "sys_configs". |
| &nbsp;&nbsp;&nbsp;&nbsp;"temps" | list  | Temperature (K) in MD
| &nbsp;&nbsp;&nbsp;&nbsp;"press" | list  | Pressure (Bar) in MD
| &nbsp;&nbsp;&nbsp;&nbsp;"trj_freq"   | int          | Frequency of trajectory saved in MD.                  |
| &nbsp;&nbsp;&nbsp;&nbsp;"nsteps"     | int          | Running steps of MD.                                  |
| &nbsp;&nbsp;&nbsp;&nbsp;"ensembles"  | str          | Determining which ensemble used in MD, options include “npt” and “nvt”. |


Description of example:

The exploration related keys specify the details of exploration tasks.  Here, MD simulations are performed at the temperature of 100 K and the pressure of 1.0 Bar with an integrator time of 2 fs under the nvt ensemble. Two iterations are set in "model_devi_jobs". MD simulations are run for 300 and 3000 time steps with the first and second groups of structures in "sys_configs" in 00 and 01 iterations. We choose to save all structures generated in MD simulations and have set  `"trj_freq"`  as 10, so 30 and 300 structures are saved in 00 and 01 iterations. If the "max_devi_f" of saved structure falls between 0.05 and 0.15, DP-GEN will treat the structure as a candidate. We choose to clean traj folders in MD since they are too large. If you want to save the most recent n iterations of traj folders, you can set "model_devi_clean_traj" to be an integer.

**Labeling**

The labeling related keywords are given in the following: 
```json
"fp_style": "vasp",
"shuffle_poscar": false,
"fp_task_max": 20,
"fp_task_min": 5,
"fp_pp_path": "./",
"fp_pp_files": ["POTCAR_H","POTCAR_C"],
"fp_incar": "./INCAR_methane"
```

Description of keywords:

| Key               | Type            | Description                                                                                                              |
|-------------------|-----------------|--------------------------------------------------------------------------------------------------------------------------|
| "fp_style"        | String          | Software for First Principles. Options include “vasp”, “pwscf”, “siesta” and “gaussian” up to now.                       |
| "shuffle_poscar"  | Boolean         |                                                                                                                          |
| "fp_task_max"     | Integer         | Maximum of structures to be calculated in 02.fp of each iteration.                                                       |
| "fp_task_min"     | Integer         | Minimum of structures to calculate in 02.fp of each iteration.                                                           |
| "fp_pp_path"      | String          | Directory of psuedo-potential file to be used for 02.fp exists.                                                          |
| "fp_pp_files"     | List of string  | Psuedo-potential file to be used for 02.fp. Note that the order of elements should correspond to the order in type_map.  |
| "fp_incar"        | String          | Input file for VASP. INCAR must specify KSPACING and KGAMMA.                              |

Description of example:

The labeling related keys specify the details of labeling tasks.  Here, a minimum of 1 and a maximum of 20 structures will be labeled using the VASP code with the INCAR provided at "....../INCAR_methane" and POTCAR provided at "....../methane/POTCAR" in each iteration. Note that the order of elements in POSCAR and POTCAR should correspond to the order in  `type_map`.

**machine.json**

Each iteration in the run process of DP-GEN is composed of three steps: exploration, labeling, and training. Accordingly, machine.json is composed of three parts: train, model_devi, and fp. Each part is a list of dicts. Each dict can be considered as an independent environment for calculation. 

In this section, we will show you how to perform the training step at a local workstation, model_devi step at a local Slurm cluster, and fp step at a remote PBS cluster using the new DPDispatcher (the value of keyword "api_version" is larger than or equal to 1.0).  For each step, three types of keys are needed:
- Command: provides the command used to execute each step.
- Machine: specifies the machine environment (local workstation, local or remote cluster, or cloud server).
- Resources: specify the number of groups, nodes, CPU, and GPU; enable the virtual environment.

**Performing the training step at a local workstation**

In this example, we perform the training step on a local workstation.

```json
"train": [
    {
      "command": "dp",
      "machine": {
        "batch_type": "Shell",
        "context_type": "local",
        "local_root": "./",
        "remote_root": "/home/user1234/work_path"
      },
      "resources": {
        "number_node": 1,
        "cpu_per_node": 4,
        "gpu_per_node": 1,
        "group_size": 1,
        "source_list": ["/home/user1234/deepmd.env"]
      }
    }
  ],
```
Description of keywords:

| Key| Type| Description|
|------|------|------|
| "command"| String| A command to be executed of this task.|
| "machine"| dict| The definition of machine.|
| &nbsp;&nbsp;&nbsp;&nbsp;"batch_type"  | str| The batch job system type.|
| &nbsp;&nbsp;&nbsp;&nbsp;"context_type"| str| The connection used to remote machine.|
| &nbsp;&nbsp;&nbsp;&nbsp;"local_root"  | str| The dir where the tasks and relating files locate.|
| &nbsp;&nbsp;&nbsp;&nbsp;"remote_root" | str| The dir where the tasks are executed on the remote machine.|
| "machine"| dict| The definition of resources.|
| &nbsp;&nbsp;&nbsp;&nbsp;"number_node" | int| The number of node need for each job.|
| &nbsp;&nbsp;&nbsp;&nbsp;"cpu_per_node"| int| cpu numbers of each node assigned to each job.|
| &nbsp;&nbsp;&nbsp;&nbsp;"gpu_per_node"| int| gpu numbers of each node assigned to each job.|
| &nbsp;&nbsp;&nbsp;&nbsp;"group_size"  |int | The number of tasks in a job.|
| &nbsp;&nbsp;&nbsp;&nbsp;"source_list" | str| The dir where the tasks are executed on the remote machine.|

Description of example:

The "command" for the training tasks in the DeePMD-kit is "dp".

In machine parameters, "batch_type" specifies the type of job scheduling system. If there is no job scheduling system, we can use the "Shell" to perform the task. "context_type" specifies the method of data transfer, and "local" means copying and moving data via local file storage systems (e.g. cp, mv, etc.). In DP-GEN, the paths of all tasks are automatically located and set by the software, and therefore "local_root" is always set to ". /". The input file for each task will be sent to the "remote_root" and the task will be performed there, so we need to make sure that the path exists.

In the resources parameter, "number_node", "cpu_per_node", and "gpu_per_node" specify the number of nodes, the number of CPUs, and the number of GPUs required for a task respectively. "group_size", which needs to be highlighted, specifies how many tasks will be packed into a group. In the training tasks, we need to train 4 models. If we only have one GPU, we can set the "group_size" to 4. If "group_size" is set to 1, 4  models will be trained on one GPU at the same time, as there is no job scheduling system. Finally, the environment variables can be activated by "source_list". In this example, "source /home/user1234/deepmd.env" is executed before "dp" to load the environment variables necessary to perform the training task.

**Perform the model_devi step at a local Slurm cluster**

In this example, we perform the model_devi step at a local Slurm workstation.

```json
"model_devi": [
    {
      "command": "lmp",
      "machine": {
       "context_type": "local",
        "batch_type": "Slurm",
        "local_root": "./",
        "remote_root": "/home/user1234/work_path"
      },
      "resources": {
        "number_node": 1,
        "cpu_per_node": 4,
        "gpu_per_node": 1,
        "queue_name": "QueueGPU",
        "custom_flags" : ["#SBATCH --mem=32G"],
        "group_size": 10,
        "source_list": ["/home/user1234/lammps.env"]
      }
    }
],
```
Description of keywords:

| Key  | Type | Description|
|------|------|------------|
| "queue_name"  | String| The queue name of batch job scheduler system.|
| "custom_flags"| String| The extra lines pass to job submitting script header.|

Description of example:

The "command" for the model_devi tasks in the LAMMPS is "lmp".

In the machine parameter, we specify the type of job scheduling system by changing the "batch_type" to "Slurm".

In the resources parameter, we specify the name of the queue to which the task is submitted by adding "queue_name". We can add additional lines to the calculation script via the "custom_flags". In the model_devi steps, there are frequently many short tasks, so we usually pack multiple tasks (e.g. 10) into a group for submission. Other parameters are similar to that of the local workstation.

**Perform the fp step in a remote PBS cluster**

In this example, we perform the fp step at a remote PBS cluster that can be accessed via SSH.

```json
"fp": [
    {
      "command": "mpirun -n 32 vasp_std",
      "machine": {
       "context_type": "SSHContext",
        "batch_type": "PBS",
        "local_root": "./",
        "remote_root": "/home/user1234/work_path",
        "remote_profile": {
          "hostname": "39.xxx.xx.xx",
          "username": "user1234"
         }
      },
      "resources": {
        "number_node": 1,
        "cpu_per_node": 32,
        "gpu_per_node": 0,
        "queue_name": "QueueCPU",
        "group_size": 5,
        "source_list": ["/home/user1234/vasp.env"]
      }
    }
],
```
Description of keywords:

| Key| Type| Description|
|----|-----|------------|
| "remote_profile"| dict| The information used to maintain the connection with remote machine.|
| &nbsp;&nbsp;&nbsp;&nbsp;"hostname"| str | hostname or ip of ssh connection.|
| &nbsp;&nbsp;&nbsp;&nbsp;"username"| str | username of target linux system.|

Description of example:

VASP code is used for fp tasks and mpi is used for parallel computing, so "mpirun -n 32" is added to specify the number of parallel threads.

In the machine parameter, "context_type" is modified to "SSHContext" and "batch_type" is modified to "PBS". It is worth noting that "remote_root" should be set to an accessible path on the remote PBS cluster. "remote_profile" is added to specify the information used to connect the remote cluster, including hostname, username, password, port, etc. 

In the resources parameter, we set "gpu_per_node" to 0 since it is cost-effective to use the CPU for VASP calculations.

#### Start run process

Once param.json and machine.json have been prepared, we can run DP-GEN easily by:
```sh
$ dpgen run param.json machine.json
```

#### Results analysis

Users need to know the output files of the run process and the information they contain. After successfully executing the above command, we can find that a folder and two files are generated automatically in dpgen_example/run. 
```sh
$ ls 
dpgen.log  INCAR_methane  iter.000000  machine.json  param.json  record.dpgen 
```

- `iter.000000` contains the main results that DP-GEN generates in the first iteration.
- `record.dpgen` records the current stage of the run process. 
- `dpgen.log` includes time and iteration information. 
 When the first iteration is completed, the folder structure of `iter.000000` is like this:

```sh
$ tree iter.000000/ -L 1
./iter.000000/
├── 00.train
├── 01.model_devi
└── 02.fp
```
- 00.train: several (default 4) DP models are trained on existing data.
- 01.model_devi: new configurations are generated using the DP models obtained in 00.train.
- 02.fp: first-principles calculations are performed on the selected configurations and the results are converted into training data.

**00.train**
First, we check the folder `iter.000000`/ `00.train`.
```sh
$ tree iter.000000/00.train -L 1
./iter.000000/00.train/
├── 000
├── 001
├── 002
├── 003
├── data.init -> /root/dpgen_example
├── data.iters
├── graph.000.pb -> 000/frozen_model.pb
├── graph.001.pb -> 001/frozen_model.pb
├── graph.002.pb -> 002/frozen_model.pb
└── graph.003.pb -> 003/frozen_model.pb
```

- Folder 00x contains the input and output files of the DeePMD-kit, in which a model is trained.
- graph.00x.pb , linked to 00x/frozen.pb, is the model DeePMD-kit generates. The only difference between these models is the random seed for neural network initialization. 
We may randomly select one of them, like 000.
```sh
$ tree iter.000000/00.train/000 -L 1
./iter.000000/00.train/000
├── checkpoint
├── frozen_model.pb
├── input.json
├── lcurve.out
├── model.ckpt-400000.data-00000-of-00001
├── model.ckpt-400000.index
├── model.ckpt-400000.meta
├── model.ckpt.data-00000-of-00001
├── model.ckpt.index
├── model.ckpt.meta
└── train.log
```

- `input.json` is the settings for DeePMD-kit for the current task.
- `checkpoint`  is used for restart training.
- `model.ckpt*` are model related files.
- `frozen_model.pb` is the frozen model. 
- `lcurve.out` records the training accuracy of energies and forces.
- `train.log` includes version, data, hardware information, time, etc.

**01.model_devi**
Then, we check the folder iter.000000/ 01.model_devi.
```sh
$ tree iter.000000/01.model_devi -L 1
./iter.000000/01.model_devi/
├── confs
├── graph.000.pb -> /root/dpgen_example/run/iter.000000/00.train/graph.000.pb
├── graph.001.pb -> /root/dpgen_example/run/iter.000000/00.train/graph.001.pb
├── graph.002.pb -> /root/dpgen_example/run/iter.000000/00.train/graph.002.pb
├── graph.003.pb -> /root/dpgen_example/run/iter.000000/00.train/graph.003.pb
├── task.000.000000
├── task.000.000001
├── task.000.000002
├── task.000.000003
├── task.000.000004
├── task.000.000005
├── task.000.000006
├── task.000.000007
├── task.000.000008
└── task.000.000009
```

- Folder confs contains the initial configurations for LAMMPS MD converted from POSCAR you set in "sys_configs" of param.json. 

- Folder task.000.00000x contains the input and output files of the LAMMPS. We may randomly select one of them, like task.000.000001. 
```sh
$ tree iter.000000/01.model_devi/task.000.000001
./iter.000000/01.model_devi/task.000.000001
├── conf.lmp -> ../confs/000.0001.lmp
├── input.lammps
├── log.lammps
├── model_devi.log
└── model_devi.out
```

- `conf.lmp`, linked to `000.0001.lmp` in folder confs, serves as the initial configuration of MD. 
- `input.lammps` is the input file for LAMMPS.
- `model_devi.out` records the model deviation of concerned labels, energy and force, in MD. It serves as the criterion for selecting which structures and doing first-principle calculations.

By head `model_devi.out`, you will see:
```
$ head -n 5 ./iter.000000/01.model_devi/task.000.000001/model_devi.out
 #  step max_devi_v     min_devi_v     avg_devi_v     max_devi_f     min_devi_f     avg_devi_f 
 0     1.438427e-04   5.689551e-05   1.083383e-04   8.835352e-04   5.806717e-04   7.098761e-04
10     3.887636e-03   9.377374e-04   2.577191e-03   2.880724e-02   1.329747e-02   1.895448e-02
20     7.723417e-04   2.276932e-04   4.340100e-04   3.151907e-03   2.430687e-03   2.727186e-03
30     4.962806e-03   4.943687e-04   2.925484e-03   5.866077e-02   1.719157e-02   3.011857e-02
```
Now we'll concentrate on `max_devi_f`.
Recall that we've set `"trj_freq"` as 10, so every 10 steps the structures are saved. Whether to select the structure depends on its `"max_devi_f"`. If it falls between `"model_devi_f_trust_lo"` (0.05) and `"model_devi_f_trust_hi"` (0.15), DP-GEN will treat the structure as a candidate. Here, only the 30th structure will be selected, whose `"max_devi_f"` is 5.866077e e-02.

**02.fp**
Finally, we check the folder iter.000000/ 02.fp.
```
$ tree iter.000000/02.fp -L 1
./iter.000000/02.fp
├── data.000
├── task.000.000000
├── task.000.000001
├── task.000.000002
├── task.000.000003
├── task.000.000004
├── task.000.000005
├── task.000.000006
├── task.000.000007
├── task.000.000008
├── task.000.000009
├── task.000.000010
├── task.000.000011
├── candidate.shuffled.000.out
├── POTCAR.000
├── rest_accurate.shuffled.000.out
└── rest_failed.shuffled.000.out
```

- `POTCAR` is the input file for VASP generated according to `"fp_pp_files"` of param.json.
- `candidate.shuffle.000.out` records which structures will be selected from last step 01.model_devi.  There are always far more candidates than the maximum you expect to calculate at one time. In this condition, DP-GEN will randomly choose up to `"fp_task_max"` structures and form the folder task.*.
- `rest_accurate.shuffle.000.out` records the other structures where our model is accurate ("max_devi_f" is less than `"model_devi_f_trust_lo"`, no need to calculate any more), 
- `rest_failed.shuffled.000.out` records the other structures where our model is too inaccurate (larger than `"model_devi_f_trust_hi"`, there may be some error).
- `data.000`: After first-principle calculations, DP-GEN will collect these data and change them into the format DeePMD-kit needs. In the next iteration's `00.train`, these data will be trained together as well as the initial data.

By cat candidate.shuffled.000.out | grep task.000.000001, you will see:

```sh
$ cat ./iter.000000/02.fp/candidate.shuffled.000.out | grep task.000.000001
iter.000000/01.model_devi/task.000.000001 190
iter.000000/01.model_devi/task.000.000001 130
iter.000000/01.model_devi/task.000.000001 120
iter.000000/01.model_devi/task.000.000001 150
iter.000000/01.model_devi/task.000.000001 280
iter.000000/01.model_devi/task.000.000001 110
iter.000000/01.model_devi/task.000.000001 30
iter.000000/01.model_devi/task.000.000001 230
```

The `task.000.000001` 30 is exactly what we have just found in `01.model_devi` satisfying the criterion to be calculated again.
After the first iteration, we check the contents of dpgen.log and record.dpgen.

```sh
$ cat dpgen.log
2022-03-07 22:12:45,447 - INFO : start running
2022-03-07 22:12:45,447 - INFO : =============================iter.000000==============================
2022-03-07 22:12:45,447 - INFO : -------------------------iter.000000 task 00--------------------------
2022-03-07 22:12:45,451 - INFO : -------------------------iter.000000 task 01--------------------------
2022-03-08 00:53:00,179 - INFO : -------------------------iter.000000 task 02--------------------------
2022-03-08 00:53:00,179 - INFO : -------------------------iter.000000 task 03--------------------------
2022-03-08 00:53:00,187 - INFO : -------------------------iter.000000 task 04--------------------------
2022-03-08 00:57:04,113 - INFO : -------------------------iter.000000 task 05--------------------------
2022-03-08 00:57:04,113 - INFO : -------------------------iter.000000 task 06--------------------------
2022-03-08 00:57:04,123 - INFO : system 000 candidate :     12 in    310   3.87 %
2022-03-08 00:57:04,125 - INFO : system 000 failed    :      0 in    310   0.00 %
2022-03-08 00:57:04,125 - INFO : system 000 accurate  :    298 in    310  96.13 %
2022-03-08 00:57:04,126 - INFO : system 000 accurate_ratio:   0.9613    thresholds: 1.0000 and 1.0000   eff. task min and max   -1   20   number of fp tasks:     12
2022-03-08 00:57:04,154 - INFO : -------------------------iter.000000 task 07--------------------------
2022-03-08 01:02:07,925 - INFO : -------------------------iter.000000 task 08--------------------------
2022-03-08 01:02:07,926 - INFO : failed tasks:      0 in     12    0.00 % 
2022-03-08 01:02:07,949 - INFO : failed frame:      0 in     12    0.00 % 
```

It can be found that 310 structures are generated in iter.000000, in which 12 structures are collected for first-principle calculations.
```sh
$ cat record.dpgen
0 0
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
```

Each line contains two numbers: the first is the index of iteration, and the second, ranging from 0 to 9, records which stage in each iteration is currently running.

| Index of iterations  | "Stage in each iteration "   | Process          |
|----------------------|-----------------------------|------------------|
| 0                    | 0                           | make_train       |
| 0                    | 1                           | run_train        |
| 0                    | 2                           | post_train       |
| 0                    | 3                           | make_model_devi  |
| 0                    | 4                           | run_model_devi   |
| 0                    | 5                           | post_model_devi  |
| 0                    | 6                           | make_fp          |
| 0                    | 7                           | run_fp           |
| 0                    | 8                           | post_fp          |

If the process of DP-GEN stops for some reason, DP-GEN will automatically recover the main process by record.dpgen. You may also change it manually for your purpose, such as removing the last iterations and recovering from one checkpoint.
After all iterations, we check the structure of dpgen_example/run 
```sh
$ tree ./ -L 2
./
├── dpgen.log
├── INCAR_methane
├── iter.000000
│   ├── 00.train
│   ├── 01.model_devi
│   └── 02.fp
├── iter.000001
│   ├── 00.train
│   ├── 01.model_devi
│   └── 02.fp
├── iter.000002
│   └── 00.train
├── machine.json
├── param.json
└── record.dpgen
```

and contents of `dpgen.log`.
```sh
$ cat cat dpgen.log | grep system
2022-03-08 00:57:04,123 - INFO : system 000 candidate :     12 in    310   3.87 %
2022-03-08 00:57:04,125 - INFO : system 000 failed    :      0 in    310   0.00 %
2022-03-08 00:57:04,125 - INFO : system 000 accurate  :    298 in    310  96.13 %
2022-03-08 00:57:04,126 - INFO : system 000 accurate_ratio:   0.9613    thresholds: 1.0000 and 1.0000   eff. task min and max   -1   20   number of fp tasks:     12
2022-03-08 03:47:00,718 - INFO : system 001 candidate :      0 in   3010   0.00 %
2022-03-08 03:47:00,718 - INFO : system 001 failed    :      0 in   3010   0.00 %
2022-03-08 03:47:00,719 - INFO : system 001 accurate  :   3010 in   3010 100.00 %
2022-03-08 03:47:00,722 - INFO : system 001 accurate_ratio:   1.0000    thresholds: 1.0000 and 1.0000   eff. task min and max   -1    0   number of fp tasks:      0
```
It can be found that 3010 structures are generated in `iter.000001`, in which no structure is collected for first-principle calculations. Therefore, the final models are not updated in iter.000002/00.train. 

## Simplify
When you have a dataset containing lots of repeated data, this step will help you simplify your dataset.Since `dpgen simplify` is proformed on a large dataset, only a simple demo will be provided in this part. 

To learn more about simplify, you can refer to [DPGEN's Document](https://docs.deepmodeling.com/projects/dpgen/en/latest/)
[Document of dpgen simplify parameters](https://docs.deepmodeling.com/projects/dpgen/en/latest/simplify/simplify-jdata.html)
[Document of dpgen simplify machine parameters](https://docs.deepmodeling.com/projects/dpgen/en/latest/simplify/simplify-mdata.html)

This demo can be download from dpgen/examples/simplify-MAPbI3-scan-lebesgue. You can find more example in [dpgen.examples](https://github.com/deepmodeling/dpgen/tree/master/examples)

In the example, `data` contains a simplistic data set based on MAPbI3-scan case. Since it has been greatly reduced, do not take it seriously. It is just a demo. 
`simplify_example` is the work path, which contains `INCAR` and templates for `simplify.json` and `machine.json`. You can use the command `nohup dpgen simplify simplify.json machine.json 1>log 2>err &` here to test if `dpgen simplify` can run normally. 

Kindly reminder: 
1. `machine.json` is supported by `dpdispatcher 0.4.15`, please check https://docs.deepmodeling.com/projects/dpdispatcher/en/latest/ to update the parameters according to your `dpdispatcher` version.
2. `POTCAR` should be prepared by the user. 
3. Please check the path and files name and make sure they are correct. 

Simplify can be used in Transfer Learning, see [CaseStudies: Transfer-learning](../../../CaseStudies/Transfer-learning/index.html)

## Auto-test

The function, `auto-test`, is only for alloy materials to verify the accuracy of their DP model, users  can calculate a simple set of properties and compare the results with those of a DFT or traditional empirical force field. DPGEN's autotest module supports the calculation of a variety of properties, such as

- 00.equi:(default task) the equilibrium state；

- 01.eos: the equation of state；

- 02.elastic: the elasticity like Young's module；

- 03.vacancy: the vacancy formation energy；

- 04.interstitial: the interstitial formation energy；

- 05.surf: the surface formation energy.

In this part, the Al-Mg-Cu DP potential is used to illustrate how to automatically test DP potential of alloy materials. Each `auto-test` task includes three stages:
- `make` prepares all required calculation files and input scripts automatically;
- `run` can help submit calculation tasks to remote calculation plantforms and when calculation tasks are completed, will collect results automatically;
- `post` returns calculation results to local root automatically.

### structure relaxation

#### step1-`make`
Prepare the following files in a separate folder.
```sh
├── machine.json
├── relaxation.json
├── confs
│   ├── mp-3034
```
**IMPORTANT!** The ID number, mp-3034, is in the line with Material Project ID for Al-Mg-Cu.

In order to harness the benefits of `pymatgen` combined with Material Project to generate files for calculation tasks by mp-ID automatically, you are supposed to add the API for Material Project in the `.bashrc`.

You can do that easily by running this command.
```bash
vim .bashrc
// add this line into this file, `export MAPI_KEY="your-api-key-for-material-projects"`
```
If you have no ideas about api-key for material projects, please refer to this [link](https://materialsproject.org/api#:~:text=API%20Key,-Your%20API%20Key&text=To%20make%20any%20request%20to,anyone%20you%20do%20not%20trust.).

- machine.json is the same with the one used in `init` and `run`. For more information about it, please check this [link](https://bohrium-doc.dp.tech/#/docs/DP-GEN?id=步骤3：准备计算文件).
- relaxtion.json

```json
{
    "structures":         ["confs/mp-3034"],//in this folder, confs/mp-3034, required files and scripts will be generated automatically by `dpgen autotest make relaxation.json`
    "interaction": {
            "type":        "deepmd",
            "model":       "graph.pb",
            "in_lammps":   "lammps_input/in.lammps",
            "type_map":   {"Mg":0,"Al": 1,"Cu":2} //if you  calculate other materials, remember to modify element types here.
    },
    "relaxation": {
            "cal_setting":{"etol": 1e-12,
                           "ftol": 1e-6,
                           "maxiter": 5000,
                           "maximal": 500000,
                           "relax_shape":     true,
                           "relax_vol":       true}
    }
}
```

Run this command,
```bash
dpgen autotest make relaxation.json 
```
and then corresponding files and scripts used for calculation will be generated automatically.

#### step2-`run`
```bash
nohup dpgen autotest run relaxation.json machine.json &
```
After running this command, structures will be relaxed.

#### step3-`post`
```bash
dpgen autotest post relaxation.json 
```
### property calculation
#### step1-`make`
The parameters used for property calculations are in property.json. 

```json
{
    "structures":       ["confs/mp-3034"],
    "interaction": {
        "type":          "deepmd",
        "model":         "graph.pb",
        "deepmd_version":"2.1.0",
        "type_map":     {"Mg":0,"Al": 1,"Cu":2}
    },
    "properties": [
        {
         "type":         "eos",
         "vol_start":    0.9,
         "vol_end":      1.1,
         "vol_step":     0.01
        },
        {
         "type":         "elastic",
         "norm_deform":  2e-2,
         "shear_deform": 5e-2
        },
        {
         "type":             "vacancy",
         "supercell":        [3, 3, 3],
         "start_confs_path": "confs"
        },
        {
         "type":         "interstitial",
         "supercell":   [3, 3, 3],
         "insert_ele":  ["Mg","Al","Cu"],
         "conf_filters":{"min_dist": 1.5},
         "cal_setting": {"input_prop": "lammps_input/lammps_high"}
        },
        {
         "type":           "surface",
         "min_slab_size":  10,
         "min_vacuum_size":11,
         "max_miller":     2,
         "cal_type":       "static"
        }
        ]
}
```
Run this command
```bash
dpgen autotest make property.json
```
#### step2-`run`
Run this command
```bash
nohup dpgen autotest run property.json machine.json &
```
#### step3-`post`
```bash
dpgen autotest post property.json
```
In the folder, you can use the command `tree . -L 1` and then you can check results.

```
(base) ➜ mp-3034 tree . -L 1
.
├── dpdispatcher.log
├── dpgen.log
├── elastic_00
├── eos_00
├── eos_00.bk000
├── eos_00.bk001
├── eos_00.bk002
├── eos_00.bk003
├── eos_00.bk004
├── eos_00.bk005
├── graph_new.pb
├── interstitial_00
├── POSCAR
├── relaxation
├── surface_00
└── vacancy_00
```

- 01.eos: the equation of state；
```bash
(base) ➜ mp-3034 tree eos_00 -L 1
eos_00
├── 99c07439f6f14399e7785dc783ca5a9047e768a8_flag_if_job_task_fail
├── 99c07439f6f14399e7785dc783ca5a9047e768a8_job_tag_finished
├── 99c07439f6f14399e7785dc783ca5a9047e768a8.sub
├── backup
├── graph.pb -> ../../../graph.pb
├── result.json
├── result.out
├── run_1660558797.sh
├── task.000000
├── task.000001
├── task.000002
├── task.000003
├── task.000004
├── task.000005
├── task.000006
├── task.000007
├── task.000008
├── task.000009
├── task.000010
├── task.000011
├── task.000012
├── task.000013
├── task.000014
├── task.000015
├── task.000016
├── task.000017
├── task.000018
├── task.000019
└── tmp_log
```

The `EOS` calculation results are shown in `eos_00/results.out` file
```bash
(base) ➜ eos_00 cat result.out 
conf_dir: /root/1/confs/mp-3034/eos_00
 VpA(A^3)  EpA(eV)
 15.075   -3.2727 
 15.242   -3.2838 
 15.410   -3.2935 
 15.577   -3.3019 
 15.745   -3.3090 
 15.912   -3.3148 
 16.080   -3.3195 
 16.247   -3.3230 
 16.415   -3.3254 
 16.582   -3.3268 
 16.750   -3.3273 
 16.917   -3.3268 
 17.085   -3.3256 
 17.252   -3.3236 
 17.420   -3.3208 
 17.587   -3.3174 
 17.755   -3.3134 
 17.922   -3.3087 
 18.090   -3.3034 
 18.257   -3.2977 
```
- 02.elastic: the elasticity like Young's module；
The `elastic` calculation results are shown in `elastic_00/results.out` file
```bash
(base) ➜ elastic_00 cat result.out 
/root/1/confs/mp-3034/elastic_00
 124.32   55.52   60.56    0.00    0.00    1.09 
  55.40  125.82   75.02    0.00    0.00   -0.17 
  60.41   75.04  132.07    0.00    0.00    7.51 
   0.00    0.00    0.00   53.17    8.44    0.00 
   0.00    0.00    0.00    8.34   37.17    0.00 
   1.06   -1.35    7.51    0.00    0.00   34.43 
# Bulk   Modulus BV = 84.91 GPa
# Shear  Modulus GV = 37.69 GPa
# Youngs Modulus EV = 98.51 GPa
# Poission Ratio uV = 0.31
```
- 03.vacancy: the vacancy formation energy；
The `vacancy` calculation results are shown in `vacancy_00/results.out` file
```bash
(base) ➜ vacancy_00 cat result.out 
/root/1/confs/mp-3034/vacancy_00
Structure:      Vac_E(eV)  E(eV) equi_E(eV)
[3, 3, 3]-task.000000: -10.489  -715.867 -705.378 
[3, 3, 3]-task.000001:   4.791  -713.896 -718.687 
[3, 3, 3]-task.000002:   4.623  -714.064 -718.687 
```
- 04.interstitial: the interstitial formation energy；
The `interstitial` calculation results are shown in `interstitial_00/results.out` file
```bash
(base) ➜ vacancy_00 cat result.out 
/root/1/confs/mp-3034/vacancy_00
Structure:      Vac_E(eV)  E(eV) equi_E(eV)
[3, 3, 3]-task.000000: -10.489  -715.867 -705.378 
[3, 3, 3]-task.000001:   4.791  -713.896 -718.687 
[3, 3, 3]-task.000002:   4.623  -714.064 -718.687 
```
- 05.surf: the surface formation energy.
The `surface` calculation results are shown in `surface_00/results.out` file
```bash
(base) ➜ surface_00 cat result.out  
/root/1/confs/mp-3034/surface_00
Miller_Indices:         Surf_E(J/m^2) EpA(eV) equi_EpA(eV)
[1, 1, 1]-task.000000:          1.230      -3.102   -3.327
[1, 1, 1]-task.000001:          1.148      -3.117   -3.327
[2, 2, 1]-task.000002:          1.160      -3.120   -3.327
[2, 2, 1]-task.000003:          1.118      -3.127   -3.327
[1, 1, 0]-task.000004:          1.066      -3.138   -3.327
[2, 1, 2]-task.000005:          1.223      -3.118   -3.327
[2, 1, 2]-task.000006:          1.146      -3.131   -3.327
[2, 1, 1]-task.000007:          1.204      -3.081   -3.327
[2, 1, 1]-task.000008:          1.152      -3.092   -3.327
[2, 1, 1]-task.000009:          1.144      -3.093   -3.327
[2, 1, 1]-task.000010:          1.147      -3.093   -3.327
[2, 1, 0]-task.000011:          1.114      -3.103   -3.327
[2, 1, 0]-task.000012:          1.165      -3.093   -3.327
[2, 1, 0]-task.000013:          1.137      -3.098   -3.327
[2, 1, 0]-task.000014:          1.129      -3.100   -3.327
[1, 0, 1]-task.000015:          1.262      -3.124   -3.327
[1, 0, 1]-task.000016:          1.135      -3.144   -3.327
[1, 0, 1]-task.000017:          1.113      -3.148   -3.327
[1, 0, 1]-task.000018:          1.119      -3.147   -3.327
[1, 0, 1]-task.000019:          1.193      -3.135   -3.327
[2, 0, 1]-task.000020:          1.201      -3.089   -3.327
[2, 0, 1]-task.000021:          1.189      -3.092   -3.327
[2, 0, 1]-task.000022:          1.175      -3.094   -3.327
[1, 0, 0]-task.000023:          1.180      -3.100   -3.327
[1, 0, 0]-task.000024:          1.139      -3.108   -3.327
[1, 0, 0]-task.000025:          1.278      -3.081   -3.327
[1, 0, 0]-task.000026:          1.195      -3.097   -3.327
[2, -1, 2]-task.000027:         1.201      -3.121   -3.327
[2, -1, 2]-task.000028:         1.121      -3.135   -3.327
[2, -1, 2]-task.000029:         1.048      -3.147   -3.327
[2, -1, 2]-task.000030:         1.220      -3.118   -3.327
[2, -1, 1]-task.000031:         1.047      -3.169   -3.327
[2, -1, 1]-task.000032:         1.308      -3.130   -3.327
[2, -1, 1]-task.000033:         1.042      -3.170   -3.327
[2, -1, 0]-task.000034:         1.212      -3.154   -3.327
[2, -1, 0]-task.000035:         1.137      -3.165   -3.327
[2, -1, 0]-task.000036:         0.943      -3.192   -3.327
[2, -1, 0]-task.000037:         1.278      -3.144   -3.327
[1, -1, 1]-task.000038:         1.180      -3.118   -3.327
[1, -1, 1]-task.000039:         1.252      -3.105   -3.327
[1, -1, 1]-task.000040:         1.111      -3.130   -3.327
[1, -1, 1]-task.000041:         1.032      -3.144   -3.327
[1, -1, 1]-task.000042:         1.177      -3.118   -3.327
[2, -2, 1]-task.000043:         1.130      -3.150   -3.327
[2, -2, 1]-task.000044:         1.221      -3.135   -3.327
[2, -2, 1]-task.000045:         1.001      -3.170   -3.327
[1, -1, 0]-task.000046:         0.911      -3.191   -3.327
[1, -1, 0]-task.000047:         1.062      -3.168   -3.327
[1, -1, 0]-task.000048:         1.435      -3.112   -3.327
[1, -1, 0]-task.000049:         1.233      -3.143   -3.327
[1, 1, 2]-task.000050:          1.296      -3.066   -3.327
[1, 1, 2]-task.000051:          1.146      -3.097   -3.327
[1, 0, 2]-task.000052:          1.192      -3.085   -3.327
[1, 0, 2]-task.000053:          1.363      -3.050   -3.327
[1, 0, 2]-task.000054:          0.962      -3.132   -3.327
[1, -1, 2]-task.000055:         1.288      -3.093   -3.327
[1, -1, 2]-task.000056:         1.238      -3.102   -3.327
[1, -1, 2]-task.000057:         1.129      -3.122   -3.327
[1, -1, 2]-task.000058:         1.170      -3.115   -3.327
[0, 0, 1]-task.000059:          1.205      -3.155   -3.327
[0, 0, 1]-task.000060:          1.188      -3.158   -3.327
```


## Summary
Now, users have learned the basic usage of the DP-GEN.  For further information, please refer to the recommended links.

1. GitHub website：https://github.com/deepmodeling/dpgen
2. Papers：https://blogs.deepmodeling.com/papers/dpgen/

