DeepMD-kit安装：旧版¶

本部分写于2020年，适用于DeePMD-kit 1.x 和 TensorFlow 1.14。对目前较新的版本可能不适用，请移步安装最佳实践和快速安装教程

背景：以 Zeus 集群为例，在服务器安装DeepMD-kit和包含完整接口的LAMMPS。

参考：

DeepMD-kit

TensorFlow

初始环境说明¶

以下过程以 Zeus 集群为例，操作系统及版本为CentOS 7，采用module作为环境管理。

通过yum安装：
Cmake 3.7
GCC 4.8.5
Git 1.8.2
通过module加载
CUDA 10.0
Miniconda3 (Python 3.7)
GCC 4.9.4
Intel MPI 2017

创建新的环境¶

首先准备必要的依赖。

检查可用的模块，并加载必要的模块：

module avail
module add cuda/10.0
module add gcc/4.9.4

注意这里导入的是gcc 4.9.4版本，如果采用更低的版本（不导入gcc）则dp_ipi不会被编译。

然后创建虚拟环境，步骤请参考Anaconda 使用指南。

假设创建的虚拟环境名称是 deepmd，则请将步骤最后的 <your env name> 替换为 deepmd。若采用该步骤的设置，则虚拟环境将被创建在/data/user/conda/env/deepmd下（假设用户名为user）。

由于GPU节点不能联网，故我们需要将所需的驱动程序库libcuda.so和libcuda.so.1手动链接到某个路径/some/local/path并加入环境变量。

ln -s /share/cuda/10.0/lib64/stubs/libcuda.so /some/local/path/libcuda.so.1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/share/cuda/10.0/lib64/stubs:/some/local/path

提示

若在 Zeus 集群上安装，管理员已事先把libcuda.so.1 链接在/share/cuda/10.0/lib64/stubs/下，故无需额外创建软链接，同理/some/local/path也无需加入环境变量。

安装Tensorflow的C++ 接口¶

以下安装，假设软件包下载路径均为/some/workspace，以TensorFlow 1.14.0版本、DeePMD-kit 1.2.0 版本为例进行说明，其他版本的步骤请参照修改。

下载对应的bazel安装包¶

cd /some/workspace
wget https://github.com/bazelbuild/bazel/releases/download/0.24.0/bazel-0.24.0-installer-linux-x86_64.sh
chmod +x bazel-0.24.0-installer-linux-x86_64.sh
./bazel-0.24.0-installer-linux-x86_64.sh --user
export PATH="$HOME/bin:$PATH"

注意

注意bazel的兼容性问题，合理的bazel版本设置请参阅Tensorflow官方文档中的说明。

下载TensorFlow源代码¶

cd /some/workspace 
git clone https://github.com/tensorflow/tensorflow tensorflow -b v1.14.0 --depth=1
cd tensorflow

编译TensorFlow C++ Interface¶

在tensorflow文件夹下运行configure，设置编译参数。

./configure
Please specify the location of python. [Default is xxx]:

Found possible Python library paths:
  /xxx/xxx/xxx
Please input the desired Python library path to use.  Default is [xxx]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.

Found CUDA 10.0 in:
    /share/cuda/10.0/lib64
    /share/cuda/10.0/include
Found cuDNN 7 in:
    /share/cuda/10.0/lib64
    /share/cuda/10.0/include

Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]:

Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /share/apps/gcc/4.9.4/bin/gcc]:

Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
    --config=mkl             # Build with MKL support.
    --config=monolithic      # Config for mostly static monolithic build.
    --config=gdr             # Build with GDR support.
    --config=verbs           # Build with libverbs support.
    --config=ngraph          # Build with Intel nGraph support.
    --config=numa            # Build with NUMA support.
    --config=dynamic_kernels    # (Experimental) Build kernels into separate shared objects.
    --config=v2              # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
    --config=noaws           # Disable AWS S3 filesystem support.
    --config=nogcp           # Disable GCP support.
    --config=nohdfs          # Disable HDFS support.
    --config=noignite        # Disable Apache Ignite support.
    --config=nokafka         # Disable Apache Kafka support.
    --config=nonccl          # Disable NVIDIA NCCL support.
Configuration finished

注意

若采用前文导入的GCC 4.9.4版本，请根据which gcc的输出判断GCC的安装路径。但一般情况下安装程序可以直接检测到正确路径。

随后进行编译，由于时间较长，可以考虑使用screen或者tmux将进程放置在后台。

bazel build -c opt --verbose_failures //tensorflow:libtensorflow_cc.so

说明

安装高版本Tensorflow（如2.1.0）时，若提示没有git -c的命令，请升级git到最新版。用户可能需要在本地进行编译并加入环境变量。

提示

一般情况下，bazel默认在~/.cache/bazel下进行编译。由于编译所需硬盘空间较大，如有需要，请在运行bazel前采用环境变量指定编译用临时文件夹，以/data/user/.bazel为例：

export TEST_TMPDIR=/data/user/.bazel

整合运行库与头文件¶

假设Tensorflow C++ 接口安装在/some/workspace/tensorflow_root下，则定义环境变量：

export tensorflow_root=/some/workspace/tensorflow_root

创建上述文件夹并从编译结果中抽取运行库和头文件。

mkdir -p $tensorflow_root

mkdir $tensorflow_root/lib
cp -d bazel-bin/tensorflow/libtensorflow_cc.so* $tensorflow_root/lib/
cp -d bazel-bin/tensorflow/libtensorflow_framework.so* $tensorflow_root/lib/
cp -d $tensorflow_root/lib/libtensorflow_framework.so.1 $tensorflow_root/lib/libtensorflow_framework.so

mkdir -p $tensorflow_root/include/tensorflow
cp -r bazel-genfiles/* $tensorflow_root/include/
cp -r tensorflow/cc $tensorflow_root/include/tensorflow
cp -r tensorflow/core $tensorflow_root/include/tensorflow
cp -r third_party $tensorflow_root/include
cp -r bazel-tensorflow/external/eigen_archive/Eigen/ $tensorflow_root/include
cp -r bazel-tensorflow/external/eigen_archive/unsupported/ $tensorflow_root/include
rsync -avzh --include '*/' --include '*.h' --include '*.inc' --exclude '*' bazel-tensorflow/external/protobuf_archive/src/ $tensorflow_root/include/
rsync -avzh --include '*/' --include '*.h' --include '*.inc' --exclude '*' bazel-tensorflow/external/com_google_absl/absl/ $tensorflow_root/include/absl

清理目标目录下赘余的源代码文件，保留编译好的接口。

cd $tensorflow_root/include
find . -name "*.cc" -type f -delete

安装DeePMD-kit的Python接口¶

首先安装Tensorflow的Python接口

pip install tensorflow-gpu==1.14.0

若提示已安装，请使用--upgrade选项进行覆盖安装。若提示权限不足，请使用--user选项在当前账号下安装。

然后下载DeePMD-kit的源代码。

cd /some/workspace
git clone --recursive https://github.com/deepmodeling/deepmd-kit.git deepmd-kit

在运行git clone时记得要--recursive，这样才可以将全部文件正确下载下来，否则在编译过程中会报错。

提示

如果不慎漏了--recursive，可以采取以下的补救方法：

git submodule update --init --recursive

" %}

随后通过pip安装DeePMD-kit：

cd deepmd-kit
pip install .

安装DeePMD-kit的C++ 接口¶

延续上面的步骤，下面开始编译DeePMD-kit C++接口：

deepmd_source_dir=`pwd`
cd $deepmd_source_dir/source
mkdir build 
cd build

假设DeePMD-kit C++ 接口安装在/some/workspace/deepmd_root下，定义安装路径deepmd_root：

export deepmd_root=/some/workspace/deepmd_root

修改环境变量以使得cmake正确指定编译器：

export CC=`which gcc`
export CXX=`which g++`

在build目录下运行：

cmake -DTENSORFLOW_ROOT=$tensorflow_root -DCMAKE_INSTALL_PREFIX=$deepmd_root ..

若通过yum同时安装了Cmake 2和Cmake 3，请将以上的cmake切换为cmake3。

最后编译并安装：

make
make install

若无报错，通过以下命令执行检查是否有正确输出：

$ ls $deepmd_root/bin
dp_ipi
$ ls $deepmd_root/lib
libdeepmd_ipi.so  libdeepmd_op.so  libdeepmd.so

因为GCC版本差别，可能没有$deepmd_root/bin/dp_ipi。

安装LAMMPS的DeePMD-kit模块¶

接下来安装

cd $deepmd_source_dir/source/build
make lammps

此时在$deepmd_source_dir/source/build下会出现USER-DEEPMD的LAMMPS拓展包。

下载LAMMPS安装包，按照常规方法编译LAMMPS：

cd /some/workspace
# Download Lammps latest release
wget -c https://lammps.sandia.gov/tars/lammps-stable.tar.gz
tar xf lammps-stable.tar.gz
cd lammps-*/src/
cp -r $deepmd_source_dir/source/build/USER-DEEPMD .

选择需要编译的包（若需要安装其他包，请参考Lammps官方文档）：

make yes-user-deepmd
make yes-kspace

如果没有make yes-kspace 会因缺少pppm.h报错。

加载MPI环境，并采用MPI方式编译Lammps可执行文件：

module load intel/17u5 mpi/intel/17u5
make mpi -j4

注意

此处使用的GCC版本应与之前编译Tensorflow C++接口和DeePMD-kit C++接口一致，否则可能会报错：@GLIBCXX_3.4.XX。如果在前面的安装中已经加载了GCC 4.9.4，请在这里也保持相应环境的加载。

经过以上过程，Lammps可执行文件lmp_mpi已经编译完成，用户可以执行该程序调用训练的势函数进行MD模拟。