DeepMD-kit安装:旧版¶
本部分写于2020年,适用于DeePMD-kit 1.x 和 TensorFlow 1.14。对目前较新的版本可能不适用,请移步安装最佳实践和快速安装教程
背景:以 Zeus 集群为例,在服务器安装DeepMD-kit和包含完整接口的LAMMPS。
参考:
初始环境说明¶
以下过程以 Zeus 集群为例,操作系统及版本为CentOS 7,采用module作为环境管理。
- 通过yum安装:
- Cmake 3.7
- GCC 4.8.5
- Git 1.8.2
- 通过module加载
- CUDA 10.0
- Miniconda3 (Python 3.7)
- GCC 4.9.4
- Intel MPI 2017
创建新的环境¶
首先准备必要的依赖。
检查可用的模块,并加载必要的模块:
module avail
module add cuda/10.0
module add gcc/4.9.4
注意这里导入的是gcc 4.9.4版本,如果采用更低的版本(不导入gcc)则dp_ipi不会被编译。
然后创建虚拟环境,步骤请参考Anaconda 使用指南。
假设创建的虚拟环境名称是 deepmd,则请将步骤最后的 <your env name> 替换为 deepmd。若采用该步骤的设置,则虚拟环境将被创建在/data/user/conda/env/deepmd下(假设用户名为user)。
由于GPU节点不能联网,故我们需要将所需的驱动程序库libcuda.so和libcuda.so.1手动链接到某个路径/some/local/path并加入环境变量。
ln -s /share/cuda/10.0/lib64/stubs/libcuda.so /some/local/path/libcuda.so.1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/share/cuda/10.0/lib64/stubs:/some/local/path
提示
若在 Zeus 集群上安装,管理员已事先把libcuda.so.1 链接在/share/cuda/10.0/lib64/stubs/下,故无需额外创建软链接,同理/some/local/path也无需加入环境变量。
安装Tensorflow的C++ 接口¶
以下安装,假设软件包下载路径均为/some/workspace, 以TensorFlow 1.14.0版本、DeePMD-kit 1.2.0 版本为例进行说明,其他版本的步骤请参照修改。
下载对应的bazel安装包¶
cd /some/workspace
wget https://github.com/bazelbuild/bazel/releases/download/0.24.0/bazel-0.24.0-installer-linux-x86_64.sh
chmod +x bazel-0.24.0-installer-linux-x86_64.sh
./bazel-0.24.0-installer-linux-x86_64.sh --user
export PATH="$HOME/bin:$PATH"
注意
注意bazel的兼容性问题,合理的bazel版本设置请参阅Tensorflow官方文档中的说明。
下载TensorFlow源代码¶
cd /some/workspace 
git clone https://github.com/tensorflow/tensorflow tensorflow -b v1.14.0 --depth=1
cd tensorflow
编译TensorFlow C++ Interface¶
在tensorflow文件夹下运行configure,设置编译参数。
./configure
Please specify the location of python. [Default is xxx]:
Found possible Python library paths:
  /xxx/xxx/xxx
Please input the desired Python library path to use.  Default is [xxx]
Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.
Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.
Do you wish to build TensorFlow with TensorRT support? [y/N]:
No TensorRT support will be enabled for TensorFlow.
Found CUDA 10.0 in:
    /share/cuda/10.0/lib64
    /share/cuda/10.0/include
Found cuDNN 7 in:
    /share/cuda/10.0/lib64
    /share/cuda/10.0/include
Please specify a list of comma-separated CUDA compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]:
Do you want to use clang as CUDA compiler? [y/N]:
nvcc will be used as CUDA compiler.
Please specify which gcc should be used by nvcc as the host compiler. [Default is /share/apps/gcc/4.9.4/bin/gcc]:
Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
    --config=mkl             # Build with MKL support.
    --config=monolithic      # Config for mostly static monolithic build.
    --config=gdr             # Build with GDR support.
    --config=verbs           # Build with libverbs support.
    --config=ngraph          # Build with Intel nGraph support.
    --config=numa            # Build with NUMA support.
    --config=dynamic_kernels    # (Experimental) Build kernels into separate shared objects.
    --config=v2              # Build TensorFlow 2.x instead of 1.x.
Preconfigured Bazel build configs to DISABLE default on features:
    --config=noaws           # Disable AWS S3 filesystem support.
    --config=nogcp           # Disable GCP support.
    --config=nohdfs          # Disable HDFS support.
    --config=noignite        # Disable Apache Ignite support.
    --config=nokafka         # Disable Apache Kafka support.
    --config=nonccl          # Disable NVIDIA NCCL support.
Configuration finished
注意
若采用前文导入的GCC 4.9.4版本,请根据which gcc的输出判断GCC的安装路径。但一般情况下安装程序可以直接检测到正确路径。
随后进行编译,由于时间较长,可以考虑使用screen或者tmux将进程放置在后台。
bazel build -c opt --verbose_failures //tensorflow:libtensorflow_cc.so
说明
安装高版本Tensorflow(如2.1.0)时,若提示没有git -c的命令,请升级git到最新版。用户可能需要在本地进行编译并加入环境变量。
提示
一般情况下,bazel默认在~/.cache/bazel下进行编译。由于编译所需硬盘空间较大,如有需要,请在运行bazel前采用环境变量指定编译用临时文件夹,以/data/user/.bazel为例:
export TEST_TMPDIR=/data/user/.bazel整合运行库与头文件¶
假设Tensorflow C++ 接口安装在/some/workspace/tensorflow_root下,则定义环境变量:
export tensorflow_root=/some/workspace/tensorflow_root
创建上述文件夹并从编译结果中抽取运行库和头文件。
mkdir -p $tensorflow_root
mkdir $tensorflow_root/lib
cp -d bazel-bin/tensorflow/libtensorflow_cc.so* $tensorflow_root/lib/
cp -d bazel-bin/tensorflow/libtensorflow_framework.so* $tensorflow_root/lib/
cp -d $tensorflow_root/lib/libtensorflow_framework.so.1 $tensorflow_root/lib/libtensorflow_framework.so
mkdir -p $tensorflow_root/include/tensorflow
cp -r bazel-genfiles/* $tensorflow_root/include/
cp -r tensorflow/cc $tensorflow_root/include/tensorflow
cp -r tensorflow/core $tensorflow_root/include/tensorflow
cp -r third_party $tensorflow_root/include
cp -r bazel-tensorflow/external/eigen_archive/Eigen/ $tensorflow_root/include
cp -r bazel-tensorflow/external/eigen_archive/unsupported/ $tensorflow_root/include
rsync -avzh --include '*/' --include '*.h' --include '*.inc' --exclude '*' bazel-tensorflow/external/protobuf_archive/src/ $tensorflow_root/include/
rsync -avzh --include '*/' --include '*.h' --include '*.inc' --exclude '*' bazel-tensorflow/external/com_google_absl/absl/ $tensorflow_root/include/absl
清理目标目录下赘余的源代码文件,保留编译好的接口。
cd $tensorflow_root/include
find . -name "*.cc" -type f -delete
安装DeePMD-kit的Python接口¶
首先安装Tensorflow的Python接口
pip install tensorflow-gpu==1.14.0
若提示已安装,请使用--upgrade选项进行覆盖安装。若提示权限不足,请使用--user选项在当前账号下安装。
然后下载DeePMD-kit的源代码。
cd /some/workspace
git clone --recursive https://github.com/deepmodeling/deepmd-kit.git deepmd-kit
在运行git clone时记得要--recursive,这样才可以将全部文件正确下载下来,否则在编译过程中会报错。
提示
如果不慎漏了--recursive, 可以采取以下的补救方法:
git submodule update --init --recursive
" %}
随后通过pip安装DeePMD-kit:
cd deepmd-kit
pip install .
安装DeePMD-kit的C++ 接口¶
延续上面的步骤,下面开始编译DeePMD-kit C++接口:
deepmd_source_dir=`pwd`
cd $deepmd_source_dir/source
mkdir build 
cd build
假设DeePMD-kit C++ 接口安装在/some/workspace/deepmd_root下,定义安装路径deepmd_root:
export deepmd_root=/some/workspace/deepmd_root
修改环境变量以使得cmake正确指定编译器:
export CC=`which gcc`
export CXX=`which g++`
在build目录下运行:
cmake -DTENSORFLOW_ROOT=$tensorflow_root -DCMAKE_INSTALL_PREFIX=$deepmd_root ..
若通过yum同时安装了Cmake 2和Cmake 3,请将以上的cmake切换为cmake3。
最后编译并安装:
make
make install
若无报错,通过以下命令执行检查是否有正确输出:
$ ls $deepmd_root/bin
dp_ipi
$ ls $deepmd_root/lib
libdeepmd_ipi.so  libdeepmd_op.so  libdeepmd.so
因为GCC版本差别,可能没有$deepmd_root/bin/dp_ipi。
安装LAMMPS的DeePMD-kit模块¶
接下来安装
cd $deepmd_source_dir/source/build
make lammps
此时在$deepmd_source_dir/source/build下会出现USER-DEEPMD的LAMMPS拓展包。
下载LAMMPS安装包,按照常规方法编译LAMMPS:
cd /some/workspace
# Download Lammps latest release
wget -c https://lammps.sandia.gov/tars/lammps-stable.tar.gz
tar xf lammps-stable.tar.gz
cd lammps-*/src/
cp -r $deepmd_source_dir/source/build/USER-DEEPMD .
选择需要编译的包(若需要安装其他包,请参考Lammps官方文档):
make yes-user-deepmd
make yes-kspace
如果没有make yes-kspace 会因缺少pppm.h报错。
加载MPI环境,并采用MPI方式编译Lammps可执行文件:
module load intel/17u5 mpi/intel/17u5
make mpi -j4
注意
此处使用的GCC版本应与之前编译Tensorflow C++接口和DeePMD-kit C++接口一致,否则可能会报错:@GLIBCXX_3.4.XX。如果在前面的安装中已经加载了GCC 4.9.4,请在这里也保持相应环境的加载。
经过以上过程,Lammps可执行文件lmp_mpi已经编译完成,用户可以执行该程序调用训练的势函数进行MD模拟。