Caffe – Ubuntu 安装及问题解决
- Ubuntu14.04
- CUDA 8.0
- cuDNN
Caffe 安装
-
Step 1 CUDA安装
要结合使用 Caffe 和 NVIDIA GPU,需要安装CUDA 工具包. -
Step 2 cuDNN 安装
- 下载适用于Linux的cuDNN库,这里需要注册加速计算开发人员计划;
- 下载后,解压缩文件并将其复制到CUDA目录, 以cuDNN v5.1 为例:
tar zvxf cudnn-8.0-linux-x64-v5.1.tgz cd cuda/ sudo cp lib64/lib* /usr/local/cuda/lib64/ sudo cp include/cudnn.h /usr/local/cuda/include/
- 建立软链接,并更新:
cd /usr/local/cuda/lib64/ sudo rm -rf libcudnn.so libcudnn.so.5 sudo ln -s libcudnn.so.5.1.10 libcudnn.so.5 sudo ln -s libcudnn.so.5 libcudnn.so sudo ldconfig -v
注:cuDNN在很多工程中兼容性较差,可能需要安装特定的历史版本,只需对如上命令中的版本进行修改.
- Step 3 安装依赖项
$ sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler libgflags-dev libgoogle-glog-dev liblmdb-dev libatlas-base-dev git $ sudo apt-get install --no-install-recommends libboost-all-dev $ sudo apt-get install libatlas-base-dev # 安装 ATLAS $ sudo apt-get install libopenblas-dev # 安装OpenBLAS $ sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev
- Step 4 安装 NCCL库
多GPUs进行并行计算,Caffe自带实现. 在多个 GPU 上运行 Caffe 需要使用 NVIDIA NCCL.
$ git clone https://github.com/NVIDIA/nccl.git $ cd nccl $ sudo make install -j4 $ sudo ldconfig
NCCL 库和文件头将安装在 /usr/local/lib 和 /usr/local/include 中.
NCCL主要是为了加速在多GPU环境,同时用多块GPU做training的时候,它做出一个同步,或者说Reduction时候,加速collective的过程。
它的最核心思想是什么呢?在做数据传输的时候,把大块数据切成小块,同时利用系统里面的多条链路,比如现在是PCI-E链路,同时利用PCI-E的上行和下行,尽量去避免不同的数据同时用某一个上行或者下行通道,可能会造成数据的contention,大大降低传输效率。
- Step 5 编译Caffe
- 下载Caffe
$ git clone https://github.com/BVLC/caffe.git $ cd caffe/ $ cp Makefile.config.example Makefile.config
- 编辑 Makefile.config,进行修改:
取消 USE_CUDNN := 1 的注释,启用 cuDNN 加速; 取消 USE_NCCL := 1 的注释,启用在多个 GPU 上运行 Caffe 所需的 NCCL
- 编译安装Caffe
$ make all -j8 $ make test -j8 $ make pycaffe # python API $ make matcaffe # matlab API,需定义matlab路径
完成安装,即可在 build/tools/caffe 中获得 Caffe 二进制文件.
Caffe 实例
- Step1 – 准备图像数据库
测试 Caffe 的训练性能需要使用图像数据库作为输入资源。Caffe 自带多个模型,可使用来自 ILSVRC12 挑战赛(“ImageNet”)的图像.
原始图像文件可从 http://image-net.org/download-images 下载(您将需要开通帐户并同意其条款.
下载原始图像文件并解压,假设原始图像存储方式如下:
/path/to/imagenet/train/n01440764/n01440764_10026.JPEG /path/to/imagenet/val/ILSVRC2012_val_00000001.JPEG
- Step2 – 下载辅助数据
$ ./data/ilsvrc12/get_ilsvrc_aux.sh
- Step3 – 创建数据库
#更改 examples/imagenet/create_imagenet.sh 脚本中的 TRAIN_DATA_ROOT 和 VAL_DATA_ROOT 为解压后的原始图像路径 # 设置 RESIZE=true 以便在将图像添加到数据库之前将其调整到适当大小 # 创建图像数据库 $ ./examples/imagenet/create_imagenet.sh # 创建所需的图像均值文件 $ ./examples/imagenet/make_imagenet_mean.sh
- Step4 – 训练模型
$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:$LD_LIBRARY_PATH # 编辑 models/bvlc_alexnet/solver.prototxt 文件 $ ./build/tools/caffe train –solver=models/bvlc_alexnet/solver.prototxt –gpu 0 # 可以通过指定多个设备 ID(例如 0、1、2、3)或指定“-gpu all”以使用系统中所有可用的 GPU,在多个 GPU 上进行训练.
安装问题及解决
问题1 – python 相关项安装
$ sudo apt-get install python-numpy python-scipy python-matplotlib python-sklearn python-skimage python-h5py python-protobuf python-leveldb python-networkx python-nose python-pandas python-gflags Cython ipython $ sudo apt-get install python-pip wget
问题2 – No module named lmdb
$ sudo apt-get install liblmdb-dev 或 $ sudo pip install lmdb
问题3 – No module named cv2
$ sudo conda install opencv # for Anaconda 或 $ sudo apt-get install python-opencv
问题4 – can not find module skimage.io
$ sudo apt-get install python-skimage $ sudo apt-get update $ make pycaffe # 重新编译 python API
问题5 – No module named _caffe
# 设置环境变量 $ sudo gedit /etc/profile # 添加环境变量: export PYTHONPATH=${HOME}/caffe-master/distribute/python:$PYTHONPATH export LD_LIBRARY_PATH=${HOME}/caffe-master/build/lib:$LD_LIBRARY_PATH $ source /etc/profile # 使环境变量生效 $ echo $<环境变量名> # 查看环境变量
问题6 – make pytest时,出现layer_factory.hpp:77 ] Check failed: registry…..(详细错误信息类似于:Check failed: registry.count(type) == 1 (0 vs. 1) Unknown layer type: Python)
打开caffe目录下的Makefile.config文件,找到WITH_PYTHON_LAYER := 1这一行,将前面的‘#’去掉,重新编译,最好是重新打开一个终端进行编译
问题7 – No module named easydict
$ sudo pip install easydict
问题8 – ImportError: No module named google.protobuf错误
$ wget https://code.google.com/p/protobuf/wiki/Download?tm=2 $ 解压缩文件,并进入文件夹 $ ./configure $ make $ make check $ make install $ ./configure && make && cd python && python setup.py test && python setup.py install
引用1:http://code.google.com/p/protobuf/issues/detail?id=235
引用2:http://www.voidcn.com/article/p-rdngwpwe-nz.html
问题9 – ImportError: No module named google.protobuf.internal
$ sudo apt-get install python-protobuf 或 使用新立得软件包搜索“python-protobuf”安装
问题10 – “fatal error: hdf5.h: 没有那个文件或目录”
- Step 1: 在Makefile.config文件的第85行,添加/usr/include/hdf5/serial/ 到 INCLUDE_DIRS,也就是把下面第一行代码改为第二行代码:
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial/
- Step 2: 在Makefile文件的第173行,把 hdf5_hl 和hdf5修改为hdf5_serial_hl 和 hdf5_serial,也就是把下面第一行代码改为第二行代码:
LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_hl hdf5 LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_serial_hl hdf5_serial
问题11 – nccl.hpp:5:18: fatal error: nccl.h: No such file or directory
# 在多个 GPU 上运行 Caffe 需要使用 NVIDIA NCCL $ git clone https://github.com/NVIDIA/nccl.git $ cd nccl $ sudo make install -j4 # NCCL 库和文件头将安装在 /usr/local/lib 和 /usr/local/include 中 $ sudo ldconfig # 该命令不执行会出现错误: error while loading shared libraries: libnccl.so.1: cannot open shared object file: No such file or directory
问题12 – No module named google.prototxt
$ sudo apt-get install python-protobuf 或 可以先下载安装包,自行编译和安装。可参考:http://blog.csdn.net/paynetiger/article/details/8197326 建议使用第一种方法,下面是关键: 如果使用Anaconda,而以上两种方法无论哪一种都会将prototxt相关文件安装到/usr/local/lib/python2.7/dist-packages, 需要将相关文件复制到Anaconda/lib/python2.7/site-packages下,才能正常使用.
问题13 – No module named pydot
$ sudo apt-get install graphviz # 安装graphviz $ sudo pip install pydot # 安装pydot 如果使用Anaconda,需要将相关文件从/usr/local/lib/python2.7/dist-packages复制到Anaconda/lib/python2.7/site-packages下.
问题14 – ImportError: libcudart.so.7.0: cannot open shared object file: No such file or directory (CUDA7.5)
$ sudo ldconfig /usr/local/cuda/lib64 $ sudo ldconfig /usr/local/cuda-7.5/lib64
问题15 – Failed to compile cuda_ndarray.cu: libcublas.so.7.5: cannot open shared object file(CUDA7.5)
$ sudo ldconfig /usr/local/cuda-7.5/lib64
问题16 – ImportError: No module named caffe
import sys sys.path.append("/(你的caffe-master路径)/caffe-master/python") sys.path.append("/(你的caffe-master路径)/caffe-master/python/caffe") import caffe
问题17 – ImportError: No module named google.protobuf.internal
# 下载protobuf-2.3.0: $ wget http://protobuf.googlecode.com/files/protobuf-2.5.0.zip $ unzip protobuf-2.5.0 $ cd protobuf-2.5.0 $ chmod 777 configure $ ./configure $ make -j4 $ make check -j4 $ make install # 编译python接口 $ cd ./python $ python setup.py build $ python setup.py test $ python setup.py install $ protoc -version # 验证使用命令 >> import google.protobuf
问题18 – autoreconf:not found
$ sudo apt-get install automake autoconf libtool
问题19 – ImportError–usr-lib-liblapack.so.3- undefined symbol- ATL_chemv
# This issue arises when you have libopenblas-base and libatlas3-base installed,but don't haveliblapack3 installed. This combination of packages installs conflicting versions of libblas.so (from OpenBLAS) and liblapack.so (from ATLAS). # Solution 1 (my favorite): You can keep both OpenBLAS and ATLAS on your machine if you also install liblapack3. $ sudo apt-get install liblapack3 # Solution 2: Uninstall ATLAS (this will actually install liblapack3 for you automatically because of some deb package shenanigans) $ sudo apt-get uninstall libatlas3-base # Solution 3: Uninstall OpenBLAS $ sudo apt-get uninstall libopenblas-base -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- # Bad configuration $ dpkg -l | grep 'openblas\|atlas\|lapack' ii libatlas3-base 3.10.1-4 amd64 Automatically Tuned Linear Algebra Software,generic shared ii libopenblas-base 0.2.8-6ubuntu1 amd64 Optimized BLAS (linear algebra) library based on GotoBLAS2 $ update-alternatives --get-selections | grep 'blas\|lapack' libblas.so.3 auto /usr/lib/openblas-base/libblas.so.3 liblapack.so.3 auto /usr/lib/atlas-base/atlas/liblapack.so.3 $ python -c 'import numpy' Traceback (most recent call last): File "<string>",line 1,in <module> File "/usr/lib/python2.7/dist-packages/numpy/__init__.py",line 153,in <module> from . import add_newdocs File "/usr/lib/python2.7/dist-packages/numpy/add_newdocs.py",line 13,in <module> from numpy.lib import add_newdoc File "/usr/lib/python2.7/dist-packages/numpy/lib/__init__.py",line 18,in <module> from .polynomial import * File "/usr/lib/python2.7/dist-packages/numpy/lib/polynomial.py",line 19,in <module> from numpy.linalg import eigvals,lstsq,inv File "/usr/lib/python2.7/dist-packages/numpy/linalg/__init__.py",line 50,in <module> from .linalg import * File "/usr/lib/python2.7/dist-packages/numpy/linalg/linalg.py",line 29,in <module> from numpy.linalg import lapack_lite,_umath_linalg ImportError: /usr/lib/liblapack.so.3: undefined symbol: ATL_chemv # Solution 1 $ dpkg -l | grep 'openblas\|atlas\|lapack' ii libatlas3-base 3.10.1-4 amd64 Automatically Tuned Linear Algebra Software,generic shared ii liblapack3 3.5.0-2ubuntu1 amd64 Library of linear algebra routines 3 - shared version ii libopenblas-base 0.2.8-6ubuntu1 amd64 Optimized BLAS (linear algebra) library based on GotoBLAS2 $ update-alternatives --get-selections | grep 'blas\|lapack' libblas.so.3 auto /usr/lib/openblas-base/libblas.so.3 liblapack.so.3 auto /usr/lib/lapack/liblapack.so.3 $ python -c 'import numpy' # Solution 2 $ dpkg -l | grep 'openblas\|atlas\|lapack' ii liblapack3 3.5.0-2ubuntu1 amd64 Library of linear algebra routines 3 - shared version ii libopenblas-base 0.2.8-6ubuntu1 amd64 Optimized BLAS (linear algebra) library based on GotoBLAS2 $ update-alternatives --get-selections | grep 'blas\|lapack' libblas.so.3 auto /usr/lib/openblas-base/libblas.so.3 liblapack.so.3 auto /usr/lib/lapack/liblapack.so.3 $ python -c 'import numpy' # Solution 3 $ dpkg -l | grep 'openblas\|atlas\|lapack' ii libatlas3-base 3.10.1-4 amd64 Automatically Tuned Linear Algebra Software,generic shared $ update-alternatives --get-selections | grep 'blas\|lapack' libblas.so.3 auto /usr/lib/atlas-base/atlas/libblas.so.3 liblapack.so.3 auto /usr/lib/atlas-base/atlas/liblapack.so.3 $ python -c 'import numpy'
问题20 – libcudart.so.7.5 cannot open shared object file: No such file or directory(CUDA7.5)
# 检查环境变量设置 $ echo $PATH $ echo $LD_LIBRARY_PATH # 将一些文件复制到/usr/local/lib文件夹下: $ sudo cp /usr/local/cuda-7.5/lib64/libcudart.so.7.5 /usr/local/lib/libcudart.so.7.5 && sudo ldconfig $ sudo cp /usr/local/cuda-7.5/lib64/libcublas.so.7.5 /usr/local/lib/libcublas.so.7.5 && sudo ldconfig $ sudo cp /usr/local/cuda-7.5/lib64/libcurand.so.7.5 /usr/local/lib/libcurand.so.7.5 && sudo ldconfig
问题21 – W- GPG error- http–archive-ubuntukylin-com
$ sudo apt-get update # 出现错误 - W: GPG error: http://archive.ubuntukylin.com:10006 xenial InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 8D5A09DC9B929006 # 是密钥的问题,解决办法: $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 8D5A09DC9B929006 # 注意提示的密钥和你上述命令中的密钥是一样的,每个人的电脑都不一样
问题22 – ImportError-No module named yaml
$ sudo apt-get install python-yaml 或源码安装,如 PyYAML 3.11 : $ wget http://pyyaml.org/download/pyyaml/PyYAML-3.11.tar.gz $ tar -zxvf PyYAML-3.11.tar.gz $ python setup.py install
问题23 – ubuntu常见错误–Could not get lock
# 通过终端安装程序sudo apt-get install xxx时出错: # E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable) # E: Unable to lock the administration directory (/var/lib/dpkg/),is another process using it # 出现这个问题可能是有另外一个程序正在运行,导致资源被锁不可用。而导致资源被锁的原因可能是上次运行安装或更新时没有正常完成,进而出现此状况,解决的办法: $ sudo rm /var/cache/apt/archives/lock $ sudo rm /var/lib/dpkg/lock
问题24 – Ubuntu服务器的NVIDIA驱动自动更新所引起的问题及解决方法
# 问题描述: # 服务器为Ubuntu14.04,NVIDIA驱动由352.39自动升级到352.63,导致显卡不能使用,错误为:运行nvidia-smi指令,得到的信息为“Failed to initialize NVML: GPU access blocked by the operating sestem” # 系统:Ubuntu14.04 # cuda:7.5 # 解决方案: # 1. 首先关闭系统包括的所有更新 $ sudo vim /etc/apt/apt.conf.d/50unattended-upgrades #(注释掉其中的更新部分) #参考链接:http://www.linuxdiyf.com/Linux/15997.html # 2. 卸载cuda驱动并重新安装 # (1)彻底卸载 $ sudo apt-get remove --purge nvidia* $ sudo apt-get autoremove $ sudo apt-get clean $ dpkg -l |grep ^rc|awk '{print $2}' |sudo xargs dpkg -P # 参考链接: # https://devtalk.nvidia.com/default/topic/900899/cuda-setup-and-installation/unable-to-detect-cuda-capable-device-after-automatic-forced-nvidia-updated/ # http://zhidao.baidu.com/link?url=smwXar3NPdAi1WxnZJ2_sARCEPoNcxLwB0RwmEnDPiqyrbdz64aVCoabN9azod-AQrJP0OjeiL8-y8mFRHZDma # (2)重装cuda # 由于之前系统Ubuntu14.04在配置好caffe环境后,编译matlab接口时将gcc由4.8降级为4.7。若直接安装cuda,会提示错误,“Unable to find the kernel source tree for the currently running kernel. Please make sure you have installed the kernel source files for your kernel and that they are properly configured; on Red Hat Linux systems,for example,be sure you have the 'kernel-source' or 'kernel-devel' RPM installed.If you know the correct kernel source files are installed,you may specify the kernel source path with the '--kernel-source-path' command line option.”.如果按照此条路径思索,在尝试添加了“--kernel-source-path”之后,仍存在问题。继续往下走,会提示给系统内核降级。 # 考虑到之前曾对系统的gcc降级,这导致了上述错误,对gcc升级。 $ sudo apt-get install gcc-4.7 $ cd /usr/bin $ sudo mv gcc gcc.bak $ sudo ln -s gcc-4.7 gcc $ sudo mv g++ g++.bak $ sudo ln -s g++-4.7 g++ # 参考链接:http://www.mamicode.com/info-detail-876185.html # 然后按照常规方法重装cuda驱动即可解决问题。 # 转自: http://blog.csdn.net/u012494820/article/details/52289095
问题25 – 装nvidia驱动后无法进入系统
# 开机,在 GRUB 选择界面按 E,这时界面变成了一个编辑器。 # 在倒数几行找到 ro quiet splash # 然后删掉quiet,改成text,接着按F10 # 这时你已进入操作系统(ctrl+alt+F1-F6),输入用户名密码登录。 # 然后输入以下代码: $ sudo add-apt-repository ppa:bumblebee/stable $ sudo apt-get update $ sudo apt-get install bumblebee bumblebee-nvidia
问题26 – opencv3.1 安装
# 从官网(http://opencv.org/downloads.html)下载OpenCV,并将其解压到安装的位置,假设解压到了/home/opencv # 创建编译文件夹: $ cd ~/opencv $ mkdir build $ cd build # 配置: $ cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local .. # 编译: make -j8 #-j8表示并行计算,也可直接 make # opencv安装 $ sudo make install
问题27 – “libcudart.so.8.0 cannot open shared object file: No such file or directory”
# 解决办法是将一些文件复制到/usr/local/lib文件夹下: # 注意CUDA版本号 $ sudo cp /usr/local/cuda-8.0/lib64/libcudart.so.8.0 /usr/local/lib/libcudart.so.8.0 && sudo ldconfig $ sudo cp /usr/local/cuda-8.0/lib64/libcublas.so.8.0 /usr/local/lib/libcublas.so.8.0 && sudo ldconfig $ sudo cp /usr/local/cuda-8.0/lib64/libcurand.so.8.0 /usr/local/lib/libcurand.so.8.0 && sudo ldconfig
问题28 – matio.h no such file or directory / matio 安装
$ sudo apt-get install libmatio-dev 或源码安装: # 下载 matio (https://sourceforge.net/projects/matio/) $ tar zxf matio-X.Y.Z.tar.gz $ cd matio-X.Y.Z $ ./configure $ make $ make check $ make install # 安装 $ export LD_LIBRARY_PATH=/path/to/libmatio.so.2 # 在caffe 的 Makefile.config 中的INCLUDE_DIRS 中添加 matio 的 src路径, LIBRARY_DIRS 中添加 src/.libs,如: # INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /path/to/matio-1.5.2/src # LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /path/to/matio-1.5.2/src/.libs # 参考: http://blog.csdn.net/houqiqi/article/details/46469981
问题29 – fast-rcnn 出现问题
# cython_bbox 和cython_nms 问题 $ cd fast_rcnn_root/lib $ python setup.py install # setup.py安装完成后, $ cd python_root/Lib/site-packages/utils # 可以找到两个文件cython_bbox.so和cython_nms.so,把这两个文件复制到fast_rcnn_root/lib/utils中即可. # 参考: http://blog.csdn.net/happynear/article/details/46822109
问题30 – CUDA8.0- atomicAdd的重写问题,cuda8中出现了atomicAdd的定义,从而产生bug
将common.cuh 进行如下修改,注意最后的endif 1. #ifndef CAFFE_COMMON_CUH_ 2. #define CAFFE_COMMON_CUH_ 3. 4. 5. #include <cuda.h> 6. #if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600 7. #else 8. // CUDA: atomicAdd is not defined for doubles 9. static __inline__ __device__ double atomicAdd(double *address,double val) { 10. unsigned long long int* address_as_ull = (unsigned long long int*)address; 11. unsigned long long int old = *address_as_ull,assumed; 12. if (val==0.0) 13. return __longlong_as_double(old); 14. do { 15. assumed = old; 16. old = atomicCAS(address_as_ull,assumed,__double_as_longlong(val +__longlong_as_double(assumed))); 17. } while (assumed != old); 18. return __longlong_as_double(old); 19. } 20. #endif 21. #endif 至此基本就可以make通过了.
问题31 – CUDA 查询 nvcc -V not installed 问题
问题:
$ nvcc -V >> The program 'nvcc' is currently not installed. You can install it by typing: sudo apt-get install nvidia-cuda-toolkit