EBLearn runs faster using some code optimizations provided by some external libraries.

  • TH Tensor library: SSE Optimizations
  • Intel IPP: float optimizations
  • OpenMP: multi-core optimizations
  • GPU (CUDA): CUDA Optimizations for convolutions

Expected speedups

For all optimizations, speedups are increasing as inputs become larger. For convolutions, gains are biggest with 9×9 and 11×11 kernels.

The expected speedups for each optimizations are:

  • TH: 30% to 100% speedup for both training and detection.
  • IPP: (in float only) up to 100% speedup for detection.
  • OpenMP: 0 times to n times speedup where n is the number of cores available.
  • GPU: 0 times to 120 times speedup


Using the TH tensor backend for SSE optimizations is recommended. It gives up to 100% increase in speed on large inputs and can be useful for both speeding up training and detection.

- To install it, look at instructions given at https://github.com/soumith/TH

  1. Open eblearn/tools/scripts/FindCustom.cmake
  2. At the BEGINNING of the file, add the following lines
    SET(THC_INCLUDE_DIR "your-th-installed-dir/include/")
    SET(THC_LIBRARIES_DIR "your-th-installed-dir/lib/")


If you are using float precision, using Intel IPP is recommended for detection (we recommend that you train in double precision though).

If you install IPP on your system, CMake will attempt to find them automatically. If it fails, you can enable IPP by pointing to the installed directories by editing tools/scripts/FindCustom.cmake and adding the following lines at the top of the file.


where you replace your directories by the installed dirs.


Multicore optimizations using OpenMP (experimental, probably unstable) OpenMP gives speedups on convolution modules equivalent to the number of cores on your machine, given a large enough input size. However, optimizations have not been added for all modules and are recent and not very well tested.

To use OpenMP optimizations, make sure that you have a compiler that supports OpenMP and add the environment variable “USEOPENMP”

export USEOPENMP=1


UPDATE: Due to some code changes, the CUDA might NOT compile in recent revisions. Please try trunk revision 2522 for working CUDA. If you have further questions, please email liao500km <at> gmail <dot> com (Qianli)

To use GPU optimizations (for now, only convolution, lppooling, power, tanh modules are optimized),

  • Install NVIDIA CUDA Toolkit from the NVIDIA page and install it. Make sure the command “nvcc” is in the environment path. In Linux, if you are installing it to a non-standard directory, you might have to add lines similar to this in the .bashrc file and reload the terminal
    export LD_LIBRARY_PATH=/home/sc3104/nvidia_cuda_latest/cuda/lib64/:$LD_LIBRARY_PATH
    export PATH=/home/sc3104/nvidia_cuda_latest/cuda/bin:$PATH
  • set the environment variable USECUDA=1
    export USECUDA=1
  • Make sure CMake finds the appropriate cuda libraries.
  • To use a particular conv module with GPU, look at the convolution module's constructor, there is an argument that takes in a boolean called use_gpu (true means runs on gpu, false means runs on cpu).
  • To enable/disable gpu from the conf files, you can set the following variables globally
    use_gpu=1 #1 is enable, 0 is disable
    gpu_id=0 #which gpu to use on the system. If this var is not specified, it uses the default GPU
  • You can also additionally configure each convolution module separately by setting its corresponding property. Eg. if use_gpu=1 is specified globally, but you want to run a particular conv module conv051 on cpu, you can set conv051_use_gpu=0 to disable gpu for that module. Similarly, you can specify the gpu device to use just for that particular convolution module.
    use_gpu=0 #1 is enable, 0 is disable
    conv051_gpu_id=3 #use the GPU with id:3 on the system

Note: GPU Convolutions are only supported for float precisions and for convolutions which have either a full connection table or a connection table with a fixed fanin, i.e. each output connects to a fixed number of inputs.
Note2: CUDA only support full table or table with fixed fan-in. Some example scripts for making tables can be: second layer: $maketable -random 32 256 -fanin 4 first layer : $maketable -full 3 32

optimizations.txt · Last modified: 2013/01/16 12:10 by qianli