detect

detect finds objects in images at multiple scales and outputs corresponding bounding boxes given a trained model defined by a configuration file. Example uses of the detect tool can be found in the face detection demo or the MNIST demo.

To use detect, call:

detect pedestrians.conf

When variable 'input_dir' is not defined and the inputs are image files (camera = directory), one can pass the image path as second argument, e.g.:

detect pedestrians.conf /images/pedestrians/

When the input is a video file (camera = video), one must pass the file path as second argument, e.g.:

detect pedestrians.conf /videos/pedestrians.flv 

Outputs

  • All outputs are saved in a directory named 'out_[timestamp]' created in the calling directory.
  • When variable 'save_detections' equals 1, each detected region in the original image is saved in a single image file in subdirectory 'detections/originals', and the exact input to the model (usually preprocessed and scaled) is saved as a matrix file in subdirectory 'detections/preprocessed'.
  • When variable 'save_video' equals 1, original images with bounding boxes overlaid are saved in subdirectory 'video'.
  • When variable 'bbox_saving' is different than 0, bounding boxes coordinates are saved in file 'bbox.txt'.
  • Different values for 'bbox_saving' correspond to different saving formats:
    1. save in all formats (in different files)
    2. save in eblearn format
    3. save in caltech format

Configuration variables

The following variables configure the way detect works. Some variables are required, but most of them are optional.

Adding Pre-processing

Detect works by running a fixed-size network on different scales of a same image. For that, our architecture needs to contain a resize_module which also knows how to pre-process the image. During the training phase, your training data was most likely pre-processed already. We need to tell the resize module to use the same pre-processing that was used for training data.

Note that the resize module does not have to be the first module of the architecture as long as it is present. One can have other modules beforehand to apply some transformation of the image for example before scaling.

Here is an example of a resize module using a YnUV normalization with kernel size 7×7:

arch = resizepp0,${my_architecture}
resizepp0_pp = rgb_to_ynuv0
rgb_to_ynuv0_kernel = 7x7
rgb_to_ynuv0_global_norm = 0

Required configuration variables

These variables are required in your configuration file by detect.

weights   = my_weights.mat # weights file of the trained model. 
classes   = classes.mat    # names of each output class in a matrix file. 
camera    = directory      # the type of input: directory, video, v4l2, kinect, opencv, shmem
input_dir = /data/         # image directory if camera = directory
threshold = .1             # threshold of [0,1] confidence

Optional configuration variables

The following optional variables allow tuning of detect. If not specified, detect will run with their default value.

Camera

The 'camera' variable allows to select the source of images as follow:

  • directory: inputs are images from a directory, which path can be defined by variable 'input_dir' or as second argument in the command line. One can use a custom file pattern for images and randomize the order of the images as follow (optional):
    file_pattern = ".*[.](png|jpg|jpeg|PNG|JPG|JPEG|bmp|BMP|ppm|PPM|pnm|PNM|pgm|PGM|gif|GIF)" 
    input_random = 1  # randomize input list (only works for 'directory' camera). 
  • video: take a video file as input, which path must be given as second argument in the command line. The video extraction can be customized as follow (optional):
    # limit of input video duration in seconds, 0 means no limit 
    input_video_max_duration = 0 
    # step between input frames in seconds, 0 means no step 
    input_video_sstep = 0 
  • v4l2: use v4l2 camera interface (linux only). User must also defined variable 'device' to select the v4l2 device, e.g. 'device = /dev/video0'.
  • kinect: use Kinect's camera.
  • opencv: use OpenCV's camera interface as video input.
  • shmem: use shared memory as video feed.

One can force the images grabbed by any camera to be a certain size as follow (optional):

input_height      = 240 # use -1 to use original size 
input_width       = 320 # use -1 to use original size 

Misc

nthreads              = 1  # number of detection threads 
input_gain            = 1  # factor applied to input image
input_npasses         = 1  # passes on the input list (only works for 'directory' cam). 
bbox_saving           = 2  # 0 = none 1 = all styles 2 = eblearn style 3 = caltech style 
max_object_hratio     = 0  # image's height / object's height, 0 to ignore 
smoothing             = 0  # smooth network outputs 
outputs_threshold     = -1 # thresholds raw outputs before smoothing or other processing
outputs_threshold_val = -1 # replacement value when below outputs_threshold
dump_outputs          = 0  # if 1, save detect output matrices in dump/
background_name       = bg # name of background class (optional) 
#mem_optimization     = 1  # temporarly disabled
ipp_cores             = 1  # number of cores used by IPP 

Multi-Scale detection

# multi-scaling type:
# 0 = manually set each scale sizes
#     also set: scales = 100x100,150x150 for all images
#     or        scales = 100x100,150x150;110x110,160x160 for each image
# 1 = manually set each scale step
# 2 = number of scales between min and max
# 3 = step factor between min and max
# 4 = 1 scale, the original image size.
scaling_type = 3     # type of scaling
scaling      = 1.1   # scaling ratio between scales 
min_scale    = .75   # min scale as factor of minimal network size 
max_scale    = 1.3   # max scale as factor of original resolution 

# limiting multi-scale to minimum and maximum height or width sizes
input_min    = 0     # minimum height or width for minimum scale 
input_max    = 1200  # maximum height or width for maximum scale 

# padding (to detect partially cut objects at image boundaries)
hzpad        = .3    # height's 0-padding each side as ratio of network's min height 
wzpad        = .3    # width's 0-padding each side as ratio of network's min width 

Non-Maximum Suppression (NMS)

nms                   = 1   # 0: none, 1: regular, 2: voting, 3: voting + regular
pre_threshold         = 0   # bbox threshold before nms
post_threshold        = .5  # bbox threshold after nms
pre_hfact             = 1   # bbox height factor before nms
pre_wfact             = 1   # bbox width factor before nms
post_hfact            = 1   # bbox height factor after nms
post_wfact            = 1   # bbox width factor after nms
woverh                = 1   # set width to be h * woverh if different than 1.0
max_overlap           = .5  # regular nms, bboxes match when overlap less than this
max_hcenter_dist      = .3  # regular nms, bboxes match when hcenter dist below this
max_wcenter_dist      = .3  # regular nms, bboxes match when wcenter dist below this
vote_max_overlap      = .5  # voting nms, bboxes match when overlap less than this
vote_max_hcenter_dist = .5  # voting nms, match when hcenter dist below this
vote_max_wcenter_dist = .3  # voting nms, match when wcenter dist below this

Detection display

The following variables are related to the display configuration of detection.

skip_frames     = 0          # skip this number of frames before each processed frame 
save_detections = 0          # output saving and display 
detections_dir  = detections # override name of directory with detection results
save_bbox_period = 100 # period at which bbox file is saved (reduce disk I/O)
save_max        = 25000 # Exit when this number of objects have been saved 
save_max_per_frame = 10 # Only save the first n objects per frame 
save_video      = 0 # save each classified frame and make a video out of it 
save_video_fps  = 5 
use_original_fps = 0 
display         = 1 # enable/disable display altogether
display_zoom    = 1 # zooming 
display_min     = -1.7 # minimum data range to display (optional) 
display_max     = 1.7 # maximum data range to display (optional) 
display_in_min  = 0 # input image min display range (optional) 
display_in_max  = 255 # input image max display range (optional) 
display_bb_transparency = .5 # bbox transp factor (modulated by confidence) 
display_threads = 0 # each thread displays on its own 
display_states  = 0 # display internal states of 1 resolution 
show_parts      = 0 # show parts composing an object or not 
show_extracted  = 0 # show positive input patches
silent          = 0 # minimize outputs to be printed 
sync_outputs    = 1 # synchronize output between threads 
minimal_display = 1 # only show classified input 
display_sleep   = 0 # sleep in milliseconds after displaying 
ninternals      = 1 # demo display variables 
bbox_show_conf  = 1 # display confidence in each bbox or not
bbox_show_class = 1 # display class name in each bbox or not
no_gui_quit     = 1 # if 1, wait for user to escape gui to exit program

Detection evaluation

The following variables show how to call an independent evaluation program which can print a detection score.

  set = Test 
  evaluate = 1 
  evaluate_cmd = "${visiongrader} ${visiongrader_params}" 
  visiongrader = ${HOME}/visiongrader/src/main.py 
  visiongrader_params = "${input_params} ${groundtruth_params} ${compare_params} ${curve_params} ${ignore} " 
  input_params = "--input bbox.txt --input_parser eblearn --sampling 50 " 
  annotations = ${root}/../INRIAPerson/${set}/annotations/ 
  groundtruth_params = "--groundtruth ${annotations} --groundtruth_parser inria --gt_whratio .43 " 
  compare_params = "--comparator overlap50percent --comparator_param .5 " 
  curve_params = "--det --saving-file curve.pickle --show-no-curve " 
  input_dir = /home/sermanet/${machine}data/ped/inria/INRIAPerson/${set}/pos 
  ignore = "--ignore ${HOME}/visiongrader/datasets/pedestrians/inria/ignore/${set}/ " 

Debugging

To ensure that bounding boxes are correctly computed, set the “bbox_decision” variable and visually check that all found bounding boxes are aligned onto each image corner (or outside if padding is used) . Also set the “nms” to 0 so that all boxes are kept:

# changes the type of decision accepting positive bounding boxes
# 0: accept bboxes with confidence higher than threshold
# 1: accept bboxes at each corners of each scale
# 2: accept bboxes at the bottom right corner of each scale
bbox_decision = 2
nms           = 0
detect.txt · Last modified: 2012/10/09 08:05 by sermanet