Cyril Poulet (cyril.poulet@centraliens.net)

Computational and Biological Learning Lab

Courant Institute of Mathematical Studies

New-York University

How to build and train an object detector using the Eblearn library

1.An example of face detector using convolutional nets

This tutorial aims at helping the reader in his first steps using the Eblearn and Eblearn_gui C++ libraries developed by the CBLL at NYU. For this, we will use the example of a face detector called Facenet, and you will have to download the source code before beginning this tutorial. We will also assume that you have installed and compiled the eblearn and eblearn_gui libraries (the gui will require the use of Qt, therefore if you are not familiar with Qt, we recommend to use Eclipse C++ IDE and the Qt plugin, which will simplify the compilation with qmake).

Creating an object detector “from scratch” is a 4-steps process : determining the structure of your convolutional net, creating the data to train it, training it, and finally building the associated detector. We will try in this tutorial to explain each step clearly and accurately, and to mention a few common mistakes or bugs that are likely to occur and how to avoid or solve them.

I.Structure of your convolutional net

For this first part, you won't need to program : it is important to think about the structure of the net you want to use. We will assume that you know what convolutional networks are (if that's not the case, you might want to look at some of Yann LeCun's article on the subject at http://yann.lecun.com/exdb/publis/index.html).

The Eblearn library proposes 3 main types of modules :

  • convolutional layer modules, which perform multiple convolutions on the input;

  • subsampling layer modules, which perform subsampling on the input

  • full-layers, who provide full connections between the input layer and the output layer.

Each of these modules are completely tunable and all the connections can be learned (i.e.  the convolution and subsampling kernels and bias, and the full connections). Let's look at the structure we will use in the facenet detector.

The facenet network is a CSCSCF network :

graphics1

The input image (42x42 pixels) is first given to a convolutional layer with 7x7 kernels, which extracts 6 feature map (36x36 pixels), then a subsampling layer (2x2 kernel) outputs six 18x18 pixels maps. A second convolutional layer (7x7 kernels) extracts 16 new feature maps (12x12 pixels) then subsampled (2x2 kernel) to 16 6x6 pixels maps. A last convolutional layer (6x6 kernels) outputs 80 1x1 pixel feature maps, who are finally connected to a 2-categories (face and background) layer via a full-layer.

As you can see, there are multiple parameters that you can play with to customize your network : the number and type of layers, the number of feature maps (each feature map corresponds to one kernel) of each layer and the tables of connections between the inputs and the feature maps, and the size of the kernels.

  • The convolutional layers are learning features which combinations of will enable the detector to find the objects in images. The subsamplers allow to use convolutional layers at different scales, and to find “features of features”, thus enabling the detection of more complex objects. Finally the full layer is a kind of “If you find this, this and that in an image, then it's a category 1, but with these 3 other ones it's a 2” and are used to connect the final feature maps to the output categories.

If you are detecting simple objects, 2 C-layers may suffice in your network, but for most objects 3 is recommended. With more, it is not clear whether you will improve the detection, but you will surely increase the cost of the training and detection (in time and memory)

  • As a general rule, the more complex the detection is (multiple types of object, complex objects) the more feature maps you will need, but it will cost more connections (which means more calculations to train them).

As a comparative point, Facenet uses 14571 weights (connections, biases, etc.).

When you have chosen the architecture, you have to determine the sizes of the kernels. Usually 5x5 to 7x7 kernels for c-layers and 2x2 to 4x4 kernels for s-layers are a good choice:

  • under 5x5 your c-layer won't find relevant information in the image (the feature map will be small, therefore very generic), and over 8x8 the information will be either to specialized or too computer-costly.

  • Over 4x4 your s-layer will subsample too much and you may miss some important informations in the image.

You also have to take into account the minimum size of the object you want to detect : since each c or s-layer is performing a convolution, each output is smaller than the corresponding input (you're “loosing” the borders). For example, Facenet takes an input of at minimum 42x42 pixels, since a 42x42 pixels is transformed into a 1x1 pixel output.

To calculate the sizes of the kernels and the minimum size of the input, you'll have to :

  • take as a starting point that the output is 1x1, and so should be the input of the f-layer

  • each convolution (c or s-layer) by a NxN kernel makes you loose N-1 pixels in each dimension

  • your object at its minimum size must still be recognizable, and whole (you have to leave some background around it). In Facenet, we decided for a 42x42 pixels minimum size, with a face 30 pixels high (8 to 10 pixels between the middle of the eyes and the center of the mouth).

For another example of CSCSCF architecture, see http://yann.lecun.com/exdb/publis/pdf/huang-lecun-06.pdf.

II.The training data

When you have finally settled for a structure, it's time to prepare the data you will use to train your network. For that, you will need images and the corresponding data (providing the location of the objects in the images). To have a satisfactory training, you should at least have 1000 to 2000 instances of each type of object you want to train your network to detect : the more you have, the better the training will be (but you should have roughly the same number of instances of each object). You then have to extract these instances from the images and rescale them to the minimum size (as calculated above).

TIP : A useful trick if you want to build a detector that will be robust to scale and angle variations is to randomly generate around 10 instances with various scales and angles for each extracted instance of object.

To illustrate this, let's use the Facenet code. We had 2 types of objects : face and background (everything else), one directory of images, and a file with the corresponding data : each line corresponded to an instance of face in an image, with the following information :

"name_of_the_imge_file x_lefteye ylefteye x_righteye y_righteye x_nose y_nose x_leftcornermouth y_leftcornermouth x_centermouth y_centermouth x_rightcornermouth y_rightcornermouth"

1.The parser

The first thing was to extract useful information from each line of the information file. We decided that we needed the name of the image to open, the coordinates of the middle of the eyes and the coordinates of the center of the mouth : with a rather simple model, this is enough to locate the face in the opened picture and to crop and resize it.

The parser is a rather classical char* parser named lineparser (see the comments of the code for details on how to parse a char*).

2.Extracting the instances of faces

The function is called extract_faces and is based on the following algorithm:

  • for each line in the information file

    • extract the relevant information with lineparser

    • open the image and calculate the size and angle of the face

    • if the the face is too big (more than twice the minimum size calculated), subsample the image by 2 until it is small enough (this is to prevent clipping if we rescale it at once) and calculate the new size and location of the face in the new image

    • do 10 times :

      • calculate random resizing and rotating coefficients

      • rotate and resize the whole image (to prevent border effects)

      • if you can extract a 42x42 pixels image centered on the center of the eyes from the previous rotscaled image :

        • crop the 42x42 image and add it to the matrix where these images are saved

        • save the according data in a txt file

        • perform vertical axial rotation on the 42x42 image

        • save the new image

        • save the according data

Adding the vertical axial rotation is another trick to make sure that the detector will detect faces slightly facing right as well as slightly facing left.

TIP : The random resizing is done with a coefficient between 1/sqrt(sqrt(2))  and sqrt(sqrt(2)), and the rotating coefficient is between -30° and 30°. You don't need to change the rescaling min and max factors (see “building the detector”), but if you are thinking about changing the rotating max coefficient (to detect objects whatever their angle), it might be a better idea to create more object categories (for example “face looking down”, “face looking up” and so forth) or your feature maps will be too generic et your detector won't provide useful results.

As you may note, the extract_face function also provides a mean to just append the new extracted images to existing ones, by providing a existing .mat file (.mat is not compulsory, you can save your matrices in any format, but this one seems somewhat more appropriate, if only as a reminder of its content...). This allows you to extract images from several directories, but be careful either to not mix different categories in the same matrix, or to carefully keep trace of the corresponding labels.

Last, we haven't implemented the function to extract background 42x42 patches, but the algorithm is very similar :

  • for each line in the information file

    • extract the name of the image to open with lineparser

    • open the image

    • do 20 times :

      • calculate random coordinates, with respect to 42 pixels margin right and down

      • crop the 42x42 image and add it to the matrix where these images are saved

      • save the according data in a txt file

The chances to find a perfect face in these random patches is extremely small, and having partial faces as background will help your detector to place any face it finds more accurately.

III.Training your net

For the rest of this tutorial we will mostly rely on Facenet source code, with hints as to how to handle more than 2 categories, but we also encourage you to look at the Detector2D example provided in the eblearn library.

We will decompose this phase in 3 parts : creating your network, creating the sets of data from the matrices previously built and finally training your network.

1.Creating the network

Let's have a look at the facenet class (at least what we need of it at the moment):

class facenet : public net_cscscf {

public :

parameter *theparam;

Idx<intg>   table0, table1, table2;

Idx<ubyte>  *labels;

Idx< double > *targets;

Idx< double > smoothing_kernel, highpass_kernel;

//! creates the net

//! @param paramfile is the file containing the weights, if your net has already been trained.

facenet( const char *paramfile = NULL);

~facenet ();

}

The facenet class is a net_cscscf, which is a ready-to-use net provided by the eblearn library. Should you want to create your own structure, take a look at the Net.h and Net.cpp files.

The parameter is the memory space where the weights will be stored; the tables are connection tables for the c-layers and contain the complete list of which input feature map is connected to which output feature map; the labels vector is a list of the categories of objects (usually only referred to as '0', '1', etc.) and targets is a matrix with 1.5 on diagonal terms and -1.5 on all the other terms : it is used to determine the reward of the network for each proposed output during the training. Finally, the 2 kernels are used to pre-treat the training images.

The constructor of facenet follows the following algorithm:

  • create a new parameter with enough place to store all the weights (make it large...)

  • create the labels and targets

  • create the tables needed for the c-layers

  • initiate the net (see the net_cscscf functions for the details)

  • load some weights if needed (we don't need to for training, but it is useful to avoid training each time you want to use the detector)

  • create the kernels

TIP : As you can see, the constructor is pretty straightforward, but you CANNOT switch the initiation of the net and the loading of precalculated weights. The reason is that when you initiate the net, each layer will reserve some space in the parameter and fill it with zeros, even if there were  weights in this space. So be sure that if you want to load weights, you load them after initiating the net.

TIP : we provide 2 functions to generate tables for the c-layers : full_table and custom_table.

full_table will provide a full table (pretty obvious, isn't it ?): for full_table(1, 6), the connections are 0->0, 0->1, 0->2, 0->3, 0->4 and 0->5.

custom_table(6, 16) provides the connections :

in\out

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0

x

 

 

 

x

x

x

 

 

x

x

x

x

 

x

x

 

1

x

x

 

 

 

x

x

x

 

 

x

x

x

x

x

x

 

2

x

x

x

 

 

 

x

x

x

 

 

x

x

x

x

x

 

3

 

x

x

x

 

 

 

x

x

x

 

 

x

x

x

x

 

4

 

 

x

x

x

 

 

x

x

x

x

 

 

x

x

x

 

5

 

 

 

x

x

x

 

 

x

x

x

x

 

 

 

x

 

That prevents to have too much connections by always using full tables, and helps to create different feature maps. You may also note that some feature maps share the same connection table. That is not really important, since they will be in turn connected to different feature maps by the next c-layer. Of course, you are also welcome to design your own connection tables.

2.Creating the sets of data

What we need in order to train accurately the network, and to measure how well the training went, is to build 2 sets of data : a training set and a validation set. Each of these sets must have images from all the categories (preferably the same number of each) and the associated labels.

Let's have a look at the facenet_datasource class, which will be used to handle the sets :

class facenet_datasource : public LabeledDataSource< double , ubyte> {

public :

int width, height;

double mymean, mydeviation;

facenet_datasource(Idx< double > *inputs, Idx<ubyte> *labels, double m = 0, double d = 0, bool pretreatment = false );

}

(There are other functions, but we will let you have a look at it by yourself)

The main interest of the LabeledDataSource class is that it stores pointers to the data N*height*width idx (here, images in doubles) and to the corresponding labels idx (N*1 ubytes).

The constructor takes directly the pointers to these structures, but can pre-treat them (high-pass filtering and normalization)  if you want. You can also store the resulting idxs so that you don't have to do the pretreatment each time (it can be quite time-consuming, since you should have more than 20,000 42x42 images of each category to treat).

TIP : Pre-treating the data is very important : the high-pass filtering will underline the edges of the images, helping your network to focus on important features, and normalization will put your raw images (with bit values from -255 to 255 after the filtering) to a range much more acceptable as an input for your detector (around -2.5 to 2.5, therefor saved in doubles). Keep in mind that after each  layer, a squashing function is used to keep the values of the intermediate outputs between -1.5 and 1.5 . For more hints on pretreatment, have a look at http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf.

Now that we know how to handle the 2 sets of data, we need  to create them. That is what the facenet::build_databases is doing. Here is the algorithm used :

  • load the matrices of faces and background and the according labels (just a lot a ones for the faces and zeros for the background) in 4 idxs of the appropriate types

  • initiate the training and validation data and labels idxs

  • for each pair of face/background images until min(number of face images, number of background images)

    • if random < ratio

      • append the face image to the training data idx, and the according label to the training labels idx

      • append the background image to the training data idx, and the according label to the training labels idx

    • else (when random > ratio)

      • append the face image to the validation data idx, and the according label to the validation labels idx

      • append the background image to the validation data idx, and the according label to the validation labels idx

  • create the 2 facenet_datasource with the pretreatment option on

  • save the idxs (so that you only have to do this once, as it also is time consuming)

The ratio is a mean to determine how much of your images will go in each set. You won't need a huge validation set (3000 images of each categories should be sufficient).

TIP : Note that we alternate images from each category. It is very important to alternate categories, at least in the training set. If you do not alternate, the first few images will overtrain the weights into producing the same output whatever the input, and the rest of the images won't be sufficient to get out of this dead-end. If you fall in such case, you'll know it immediately after the first testing session (see “training”) : you will have 1/NofCategories of your images well-classified, while the rest won't be .

3.Training your network

Now that we have a network and the data, we are ready to begin the training. We created 2 functions to do that : facenet::train and facenet::train_online.

The first function is the main training function : it creates the high-level structures that will be used to train and test the network, loads the training and validation databases, and finally calls train_online.

Before going deeper into train_online, let's have a look at the structures used for the training :

  • The idx3_supervised_module is a “super net”, which means it includes your net, and add to it a classifier (the max_classer) and a module that will calculate the error between the output of the net and the desired output (euclidean distance).

graphics2

  • The supervised_gradient is the structure that will perform the training: compute the diagonal hessian matrix, then perform the forward and backward propagation and the according update of the weights.

  • The classifier_meter is a structure that measures the performance of the set of weights on a dataset.

train_online is the function used to perform a one-pass training on the training dataset, then to measure the performance of the set of weights produced on the training dataset, then on the validation dataset, using the previous structures. Finally it saves the current weights as a .mat.

We then have to design a training scheme : usually at the beginning we will choose a high adaptation parameter (around 10¯4) to get quickly some results, perform a few passes with train_online, then decrease the parameter by a factor of 5 or 10 to finely tune the training and perform some more passes, and so on.  Usually the performances will increase at the beginning, up to the moment where it will begin to decrease : this means that we have overtrained our network. You can look at the train function for a basic example, but you should also look at the gd_param class of the eblearn library, which provides a lot of options to tune your training scheme.

TIP : the training is time consuming, so you might want to make sure that it is working before going for the full training scheme : try to run it on on pass only to see if you don't have a problem with your data (see previous tip about alternating the categories...)

TIP : remember that before each training (at the beginning of train ), the weights are randomly initialized, so the same training scheme will not give the same results when run twice (even if it should be of the same order). For this reason, try to run your design a second time before radically changing it because you are not satisfied with the results...

FINAL TIP ON THE TRAINING : Remember that the performance measured by the classifier_meters is not relevant to compare to any other network design : it is only provided to help you choose between several sets of weight for the given network. You will only know the real performances of your network after using it on full scale images, which leads us to the creation of the detector.

Just as a hint of what you should be able to achieve, we could quite easily train a net that had less than 8% error rate on both the training and validation set.

IV.Building the detector

Now, you have a trained network, and of course you want to use it on full-scale images. It's time to build the detector.

1.Presentation of the algorithm

Let's make a list of everything the detector should do :

  • take a picture as input, whatever the size of the picture

  • be able to spot faces (or objects) everywhere on the picture, whatever the size of these objects, and be able to tell us where they are

  • give us accurate results (where we will most likely find faces (or objects, and which category))

For the first point, we do not need to change anything, as the different layers are able to deal with inputs of whatever size you need, they just change the sizes of their output. Example : if you give a 200x200 pixels input to the C0 layer, the output of this layer will be 6 feature maps 196x196, and the final output out the network will be 2 “images” 40x40 containing the output scores for the 2 categories “face” and “background”, each “pixel” corresponding to a 42x42 patch in the image, separated from its neighboring patches by 4 pixels.

For the other points, however, there is much more work to do. You might remember that we trained the network with random size shifts on the face pictures, to make it robust to size change. This is useful, but it's not robust enough to provide us with the ability to detect any face whatever the size. For that, we will have to run the detector at multiple sizes of the picture : every 1/sqrt(2) size factor, beginning at the full-scale image and finishing at the minimum size (42x42 pixels in the case of facenet). So as long as we know how to detect a face at one scale, we only have to run the same algorithm multiple times to detect faces at every scales.

The algorithm to detect faces at one scale is quite simple :

  • pre-treat the image we are going to feed the network with

  • perform a forward propagation on the image

  • find the local maxima of the output (and the corresponding category if it's relevant), corresponding to the probable faces on the picture

  • calculate the corresponding coordinates on the input image

TIP : it is vital to perform the exact same pretreatment on the input that you performed on the training dataset (in our case, high-pass filtering, then normalization). Indeed, the network is trained on a certain type of input, so if you change that type of input, you won't get good results...

Once we have fed our network with multiple-scale inputs, we have N lists of possible locations for faces, one per scale : we have to merge them, as faces are more likely to be where multiple scales had a maximum.

 As the algorithm is quite straightforward, we will now see how we implemented it for facenet.

2.Implementation

We assume that you have created your net using the constructor previously described, but this time giving it your stored weights file as argument.

The main function of the detector is facenet::detect. It first creates the inputs, outputs and results structures, that will be stored as a vector of state_idxs for the inputs and outputs (one per scale), and a vector of idxs for the results. It then calls facenet::multi_res_prep, which prepares and pretreats the images at each scale. The next step is calling facenet::multi_res_fprop, which performs the forward propagation at each scale, selects the maxima at each scale and merge the results. Finally detect draws rectangles on the input pictures where faces have been detected.

facenet::multi_res_prep is quite easy to understand :

  • it first resizes your image either to its own size or to the biggest size the network has to work on, whichever the biggest of the 2 sizes, and stores it in display, so that you can keep a clean version of it.

  • Then for each given size, it :

    • resizes the image to the right size

    • high-pass filters it (by a convolution with a high-pass kernel)

    • normalizes it

    • stores it in the input vector at the right place

facenet::multi_res_fprop does the following :

  • it performs a forward propagation at each scale, the outputs being stored in the output vector

  • it calls facenet::postprocess_output, which extract results for each scale

  • it then calls facenet::prune, which merge the results of each scale into one single list of results

  • Finally it prints the previous list

facenet::postprocess_output runs the following algorithm :

  • for each output

    • smooth it (by a convolution with a smoothing kernel)

    • call mark_maxima on it to get a list of local maxima, stored in the results vector

  • create a global result list

  • for each list of result

    • for each result in the list

      • if the score of the face (or object) category is better than a chosen threshold, put it in the global list, along with the calculated position (taking the scale into account) and the calculated size of the face

  • return the global list

facenet::mark_maxima just looks at each point : if its score for the “face” category is superior to the scores of its 8 neighbors in the output idx, then copy this score in the result idx. However, if you are building a multiple-objects detector, this function should first determine which category has the highest score for the considered output pixel, then compare it to the score of its neighbors for the same category.

Finally, facenet::prune is called on the global result list and, for each result in it, it compares it to all the other results : if there is another result which score is better and which rectangle is overlapping the one of the result we are considering by more than a given percentage, we discard the result we were considering. This way we can prevent very close peaks, that are very likely to represent the same face : we only keep the best of its neighborhood. There is here no consideration of type of object : 2 close peaks are likely to be for the same object, so we consider that if there are multiple hits around it, then the highest score is the right category.

As you can see, the previous algorithm is followed very closely. Here are a few way to tune the detector :

  • The threshold is the main parameter to control the number of results your network will consider as viable. The most radical way is to create a network that will only return the best result. This way you don't need to worry about the threshold. Otherwise, there is no way to know beforehand what threshold is the most appropriate : you will have to run the detector on several images to find which threshold suits you best. A more scientific way to proceed is to run the detector on several images and find what threshold enables most of the detection of faces while minimizing the false positive (a high score where there is no face).

  • If you run the detector on images with faces quite close to each other, it might be a good idea to change in prune the percentage of overlapping rectangles above which a result is discarded.

As you will see, we also provide a function, facenet::calc_sizes, to calculate the sizes of the different scale. This function does not provide the sizes of the inputs but the sizes of the output. From a general point of view, be extra-careful when dealing with position or sizes of the inputs and outputs. Don't forget that you “loose” some pixels due to border effects in the forward propagation :

graphics3

At a 1:1 scale (which means the face will be the same size as the ones used to train the system), a high score in the red pixel of the output matrix means that there is a face in the 42x42 red square in the input image, which is not an intuitive result. So when the scale is different, be careful when you're calculating the place of the face in the input (for our system it's (23 + (4 * j)) * scale, where j is the place of the output pixel (on either axis)).

TIP : the X and Y axis are not the same when displaying the image and when storing it in an idx, so be careful when drawing the rectangles (see detect ).

3.Visualizing what's going on

To visualize the result, the best way is to use the eblearn_gui library (see it's readme for installation help and a tutorial on how to use it). You can very easily display the image with the rectangles around the faces by using the ebwindow class and its gray_draw_matrix method.

We also created the facenet::show_net method, which provides a display of either the different calculated kernels, or of the intermediate outputs in the network, or both. Note that if you want to visualize the outputs for a given size, you will have to run the network with this size as last given size, since the show_net method only displays the last calculated intermediate results.

graphics4

Output of the c0-layer

graphics5graphics69 of the 16 outputs of the c1-layer

graphics79 of the 80 outputs of the c2-layer

graphics8 outputs of the network : scores for the background (left) and for the faces (right), and corresponding rectangles on the input image (under)

V.Conclusion

You should now have a working face (or whatever kind of objects you chose) detector. The next step is to automatize the detection to whole set of images, and then compare the result you're  obtaining to the best world result !

However if you're like us, you are likely to notice that your detector is returning a lot of false positives (it tells you there are objects where there aren't) or bad categorization. The solutions to these problems are :

  • having a bigger training dataset : the more different examples of each category you have, the better your detector will be able to differentiate them

  • run your detector on a set of images where you know there is none of the objects it is trained to detect, and add all the patches of images that trigger high scores to the background category (which is a category that you must have, since it is the category in which all that is not a known object will be classified). This way, you will decrease the number of false positives (which will be correctly classified as background instead of object).

We hope this tutorial has helped you in discovering how the eblearn C++ library works !