Tutorial 2: Creating and training a simple digit classifier

This tutorial explains classification in details, however you can check the MNIST demo to get a general idea.

The Dataset

In this tutorial, we shall build a handwritten digit classifier with the datasets from the previous tutorial as the train/test data. If you did not go through that tutorial or don't want to, the compiled dataset we are going to use in this tutorial can be downloaded here mnist_compiled_data.zip.

You should unzip this to some folder (Example: /home/rex/eb_dataset/)

The Classifier

Convolution Networks

In our tutorial, we shall build a convolution network which would distinguish different digits from each other and give a correct label to each input digit image in our test set.

If you do not know what a convolution network(convnet) is, we recommend that you read about convnets over heresection for a quick starter on the concepts behind convnets.

In this tutorial, we are going to build a convnet with the architecture given in Figure 1. This architecture is commonly referred to as Lenet-5

Figure 1 : Architecture of LeNet-5, a Convolution Neural network, here for digits recognition. Each plane is a feature map, i.e. a set of units whose weights are constrained to be identical. C1 has a 5×5 convolution kernel, S2 has a 2×2 subsampling kernel, C3 has a 6×6 kernel, S4 has a 2×2 kernel, C5 has a 6×5 kernel

Step 1: List out what you know already

Let us first list out what we already know about how we want our convnet to be.

  • The input images are 32×32 in size with 1 channel(i.e grayscale image, not color)
  • From figure 1, we can say that there are 6-layers in our convnet.
    • Layer C1 is a convolution layer with 6 feature maps and a 5×5 kernel for each feature map.
    • Layer S1 is a subsampling layer with 6 feature maps and a 2×2 kernel for each feature map.
    • Layer C3 is a convolution layer with 16 feature maps and a 6×6 kernel for each feature map.
    • Layer S4 is a subsampling layer with 16 feature maps and a 2×2 kernel for each feature map.
    • Layer C5 is a convolution layer with 120 feature maps and a 6×6 kernel for each feature map.
    • Layer C6 is a fully connected layer
  • Let us use an additive bias and a tanh non-linearity for each layer(several other non-linearity operations can be used, like sigmoid, softmax etc.). If you do not understand why we are doing this, please look at the ConvNets section
  • Let us assuming full connections between each layer, i.e all feature maps in layer X are connected with all feature maps in layer X+1. In the next tutorial, you will see how to define more complicated connections.

Step 2: Write your architecture into a configuration file that eblearn understands

In this step, we are going to write a very simple configuration file that passes several parameters to a generic training binary. The trainer reads the conf file and trains the network accordingly. We adopted this method because it is much easier and less cumbersome to write and make changes to a conf file than writing your c++ trainer and having to recompile it every time you change your architecture or parameters. We are going to do the following next:

  1. Have a quick syntax walk-through of a conf file
  2. Add paths for our dataset

1. Name a file as tutorial2.conf and open it with your favorite text editor.

Syntax walk-through of your conf file

The conf file that we are going to write has very simple syntax and is similar to unix shell scripting syntax

Comments start with a #

Variables are defined without a $ but are referred with a $. Example

a = hi #defining variable 'a' to be the string "hi"
b = ${a} #here b will become "hi"
b = a #here b will become the string "a"
c = 10 # c takes the value 10

Isn't that simple :)

Describing your ConvNet Architecture

To describe the convnet architecture shown in the picture above, we use the syntax scheme described above, and write our configuration file.

The key to describe the architecture, is the arch variable. The training utility looks for that variable and derives the architecture from that. Since we want our architecture to be cscscf, i.e. convolution, bias, non-linearity, subsampling, convolution, bias, non-linearity, subsampling, convolution bias, non-linearity, and fully connected (as described in the image),let us write our arch variable to reflect that. In our convnet architecture, after each convolution layer, we also add an “absolute” layer and a subtractive normalization layer, which improve performance usually.

  • A convolution layer is described with the prefix “conv” and a unique number.
  • An average subsampling layer is described with the prefix “subs” whereas an L2-Pooling layer is described using the prefix “l2pool”
  • We shall use the tanh non-linearity described with the prefix “tanh”
  • An additive bias layer has the prefix “addc”
  • An abs layer as the prefix “abs”
  • A subtractive normalization layer has the prefix “wstd”

Each layer is expected to be separated by a comma ,

arch=conv0,addc0,tanh,abs0,wstd0,l2pool1,addc1,tanh,conv2,addc2,tanh,abs2,wstd1,l2pool3,addc3,tanh,conv5,addc5,tanh,linear6,addc6,tanh,linear7,addc7,tanh

As you see, this is a very long line, and gets very ugly. Hence, we can use our variable magic and write a cleaner looking syntax, that will get the same result.

arch = ${features},${classifier}
features = ${c0},${s1},${c2},${s3}
classifier = ${c5},${f7}
nonlin          = tanh  # type of non-linearity                                                                                                                                      
pool            = l2pool  # subs (is another option) 
# main branch layers                                                                                                                                                                 
c0              = conv0,addc0,${nonlin},abs0,wstd0
s1              = ${pool}1,addc1,${nonlin}
c2              = conv2,addc2,${nonlin},abs2,wstd2
s3              = ${pool}3,addc3,${nonlin}
c5              = conv5,addc5,${nonlin}
f7              = linear7,addc7,${nonlin}

Now we have to describe each layer's properties. For example, conv0 as seen in the image, has a kernel size of 5×5 for its convolutions and a stride of 1×1. l2pool1 has a subsampling size of 2×2 (i.e. it halves the size of the feature-map).

Describing these properties is simple. For each layer, you define each of it's property by having a variable with the naming scheme [variable name]_[property] = [property description].

For our architecture, we can describe our architecture properties with the following code.

# main branch parameters
classifier_hidden = 16  # number of hidden units in 2-layer classifier 
                                                                                                                                                             
conv0_kernel    = 5x5 # convolution kernel sizes (hxw)                                                                                                                               
conv0_stride    = 1x1 # convolution strides  (hxw)                                                                                                                                   
conv0_table     =     # convolution table (optional)                                                                                                                                 
conv0_table_in  = 1   # conv input max, used if table file not defined                                                                                                               
conv0_table_out = 6   # features max, used if table file not defined                                                                                                                 
conv0_weights   =     # manual loading of weights (optional)                                                                                                                         
addc0_weights   =     # manual loading of weights (optional)                                                                                                                         
wstd0_kernel    = ${conv0_kernel} # normalization kernel sizes (hxw)                                                                                                                 
subs1_kernel    = 2x2 # subsampling kernel sizes (hxw)                                                                                                                               
subs1_stride    = ${subs1_kernel} # subsampling strides (hxw)                                                                                                                        
l2pool1_kernel  = 2x2 # subsampling kernel sizes (hxw)                                                                                                                               
l2pool1_stride  = ${l2pool1_kernel} # subsampling strides (hxw)                                                                                                                      
addc1_weights   = # manual loading of weights (optional)                                                                                                                             
conv2_kernel    = 5x5 # convolution kernel sizes (hxw)                                                                                                                               
conv2_stride    = 1x1 # convolution strides (hxw)                                                                                                                                    
#conv2_table     = ${tblroot}/table_6_16_connect_60.mat # conv table (optional)                                                                                                       
conv2_table_in  = thickness # use current thickness as max table input                                                                                                               
conv2_table_out = 16 # features max, used if table file not defined                                                                                                       
conv2_weights   =     # manual loading of weights (optional) 
addc2_weights   =     # manual loading of weights (optional)                                                                                                                         
wstd2_kernel    = ${conv2_kernel} # normalization kernel sizes (hxw)                                                                                                                 
subs3_kernel    = 2x2 # subsampling kernel sizes (hxw)                                                                                                                               
subs3_stride    = ${subs3_kernel} # subsampling strides (hxw)                                                                                                                        
l2pool3_kernel    = 2x2 # l2poolampling kernel sizes (hxw)                                                                                                                           
l2pool3_stride    = ${l2pool3_kernel} # subsampling strides (hxw)                                                                                                                    
addc3_weights   =     # manual loading of weights (optional)                                                                                                                         
linear5_in      = ${linear5_in_${net}} # linear module input features size                                                                                                           
linear5_out     = noutputs # use number of classes as max table output                                                                                                               
linear6_in      = thickness # linear module input features size                                                                                                                      
linear6_out     = ${classifier_hidden}
linear7_in      = thickness # use current thickness                                                                                                                                  
linear7_out     = noutputs # use number of classes as max table output                                                                                                               
conv5_kernel    = 5x5 # convolution kernel sizes (hxw)                                                                                                                               
conv5_stride    = 1x1 # convolution strides (hxw)                                                                                                                                    
conv5_table_in  = thickness # use current thickness as max table input                                                                                                               
conv5_table_out = 120 # features max, used if table file not defined       

With this, we finished describing our architecture.

Describing Training Parameters

We have to describe a few more parameters. For one, we have to give the trainer the path to our dataset files (that we compiled with dscompile in the previous tutorial).

# training #####################################################################                                                                                                     
classification  = 1 # load datasets in classification mode, regression otherwise                                                                                                     
dataset_path = /home/rex/eb_dataset #replace this with where your dataset files are
train           = ${dataset_path}/mnist_train_data.mat # training data                                                                                                                    
train_labels    = ${dataset_path}/mnist_train_labels.mat # training labels        
# train_size      = 10000                            # limit number of samples                                                                                                          
val             = ${dataset_path}/mnist_test_data.mat  # validation data                                                                                                                  
val_labels      = ${dataset_path}/mnist_test_labels.mat  # validation labels          
# val_size        = 1000                            # limit number of samples

We also have to describe a few advanced parameters, if you do not understand any of these at this point, that is okay.

# energies & answers ###########################################################                                                                                                     
trainer         = trainable_module1  # the trainer module                                                                                                                            
trainable_module1_energy = l2_energy # type of energy                                                                                                                                
answer          = class_answer # how to infer answers from network raw outputs   
# hyper-parameters                                                                                                                                                                   
eta             = .0001  # learning rate                                                                                                                                             
reg             = 0      # regularization                                                                                                                                            
reg_l1          = ${reg} # L1 regularization                                                                                                                                         
reg_l2          = ${reg} # L2 regularization                                                                                                                                         
reg_time        = 0      # time (in samples) after which to start regularizing                                                                                                       
inertia         = 0.0    # gradient inertia                                                                                                                                          
anneal_value    = 0.0    # learning rate decay value                                                                                                                                 
anneal_period   = 0      # period (in samples) at which to decay learning rate                                                                                                       
gradient_threshold = 0.0
iterations      = 2     # number of training iterations                                                                                                                             
ndiaghessian    = 100    # number of sample for 2nd derivatives estimation                                                                                                           
epoch_mode      = 1      # 0: fixed number 1: show all at least once                                                                                                                 
#epoch_size = 4000    # number of training samples per epoch. comment to ignore.                                                                                                     
epoch_show_modulo = 400  # print message every n training samples                                                                                                                    
sample_probabilities = 0 # use probabilities to pick samples                                                                                                                         
hardest_focus   = 1      # 0: focus on easiest samples 1: focus on hardest ones                                                                                                      
ignore_correct  = 0      # If 1, do not train on correctly classified samples                                                                                                        
min_sample_weight = 0    # minimum probability of each sample                                                                                                                        
per_class_norm  = 1      # normalize probabiliy by class (1) or globally (0)                                                                                                         
shuffle_passes  = 1      # shuffle samples between passes                                                                                                                            
balanced_training = 1    # show each class the same amount of samples or not                                                                                                         
random_class_order = 0   # class order is randomized or not when balanced                                                                                                            
no_training_test = 0     # do not test on training set if 1                                                                                                                          
no_testing_test = 0      # do not test on testing set if 1                                                                                                                           
max_testing     = 0      # limit testing to this number of samples                                                                                                                   
save_pickings   = 0      # save sample picking statistics                                                                                                                            
binary_target   = 0      # use only 1 output, -1 is negative, +1 positive                                                                                                            
test_only       = 0      # if 1, just test the data and return                                                                                                                       
save_weights    = 0      # if 0, do not save weights after each iteration                                                                                                            
keep_outputs    = 1      # keep all outputs in memory                                                                                                                                
training_precision = double #float                                                                                                                                                   
save_weights    = 1      # if 0, do not save weights when training    

# training display #############################################################                                                                                                     
display               = 1  # display results                                                                                                                                         
show_conf             = 0  # show configuration variables or not                                                                                                                     
show_train            = 1  # enable/disable all training display                                                                                                                     
show_train_ninternals = 1  # number of internal examples to display                                                                                                                  
show_train_errors     = 0  # show worst errors on training set                                                                                                                       
show_train_correct    = 0  # show worst corrects on training set                                                                                                                     
show_val_errors       = 1  # show worst errors on validation set                                                                                                                     
show_val_correct      = 1  # show worst corrects on validation set                                                                                                                   
show_hsample          = 5  # number of samples to show on height axis                                                                                                                
show_wsample          = 18 # number of samples to show on height axis                                                                                                                
show_wait_user        = 0  # if 1, wait for user to close windows  

Now we are ready to train :)

Training the System

To train the system, in a terminal, call eblearn's train utility with argument as the file that we wrote in this tutorial.

train tutorial2.conf
beginner_tutorial2_train.txt · Last modified: 2012/11/20 23:30 by soumith