Model training

Here you can find information about how to train DeepCpG models.

Splitting data into training, validation, and test set

For comparing different models, it is necessary to train, select hyper-parameters, and test models on distinct data. In holdout validation, the dataset is split into a training set (~60% of the data), validation set (~20% of the data), and test set (~20% of the data). Models are trained on the training set, hyper-parameters selected on the validation set, and the selected models compared on the test set. For example, you could use chromosome 1-5, 7, 9, 11, 13 as training set, chromosome 14-19 as validation set, and chromosome 6, 8, 10, 12, 14 as test set:

train_files="$data_dir/c{1,2,3,4,5,7,9,11,13}_*.h5
val_files="$data_dir/c{14,15,16,17,18,19}_*.h5"
test_files="$data_dir/c{6,8,10,12,14}_*.h5"

dcpg_train.py
    $train_files
    --val_file $val_files
    ...

As you can see, DeepCpG allows to easily split the data by glob patterns. You do not have to split the dataset by chromosomes. For example, you could use train_files=$data_Dir/c*_[01].h5 to select all data files starting with index 0 or 1 for training, and use the remaining files for validation.

If you are not concerned about comparing DeepCpG with other models, you do not need a test set. In this case, you could, for example, leave out chromosome 14-19 as validation set, and use the remaining chromosomes for training.

If your data were generated using whole-genome scBS-seq, then the number of CpG sites on few chromosomes is usually already sufficient for training. For example, chromosome 1, 3, and 5 from Smallwood et al (2014) cover already more than 3 million CpG sites. I found about 3 million CpG sites as sufficient for training models without overfitting. However, if you are working with scRRBS-seq data, you probably need more chromosomes for training. To check how many CpG sites are stored in a set of DeepCpG data files, you can use the dcpg_data_stats.py. The following command will compute different statistics for the training set, including the number number of CpG sites:

dcpg_data_stats.py $data_dir/$train_files
#################################
dcpg_data_stats.py ./data/c19_000000-032768.h5 ./data/c19_032768-050000.h5
#################################
           output  nb_tot  nb_obs  frac_obs      mean       var
0  cpg/BS27_1_SER   50000   20621   0.41242  0.665972  0.222453
1  cpg/BS27_3_SER   50000   13488   0.26976  0.573102  0.244656
2  cpg/BS27_5_SER   50000   25748   0.51496  0.529633  0.249122
3  cpg/BS27_6_SER   50000   17618   0.35236  0.508117  0.249934
4  cpg/BS27_8_SER   50000   16998   0.33996  0.661019  0.224073

For each output cell, nb_tot is the total number of CpG sites, nb_obs the number of CpG sites with known methylation state, frac_obs the ratio between nb_obs and nb_tot, mean the mean methylation rate, and var the variance of the methylation rate.

Training DeepCpG models jointly

As described in Angermueller et al (2017), DeepCpG consists of a DNA, CpG, and Joint model. The DNA model recognizes features in the DNA sequence window that is centered on a target site, the CpG model recognizes features in observed neighboring methylation states of multiple cells, and the Joint model integrates features from the DNA and CpG model and predicts the methylation state of all cells.

The easiest way is to train all models jointly:

dcpg_train.py
    $train_files
    --val_files $val_files
    --dna_model CnnL2h128
    --cpg_model RnnL1
    --out_dir $models_dir/joint
    --nb_epoch 30

--dna_model, --cpg_model, and --joint_model specify the architecture of the DNA, CpG, and Joint model, respectively, which are described in here <./models.rst>_.

Training DeepCpG models separately

Although it is convenient to train all models jointly by running only a single command as described above, I suggest to train models separately. First, because it enables to train the DNA and CpG model in parallel on separate machines and thereby to reduce the training time. Second, it enables to compare how predictive the DNA model is relative to CpG model. If you think the CpG model is already accurate enough alone, you might not need the DNA model. Thirdly, I obtained better results by training the models separately. However, this may not be true for your particular dataset.

You can train the CpG model separately by only using the --cpg_model argument, but not --dna_model:

dcpg_train.py
    $train_files
    --val_files $val_files
    --dna_model CnnL2h128
    --out_dir $models_dir/dna
    --nb_epoch 30

You can train the DNA model separately by only using --dna_model:

dcpg_train.py
    $train_files
    --val_files $val_files
    --cpg_model RnnL1
    --out_dir $models_dir/cpg
    --nb_epoch 30

After training the CpG and DNA model, we are joining them by specifying the name of the Joint model with --joint_model:

dcpg_train.py
    $train_files
    --val_files $val_files
    --dna_model $models_dir/dna
    --cpg_model $models_dir/cpg
    --joint_model JointL2h512
    --train_models joint
    --out_dir $models_dir/joint
    --nb_epoch 10

--dna_model and --cpg_model point to the output training directory of the DNA and CpG model, respectively, which contains their specification and weights:

ls $models_dir/dna
events.out.tfevents.1488213772.lawrence model.json
lc_train.csv                            model_weights_train.h5
lc_val.csv                              model_weights_val.h5
model.h5

model.json is the specification of the trained model, model_weights_train.h5 the weights with the best performance on the training set, and model_weights_val.h5 the weights with the best performance on the validation set. --dna_model ./dna is equivalent to using --dna_model ./dna/model.json ./dna/model_weights_val.h5, i.e. the validation weights will be used. The training weights can be used by --dna_model ./dna/model.json ./dna/model_weights_train.h5

In the command above, we used --train_models joint to only train the parameters of the Joint model without training the pre-trained DNA and CpG model. Although this reduces training time, you might obtain better results by also fine-tuning the parameters of the DNA and CpG model without using --train_models.

Monitoring training progress

To check if your model is training correctly, you should monitor the training and validation loss. DeepCpG prints the loss and performance metrics for each output to the console as you can see from the previous commands. loss is the loss on the training set, val_loss the loss on the validation set, and cpg/X_acc, is, for example, the accuracy for output cell X. DeepCpG also stores these metrics in X.csv in the training output directory.

Both the training loss and validation loss should continually decrease until saturation. If at some point the validation loss starts to increase while the training loss is still decreasing, your model is overfitting the training set and you should stop training. DeepCpG will automatically stop training if the validation loss does not increase over the number of epochs that is specified by --early_stopping (by default 5). If your model is overfitting already after few epochs, your training set might be to small, and you could try to regularize your model model by choosing a higher value for --dropout or --l2_decay.

If your training loss fluctuates or increases, then you should decrease the learning rate. For more information on interpreting learning curves I recommend this tutorial.

To stop training before reaching the number of epochs specified by --nb_epoch, you can create a stop file (default name STOP) in the training output directory with touch STOP.

Watching numeric console outputs is not particular user friendly. TensorBoard provides a more convenient and visually appealing way to mointor training. You can use TensorBoard provided that you are using the Tensorflow backend. Simply go to the training output directory and run tensorboard --logdir ..

Deciding how long to train

The arguments --nb_epoch and --early_stopping control how long models are trained.

--nb_epoch defines the maximum number of training epochs (default 30). After one epoch, the model has seen the entire training set once. The time per epoch hence depends on the size of the training set, but also on the complexity of the model that you are training and the hardware of your machine. On a large dataset, you have to train for fewer epochs than on a small dataset, since your model will have seen already a lot of training samples after one epoch. For training on about 3,000,000 samples, good default values are 20 for the DNA and CpG model, and 10 for the Joint model.

Early stopping stops training if the loss on the validation set did not improve after the number of epochs that is specified by --early_stopping (default 5). If you are training without specifying a validation set with --val_files, early stopping will be deactivated.

--max_time sets the maximum training time in hours. This guarantees that training terminates after a certain amount of time regardless of the --nb_epoch or --early_stopping argument.

--stop_file defines the path of a file that, if it exists, stop training after the end of the current epoch. This is useful if you are monitoring training and want to terminate training manually as soon as the training loss starts to saturate regardless of --nb_epoch or --early_stopping. For example, when using --stop_file ./train/STOP, you can create an empty file with touch ./train/STOP to stop training at the end of the current epoch.

Optimizing hyper-parameters

DeepCpG has different hyper-parameters, such as the learning rate, dropout rate, or model architectures. Although the performance of DeepCpG is relatively robust to different hyper-parameters, you can tweak performances by trying out different parameter combinations. For example, you could train different models with different parameters on a subset of your data, select the parameters with the highest performance on the validation set, and then train the full model.

The following hyper-parameters are most important (default values shown): 1. Learning rate: --learning_rate 0.0001 2. Dropout rate: --dropout 0.0 3. DNA model architecture: --dna_model CnnL2h128 4. Joint model architecture: --joint_model JointL2h512 5. CpG model architecture: --cpg_model RnnL1 6. L2 weight decay: --l2_decay 0.0001

The learning rate defines how aggressively model parameters are updated during training. If the training loss changes only slowly, you could try increasing the learning rate. If your model is overfitting of if the training loss fluctuates, you should decrease the learning rate. Reasonable values are 0.001, 0.0005, 0.0001, 0.00001, or values in between.

The dropout rate defines how strongly your model is regularized. If you have only few data and your model is overfitting, then you should increase the dropout rate. Reasonable values are, e.g., 0.0, 0.2, 0.4.

DeepCpG provides different architectures for the DNA, CpG, and joint model. Architectures are more or less complex, depending on how many layers and neurons say have. More complex model might yield better performances, but take longer to train and might overfit your data. You can find more information about available model architecture here.

L2 weight decay is an alternative to dropout for regularizing model training. If your model is overfitting, you might try 0.001, or 0.005.

Testing training

dcpg_train.py provides different arguments that allow to briefly test training before training the full model for a about a day.

--nb_train_sample and --nb_val_sample specify the number of training and validation samples. When using --nb_train_sample 500, the training loss should briefly decay and your model should start overfitting.

--nb_output and --output_names define the maximum number and the name of model outputs. For example, --nb_output 3 will train only on the first three outputs, and --output_names cpg/.*SER.* only on outputs that include ‘SER’ in their name.

Analogously, --nb_replicate and --replicate_name define the number and name of cells that are used as input to the CpG model. --nb_replicate 3 will only use observed methylation states from the first three cells, and allows to briefly test the CpG model.

--dna_wlen specifies the size of DNA sequence windows that will be used as input to the DNA model. For example, --dna_wlen 101 will train only on windows of size 101, instead of using the full window length that was specified when creating data files with dcpg_data.py.

Analogously, --cpg_wlen specifies the sum of the number of observed CpG sites to the left and the right of the target CpG site for training the CpG model. For example, --cpg_wlen 10 will use 5 observed CpG sites to the left and to the right.

Fine-tuning and training selected components

dcpg_train.py provides different arguments that allow to selectively train only some components of a model.

With --fine_tune, only the output layer will be trained. As the name implies, this argument is useful for fine-tuning a pre-trained model.

--train_models specifies which models are trained. For example, --train_models joint will train the Joint model, but not the DNA and CpG model. --train_models cpg joint will train the CpG and Joint model, but not the DNA model.

--trainable and --not_trainable allow including and excluding certain layers. For example, --not_trainable '.*' --trainable 'dna/.*_2' will only train the second layers of the DNA model.

--freeze_filter excludes the first convolutional layer of the DNA model from training.

Configuring the Keras backend

DeepCpG use the Keras deep learning library, which supports Theano or Tensorflow as backend. While Theano has long been the dominant deep learning library, Tensorflow is more suited for parallelizing computations on multiple GPUs and CPUs, and provides TensorBoard to interactively monitor training.

You can configure the backend by setting the backend attribute in ~/.keras/keras.json to tensorflow or theano. Alternatively you can set the environemnt variable KERAS_BACKEND='tensorflow' to use Tensorflow, or KERAS_BACKEND='theano' to use Theano.

You can find more information about Keras backends here.