Model training¶
Here you can find information about how to train DeepCpG models.
Splitting data into training, validation, and test set¶
For comparing different models, it is necessary to train, select hyper-parameters, and test models on distinct data. In holdout validation, the dataset is split into a training set (~60% of the data), validation set (~20% of the data), and test set (~20% of the data). Models are trained on the training set, hyper-parameters selected on the validation set, and the selected models compared on the test set. For example, you could use chromosome 1-5, 7, 9, 11, 13 as training set, chromosome 14-19 as validation set, and chromosome 6, 8, 10, 12, 14 as test set:
train_files="$data_dir/c{1,2,3,4,5,7,9,11,13}_*.h5
val_files="$data_dir/c{14,15,16,17,18,19}_*.h5"
test_files="$data_dir/c{6,8,10,12,14}_*.h5"
dcpg_train.py
$train_files
--val_file $val_files
...
As you can see, DeepCpG allows to easily split the data by glob
patterns. You do not have to split the dataset by chromosomes. For
example, you could use train_files=$data_Dir/c*_[01].h5
to select
all data files starting with index 0 or 1 for training, and use the
remaining files for validation.
If you are not concerned about comparing DeepCpG with other models, you do not need a test set. In this case, you could, for example, leave out chromosome 14-19 as validation set, and use the remaining chromosomes for training.
If your data were generated using whole-genome scBS-seq, then the number
of CpG sites on few chromosomes is usually already sufficient for
training. For example, chromosome 1, 3, and 5 from Smallwood et al
(2014) cover already more than 3 million CpG sites. I found about 3
million CpG sites as sufficient for training models without overfitting.
However, if you are working with scRRBS-seq data, you probably need more
chromosomes for training. To check how many CpG sites are stored in a set
of DeepCpG data files, you can use the dcpg_data_stats.py
. The
following command will compute different statistics for the training
set, including the number number of CpG sites:
dcpg_data_stats.py $data_dir/$train_files
#################################
dcpg_data_stats.py ./data/c19_000000-032768.h5 ./data/c19_032768-050000.h5
#################################
output nb_tot nb_obs frac_obs mean var
0 cpg/BS27_1_SER 50000 20621 0.41242 0.665972 0.222453
1 cpg/BS27_3_SER 50000 13488 0.26976 0.573102 0.244656
2 cpg/BS27_5_SER 50000 25748 0.51496 0.529633 0.249122
3 cpg/BS27_6_SER 50000 17618 0.35236 0.508117 0.249934
4 cpg/BS27_8_SER 50000 16998 0.33996 0.661019 0.224073
For each output cell, nb_tot
is the total number of CpG sites,
nb_obs
the number of CpG sites with known methylation state,
frac_obs
the ratio between nb_obs
and nb_tot
, mean
the
mean methylation rate, and var
the variance of the methylation rate.
Training DeepCpG models jointly¶
As described in Angermueller et al (2017), DeepCpG consists of a DNA, CpG, and Joint model. The DNA model recognizes features in the DNA sequence window that is centered on a target site, the CpG model recognizes features in observed neighboring methylation states of multiple cells, and the Joint model integrates features from the DNA and CpG model and predicts the methylation state of all cells.
The easiest way is to train all models jointly:
dcpg_train.py
$train_files
--val_files $val_files
--dna_model CnnL2h128
--cpg_model RnnL1
--out_dir $models_dir/joint
--nb_epoch 30
--dna_model
, --cpg_model
, and --joint_model
specify the
architecture of the DNA, CpG, and Joint model, respectively, which are
described in here <./models.rst>_.
Training DeepCpG models separately¶
Although it is convenient to train all models jointly by running only a single command as described above, I suggest to train models separately. First, because it enables to train the DNA and CpG model in parallel on separate machines and thereby to reduce the training time. Second, it enables to compare how predictive the DNA model is relative to CpG model. If you think the CpG model is already accurate enough alone, you might not need the DNA model. Thirdly, I obtained better results by training the models separately. However, this may not be true for your particular dataset.
You can train the CpG model separately by only using the
--cpg_model
argument, but not --dna_model
:
dcpg_train.py
$train_files
--val_files $val_files
--dna_model CnnL2h128
--out_dir $models_dir/dna
--nb_epoch 30
You can train the DNA model separately by only using --dna_model
:
dcpg_train.py
$train_files
--val_files $val_files
--cpg_model RnnL1
--out_dir $models_dir/cpg
--nb_epoch 30
After training the CpG and DNA model, we are joining them by specifying
the name of the Joint model with --joint_model
:
dcpg_train.py
$train_files
--val_files $val_files
--dna_model $models_dir/dna
--cpg_model $models_dir/cpg
--joint_model JointL2h512
--train_models joint
--out_dir $models_dir/joint
--nb_epoch 10
--dna_model
and --cpg_model
point to the output training
directory of the DNA and CpG model, respectively, which contains their
specification and weights:
ls $models_dir/dna
events.out.tfevents.1488213772.lawrence model.json
lc_train.csv model_weights_train.h5
lc_val.csv model_weights_val.h5
model.h5
model.json
is the specification of the trained model,
model_weights_train.h5
the weights with the best performance on the
training set, and model_weights_val.h5
the weights with the best
performance on the validation set. --dna_model ./dna
is equivalent
to using --dna_model ./dna/model.json ./dna/model_weights_val.h5
,
i.e. the validation weights will be used. The training weights can be
used by --dna_model ./dna/model.json ./dna/model_weights_train.h5
In the command above, we used --train_models joint
to only train the
parameters of the Joint model without training the pre-trained DNA and
CpG model. Although this reduces training time, you might obtain better results
by also fine-tuning the parameters of the DNA and CpG model without using
--train_models
.
Monitoring training progress¶
To check if your model is training correctly, you should monitor the
training and validation loss. DeepCpG prints the loss and performance
metrics for each output to the console as you can see from the previous
commands. loss
is the loss on the training set, val_loss
the
loss on the validation set, and cpg/X_acc
, is, for example, the
accuracy for output cell X. DeepCpG also stores these metrics in
X.csv
in the training output directory.
Both the training loss and validation loss should continually decrease
until saturation. If at some point the validation loss starts to
increase while the training loss is still decreasing, your model is
overfitting the training set and you should stop training. DeepCpG will
automatically stop training if the validation loss does not increase
over the number of epochs that is specified by --early_stopping
(by
default 5). If your model is overfitting already after few epochs, your
training set might be to small, and you could try to regularize your
model model by choosing a higher value for --dropout
or
--l2_decay
.
If your training loss fluctuates or increases, then you should decrease the learning rate. For more information on interpreting learning curves I recommend this tutorial.
To stop training before reaching the number of epochs specified by
--nb_epoch
, you can create a stop file (default name STOP
) in
the training output directory with touch STOP
.
Watching numeric console outputs is not particular user friendly.
TensorBoard
provides a more convenient and visually appealing way to mointor
training. You can use TensorBoard provided that you are using the
Tensorflow backend. Simply go to the training output
directory and run tensorboard --logdir .
.
Deciding how long to train¶
The arguments --nb_epoch
and --early_stopping
control how long
models are trained.
--nb_epoch
defines the maximum number of training epochs (default
30). After one epoch, the model has seen the entire training set once.
The time per epoch hence depends on the size of the training set, but
also on the complexity of the model that you are training and the
hardware of your machine. On a large dataset, you have to train for
fewer epochs than on a small dataset, since your model will have seen
already a lot of training samples after one epoch. For training on about
3,000,000 samples, good default values are 20 for the DNA and CpG
model, and 10 for the Joint model.
Early stopping stops training if the loss on the validation set did not
improve after the number of epochs that is specified by
--early_stopping
(default 5). If you are training without specifying
a validation set with --val_files
, early stopping will be
deactivated.
--max_time
sets the maximum training time in hours. This guarantees
that training terminates after a certain amount of time regardless of
the --nb_epoch
or --early_stopping
argument.
--stop_file
defines the path of a file that, if it exists, stop
training after the end of the current epoch. This is useful if you are
monitoring training and want to terminate training manually as soon as
the training loss starts to saturate regardless of --nb_epoch
or
--early_stopping
. For example, when using
--stop_file ./train/STOP
, you can create an empty file with
touch ./train/STOP
to stop training at the end of the current epoch.
Optimizing hyper-parameters¶
DeepCpG has different hyper-parameters, such as the learning rate, dropout rate, or model architectures. Although the performance of DeepCpG is relatively robust to different hyper-parameters, you can tweak performances by trying out different parameter combinations. For example, you could train different models with different parameters on a subset of your data, select the parameters with the highest performance on the validation set, and then train the full model.
The following hyper-parameters are most important (default values shown):
1. Learning rate: --learning_rate 0.0001
2. Dropout rate: --dropout 0.0
3. DNA model architecture: --dna_model CnnL2h128
4. Joint model architecture: --joint_model JointL2h512
5. CpG model architecture: --cpg_model RnnL1
6. L2 weight decay: --l2_decay 0.0001
The learning rate defines how aggressively model parameters are updated during training. If the training loss changes only slowly, you could try increasing the learning rate. If your model is overfitting of if the training loss fluctuates, you should decrease the learning rate. Reasonable values are 0.001, 0.0005, 0.0001, 0.00001, or values in between.
The dropout rate defines how strongly your model is regularized. If you have only few data and your model is overfitting, then you should increase the dropout rate. Reasonable values are, e.g., 0.0, 0.2, 0.4.
DeepCpG provides different architectures for the DNA, CpG, and joint model. Architectures are more or less complex, depending on how many layers and neurons say have. More complex model might yield better performances, but take longer to train and might overfit your data. You can find more information about available model architecture here.
L2 weight decay is an alternative to dropout for regularizing model training. If your model is overfitting, you might try 0.001, or 0.005.
Testing training¶
dcpg_train.py
provides different arguments that allow to briefly
test training before training the full model for a about a day.
--nb_train_sample
and --nb_val_sample
specify the number of
training and validation samples. When using --nb_train_sample 500
,
the training loss should briefly decay and your model should start
overfitting.
--nb_output
and --output_names
define the maximum number and the
name of model outputs. For example, --nb_output 3
will train only on
the first three outputs, and --output_names cpg/.*SER.*
only on
outputs that include ‘SER’ in their name.
Analogously, --nb_replicate
and --replicate_name
define the
number and name of cells that are used as input to the CpG model.
--nb_replicate 3
will only use observed methylation states from the
first three cells, and allows to briefly test the CpG model.
--dna_wlen
specifies the size of DNA sequence windows that will be
used as input to the DNA model. For example, --dna_wlen 101
will
train only on windows of size 101, instead of using the full window
length that was specified when creating data files with
dcpg_data.py
.
Analogously, --cpg_wlen
specifies the sum of the number of observed
CpG sites to the left and the right of the target CpG site for training
the CpG model. For example, --cpg_wlen 10
will use 5 observed CpG
sites to the left and to the right.
Fine-tuning and training selected components¶
dcpg_train.py
provides different arguments that allow to selectively
train only some components of a model.
With --fine_tune
, only the output layer will be trained. As the name
implies, this argument is useful for fine-tuning a pre-trained model.
--train_models
specifies which models are trained. For example,
--train_models joint
will train the Joint model, but not the DNA
and CpG model. --train_models cpg joint
will train the CpG and
Joint model, but not the DNA model.
--trainable
and --not_trainable
allow including and excluding
certain layers. For example,
--not_trainable '.*' --trainable 'dna/.*_2'
will only train the
second layers of the DNA model.
--freeze_filter
excludes the first convolutional layer of the DNA
model from training.
Configuring the Keras backend¶
DeepCpG use the Keras deep learning library, which supports Theano or Tensorflow as backend. While Theano has long been the dominant deep learning library, Tensorflow is more suited for parallelizing computations on multiple GPUs and CPUs, and provides TensorBoard to interactively monitor training.
You can configure the backend by setting the backend
attribute in
~/.keras/keras.json
to tensorflow
or theano
. Alternatively
you can set the environemnt variable KERAS_BACKEND='tensorflow'
to
use Tensorflow, or KERAS_BACKEND='theano'
to use Theano.
You can find more information about Keras backends here.