Scripts

Documentation of DeepCpG scripts.

dcpg_data.py

Create DeepCpG input data from incomplete methylation profiles.

Takes as input incomplete CpG methylation profiles of multiple cells, extracts neighboring CpG sites and/or DNA sequences windows, and writes data chunk files to output directory. Output data can than be used for model training using dcpg_train.py model evaluation using dcpg_eval.py.

Examples

Create data files for training a CpG and DNA model, using 50 neighboring methylation states and DNA sequence windows of 1001 bp from the mm10 genome build:

dcpg_data.py
    --cpg_profiles ./cpg/*.tsv
    --cpg_wlen 50
    --dna_files ./mm10
    --dna_wlen 1001
    --out_dir ./data

Create data files from gzip-compressed bedGraph files for predicting the mean methylation rate and cell-to-cell variance from the DNA sequence:

dcpg_data.py
    --cpg_profiles ./cpg/*.bedGraph.gz
    --dna_files ./mm10
    --dna_wlen 1001
    --win_stats mean var
    --win_stats_wlen 1001 2001 3001 4001 5001
    --out_dir ./data

See Also

  • dcpg_data_stats.py: For computing statistics of data files.
  • dcpg_data_show.py: For showing the content of data files.
  • dcpg_train.py: For training a model.
scripts.dcpg_data.extract_seq_windows(seq, pos, wlen, seq_index=1, assert_cpg=False)[source]

Extracts DNA sequence windows at positions.

Parameters:
seq: str

DNA sequence.

pos: list

Positions at which windows are extracted.

wlen: int

Window length.

seq_index: int

Offset at which positions start.

assert_cpg: bool

If True, check if positions in pos point to CpG sites.

Returns:
np.array

Array with integer-encoded sequence windows.

scripts.dcpg_data.map_cpg_tables(cpg_tables, chromo, chromo_pos)[source]

Maps values from cpg_tables to chromo_pos.

Positions in cpg_tables for chromo must be a subset of chromo_pos. Inserts dat.CPG_NAN for uncovered positions.

scripts.dcpg_data.map_values(values, pos, target_pos, dtype=None, nan=-1)[source]

Maps values array at positions pos to target_pos.

Inserts nan for uncovered positions.

scripts.dcpg_data.prepro_pos_table(pos_tables)[source]

Extracts unique positions and sorts them.

scripts.dcpg_data.read_cpg_profiles(filenames, log=None, *args, **kwargs)[source]

Read methylation profiles.

Input files can be gzip compressed.

Returns:
dict

dict (key, value), where key is the output name and value the CpG table.

scripts.dcpg_data.split_ext(filename)[source]

Remove file extension from filename.

dcpg_data_show.py

Show the content of DeepCpG data files.

Shows the content of dcpg_data.py output files for a selected region, for example the methylation state of the target CpG site, neighboring CpG sites, or the DNA sequence.

Examples

Show the output methylation state of CpG sites on on chromosome 19 between position 3028955 and 3079682:

dcpg_data_show.py
    ./data/*.h5
    --chromo 1
    --start 3028955
    --end 3079682
    --outputs

Show output methylation states and the state as well as the distance of 10 neighboring CpG sites of cell BS27_1_SER:

dcpg_data_show.py
    ./data/*.h5
    --chromo 1
    --start 3028955
    --end 3079682
    --outputs cpg/BS27_1_SER
    --cpg BS27_1_SER
    --cpg_wlen 10
    --cpg_dist

Show output methylation states and DNA sequence windows of length 11 and store the results in HDF5 file selected.h5:

dcpg_data_show.py
    ./data/*.h5
    --chromo 1
    --start 3028955
    --end 3079682
    --outputs
    --dna_wlen 11
    --out_hdf selected.h5

dcpg_data_stats.py

Compute summary statistics of data files.

Computes summary statistics of data files such as the number of samples or the mean and variance of output variables.

Examples

dcpg_data_stats.py
    ./data/*.h5

dcpg_download.py

Download a pre-trained model from DeepCpG model zoo.

Downloads a pre-trained model from the DeepCpG model zoo by its identifier. Model descriptions can be found on online.

Examples

Show available models:

dcpg_download --show

Download DNA model trained on serum cells from Smallwood et al:

dcpg_download.py
    Smallwood2014_serum_dna
    -o ./model

dcpg_eval.py

Evaluate the prediction performance of a DeepCpG model.

Imputes missing methylation states and evaluates model on observed states. --out_report will write evaluation metrics to a TSV file using. --out_data will write predicted and observed methylation state to a HDF5 file with following structure:

  • chromo: The chromosome of the CpG site.
  • pos: The position of the CpG site on the chromosome.
  • outputs: The input methylation state of each cell and CpG site, which can either observed or missing (-1).
  • preds: The predicted methylation state of each cell and CpG site.

Examples

dcpg_eval.py
    ./data/*.h5
    --model_files ./model
    --out_data ./eval/data.h5
    --out_report ./eval/report.tsv

dcpg_eval_export.py

Export imputed methylation profiles.

Exports imputed methylation profiles from dcpg_eval.py output file to different data formats. Outputs for each CpG site and cell either the experimentally observed or predicted methylation state depending on whether or not the methylation state was observed in the input file or not, respectively. Creates for each methylation profile one file in the output directory.

Examples

Export profiles of all cells as HDF5 files to ./eval:

dcpg_eval_export.py
    ./eval/data.h5
    --out_dir ./eval

Export the profile of cell Ca01 for chromosomes 4 and 5 to a bedGraph file:

dcpg_eval_export.py
    ./eval/data.h5
    --output cpg/Ca01
    --chromo 4 5
    --format bedGraph
    --out_dir ./eval

dcpg_eval_perf.py

Evaluate prediction performance.

Evaluates prediction performances globally and genomic annotations.

Examples

Evaluate prediction performance globally and in genomic contexts annotated as CGI, TSS, or gene body. Also compute precision recall and ROC curve of individual outputs:

dcpg_eval_perf.py
    ./eval/data.h5
    --out_dir ./eval
    --curves pr roc
    --annos_files ./bed/CGI.bed ./bed/TSS.bed ./bed/gene_body.bed
scripts.dcpg_eval_perf.annotate(chromos, pos, anno)[source]

Annotate genomic locations.

Tests if sites specified by chromos and pos are annotated by anno.

Parameters:
chromos: :class:`numpy.ndarray`

numpy.ndarray with chromosome of sites.

pos: :class:`numpy.ndarray`

numpy.ndarray with position on chromosome of sites.

anno: :class:`pandas.DataFrame`

pandas.DataFrame with columns chromo, start, end that specify annotated regions.

Returns:
:class:`numpy.ndarray`

Binary numpy.ndarray of same length as chromos indicating if positions are annotated.

scripts.dcpg_eval_perf.get_curve_fun(name)[source]

Return performance curve function by its name.

scripts.dcpg_eval_perf.read_anno_file(anno_file, chromos=None, nb_sample=None)[source]

Read annotations from BED file.

Reads annotations from BED file merges overlapping annotations.

Parameters:
anno_file: str

File name.

chromos: list

List of chromosomes for filtering annotations.

nb_sample: int

Maximum number of annotated regions.

Returns:
:class:`pandas.DataFrame`

pandas.DataFrame with columns chromo, start, end.

dcpg_filter_act.py

Compute filter activations of a DeepCpG model.

Computes the activation of the filters of the first convolutional layer for a given DNA model. The resulting activations can be used to visualize and cluster motifs, or correlated with model outputs.

Examples

Compute activations in 25000 sequence windows and also store DNA sequences. For example to visualize motifs.

dcpg_filter_act.py
    ./data/*.h5
    --model_files ./models/dna
    --out_file ./activations.h5
    --nb_sample 25000
    --store_inputs

Compute the weighted mean activation in each sequence window and also store model predictions. For example to cluster motifs or to correlated mean motif activations with model predictions.

dcpg_filter_act.py
    ./data/*.h5
    --model_files ./models/dna
    --out_file ./activations.h5
    --act_fun wmean

See Also

  • dcpg_filter_motifs.py: For motif visualization and analysis.

dcpg_filter_motifs.py

Visualizes and analyzes filter motifs.

Enables to visualize motifs as sequence logos, compare motifs to annotated motifs, cluster motifs, and compute motif summary statistics. Requires Weblogo3 for visualization, and Tomtom for motif comparison.

Copyright (c) 2015 David Kelley since since parts of the code are based on the Basset script basset_motifs.py from David Kelley.

Examples

Compute filter activations and also store input DNA sequence windows:

dcpg_filter_act.py
    ./data/*.h5
    --out_file ./activations.h5
    --store_inputs
    --nb_sample 100000

Visualize and analyze motifs:

dcpg_filter_motifs.py
    ./activations.h5
    --out_dir ./motifs
    --motif_db ./motif_databases/CIS-BP/Mus_musculus.meme
    --plot_heat
    --plot_dens
    --plot_pca

dcpg_snp.py

Compute the effect of DNA mutations on methylation.

Computes the effect of DNA mutation on the mean methylation rate or cell-to-cell variance using gradient backpropagation.

Examples

Compute the effect on mean methylation rates and cell-to-cell variance:

dcpg_snp.py
    ./data/*.h5
    --model_files ./model/dna
    --out_file ./effects.h5
    --targets mean var

Compute weighted mean effects in DNA sequence windows of length 101:

dcpg_snp.py
    ./data/*.h5
    --model_files ./model/dna
    --out_file ./effects.h5
    --targets mean var
    --dna_wlen 101
    --agg_effects wmean

dcpg_train.py

Train a DeepCpG model to predict DNA methylation.

Trains a DeepCpG model on DNA (DNA model), neighboring methylation states (CpG model), or both (Joint model) to predict CpG methylation of multiple cells. Allows to fine-tune individual models or to train them from scratch.

Examples

Train a DNA model on chromosome 1, 3, and 5, and use chromosome 13, 14, and 15 for validation:

dcpg_train.py
    ./data/c{1,3,5}_*.h5
    --val_files ./data/c{13,14,15}_*.h5
    --dna_model CnnL2h128
    --out_dir ./models/dna

Train a CpG model:

dcpg_train.py
    ./data/c{1,3,5}_*.h5
    --val_files ./data/c{13,14,15}_*.h5
    --cpg_model RnnL1
    --out_dir ./models/cpg

Train a Joint model using a pre-trained DNA and CpG model:

dcpg_train.py
    ./data/c{1,3,5}_*.h5
    --val_files ./data/c{13,14,15}_*.h5
    --dna_model ./models/dna
    --cpg_model ./models/cpg
    --joint_model JointL2h512
    --train_models joint
    --out_dir ./models/joint

See Also

  • dcpg_eval.py: For evaluating a trained model and imputing methylation
    profiles.

dcpg_train_viz.py

Visualizes learning curves of dcpg_train.py.

Visualizes training and validation learning from dcpg_train.py. Tensorboard is recommended for advanced visualization.

Examples

dcpg_train_viz.py
    ./model/lc_train.tsv ./model/lc_val.tsv
    --out_file ./lc.pdf