data

Package for reading, writing, and transforming data.

data.annotations

Functions for reading and matching annotations.

deepcpg.data.annotations.distance(pos, start, end)[source]

Return shortest distance between a position and a list of intervals.

Parameters:
pos: list

List of integer positions.

start: list

Start position of intervals.

end: list

End position of intervals.

Returns
:class:`numpy.ndarray`

numpy.ndarray of same length as pos with shortest distance between each pos[i] and any interval.

deepcpg.data.annotations.extend_len(start, end, min_len, min_pos=1)[source]

Extend intervals to minimum length.

Extends intervals start-end with length smaller than min_len to length min_len by equally decreasing start and increasing end.

Parameters:
start: list

List of start position of intervals.

end: list

List of end position of intervals.

min_len: int

Minimum interval length.

min_pos: int

Minimum position.

Returns:
tuple

tuple with start end end position of extended intervals.

deepcpg.data.annotations.extend_len_frame(d, min_len)[source]

Extend length of intervals in Pandas DataFrame using extend_len.

deepcpg.data.annotations.group_overlapping(s, e)[source]

Assign group index of indices.

Assigns group index to intervals. Overlapping intervals will be assigned to the same group.

Parameters:
s : list

list with start of interval sorted in ascending order.

e : list

list with end of interval.

Returns:
:class:`numpy.ndarray`

numpy.ndarray with group indices.

deepcpg.data.annotations.in_which(x, ys, ye)[source]

Return index of positions in intervals.

Returns for positions x[i] index j, s.t. ys[j] <= x[i] <= ye[j], or -1 if x[i] is not overlapped by any interval. Intervals must be non-overlapping!

Parameters:
x : list

list of positions.

ys: list

list with start of interval sorted in ascending order.

ye: list

list with end of interval.

Returns:
:class:`numpy.ndarray`

n:class:numpy.ndarray with indices of overlapping intervals or -1.

deepcpg.data.annotations.is_in(pos, start, end)[source]

Test if position is overlapped by at least one interval.

deepcpg.data.annotations.join_overlapping(s, e)[source]

Join overlapping intervals.

Transforms a list of possible overlapping intervals into non-overlapping intervals.

Parameters:
s : list

List with start of interval sorted in ascending order

e : list

List with end of interval.

Returns:
tuple

tuple (s, e) of non-overlapping intervals.

deepcpg.data.annotations.join_overlapping_frame(d)[source]

Join overlapping intervals of Pandas DataFrame.

Uses join_overlapping to join overlapping intervals of pandas.DataFrame d.

deepcpg.data.annotations.read_bed(filename, sort=False, usecols=[0, 1, 2], *args, **kwargs)[source]

Read chromo,start,end from BED file without formatting chromo.

data.dna

Functions for representing DNA sequences.

deepcpg.data.dna.char_to_int(seq)[source]

Translate chars of single sequence seq to ints.

Parameters:
seq: str

DNA sequence.

Returns:
list

Integer-encoded seq.

deepcpg.data.dna.get_alphabet(special=False, reverse=False)[source]

Return char->int alphabet.

Parameters:
special: bool

If True, remove special ‘N’ character.

reverse: bool

If True, return int->char instead of char->int alphabet.

Returns:
OrderedDict

DNA alphabet.

deepcpg.data.dna.int_to_char(seq, join=True)[source]

Translate ints of single sequence seq to chars.

Parameters:
seq: list

Integers of sequences

join: bool

If True joint characters to str.

Returns:
If `join=True`, `str`, otherwise list of chars.
deepcpg.data.dna.int_to_onehot(seqs, dim=4)[source]

One-hot encodes array of integer sequences.

Takes array [nb_seq, seq_len] of integer sequence end encodes them one-hot. Special nucleotides (int > 4) will be encoded as [0, 0, 0, 0].

Returns:
:class:`numpy.ndarray`

[nb_seq, seq_len, dim] numpy.ndarray of one-hot encoded sequences.

deepcpg.data.dna.onehot_to_int(seqs, axis=-1)[source]

Translates one-hot sequences to integer sequences.

data.fasta

Functions reading FASTA files.

class deepcpg.data.fasta.FastaSeq(head, seq)[source]

FASTA sequence.

deepcpg.data.fasta.parse_lines(lines)[source]

Parse FASTA sequences from list of strings.

Parameters:
lines: list

List of lines from FASTA file.

Returns:
list

List of FastaSeq objects.

deepcpg.data.fasta.read_chromo(filenames, chromo)[source]

Read DNA sequence of chromosome chromo.

Parameters:
filenames: list

List of FASTA files.

chromo: str

Chromosome that is read.

Returns:
str

DNA sequence of chromosome chromo.

deepcpg.data.fasta.read_file(filename, gzip=None)[source]

Read FASTA file and return sequences.

Parameters:
filename: str

File name.

gzip: bool

If True, file is gzip compressed. If None, suffix is used to determine if file is compressed.

Returns:
List of :class:`FastaSeq` objects.
deepcpg.data.fasta.select_file_by_chromo(filenames, chromo)[source]

Select file of chromosome chromo.

Parameters:
filenames: list

List of file names or directory with FASTA files.

chromo: str

Chromosome that is selected.

Returns:
str

Filename in filenames that contains chromosome chromo.

data.feature_extractor

Feature extraction.

class deepcpg.data.feature_extractor.IntervalFeatureExtractor[source]

Check if positions are in a list of intervals (start, end).

static index_intervals(x, ys, ye)[source]
Return for positions x[i] index j, s.t. ys[j] <= x[i] <= ye[j] or -1.
Intervals must be non-overlapping!
Parameters:
x : list

List of positions.

ys: list

List with start of interval sorted in ascending order.

ye: list

List with end of interval.

Returns:
:class:`numpy.ndarray`

numpy.ndarray of same length than x with index or -1.

static join_intervals(s, e)[source]

Transform a list of possible overlapping intervals into non-overlapping intervals.

Parameters:
s: list

List with start of interval sorted in ascending order.

e: list

List with end of interval.

Returns:
tuple

Tuple (s, e) of non-overlapping intervals.

class deepcpg.data.feature_extractor.KnnCpgFeatureExtractor(k=1)[source]

Extract k CpG sites next to target sites. Exclude CpG sites at the same position.

extract(x, y, ys)[source]

Extract state and distance of k CpG sites next to target sites. Target site is excluded.

Parameters:
x: :class:`numpy.ndarray`

numpy.ndarray with target positions sorted in ascending order.

y: :class:`numpy.ndarray`

numpy.ndarray with source positions sorted in ascending order.

ys: :class:`numpy.ndarray`

numpy.ndarray with source CpG states.

Returns:
tuple

Tuple (cpg, dist) with numpy arrays of dimension (len(x), 2k): cpg: CpG states to the left (0:k) and right (k:2k) dist: Distances to the left (0:k) and right (k:2k)

data.hdf

Functions for accessing HDF5 files.

deepcpg.data.hdf.hnames_to_names(hnames)[source]

Flattens dict hnames of hierarchical names.

Converts hierarchical dict, e.g. hnames={‘a’: [‘a1’, ‘a2’], ‘b’}, to flat list of keys for accessing HDF5 file, e.g. [‘a/a1’, ‘a/a2’, ‘b’]

deepcpg.data.hdf.ls(filename, group='/', recursive=False, groups=False, regex=None, nb_key=None, must_exist=True)[source]

List name of records HDF5 file.

Parameters:
filename:

Path of HDF5 file.

group:

HDF5 group to be explored.

recursive: bool

If True, list records recursively.

groups: bool

If True, only list group names but not name of datasets.

regex: str

Regex to filter listed records.

nb_key: int

Maximum number of records to be listed.

must_exist: bool

If False, return None if file or group does not exist.

Returns:
list

list with name of records in filename.

deepcpg.data.hdf.write_data(data, filename)[source]

Write data in dict data to HDF5 file.

data.stats

Functions for computing statistic about binary CpG matrix.

CpG matrix x assumed to have shape
  • [sites, cells] for per CpG statistics
  • [sites, cells, context] for window-based statistics
deepcpg.data.stats.cat2_var(*args, **kwargs)[source]

Binary variance between cells.

deepcpg.data.stats.cat_var(x, nb_bin=3, *args, **kwargs)[source]

Categorical variance between cells.

Discretizes variance from var() into nb_bin equally-spaced bins.

deepcpg.data.stats.diff(x)[source]

Test if CpG site is differentially methylated.

deepcpg.data.stats.entropy(x)[source]

Entropy of single CpG sites between cells.

deepcpg.data.stats.get(name)[source]

Return object from module by its name.

deepcpg.data.stats.mean(x)[source]

Mean methylation rate.

deepcpg.data.stats.mode(x)[source]

Mode of methylation rate.

deepcpg.data.stats.var(x, *args, **kwargs)[source]

Variance between cells.

data.utils

General purpose IO functions.

class deepcpg.data.utils.GzipFile(filename, mode='r', *args, **kwargs)[source]

Wrapper to read and write gzip-compressed files.

If filename ends with gz, opens file with gzip package, otherwise builtin open function.

Parameters:
filename: str

Path of file

mode: str

File access mode

*args: list

Unnamed arguments passed to open function.

**kwargs: dict

Named arguments passed to open function.

deepcpg.data.utils.add_to_dict(src, dst)[source]

Add dict `src to dict dst

Adds values in dict src to dict dst with same keys but values are lists of added values. lists of values in dst can be stacked with stack_dict(). Used for example in dpcg_eval.py to stack dicts from different batches.

deepcpg.data.utils.format_chromo(chromo)[source]

Format chromosome name.

Makes name upper case, e.g. ‘mt’ -> ‘MT’ and removes ‘chr’, e.g. ‘chr1’ -> ‘1’.

deepcpg.data.utils.get_anno_names(data_file, *args, **kwargs)[source]

Return name of annotations stored in data_file.

deepcpg.data.utils.get_cpg_wlen(data_file, max_len=None)[source]

Return number of CpG neighbors stored in data_file.

deepcpg.data.utils.get_dna_wlen(data_file, max_len=None)[source]

Return length of DNA sequence windows stored in data_file.

deepcpg.data.utils.get_nb_sample(data_files, nb_max=None, batch_size=None)[source]

Count number of samples in all data_files.

Parameters:
data_files: list

list with file name of DeepCpG data files.

nb_max: int

If defined, stop counting if that number is reached.

batch_size: int

If defined, return the largest multiple of batch_size that is smaller or equal than the actual number of samples.

Returns:
int

Number of samples in data_files.

deepcpg.data.utils.get_output_names(data_file, *args, **kwargs)[source]

Return name of outputs stored in data_file.

deepcpg.data.utils.get_replicate_names(data_file, *args, **kwargs)[source]

Return name of replicates stored in data_file.

deepcpg.data.utils.is_bedgraph(filename)[source]

Test if filename is a bedGraph file.

bedGraph files are assumed to start with ‘track type=bedGraph’

deepcpg.data.utils.is_binary(values)[source]

Check if values in array values are binary, i.e. zero or one.

deepcpg.data.utils.read_cpg_profile(filename, chromos=None, nb_sample=None, round=False, sort=True, nb_sample_chromo=None)[source]

Read CpG profile from TSV or bedGraph file.

Reads CpG profile from either tab delimited file with columns chromo, pos, value. value or bedGraph file. value columns contains methylation states, which can be binary or continuous.

Parameters:
filenamne: str

Path of file.

chromos: list

List of formatted chromosomes to be read, e.g. [‘1’, ‘X’].

nb_sample: int

Maximum number of sample in total.

round: bool

If True, round methylation states in column ‘value’ to zero or one.

sort: bool

If True, sort by rows by chromosome and position.

nb_sample_chromo: int

Maximum number of sample per chromosome.

Returns:
:class:`pandas.DataFrame`

pandas.DataFrame with columns chromo, pos, value.

deepcpg.data.utils.sample_from_chromo(frame, nb_sample)[source]

Randomly sample nb_sample samples from each chromosome.

Samples nb_sample records from pandas.DataFrame which must contain a column with name ‘chromo’.

deepcpg.data.utils.stack_dict(data)[source]

Stacks lists of numpy arrays in dict data.

deepcpg.data.utils.threadsafe_generator(f)[source]

A decorator that takes a generator function and makes it thread-safe.

class deepcpg.data.utils.threadsafe_iter(it)[source]

Takes an iterator/generator and makes it thread-safe by serializing call to the next method of given iterator/generator.