`data`¶

Package for reading, writing, and transforming data.

`data.annotations`¶

Functions for reading and matching annotations.

deepcpg.data.annotations.distance(pos, start, end)[source]¶

Return shortest distance between a position and a list of intervals.

Parameters:	pos: list List of integer positions. start: list Start position of intervals. end: list End position of intervals. Returns :class:`numpy.ndarray` `numpy.ndarray` of same length as pos with shortest distance between each pos[i] and any interval.

deepcpg.data.annotations.extend_len(start, end, min_len, min_pos=1)[source]¶

Extend intervals to minimum length.

Extends intervals start-end with length smaller than min_len to length min_len by equally decreasing start and increasing end.

Parameters:	start: list List of start position of intervals. end: list List of end position of intervals. min_len: int Minimum interval length. min_pos: int Minimum position.
Returns:	tuple tuple with start end end position of extended intervals.

deepcpg.data.annotations.extend_len_frame(d, min_len)[source]¶: Extend length of intervals in Pandas DataFrame using extend_len.

deepcpg.data.annotations.group_overlapping(s, e)[source]¶

Assign group index of indices.

Assigns group index to intervals. Overlapping intervals will be assigned to the same group.

Parameters:	s : list list with start of interval sorted in ascending order. e : list list with end of interval.
Returns:	:class:`numpy.ndarray` `numpy.ndarray` with group indices.

deepcpg.data.annotations.in_which(x, ys, ye)[source]¶

Return index of positions in intervals.

Returns for positions x[i] index j, s.t. ys[j] <= x[i] <= ye[j], or -1 if x[i] is not overlapped by any interval. Intervals must be non-overlapping!

Parameters:	x : list list of positions. ys: list list with start of interval sorted in ascending order. ye: list list with end of interval.
Returns:	:class:`numpy.ndarray` n:class:numpy.ndarray with indices of overlapping intervals or -1.

deepcpg.data.annotations.is_in(pos, start, end)[source]¶: Test if position is overlapped by at least one interval.

deepcpg.data.annotations.join_overlapping(s, e)[source]¶

Join overlapping intervals.

Transforms a list of possible overlapping intervals into non-overlapping intervals.

Parameters:	s : list List with start of interval sorted in ascending order e : list List with end of interval.
Returns:	tuple tuple (s, e) of non-overlapping intervals.

deepcpg.data.annotations.join_overlapping_frame(d)[source]¶

Join overlapping intervals of Pandas DataFrame.

Uses join_overlapping to join overlapping intervals of pandas.DataFrame d.

deepcpg.data.annotations.read_bed(filename, sort=False, usecols=[0, 1, 2], *args, **kwargs)[source]¶: Read chromo,start,end from BED file without formatting chromo.

`data.dna`¶

Functions for representing DNA sequences.

deepcpg.data.dna.char_to_int(seq)[source]¶

Translate chars of single sequence seq to ints.

Parameters:	seq: str DNA sequence.
Returns:	list Integer-encoded seq.

deepcpg.data.dna.get_alphabet(special=False, reverse=False)[source]¶

Return char->int alphabet.

Parameters:	special: bool If True, remove special ‘N’ character. reverse: bool If True, return int->char instead of char->int alphabet.
Returns:	OrderedDict DNA alphabet.

deepcpg.data.dna.int_to_char(seq, join=True)[source]¶

Translate ints of single sequence seq to chars.

Parameters:	seq: list Integers of sequences join: bool If True joint characters to str.
Returns:	If `join=True`, `str`, otherwise list of chars.

deepcpg.data.dna.int_to_onehot(seqs, dim=4)[source]¶

One-hot encodes array of integer sequences.

Takes array [nb_seq, seq_len] of integer sequence end encodes them one-hot. Special nucleotides (int > 4) will be encoded as [0, 0, 0, 0].

Returns:	:class:`numpy.ndarray` [nb_seq, seq_len, dim] `numpy.ndarray` of one-hot encoded sequences.

deepcpg.data.dna.onehot_to_int(seqs, axis=-1)[source]¶: Translates one-hot sequences to integer sequences.

`data.fasta`¶

Functions reading FASTA files.

class deepcpg.data.fasta.FastaSeq(head, seq)[source]¶: FASTA sequence.

deepcpg.data.fasta.parse_lines(lines)[source]¶

Parse FASTA sequences from list of strings.

Parameters:	lines: list List of lines from FASTA file.
Returns:	list List of `FastaSeq` objects.

deepcpg.data.fasta.read_chromo(filenames, chromo)[source]¶

Read DNA sequence of chromosome chromo.

Parameters:	filenames: list List of FASTA files. chromo: str Chromosome that is read.
Returns:	str DNA sequence of chromosome chromo.

deepcpg.data.fasta.read_file(filename, gzip=None)[source]¶

Read FASTA file and return sequences.

Parameters:	filename: str File name. gzip: bool If True, file is gzip compressed. If None, suffix is used to determine if file is compressed.
Returns:	List of :class:`FastaSeq` objects.

deepcpg.data.fasta.select_file_by_chromo(filenames, chromo)[source]¶

Select file of chromosome chromo.

Parameters:	filenames: list List of file names or directory with FASTA files. chromo: str Chromosome that is selected.
Returns:	str Filename in filenames that contains chromosome chromo.

`data.feature_extractor`¶

Feature extraction.

class deepcpg.data.feature_extractor.IntervalFeatureExtractor[source]¶

Check if positions are in a list of intervals (start, end).

static index_intervals(x, ys, ye)[source]¶

Return for positions x[i] index j, s.t. ys[j] <= x[i] <= ye[j] or -1.: Intervals must be non-overlapping!

Parameters:	x : list List of positions. ys: list List with start of interval sorted in ascending order. ye: list List with end of interval.
Returns:	:class:`numpy.ndarray` `numpy.ndarray` of same length than x with index or -1.

static join_intervals(s, e)[source]¶

Transform a list of possible overlapping intervals into non-overlapping intervals.

Parameters:	s: list List with start of interval sorted in ascending order. e: list List with end of interval.
Returns:	tuple Tuple (s, e) of non-overlapping intervals.

class deepcpg.data.feature_extractor.KnnCpgFeatureExtractor(k=1)[source]¶

Extract k CpG sites next to target sites. Exclude CpG sites at the same position.

extract(x, y, ys)[source]¶

Extract state and distance of k CpG sites next to target sites. Target site is excluded.

Parameters:	x: :class:`numpy.ndarray` `numpy.ndarray` with target positions sorted in ascending order. y: :class:`numpy.ndarray` `numpy.ndarray` with source positions sorted in ascending order. ys: :class:`numpy.ndarray` `numpy.ndarray` with source CpG states.
Returns:	tuple Tuple (cpg, dist) with numpy arrays of dimension (len(x), 2k): cpg: CpG states to the left (0:k) and right (k:2k) dist: Distances to the left (0:k) and right (k:2k)

`data.hdf`¶

Functions for accessing HDF5 files.

deepcpg.data.hdf.hnames_to_names(hnames)[source]¶

Flattens dict hnames of hierarchical names.

Converts hierarchical dict, e.g. hnames={‘a’: [‘a1’, ‘a2’], ‘b’}, to flat list of keys for accessing HDF5 file, e.g. [‘a/a1’, ‘a/a2’, ‘b’]

deepcpg.data.hdf.ls(filename, group='/', recursive=False, groups=False, regex=None, nb_key=None, must_exist=True)[source]¶

List name of records HDF5 file.

Parameters:	filename: Path of HDF5 file. group: HDF5 group to be explored. recursive: bool If True, list records recursively. groups: bool If True, only list group names but not name of datasets. regex: str Regex to filter listed records. nb_key: int Maximum number of records to be listed. must_exist: bool If False, return None if file or group does not exist.
Returns:	list list with name of records in filename.

deepcpg.data.hdf.write_data(data, filename)[source]¶: Write data in dict data to HDF5 file.

`data.stats`¶

Functions for computing statistic about binary CpG matrix.

CpG matrix x assumed to have shape

[sites, cells] for per CpG statistics
[sites, cells, context] for window-based statistics

deepcpg.data.stats.cat2_var(*args, **kwargs)[source]¶: Binary variance between cells.

deepcpg.data.stats.cat_var(x, nb_bin=3, *args, **kwargs)[source]¶

Categorical variance between cells.

Discretizes variance from var() into nb_bin equally-spaced bins.

deepcpg.data.stats.diff(x)[source]¶: Test if CpG site is differentially methylated.

deepcpg.data.stats.entropy(x)[source]¶: Entropy of single CpG sites between cells.

deepcpg.data.stats.get(name)[source]¶: Return object from module by its name.

deepcpg.data.stats.mean(x)[source]¶: Mean methylation rate.

deepcpg.data.stats.mode(x)[source]¶: Mode of methylation rate.

deepcpg.data.stats.var(x, *args, **kwargs)[source]¶: Variance between cells.

`data.utils`¶

General purpose IO functions.

class deepcpg.data.utils.GzipFile(filename, mode='r', *args, **kwargs)[source]¶

Wrapper to read and write gzip-compressed files.

If filename ends with gz, opens file with gzip package, otherwise builtin open function.

Parameters:	filename: str Path of file mode: str File access mode args: list Unnamed arguments passed to open function. *kwargs: dict Named arguments passed to open function.

deepcpg.data.utils.add_to_dict(src, dst)[source]¶

Add dict `src to dict dst

Adds values in dict src to dict dst with same keys but values are lists of added values. lists of values in dst can be stacked with stack_dict(). Used for example in dpcg_eval.py to stack dicts from different batches.

deepcpg.data.utils.format_chromo(chromo)[source]¶

Format chromosome name.

Makes name upper case, e.g. ‘mt’ -> ‘MT’ and removes ‘chr’, e.g. ‘chr1’ -> ‘1’.

deepcpg.data.utils.get_anno_names(data_file, *args, **kwargs)[source]¶: Return name of annotations stored in data_file.

deepcpg.data.utils.get_cpg_wlen(data_file, max_len=None)[source]¶: Return number of CpG neighbors stored in data_file.

deepcpg.data.utils.get_dna_wlen(data_file, max_len=None)[source]¶: Return length of DNA sequence windows stored in data_file.

deepcpg.data.utils.get_nb_sample(data_files, nb_max=None, batch_size=None)[source]¶

Count number of samples in all data_files.

Parameters:	data_files: list list with file name of DeepCpG data files. nb_max: int If defined, stop counting if that number is reached. batch_size: int If defined, return the largest multiple of batch_size that is smaller or equal than the actual number of samples.
Returns:	int Number of samples in data_files.

deepcpg.data.utils.get_output_names(data_file, *args, **kwargs)[source]¶: Return name of outputs stored in data_file.

deepcpg.data.utils.get_replicate_names(data_file, *args, **kwargs)[source]¶: Return name of replicates stored in data_file.

deepcpg.data.utils.is_bedgraph(filename)[source]¶

Test if filename is a bedGraph file.

bedGraph files are assumed to start with ‘track type=bedGraph’

deepcpg.data.utils.is_binary(values)[source]¶: Check if values in array values are binary, i.e. zero or one.

deepcpg.data.utils.read_cpg_profile(filename, chromos=None, nb_sample=None, round=False, sort=True, nb_sample_chromo=None)[source]¶

Read CpG profile from TSV or bedGraph file.

Reads CpG profile from either tab delimited file with columns chromo, pos, value. value or bedGraph file. value columns contains methylation states, which can be binary or continuous.

Parameters:

filenamne: str: Path of file.
chromos: list: List of formatted chromosomes to be read, e.g. [‘1’, ‘X’].
nb_sample: int: Maximum number of sample in total.
round: bool: If True, round methylation states in column ‘value’ to zero or one.
sort: bool: If True, sort by rows by chromosome and position.
nb_sample_chromo: int: Maximum number of sample per chromosome.

Returns:

:class:`pandas.DataFrame`: pandas.DataFrame with columns chromo, pos, value.

deepcpg.data.utils.sample_from_chromo(frame, nb_sample)[source]¶

Randomly sample nb_sample samples from each chromosome.

Samples nb_sample records from pandas.DataFrame which must contain a column with name ‘chromo’.

deepcpg.data.utils.stack_dict(data)[source]¶: Stacks lists of numpy arrays in dict data.

deepcpg.data.utils.threadsafe_generator(f)[source]¶: A decorator that takes a generator function and makes it thread-safe.

class deepcpg.data.utils.threadsafe_iter(it)[source]¶: Takes an iterator/generator and makes it thread-safe by serializing call to the next method of given iterator/generator.

`data`¶

`data.annotations`¶

`data.dna`¶

`data.fasta`¶

`data.feature_extractor`¶

`data.hdf`¶

`data.stats`¶

`data.utils`¶

Table Of Contents

Previous topic

Next topic

This Page

data¶

data.annotations¶

data.dna¶

data.fasta¶

data.feature_extractor¶

data.hdf¶

data.stats¶

data.utils¶

`data`¶

`data.annotations`¶

`data.dna`¶

`data.fasta`¶

`data.feature_extractor`¶

`data.hdf`¶

`data.stats`¶

`data.utils`¶