data
¶
Package for reading, writing, and transforming data.
data.annotations
¶
Functions for reading and matching annotations.
-
deepcpg.data.annotations.
distance
(pos, start, end)[source]¶ Return shortest distance between a position and a list of intervals.
Parameters: pos: list
List of integer positions.
start: list
Start position of intervals.
end: list
End position of intervals.
Returns
:class:`numpy.ndarray`
numpy.ndarray
of same length as pos with shortest distance between each pos[i] and any interval.
-
deepcpg.data.annotations.
extend_len
(start, end, min_len, min_pos=1)[source]¶ Extend intervals to minimum length.
Extends intervals start-end with length smaller than min_len to length min_len by equally decreasing start and increasing end.
Parameters: start: list
List of start position of intervals.
end: list
List of end position of intervals.
min_len: int
Minimum interval length.
min_pos: int
Minimum position.
Returns: tuple
tuple with start end end position of extended intervals.
-
deepcpg.data.annotations.
extend_len_frame
(d, min_len)[source]¶ Extend length of intervals in Pandas DataFrame using extend_len.
-
deepcpg.data.annotations.
group_overlapping
(s, e)[source]¶ Assign group index of indices.
Assigns group index to intervals. Overlapping intervals will be assigned to the same group.
Parameters: s : list
list with start of interval sorted in ascending order.
e : list
list with end of interval.
Returns: numpy.ndarray
with group indices.
-
deepcpg.data.annotations.
in_which
(x, ys, ye)[source]¶ Return index of positions in intervals.
Returns for positions x[i] index j, s.t. ys[j] <= x[i] <= ye[j], or -1 if x[i] is not overlapped by any interval. Intervals must be non-overlapping!
Parameters: x : list
list of positions.
ys: list
list with start of interval sorted in ascending order.
ye: list
list with end of interval.
Returns: n:class:numpy.ndarray with indices of overlapping intervals or -1.
-
deepcpg.data.annotations.
is_in
(pos, start, end)[source]¶ Test if position is overlapped by at least one interval.
-
deepcpg.data.annotations.
join_overlapping
(s, e)[source]¶ Join overlapping intervals.
Transforms a list of possible overlapping intervals into non-overlapping intervals.
Parameters: s : list
List with start of interval sorted in ascending order
e : list
List with end of interval.
Returns: tuple
tuple (s, e) of non-overlapping intervals.
-
deepcpg.data.annotations.
join_overlapping_frame
(d)[source]¶ Join overlapping intervals of Pandas DataFrame.
Uses join_overlapping to join overlapping intervals of
pandas.DataFrame
d.
data.dna
¶
Functions for representing DNA sequences.
-
deepcpg.data.dna.
char_to_int
(seq)[source]¶ Translate chars of single sequence seq to ints.
Parameters: seq: str
DNA sequence.
Returns: list
Integer-encoded seq.
-
deepcpg.data.dna.
get_alphabet
(special=False, reverse=False)[source]¶ Return char->int alphabet.
Parameters: special: bool
If True, remove special ‘N’ character.
reverse: bool
If True, return int->char instead of char->int alphabet.
Returns: OrderedDict
DNA alphabet.
-
deepcpg.data.dna.
int_to_char
(seq, join=True)[source]¶ Translate ints of single sequence seq to chars.
Parameters: seq: list
Integers of sequences
join: bool
If True joint characters to str.
Returns: If join=True, str, otherwise list of chars.
-
deepcpg.data.dna.
int_to_onehot
(seqs, dim=4)[source]¶ One-hot encodes array of integer sequences.
Takes array [nb_seq, seq_len] of integer sequence end encodes them one-hot. Special nucleotides (int > 4) will be encoded as [0, 0, 0, 0].
Returns: [nb_seq, seq_len, dim]
numpy.ndarray
of one-hot encoded sequences.
data.fasta
¶
Functions reading FASTA files.
-
deepcpg.data.fasta.
parse_lines
(lines)[source]¶ Parse FASTA sequences from list of strings.
Parameters: lines: list
List of lines from FASTA file.
Returns: list
List of
FastaSeq
objects.
-
deepcpg.data.fasta.
read_chromo
(filenames, chromo)[source]¶ Read DNA sequence of chromosome chromo.
Parameters: filenames: list
List of FASTA files.
chromo: str
Chromosome that is read.
Returns: str
DNA sequence of chromosome chromo.
data.feature_extractor
¶
Feature extraction.
-
class
deepcpg.data.feature_extractor.
IntervalFeatureExtractor
[source]¶ Check if positions are in a list of intervals (start, end).
-
static
index_intervals
(x, ys, ye)[source]¶ - Return for positions x[i] index j, s.t. ys[j] <= x[i] <= ye[j] or -1.
- Intervals must be non-overlapping!
Parameters: x : list
List of positions.
ys: list
List with start of interval sorted in ascending order.
ye: list
List with end of interval.
Returns: numpy.ndarray
of same length than x with index or -1.
-
static
-
class
deepcpg.data.feature_extractor.
KnnCpgFeatureExtractor
(k=1)[source]¶ Extract k CpG sites next to target sites. Exclude CpG sites at the same position.
-
extract
(x, y, ys)[source]¶ Extract state and distance of k CpG sites next to target sites. Target site is excluded.
Parameters: x: :class:`numpy.ndarray`
numpy.ndarray
with target positions sorted in ascending order.y: :class:`numpy.ndarray`
numpy.ndarray
with source positions sorted in ascending order.ys: :class:`numpy.ndarray`
numpy.ndarray
with source CpG states.Returns: tuple
Tuple (cpg, dist) with numpy arrays of dimension (len(x), 2k): cpg: CpG states to the left (0:k) and right (k:2k) dist: Distances to the left (0:k) and right (k:2k)
-
data.hdf
¶
Functions for accessing HDF5 files.
-
deepcpg.data.hdf.
hnames_to_names
(hnames)[source]¶ Flattens dict hnames of hierarchical names.
Converts hierarchical dict, e.g. hnames={‘a’: [‘a1’, ‘a2’], ‘b’}, to flat list of keys for accessing HDF5 file, e.g. [‘a/a1’, ‘a/a2’, ‘b’]
-
deepcpg.data.hdf.
ls
(filename, group='/', recursive=False, groups=False, regex=None, nb_key=None, must_exist=True)[source]¶ List name of records HDF5 file.
Parameters: filename:
Path of HDF5 file.
group:
HDF5 group to be explored.
recursive: bool
If True, list records recursively.
groups: bool
If True, only list group names but not name of datasets.
regex: str
Regex to filter listed records.
nb_key: int
Maximum number of records to be listed.
must_exist: bool
If False, return None if file or group does not exist.
Returns: list
list with name of records in filename.
data.stats
¶
Functions for computing statistic about binary CpG matrix.
- CpG matrix x assumed to have shape
- [sites, cells] for per CpG statistics
- [sites, cells, context] for window-based statistics
data.utils
¶
General purpose IO functions.
-
class
deepcpg.data.utils.
GzipFile
(filename, mode='r', *args, **kwargs)[source]¶ Wrapper to read and write gzip-compressed files.
If filename ends with gz, opens file with gzip package, otherwise builtin open function.
Parameters: filename: str
Path of file
mode: str
File access mode
*args: list
Unnamed arguments passed to open function.
**kwargs: dict
Named arguments passed to open function.
-
deepcpg.data.utils.
add_to_dict
(src, dst)[source]¶ Add dict `src to dict dst
Adds values in dict src to dict dst with same keys but values are lists of added values. lists of values in dst can be stacked with
stack_dict()
. Used for example in dpcg_eval.py to stack dicts from different batches.
-
deepcpg.data.utils.
format_chromo
(chromo)[source]¶ Format chromosome name.
Makes name upper case, e.g. ‘mt’ -> ‘MT’ and removes ‘chr’, e.g. ‘chr1’ -> ‘1’.
-
deepcpg.data.utils.
get_anno_names
(data_file, *args, **kwargs)[source]¶ Return name of annotations stored in data_file.
-
deepcpg.data.utils.
get_cpg_wlen
(data_file, max_len=None)[source]¶ Return number of CpG neighbors stored in data_file.
-
deepcpg.data.utils.
get_dna_wlen
(data_file, max_len=None)[source]¶ Return length of DNA sequence windows stored in data_file.
-
deepcpg.data.utils.
get_nb_sample
(data_files, nb_max=None, batch_size=None)[source]¶ Count number of samples in all data_files.
Parameters: data_files: list
list with file name of DeepCpG data files.
nb_max: int
If defined, stop counting if that number is reached.
batch_size: int
If defined, return the largest multiple of batch_size that is smaller or equal than the actual number of samples.
Returns: int
Number of samples in data_files.
-
deepcpg.data.utils.
get_output_names
(data_file, *args, **kwargs)[source]¶ Return name of outputs stored in data_file.
-
deepcpg.data.utils.
get_replicate_names
(data_file, *args, **kwargs)[source]¶ Return name of replicates stored in data_file.
-
deepcpg.data.utils.
is_bedgraph
(filename)[source]¶ Test if filename is a bedGraph file.
bedGraph files are assumed to start with ‘track type=bedGraph’
-
deepcpg.data.utils.
is_binary
(values)[source]¶ Check if values in array values are binary, i.e. zero or one.
-
deepcpg.data.utils.
read_cpg_profile
(filename, chromos=None, nb_sample=None, round=False, sort=True, nb_sample_chromo=None)[source]¶ Read CpG profile from TSV or bedGraph file.
Reads CpG profile from either tab delimited file with columns chromo, pos, value. value or bedGraph file. value columns contains methylation states, which can be binary or continuous.
Parameters: filenamne: str
Path of file.
chromos: list
List of formatted chromosomes to be read, e.g. [‘1’, ‘X’].
nb_sample: int
Maximum number of sample in total.
round: bool
If True, round methylation states in column ‘value’ to zero or one.
sort: bool
If True, sort by rows by chromosome and position.
nb_sample_chromo: int
Maximum number of sample per chromosome.
Returns: pandas.DataFrame
with columns chromo, pos, value.
-
deepcpg.data.utils.
sample_from_chromo
(frame, nb_sample)[source]¶ Randomly sample nb_sample samples from each chromosome.
Samples nb_sample records from
pandas.DataFrame
which must contain a column with name ‘chromo’.