nuad documentation
nuad.search
The main export of the search module is the function search_for_sequences()
,
which is a stochastic local search for finding DNA sequences to assign to
Domain
’s in a Design
to satisfy all Constraint
’s.
Various parameters of the search can be controlled using SearchParameters
.
Instructions for using the nuad library are available at https://github.com/UC-Davis-molecular-computing/nuad#data-model
- class search.SearchParameters(constraints=<factory>, probability_of_keeping_change=None, random_seed=None, never_increase_score=None, out_directory=None, on_improved_design=<function SearchParameters.<lambda>>, restart=False, force_overwrite=False, debug_log_file=False, info_log_file=False, report_only_violations=True, max_iterations=None, target_score=None, max_domains_to_change=1, num_digits_update=None, warn_fixed_sequences=True, save_report_for_all_updates=False, save_design_for_all_updates=False, save_sequences_for_all_updates=False, log_time=False, scrolling_output=True)[source]
This class describes various parameters to give to the search algorithm
search_for_sequences()
.- Parameters:
constraints (List[Constraint])
probability_of_keeping_change (Callable[[float], float] | None)
random_seed (int | None)
never_increase_score (bool | None)
out_directory (str | None)
on_improved_design (Callable[[int], None])
restart (bool)
force_overwrite (bool)
debug_log_file (bool)
info_log_file (bool)
report_only_violations (bool)
max_iterations (int | None)
target_score (float | None)
max_domains_to_change (int)
num_digits_update (int | None)
warn_fixed_sequences (bool)
save_report_for_all_updates (bool)
save_design_for_all_updates (bool)
save_sequences_for_all_updates (bool)
log_time (bool)
scrolling_output (bool)
- constraints: List[Constraint]
List of
constraints.Constraint
’s to apply to theDesign
.
- probability_of_keeping_change: Callable[[float], float] | None = None
Function giving the probability of keeping a change in one
Domain
’s DNA sequence, if the new sequence affects the total score of all violatedConstraint
’s by score_delta, the input to probability_of_keeping_change. Seedefault_probability_of_keeping_change_function()
for a description of the default behavior if this parameter is not specified.
- random_seed: int | None = None
Integer given as a random seed to the numpy random number generator, used for all random choices in the algorithm. Set this to a fixed value to allow reproducibility.
- never_increase_score: bool | None = None
If specified and True, then it is assumed that the function probability_of_keeping_change returns 0 for any negative value of score_delta (i.e., the search never goes “uphill”), and the search for violations is optimized to quit as soon as the total score of violations exceeds that of the current optimal solution. This vastly speeds up the search in later stages, when the current optimal solution is low score. If both probability_of_keeping_change and never_increase_score are left unspecified, then probability_of_keeping_change uses the default, which never goes uphill, and never_increase_score is set to True. If probability_of_keeping_change is specified and never_increase_score is not, then never_increase_score is set to False. If both are specified and never_increase_score is set to True, then take caution that probability_of_keeping_change really has the property that it never goes uphill; the optimization will essentially prevent most uphill climbs from occurring.
- out_directory: str | None = None
Directory in which to write output files (report on constraint violations and DNA sequences) whenever a new optimal sequence assignment is found.
- on_improved_design()
Function to call whenever the design improves. Takes an integer as input indicating the number of times the design has improved.
- restart: bool = False
If this function was previously called and placed files in out_directory, calling with this parameter True will re-start the search at that point.
- force_overwrite: bool = False
If restart is False and there are files/subdirectories in out_directory, then the user will be prompted to confirm that they want to delete these, UNLESS force_overwrite is True.
- debug_log_file: bool = False
If True, a very detailed log of events is written to the file debug.log in the directory out_directory. If run for several hours, this file can grow to hundreds of megabytes.
- info_log_file: bool = False
If True, the text written to the screen through logger.info (on the logger instance used in dsd.constraints) is written to the file log_info.log in the directory out_directory.
- report_only_violations: bool = True
If True, does not give report on each constraint that was satisfied; only reports violations and summary of all constraint checks of a certain type (e.g., how many constraint checks there were).
- max_iterations: int | None = None
Maximum number of iterations of search to perform.
- target_score: float | None = None
Total violation score to attempt to obtain. A score of 0.0 represents that all constraints are satisfied. Often a search can get very close to score 0.0 quickly, but take a very long time to reach a score of 0.0. Set this to a small positive value to allow the search to quit before all constraints are satisfied, but they are “mostly” satisfied.
- max_domains_to_change: int = 1
Maximum number of
constraints.Domain
’s to change at a time. A number between 1 and max_domains_to_change is selected uniformly at random, and then that manyconstraints.Domain
’s are selected proportional to the score ofconstraints.Constraint
’s that they violated.
- num_digits_update: int | None = None
Number of digits to use when writing update number in filenames. By default, they will be written using just enough digits for each integer, (for example, for sequences) sequences-0.txt, sequences-1.txt, …, sequences-9.txt, sequences-10.txt, … If num_digits_update=3 is specified, for instance, they will be written sequences-000.txt, sequences-001.txt, …, sequences-009.txt, sequences-010.txt, …, sequences-099.txt, sequences-100.txt, …, sequences-999.txt, sequences-1000.txt, …, i.e., using leading zeros to have exactly 3 digits, until the integers are sufficiently large that more digits are required.
- warn_fixed_sequences: bool = True
Log warning about sequences that are fixed, indicating they will not be re-assigned during the search.
- save_report_for_all_updates: bool = False
A report on the most recently updated
Design
is always written to a file current-best-report.txt. If this is True, then in the folder reports, a file unique to that update is also written. Set to False to use less space on disk.
- save_design_for_all_updates: bool = False
A serialized (JSON) description of the most recently updated
Design
is always written to a file current-best-design.json. If this is True, then in the folder designs, a file unique to that update is also written. Set to False to use less space on disk.
- save_sequences_for_all_updates: bool = False
A list of sequences for each
Strand
of most recently updatedDesign
is always written to a file current-best-sequences.txt. If this is True, then in the folder sequences, a file unique to that update is also written. Set to False to use less space on disk.
- log_time: bool = False
Whether to log the time taken per iteration to the screen.
- scrolling_output: bool = True
If True, then screen output “scrolls” on the screen, i.e., a newline is printed after each iteration, e.g.,
$ python sst_canvas.py using random seed of 1; use this same seed to reproduce this run number of processes in system: 4 |-----------|--------|-----------|-----------|----------|---------------| | iteration | update | opt score | new score | StrandSS | StrandPairRNA | | 0 | 0 | 2555.9 | 2545.9 | 118.2 | 2437.8 | |-----------|--------|-----------|-----------|----------|---------------| | iteration | update | opt score | new score | StrandSS | StrandPairRNA | | 1 | 1 | 2545.9 | 2593.0 | 120.2 | 2425.6 | |-----------|--------|-----------|-----------|----------|---------------| | iteration | update | opt score | new score | StrandSS | StrandPairRNA | | 2 | 1 | 2545.9 | 2563.1 | 120.2 | 2425.6 | |-----------|--------|-----------|-----------|----------|---------------| | iteration | update | opt score | new score | StrandSS | StrandPairRNA | | 3 | 1 | 2545.9 | 2545.0 | 120.2 | 2425.6 | |-----------|--------|-----------|-----------|----------|---------------| | iteration | update | opt score | new score | StrandSS | StrandPairRNA | | 4 | 2 | 2545.0 | 2510.1 | 121.0 | 2423.9 |
If False, then the screen output is updated in place:
$ python sst_canvas.py using random seed of 1; use this same seed to reproduce this run number of processes in system: 4 |-----------|--------|-----------|-----------|----------|---------------| | iteration | update | opt score | new score | StrandSS | StrandPairRNA | | 27 | 14 | 2340.5 | 2320.5 | 109.6 | 2230.9 |
This is done by printing the symbol ‘r’ (carriage return), which sets the print position back to the start of the line. The terminal screen must be wide enough to handle the output or this won’t work.
The search also occassionally logs other things to the screen that may disrupt this a bit.
- __init__(constraints=<factory>, probability_of_keeping_change=None, random_seed=None, never_increase_score=None, out_directory=None, on_improved_design=<function SearchParameters.<lambda>>, restart=False, force_overwrite=False, debug_log_file=False, info_log_file=False, report_only_violations=True, max_iterations=None, target_score=None, max_domains_to_change=1, num_digits_update=None, warn_fixed_sequences=True, save_report_for_all_updates=False, save_design_for_all_updates=False, save_sequences_for_all_updates=False, log_time=False, scrolling_output=True)
- Parameters:
constraints (List[Constraint])
probability_of_keeping_change (Callable[[float], float] | None)
random_seed (int | None)
never_increase_score (bool | None)
out_directory (str | None)
on_improved_design (Callable[[int], None])
restart (bool)
force_overwrite (bool)
debug_log_file (bool)
info_log_file (bool)
report_only_violations (bool)
max_iterations (int | None)
target_score (float | None)
max_domains_to_change (int)
num_digits_update (int | None)
warn_fixed_sequences (bool)
save_report_for_all_updates (bool)
save_design_for_all_updates (bool)
save_sequences_for_all_updates (bool)
log_time (bool)
scrolling_output (bool)
- Return type:
None
- search.search_for_sequences(design, params)[source]
Search for DNA sequences to assign to each
Domain
in design, satisfying the variousConstraint
’s inSearchParameters.constraints
.Search algorithm: This is a stochastic local search. It determines which
Constraint
’s are violated. More precisely, it adds the total score of all violated constraints (sum ofconstraints.Constraint.weight
* score_of_violation over all violatedConstraint
’s). The goal is to reduce this total score until it is 0 (i.e., no violated constraints). AnyDomain
“involved” in the violatedConstraint
is noted as being one of theDomain
’s responsible for the violation, i.e., is “blamed”. For example, if aDomainConstraint
is violated, only oneDomain
is blamed, whereas if aStrandConstraint
is violated, everyDomain
in theStrand
is blamed. However, fixed domains (those withconstraints.Domain.fixed
= True) are never blamed, since their DNA sequences cannot be changed.While any
Constraint
’s are violated, aDomain
is picked at random, with probability proportional to the total score of all theConstraint
’s for which theDomain
was blamed (so probability 0 to pick aDomain
that is fixed or that was involved in no violations). A new DNA sequence is assigned to thisDomain
by callingconstraints.DomainPool.generate_sequence()
on theDomainPool
of thatDomain
.The way to decide whether to keep the changed sequence, or revert to the old sequence, can be configured, but the default is to keep the change if and only if it does not increase the total score of violations. More generally, we calculate the total score of all violated constraints in the original and changed
Design
, calling their difference score_delta = new_total_score - old_total_score. The valueprobability_of_keeping_change(score_delta)
is the probability that the change is kept. The default function computing this probability is returned bydefault_probability_of_keeping_change_function()
, which simply assigns probability 0 to keep the change if score_delta is positive (i.e., the score went up) and probability 1 otherwise. In particular, the change is kept if the score is identical (though this would happen only rarely). One reason to favor this default is that it allows an optimization that speeds up the search significantly in practice: When evaluating constraints, once the total score of violations exceeds that of the best design so far, no further constraints need to be evaluated, since we can decide immediately that the new design change will not be kept.The
Design
is modified in place; eachDomain
is modified to have a DNA sequence.If no DNA sequences are assigned to the
Domain
’s initially, they are picked at random from theDomainPool
associated to eachDomain
by callingconstraints.DomainPool.generate_sequence()
.Otherwise, if DNA sequences are already assigned to the
Domain
’s initially, these sequences are used as a starting point for finding sequences that satisfy allConstraint
’s. (In this case, those sequences are not checked against anyNumpyFilter
’s orSequenceFilter
’s in theDesign
, since those checks are applied prior to assigning DNA sequences to anyDomain
.)The function has some side effects. It writes a report on the optimal sequence assignment found so far every time a new improve assignment is found.
Whenever a new optimal sequence assignment is found, the following are also be written to files:
DNA sequences of each strand are written to a text file .
the whole design itself
a report on the DNA sequences indicating how well they do on constraints.
- Parameters:
design (Design) – The
Design
containing theDomain
’s to which to assign DNA sequences and theConstraint
’s that apply to themparams (SearchParameters) – A
SearchParameters
object with attributes that can be used to specify options for the search.
- Return type:
None
- search.script_name_no_ext()[source]
- Returns:
Name of the Python script currently running, without the .py extension.
- Return type:
str
- search.assign_sequences_to_domains_randomly_from_pools(design, warn_fixed_sequences, rng=Generator(PCG64) at 0x7FEF8A3E4BA0, overwrite_existing_sequences=False)[source]
Assigns to each
Domain
in thisDesign
a random DNA sequence from itsDomainPool
, callingconstraints.DomainPool.generate_sequence()
to get the sequence.This is step #1 in the search algorithm.
- Parameters:
design (Design) – Design to which to assign DNA sequences.
warn_fixed_sequences (bool) – Whether to log warning that each
Domain
withconstraints.Domain.fixed
= True is not being assigned.rng (Generator) – numpy random number generator (type returned by numpy.random.default_rng()).
overwrite_existing_sequences (bool) – Whether to overwrite in this initial assignment any existing sequences for
Domain
’s that already have a DNA sequence. The DNA sequence of aDomain
withconstraints.Domain.fixed
= True are never overwritten, neither here nor later in the search. Non-fixed sequences can be skipped for overwriting on this initial assignment, but they are subject to change by the subsequent search algorithm.
- Return type:
None
- search.default_probability_of_keeping_change_function(params)[source]
Returns a function that takes a float input score_delta representing a change in score of violated constraint, which returns a probability of keeping the change in the DNA sequence assignment. The probability is 1 if the change it is at least as good as the previous (roughly, the score change is not positive), and the probability is 0 otherwise.
To mitigate floating-point rounding errors, the actual condition checked is that score_delta <
epsilon
, on the assumption that if the same score of constraints are violated, rounding errors in calculating score_delta could actually make it slightly above than 0 and result in reverting to the old assignment when we really want to keep the change. If all values ofConstraint.score
are significantly aboutepsilon
(e.g., 1.0 or higher), then this should be is equivalent to keeping a change in the DNA sequence assignment if and only if it is no worse than the previous.- Parameters:
params (SearchParameters) –
SearchParameters
to apply this rule for; params is required because the score ofConstraint
’s in theSearchParameters
are used to calculate an appropriate epsilon value for determining when a score change is too small to be significant (i.e., is due to rounding error)- Returns:
the “keep change” function f: \(\mathbb{R} \to [0,1]\), where \(f(w_\delta) = 1\) if \(w_\delta \leq \epsilon\) (where \(\epsilon\) is chosen to be 1,000,000 times smaller than the smallest
Constraint.weight
for anyConstraint
in design), and \(f(w_\delta) = 0\) otherwise.- Return type:
Callable[[float], float]
- class search.EvaluationSet(constraints: 'Iterable[Constraint]', never_increase_score: 'bool')[source]
- Parameters:
constraints (List[Constraint])
never_increase_score (bool)
- class search.Evaluation(constraint: 'Constraint', violated: 'bool', part: 'DesignPart', domains: 'Iterable[Domain]', score: 'float', summary: 'str', result: 'nc.Result')[source]
- Parameters:
constraint (Constraint)
violated (bool)
part (DesignPart)
domains (FrozenSet[Domain])
score (float)
summary (str)
result (Result)
- search.create_constraints_report(design, constraints, report_only_violations=False, include_only_with_values=False)[source]
Returns
ConstraintsReport
, where itsConstraintsReport.reports
field has oneConstraintReport
for eachConstraint
in constraints, indicating how well design does according to constraints, assuming design has sequences assigned to it, for example, if it was read usingconstraints.Design.from_design_file()
from a design.json file written as part of a call tosearch_for_sequences()
.The report contains the same information as written in the return value of
summary_of_constraints()
, but in a more structred format usingConstraintReport
objects rather than just text. (summary_of_constraints()
calls the methodConstraintReport.content()
on eachConstraintReport
object).- Parameters:
design (Design) – the
constraints.Design
, with sequences assigned to allDomain
’sconstraints (Iterable[Constraint]) – the list of
constraints.Constraint
’s to evaluate in the reportreport_only_violations (bool) – if True, lists only violations of constraints
include_only_with_values (bool) – if True, lists only constraints with “values” specified in the Results
- Returns:
ConstraintsReport
describing a report of how well design does according to constraints- Return type:
- search.create_text_report(design, constraints, report_only_violations=False, include_scores=False)[source]
Returns text string containing report of how well design does according to constraints, assuming design has sequences assigned to it, for example, if it was read using
constraints.Design.from_design_file()
from a design.json file written as part of a call tosearch_for_sequences()
.The report is the same format as written to the reports generated when calling
search_for_sequences()
.- Parameters:
design (Design) – the
constraints.Design
, with sequences assigned to allDomain
’sconstraints (Iterable[Constraint]) – the list of
constraints.Constraint
’s to evaluate in the reportreport_only_violations (bool) – if True, lists only violations of constraints
include_scores (bool) – whether to include the “scores” for each evaluation, which are numbers the search in
search_for_sequences()
is trying to minimize
- Returns:
string describing a report of how well design does according to constraints
- Return type:
str
- search.display_report(design, constraints, report_only_violations=False, layout='vert', xlims=None, ylims=None, yscales='linear', bins=10)[source]
When run in a Jupyter notebook cell, creates a
ConstraintsReport
(the one returned fromcreate_constraints_report()
) and displays its data graphically in the notebook using matplotlib.This is a histogram showing how many “design parts” (e.g., strands, pairs of strands, etc.) had various values for each
Constraint
. The x-axis is the value measured by theConstraint
(more precisely, the number in the fieldResult.value
generated when evaluating theConstraint
), and the y-axis is the number of design parts with that value.- Parameters:
design (nc.Design) – the
constraints.Design
, with sequences assigned to allDomain
’sconstraints (Iterable[Constraint]) – the list of
constraints.Constraint
’s to evaluate in the reportreport_only_violations (bool) – if True, lists only violations of constraints
layout (Literal['horz', 'vert']) – layout of plots. If ‘horz’, they will be laid out horizontally, which is smaller if you have several constraints, but might make it more useful for compare results of different choices of sequences if you call
display_report()
repeatedly for different sequences assigned to designxlims (None | Tuple[float, float] | Dict[str | Constraint, None | Tuple[float, float]]) – If specified, is either a single pair of floats to use as the argument to matplotlib.pyplot.xlim: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlim.html or to specify different xlim values for different constraints, can be a dict mapping
Constraint
(or for conveience, string in the fieldsConstraint.short_description
orConstraint.description
) to an xlim pair of floats.ylims (None | Tuple[float, float] | Dict[str | Constraint, None | Tuple[float, float]]) – Same as parameter xlims but for argument to matplotlib.pyplot.ylim: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlim.html Can either be a single value to apply to all charts, or a dict mapping a constraint to a value, with the same rules as xlims. Also, rather than being a pair of floats, it can be a single float (or the dict can map constraints to single floats), which is then given as the parameter top of matplotlib.pyplot.ylim.
yscales (Literal['log', 'linear', 'symlog'] | Dict[str | Constraint, Literal['log', 'linear', 'symlog']]) – same as argument value to matplotlib.pyplot.yscale: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.yscale.html Can either be a single value to apply to all charts, or a dict mapping a constraint to a value, with the same rules as xlims
bins (int | Dict[str | Constraint, int]) – same as argument bins to matplotlib.pyplot.hist: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.yscale.html Can either be a single value to apply to all charts, or a dict mapping a constraint to a value, with the same rules as xlims
- Return type:
None
- class search.ConstraintsReport(reports, num_evaluations, num_violations, total_score, total_score_nonfixed, total_score_fixed)[source]
Represents a report on how well a design did on all constraints.
- Parameters:
reports (List[ConstraintReport])
num_evaluations (int)
num_violations (int)
total_score (float)
total_score_nonfixed (float)
total_score_fixed (float)
- __init__(reports, num_evaluations, num_violations, total_score, total_score_nonfixed, total_score_fixed)
- Parameters:
reports (List[ConstraintReport])
num_evaluations (int)
num_violations (int)
total_score (float)
total_score_nonfixed (float)
total_score_fixed (float)
- Return type:
None
- reports: List[ConstraintReport]
Has one
ConstraintReport
perConstraint
evaluated.
- num_evaluations: int
Total number of evaluations of all
Constraint
’s.
- num_violations: int
Total number of evaluations of all
Constraint
’s that were violated.
- total_score: float
Total “score” of evaluations of all
Constraint
’s (the score is what the search done bysearch_for_sequences()
is trying to minimize).
- total_score_nonfixed: float
Total “score” of evaluations of all
Constraint
’s that are not “fixed”.It should obey
ConstraintsReport.total_score
==ConstraintsReport.total_score_nonfixed
+ConstraintsReport.total_score_fixed
- total_score_fixed: float
Total “score” of evaluations of all
Constraint
’s that are “fixed”, meaning that all domains involved in the evaluation of the constraint haveDomain.fixed
== True, so the search will not change the sequences of those domains (so it is not possible to satisfy the constraint for that evaluation).It should obey
ConstraintsReport.total_score
==ConstraintsReport.total_score_nonfixed
+ConstraintsReport.total_score_fixed
- class search.ConstraintReport(constraint, evaluation_set, report_only_violations)[source]
Represents a report on how well a design did on a constraint.
- Parameters:
constraint (Constraint[DesignPart])
evaluation_set (EvaluationSet)
report_only_violations (bool)
- __init__(constraint, evaluation_set, report_only_violations)[source]
- Parameters:
constraint (Constraint[DesignPart])
evaluation_set (EvaluationSet)
report_only_violations (bool)
- Return type:
None
- constraint: Constraint[DesignPart]
The
Constraint
to report on. This can be None if theConstraint
object is not available at the time theConstraint.generate_summary()
function is defined. If so it will be automatically inserted by the report generating code.
nuad.constraints
This module defines types for helping to define DNA sequence design constraints.
The key classes are Design
, Strand
, Domain
to define a DNA design,
and various subclasses of Constraint
, such as StrandConstraint
or StrandPairConstraint
,
to define constraints on the sequences assigned to each Domain
when calling
search.search_for_dna_sequences()
.
Also important are two other types of constraints
(not subclasses of Constraint
), which are used prior to the search to determine if it is even
legal to use a DNA sequence: subclasses of the abstract base class NumpyFilter
,
and SequenceFilter
, an alias for a function taking a string as input and returning a bool.
See the README on the GitHub page for more detailed explaination of these classes: https://github.com/UC-Davis-molecular-computing/dsd#data-model
- constraints.all_dna_bases = {'A', 'C', 'G', 'T'}
Set of all DNA bases.
- class constraints.M13Variant(value)[source]
Variants of M13mp18 viral genome. “Standard” variant is p7249. Other variants are longer.
- p7249 = 'p7249'
“Standard” variant of M13mp18; 7249 bases long, available from, for example
https://www.tilibit.com/collections/scaffold-dna/products/single-stranded-scaffold-dna-type-p7249
https://www.neb.com/products/n4040-m13mp18-single-stranded-dna
- p7560 = 'p7560'
Variant of M13mp18 that is 7560 bases long. Available from, for example
https://www.tilibit.com/collections/scaffold-dna/products/single-stranded-scaffold-dna-type-p7560
- p8064 = 'p8064'
Variant of M13mp18 that is 8064 bases long. Available from, for example
https://www.tilibit.com/collections/scaffold-dna/products/single-stranded-scaffold-dna-type-p8064
- p8634 = 'p8634'
Variant of M13mp18 that is 8634 bases long. At the time of this writing, not listed as available from any biotech vender, but Tilibit will make it for you if you ask. (https://www.tilibit.com/pages/contact-us)
- length()[source]
- Returns:
length of this variant of M13 (e.g., 7249 for variant
M13Variant.p7249
)- Return type:
int
- constraints.m13(rotation=5587, variant=M13Variant.p7249)[source]
The M13mp18 DNA sequence (commonly called simply M13).
By default, starts from cyclic rotation 5587 (with 0-based indexing; commonly this is called rotation 5588, which assumes that indexing begins at 1), as defined in GenBank.
By default, returns the “standard” variant of consisting of 7249 bases, sold by companies such as Tilibit and New England Biolabs.
For a more detailed discussion of why the default rotation 5587 of M13 is used, see Supplementary Note S8 in [Folding DNA to create nanoscale shapes and patterns. Paul W. K. Rothemund, Nature 440:297-302 (2006)].
- Parameters:
rotation (int) – rotation of circular strand. Valid values are 0 through length-1.
variant (M13Variant) – variant of M13 strand to use
- Returns:
M13 strand sequence
- Return type:
str
- constraints.m13_substrings_of_length(length, except_indices=(5514, 5515, 5516, 5517, 5518, 5519, 5520, 5521, 5522, 5523, 5524, 5525, 5526, 5527, 5528, 5529, 5530, 5531, 5532, 5533, 5534, 5535, 5536, 5537, 5538, 5539, 5540, 5541, 5542, 5543, 5544, 5545, 5546, 5547, 5548, 5549, 5550, 5551, 5552, 5553, 5554, 5555, 5556), variant=M13Variant.p7249)[source]
WARNING: This function was previously recommended to use with
DomainPool.possible_sequences
to specify possible rotations of M13 to use. However, it creates a large file size to write all those sequences to disk on every update in the search. A better method now exists to specify this, which is to specify aSubstringSampler
object as the value forDomainPool.possible_sequences
instead of calling this function.Return all substrings of the M13mp18 DNA sequence of length length, except those overlapping indices in except_start_indices.
This is useful with the field
DomainPool.possible_sequences
, when one strand in theDesign
represents a small portion of the full M13 sequence, and part of the sequence design process is to choose a rotation of M13 to use. One can set that strand to have a singleDomain
, which contains dependent subdomains (those withDomain.dependent
set to True). These subdomains are the smaller domains where M13 attaches to otherStrand
’s in theDesign
. Then, give the parentDomain
aDomainPool
withDomainPool.possible_sequences
set to the return value of this function, to allow the search to explore different rotations of M13.For example, suppose m13_subdomains is a list containing
Domain
’s from theDesign
, which are consecutive subdomains of M13 from 5’ to 3’ (all withDomain.dependent
set to True), and m13_length is the sum of their lengths (note this needs to be calculated manually since the following code assumes noDomain
in m13_subdomains has aDomainPool
yet, thus none yet have a length). Then the following code creates aStrand
representing the M13 portion that binds to otherStrand
’s in theDesign
.m13_subdomains = # subdomains of M13 used in the design m13_length = # sum of lengths of domains in m13_subdomains m13_substrings = dc.m13_substrings_of_length(m13_length) m13_domain_pool = dc.DomainPool(name='m13 domain pool', possible_sequences=m13_substrings) m13_domain = dc.Domain(name='m13', subdomains=m13_subdomains, pool=m13_domain_pool) m13_strand = dc.Strand(name='m13', domains=[m13_domain])
- Parameters:
length (int) – length of substrings to return
except_indices (Iterable[int]) – Indices of M13 to avoid in any part of the substring. If not specified, based on length, indices 5514-5556 are avoided, which are known to contain a long hairpin. (When using 1-based indexing, these are indices 5515-5557.) For example, if length = 10, then the starting indices of substrings will avoid the list [5505, 5506, …, 5556]
variant (M13Variant) –
M13Variant
to use
- Returns:
All substrings of the M13mp18 DNA sequence, except those that overlap any index in except_start_indices.
- Return type:
List[str]
- constraints.default_score_transfer_function(x)[source]
A cubic transfer function.
- Returns:
max(0.0, x^3)
- Parameters:
x (float)
- Return type:
float
- constraints.logger = <Logger dsd (DEBUG)>
Global logger instance used throughout dsd.
Call
logger.removeHandler(logger.handlers[0])
to stop screen output (assuming that you haven’t added or removed any handlers to the dsd logger instance already; by default there is one StreamHandler, and removing it will stop screen output).Call
logger.addHandler(logging.FileHandler(filename))
to direct to a file.
- constraints.all_pairs(values, with_replacement=True, where=<function <lambda>>)[source]
Strongly typed function to get list of all pairs from iterable. (for using with mypy)
- Parameters:
values (Iterable[T]) – Iterable of values.
with_replacement (bool) – Whether to include self pairs, i.e., pairs (a,a)
where (Callable[[T, T], bool]) – Predicate indicating whether to include a specific pair. Must take two parameters, each of type T, and return a bool.
- Returns:
List of all pairs of values from iterable.
- Return type:
List[Tuple[T, T]]
- constraints.all_pairs_iterator(values, with_replacement=True, where=<function <lambda>>)[source]
Strongly typed function to get iterator of all pairs from iterable. (for using with mypy)
This is WITH replacement; to specify without replacement, set with_replacement = False
- Parameters:
values (Iterable[T]) – Iterable of values.
with_replacement (bool) – Whether to include self pairs, i.e., pairs (a,a)
where (Callable[[Tuple[T, T]], bool]) – Predicate indicating whether to include a specific pair.
- Returns:
Iterator of all pairs of values from iterable. Unlike
all_pairs()
, which returns a list, the iterator returned may be iterated over only ONCE.- Return type:
Iterator[Tuple[T, T]]
- constraints.SequenceFilter = typing.Callable[[str], bool]
Filter (see description of
NumpyFilter
for explanation of the term “filter”) that applies to a DNA sequence; the difference between this an aDomainConstraint
is that these are applied before a sequence is assigned to aDomain
, so the constraint can only be based on the DNA sequence, and not, for instance, on theDomain
’sDomainPool
.Consequently
SequenceFilter
’s, likeNumpyFilter
’s, are treated differently than subtypes ofConstraint
, since a DNA sequence failing anySequenceFilter
’s orNumpyFilter
’s is never allowed to be assigned into anyDomain
.The difference with
NumpyFilter
is that aNumpyFilter
requires one to express the constraint in a way that is efficient for the linear algebra operations of numpy. If you cannot figure out how to do this, aSequenceFilter
can be expressed in pure Python, but typically will be much slower to apply than aNumpyFilter
.alias of
Callable
[[str
],bool
]
- class constraints.NumpyFilter[source]
Abstract base class for numpy filters. A “filter” is a hard constraint applied to sequences for a
Domain
; a sequence not passing the filter is never allowed to be assigned to aDomain
. This constrasts with the various subclasses ofConstraint
, which are different in two ways: 1) they can apply to large parts of the design than just a domain, e.g., aStrand
or a pair ofDomain
’s, and 2) they are “soft” constraints that are allowed to be violated during the course of the search.A
NumpyFilter
is one that can be efficiently encoded as numpy operations on 2D arrays of bytes representing DNA sequences, through the classnp.DNASeqList
(which uses such a 2D array as the fieldnp.DNASeqList.seqarr
).Subclasses should set the value
NumpyFilter.name
, inherited from this class.Pre-made subclasses of
NumpyFilter
provided in this library, such asRestrictBasesFilter
orNearestNeighborEnergyFilter
, are dataclasses (https://docs.python.org/3/library/dataclasses.html). There is no requirement that custom subclasses be dataclasses, but since the subclasses will inherit the fieldNumpyFilter.name
, you can easily make them dataclasses to get, for example, freerepr
andstr
implementations. See the source code for examples.The related type
SequenceFilter
(which is just an alias for a Python function with a certain signature) has a similar purpose, but is used for filters that cannot be encoded as numpy operations. Since they are applied by running a Python loop, they are much slower to evaluate than aNumpyFilter
.- name: str = 'TODO: give a concrete name to this NumpyFilter'
Name of this
NumpyFilter
.
- abstract remove_violating_sequences(seqs)[source]
Subclasses should override this method.
Since these are filters that use numpy, generally they will access the numpy ndarray instance seqs.seqarr, operate on it, and then create a new
np.DNASeqList
instance via the constructornp.DNASeqList
taking an numpy ndarray as input.See the source code of included constraints for examples, such as
NearestNeighborEnergyFilter.remove_violating_sequences()
orBaseCountFilter.remove_violating_sequences()
. These are usually quite tricky to write, requiring one to think in terms of linear algebra operations. The code tends not to be easy to read. But when a constraint can be expressed in this way, it is typically very fast to apply; many millions of sequences can be processed in a few seconds.- Parameters:
seqs (DNASeqList) –
np.DNASeqList
object representing DNA sequences- Returns:
a new
np.DNASeqList
object representing the DNA sequences in seqs that satisfy the constraint- Return type:
DNASeqList
- __init__()
- Return type:
None
- class constraints.RestrictBasesFilter(bases)[source]
Restricts the sequence to use only a subset of bases. This can be used to implement a so-called “three-letter code”, for instance, in which a certain subset of
Strand
uses only the bases A, T, C (andStrand
’s with complementaryDomain
use only A, T, G), to help reduce secondary structure of thoseStrand
’s. See for example Supplementary Section S1.1 of “Scaling Up Digital Circuit Computation with DNA Strand Displacement Cascades”, Qian and Winfree, Science 332:1196–1201, 2011. DOI: 10.1126/science.1200520, https://science.sciencemag.org/content/332/6034/1196, http://www.qianlab.caltech.edu/seesaw_digital_circuits2011_SI.pdfNote, however, that this is a filter for
Domain
’s, not wholeStrand
’s, so for a three-letter code to work, you must take care not to mixedDomain
’s on aStrand
that will use different alphabets.- Parameters:
bases (Collection[str])
- bases: Collection[str]
Bases to use. Must be a strict subset of {‘A’, ‘C’, ‘G’, ‘T’} with at least two bases.
- remove_violating_sequences(seqs)[source]
Should never be called directly; it is handled specially by the library when initially generating sequences.
- Parameters:
seqs (DNASeqList)
- Return type:
DNASeqList
- __init__(bases)
- Parameters:
bases (Collection[str])
- Return type:
None
- class constraints.NearestNeighborEnergyFilter(low_energy, high_energy, temperature=37.0)[source]
This constraint calculates the nearest-neighbor binding energy of a domain with its perfect complement (summing over all length-2 substrings of the domain’s sequence), using parameters from the 2004 Santa-Lucia and Hicks paper (https://www.annualreviews.org/doi/abs/10.1146/annurev.biophys.32.110601.141800, see Table 1, and example on page 419). It rejects any sequences whose energy according to this sum is outside the range [
NearestNeighborEnergyFilter.low_energy
,NearestNeighborEnergyFilter.high_energy
].- Parameters:
low_energy (float)
high_energy (float)
temperature (float)
- low_energy: float
Low threshold for nearest-neighbor energy.
- high_energy: float
High threshold for nearest-neighbor energy.
- temperature: float = 37.0
Temperature in Celsius at which to calculate nearest-neighbor energy.
- remove_violating_sequences(seqs)[source]
Remove sequences with nearest-neighbor energies outside of an interval.
- Parameters:
seqs (DNASeqList)
- Return type:
DNASeqList
- __init__(low_energy, high_energy, temperature=37.0)
- Parameters:
low_energy (float)
high_energy (float)
temperature (float)
- Return type:
None
- class constraints.BaseCountFilter(base, high_count=None, low_count=None)[source]
Restricts the sequence to contain a certain number of occurences of a given base.
- Parameters:
base (str)
high_count (int | None)
low_count (int | None)
- base: str
Base to count.
- high_count: int | None = None
Count of
BaseCountFilter.base
must be at mostBaseCountFilter.high_count
.
- low_count: int | None = None
Count of
BaseCountFilter.base
must be at leastBaseCountFilter.low_count
.
- remove_violating_sequences(seqs)[source]
Remove sequences whose counts of a certain base are outside of an interval.
- Parameters:
seqs (DNASeqList)
- Return type:
DNASeqList
- __init__(base, high_count=None, low_count=None)
- Parameters:
base (str)
high_count (int | None)
low_count (int | None)
- Return type:
None
- class constraints.BaseEndFilter(bases, distance_from_end=0, five_prime=True, three_prime=True)[source]
Restricts the sequence to contain only certain bases on (or near, if
BaseEndFilter.distance
> 0) each end.- Parameters:
bases (Collection[str])
distance_from_end (int)
five_prime (bool)
three_prime (bool)
- bases: Collection[str]
Bases to require on ends.
- distance_from_end: int = 0
Distance from end.
- five_prime: bool = True
Whether to apply to 5’ end of sequence (left end of DNA sequence, lowest index).
- three_prime: bool = True
Whether to apply to 3’ end of sequence (right end of DNA sequence, highest index).
- remove_violating_sequences(seqs)[source]
Keeps sequences with the given bases at given distance from the 5’ or 3’ end.
- Parameters:
seqs (DNASeqList)
- Return type:
DNASeqList
- __init__(bases, distance_from_end=0, five_prime=True, three_prime=True)
- Parameters:
bases (Collection[str])
distance_from_end (int)
five_prime (bool)
three_prime (bool)
- Return type:
None
- class constraints.BaseAtPositionFilter(bases, position)[source]
Restricts the sequence to contain only certain base(s) on at a particular position.
One use case is that many internal modifications (e.g., biotin or fluorophore) can only be placed on an T.
- Parameters:
bases (str | Collection[str])
position (int)
- bases: str | Collection[str]
Base(s) to require at position
BasePositionConstraint.position
.Can either be a single base, or a collection (e.g., list, tuple, set). If several bases are specified, the base at
BasePositionConstraint.position
must be one of the bases inBasePositionConstraint.bases
.
- position: int
Position of base to check.
- remove_violating_sequences(seqs)[source]
Remove sequences that don’t have one of the given bases at the given position.
- Parameters:
seqs (DNASeqList)
- Return type:
DNASeqList
- __init__(bases, position)
- Parameters:
bases (str | Collection[str])
position (int)
- Return type:
None
- class constraints.ForbiddenSubstringFilter(substrings, indices=None)[source]
Restricts the sequence not to contain a certain substring(s), e.g., GGGG.
- Parameters:
substrings (str | Collection[str])
indices (Sequence[int] | None)
- substrings: str | Collection[str]
Substring(s) to forbid.
Can either be a single substring, or a collection (e.g., list, tuple, set). If a collection, all substrings must have the same length.
- indices: Sequence[int] | None = None
Indices at which to check for each substring in
ForbiddenSubstringFilter.substrings
. If not specified, all appropriate indices are checked.
- remove_violating_sequences(seqs)[source]
Remove sequences that have a string in
ForbiddenSubstringFilter.substrings
as a substring.- Parameters:
seqs (DNASeqList)
- Return type:
DNASeqList
- __init__(substrings, indices=None)
- Parameters:
substrings (str | Collection[str])
indices (Sequence[int] | None)
- Return type:
None
- class constraints.RunsOfBasesFilter(bases, length)[source]
Restricts the sequence not to contain runs of a certain length from a certain subset of bases, (e.g., forbidding any substring in {C,G}^3; no four bases can appear in a row that are either C or G)
This works by simply generating all strings representing the runs of bases, and then using a
ForbiddenSubstringFilter
with those strings. So this will not be efficient for forbidding, for example {A,C,T}^20 (i.e., all runs of A’s, C’s, or T’s of length 20), which would generate all 3^20 = 3,486,784,401 strings of length 20 from the alphabet {A,C,T}^20. Hopefully such a constraint would not be used in practice.- Parameters:
bases (Collection[str])
length (int)
- __init__(bases, length)[source]
- Parameters:
bases (str | Collection[str]) – Can either be a single base, or a collection (e.g., list, tuple, set).
length (int) – length of run to forbid
- Return type:
None
- bases: Collection[str]
Bases to forbid in runs of length
RunsOfBasesFilter.length
.
- length: int
Length of run to forbid.
- class constraints.SubstringSampler(supersequence, substring_length, except_start_indices=None, except_overlapping_indices=None, circular=False)[source]
A
SubstringSampler
is an object for specifying a common case for the fieldDomainPool.possible_sequences
, namely where we want the set of possible sequences to be all (or many) substrings of a single longer sequence.For example, this can be used to choose a rotation of the M13mp18 strand in sequence design. If for example 300 consecutive bases of M13 will be used in the design, and we want to choose the rotation, but disallow the substring of length 300 to overlap the hairpin at indices 5514-5556, then one would do the following
possible_sequences = SubstringSampler( supersequence=m13(), substring_length=300, except_overlapping_indices=range(5514, 5557), circular=True) pool = DomainPool('M13 rotations', possible_sequences=possible_sequences)
For this example, using a
SubstringSampler
is much more efficient than explicitly listing all length-300 substrings of M13 in the parameterDomainPool.possible_sequences
. This is because the latter approach, whenever theDesign
improves and is written to a file, will write out all the length-300 substrings of M13, taking much more space than just writing the full M13 sequence once.- Parameters:
supersequence (str)
substring_length (int)
except_start_indices (Tuple[int, ...])
except_overlapping_indices (Iterable[int] | None)
circular (bool)
- __init__(supersequence, substring_length, except_start_indices=None, except_overlapping_indices=None, circular=False)[source]
- Parameters:
supersequence (str)
substring_length (int)
except_start_indices (Iterable[int] | None)
except_overlapping_indices (Iterable[int] | None)
circular (bool)
- Return type:
None
- supersequence: str
The longer sequence from which to sample substrings.
- substring_length: int
Length of substrings to sample.
- circular: bool
Whether
SubstringSampler.supersequence
is circular. If so, then we can sample indices near the end and the substrings will start at the end and wrap around to the start.
- except_start_indices: Tuple[int, ...]
Start indices in
SubstringSampler.supersequence
to avoid. In the constructor this can be specified directly. Another option (mutually exclusive with the parameter except_start_indices) is to specify the parameter except_overlapping_indices, which setsSubstringSampler.except_start_indices
so that substrings will not intersect any indices in except_overlapping_indices.
- extended_supersequence: str
If
SubstringSampler.circular
is True, then this isSubstringSampler.supersequence
extended by its own prefix of lengthSubstringSampler.substring_length - 1
, to make sampling easier. Otherwise it is simply identical toSubstringSampler.supersequence
. Computed in constructor from other arguments.
- start_indices: Tuple[int, ...]
List of start indices from which to sample when calling
SubstringSampler.sample_substring()
. Computed in constructor from other arguments.
- sample_substring(rng)[source]
- Returns:
a random substring of
SubstringSampler.supersequence
of lengthSubstringSampler.substring_length
.- Parameters:
rng (Generator)
- Return type:
str
- class constraints.DomainPool(name, length=None, possible_sequences=None, replace_with_close_sequences=True, hamming_probability=<factory>, numpy_filters=<factory>, sequence_filters=<factory>)[source]
Represents a group of related
Domain
’s that share common properties in their sequence design, such as length of DNA sequence, or bounds on nearest-neighbor duplex energy.Also serves as a “source” of DNA sequences for
Domain
’s in thisDomainPool
. By callingDomainPool.generate_sequence()
repeatedly, we can produce DNA sequences satisfying the constraints defining thisDomainPool
.- Parameters:
name (str)
length (int | None)
possible_sequences (List[str] | SubstringSampler | None)
replace_with_close_sequences (bool)
hamming_probability (Dict[int, float])
numpy_filters (List[NumpyFilter])
sequence_filters (List[SequenceFilter])
- name: str
Name of this
DomainPool
. Must be unique.
- length: int | None = None
Length of DNA sequences generated by this
DomainPool
.Should be None if
DomainPool.possible_sequences
is specified.
- possible_sequences: List[str] | SubstringSampler | None = None
If specified, all other fields except
DomainPool.name
andDomainPool.length
are ignored. This is an explicit list of sequences to consider forDomain
’s using thisDomainPool
. During the search, if a domain with thisDomainPool
is picked to have its sequence changed, then a sequence will be picked uniformly at random from this list. Note that noNumpyFilter
’s orSequenceFilter
’s will be applied.Alternatively, the field can be an instance of
SubstringSampler
for the common case that the set of possible sequences is the set of substrings of some length of a single longer sequence. For example, this can be used to choose a rotation of the M13mp18 strand in sequence design. (This is advantageous because the files saving theDesign
each time the design improves will be much shorter, since it takes much less space to write the M13 sequence than to write all of its length-300 substrings.)Should be None if
DomainPool.length
is specified.
- replace_with_close_sequences: bool = True
If True, instead of picking a sequence uniformly at random from all those satisfying the filters when returning a sequence from
DomainPool.generate_sequence()
, one is picked “close” in Hamming distance to the previous sequence of theDomain
. The fieldDomainPool.hamming_probability
is used to pick a distance at random, after which a sequence that distance from the previous sequence is selected to return.
- hamming_probability: Dict[int, float]
Dictionary that specifies probability of taking a new sequence from the pool that is some integer number of bases different from the previous sequence (Hamming distance).
- numpy_filters: List[NumpyFilter]
NumpyFilter
’s shared by allDomain
’s in thisDomainPool
. This is used to choose potential sequences to assign to theDomain
’s in thisDomainPool
in the methodDomainPool.generate_sequence()
.The difference with
DomainPool.sequence_filters
is that these constraints can be applied efficiently to many sequences at once, represented as a numpy 2D array of bytes (via the classnp.DNASeqList
), so they are done in large batches in advance. In contrast, the constraints inDomainPool.sequence_filters
are done on Python strings representing DNA sequences, and they are called one at a time when a new sequence is requested inDomainPool.generate_sequence()
.Optional; default is empty.
- sequence_filters: List[SequenceFilter]
SequenceFilter
’s shared by allDomain
’s in thisDomainPool
. This is used to choose potential sequences to assign to theDomain
’s in thisDomainPool
in the methodDomainPool.generate()
.See
DomainPool.numpy_filters
for an explanation of the difference between them.See
DomainPool.domain_constraints
for an explanation of the difference between them.Optional; default is empty.
- satisfies_sequence_constraints(sequence)[source]
- Parameters:
sequence (str) – DNA sequence to check
- Returns:
whether sequence satisfies all constraints in
DomainPool.sequence_filters
- Return type:
bool
- generate_sequence(rng, previous_sequence=None)[source]
Returns a DNA sequence of given length satisfying
DomainPool.numpy_filters
andDomainPool.sequence_filters
Note: By default, there is no check that the sequence returned is unequal to one already assigned somewhere in the design, since both
DomainPool.numpy_filters
andDomainPool.sequence_filters
do not have access to the wholeDesign
. But theDomainPairConstraint
returned bydomains_not_substrings_of_each_other_constraint()
can be used to specify thisDesign
-wide constraint.Note that if
DomainPool.possible_sequences
is specified, then all constraints are ignored, and instead a sequence is chosen randomly to be returned from that list.- Parameters:
rng (np.random.Generator) – numpy random number generator to use. To use a default, pass
np.default_rng
.previous_sequence (str | None) – previously generated sequence to be replaced by a new sequence; None if no previous sequence exists. Used to choose a new sequence “close” to itself in Hamming distance, if the field
DomainPool.replace_with_close_sequences
is True and previous_sequence is not None. The number of differences between previous_sequence and its neighbors is determined by randomly picking a Hamming distance fromDomainPool.hamming_probability
with weighted probabilities of choosing each distance.
- Returns:
DNA sequence of given length satisfying
DomainPool.numpy_filters
andDomainPool.sequence_filters
- Return type:
str
- __init__(name, length=None, possible_sequences=None, replace_with_close_sequences=True, hamming_probability=<factory>, numpy_filters=<factory>, sequence_filters=<factory>)
- Parameters:
name (str)
length (int | None)
possible_sequences (List[str] | SubstringSampler | None)
replace_with_close_sequences (bool)
hamming_probability (Dict[int, float])
numpy_filters (List[NumpyFilter])
sequence_filters (List[SequenceFilter])
- Return type:
None
- class constraints.Domain(name, pool=None, sequence=None, fixed=False, label=None, dependent=False, subdomains=None, weight=None)[source]
Represents a contiguous substring of the DNA sequence of a
Strand
, which is intended to be either single-stranded, or to bind fully to the Watson-Crick complement of theDomain
.If two domains are complementary, they are represented by the same
Domain
object. They are distinguished only by whether theStrand
object containing them has theDomain
in its setStrand.starred_domains
or not.A
Domain
uses only its name to compute hash and equality checks, not its sequence. This allows aDomain
to be used in sets and dicts while modifying the sequence assigned to it, and also modifying the pool (letting the pool be assigned after it is created).- Parameters:
name (str)
pool (DomainPool | None)
sequence (str | None)
fixed (bool)
label (str | None)
dependent (bool)
subdomains (List[Domain] | None)
weight (float | None)
- length: int | None = None
Length of this domain. If None, then the method
Domain.get_length()
asksDomain.pool
for the length. However, aDomain
withDomain.dependent
set to True has noDomain.pool
. For such domains, it is necessary to set aDomain.length
field directly.
- parent: Domain | None = None
Domain of which this is a subdomain. Note, this is not set manually, this is set by the library based on the
Domain.subdomains
of other domains in the same tree.
- __init__(name, pool=None, sequence=None, fixed=False, label=None, dependent=False, subdomains=None, weight=None)[source]
- Parameters:
name (str)
pool (DomainPool | None)
sequence (str | None)
fixed (bool)
label (str | None)
dependent (bool)
subdomains (List[Domain] | None)
weight (float | None)
- Return type:
None
- fixed: bool = False
Whether this
Domain
’s DNA sequence is fixed, i.e., cannot be changed by the search algorithmsearch.search_for_dna_sequences()
.Note: If a domain is fixed then all of its subdomains must also be fixed.
- label: str | None = None
Optional “label” string to associate to this
Domain
.Useful for associating extra information with the
Domain
that will be serialized, for example, for DNA sequence design.
- dependent: bool = False
Whether this
Domain
’s DNA sequence is dependent on others. Usually this is not the case. However, domains can be subdivided hierarchically into a tree of domains by settingDomain.subdomains
to describe the tree. In this case exactly one domain along every path from the root to any leaf must be independent, and the rest dependent: the dependent domains will have their sequences calculated from the indepenedent ones.A possible use case is that one strand represents a subsequence of M13 of length 300, of which there are 7249 possible DNA sequences to assign based on the different rotations of M13. If this strand is bound to several other strands, it will have several domains, but they cannot be set independently of each other. This can be done by creating a strand with a single long domain, which is subdivided into many dependent child domains. Only the entire strand, the root domain, can be assigned at once, changing every domain at once, so the domains are dependent on the root domain’s assigned sequence.
- weight: float = 1.0
Weight to apply before picking domain at random to change when re-assigning DNA sequences during search. Should only be changed for independent domains. (those with
Domain.dependent
set to False)Normally a domain’s probability of being changed is proportional to the total score of violations it causes, but that total score is first multiplied by
Domain.weight
. This is useful, for instance, to weight a domain lower when it has many subdomains that intersect many strands, for example if a domain represents an M13 strand. It may be more efficient to pick such a domain less often since changing it will change many strands in the design and, when the design gets close to optimized, this will likely cause the score to go up.
- to_json_serializable(suppress_indent=True)[source]
- Returns:
Dictionary
d
representing thisDomain
that is “naturally” JSON serializable, by callingjson.dumps(d)
.- Parameters:
suppress_indent (bool)
- Return type:
NoIndent | Dict[str, Any]
- static from_json_serializable(json_map, pool_with_name)[source]
- Parameters:
json_map (Dict[str, Any]) – JSON serializable object encoding this
Domain
, as returned byDomain.to_json_serializable()
.pool_with_name (Dict[str, DomainPool] | None) – dict mapping name to
DomainPool
with that name; required to rehydrateDomain
’s. If None, then a DomainPool with no constraints is created with the name and domain length found in the JSON.
- Returns:
Domain
represented by dict json_map, assuming it was created byDomain.to_json_serializable()
.- Return type:
- property pool: DomainPool
DomainPool
of thisDomain
- Type:
return
- property subdomains: List[Domain]
Subdomains of this
Domain
.Used in connection with
Domain.dependent
to declare that someDomain
’s are contained within other domains (forming a tree in general), and domains withDomain.dependent
set to True automatically take their sequences from independent domains.WARNING: this can be a bit tricky to determine the order when setting these. The subdomains should be listed in 5’ to 3’ order for UNSTARRED domains. If there is a starred domain with starred subdomains, they would be listed in REVERSE order.
For example, if there is a domain dom*
[--------->
of length 11 with two subdomains sub1*[----->
of length 7 and sub2*[-->
of length 4 (put together they look like[----->[-->
) that appear in that order left to right (5’ to 3’), then one would assign the domain dom to have subdomains[sub2, sub1]
, since the UNSTARRED domains appear<-----]<--]
, i.e., in 5’ to 3’ order for the unstarred domains, first the length 4 domain dom2 appears, then the length 7 domain dom1.
- has_length()[source]
- Returns:
True if this
Domain
has a length, which means either a sequence has been assigned to it, or it has aDomainPool
.- Return type:
bool
- get_length()[source]
- Returns:
Length of this domain (delegates to pool)
- Raises:
ValueError – if no
DomainPool
has been set for thisDomain
- Return type:
int
- sequence()[source]
- Returns:
DNA sequence of this domain (unstarred version)
- Raises:
ValueError – If no sequence has been assigned.
- Return type:
str
- set_sequence(new_sequence)[source]
- Parameters:
new_sequence (str) – new DNA sequence to set
- Return type:
None
- set_fixed_sequence(fixed_sequence)[source]
Set DNA sequence and fix it so it is not changed by the nuad sequence designer.
Since it is being fixed, there is no Domain pool, so we don’t check the pool or whether it has a length. We also bypass the check that it is not fixed.
- Parameters:
fixed_sequence (str) – new fixed DNA sequence to set
- Return type:
None
- property starred_name: str
The value
Domain.name
with * appended to it.- Type:
return
- property starred_sequence: str
Watson-Crick complement of DNA sequence assigned to this
Domain
.- Type:
return
- get_name(starred)[source]
- Parameters:
starred (bool) – whether to return the starred or unstarred version of the name
- Returns:
The value
Domain.name
orDomain.starred_name
, depending on the value of parameter starred.- Return type:
str
- concrete_sequence(starred)[source]
- Parameters:
starred (bool) – whether to return the starred or unstarred version of the sequence
- Returns:
The value
Domain.sequence
orDomain.starred_sequence
, depending on the value of parameter starred.- Raises:
ValueError – if this
Domain
does not have a sequence assigned- Return type:
str
- has_sequence()[source]
- Returns:
Whether a complete DNA sequence has been assigned to this
Domain
. If this domain has subdomains, False if any subdomain has not been assigned a sequence.- Return type:
bool
- static complementary_domain_name(domain_name)[source]
Returns the name of the domain complementary to domain_name. In other words, a
*
is either removed from the end of domain_name, or appended to it if not already there.- Parameters:
domain_name (str) – name of domain
- Returns:
name of complementary domain
- Return type:
str
- all_domains_in_tree()[source]
- Returns:
list of all domains in the same subdomain tree as this domain (including itself)
- Return type:
List[Domain]
- all_domains_intersecting()[source]
- Returns:
list of all domains intersecting this one, meaning those domains in the subtree rooted at this domain (including itself), plus any ancestors of this domain.
- Return type:
List[Domain]
- ancestors()[source]
- Returns:
list of all domains that are ancestors of this one, NOT including this domain
- Return type:
List[Domain]
- has_pool()[source]
- Returns:
whether a
DomainPool
has been assigned to thisDomain
- Return type:
bool
- independent_source()[source]
Like
independent_ancestor_or_descendent()
, but returns this Domain if it is already independent.
- constraints.domains_not_substrings_of_each_other_constraint(check_complements=True, short_description='dom neq', weight=1.0, min_length=0, pairs=None)[source]
Returns constraint ensuring no two domains are substrings of each other. Note that this ensures that no two
Domain
’s are equal if they are the same length.- Parameters:
check_complements (bool) – whether to also ensure the check for Watson-Crick complements of the sequences
short_description (str) – short description of constraint suitable for logging to stdout
weight (float) – weight to assign to constraint
min_length (int) – minimum length substring to check. For instance if min_length is 4, then having two domains with sequences AAAA and CAAAAC would violate this constraint, but domains with sequences AAA and CAAAC would not.
pairs (Iterable[Tuple[Domain, Domain]] | None) – pairs of domains to check. By default all pairs of unequal domains are compared unless both are fixed.
- Returns:
a
DomainPairConstraint
ensuring no two domain sequences contain each other as a substring (in particular, if they are equal length, then they are not the same domain)- Return type:
- class constraints.VendorFields(scale='25nm', purification='STD', plate=None, well=None)[source]
Data required when ordering DNA strands from a synthesis company such as IDT (Integrated DNA Technologies). This data is used when automatically generating files used to order DNA from IDT.
When exporting to IDT files via
Design.write_idt_plate_excel_file()
orDesign.write_idt_bulk_input_file()
, the fieldStrand.name
is used for the name if it exists, otherwise a reasonable default is chosen.- Parameters:
scale (str)
purification (str)
plate (str | None)
well (str | None)
- scale: str = '25nm'
Synthesis scale at which to synthesize the strand (third field in IDT bulk input: https://www.idtdna.com/site/order/oligoentry). Choices supplied by IDT at the time this was written:
"25nm"
,"100nm"
,"250nm"
,"1um"
,"5um"
,"10um"
,"4nmU"
,"20nmU"
,"PU"
,"25nmS"
.
- purification: str = 'STD'
Purification options (fourth field in IDT bulk input: https://www.idtdna.com/site/order/oligoentry). Choices supplied by IDT at the time this was written:
"STD"
,"PAGE"
,"HPLC"
,"IEHPLC"
,"RNASE"
,"DUALHPLC"
,"PAGEHPLC"
.
- plate: str | None = None
Name of plate in case this strand will be ordered on a 96-well or 384-well plate.
Optional field, but non-optional if
book = openpyxl.load_workbook(filename=filename).well
is notNone
.
- well: str | None = None
Well position on plate in case this strand will be ordered on a 96-well or 384-well plate.
Optional field, but non-optional if
VendorFields.plate
is notNone
.
- __init__(scale='25nm', purification='STD', plate=None, well=None)
- Parameters:
scale (str)
purification (str)
plate (str | None)
well (str | None)
- Return type:
None
- class constraints.Strand(domains=None, starred_domain_indices=(), group='default_strand_group', name=None, label=None, vendor_fields=None)[source]
Represents a DNA strand, made of several
Domain
’s.- Parameters:
domains (Iterable[Domain] | None)
starred_domain_indices (Iterable[int])
group (str)
name (str | None)
label (str | None)
vendor_fields (VendorFields | None)
- modification_5p: nm.Modification5Prime | None = None
5’ modification; None if there is no 5’ modification.
- modification_3p: nm.Modification3Prime | None = None
3’ modification; None if there is no 3’ modification.
- __init__(domains=None, starred_domain_indices=(), group='default_strand_group', name=None, label=None, vendor_fields=None)[source]
A
Strand
can be created only by listing explicitDomain
objects via parameter domains. To specify aStrand
by giving domain names, see the methodDesign.add_strand()
.- Parameters:
domains (Iterable[Domain] | None) – list of
Domain
’s on thisStrand
starred_domain_indices (Iterable[int]) – Indices of
Domain
’s in domains that are starred.group (str) – name of group of this
Strand
.name (str | None) – Name of this
Strand
.label (str | None) – Label to associate with this
Strand
.vendor_fields (VendorFields | None) –
VendorFields
object to associate with thisStrand
; needed to call methods for exporting to IDT formats (e.g.,Strand.write_idt_bulk_input_file()
)
- Return type:
None
- group: str
Optional “group” field to describe strands that share similar properties.
- starred_domain_indices: FrozenSet[int]
Set of positions of
Domain
’s inStrand.domains
on thisStrand
that are starred.
- label: str | None = None
Optional generic “label” string to associate to this
Strand
.Useful for associating extra information with the
Strand
that will be serialized, for example, for DNA sequence design.
- vendor_fields: VendorFields | None = None
Fields used when ordering strands from a synthesis company such as IDT (Integrated DNA Technologies, Coralville, IA). If present (i.e., not equal to
None
) then the methodDesign.write_idt_bulk_input_file()
can be called to automatically generate an text file for ordering strands in test tubes: https://www.idtdna.com/site/order/oligoentry, as can the methodDesign.write_idt_plate_excel_file()
for writing a Microsoft Excel file that can be uploaded to IDT’s website for describing DNA sequences to be ordered in 96-well or 384-well plates.
- modifications_int: Dict[int, nm.ModificationInternal]
modifications.Modification
’s to the DNA sequence (e.g., biotin, Cy3/Cy5 fluorphores).Maps index within DNA sequence to modification. If the internal modification is attached to a base (e.g., internal biotin, /iBiodT/ from IDT), then the index is that of the base. If it goes between two bases (e.g., internal Cy3, /iCy3/ from IDT), then the index is that of the previous base, e.g., to put a Cy3 between bases at indices 3 and 4, the index should be 3. So for an internal modified base on a sequence of length n, the allowed indices are 0,…,n-1, and for an internal modification that goes between bases, the allowed indices are 0,…,n-2.
- clone(name)[source]
Returns a copy of this
Strand
. The copy is “shallow” in that theDomain
’s are shared. This is useful for creating multiple versions of eachStrand
, e.g., for having a variant with an extension.WARNING: the
Strand.label
will be shared between them. If it should be copied, this must be done manually. A shallow copy of it can be made by setting
- compute_derived_fields()[source]
Re-computes derived fields of this
Strand
. Should be called after modifications to the Strand. (Done automatically at the start ofsearch.search_for_dna_sequences()
.)
- intersects_domain(domain)[source]
- Parameters:
domain (Domain) – domain to test for intersection
- Returns:
whether this strand intersects domain, which is true if either domain is in the list
Strand.domains
, or if any of those domains have domain in their hierarchical tree as a subdomain or an ancestor- Return type:
bool
- length()[source]
- Returns:
Sum of lengths of
Domain
’s in thisStrand
. EachDomain
must have aDomainPool
assigned so that the length is defined.- Return type:
int
- vendor_dna_sequence()[source]
- Returns:
DNA sequence as it needs to be typed to order from a synthesis company, with
Modification5Prime
’s,Modification3Prime
’s, andModificationInternal
’s represented with text codes, e.g., “/5Biosg/ACGT” for sequence ACGT with a 5’ biotin modification to order from IDT.- Return type:
str
- to_json_serializable(suppress_indent=True)[source]
- Returns:
Dictionary
d
representing thisStrand
that is “naturally” JSON serializable, by callingjson.dumps(d)
.- Parameters:
suppress_indent (bool)
- Return type:
NoIndent | Dict[str, Any]
- static from_json_serializable(json_map, domain_with_name)[source]
- Returns:
Strand
represented by dict json_map, assuming it was created byStrand.to_json_serializable()
.- Parameters:
json_map (Dict[str, Any])
domain_with_name (Dict[str, Domain])
- Return type:
- unstarred_domains()[source]
- Returns:
list of unstarred
Domain
’s in thisStrand
, in order they appear inStrand.domains
- Return type:
List[Domain]
- starred_domains()[source]
- Returns:
list of starred
Domain
’s in thisStrand
, in order they appear inStrand.domains
- Return type:
List[Domain]
- sequence(delimiter='')[source]
- Parameters:
delimiter (str) – Delimiter string to place between sequences of each
Domain
in thisStrand
. For instance, if delimiter ='--'
, then it will return a string such asACGTAGCTGA--CGCTAGCTGA--CGATCGATC--GCGATCGAT
- Returns:
DNA sequence assigned to this
Strand
, calculated by concatenating all sequences assigned to itsDomain
’s.- Raises:
ValueError – if any
Domain
of thisStrand
does not have a sequence assigned- Return type:
str
- assign_dna(sequence)[source]
- Parameters:
sequence (str) – DNA sequence to assign to this
Strand
. Must have length =Strand.length()
.- Return type:
None
- unfixed_domains()[source]
- Returns:
all
Domain
’s in thisStrand
whereDomain.fixed
is False- Return type:
Tuple[Domain, …]
- property name: str
name of this
Strand
if it was assigned one, otherwiseDomain
names are concatenated with ‘-’ joining them- Type:
return
- address_of_domain(domain_idx)[source]
Returns
StrandDomainAddress
of the domain located at domain_idx- Rparam domain_idx:
Index of domain
- Parameters:
domain_idx (int)
- Return type:
- address_of_nth_domain_occurence(domain_name, n, forward=True)[source]
Returns
StrandDomainAddress
of the n’th occurence of domain named domain_name.- Parameters:
- Returns:
StrandDomainAddress
of the n’th occurence of domain named domain_name.- Return type:
- address_of_first_domain_occurence(domain_name)[source]
Returns
StrandDomainAddress
of the first occurrence of domain named domain_name starting from the 5’ end.- Parameters:
domain_name (str)
- Return type:
- address_of_last_domain_occurence(domain_name)[source]
Returns
StrandDomainAddress
of the nth occurrence of domain named domain_name starting from the 3’ end.- Parameters:
domain_name (str)
- Return type:
- prepend_domain(domain, starred=False)[source]
Prepends domain to 5’ end of this
Strand
(i.e., the beginning of theStrand
).
- insert_domain(idx, domain, starred=False)[source]
Inserts domain at index idx of this
Strand
, with same semantics as Python’s List.insert. For example,strand.insert(0, domain)
is equivalent tostrand.prepend_domain(domain)
andstrand.insert(len(strand.domains), domain)
is equivalent tostrand.append_domain(domain)
.
- set_fixed_sequence(seq)[source]
Sets each domain of this
Strand
to have a substring of seq, such that the entire strand has the sequence seq. AllDomain
’s in this strand will be fixed after doing this. (And if any of them are already fixed it will raise an error.)- Parameters:
seq (str) – sequence to assign to this
Strand
- Return type:
None
- class constraints.DomainPair(domain1: 'Domain', domain2: 'Domain', starred1: 'bool' = False, starred2: 'bool' = False)[source]
-
- starred1: bool = False
Whether first domain is starred (not used in most constraints)
- starred2: bool = False
Whether second domain is starred (not used in most constraints)
- class constraints.Complex(*args: 'Strand')[source]
- Parameters:
args (Strand)
- constraints.remove_duplicates(lst)[source]
- Parameters:
lst (Iterable[T]) – an Iterable of objects
- Returns:
a List consisting of elements of lst with duplicates removed, while preserving iteration order of lst (naive approach using Python set would not preserve order, since iteration order of Python sets is not specified)
- Return type:
List[T]
- class constraints.PlateType(value)[source]
Represents two different types of plates in which DNA sequences can be ordered.
- wells96 = 96
96-well plate.
- wells384 = 384
384-well plate.
- class constraints.Design(strands=())[source]
Represents a complete design, i.e., a set of DNA
Strand
’s with domains, andConstraint
’s on the sequences to assign to them viasearch.search_for_dna_sequences()
.- Parameters:
strands (List[Strand])
- domains: List[Domain]
List of all
Domain
’s in thisDesign
. (without repetitions)Computed from
Design.strands
, so not specified in constructor.
- strands_by_group_name: Dict[str, List[Strand]]
Dict mapping each group name to a list of the
Strand
’s in thisDesign
in the group.Computed from
Design.strands
, so not specified in constructor.
- domain_pools_to_domain_map: Dict[DomainPool, List[Domain]]
Dict mapping each
DomainPool
to a list of theDomain
’s in thisDesign
in the pool.Computed from
Design.strands
, so not specified in constructor.
- domains_by_name: Dict[str, Domain]
Dict mapping each name of a
Domain
to theDomain
’s in thisDesign
.Computed from
Design.strands
, so not specified in constructor.
- compute_derived_fields()[source]
Computes derived fields of this
Design
. Used to ensure that all fields are valid in case theDesign
was manually modified after being created, before runningsearch.search_for_dna_sequences()
.- Return type:
None
- to_json_serializable(suppress_indent=True)[source]
- Parameters:
suppress_indent (bool) – Whether to suppress indentation of some objects using the NoIndent object.
- Returns:
Dictionary
d
representing thisDesign
that is “naturally” JSON serializable, by callingjson.dumps(d)
.- Return type:
Dict[str, Any]
- write_design_file(directory='.', filename=None, extension='json')[source]
Write JSON file representing this
Design
, which can be imported via the methodDesign.from_design_file()
, with the output file having the same name as the running script but with.py
changed to.json
, unless filename is explicitly specified. For instance, if the script is namedmy_design.py
, then the design will be written tomy_design.json
. If extension is specified (but filename is not), then the design will be written tomy_design.<extension>
The string written is that returned by
Design.to_json()
.- Parameters:
directory (str) – directory in which to put file (default: current working directory)
filename (str | None) – filename (default: name of script with
.py
replaced by.sc
). Mutually exclusive with extensionextension (str) – extension for filename (default:
.sc
) Mutually exclusive with filename
- Return type:
None
- static from_json_serializable(json_map)[source]
- Parameters:
json_map (Dict[str, Any]) – JSON serializable object encoding this
Design
, as returned byDesign.to_json_serializable()
.- Returns:
Design
represented by dict json_map, assuming it was created byDesign.to_json_serializable()
. No constraints are populated.- Return type:
- add_strand(domain_names=None, domains=None, starred_domain_indices=None, group='default_strand_group', name=None, label=None, vendor_fields=None)[source]
This is an alternative way to create strands instead of calling the
Strand
constructor explicitly. It behaves similarly to theStrand
constructor, but it has an option to specifyDomain
’s simply by giving a name.A
Strand
can be created either by listing explicitDomain
objects via parameter domains (as in theStrand
constructor), or by giving names via parameter domain_names. If domain_names is specified, then by convention those that end with a*
are assumed to be starred.In particular,
Domain
objects are created as needed, whenever theDesign
sees a new domain name that has not been encountered. Also,Domain
’s created in this way are “interned” as variables in a cache stored in theDesign
object; no twoDomain
’s with the same name in this design will be created, and subsequent uses of the same name will refer to the sameDomain
object.- Parameters:
domain_names (List[str] | None) – Names of the
Domain
’s on thisStrand
.Domain
objects are created by theDesign
as needed whenever a new domain name is specified; if the domain name has already been used (or its complement via the convention that names ending in a * are the complement of the domain whose name is equal but without ending in a *), then the sameDomain
object is reused. Mutually exclusive withStrand.domains
andStrand.starred_domain_indices
.domains (List[Domain] | None) – list of
Domain
’s on thisStrand
. Mutually exclusive withStrand.domain_names
, and must be specified jointly withStrand.starred_domain_indices
.starred_domain_indices (Iterable[int] | None) – Indices of
Domain
’s in domains that are starred. Mutually exclusive withStrand.domain_names
, and must be specified jointly withStrand.domains
.group (str) – name of group of this
Strand
.name (str | None) – Name of this
Strand
.label (str | None) – Label to associate with this
Strand
.vendor_fields (VendorFields | None) –
VendorFields
object to associate with thisStrand
; needed to call methods for exporting to IDT formats (e.g.,Strand.write_idt_bulk_input_file()
)
- Returns:
the
Strand
that is created- Return type:
- modifications(mod_type=None)[source]
Returns either set of all
modifications.Modification
’s in thisDesign
, or set of all modifications of a given type (5’, 3’, or internal).- Parameters:
mod_type (nm.ModificationType | None) – type of modifications (5’, 3’, or internal); if not specified, all three types are returned
- Returns:
Set of all modifications in this
Design
(possibly of a given type).- Return type:
Set[nm.Modification]
- to_idt_bulk_input_format(delimiter=',', domain_delimiter='', key=None, warn_duplicate_name=False, only_strands_with_vendor_fields=False, strands=None)[source]
Called by
Design.write_idt_bulk_input_file()
to determine what string to write to the file. This function can be used to get the string directly without creating a file.Parameters have the same meaning as in
Design.write_idt_bulk_input_file()
.- Returns:
string that is written to the file in the method
Design.write_idt_bulk_input_file()
.- Parameters:
- Return type:
str
- write_idt_bulk_input_file(*, filename=None, directory='.', key=None, extension=None, delimiter=',', domain_delimiter='', warn_duplicate_name=True, only_strands_with_vendor_fields=False, strands=None)[source]
Write
.idt
text file encoding the strands of thisDesign
with the fieldStrand.vendor_fields
, suitable for pasting into the “Bulk Input” field of IDT (Integrated DNA Technologies, Coralville, IA, https://www.idtdna.com/), with the output file having the same name as the running script but with.py
changed to.idt
, unless filename is explicitly specified. For instance, if the script is namedmy_origami.py
, then the sequences will be written tomy_origami.idt
. If filename is not specified but extension is, then that extension is used instead ofidt
. At least one of filename or extension must beNone
.The string written is that returned by
Design.to_idt_bulk_input_format()
.- Parameters:
filename (str) – optional custom filename to use (instead of currently running script)
directory (str) – specifies a directory in which to place the file, either absolute or relative to the current working directory. Default is the current working directory.
key (KeyFunction[Strand] | None) – key function used to determine order in which to output strand sequences. Some useful defaults are provided by
strand_order_key_function()
extension (str | None) – alternate filename extension to use (instead of idt)
delimiter (str) – is the symbol to delimit the four IDT fields name,sequence,scale,purification.
domain_delimiter (str) – symbol(s) to put in between DNA sequences of different domains in a strand.
warn_duplicate_name (bool) – if
True
prints a warning when two differentStrand
’s have the sameVendorFields.name
and the sameStrand.sequence()
. A ValueError is raised (regardless of the value of this parameter) if two differentStrand
’s have the same name but different sequences, IDT scales, or IDT purifications.only_strands_with_vendor_fields (bool) – If False (the default), all non-scaffold sequences are output, with reasonable default values chosen if the field
Strand.vendor_fields
is missing. If True, then strands lacking the fieldStrand.vendor_fields
will not be exported.strands (Iterable[Strand] | None) – strands to export; if not specified, all strands in design are exported. NOTE: it is not checked that each
Strand
in strands is actually contained in this any:Design
- Return type:
None
- write_idt_plate_excel_file(*, filename=None, directory='.', key=None, warn_duplicate_name=False, only_strands_with_vendor_fields=False, use_default_plates=True, warn_using_default_plates=True, plate_type=PlateType.wells96, strands=None)[source]
Write
.xls
(Microsoft Excel) file encoding the strands of thisDesign
with the fieldStrand.vendor_fields
, suitable for uploading to IDT (Integrated DNA Technologies, Coralville, IA, https://www.idtdna.com/) to describe a 96-well or 384-well plate (https://www.idtdna.com/site/order/plate/index/dna/), with the output file having the same name as the running script but with.py
changed to.xls
, unless filename is explicitly specified. For instance, if the script is namedmy_origami.py
, then the sequences will be written tomy_origami.xls
.If the last plate has fewer than 24 strands for a 96-well plate, or fewer than 96 strands for a 384-well plate, then the last two plates are rebalanced to ensure that each plate has at least that number of strands, because IDT charges extra for a plate with too few strands: https://www.idtdna.com/pages/products/custom-dna-rna/dna-oligos/custom-dna-oligos
- Parameters:
filename (str) – custom filename if default (explained above) is not desired
directory (str) – specifies a directory in which to place the file, either absolute or relative to the current working directory. Default is the current working directory.
key (KeyFunction[Strand] | None) –
key function used to determine order in which to output strand sequences. Some useful defaults are provided by
strand_order_key_function()
warn_duplicate_name (bool) – if
True
prints a warning when two differentStrand
’s have the sameVendorFields.name
and the sameStrand.sequence()
. A ValueError is raised (regardless of the value of this parameter) if two differentStrand
’s have the same name but different sequences, IDT scales, or IDT purifications.only_strands_with_vendor_fields (bool) – If False (the default), all non-scaffold sequences are output, with reasonable default values chosen if the field
Strand.vendor_fields
is missing. (though scaffold is included if export_scaffold is True). If True, then strands lacking the fieldStrand.vendor_fields
will not be exported. If False, then use_default_plates must be True.use_default_plates (bool) – Use default values for plate and well (ignoring those in
Strand.vendor_fields
, which may be None). If False, each Strand to export must have the fieldStrand.vendor_fields
, so in particular the parameter only_strands_with_idt must be True.warn_using_default_plates (bool) – specifies whether, if use_default_plates is True, to print a warning for strands whose
Strand.vendor_fields
has the fieldsVendorFields.plate
andVendorFields.well
, since use_default_plates directs these fields to be ignored.plate_type (PlateType) – a
PlateType
specifying whether to use a 96-well plate or a 384-well plate if the use_default_plates parameter isTrue
. Ignored if use_default_plates isFalse
, because in that case the wells are explicitly set by the user, who is free to use coordinates for either plate type.strands (Iterable[Strand] | None) – strands to export; if not specified, all strands in design are exported. NOTE: it is not checked that each
Strand
in strands is actually contained in this any:Design
- Return type:
None
- domain_pools()[source]
- Returns:
list of all
DomainPool
’s in thisDesign
- Return type:
List[DomainPool]
- domains_by_pool_name(domain_pool_name)[source]
- Parameters:
domain_pool_name (str) – name of a
DomainPool
- Returns:
the
Domain
’s in domain_pool- Return type:
List[Domain]
- static from_scadnano_file(sc_filename, fix_assigned_sequences=True, ignored_strands=None)[source]
Converts a scadnano Design stored in file named sc_filename to a a
Design
for doing DNA sequence design. Each Strand name and Domain name from the scadnano Design are assigned asStrand.name
andDomain.name
in the obvious way. Assumes each Strand label is a string describing the strand group.The scadnano package must be importable.
Also assigns sequences from domains in sc_design to those of the returned
Design
. If fix_assigned_sequences is true, then these DNA sequences are fixed; otherwise not.- Parameters:
sc_filename (str) – Name of file containing scadnano Design.
fix_assigned_sequences (bool) – Whether to fix the sequences that are assigned from those found in sc_design.
ignored_strands (Iterable[Strand] | None) – Strands to ignore
- Returns:
An equivalent
Design
, ready to be given constraints for DNA sequence design.- Raises:
TypeError – If any scadnano strand label is not a string.
- Return type:
- static from_scadnano_design(sc_design, fix_assigned_sequences=True, ignored_strands=None, warn_existing_domain_labels=True)[source]
Converts a scadnano Design sc_design to a a
Design
for doing DNA sequence design. Each Strand name and Domain name from the scadnano Design are assigned asStrand.name
andDomain.name
in the obvious way. Assumes each Strand label is a string describing the strand group.The scadnano package must be importable.
Also assigns sequences from domains in sc_design to those of the returned
Design
. If fix_assigned_sequences is true, then these DNA sequences are fixed; otherwise not.- Parameters:
sc_design (sc.Design) – Instance of scadnano.Design from the scadnano Python scripting library.
fix_assigned_sequences (bool) – Whether to fix the sequences that are assigned from those found in sc_design.
ignored_strands (Iterable[Strand] | None) – Strands to ignore; none are ignore if not specified.
warn_existing_domain_labels (bool) – If True, logs warning when dsd
Domain
already has a label and so does scadnano domain, since scadnano label will not be assigned to the dsdDomain
.
- Returns:
An equivalent
Design
, ready to be given constraints for DNA sequence design.- Raises:
TypeError – If any scadnano strand label is not a string.
- Return type:
- assign_fields_to_scadnano_design(sc_design, ignored_strands=(), overwrite=False)[source]
Assigns DNA sequence, VendorFields, and StrandGroups (as a key in a scadnano String.label dict under key “group”). TODO: document more
- Parameters:
sc_design (Design)
ignored_strands (Iterable[Strand])
overwrite (bool)
- assign_sequences_to_scadnano_design(sc_design, ignored_strands=(), overwrite=False)[source]
Assigns sequences from this
Design
into sc_design.Also writes a label to each scadnano strand. If the label is None a new one is created as a dict with a key group. The name of the StrandGroup of the nuad design is the value to assign to this key. If the scadnano strand label is already a dict, it adds this key. If the strand label is not None or a dict, an exception is raised.
Assumes that each domain name in domains in sc_design is a
Domain.name
of aDomain
in thisDesign
.If multiple strands in sc_design share the same name, then all of them are assigned the DNA sequence of the nuad
Strand
with that name.- Parameters:
sc_design (Design) – a scadnano design.
ignored_strands (Iterable[Strand]) – strands in the scadnano design that are to be ignored by the sequence designer.
overwrite (bool) – if True, overwrites existing sequences; otherwise gives an error if an existing sequence disagrees with the newly assigned sequence
- Return type:
None
Returns a list of pairs (nuad_strand, sc_strands), where nuad_strand has the same name as all scadnano Strands in sc_strands, but only scadnano strands are included in the list that do not appear in ignored_strands.
- assign_strand_groups_to_labels(sc_design, ignored_strands=(), overwrite=False)[source]
TODO: document this
- Parameters:
sc_design (Design)
ignored_strands (Iterable[Strand])
overwrite (bool)
- Return type:
None
- assign_idt_fields_to_scadnano_design(sc_design, ignored_strands=(), overwrite=False)[source]
Assigns
VendorFields
from thisDesign
into sc_design.If multiple strands in sc_design share the same name, then all of them are assigned the IDT fields of the dsd
Strand
with that name.- Parameters:
sc_design (Design) – a scadnano design.
ignored_strands (Iterable[Strand]) – strands in the scadnano design that are to be not assigned.
overwrite (bool) – whether to overwrite existing fields.
- Raises:
ValueError – if scadnano strand already has any modifications assigned
- Return type:
None
- assign_modifications_to_scadnano_design(sc_design, ignored_strands=(), overwrite=False)[source]
Assigns
modifications.Modification
’s from thisDesign
into sc_design.If multiple strands in sc_design share the same name, then all of them are assigned the modifications of the dsd
Strand
with that name.- Parameters:
sc_design (Design) – a scadnano design.
ignored_strands (Iterable[Strand]) – strands in the scadnano design that are to be not assigned.
overwrite (bool) – whether to overwrite existing fields in scadnano design
- Raises:
ValueError – if scadnano strand already has any modifications assigned
- Return type:
None
- copy_sequences_from(other)[source]
Assuming every
Domain
in thisDesign
is has a matching (same name)Domain
in other, copies sequences from other into thisDesign
.
- class constraints.Constraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>)[source]
Abstract base class of all “soft” constraints to apply when running
search.search_for_dna_sequences()
. Unlike aNumpyFilter
or aSequenceFilter
, which disallow certain DNA sequences from ever being assigned to aDomain
, aConstraint
can be violated during the search. The goal of the search is to reduce the number of violatedConstraint
’s. Seesearch.search_for_dna_sequences()
for a more detailed description of how the search algorithm interacts with the constraints.You will not use this class directly, but instead its concrete subclasses
DomainConstraint
,StrandConstraint
,DomainPairConstraint
,StrandPairConstraint
,ComplexConstraint
, which are subclasses ofSingularConstraint
,DomainsConstraint
,StrandsConstraint
,DomainPairsConstraint
,StrandPairsConstraint
, which are subclasses ofBulkConstraint
, orDesignConstraint
.- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
- description: str
Description of the constraint, e.g., ‘strand has secondary structure exceeding -2.0 kcal/mol’ suitable for printing in a long text report.
- short_description: str = ''
Very short description of the constraint suitable for compactly logging to the screen, where many of these short descriptions must fit onto one line, e.g., ‘strand ss’ or ‘dom pair nupack’
- weight: float = 1.0
Constant multiplier Weight of the problem; the higher the total weight of all the
Constraint
’s aDomain
has caused, the greater likelihood its sequence is changed when stochastically searching for sequences to satisfy all constraints.
- score_transfer_function()
Score transfer function to use. When a constraint is violated, the constraint returns a nonnegative float (the score) indicating the “severity” of the violation. For example, if a
Strand
has secondary structure energy exceeding a threshold, the score returned is the difference between the energy and the threshold.The score is then passed through the score_transfer_function. The default is the squared ReLU function: f(x) = max(0, x^2). This “punishes” more severe violations more, for example, it would bring down the total score of violations more to reduce a violation 3 kcal/mol in excess of its threshold to 2 kcal/mol excess, than to reduce a violation only 1 kcal/mol in excess of its threshold down to 0.
- Parameters:
x (float)
- Return type:
float
- abstract static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>)
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
- Return type:
None
- class constraints.Result(excess, summary=None, value=None, unit=None)[source]
A
Result
is returned from the functionSingularConstraint.evaluate
, and a list ofResult
’s is returned from the functionBulkConstraint.evaluate_bulk
, describing the result of evaluating the constraint on the design “part”.A
Result
must have an “excess” and “summary” specified.Optionally one may also specify a “value”, which helps in graphically displaying the results of evaluating constraints using the function
display_report()
.For example, if the constraint checks that the NUPACK complex free energy of a strand is at least -2.5 kcal/mol, and a strand has energy -3.4 kcal/mol, then the following are sensible values for these fields:
value
=-3.4
unit
="kcal/mol"
excess
=-0.9
summary
="-3.4 kcal/mol"
- Parameters:
excess (float)
summary (str | None)
value (float | None)
unit (str | None)
- __init__(excess, summary=None, value=None, unit=None)[source]
- Parameters:
excess (float)
summary (str | None)
value (float | None)
unit (str | None)
- Return type:
None
- excess: float
The excess is a nonnegative value that is turned into a score, and the search minimizes the total score of all constraint evaluations. Setting this to 0 (or a negative value) means the constraint is satisfied, and setting it to a positive value means the constraint is violated. The interpretation is that the larger excess is, the more the constraint is violated.
For example, a common value for excess is the amount by which the NUPACK complex free energy exceeds a threshold.
- value: float | None = None
If this is a “numeric” constraint, i.e., checking some number such as the complex free energy of a strand and comparing it to a threshold, this is the “raw” value. It is optional, but if specified, then the raw values can be plotted in a Jupyter notebook by the function
display_report()
.Optional units (e.g., ‘kcal/mol’) can be specified in the field
Result.units
.
- unit: str | None = None
Optional units for
Result.value
, e.g.,'kcal/mol'
.If specified, then the units are used in text reports and to label the y-axis in plots created by
search.display_report()
.
- score: float
Set by the search algorithm based on
Result.excess
as well as other data such as the constraint’s weight and theSearchParameters.score_transfer_function
.
- part: DesignPart
Set by the search algorithm based on the part that was evaluated.
- property summary: str
This string is displayed in the text report on constraints, after the name of the “part” (e.g., strand, pair of domains, pair of strands).
It can be set explicitly, or calculated from
Result.value
if not set explicitly.
- class constraints.SingularConstraint(description: 'str', short_description: 'str' = '', weight: 'float' = 1.0, score_transfer_function: 'Callable[[float], float]' = <function default_score_transfer_function at 0x7fef7b6f3c10>, evaluate: 'Callable[[Tuple[str, ...], DesignPart | None], Result[DesignPart]]' = <function SingularConstraint.<lambda> at 0x7fef7b6abc10>, parallel: 'bool' = False)[source]
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate (Callable[[Tuple[str, ...], DesignPart | None], Result[DesignPart]])
parallel (bool)
- evaluate()
Essentially a wrapper for a function that evaluates the
Constraint
. It takes as input a tuple of DNA sequences (Python strings) and an optionalPart
, wherePart
is one ofDomain
,Strand
,DomainPair
,StrandPair
, orComplex
(the latter being an alias for arbitrary-length tuple ofStrand
’s).The second argument will be None if
SingularConstraint.parallel
is True (since it’s more expensive to serialize theDomain
andStrand
objects than strings for passing data to processes executing in parallel).Thus, if the
Constraint
needs to use more data about thePart
than just its DNA sequence, by accessing the second argument,Constraint.parallel
should be set to False.It should return a
Result
object.
- parallel: bool = False
Whether or not to use parallelization across multiple processes to take advantage of multiple processors/cores, by calling
SingularConstraint.evaluate
on different DesignParts in separate processes.
- call_evaluate(seqs, part)[source]
Evaluates this
Constraint
using functionSingularConstraint.evaluate
supplied in constructor.- Parameters:
seqs (Tuple[str, ...]) – sequence(s) of relevant
Part
, e.g., if part is a pair ofStrand
’s, then seqs is a pair of stringspart (DesignPart | None) – the
Part
to be evaluated. Might be None if parallelization is being used, since it is cheaper to serialize only the sequence(s) than the entirePart
for passing to other processes to evaluate in parallel.
- Returns:
a
Result
object- Return type:
Result[DesignPart]
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False)
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate (Callable[[Tuple[str, ...], DesignPart | None], Result[DesignPart]])
parallel (bool)
- Return type:
None
- class constraints.BulkConstraint(description: 'str', short_description: 'str' = '', weight: 'float' = 1.0, score_transfer_function: 'Callable[[float], float]' = <function default_score_transfer_function at 0x7fef7b6f3c10>, evaluate_bulk: 'Callable[[Sequence[DesignPart]], List[Result]]' = <function BulkConstraint.<lambda> at 0x7fef7b6abf70>)[source]
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate_bulk (Callable[[Sequence[DesignPart]], List[Result]])
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>)
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate_bulk (Callable[[Sequence[DesignPart]], List[Result]])
- Return type:
None
- class constraints.ConstraintWithDomains(domains: 'Tuple[Domain, ...] | None' = None)[source]
- Parameters:
domains (Tuple[Domain, ...] | None)
- class constraints.ConstraintWithStrands(strands: 'Tuple[Strand, ...] | None' = None)[source]
- Parameters:
strands (Tuple[Strand, ...] | None)
- class constraints.DomainConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, domains=None)[source]
Constraint that applies to a single
Domain
.- Parameters:
- static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, domains=None)
- class constraints.StrandConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, strands=None)[source]
Constraint that applies to a single
Strand
.- Parameters:
- static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, strands=None)
- class constraints.ConstraintWithDomainPairs(description: 'str', short_description: 'str' = '', weight: 'float' = 1.0, score_transfer_function: 'Callable[[float], float]' = <function default_score_transfer_function at 0x7fef7b6f3c10>, domain_pairs: 'Tuple[DomainPair, ...] | None' = None, pairs: 'InitVar[Iterable[Tuple[Domain, Domain], ...] | None]' = None, check_domain_against_itself: 'bool' = True)[source]
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
domain_pairs (Tuple[DomainPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Domain, Domain], ...] | None])
check_domain_against_itself (bool)
- domain_pairs: Tuple[DomainPair, ...] | None = None
List of
DomainPair
’s to check; if not specified, all pairs inDesign
are checked.This can be specified manmually, or alternately is set internally in the constructor based on the optional
__init__
parameter pairs.
- pairs: InitVar[Iterable[Tuple[Domain, Domain], ...] | None] = None
Init-only variable (specified in constructor, but is not a field in the class) for specifying pairs of domains to check; if not specified, all pairs in
Design
are checked, unlessConstraintWithDomainPairs.domain_pairs
is specified.
- check_domain_against_itself: bool = True
Whether to check a domain against itself when checking all pairs of
Domain
’s in theDesign
. Only used ifConstraintWithDomainPairs.pairs
is not specified, otherwise it is ignored.
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, domain_pairs=None, pairs=None, check_domain_against_itself=True)
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
domain_pairs (Tuple[DomainPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Domain, Domain], ...] | None])
check_domain_against_itself (bool)
- Return type:
None
- class constraints.ConstraintWithStrandPairs(description: 'str', short_description: 'str' = '', weight: 'float' = 1.0, score_transfer_function: 'Callable[[float], float]' = <function default_score_transfer_function at 0x7fef7b6f3c10>, strand_pairs: 'Tuple[StrandPair, ...] | None' = None, pairs: 'InitVar[Iterable[Tuple[Strand, Strand], ...] | None]' = None, check_strand_against_itself: 'bool' = True)[source]
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
strand_pairs (Tuple[StrandPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Strand, Strand], ...] | None])
check_strand_against_itself (bool)
- strand_pairs: Tuple[StrandPair, ...] | None = None
List of
StrandPair
’s to check; if not specified, all pairs inDesign
are checked.This can be specified manmually, or alternately is set internally in the constructor based on the optional
__init__
parameter pairs.
- pairs: InitVar[Iterable[Tuple[Strand, Strand], ...] | None] = None
Init-only variable (specified in constructor, but is not a field in the class) for specifying pairs of strands; if not specified, all pairs in
Design
are checked, unlessConstraintWithStrandPairs.strand_pairs
is specified.
- check_strand_against_itself: bool = True
Whether to check a strand against itself when checking all pairs of
Strand
’s in theDesign
. Only used ifConstraintWithStrandPairs.pairs
is not specified, otherwise it is ignored.
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, strand_pairs=None, pairs=None, check_strand_against_itself=True)
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
strand_pairs (Tuple[StrandPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Strand, Strand], ...] | None])
check_strand_against_itself (bool)
- Return type:
None
- class constraints.DomainPairConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, domain_pairs=None, pairs=None, check_domain_against_itself=True)[source]
Constraint that applies to a pair of
Domain
’s.These should be symmetric, meaning that the constraint will give the same evaluation whether its evaluate method is given the pair (domain1, domain2), or the pair (domain2, domain1).
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate (Callable[[Tuple[str, ...], DesignPart | None], Result[DesignPart]])
parallel (bool)
domain_pairs (Tuple[DomainPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Domain, Domain], ...] | None])
check_domain_against_itself (bool)
- static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, domain_pairs=None, pairs=None, check_domain_against_itself=True)
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate (Callable[[Tuple[str, ...], DesignPart | None], Result[DesignPart]])
parallel (bool)
domain_pairs (Tuple[DomainPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Domain, Domain], ...] | None])
check_domain_against_itself (bool)
- Return type:
None
- class constraints.StrandPairConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, strand_pairs=None, pairs=None, check_strand_against_itself=True)[source]
Constraint that applies to a pair of
Strand
’s.These should be symmetric, meaning that the constraint will give the same evaluation whether its evaluate method is given the pair (strand1, strand2), or the pair (strand2, strand1).
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate (Callable[[Tuple[str, ...], DesignPart | None], Result[DesignPart]])
parallel (bool)
strand_pairs (Tuple[StrandPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Strand, Strand], ...] | None])
check_strand_against_itself (bool)
- static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, strand_pairs=None, pairs=None, check_strand_against_itself=True)
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate (Callable[[Tuple[str, ...], DesignPart | None], Result[DesignPart]])
parallel (bool)
strand_pairs (Tuple[StrandPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Strand, Strand], ...] | None])
check_strand_against_itself (bool)
- Return type:
None
- class constraints.DomainsConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, domains=None)[source]
Constraint that applies to a several
Domain
’s.The difference with
DomainConstraint
is that the caller may want to process allDomain
’s at once, e.g., by giving many of them to a third-party program such as ViennaRNA, which may be more efficient than repeatedly calling a Python function.It is assumed that the constraint works by checking one
Domain
at a time. After computing initial violations of constraints, subsequent calls to this constraint only give the domain that was mutated, not the entire ofDomain
’s in the wholeDesign
. UseDesignConstraint
for constraints that require everyDomain
in theDesign
.- Parameters:
- static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, domains=None)
- class constraints.StrandsConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, strands=None)[source]
Constraint that applies to a several
Strand
’s.The difference with
StrandConstraint
is that the caller may want to process allStrand
’s at once, e.g., by giving many of them to a third-party program such as ViennaRNA.It is assumed that the constraint works by checking one
Strand
at a time. After computing initial violations of constraints, subsequent calls to this constraint only give strands containing the domain that was mutated, not the entire ofStrand
’s in the wholeDesign
. UseDesignConstraint
for constraints that require everyStrand
in theDesign
.- Parameters:
- static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, strands=None)
- class constraints.DomainPairsConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, domain_pairs=None, pairs=None, check_domain_against_itself=True)[source]
Similar to
DomainsConstraint
but operates on a specified list of pairs ofDomain
’s.- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate_bulk (Callable[[Sequence[DesignPart]], List[Result]])
domain_pairs (Tuple[DomainPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Domain, Domain], ...] | None])
check_domain_against_itself (bool)
- static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, domain_pairs=None, pairs=None, check_domain_against_itself=True)
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate_bulk (Callable[[Sequence[DesignPart]], List[Result]])
domain_pairs (Tuple[DomainPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Domain, Domain], ...] | None])
check_domain_against_itself (bool)
- Return type:
None
- class constraints.StrandPairsConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, strand_pairs=None, pairs=None, check_strand_against_itself=True)[source]
Similar to
StrandsConstraint
but operates on a specified list of pairs ofStrand
’s.- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate_bulk (Callable[[Sequence[DesignPart]], List[Result]])
strand_pairs (Tuple[StrandPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Strand, Strand], ...] | None])
check_strand_against_itself (bool)
- static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, strand_pairs=None, pairs=None, check_strand_against_itself=True)
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
evaluate_bulk (Callable[[Sequence[DesignPart]], List[Result]])
strand_pairs (Tuple[StrandPair, ...] | None)
pairs (InitVar[Iterable[Tuple[Strand, Strand], ...] | None])
check_strand_against_itself (bool)
- Return type:
None
- class constraints.DesignConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_design=<function DesignConstraint.<lambda>>)[source]
Constraint that applies to the entire
Design
. This is used for anyConstraint
that does not naturally fit the structure of the other types of constraints.Unlike other constraints, which specify either
Constraint._evaluate
orConstraint._evaluate_bulk
, aDesignConstraint
leaves both of these unspecified and specifiesDesignConstraint._evaluate_design
instead.- Parameters:
- evaluate_design()
Evaluates the
Design
(first argument), possibly taking into account whichDomain
’s have changed in the last iteration (second argument).Returns a list of tuples (part, score, summary), one tuple per violation of the
DesignConstraint
.part is the part of the
Design
that caused the violation. It must be one ofDomain
,Strand
, pair of Domain’s, or tuple ofStrand
’s.score is the score of the violation.
summary is a 1-line summary of the violation to put into the generated reports.
- static part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_design=<function DesignConstraint.<lambda>>)
- constraints.verify_designs_match(design1, design2, check_fixed=True)[source]
Verifies that two designs match, other than their constraints. This is useful when loading a design that has been saved in the middle of searching for DNA sequences, to verify that it matches a design created before the DNA sequence search started.
- Parameters:
- Raises:
ValueError – If the designs do not match. Here is what is checked: - strand names and group names appear in the same order - domain names and pool names appear in the same order in strands with the same name -
Domain.fixed
matches betweenDomain
’s- Return type:
None
- constraints.convert_threshold(threshold, key)[source]
- Parameters:
threshold (float | Dict[T, float]) – either a single float, or a dictionary mapping instances of T to floats
key (T) – instance of T
- Returns:
threshold for key
- Return type:
float
- constraints.nupack_domain_free_energy_constraint(threshold, temperature=37, sodium=0.05, magnesium=0.0125, weight=1.0, score_transfer_function=<function default_score_transfer_function>, parallel=False, description=None, short_description='strand_ss_nupack', domains=None)[source]
Returns constraint that checks individual
Domain
’s for excessive interaction using NUPACK’s pfunc.NUPACK 4 must be installed. Installation instructions can be found at https://piercelab-caltech.github.io/nupack-docs/start/.
- Parameters:
threshold (float) – energy threshold in kcal/mol
temperature (float) – temperature in Celsius
sodium (float) – molarity of sodium (more generally, monovalent ions such as Na+, K+, NH4+) in moles per liter
magnesium (float) – molarity of magnesium (Mg++) in moles per liter
weight (float) – how much to weigh this
Constraint
score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.parallel (bool) – Whether to use parallelization by running constraint evaluation in separate processes to take advantage of multiple cores.
domains (Iterable[Domain] | None) –
Domain
’s to check; if not specified, all domains are checked.description (str | None) – detailed description of constraint suitable for putting in report; if not specified a reasonable default is chosen
short_description (str) – short description of constraint suitable for logging to stdout
- Returns:
the constraint
- Return type:
- constraints.nupack_strand_free_energy_constraint(threshold, temperature=37, sodium=0.05, magnesium=0.0125, weight=1.0, score_transfer_function=<function default_score_transfer_function>, parallel=False, description=None, short_description='strand_ss_nupack', strands=None)[source]
Returns constraint that checks individual
Strand
’s for excessive interaction using NUPACK’s pfunc. This is the so-called “complex free energy”: https://docs.nupack.org/definitions/#complex-free-energyNUPACK 4 must be installed. Installation instructions can be found at https://piercelab-caltech.github.io/nupack-docs/start/.
- Parameters:
threshold (float) – energy threshold in kcal/mol
temperature (float) – temperature in Celsius
sodium (float) – molarity of sodium (more generally, monovalent ions such as Na+, K+, NH4+) in moles per liter
magnesium (float) – molarity of magnesium (Mg++) in moles per liter
weight (float) – how much to weigh this
Constraint
score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.parallel (bool) – Whether to use parallelization by running constraint evaluation in separate processes to take advantage of multiple cores.
strands (Iterable[Strand] | None) – Strands to check; if not specified, all strands are checked.
description (str | None) – detailed description of constraint suitable for putting in report; if not specified a reasonable default is chosen
short_description (str) – short description of constraint suitable for logging to stdout
- Returns:
the constraint
- Return type:
- constraints.nupack_domain_pair_constraint(threshold, temperature=37, sodium=0.05, magnesium=0.0125, parallel=False, weight=1.0, score_transfer_function=<function default_score_transfer_function>, description=None, short_description='dom_pair_nupack', pairs=None)[source]
Returns constraint that checks given pairs of
Domain
’s for excessive interaction using NUPACK’s pfunc executable. Each of the four combinations of seq1, seq2 and their Watson-Crick complements are compared.- Parameters:
threshold (float) – Energy threshold in kcal/mol.
temperature (float) – Temperature in Celsius
sodium (float) – molarity of sodium (more generally, monovalent ions such as Na+, K+, NH4+) in moles per liter
magnesium (float) – molarity of magnesium (Mg++) in moles per liter
parallel (bool) – Whether to test each pair of
Domain
’s in parallel (i.e., sets fieldConstraint.parallel
)weight (float) – See
Constraint.weight
.score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.description (str | None) – Detailed description of constraint suitable for summary report.
short_description (str) – See
Constraint.short_description
pairs (Iterable[Tuple[Domain, Domain]] | None) – Pairs of
Domain
’s to compare; if not specified, checks all pairs (including aDomain
against itself).
- Returns:
The
DomainPairConstraint
.- Return type:
- constraints.nupack_strand_pair_constraints_by_number_matching_domains(thresholds, temperature=37, sodium=0.05, magnesium=0.0125, weight=1.0, score_transfer_function=<function default_score_transfer_function>, descriptions=None, short_descriptions=None, parallel=False, strands=None, pairs=None, ignore_missing_thresholds=False)[source]
Convenience function for creating many constraints as returned by
nupack_strand_pair_constraint()
, one for each threshold specified in parameter thresholds, based on number of matching (complementary) domains between pairs of strands.Optional parameters description and short_description are also dicts keyed by the same keys.
Exactly one of strands or pairs must be specified. If strands, then all pairs of strands (including a strand with itself) will be checked; otherwise only those pairs in pairs will be checked.
It is also common to set different thresholds according to the lengths of the strands. This can be done by calling
strand_pairs_by_lengths()
to separate first by lengths in a dict mapping length pairs to strand pairs, then calling this function once for each (key, value) in that dict, giving the value (which is a list of pairs of strands) as the pairs parameter to this function.- Parameters:
thresholds (Dict[int, float]) – Energy thresholds in kcal/mol. If k domains are complementary between the strands, then use threshold thresholds[k].
temperature (float) – Temperature in Celsius.
sodium (float) – concentration of Na+ in molar
magnesium (float) – concentration of Mg++ in molar
weight (float) – See
Constraint.weight
.score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.descriptions (Dict[int, str] | None) – Long descriptions of constraint suitable for putting into constraint report.
short_descriptions (Dict[int, str] | None) – Short descriptions of constraint suitable for logging to stdout.
parallel (bool) – Whether to test each pair of
Strand
’s in parallel.strands (Iterable[Strand] | None) – Pairs of
Strand
’s to compare; if not specified, checks all pairs in pairs. Mutually exclusive with pairs.pairs (Iterable[Tuple[Strand, Strand]] | None) – Pairs of
Strand
’s to compare; if not specified, checks all pairs in strands, including each strand with itself. Mutually exclusive with strands.ignore_missing_thresholds (bool) – If True, then a key num left out of thresholds dict will cause no constraint to be returned for pairs of strands with num complementary domains. If False, then a ValueError is raised.
- Returns:
list of constraints, one per threshold in thresholds
- Return type:
List[StrandPairConstraint]
- constraints.nupack_strand_pair_constraint(threshold, temperature=37, sodium=0.05, magnesium=0.0125, weight=1.0, score_transfer_function=<function default_score_transfer_function>, description=None, short_description='strand_pair_nupack', parallel=False, pairs=None)[source]
Returns constraint that checks given pairs of
Strand
’s for excessive interaction using NUPACK’s pfunc function.NUPACK 4 must be installed. Installation instructions can be found at https://piercelab-caltech.github.io/nupack-docs/start/.
- Parameters:
threshold (float) – Energy threshold in kcal/mol
temperature (float) – Temperature in Celsius
sodium (float) – concentration of Na+ in molar
magnesium (float) – concentration of Mg++ in molar
weight (float) – See
Constraint.weight
.score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.parallel (bool) – Whether to use parallelization by running constraint evaluation in separate processes to take advantage of multiple cores.
description (str | None) – Detailed description of constraint suitable for report.
short_description (str) – See
Constraint.short_description
pairs (Iterable[Tuple[Strand, Strand]] | None) – Pairs of
Strand
’s to compare; if not specified, checks all pairs (including aStrand
against itself).
- Returns:
The
StrandPairConstraint
.- Return type:
- constraints.chunker(sequence, chunk_length=None, num_chunks=None)[source]
Collect data into fixed-length chunks or blocks, e.g., chunker(‘ABCDEFG’, 3) –> ABC DEF G
- Parameters:
sequence (Sequence[T]) – Sequence (list or tuple) of items.
chunk_length (int | None) – Length of each chunk. Mutually exclusive with num_chunks.
num_chunks (int | None) – Number of chunks. Mutually exclusive with chunk_length.
- Returns:
List of num_chunks lists, each list of length chunk_length (one of num_chunks or chunk_length will be calculated from the other).
- Return type:
List[List[T]]
- constraints.cpu_count(logical=False)[source]
Counts the number of physical CPUs (cores). For greatest accuracy, requires the 3rd party psutil package to be installed.
- Parameters:
logical (bool) – Whether to count number of logical processors or physical CPU cores.
- Returns:
Number of physical CPU cores if logical is False and package psutils is installed; otherwise, the number of logical processors.
- Return type:
int
- constraints.rna_duplex_domain_pairs_constraint(threshold, temperature=37, weight=1.0, score_transfer_function=<function <lambda>>, description=None, short_description='rna_dup_dom_pairs', pairs=None, parameters_filename='dna_mathews1999.par')[source]
Returns constraint that checks given pairs of
Domain
’s for excessive interaction using Vienna RNA’s RNAduplex executable.- Parameters:
threshold (float) – energy threshold
temperature (float) – temperature in Celsius
weight (float) – how much to weigh this
Constraint
score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.description (str | None) – long description of constraint suitable for printing in report file
short_description (str) – short description of constraint suitable for logging to stdout
pairs (Iterable[Tuple[Domain, Domain]] | None) – pairs of
Domain
’s to compare; if not specified, checks all pairsparameters_filename (str) – name of parameters file for ViennaRNA; default is same as
vienna_nupack.rna_duplex_multiple()
- Returns:
constraint
- Return type:
- constraints.rna_plex_domain_pairs_constraint(threshold, temperature=37, weight=1.0, score_transfer_function=<function <lambda>>, description=None, short_description='rna_plex_dom_pairs', pairs=None, parameters_filename='dna_mathews1999.par')[source]
Returns constraint that checks given pairs of
Domain
’s for excessive interaction using Vienna RNA’s RNAplex executable.- Parameters:
threshold (float) – energy threshold
temperature (float) – temperature in Celsius
weight (float) – how much to weigh this
Constraint
score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.description (str | None) – long description of constraint suitable for printing in report file
short_description (str) – short description of constraint suitable for logging to stdout
pairs (Iterable[Tuple[Domain, Domain]] | None) – pairs of
Domain
’s to compare; if not specified, checks all pairsparameters_filename (str) – name of parameters file for ViennaRNA; default is same as
vienna_nupack.rna_duplex_multiple()
- Returns:
constraint
- Return type:
- constraints.nupack_domain_pairs_nonorthogonal_constraint(thresholds, temperature=37, sodium=0.05, magnesium=0.0125, weight=1.0, score_transfer_function=<function default_score_transfer_function>, description=None, short_description='dom_pair_nupack_nonorth', parameters_filename='dna_mathews1999.par', max_energy=0.0)[source]
Similar to
rna_plex_domain_pairs_nonorthogonal_constraint()
, but uses NUPACK instead of RNAplex. Only two parameters sodium and magnesium are different; documented here.- Parameters:
sodium (float) – concentration of sodium (more generally, monovalent ions such as Na+, K+, NH4+) in moles per liter
magnesium (float) – concentration of magnesium (Mg++) in moles per liter
thresholds (Dict[Tuple[Domain, bool, Domain, bool] | Tuple[Domain, Domain], Tuple[float, float]])
temperature (float)
weight (float)
score_transfer_function (Callable[[float], float])
description (str | None)
short_description (str)
parameters_filename (str)
max_energy (float)
- Return type:
- constraints.rna_plex_domain_pairs_nonorthogonal_constraint(thresholds, temperature=37, weight=1.0, score_transfer_function=<function <lambda>>, description=None, short_description='rna_plex_dom_pairs_nonorth', parameters_filename='dna_mathews1999.par', max_energy=0.0)[source]
Returns constraint that checks given pairs of
Domain
’s for interaction energy in a given interval, using ViennaRNA’s RNAplex executable.This can be used to implement “nonorthogonal” domains as described in these papers:
The “binding affinity table” as described in the first paper is implemented as the thresholds parameter of this function.
- Parameters:
thresholds (Dict[Tuple[Domain, bool, Domain, bool] | Tuple[Domain, Domain], Tuple[float, float]]) –
dict mapping pairs of
Domain
’s, along with Booleans to indicate whether eitherDomain
is starred, to pairs of energy thresholds. Alternately, the key can be a pair ofDomain
’s(a,b)
without any Booleans; in this case only the unstarred versions of the domains are checked; this is equivalent to the key(a, False, b, False)
.For example, if the dict is
{ (a, False, b, False): (-9.0, -8.0), (a, False, b, True): (-5.0, -3.0), (a, True, c, False): (-1.0, 0.0), (a, d): (-7.0, -6.0), }
then the constraint ensures that RNAplex energy for
(a,b)
is between -9.0 and -8.0 kcal/mol,(a,b*)
is between -5.0 and -3.0 kcal/mol,(a*,c)
is between -1.0 and 0.0 kcal/mol, and(a,d)
is between -7.0 and -6.0 kcal/mol.
For all other pairs of domains not listed as keys in the dict (such as
(b,c)
or(a*,b*)
above), the constraint is not checked.temperature (float) – temperature in Celsius
weight (float) – See
Constraint.weight
.score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.description (str | None) – See
Constraint.description
.short_description (str) – See
Constraint.short_description
.parameters_filename (str) – name of parameters file for ViennaRNA; default is same as
vienna_nupack.rna_duplex_multiple()
max_energy (float) – maximum energy to return; if the RNAplex returns a value larger than this, then this value is used instead
- Returns:
constraint
- Return type:
- constraints.rna_duplex_domain_pairs_nonorthogonal_constraint(thresholds, temperature=37, weight=1.0, score_transfer_function=<function <lambda>>, description=None, short_description='rna_plex_dom_pairs_nonorth', parameters_filename='dna_mathews1999.par', max_energy=0.0)[source]
Similar to
rna_plex_domain_pairs_nonorthogonal_constraint()
, but uses RNAduplex instead of RNAplex.- Parameters:
- Return type:
- constraints.rna_cofold_domain_pairs_nonorthogonal_constraint(thresholds, temperature=37, weight=1.0, score_transfer_function=<function <lambda>>, description=None, short_description='rna_plex_dom_pairs_nonorth', parameters_filename='dna_mathews1999.par', max_energy=0.0)[source]
Similar to
rna_plex_domain_pairs_nonorthogonal_constraint()
, but uses RNAcofold instead of RNAplex.- Parameters:
- Return type:
- constraints.strand_pairs_by_lengths(strands)[source]
Separates pairs of strands in strands by lengths. If there are n different strand lengths in strands, then there are ((n+1) choose 2) keys in the returned dict, one for each pair of lengths
(len1, len2)
, including pairs wherelen1 == len2
. This key maps to a list of all pairs of strands in strands where the first strand has lengthlen1
and the second has lengthlen2
.
- constraints.strand_pairs_by_number_matching_domains(*, strands=None, pairs=None)[source]
Utility function for calculating number of complementary domains betweeen several pairs of strands.
Note that for the common use case that you want to create several constraints, each with their own threshold depending on the number of complementary domains, you should use functions such as
rna_duplex_strand_pairs_constraints_by_number_matching_domains()
ornupack_strand_pair_constraints_by_number_matching_domains()
, which in turn calls this function and then creates as many constraints as their are different numbers of complementary domains.- Parameters:
- Returns:
dict mapping integer (number of complementary
Domain
’s) to the list of pairs of strands in strands with that number of complementary domains- Return type:
- constraints.rna_cofold_strand_pairs_constraints_by_number_matching_domains(*, thresholds, temperature=37, weight=1.0, score_transfer_function=<function default_score_transfer_function>, descriptions=None, short_descriptions=None, parallel=False, strands=None, pairs=None, parameters_filename='dna_mathews1999.par', ignore_missing_thresholds=False)[source]
Similar to
rna_duplex_strand_pairs_constraints_by_number_matching_domains()
but creates constraints as returned byrna_cofold_strand_pairs_constraint()
.- Parameters:
thresholds (Dict[int, float])
temperature (float)
weight (float)
score_transfer_function (Callable[[float], float])
descriptions (Dict[int, str] | None)
short_descriptions (Dict[int, str] | None)
parallel (bool)
strands (Iterable[Strand] | None)
parameters_filename (str)
ignore_missing_thresholds (bool)
- Return type:
List[StrandPairsConstraint]
- constraints.rna_duplex_strand_pairs_constraints_by_number_matching_domains(*, thresholds, temperature=37, weight=1.0, score_transfer_function=<function default_score_transfer_function>, descriptions=None, short_descriptions=None, parallel=False, strands=None, pairs=None, parameters_filename='dna_mathews1999.par', ignore_missing_thresholds=False)[source]
Convenience function for creating many constraints as returned by
rna_duplex_strand_pairs_constraint()
, one for each threshold specified in parameter thresholds, based on number of matching (complementary) domains between pairs of strands.Optional parameters description and short_description are also dicts keyed by the same keys.
Exactly one of strands or pairs must be specified. If strands, then all pairs of strands (including a strand with itself) will be checked; otherwise only those pairs in pairs will be checked.
It is also common to set different thresholds according to the lengths of the strands. This can be done by calling
strand_pairs_by_lengths()
to separate first by lengths in a dict mapping length pairs to strand pairs, then calling this function once for each (key, value) in that dict, giving the value (which is a list of pairs of strands) as the pairs parameter to this function.- Parameters:
thresholds (Dict[int, float]) – Energy thresholds in kcal/mol. If k domains are complementary between the strands, then use threshold thresholds[k].
temperature (float) – Temperature in Celsius.
weight (float) – See
Constraint.weight
.score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.descriptions (Dict[int, str] | None) – Long descriptions of constraint suitable for putting into constraint report.
short_descriptions (Dict[int, str] | None) – Short descriptions of constraint suitable for logging to stdout.
parallel (bool) – Whether to test each pair of
Strand
’s in parallel.strands (Iterable[Strand] | None) –
Strand
’s to compare; if not specified, checks all in design. Mutually exclusive with pairs.pairs (Iterable[Tuple[Strand, Strand]] | None) – Pairs of
Strand
’s to compare; if not specified, checks all pairs in strands, including each strand with itself. Mutually exclusive with strands.parameters_filename (str) – Name of parameters file for ViennaRNA; default is same as
vienna_nupack.rna_duplex_multiple()
ignore_missing_thresholds (bool) – If True, then a key num left out of thresholds dict will cause no constraint to be returned for pairs of strands with num complementary domains. If False, then a ValueError is raised.
- Returns:
list of constraints, one per threshold in thresholds
- Return type:
List[StrandPairsConstraint]
- constraints.rna_plex_strand_pairs_constraints_by_number_matching_domains(*, thresholds, temperature=37, weight=1.0, score_transfer_function=<function default_score_transfer_function>, descriptions=None, short_descriptions=None, parallel=False, strands=None, pairs=None, parameters_filename='dna_mathews1999.par', ignore_missing_thresholds=False)[source]
Convenience function for creating many constraints as returned by
rna_plex_strand_pairs_constraint()
, one for each threshold specified in parameter thresholds, based on number of matching (complementary) domains between pairs of strands.Optional parameters description and short_description are also dicts keyed by the same keys.
Exactly one of strands or pairs must be specified. If strands, then all pairs of strands (including a strand with itself) will be checked; otherwise only those pairs in pairs will be checked.
It is also common to set different thresholds according to the lengths of the strands. This can be done by calling
strand_pairs_by_lengths()
to separate first by lengths in a dict mapping length pairs to strand pairs, then calling this function once for each (key, value) in that dict, giving the value (which is a list of pairs of strands) as the pairs parameter to this function.- Parameters:
thresholds (Dict[int, float]) – Energy thresholds in kcal/mol. If k domains are complementary between the strands, then use threshold thresholds[k].
temperature (float) – Temperature in Celsius.
weight (float) – See
Constraint.weight
.score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.descriptions (Dict[int, str] | None) – Long descriptions of constraint suitable for putting into constraint report.
short_descriptions (Dict[int, str] | None) – Short descriptions of constraint suitable for logging to stdout.
parallel (bool) – Whether to test each pair of
Strand
’s in parallel.strands (Iterable[Strand] | None) –
Strand
’s to compare; if not specified, checks all in design. Mutually exclusive with pairs.pairs (Iterable[Tuple[Strand, Strand]] | None) – Pairs of
Strand
’s to compare; if not specified, checks all pairs in strands, including each strand with itself. Mutually exclusive with strands.parameters_filename (str) – Name of parameters file for ViennaRNA; default is same as
vienna_nupack.rna_plex_multiple()
ignore_missing_thresholds (bool) – If True, then a key num left out of thresholds dict will cause no constraint to be returned for pairs of strands with num complementary domains. If False, then a ValueError is raised.
- Returns:
list of constraints, one per threshold in thresholds
- Return type:
List[StrandPairsConstraint]
- constraints.longest_complementary_subsequences_python_loop(arr1, arr2, gc_double)[source]
Like
longest_complementary_subsequences()
, but uses a Python loop instead of numpy operations. This is slower, but is easier to understand and useful for testing.- Parameters:
arr1 (ndarray)
arr2 (ndarray)
gc_double (bool)
- Return type:
List[int]
- constraints.longest_complementary_subsequences_two_loops(arr1, arr2, gc_double)[source]
Calculate length of longest common subsequences between arr1[i] and arr2[i] for each i, storing in returned list result[i].
This uses two nested Python loops to calculate the whole dynamic programming table.
longest_complementary_subsequences()
is slightly faster because it maintains only the diagonal of the DP table, and uses numpy vectorized operations to calculate the next diagonal of the table.When used for DNA sequences, this assumes arr2 has been reversed along axis 1, i.e., the sequences in arr1 are assumed to be oriented 5’ –> 3’, and the sequences in arr2 are assumed to be oriented 3’ –> 5’.
- Parameters:
arr1 (ndarray) – 2D array of DNA sequences, with each sequence represented as a 1D array of 0, 1, 2, 3 corresponding to A, C, G, T, respectively, with each row being a single DNA sequence oriented 5’ –> 3’.
arr2 (ndarray) – 2D array of DNA sequences, with each row being a single DNA sequence oriented 3’ –> 5’.
gc_double (bool) – Whether to double the score for G-C base pairs.
- Returns:
list ret of ints, where ret[i] is the length of the longest complementary subsequence between arr1[i] and arr2[i].
- Return type:
List[int]
- constraints.longest_complementary_subsequences(arr1, arr2, gc_double)[source]
Calculate length of longest common subsequences between arr1[i] and arr2[i] for each i, storing in returned list result[i].
Unlike
longest_complementary_subsequences_two_loops()
, this uses only one Python loop, by using the “anti-diagonal” method for evaluating the dynamic programming table, calculating a whole anti-diagonal from the previous two in O(1) numpy commands.When used for DNA sequences, this assumes arr2 has been reversed along axis 1, i.e., the sequences in arr1 are assumed to be oriented 5’ –> 3’, and the sequences in arr2 are assumed to be oriented 3’ –> 5’.
- Parameters:
arr1 (ndarray) – 2D array of DNA sequences, with each sequence represented as a 1D array of 0, 1, 2, 3 corresponding to A, C, G, T, respectively, with each row being a single DNA sequence oriented 5’ –> 3’.
arr2 (ndarray) – 2D array of DNA sequences, with each row being a single DNA sequence oriented 3’ –> 5’.
gc_double (bool) – Whether to double the score for G-C base pairs. (assumes that integers 1,2 represent C,G respectively)
- Returns:
list ret of ints, where ret[i] is the length of the longest complementary subsequence between arr1[i] and arr2[i].
- Return type:
List[int]
- constraints.lcs_domain_pairs_constraint(*, threshold, weight=1.0, score_transfer_function=<function default_score_transfer_function>, description=None, short_description='lcs domain pairs', domains=None, pairs=None, check_domain_against_itself=True, gc_double=True)[source]
Checks pairs of domain sequences for longest complementary subsequences. This can be thought of as a very rough heuristic for “binding energy” that is much less accurate than NUPACK or ViennaRNA, but much faster to evaluate.
- Args
threshold: Max length of complementary subsequence allowed.
weight: See
Constraint.weight
.score_transfer_function: See
Constraint.score_transfer_function
.description: See
Constraint.description
short_description: See
Constraint.short_description
pairs: Pairs of
Strand
’s to compare; if not specified, checks all pairs in design.- check_domain_against_itself: Whether to check domain x against x*. (Obviously x has a maximal
common subsequence with x, so we don’t check that.)
gc_double: Whether to weigh G-C base pairs as double (i.e., they count for 2 instead of 1).
- Returns
A
DomainPairsConstraint
that checks given pairs ofDomain
’s for excessive interaction due to having long complementary subsequences.
- Parameters:
- Return type:
- constraints.lcs_strand_pairs_constraint(*, threshold, weight=1.0, score_transfer_function=<function default_score_transfer_function>, description=None, short_description='lcs strand pairs', pairs=None, check_strand_against_itself=True, gc_double=True)[source]
Checks pairs of strand sequences for longest complementary subsequences. This can be thought of as a very rough heuristic for “binding energy” that is much less accurate than NUPACK or ViennaRNA, but much faster to evaluate.
- Args
threshold: Max length of complementary subsequence allowed.
weight: See
Constraint.weight
.score_transfer_function: See
Constraint.score_transfer_function
.description: See
Constraint.description
.short_description: See
Constraint.short_description
.pairs: Pairs of
Strand
’s to compare; if not specified, checks all pairs in design.check_strand_against_itself: Whether to check a strand against itself.
gc_double: Whether to weigh G-C base pairs as double (i.e., they count for 2 instead of 1).
- Returns
A
StrandPairsConstraint
that checks given pairs ofStrand
’s for excessive interaction due to having long complementary subsequences.
- Parameters:
- Return type:
- constraints.lcs_strand_pairs_constraints_by_number_matching_domains(*, thresholds, weight=1.0, score_transfer_function=<function default_score_transfer_function>, descriptions=None, short_descriptions=None, parallel=False, strands=None, pairs=None, gc_double=True, parameters_filename='', ignore_missing_thresholds=False)[source]
TODO
- Parameters:
thresholds (Dict[int, int])
weight (float)
score_transfer_function (Callable[[float], float])
descriptions (Dict[int, str] | None)
short_descriptions (Dict[int, str] | None)
parallel (bool)
strands (Iterable[Strand] | None)
gc_double (bool)
parameters_filename (str)
ignore_missing_thresholds (bool)
- Return type:
List[StrandPairsConstraint]
- constraints.rna_duplex_strand_pairs_constraint(*, threshold, temperature=37, weight=1.0, score_transfer_function=<function default_score_transfer_function>, description=None, short_description='rna_dup_strand_pairs', parallel=False, pairs=None, parameters_filename='dna_mathews1999.par')[source]
Returns constraint that checks given pairs of
Strand
’s for excessive interaction using Vienna RNA’s RNAduplex executable.Often one wishes to let the threshold depend on how many domains match between a pair of strands. The function
rna_duplex_strand_pairs_constraints_by_number_matching_domains()
is useful for this purpose, returning a list ofStrandPairsConstraint
’s such as those returned by this function, one for each possible number of matching domains.TODO: explain that this should be many pairs of strands to be fast
- Parameters:
threshold (float) – Energy threshold in kcal/mol. If a float, this is used for all pairs of strands. If a dict[int, float], interpreted to mean that
temperature (float) – Temperature in Celsius.
weight (float) – See
Constraint.weight
.score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.description (str | None) – See
Constraint.description
short_description (str) – See
Constraint.short_description
parallel (bool) – Whether to test each pair of
Strand
’s in parallel.pairs (Iterable[Tuple[Strand, Strand]] | None) – Pairs of
Strand
’s to compare; if not specified, checks all pairs in design.parameters_filename (str) – Name of parameters file for ViennaRNA; default is same as
vienna_nupack.rna_duplex_multiple()
- Returns:
- Return type:
- constraints.rna_plex_strand_pairs_constraint(*, threshold, temperature=37, weight=1.0, score_transfer_function=<function default_score_transfer_function>, description=None, short_description='rna_plex_strand_pairs', parallel=False, pairs=None, parameters_filename='dna_mathews1999.par')[source]
Returns constraint that checks given pairs of
Strand
’s for excessive interaction using Vienna RNA’s RNAplex executable.Often one wishes to let the threshold depend on how many domains match between a pair of strands. The function
rna_plex_strand_pairs_constraints_by_number_matching_domains()
is useful for this purpose, returning a list ofStrandPairsConstraint
’s such as those returned by this function, one for each possible number of matching domains.- Parameters:
threshold (float) – Energy threshold in kcal/mol. If a float, this is used for all pairs of strands. If a dict[int, float], interpreted to mean that
temperature (float) – Temperature in Celsius.
weight (float) – See
Constraint.weight
.score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.description (str | None) – See
Constraint.description
short_description (str) – See
Constraint.short_description
parallel (bool) – Whether to test each pair of
Strand
’s in parallel.pairs (Iterable[Tuple[Strand, Strand]] | None) – Pairs of
Strand
’s to compare; if not specified, checks all pairs in design.parameters_filename (str) – Name of parameters file for ViennaRNA; default is same as
vienna_nupack.rna_plex_multiple()
- Returns:
- Return type:
- constraints.rna_cofold_strand_pairs_constraint(*, threshold, temperature=37, weight=1.0, score_transfer_function=<function default_score_transfer_function>, description=None, short_description='rna_dup_strand_pairs', parallel=False, pairs=None, parameters_filename='dna_mathews1999.par')[source]
Returns constraint that checks given pairs of
Strand
’s for excessive interaction using Vienna RNA’s RNAduplex executable.- Parameters:
threshold (float) – Energy threshold in kcal/mol
temperature (float) – Temperature in Celsius.
weight (float) – See
Constraint.weight
.score_transfer_function (Callable[[float], float]) – See
Constraint.score_transfer_function
.description (str | None) – See
Constraint.description
short_description (str) – See
Constraint.short_description
parallel (bool) – Whether to test each pair of
Strand
’s in parallel.pairs (Iterable[Tuple[Strand, Strand]] | None) – Pairs of
Strand
’s to compare; if not specified, checks all pairs.parameters_filename (str) – Name of parameters file for ViennaRNA; default is same as
vienna_nupack.rna_duplex_multiple()
- Returns:
- Return type:
- class constraints.ConstraintWithComplexes(description: 'str', short_description: 'str' = '', weight: 'float' = 1.0, score_transfer_function: 'Callable[[float], float]' = <function default_score_transfer_function at 0x7fef7b6f3c10>, complexes: 'Tuple[Complex, ...]' = ())[source]
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
complexes (Tuple[Complex, ...])
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, complexes=())
- Parameters:
description (str)
short_description (str)
weight (float)
score_transfer_function (Callable[[float], float])
complexes (Tuple[Complex, ...])
- Return type:
None
- class constraints.ComplexConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, complexes=())[source]
Constraint that applies to a complex (tuple of
Strand
’s).Specify
Constraint._evaluate
in the constructor.Unlike other types of
Constraint
’s such asStrandConstraint
orStrandPairConstraint
, there is no default list ofComplex
’s that aComplexConstraint
is applied to. The list ofComplex
’s must be specified manually in the constructor.- Parameters:
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate=<function SingularConstraint.<lambda>>, parallel=False, complexes=())
- part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- class constraints.ComplexesConstraint(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, complexes=())[source]
Similar to
ComplexConstraint
but operates on a specified list of complexes (tuples ofStrand
’s).- Parameters:
- __init__(description, short_description='', weight=1.0, score_transfer_function=<function default_score_transfer_function>, evaluate_bulk=<function BulkConstraint.<lambda>>, complexes=())
- part_name()[source]
Returns name of the
Part
that thisConstraint
tests.- Returns:
name of the
Part
that thisConstraint
tests (e.g., “domain”, “strand pair”)- Return type:
str
- constraints.default_interior_to_strand_probability = 0.98
Default probability threshold for
BasePairType.INTERIOR_TO_STRAND
- constraints.default_adjacent_to_exterior_base_pair = 0.95
Default probability threshold for
BasePairType.ADJACENT_TO_EXTERIOR_BASE_PAIR
- constraints.default_blunt_end_probability = 0.33
Default probability threshold for
BasePairType.BLUNT_END
- constraints.default_nick_3p_probability = 0.79
Default probability threshold for
BasePairType.NICK_3P
- constraints.default_nick_5p_probability = 0.73
Default probability threshold for
BasePairType.NICK_5P
- constraints.default_dangle_3p_probability = 0.51
Default probability threshold for
BasePairType.DANGLE_3P
- constraints.default_dangle_5p_probability = 0.57
Default probability threshold for
BasePairType.DANGLE_5P
- constraints.default_dangle_5p_3p_probability = 0.73
Default probability threshold for
BasePairType.DANGLE_5P_3P
- constraints.default_overhang_on_this_strand_3p_probability = 0.82
Default probability threshold for
BasePairType.OVERHANG_ON_THIS_STRAND_3P
- constraints.default_overhang_on_this_strand_5p_probability = 0.79
Default probability threshold for
BasePairType.OVERHANG_ON_THIS_STRAND_5P
- constraints.default_overhang_on_adjacent_strand_3p_probability = 0.55
Default probability threshold for
BasePairType.OVERHANG_ON_ADJACENT_STRAND_3P
- constraints.default_overhang_on_adjacent_strand_5p_probability = 0.49
Default probability threshold for
BasePairType.OVERHANG_ON_ADJACENT_STRAND_5P
- constraints.default_overhang_on_both_strand_3p_probability = 0.61
Default probability threshold for
BasePairType.OVERHANG_ON_BOTH_STRANDS_3P
- constraints.default_overhang_on_both_strand_5p_probability = 0.55
Default probability threshold for
BasePairType.OVERHANG_ON_BOTH_STRANDS_5P
- constraints.default_three_arm_junction_probability = 0.69
Default probability threshold for
BasePairType.THREE_ARM_JUNCTION
- constraints.default_four_arm_junction_probability = 0.84
Default probability threshold for
BasePairType.FOUR_ARM_JUNCTION
- constraints.default_five_arm_junction_probability = 0.77
Default probability threshold for
BasePairType.FIVE_ARM_JUNCTION
- constraints.default_mismatch_probability = 0.76
Default probability threshold for
BasePairType.MISMATCH
- constraints.default_bulge_loop_3p_probability = 0.69
Default probability threshold for
BasePairType.BULGE_LOOP_3P
- constraints.default_bulge_loop_5p_probability = 0.65
Default probability threshold for
BasePairType.BULGE_LOOP_5P
- constraints.default_unpaired_probability = 0.95
Default probability threshold for
BasePairType.UNPAIRED
- constraints.default_other_probability = 0.7
Default probability threshold for
BasePairType.OTHER
- class constraints.BasePairType(value)[source]
Represents different configurations for a base pair, and its immediate neighboring base pairs (or lack thereof).
Notation:
“#” indicates denotes the ends of a domain. They can either be the end of a strand or they could be connected to another domain.
“]” and “[” indicates 5’ ends of strand
“>” and “<” indicates 3’ ends of a strand
“-” indicates a base (number of these are not important).
“|” indicates a bases are bound (forming a base pair). Any “-” not connected by “|” is unbound
Domain Example:
The following represents an unbound domain of length 5
#-----#
The following represents bound domains of length 5
#-----# ||||| #-----#
Ocassionally, domains will be vertical in the case of overhangs. In this case, “-” and “|” have opposite meanings
Vertical Domain Example:
# # |-| |-| |-| |-| |-| # #
Formatting:
Top strands have 5’ end on left side and 3’ end on right side
Bottom strand have 3’ end on left side and 5’ end on right side
Strand Example:
strand0: a-b-c-d strand1: d*-b*-c*-a* a b c d strand0 [-----##-----##-----##-----> ||||| ||||| ||||| ||||| strand1 <-----##-----##-----##-----] a* b* c* d*
Consecutive “#”:
In some cases, extra “#” are needed to make space for ascii art. We consider any consecutive “#”s to be equivalent “##”. The following is considered equivalent to the example above
a b c d strand0 [-----###-----####-----##-----> ||||| ||||| ||||| ||||| strand1 <-----###-----####-----##-----] a* b* c* d*
Note that only consecutive “#”s is considered equivalent to “##”. The following example is not equivalent to the strands above because the “# #” between b and c are seperated by spaces, so they are not equivalent to “##”, meaning that b and c neednot be adjacent. Note that while b and c need not be adjacent, b* and c* are still adjacent because they are seperated by consecutive “#”s with no spaces in between.
a b c d strand0 [-----###-----# #-----##-----> ||||| ||||| ||||| ||||| strand1 <-----###-----####-----##-----] a* b* c* d*
- INTERIOR_TO_STRAND = 1
Base pair is located inside of a strand but not next to a base pair that resides on the end of a strand.
Similar base-pairing probability compared to
ADJACENT_TO_EXTERIOR_BASE_PAIR
but usually breathes less.#-----##-----# ||||| ||||| #-----##-----# ^ | base pair
- ADJACENT_TO_EXTERIOR_BASE_PAIR = 2
Base pair is located inside of a strand and next to a base pair that resides on the end of a strand.
Similar base-pairing probability compared to
INTERIOR_TO_STRAND
but usually breathes more.#-----# ||||| #-----] ^ | base pair
or
#-----> ||||| #-----# ^ | base pair
- BLUNT_END = 3
Base pair is located at the end of both strands.
#-----> ||||| #-----] ^ | base pair
- NICK_3P = 4
Base pair is located at a nick involving the 3’ end of the strand.
#----->[-----# ||||| ||||| #-----##-----# ^ | base pair
- NICK_5P = 5
Base pair is located at a nick involving the 3’ end of the strand.
#-----##-----# ||||| ||||| #-----]<-----# ^ | base pair
- DANGLE_3P = 6
Base pair is located at the end of a strand with a dangle on the 3’ end.
#-----##----# ||||| #-----] ^ | base pair
- DANGLE_5P = 7
Base pair is located at the end of a strand with a dangle on the 5’ end.
#-----> ||||| #-----##----# ^ | base pair
- DANGLE_5P_3P = 8
Base pair is located with dangle at both the 3’ and 5’ end.
#-----##----# ||||| #-----##----# ^ | base pair
- OVERHANG_ON_THIS_STRAND_3P = 9
Base pair is next to a overhang on the 3’ end.
# | | | # #-----# #-----# ||||| ||||| #-----###-----# ^ | base pair
- OVERHANG_ON_THIS_STRAND_5P = 10
Base pair is next to a overhang on the 5’ end.
base pair | v #-----###-----# ||||| ||||| #-----# #-----# # | | | #
- OVERHANG_ON_ADJACENT_STRAND_3P = 11
Base pair 3’ end interfaces with an overhang.
The adjacent base pair type is
OVERHANG_ON_THIS_STRAND_5P
# | | | # #-----# #---# ||||| ||| #-----###---# ^ | base pair
- OVERHANG_ON_ADJACENT_STRAND_5P = 12
Base pair 5’ end interfaces with an overhang.
The adjacent base pair type is
OVERHANG_ON_THIS_STRAND_3P
base pair | v #-----###-----# ||||| ||||| #-----# #-----# # | | | #
- OVERHANG_ON_BOTH_STRANDS_3P = 13
Base pair’s 3’ end is an overhang and adjacent strand also has an overhang.
# # | | | | | | # # #-----# #---# ||||| |||| #-----###---# ^ | base pair
- OVERHANG_ON_BOTH_STRANDS_5P = 14
Base pair’s 5’ end is an overhang and adjacent strand also has an overhang.
base pair | v #-----###-----# ||||| ||||| #-----# #-----# # # | | | | | | # #
- THREE_ARM_JUNCTION = 15
Base pair is located next to a three-arm-junction.
# # |-| |-| |-| # # #-----# #---# ||||| |||| #-----###---# ^ | base pair
- FOUR_ARM_JUNCTION = 16
Currently, this case isn’t actually detected (considered as
OTHER
).Base pair is located next to a four-arm-junction (e.g. Holliday junction).
# # |-| |-| |-| # # #-----# #-----# ||||| ||||| #-----# #-----# # # |-| |-| |-| # #
- Type:
TODO
- FIVE_ARM_JUNCTION = 17
Currently, this case isn’t actually detected (considered as
OTHER
).Base pair is located next to a five-arm-junction.
- Type:
TODO
- MISMATCH = 18
Currently, this case isn’t actually detected (considered as
DANGLE_5P_3P
).Base pair is located next to a mismatch.
#-----##-##-----# ||||| ||||| #-----##-##-----# ^ | base pair
- Type:
TODO
- BULGE_LOOP_3P = 19
Currently, this case isn’t actually detected (considered as
OVERHANG_ON_BOTH_STRANDS_3P
).Base pair is located next to a mismatch.
#-----##-##-----# ||||| ||||| #-----#####-----# ^ | base pair
- Type:
TODO
- BULGE_LOOP_5P = 20
Currently, this case isn’t actually detected (considered as
OVERHANG_ON_BOTH_STRANDS_5P
).Base pair is located next to a mismatch.
#-----#####-----# ||||| ||||| #-----##-##-----# ^ | base pair
- Type:
TODO
- UNPAIRED = 21
Base is unpaired.
Probabilities specify how unlikely a base is to be paired with another base.
- OTHER = 22
Other base pair types.
- class constraints.StrandDomainAddress(strand, domain_idx)[source]
An addressing scheme for specifying a domain on a strand.
- Parameters:
strand (Strand)
domain_idx (int)
- domain_idx: int
order in which domain appears in
StrandDomainAddress.strand
- neighbor_5p()[source]
Returns 5’ domain neighbor. If domain is 5’ end of strand, returns None
- Returns:
StrandDomainAddress of 5’ neighbor or None if no 5’ neighbor
- Return type:
StrandDomainAddress | None
- neighbor_3p()[source]
Returns 3’ domain neighbor. If domain is 3’ end of strand, returns None
- Returns:
StrandDomainAddress of 3’ neighbor or None if no 3’ neighbor
- Return type:
StrandDomainAddress | None
- constraints.BaseAddress = typing.Union[int, typing.Tuple[constraints.StrandD...
Represents a reference to a base. Can be either specified as a NUPACK base index or an index of a nuad
StrandDomainAddress
:alias of
int
|Tuple
[StrandDomainAddress
,int
]
- constraints.BasePairAddress = typing.Tuple[typing.Union[int, typing.Tuple[constr...
Represents a reference to a base pair
alias of
Tuple
[int
|Tuple
[StrandDomainAddress
,int
],int
|Tuple
[StrandDomainAddress
,int
]]
- constraints.BoundDomains = typing.Tuple[constraints.StrandDomainAddress, cons...
Represents bound domains
alias of
Tuple
[StrandDomainAddress
,StrandDomainAddress
]
- constraints.nupack_complex_base_pair_probability_constraint(strand_complexes, nonimplicit_base_pairs=None, all_base_pairs=None, base_pair_prob_by_type=None, base_pair_prob_by_type_upper_bound=None, base_pair_prob=None, base_unpaired_prob=None, base_pair_prob_upper_bound=None, base_unpaired_prob_upper_bound=None, temperature=37, sodium=0.05, magnesium=0.0125, weight=1.0, score_transfer_function=<function default_score_transfer_function>, description=None, short_description='ComplexBPProbs', parallel=False)[source]
Returns constraint that checks given base pair probabilities in tuples of
Strand
’s- Parameters:
strand_complexes (List[Complex]) – Iterable of
Strand
tuplesnonimplicit_base_pairs (Optional[Iterable[BoundDomains]]) –
List of nonimplicit base pairs that cannot be inferred because multiple instances of the same
Domain
exist in complex.The
StrandDomainAddress.strand
field of each address should reference a strand in the first complex instrand_complexes
.For example, if one
Strand
has one TDomain
and another strand in the complex has two T*Domain
s, then the intended binding graph cannot be inferred and must be stated explicitly in this field.all_base_pairs (Optional[Iterable[BoundDomains]]) –
List of all base pairs in complex. If not provided, then base pairs are infered based on the name of
Domain
s in the complex as well as base pairs specified innonimplicit_base_pairs
.TODO: This has not been implemented yet, and the behavior is as if this parameter is always
None
(binding graph is always inferred).base_pair_prob_by_type (Optional[Dict[BasePairType, float]]) –
Probability lower bounds for each
BasePairType
. AllBasePairType
comes with a default such asdefault_interior_to_strand_probability
which will be used if a lower bound is not specified for a particular type.Note: Despite the name of this parameter, set thresholds for unpaired bases by specifying a threshold for
BasePairType.UNPAIRED
.base_pair_prob_by_type_upper_bound (Dict[BasePairType, float], optional) –
Probability upper bounds for each
BasePairType
. By default, no upper bound is set.Note: Despite the name of this parameter, set thresholds for unpaired bases by specifying a threshold for
BasePairType.UNPAIRED
.TODO: This has not been implemented yet.
base_pair_prob (Optional[Dict[BasePairAddress, float]]) –
Probability lower bounds for each
BasePairAddress
which takes precedence over probabilities specified bybase_pair_prob_by_type
.TODO: This has not been implemented yet.
base_unpaired_prob (Optional[Dict[BaseAddress, float]]) – Probability lower bounds for each
BaseAddress
representing unpaired bases. These lower bounds take precedence over the probability specified bybase_pair_prob_by_type[BasePairType.UNPAIRED]
.base_pair_prob_upper_bound (Optional[Dict[BasePairAddress, float]]) – Probability upper bounds for each :py:class`BasePairAddress` which takes precedence over probabilties specified by
base_pair_prob_by_type_upper_bound
.base_unpaired_prob_upper_bound (Optional[Dict[BaseAddress, float]]) – Probability upper bounds for each
BaseAddress
representing unpaired bases. These lower bounds take precedence over the probability specified bybase_pair_prob_by_type_upper_bound[BasePairType.UNPAIRED]
.temperature (float, optional) – Temperature specified in °C, defaults to
vienna_nupack.default_temperature
.sodium (float) – molarity of sodium (more generally, monovalent ions such as Na+, K+, NH4+) in moles per liter
magnesium (float) – molarity of magnesium (Mg++) in moles per liter
weight (float, optional) – See
Constraint.weight
, defaults to 1.0score_transfer_function (Callable[[float], float], optional) – Score transfer function to use. By default, f(x) = x**2 is used, where x is the sum of the squared errors of each base pair that violates the threshold.
description (str | None, optional) – See
Constraint.description
, defaults to Noneshort_description (str, optional) – See
Constraint.short_description
defaults to ‘complex_secondary_structure_nupack’parallel (bool, optional) – TODO: Implement this
- Raises:
ImportError – If NUPACK 4 is not installed.
ValueError –
If
strand_complexes
is not valid. In order forstrand_complexes
to be valid,strand_complexes
must:Consist of complexes (tuples of
Strand
objects)Each complex must be of the same motif
- Returns:
ComplexConstraint
- Return type:
nuad.modifications
- class modifications.ModificationType(value)[source]
Type of modification (5’, 3’, or internal).
- five_prime = "5'"
5’ modification type
- three_prime = "5'"
3’ modification type
- internal = 'internal'
internal modification type
- class modifications.Modification(vendor_code, id='WARNING: no id assigned to modification')[source]
Abstract case class of modifications (to DNA sequences, e.g., biotin or Cy3). Use concrete subclasses
Modification3Prime
,Modification5Prime
, orModificationInternal
to instantiate.If
Modification.id
is not specified, thenModification.vendor_code
is used as the unique ID. EachModification.id
must be unique. For example if you create a 5’ “modification” to represent 6 T bases:t6_5p = Modification5Prime(display_text='6T', idt_text='TTTTTT')
(this is a useful hack for putting single-stranded extensions on strands until loopouts on the end of a strand are supported; see https://github.com/UC-Davis-molecular-computing/scadnano-python-package/issues/2), then this would clash with a similar 3’ modification without specifying unique IDs for them:t6_3p = Modification3Prime(display_text='6T', vendor_code='TTTTTT') # ERROR
.In general it is recommended to create a single
Modification
object for each type of modification in the design. For example, if many strands have a 5’ biotin, then it is recommended to create a singleModification
object and re-use it on each strand with a 5’ biotin:biotin_5p = Modification5Prime(display_text='B', vendor_code='/5Biosg/') design.strand(0, 0).move(8).with_modification_5p(biotin_5p) design.strand(1, 0).move(8).with_modification_5p(biotin_5p)
- Parameters:
vendor_code (str)
id (str)
- vendor_code: str
Text string specifying this modification (e.g., ‘/5Biosg/’ for 5’ biotin). optional
- id: str = 'WARNING: no id assigned to modification'
Representation as a string; used to write in
Strand
json representation, while the full description of the modification is written under a global key in theDesign
. If not specified, butModification.idt_text
is specified, then it will be set equal to that.
- __init__(vendor_code, id='WARNING: no id assigned to modification')
- Parameters:
vendor_code (str)
id (str)
- Return type:
None
- class modifications.Modification5Prime(vendor_code, id='WARNING: no id assigned to modification')[source]
5’ modification of DNA sequence, e.g., biotin or Cy3.
- Parameters:
vendor_code (str)
id (str)
- __init__(vendor_code, id='WARNING: no id assigned to modification')
- Parameters:
vendor_code (str)
id (str)
- Return type:
None
- class modifications.Modification3Prime(vendor_code, id='WARNING: no id assigned to modification')[source]
3’ modification of DNA sequence, e.g., biotin or Cy3.
- Parameters:
vendor_code (str)
id (str)
- __init__(vendor_code, id='WARNING: no id assigned to modification')
- Parameters:
vendor_code (str)
id (str)
- Return type:
None
- class modifications.ModificationInternal(vendor_code, id='WARNING: no id assigned to modification', allowed_bases=None)[source]
Internal modification of DNA sequence, e.g., biotin or Cy3.
- Parameters:
vendor_code (str)
id (str)
allowed_bases (AbstractSet[str] | None)
- __init__(vendor_code, id='WARNING: no id assigned to modification', allowed_bases=None)
- Parameters:
vendor_code (str)
id (str)
allowed_bases (AbstractSet[str] | None)
- Return type:
None
- allowed_bases: AbstractSet[str] | None = None
If None, then this is an internal modification that goes between bases. If instead it is a list of bases, then this is an internal modification that attaches to a base, and this lists the allowed bases for this internal modification to be placed at. For example, internal biotins for IDT must be at a T. If any base is allowed, it should be
['A','C','G','T']
.
nuad.vienna_nupack
Contains utility functions for accessing NUPACK 4 and ViennaRNA energy calculation algorithms.
The main functions are
pfunc()
(for calculating complex free energy with NUPACK, along with its helper functions
secondary_structure_single_strand()
and binding()
),
nupack_complex_base_pair_probabilities()
(for calculating base pair probabilities with NUPACK),
rna_duplex_multiple()
(for calculating an approximation to two-strand complex free energy
that is much faster than calling pfunc()
on the same pair of strands),
and
rna_plex_multiple()
(which is even faster than rna_duplex_multiple()
).
- vienna_nupack.calculate_strand_association_penalty(temperature, num_seqs)[source]
Additive adjustment factor to convert NUPACK’s mole fraction units to molarity.
For details on why this is needed for multi-stranded complexes, see Section S1.1 of http://www.nupack.org/downloads/serve_public_file/fornace20_supp.pdf?type=pdf and Figure 2 of http://www.nupack.org/downloads/serve_public_file/nupack_user_guide_3.2.2.pdf?type=pdf
- Parameters:
temperature (float) – temperature in Celsius
num_seqs (int) – number of sequences
- Returns:
Additive adjustment factor to convert NUPACK’s mole fraction units to molar.
- Return type:
float
- vienna_nupack.pfunc(seqs, temperature=37, sodium=0.05, magnesium=0.0125, strand_association_penalty=True)[source]
Calls pfunc from NUPACK 4 (http://www.nupack.org/) on a complex consisting of the unique strands in seqs, returns energy (“delta G”), i.e., generally a negative number.
By default, a strand association penalty is applied that is not applied by NUPACK’s pfunc. See strand_association_penalty parameter documentation for details.
NUPACK 4 must be installed. Installation instructions can be found at https://piercelab-caltech.github.io/nupack-docs/start/.
- Parameters:
seqs (S | Tuple[S, ...]) – DNA sequences (tuple, or a single DNA sequence), whose order indicates a cyclic permutation of the complex. For one or two sequences, there is only one cyclic permutation, so the order doesn’t matter in such cases.
temperature (float) – temperature in Celsius
sodium (float) – molarity of sodium in moles per liter
magnesium (float) – molarity of magnesium in moles per liter
strand_association_penalty (bool) – Add strand association penalty for a complex, related to converting NUPACK’s mole fraction units to molarity. The quantity added is that returned by
calculate_strand_association_penalty()
with parameters temperature and len(seqs). For most constraints, which involve only one size of complex, this factor won’t matter other than to adjust the energy threshold by the same factor. The factor depends only on the number of strands in seqs, but not on their sequences. However, this factor is needed for a meaningful comparison of energies between complexes of different sizes, e.g., to calculate equilibrium concentrations of complexes of various sizes. For details on why this is needed for multi-stranded complexes, see Section S1.1 of http://www.nupack.org/downloads/serve_public_file/fornace20_supp.pdf?type=pdf and Figure 2 of http://www.nupack.org/downloads/serve_public_file/nupack_user_guide_3.2.2.pdf?type=pdf
- Returns:
complex free energy (“delta G”) of ordered complex with strands in given cyclic permutation
- Return type:
float
- vienna_nupack.nupack_complex_base_pair_probabilities(strand_complex, temperature=37, sodium=0.05, magnesium=0.0125)[source]
Calculates base-pair probabilities according to NUPACK 4.
- Parameters:
strand_complex (Complex) – Ordered tuple of strands in complex (specifying a particular circular ordering, which is imposed on all considered secondary structures)
temperature (float) – temperature in Celsius
sodium (float) – molarity of sodium in moles per liter
magnesium (float) – molarity of magnesium in moles per liter
- Returns:
2D Numpy array of floats, with result[i1][i2] giving the base-pair probability of base at position i1 with base at position i2 (if i1 != i2), where i1 and i2 are the absolute positions of the bases in the entire ordered list of strands. For example, with strands AAAA and TTTTT, there are nine indices 0,1,2,3,4,5,6,7,8, with positions 0,1,2,3 on the first strand AAAA, and positions 4,5,6,7,8 on the second strand TTTTT. If i1 == i2, then result[i1][i1] is the probability that the base at position i1 is unpaired.
- Return type:
ndarray
- vienna_nupack.call_subprocess(command_strs, user_input)[source]
Calls system command through a subprocess. Assumes running on a POSIX operating system.
If running on Windows, automatically appends “wsl -e” to start of command to call command through Windows subsystem for Linux, so wsl must be installed for this to work: https://docs.microsoft.com/en-us/windows/wsl/install-win10
- Parameters:
command_strs (List[str]) – List of command and command line arguments, i.e., to call
ls -l -a
, command_strs should be the list [‘ls’, ‘-l’, ‘-a’].user_input (str) – Input to give once program is running (i.e., what would user type).
- Returns:
pair of strings (output, error), giving the strings written to stdout and stderr, respectively.
- Return type:
Tuple[str, str]
- vienna_nupack.rna_duplex_multiple(pairs, temperature=37, max_energy=0.0, gu_wobble=False)[source]
Calls RNA.duplexfold (from ViennaRNA Python package: https://www.tbi.univie.ac.at/RNA/ViennaRNA/refman/api_python.html) on a list of pairs, specifically: [ (seq1, seq2), (seq3, seq4), (seq5, seq6), … ] where seqi is a string over {A,C,T,G}. Temperature is in Celsius. Returns a list (in the same order as seqpairs) of free energies.
- Parameters:
pairs (Sequence[Tuple[S, S]]) – sequence (list or tuple) of pairs of DNA sequences
temperature (float) – temperature in Celsius
max_energy (float) – This is the maximum energy possible to assign. If RNAduplex reports any energies larger than this, they will be changed to max_energy. This is useful in case two sequences have no possible base pairs between them (e.g., CCCC and TTTT), in which case RNAduplex assigns a free energy of 100000 (perhaps its approximation of infinity). But for meaningful comparison and particularly for graphing energies, it’s nice if there’s not some value several orders of magnitude larger than all the rest.
gu_wobble (bool) – Whether to allow GU wobble pairs; I honestly don’t know what this means for DNA, but it’s an option in RNAduplex, and it makes a difference in calculating the energy.
- Returns:
list of free energies, in the same order as pairs
- Return type:
Tuple[float, …]
- vienna_nupack.rna_duplex_multiple_deprecated(pairs, logger=<RootLogger root (WARNING)>, temperature=37, parameters_filename='dna_mathews1999.par', max_energy=0.0)[source]
Calls RNAduplex (from ViennaRNA package: https://www.tbi.univie.ac.at/RNA/) on a list of pairs, specifically: [ (seq1, seq2), (seq3, seq4), (seq5, seq6), … ] where seqi is a string over {A,C,T,G}. Temperature is in Celsius. Returns a list (in the same order as seqpairs) of free energies.
- Parameters:
pairs (Sequence[Tuple[S, S]]) – sequence (list or tuple) of pairs of DNA sequences
logger (Logger) – logger to use for printing error messages
temperature (float) – temperature in Celsius
parameters_filename (str) – name of parameters file for ViennaRNA
max_energy (float) – This is the maximum energy possible to assign. If RNAduplex reports any energies larger than this, they will be changed to max_energy. This is useful in case two sequences have no possible base pairs between them (e.g., CCCC and TTTT), in which case RNAduplex assigns a free energy of 100000 (perhaps its approximation of infinity). But for meaningful comparison and particularly for graphing energies, it’s nice if there’s not some value several orders of magnitude larger than all the rest.
- Returns:
list of free energies, in the same order as pairs
- Return type:
Tuple[float, …]
- vienna_nupack.rna_duplex_multiple_parallel(thread_pool, pairs, logger=<RootLogger root (WARNING)>, temperature=37, parameters_filename='dna_mathews1999.par', max_energy=0.0)[source]
Parallel version of
rna_duplex_multiple()
. TODO document this- Parameters:
thread_pool (ThreadPool)
pairs (Sequence[Tuple[S, S]])
logger (Logger)
temperature (float)
parameters_filename (str)
max_energy (float)
- Return type:
Tuple[float, …]
- vienna_nupack.rna_plex_multiple(pairs, logger=<RootLogger root (WARNING)>, temperature=37, parameters_filename='dna_mathews1999.par', max_energy=0.0)[source]
Calls RNAplex (from ViennaRNA package: https://www.tbi.univie.ac.at/RNA/) on a list of pairs, specifically: [ (seq1, seq2), (seq2, seq3), (seq4, seq5), … ] where seqi is a string over {A,C,T,G}. Temperature is in Celsius. Returns a list (in the same order as seqpairs) of free energies.
RNAplex is supposedly faster than RNAduplex, but less accurate since, compared to RNAduplex, “ loop energy is an affine function of the loop size instead of a logarithmic function” (https://doi.org/10.1093/bioinformatics/btn193). However, in testing with random sequences, RNAplex can actually be SLOWER, for instance with 1000 pairs of DNA sequences each of length 64, RNAduplex takes 1.93 s on average, compared to 2.28 s for RNAplex. (tested using %timeit in Jupyter lab) This is with default parameters; using “-f 1” speeds up the computing time by about factor 5; in this case RNAplex takes only 0.41 s on average instead of 2.28 s.
- Parameters:
pairs (Sequence[Tuple[S, S]]) – sequence (list or tuple) of pairs of DNA sequences
logger (Logger) – logger to use for printing error messages
temperature (float) – temperature in Celsius
parameters_filename (str) – name of parameters file for NUPACK
max_energy (float) – This is the maximum energy possible to assign. If RNAplex reports any energies larger than this, they will be changed to max_energy. This is useful in case two sequences have no possible base pairs between them (e.g., CCCC and TTTT), in which case RNAplex assigns a free energy of 100000 (perhaps its approximation of infinity). But for meaningful comparison and particularly for graphing energies, it’s nice if there’s not some value several orders of magnitude larger than all the rest.
- Returns:
list of free energies, in the same order as pairs
- Return type:
Tuple[float, …]
- vienna_nupack.nupack_multiple_with_sodium_magnesium(sodium=0.05, magnesium=0.0125)[source]
Used when we want a
BulkConstraint
using NUPACK (even though most of them areSingularConstraint
’s).Calls NUPACK (specifically, the function
binding()
) on a list of pairs: [ (seq1, seq2), (seq2, seq3), (seq4, seq5), … ] where seqi is a string over {A,C,T,G}.- Parameters:
sodium (float) – molarity of sodium in moles per liter
magnesium (float) – molarity of magnesium in moles per liter
- Returns:
list of free energies, in the same order as pairs
- Return type:
Callable[[Sequence[Tuple[S, S]], Logger, float, str, float], Tuple[float, …]]
- vienna_nupack.rna_plex_multiple_parallel(thread_pool, pairs, logger=<RootLogger root (WARNING)>, temperature=37, parameters_filename='dna_mathews1999.par', max_energy=0.0)[source]
Parallel version of
rna_plex_multiple()
. TODO document this- Parameters:
thread_pool (ThreadPool)
pairs (Sequence[Tuple[S, S]])
logger (Logger)
temperature (float)
parameters_filename (str)
max_energy (float)
- Return type:
Tuple[float, …]
- vienna_nupack.rna_cofold_multiple(seq_pairs, logger=<RootLogger root (WARNING)>, temperature=37, parameters_filename='dna_mathews1999.par', max_energy=0.0)[source]
Calls RNAcofold (from ViennaRNA package: https://www.tbi.univie.ac.at/RNA/) on a list of pairs, specifically: [ (seq1, seq2), (seq2, seq3), (seq4, seq5), … ] where seqi is a string over {A,C,T,G}. Temperature is in Celsius. Returns a list (in the same order as seqpairs) of free energies.
- Parameters:
seq_pairs (Sequence[Tuple[S, S]]) – sequence (list or tuple) of pairs of DNA sequences
logger (Logger) – logger to use for printing error messages
temperature (float) – temperature in Celsius
parameters_filename (str) – name of NUPACK parameters file
max_energy (float) – This is the maximum energy possible to assign. If RNAcofold reports any energies larger than this, they will be changed to max_energy. This is useful in case two sequences have no possible base pairs between them (e.g., CCCC and TTTT), in which case RNAcofold assigns a free energy of 100000 (perhaps its approximation of infinity). But for meaningful comparison and particularly for graphing energies, it’s nice if there’s not some value several orders of magnitude larger than all the rest.
- Returns:
tuple of free energies, in the same order as seq_pairs
- Return type:
Tuple[float, …]
- vienna_nupack.wc(seq)[source]
Return reverse Watson-Crick complement of seq.
- Parameters:
seq (str)
- Return type:
str
- vienna_nupack.free_energy_single_strand(seq, temperature=37, sodium=0.05, magnesium=0.0125)[source]
Computes the “complex free energy” (https://docs.nupack.org/definitions/#complex-free-energy) of a single strand according to NUPACK.
NUPACK 4 must be installed. Installation instructions can be found at https://piercelab-caltech.github.io/nupack-docs/start/.
- Parameters:
seq (str)
temperature (float)
sodium (float)
magnesium (float)
- Return type:
float
- vienna_nupack.binding_complement(seq, temperature=37, sodium=0.05, magnesium=0.0125, subtract_indv=True)[source]
Computes the complex free energy of a strand with its perfect Watson-Crick complement.
NUPACK 4 must be installed. Installation instructions can be found at https://piercelab-caltech.github.io/nupack-docs/start/.
- Parameters:
seq (str)
temperature (float)
sodium (float)
magnesium (float)
subtract_indv (bool)
- Return type:
float
- vienna_nupack.binding(seq1, seq2, *, temperature=37, sodium=0.05, magnesium=0.0125)[source]
Computes the complex free energy of association between two strands.
NUPACK 4 must be installed. Installation instructions can be found at https://piercelab-caltech.github.io/nupack-docs/start/.
- Parameters:
seq1 (str)
seq2 (str)
temperature (float)
sodium (float)
magnesium (float)
- Return type:
float
- vienna_nupack.random_dna_seq(length, bases='ACTG')[source]
Chooses a random DNA sequence.
- Parameters:
length (int)
bases (Sequence)
- Return type:
str
- vienna_nupack.domain_orthogonal(seq, seqs, temperature, sodium, magnesium, orthogonality, orthogonality_ave=-1, parallel=False)[source]
test orthogonality of domain with all others and their wc complements
NUPACK 4 must be installed. Installation instructions can be found at https://piercelab-caltech.github.io/nupack-docs/start/.
- Parameters:
seq (str)
seqs (Sequence[str])
temperature (float)
sodium (float)
magnesium (float)
orthogonality (float)
orthogonality_ave (float)
parallel (bool)
- Return type:
bool
- vienna_nupack.domain_pairwise_concatenated_no_sec_struct(seq, seqs, temperature, sodium, magnesium, concat, concat_ave=-1, parallel=False)[source]
test lack of secondary structure in concatenated domains
NUPACK 4 must be installed. Installation instructions can be found at https://piercelab-caltech.github.io/nupack-docs/start/.
- Parameters:
seq (str)
seqs (Sequence[str])
temperature (float)
sodium (float)
magnesium (float)
concat (float)
concat_ave (float)
parallel (bool)
- Return type:
bool
nuad.np
Library for doing sequence design that can be expressed as linear algebra operations for rapid processing by numpy (e.g., generating all DNA sequences of a certain length and calculating all their full duplex binding energies in the nearest neighbor model and filtering those outside a given range).
Based on the DNA single-stranded tile (SST) sequence designer used in the following publication.
“Diverse and robust molecular algorithms using reprogrammable DNA self-assembly” Woods*, Doty*, Myhrvold, Hui, Zhou, Yin, Winfree. (*Joint first co-authors)
- np.idx2seq(idx, length)[source]
Return the lexicographic idx’th DNA sequence of given length.
- Parameters:
idx (int)
length (int)
- Return type:
str
- np.seq2arr(seq, base2bits_local=None)[source]
Convert seq (string with DNA alphabet) to numpy array with integers 0,1,2,3.
- Parameters:
seq (str)
base2bits_local (Dict[str, int] | None)
- Return type:
np.ndarray
- np.seqs2arr(seqs)[source]
Return numpy 2D array converting the given DNA sequences to integers.
- Parameters:
seqs (Sequence[str])
- Return type:
ndarray
- np.arr2seqs(arr)[source]
Return list of strings converting the given numpy array of integers to DNA sequences.
- Parameters:
arr (ndarray)
- Return type:
List[str]
- np.arr2seq(arr)[source]
Return string converting the given numpy array of integers to DNA sequence.
- Parameters:
arr (ndarray)
- Return type:
str
- np.make_array_with_all_dna_seqs(length, bases=('A', 'C', 'G', 'T'))[source]
Return 2D numpy array with all DNA sequences of given length in lexicographic order. Bases contains bases to be used: (‘A’,’C’,’G’,’T’) by default, but can be set to a subset of these.
Uses the encoding described in the documentation for DNASeqList. The result is a 2D array, where each row represents a DNA sequence, and that row has one byte per base.
- Parameters:
length (int)
bases (Collection[str])
- Return type:
ndarray
- np.make_array_with_random_subset_of_dna_seqs(length, num_random_seqs, rng=Generator(PCG64) at 0x7FEF7828E120, bases=('A', 'C', 'G', 'T'))[source]
Return 2D numpy array with random subset of size num_seqs of DNA sequences of given length. Bases contains bases to be used: (‘A’,’C’,’G’,’T’) by default, but can be set to a subset of these.
Uses the encoding described in the documentation for DNASeqList. The result is a 2D array, where each row represents a DNA sequence, and that row has one byte per base.
Sequences returned will be unique (i.e., sampled without replacement) and in a random order
- Parameters:
length (int) – length of each row
num_random_seqs (int) – number of rows
bases (Collection[str]) – DNA bases to use
rng (Generator) – numpy random number generator (type returned by numpy.random.default_rng())
- Returns:
2D numpy array with random subset of size num_seqs of DNA sequences of given length
- Return type:
ndarray
- np.make_array_with_all_dna_seqs_hamming_distance(dist, seq, bases=('A', 'C', 'G', 'T'))[source]
Return 2D numpy array with all DNA sequences of given length in lexicographic order. Bases contains bases to be used: (‘A’,’C’,’G’,’T’) by default, but can be set to a subset of these.
Uses the encoding described in the documentation for DNASeqList. The result is a 2D array, where each row represents a DNA sequence, and that row has one byte per base.
- Parameters:
dist (int)
seq (str)
bases (Collection[str])
- Return type:
ndarray
- np.make_array_with_random_subset_of_dna_seqs_hamming_distance(num_seqs, dist, seq, rng=Generator(PCG64) at 0x7FEF7828E120, bases=('A', 'C', 'G', 'T'))[source]
Return 2D numpy array with random subset of size num_seqs of DNA sequences of given length. Bases contains bases to be used: (‘A’,’C’,’G’,’T’) by default, but can be set to a subset of these.
Uses the encoding described in the documentation for DNASeqList. The result is a 2D array, where each row represents a DNA sequence, and that row has one byte per base.
Sampled with replacement, so the same row may appear twice in the returned array
- Parameters:
num_seqs (int) – number of sequences to generate
dist (int) – Hamming distance to be from seq
seq (str) – sequence to generate other sequences close to
bases (Collection[str]) – DNA bases to use
rng (Generator) – numpy random number generator (type returned by numpy.random.default_rng())
- Returns:
2D numpy array with random subset of size num_seqs of DNA sequences of given length
- Return type:
ndarray
- np.longest_common_substring(a1, a2, vectorized=True)[source]
Return start and end indices (a1start, a2start, length) of longest common substring (subarray) of 1D arrays a1 and a2.
- Parameters:
a1 (ndarray)
a2 (ndarray)
vectorized (bool)
- Return type:
Tuple[int, int, int]
- np.longest_common_substrings_singlea1(a1, a2s)[source]
Return start and end indices (a1starts, a2starts, lengths) of longest common substring (subarray) of 1D array a1 and rows of 2D array a2s.
If length[i]=0, then a1starts[i]=a2starts[i]=0 (not -1), so be sure to check length[i] to see if any substrings actually matched.
- Parameters:
a1 (ndarray)
a2s (ndarray)
- Return type:
Tuple[ndarray, ndarray, ndarray]
- np.longest_common_substrings_product(a1s, a2s)[source]
Return start and end indices (a1starts, a2starts, lengths) of longest common substring (subarray) of each pair in the cross product of rows of a1s and a2s.
If length[i]=0, then a1starts[i]=a2starts[i]=0 (not -1), so be sure to check length[i] to see if any substrings actually matched.
- Parameters:
a1s (ndarray)
a2s (ndarray)
- Return type:
Tuple[ndarray, ndarray, ndarray]
- np.longest_common_substrings_all_pairs_strings(seqs1, seqs2)[source]
For Python strings
- Parameters:
seqs1 (Sequence[str])
seqs2 (Sequence[str])
- Return type:
Tuple[ndarray, ndarray, ndarray]
- np.strongest_common_substrings_all_pairs_string(seqs1, seqs2, temperature)[source]
For Python strings representing DNA; checks for reverse complement matches rather than direct matches, and evaluates nearest neighbor energy, returning indices lengths, and energies of strongest complementary substrings.
- Parameters:
seqs1 (Sequence[str])
seqs2 (Sequence[str])
temperature (float)
- Return type:
Tuple[List[float], List[float], List[float], List[float]]
- class np.DNASeqList(length=None, num_random_seqs=None, shuffle=False, alphabet=('A', 'C', 'G', 'T'), seqs=None, seqarr=None, filename=None, rng=Generator(PCG64) at 0x7FEF7828E120, hamming_distance_from_sequence=None)[source]
Represents a list of DNA sequences of identical length. The sequences are stored as a 2D numpy array of bytes
DNASeqList.seqarr
. Each byte represents a single DNA base (so it is not a compact representation; the most significant 6 bits of the byte will always be 0).- Parameters:
length (int | None)
num_random_seqs (int | None)
shuffle (bool)
alphabet (Collection[str])
seqs (Sequence[str] | None)
seqarr (ndarray)
filename (str | None)
rng (Generator)
hamming_distance_from_sequence (Tuple[int, str] | None)
- __init__(length=None, num_random_seqs=None, shuffle=False, alphabet=('A', 'C', 'G', 'T'), seqs=None, seqarr=None, filename=None, rng=Generator(PCG64) at 0x7FEF7828E120, hamming_distance_from_sequence=None)[source]
Creates a set of DNA sequences, all of the same length.
Create either all sequences of a given length if seqs is not specified, or all sequences in seqs if seqs is specified. If neither is specified then all sequences of length 3 are created.
Exactly one of the following should be specified:
length (possibly along with alphabet and num_random_seqs)
seqs
seqarr
filename
hamming_distance_from_sequence (possibly along with alphabet and num_random_seqs)
- Parameters:
length (int | None) – length of sequences; num_seqs and alphabet can also be specified along with it
hamming_distance_from_sequence (Tuple[int, str] | None) – if specified and equal to (dist, seq) of type (int, str), then only sequences at Hamming distance dist from seq will be generated. Raises error if length, seqs, seqarr, or filename is specified.
num_random_seqs (int | None) – number of sequences to generate; if not specified, then all sequences of length length using bases from alphabet are generated. Sequences are sampled with replacement, so the same sequence may appear twice.
shuffle (bool) – whether to shuffle sequences
alphabet (Collection[str]) – a subset of {‘A’, ‘C’, ‘G’, ‘T’}
seqs (Sequence[str] | None) – sequence (e.g., list or tuple) of strings, all of the same length
seqarr (np.ndarray | None) – 2D NumPy array, with axis 0 moving between sequences, and axis 1 moving between consecutive DNA bases in a sequence
filename (str | None) – name of file containing a
DNASeqList
as written byDNASeqList.write_to_file()
rng (np.random.Generator) – numpy random number generator (type returned by numpy.random.default_rng())
- rng: Generator
Random number generator to use.
- seqarr: ndarray
Uses a (noncompact) internal representation using 8 bits (1 byte, dtype = np.ubyte) per base, stored in a numpy 2D array of bytes. Each row (axis 0) is a DNA sequence, and each column (axis 1) is a base in a sequence.
The code used is \(A \to 0, C \to 1, G \to 2, T \to 3\).
- seqlen: int
Length of each DNA sequence (number of columns, axis 1, in
DNASeqList.seqarr
)
- numseqs: int
Number of DNA sequences (number of rows, axis 0, in
DNASeqList.seqarr
)
- random_choice(num, rng=Generator(PCG64) at 0x7FEF7828E120, replace=False)[source]
Returns random choice of num DNA sequence(s) (represented as list of Python strings).
- Parameters:
num (int) – number of sequences to sample
rng (Generator) – random number generator to use
replace (bool) – whether to sample with replacement
- Returns:
sampled sequences
- Return type:
List[str]
- random_sequence(rng=Generator(PCG64) at 0x7FEF7828E120)[source]
Returns random DNA sequence (represented as Python string).
- Returns:
sampled sequence
- Parameters:
rng (Generator)
- Return type:
str
- write_to_file(filename)[source]
Writes text file describing DNA sequence list, in format
numseqs seqlen seq1 seq2 seq3 …
where numseqs, seqlen are integers, and seq1, … are strings from {A,C,G,T}
- Parameters:
filename (str)
- Return type:
None
- to_list()[source]
Return list of strings representing the sequences, e.g. [‘ACG’,’TAA’]
- Return type:
List[str]
- get_seq_str(idx)[source]
Return idx’th DNA sequence as a string.
- Parameters:
idx (int)
- Return type:
str
- get_seqs_str_list(slice_)[source]
Return a list of strings specified by slice.
- Parameters:
slice_ (slice)
- Return type:
List[str]
- keep_seqs_at_indices(indices)[source]
Keeps only sequences at the given indices.
- Parameters:
indices (Iterable[int])
- Return type:
None
- hamming_map(sequence)[source]
Return dict mapping each length d to a
DNASeqList
of sequences that are Hamming distance d from seq.- Parameters:
sequence (str)
- Return type:
Dict[int, DNASeqList]
- sublist(start, end=None)[source]
Return sublist of DNASeqList from start, inclusive, to end, exclusive.
If end is not specified, goes until the end of the list.
- Parameters:
start (int)
end (int | None)
- Return type:
- filter_energy(low, high, temperature)[source]
Return new DNASeqList with seqs whose wc complement energy is within the given range.
- Parameters:
low (float)
high (float)
temperature (float)
- Return type:
- energies(temperature)[source]
Calculates the nearest-neighbor binding energy of each sequence with its perfect complement (summing over all length-2 substrings of the domain’s sequence), using parameters from the 2004 Santa-Lucia and Hicks paper (https://www.annualreviews.org/doi/abs/10.1146/annurev.biophys.32.110601.141800, see Table 1, and example on page 419).
This is used by
NearestNeighborEnergyFilter
to calculate the energy of domains when filtering.- Parameters:
temperature (float) – temperature in Celsius
- Returns:
array of nearest-neighbor energies of each sequence with its perfect Watson-Crick complement
- Return type:
ndarray
- filter_end_gc()[source]
Remove any sequence with A or T on the end. Also remove domains that do not have an A or T either next to that base, or one away. Otherwise we could get a domain ending in {C,G}^3, which, placed next to any domain ending in C or G, will create a substring in {C,G}^4 and be rejected if we are filtering those.
- Return type:
- filter_end_at(gc_near_end=False)[source]
Remove any sequence with C or G on the end. Also, if gc_near_end is True, remove domains that do not have an C or G either next to that base, or one away, to prevent breathing.
- Parameters:
gc_near_end (bool)
- Return type:
- filter_base_nowhere(base)[source]
Remove any sequence that has given base anywhere.
- Parameters:
base (str)
- Return type:
- filter_base_count(base, low, high)[source]
Remove any sequence not satisfying low <= #base <= high.
- Parameters:
base (str)
low (int)
high (int)
- Return type:
- np.create_toeplitz(seqlen, sublen, indices=None)[source]
Creates a toeplitz matrix, useful for finding subsequences.
seqlen is length of larger sequence; sublen is length of substring we’re checking for. If indices is None, then all rows are created, otherwise only rows for checking those indices are created.
- Parameters:
seqlen (int)
sublen (int)
indices (Sequence[int] | None)
- Return type:
np.ndarray
- np.calculate_loop_energies(temperature, negate=False)[source]
Get SantaLucia and Hicks nearest-neighbor loop energies for given temperature, 1 M Na+.
- Parameters:
temperature (float)
negate (bool)
- Return type:
ndarray
- np.wcenergy(seq, temperature, negate=False)[source]
Return the wc energy of seq binding to its complement.
- Parameters:
seq (str)
temperature (float)
negate (bool)
- Return type:
float
- np.calculate_wc_energies(seqarr, temperature, negate=False)[source]
Calculate and store in an array all energies of all sequences in seqarr with their Watson-Crick complements.
- Parameters:
seqarr (ndarray)
temperature (float)
negate (bool)
- Return type:
ndarray
- np.wc_arr(seqarr)[source]
Return numpy array of reverse complements of sequences in seqarr.
- Parameters:
seqarr (ndarray)
- Return type:
ndarray
- np.energy_hist(length, temperature=37, combine_lengths=False, num_random_sequences=100000, figsize=(15, 6), **kwargs)[source]
Make a matplotlib histogram of the nearest-neighbor energies (as defined by
DNASeqList.energies()
) of all DNA sequences of the given length(s), or a randomly selected subset if length(s) is too large to enumerate all DNA sequences of that length.This is useful, for example, to choose low and high energy values to pass to
NearestNeighborEnergyFilter
.- Parameters:
length (int | Iterable[int]) – length of DNA sequences to consider, or an iterable (e.g., list) of lengths
temperature (float) – temperature in Celsius
combine_lengths (bool) – If True, then length should be an iterable, and the histogram will combine all calculated energies from all lengths into one histogram to plot. If False (the default), then different lengths are plotted in different colors in the histogram.
num_random_sequences (int) – If the length is too large to enumerate all DNA sequences of that length, then this many random sequences are used to estimate the histogram.
figsize (Tuple[int, int]) – Size of the figure in inches.
kwargs – Any keyword arguments given are passed along as keyword arguments to matplotlib.pyplot.hist: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html
- Return type:
None