NOMAD V1.0
About
Nomad (Neighborhood Optimization for Multiple Alignment Discovery) is a
program dedicated to the ungapped local multiple alignment (ULMA)
problem, also known as "blocks". By using an
entropy-based objective function that takes into account the
amino acid's
nature, Nomad is well suited to deal with protein sequences. This
objective function, the shared entropy, has been shown to be
significantly more reliable than the relative entropy when protein
sequences to be aligned are distantly related.
Reference
Hernandez D. Gras R. Appel R.
Neighborhood Function and Hill-Climbing Strategies dedicated to the
Generalized Ungapped Local Multiple Alignment. Eur J Oper Res, 2006, in press (doi:
10.1016/j.ejor.2005.10.076).
Hernandez D. (2005) Stratégies d'optimisation combinatoire pour
le problème de l'alignement local multiple sans indels, et
application aux séquences protéiques. PhD thesis,
Université de Genève, SWITZERLAND.
Overview
An ULMA is essentially a collection of
n occurrences of size
w, chosen in way to be maximally
conserved. Both
n and
w are fixed by the user. Nomad is
an optimization
program that makes use of a hill-climber to search the
n occurrences that maximize an
objective function.
The occurrence distribution in the sequence set can be constrained in
four ways:
(a) OOPS (One Occurrence Per Sequence)
This is the simplest and the
most constrained mode. It is supposed that every sequence
contributes exactly once to the ULMA. In this mode, n is implicitly fixed by the number
of sequences in the
dataset.
(b) ALOOPS (At Least One Occurrence Per Sequence)
All sequences must contribute to
the ULMA but some may contribute more than once. n
has to be specified as greater than or
equal to the number of sequences.
(c) AMOOPS (At Most One Occurrence Per Sequence)
Some sequences can be discarded
from the ULMA. n has to be
specified as lower than or equal
to the number of sequences.
(d) AOPS (Any number of Occurrences
Per Sequence)
This mode is the least
constrained one. It allows occurrences to be distributed anywhere in
the sequence set, as long as they do not overlap with each other. n has to be specified between 2 and
a reasonable
value.
The widely used objective function for the ULMA problem is the relative
entropy, which is the information theory point of view of a
log-likelihood ratio statistic. The main drawback of the relative
entropy when aligning protein sequences is that all amino acids are
considered to be independent. The fact that some substitutions may
occur
more often than others is not considered by this function. Nomad
implement the "shared entropy", an objective function which takes into
account an "equivalence" measure between amino acids. The
shared entropy has been shown to be significantly more efficient than
the
relative entropy, both in terms of noise/signal distinction, and
optimization process.
Input
Dataset:
Paste your sequences in the FASTA format.
Example:
>sequence label
MKALTARQQEVFD...
>sequence label
MEQNPQSQLKLLV...
>sequence label
MGMKISELAKACD...
Width:
Set the width of the ULMA to be searched.
Protein, shared entropy:
This is the default option. The ULMA is optimized with the shared
entropy
Protein, relative entropy:
Optimize the ULMA with the "classical" relative entropy objective
function. The relative entropy is the widely used function for
the ULMA problem.
DNA, relative entropy:
Choose this option if you align DNA sequences.
Occurrence repartition constraints:
Choose one of the OOPS, ALOOPS, AMOOPS or AOPS constrain models and set
the number of occurrences in the ULMA.
Sort occurrences:
Check this option to sort occurrences by
their own score. The score of an occurrence is a log-likelihood ratio,
which reflects how well the occurrence fits the rest of the ULMA.
E-mail address:
Type your e-mail address to get the result in your mail box. This option is
recommended and is more reliable if the cpu-time is substantial.
Explanation of the result
This example shows a ULMA under the
OOPS mode performed on 15 helix-turn-helix domain-containing proteins.
The first column shows the label of the sequence, the second
column gives the occurrence positions in the corresponding sequence.
The third column shows the occurrence itself, and finally the fourth
column shows the score of the occurrence. This score reflects how well
the occurrence fits the rest of the alignment. The alignment score is
the objective value that has been optimized, and correspond to the
average occurrence score. Note that these scores cannot be interpreted
as confidence values. They are only relative to the ULMA that has been
optimized and thus cannot been compared between different ULMAs.
Symbols are blue-scaled according to their
objective score contribution. The darker the symbol the
stronger its contribution.
Since Nomad performs stochastic optimizations, two
independent runs with the same parameters could produce a different
result. If this occurs, simply consider the
best scoring alignment.
>LEXA_ECOLI_P03033; 26 PTRAEIAQRLGFRSPNAAEEHL 15.691
>RPSD_ECOLI_P00579; 571 YTLEEVGKQFDVTRERIRQIEA 19.645
>MERR_STAAU_P22874; 3 MKISELAKACDVNKETVRYYER 19.422
>ASNC_ECOLI_P03809; 23 TAYAELAKQFGVSPGTIHVRVE 22.185
>ICLR_ECOLI_P16528; 44 VALTELAQQAGLPNSTTHRLLT 18.815
>LACR_STAAW_P16644; 20 IRTNEIVEGLNVSDMTVRRDLI 16.389
>CRP_ECOLI_P03020; 168 ITRQEIGQIVGCSRETVGRILK 20.827
>GNTR_BACLI_P46833; 42 LSENKLAAEFSVSRSPIREALK 17.506
>PMX1_MOUSE_P43271; 122 FVREDLARRVNLTEARVQVWFQ 18.347
>LYSR_ECOLI_P03030; 19 GSLTEAAHLLHTSQPTVSRELA 18.060
>ARSR_STAAU_P30338; 30 LCACDLLEHFQFSQPTLSHHMK 20.778
>ARAC_ECOLI_P03021; 195 FDIASVAQHVCLSPSRLSHLFR 19.659
>NER_BPMU_P06020; 23 LSLSALSRQFGYAPTTLANALE 19.644
>RCRO_BPP22_P09964; 11 GTQRAVAKALGISDAAVSQWKE 18.936
>FIXJ_BRAJA_P23221; 158 LSNKLIAREYDISPRTIEVYRA 17.702
Objective score 18.907
Contact
For questions, suggestions or
comments,
please
contact us.