Extracting Algorithms From Code

This work explores the middle ground between strict up-front formal methods and just using model checking as a debugging tool. The former doesn’t cope well with real-world projects — changes in the implementation don’t end up getting propagated all the way back to the formal model — and the latter, while enabling round-tripping to some degree, only provides a way of reasoning about fairly weak properties of the program.

L. Lamport’s TLA is used here to analyse actual software program code. It’s really only been used to analyse algorithms (ie. small isolated program fragments) and hardware specifications before.

TLA+ is a framework based on TLA that has been extended to be able to model real programs. The temporal logic in TLA is really not the main thrust — it’s there so that the temporal logic stuff fades into the background, leaving you to get real work done.

So this project attempts to allow pulling back all the way to algorithms from real code — not just having to stop at some limited properties of the program — the idea being to give the power of full up-front formal methods with the flexibility of the “debugging” approach.

The project models the C language. The model is then used as an interpreter for the program we’re ultimately interested in. The model-checker TLC tells you which transformations on your program are meaning preserving; the user then goes through the program transforming it to successively more and more abstract high-level code. Ultimately you end up with a high-level formal model of the low-level program, about which quite interesting questions can be asked and answered within the TLA+ framework.

Protein Folding Simulation in CCP

I didn’t make many notes on this (since it’s not really my topic) except during the question time at the end, when I caught a couple of interesting questions someone else was asking:

Q: How does it compare with homology?
A: Not well. But homology is limited too, and a hybrid approach is promising.

Comment: It seems very naïve to predict tertiary structure from secondary structure with such a simplified model - there’s a long way to go.

Q: How long were the chains you tested? How big were the errors?
A: The chains were 30 to 40 amino acids long.

Multiple Biological Model Classification: From System Biology to Synthetic Biology

Notes:

  • The central dogma has information flowing in one direction only. This fits the Darwinian view of the world; if information were able to flow back from proteins to the genetic (hereditary) material, we would have a more Lamarckian system. It turns out, of course, that there is information flow in the reverse direction — promoters, operons, regulatory sequences, signalling molecules etc. — but it only seldom (never?) modifies the genetic material in a heritable way.
  • Michaelis-Menten equations are used to model cooperative factors in transcription.
  • “Simulate Biologists! Not Biology!!” - simulate biological reasoning. Faithful simulation of the actual biological system not required.
  • “Simpathica is a modular system”
  • There’s a canonical form for each described reaction system.
  • An example given is glycolysis. It’s written as a composition of building blocks. The modular nature of the simulation system means you can build libraries of modules.
  • It does ODEs and (maybe also) PDEs
  • The experiments on artificial regulatory circuits didn’t match the simulation - lots of effects need to be taken into account, e.g. RNAP run-on, …
  • Very cool “story generation” technique - ask questions about a whole bunch of variations on a model - it tells you in natural language about the results! (Also can query in natural language)

SBML - Systems Biology Markup Language

SBML is an exchange language for biochemical network models. It has various levels, each an elaboration of the previous; they are intended to complement and extend one another rather than outright replacing each other, although it seems level 2 is significantly superior to level 1.

Both levels 1 and 2 are nonspatial, although level 3, currently under development, does incorporate some spatial properties. Level 2 uses MathML! Level 2 seems like a pretty standard DSL-style evolution of level 1 — refinements in order to make the language fit the domain better, addition of more language-like properties etc etc.

The reaction networks described by SBML are made up of compartments, species and reactions. The compartments are “well-stirred volumes” (how exactly are they nonspatial?). Species have to be redefined in each compartment, even if they’re notionally the same species — this seems similar to local channel names in ambients.

At level 3, some of the planned features include model composition, spatial features, diagrams (not meaningful though — just metadata hooks for GUIs to store their hand-edited layout information), and stochastic reactions. Level 2 was limited by combinatorial explosions, since species are per-compartment unique and can’t be shared; also, no record can be made of the constitution of intermediate complexes: they’re just blobs like everything else. They aren’t even marked up to be recognisable as complexes, unless some non-standard extra-systemic convention is layered on top.

Another feature being considered for level 3 is generalised reactions, where for instance you might generalize over a species’ various phosphorylation states.

There are two planned ways of composing models at level 3: by providing intermodel reactions, or by identifying species within the interlinked models. Level 3 composed models can be flattened to level 2 monolithic models.

Notes:

  • SBML can build ODE models without using the subset of the language for reaction networks.

  • What units are used for measuring compartment size?

  • The language supports units — compound SI units! This made me think of the occasional LTU threads (1, 2) on the topic…

  • Does the SBML definition of “species” fit our “alternative” model?

  • He mentioned CellML as a related system.

  • The planned generalised reactions are kind-of tree transformers! They only move subunits around; wildcards and backreferences would be used to leave some parts of the species untouched.

  • It looks like the libSBML interop APIs might be generated by SWIG…

  • libSBML has test suites and a SBML model repository available, with over 90% of the CellML repository converted to SBML.

  • The next SBML forum will be 14-15 October, 2004, following ICSB 2004.

  • Apparently “continuous mathematical modelling” is an interesting related topic?

  • Are PDEs spatial (compared to ODEs)?

Gene Regulation in the Pi Calculus: Simulating Cooperativity at the Lambda Switch

Notes:

  • The lambda switch is a standard model of a bacterial gene regulation mechanism. Lambda is a virus infecting E. coli.
  • The lambda switch incorporates both positive and negative feedback, as well as many different kinds of feedback loop. Only transcriptional control was considered in this talk.
  • Lambda can operate in two phases: the lysogenic and the lytic phases. In lysogeny, the lambda genome is passivated and incorporated into the host cell, being replicated by the host across generations, to be reactivated into the lytic phase in some later generation. Unprompted reactivation is very rare, and is induced artificially (“induction signal”) for experimental purposes.
  • [Aside: with our “alternative model” strategy for encoding biochemical reactions, does bisimulation still compute something relevant?]
  • The paper covers various implementations tricks (handshakes, timers) for encoding the various interactions of proteins — there wasn’t time to cover the techniques during the talk.
  • Modularity within models is a major challenge - the implementation tricks required seemed to make the model monolithic (?)
  • Accessibility of π-based modelling tools to non-computer-scientists is also very important
  • A question from the end: the table in the slides listing the various states of each reagent (“states of molecular actors”) is a-priori information on the control flow in the system (is it related to the lack of ambients?). The question was: how might that table be automatically derived rather than input a-priori?

Modelling the influence of RKIP on the ERK signaling pathway using the stochastic process algebra PEPA

The focus of the talk was analysis, rather than simulation.

PEPA

PEPA is a stochastic algebra. PEPA looks like CCS, but instead of parallel composition with handshaking on actions, you have a cooperation operator, causing two processes to run in parallel synchronising on a certain subset of their shared actions. This gives you n-ary interactions — compare π with its binary interactions. PEPA was originally used in performance modelling for telecommunications networks.

PEPA programs are translated (via an SOS definition (?)) into labelled multi-transition graphs, and from there further transformed into Continuous Time Markov Chains (CTMCs). The experiments described in the talk were thus not simulations — CTMCs are static.

Some of what was learnt from the use of PEPA in the telco industry:

  • compositionality was very important — no longer any need to rewrite the whole system for each small change

  • equivalences are important when manipulating models — without a formal equivalence, you cannot prove the model is still the same after your simplification. (Simplifications are necessary to reduce the state space and thus improve tractability.)

  • it turns out to be possible to syntactically identify certain classes of efficiently-implementable models! So just by looking at the PEPA process description it is possible to judge some aspects of the model’s complexity. (eg. “product form” of CTMCs)

The ERK Signalling Pathway

The pathway is an extracellular signalling mechanism, relaying messages between cell nuclei. It is often studied since a breakdown in the pathway can lead to cancer. Only a small part of the full pathway was analysed — the part inbound from the cell membrane to the nuclear membrane.

The hypothesis that was being tested was that RKIP is a regulator of the ERK pathway. (The results appeared to supported the hypothesis, according to the graphs from the paper…)

In modelling the pathway, since the final target is a CTMC, there’s no way of representing concentrations of reagents, since those values are continuous. If you’re approximating using a few discrete concentrations you must be careful that you don’t get combinatorial explosion in state space — it must be kept finite and small. The solution adopted was to have only two concentrations for reagents: high and low. In the “high” state, various reactions were allowed to proceed; in the “low” state they were inhibited or modified appropriately. Essentially “high” meant “enough to proceed” and “low” meant “zeroish”.

Explicit representation of the two concentration-approximations was only required in one of the two models studied, the reagent-centric one. The pathway-centric one implicitly encoded concentration in the states of each sub-pathway.

There was some interesting commentary on the differences between the two models — the pathway-centric one was easier to assemble and much easier to get right; each model also emphasised different aspects of the reactions, the reagent-centric one apparently “feeling” fine-grained, the pathway-centric one more coarse-grained.

The lack of proper concentration representation means that indirect methods had to be applied to adjust initial conditions for performing experiments with the model. Essentially instead of increasing or decreasing concentrations of RKIP, the rate of the first step in the RKIP subpathway was increased or decreased, respectively, which kind of effectively does the same thing for the purposes of these experiments.

A few miscellaneous notes on the talk:

  • Sum is not nondeterministic, because of the introduction of a race policy.

  • Bisimulation can be used to demonstrate equivalence of models intended to represent the same system; this was put to use in the RKIP/ERK project in deciding that the two alternative representations (reagent-centric and pathway-centric) were equivalent.

  • ODEs are non-structural — they do not contain any information on the structure of the pathways in the system, they just allow computation of the concentrations of the reagents. Compare to stochastic π, which is structural — detailing the reaction pathways — and also quantitative enough to be able to provide useful concentration data.

  • Complexes are explicitly represented in the graphs of reagents and pathways.

Concur 2004

I’m sitting in the foyer of the Royal Society in London. This is the venue for Concur 2004, a conference on Concurrency Theory. I’m just attending a couple of the workshops — specifically BioConcur ‘04 and, next Saturday, AVoCS ‘04.

BioConcur

So far there’s been a talk by Corrado Priami giving an overview of the application of concurrent languages (such as the π-calculus) to the field of Systems Biology.

One success-story so far is that of a study that was made of inflammatory processes in Multiple Sclerosis-affected brain blood vessels. A π-based model was constructed, and simulations of the model were observed. The graphs of the simulations led to the formation of a hypothesis, previously unstudied, relating the number of rolling lymphocytes to blood vessel diameter. Later investigation in mice demonstrated the validity of the hypothesis.

I made a few random notes:

  • How does π improve over CCS with regard to cost? Why are names interesting from a cost point-of-view in biological models?

  • ODEs (Ordinary Differential Equations) are noncompositional; π is linguistically compositional; it seems to me that these biological models are partly compositional, partly noncompositional, because of the need for encoding of biochemical behaviours (?).

  • The table Priami presented relating π to biological systems:

    MoleculeProcess
    Interaction CapabilityChannel
    InteractionCommunication
    ModificationState and/or channel change

    Compartments are modelled with restriction, in the absence of ambients or ambient-like structures. This is an acknowledged problem with using π for these models.

  • The π-based model treats a molecule as a black box with a number of interaction capabilities, each of which has a defined syntactic and behavioural specification for its interactions.

  • The exponential distribution used for selecting the next action gives a memoryless property to the system

  • In the stochastic π calculus, there is a forced zero collision property — the system excludes the possibility of a collision in competing for resources. This keeps the world nice and deterministic…

  • There are standard tools for performance analysis of Continuous Time Markov Chains (link).

  • RCCS is another way of modelling these systems - the R stands for “reversible”!

  • Model checking is complicated by the stochasticness of the system; model checkers usually like a deterministic universe…

  • What kinds of equivalence between processes are biologically interesting?

  • Inductive Logic Programming is another alternative approach.