Le web et la sémantique


h1 5/10/2011 08:36:00 PM

Amis familiers des grands laboratoires d'informatique, cherchez l'erreur...

Libellés : ,

Gestion de code


h1 1/17/2011 10:55:00 PM

De lecture en lecture sur le web, je me suis récemment débloqué sur une question relative à l'organisation du développement logiciel...

Quand on élabore du code, on a des contraintes contradictoires:
  • Commiter souvent pour ne pas perdre du code qui marche, ainsi que pour communiquer et faire tester du code en cours d'élaboration.
  • Avoir une base de code propre pour que les gens puissent récupérer une version récente mais stable du code, et pour que l'historique soit propre.
On pourrait rajouter une dernière contrainte: coder au lieu de procrastiner sur des questions d'organisation de code...
Pour reformuler le second point, il s'agit d'assurer d'une part la qualité du code à un instant donné (ce qui nécessite une distinction entre code expérimental et code stable validé) et d'autre part d'assurer la tracabilité et le suivi du code dans le temps, c'est à dire la lisibilité de l'historique des modifications (principalement pour le code stable). L'objectif est que ce qui constitue moralement une seule modification ne soit pas explosé sur plusieurs commits. En particulier, on ne veut pas des corrections suivant trop souvent un commit hâtif. Avoir un historique propre simplifie la tâche quand on cherche à comprendre un bout de code (voir d'où vient telle partie du code) ou quand on liste les modifications avant de releaser une nouvelle version. Cela simplifie aussi la vie de ceux qui surveillent les modifications du code -- c'est la raison principale de mon obsession pour ce problème car je m'attache à suivre les commits effectués sur liquidsoap, en tout cas ceux qui touchent des modules critiques ou dans lesquels je suis impliqué. En fait, avec un historique propre on a des chances d'avoir plus de personnes qui suivent de près l'évolution du code et sont susceptibles de détecter des problèmes.

Bien sûr, une partie de la réponse est apportée par la notion de branche. Mais cela n'aide pas du tout pour ce qui est de la gestion de l'historique. Sur ce point, on peut essayer de concevoir un système avec plusieurs dépôts: le dépôt expérimental dont le code n'est pas testé et l'historique est fragmenté, et le dépôt stable où le code est propre et les modifications sont regroupées pour avoir un historique clair. Mais cela fonctionne très mal: pour le transport des modifications vers le dépôt stable, il faut développer un nouvel outil; pour le transport dans l'autre sens, il faut faire face à tout un tas de problèmes. En bref, cette solution est nulle car elle nous prive de l'aide des outils de gestion de code actuels.

Au fond il n'y a qu'une solution pour maintenir un historique propre: l'édition d'historique. Comme le nom l'indique, il s'agit de modifier l'historique des modifications après coup: on fait quelques commits, on attend du feedback, on trouve un bug, et quand tout a l'air propre on réduit tout ça en un seul commit bien clair.

Le gros problème c'est que l'édition d'historique peut interférer assez mal avec la gestion de code distribuée: si on a deux copies d'un dépôt, et qu'on change l'historique dans l'un, il devient difficile de suivre certaines modifications relatives à l'ancien historique dans l'autre. J'illustre ceci sur un exemple plus bas, mais l'idée générale est qu'on ne devrait modifier que des parties de l'historique dont personne ne dépendra jamais plus. Cela exclut de travailler avec une branche expérimentale dont on manipule l'historique avant le report des modifications dans la branche stable, car cela désynchroniserait les différents dépôts où l'on continue de travailler dans la branche. Même avec une branche expérimentale par développeur, le développeur ne pourrait plus continuer de travailler dans sa branche après l'édition d'historique. Il n'y a que deux situations où l'édition d'historique est utilisable: soit on parle d'une suite de modifications qui n'ont jamais été publiées, soit on parle d'une suite de modifications publiées mais qui n'auront plus de descendants.

J'étais bloqué à ce stade depuis un moment. J'avais découvert mercurial, l'édition d'historique, la gestion facilitée des branches, les commits locaux, etc. Mais je ne trouvais pas de solution à mon problème. Jusqu'à ce que je lise A Git Workflow for Agile teams qui fait pile ce que je veux. La clé est de vraiment faire PLEIN de branches: une par feature ou bug. Quand on a fini de préparer la modification, au moment d'intégrer les modifications dans la branche principale, on peut éditer l'historique et personne ne touche plus à la branche. Pour penser à ça, il faut vraiment oublier les SCM comme SVN, où gérer une branche est pénible, et être habitué à un SCM distribué, où c'est censé être tout naturel.

Je vous refais la recette, en français et en mercurial... Vous pouvez quand même lire l'article original, qui dit quelques choses en plus, notamment la correspondance entre branches et tickets/issues.

Un exemple


On a Alice et Bob qui bossent sur le même projet. Ils ont tous les deux contribué dans la branche 123-new-feature, chacun un commit. L'historique ressemble à ça:
@  changeset:   3:f3a6c5394605
|  tag:         tip
|  parent:      0:11575ff21a29
|  user:        Alice 
|  date:        Tue Jan 11 15:34:05 2011 +0100
|  summary:     Dummy
|
| o  changeset:   2:c6dc5c92d631
| |  branch:      123-new-feature
| |  user:        Bob 
| |  date:        Tue Jan 11 15:15:26 2011 +0100
| |  summary:     Plus expressif!
| |
| o  changeset:   1:fafdcd7dccf5
|/   branch:      123-new-feature
|    user:        Alice 
|    date:        Tue Jan 11 15:11:40 2011 +0100
|    summary:     Nicer.
|
o  changeset:   0:11575ff21a29
   user:        Alice 
   date:        Tue Jan 11 15:08:58 2011 +0100
   summary:     Initial commit.

Alice et Bob sont satisfaits de leur code, Alice va donc fusionner les deux commits et les passer dans default. On utilise l'option --keep pour ne pas détruire tout de suite la branche. En gros on prend le noeud 1 et ses ancetres, on fusionne, et on recolle la modification résultante sur le noeud 3.
$ hg rebase --collapse --keep --source 1 --dest 3

@  changeset:   4:2bf346072a24
|  tag:         tip
|  user:        Alice 
|  date:        Tue Jan 11 15:15:26 2011 +0100
|  summary:     Plus joli et plus expressif!
|
o  changeset:   3:f3a6c5394605
|  parent:      0:11575ff21a29
|  user:        Alice 
|  date:        Tue Jan 11 15:34:05 2011 +0100
|  summary:     Dummy
|
| o  changeset:   2:c6dc5c92d631
| |  branch:      123-new-feature
| |  user:        Bob 
| |  date:        Tue Jan 11 15:15:26 2011 +0100
| |  summary:     Plus expressif!
| |
| o  changeset:   1:fafdcd7dccf5
|/   branch:      123-new-feature
|    user:        Alice 
|    date:        Tue Jan 11 15:11:40 2011 +0100
|    summary:     Nicer.
|
o  changeset:   0:11575ff21a29
   user:        Alice 
   date:        Tue Jan 11 15:08:58 2011 +0100
   summary:     Initial commit.

A ce stade on peut même carrément virer la branche du dépôt. Les données correspondantes seront alors perdues pour toujours.
$ hg strip 1

@  changeset:   2:2bf346072a24
|  tag:         tip
|  user:        Alice 
|  date:        Tue Jan 11 15:15:26 2011 +0100
|  summary:     Plus joli et plus expressif!
|
o  changeset:   1:f3a6c5394605
|  user:        Alice 
|  date:        Tue Jan 11 15:34:05 2011 +0100
|  summary:     Dummy
|
o  changeset:   0:11575ff21a29
   user:        Alice 
   date:        Tue Jan 11 15:08:58 2011 +0100
   summary:     Initial commit.

C'est tout beau! Mais si Alice a détruit sa branche, cela ne se propagera pas chez Bob. Au contraire, si Alice se synchronise avec Bobe elle récupérerait la branche, qui apparait en fait comme étant nouvelle pour son dépôt qui en a perdu toute trace.

Si on pousse les modifications d'Alice chez Bob (éventuellement vai un dépôt central) la modification numéro 2 va être transmise. La branche 123-new-feature reste dans les autres dépôts -- en d'autres termes l'historique n'est pas lui même versionné. Bob pourrait donc faire d'autres modifications dans la branche, ce qui n'est bien sûr pas trop souhaitable...
# Bob fait une modif (#3) et la commite

o  changeset:   5:2bf346072a24
|  tag:         tip
|  user:        Alice 
|  date:        Tue Jan 11 15:15:26 2011 +0100
|  summary:     Plus joli et plus expressif!
|
o  changeset:   4:f3a6c5394605
|  parent:      0:11575ff21a29
|  user:        Alice 
|  date:        Tue Jan 11 15:34:05 2011 +0100
|  summary:     Dummy
|
| @  changeset:   3:b37a07ad9982
| |  branch:      123-new-feature
| |  user:        Alice 
| |  date:        Tue Jan 11 15:39:38 2011 +0100
| |  summary:     Never enough of that.
| |
| o  changeset:   2:c6dc5c92d631
| |  branch:      123-new-feature
| |  user:        Bob 
| |  date:        Tue Jan 11 15:15:26 2011 +0100
| |  summary:     Plus expressif!
| |
| o  changeset:   1:fafdcd7dccf5
|/   branch:      123-new-feature
|    user:        Alice 
|    date:        Tue Jan 11 15:11:40 2011 +0100
|    summary:     Nicer.
|
o  changeset:   0:11575ff21a29
   user:        Alice 
   date:        Tue Jan 11 15:08:58 2011 +0100
   summary:     Initial commit.

Si maintenant on pousse les modifications de Bob à Alice, on a la situation suivante, où les modifications individuelles de la "vieille" branche sont suivies d'une nouvelle modification...

o  changeset:   5:b37a07ad9982
|  branch:      123-new-feature
|  tag:         tip
|  user:        Alice 
|  date:        Tue Jan 11 15:39:38 2011 +0100
|  summary:     Never enough of that.
|
o  changeset:   4:c6dc5c92d631
|  branch:      123-new-feature
|  user:        Bob 
|  date:        Tue Jan 11 15:15:26 2011 +0100
|  summary:     Plus expressif!
|
o  changeset:   3:fafdcd7dccf5
|  branch:      123-new-feature
|  parent:      0:11575ff21a29
|  user:        Alice 
|  date:        Tue Jan 11 15:11:40 2011 +0100
|  summary:     Nicer.
|
| o  changeset:   2:2bf346072a24
| |  user:        Alice 
| |  date:        Tue Jan 11 15:15:26 2011 +0100
| |  summary:     Plus joli et plus expressif!
| |
| o  changeset:   1:f3a6c5394605
|/   user:        Alice 
|    date:        Tue Jan 11 15:34:05 2011 +0100
|    summary:     Dummy
|
@  changeset:   0:11575ff21a29
   user:        Alice 
   date:        Tue Jan 11 15:08:58 2011 +0100
   summary:     Initial commit.

Pour réparer cela, il faudrait reporter uniquement la modification 5 sur 2. Mais le but est que cela n'arrive pas, si les développeurs sont bien organisés. Par exemple, plutôt que de détruire la branche, il vaut mieux la garder, et utiliser la possibilité de fermer une branche dans mercurial (hg commit --close-branch). On peut imaginer que les très branches sont effacées périodiquement de tous les dépôts, quand il n'y a plus de souci possible.

Rebase sur un ancêtre


De façon pas très uniforme, le rebase de mercurial ne permet pas que la cible soit un ancetre de la source. C'est à dire que si on commence une branche alternative, et qu'on ne fait pas de modification dans la branche principale, on ne peut pas faire de rebase de la nouvelle branche dans la principale. Faire un simple merge ne nous permet pas de fusionner les différentes modifications. Il faut donc utiliser une routine un peu différente.

D'abord on se place dans la branche principale: hg update default. Ensuite, sans changer notre position dans l'historique, on applique sur nos fichiers les modifs appliquées dans l'autre branche: hg revert -r tip --all. Enfin, on commit ces modifications en une fois: hg ci -m "Combined.". Voila, 1 et 2 sont devenus 3:
@  changeset:   3:b29ddaf6c31c
|  tag:         tip
|  parent:      0:11575ff21a29
|  user:        Alice 
|  date:        Tue Jan 11 15:57:58 2011 +0100
|  summary:     Combined.
|
| o  changeset:   2:c0aa9c6feb0a
| |  branch:      123-new
| |  user:        Alice 
| |  date:        Tue Jan 11 15:47:36 2011 +0100
| |  summary:     Two.
| |
| o  changeset:   1:23e96c98b90e
|/   branch:      123-new
|    user:        Alice 
|    date:        Tue Jan 11 15:47:29 2011 +0100
|    summary:     One.
|
o  changeset:   0:11575ff21a29
   user:        Alice 
   date:        Tue Jan 11 15:08:58 2011 +0100
   summary:     Initial commit.

Comme avant on peut se débarasser de la vieille branche avec hg strip 1.

La morale


Je ne sais pas comment tout cela fonctionne dans la durée et à grande échelle. J'imagine que parfois c'est plus compliqué que sur ce petit exemple, mais cela me semble valoir le coup d'essayer. Et ça a l'air de marcher pour des gens.

La question restante est donc surtout: quand et comment se jeter à l'eau avec savonet?

Libellés :

Design III: Liquidsoap tomorrow?


h1 12/17/2010 03:16:00 PM

In the previous post we've seen the problems with the current source at the core of liquidsoap. The point of doing so was to highlight the interest of a more principled design.

We could start with a mathematical notion of stream, but we haven't reached that point yet. As I've illustrated before, sources are inherently interactive objects, which is hard to account for.

Instead, I'll keep the same practical notion of source but do one simplification: we'll work sample by sample. We'll have a #is_ready method that tells whether the source can produce a sample for the current instant. A source is always ready when it is in the middle of a track. And we have the #get method which returns at most one sample, none when the current track ends, and should never be called unless the source #is_ready.

As before, a source is passive by default (it doesn't produce data when not asked to) and caching mechanisms ensure that a source gives consistent answers to several observers. Depending on how/when it is used, a source will generate a stream, possibly undefined at some instants, but otherwise containing samples, metadata and end of track markers. Note that we need more precision than just time to denote a point in such a stream, in order to be able to tell the position relative to end of tracks -- it is possible to have several consecutive end of tracks exactly at the same point in time.

Let us detail caching, which isn't totally trivial. What do we cache: a number N of end of tracks, corresponding to partial #get, possibly followed by a sample. How do we use that cache: for an operator which already had M end of tracks and performs a #get, we return an end of track if M<N, the sample otherwise.

In the current liquidsoap, this M is passed as part of the frame, but there is no way to track which source has produced which end of track. Hence it is possible that M>N, for example in case of a switch which has seen many empty tracks before switching to that source. Even when M<N, it makes it unclear how many end of tracks we should add. I'm not sure whether it is crucial that we keep precise track of who produces which end of track.

What matters is (1) if you listen continuously to a stream, you get exactly all its content and (2) we don't have the first problem with sharing. This is because various operators of a source can only get data at one instant. If I pump it, the source will generate (or find in its cache) an end of track or a sample. It's as simple as this, there's no bad side-effect here. We don't even need to worry about sharing detection, because we can perform caching all the time without changing the stream content -- efficiency isn't an issue for now. (At some point I was worried that sub-instantaneous data is in the cache, i.e. end of tracks, could lead to the same problem as before, but it does not make sense because this data can't be refered to without first generating it: you can't pump a source "after" it has generated end of tracks, you can only pump it, ignore end of tracks, and pump it again.)

For our second problem, we would simply cache the result of #is_ready (rather, rely on the same cache). As with #get, we need to know the number of end of tracks already obtained by the observer to tell him if the source is ready at that precise point.

Some minor remarks:
  • We can't distinguish a source that is in the middle or just before the beginning of a track. If you don't listen to a source for a while, and the source is ready when you come back to it, there's no way to know if it is about to start something new or if you missed the beginning of the current track. It's never been a problem, I'll assume we can keep living with it.
  • The implementation of end-of-track notification is still a "partial" #get which may be too ad-hoc. The alternative is to attach an end-of-track tag to the last sample. Looks like a meaningless choice to me.
  • Note that metadata and end-of-tracks are treated very differently: we never issue a partial #get because of metadata. It corresponds to a view where, end of tracks are "after" the last sample, while metadata is "before" or maybe "inside" samples. We should try to generalize this into a nice notion of event, some being interruptions, some not.
  • With Samuel, we have explored other design choices. What I present here isn't his favorite, but it's the most convincing to me. It also has the advantage of being fully defined (at least in my mind).

This new model is not revolutionary. But we gained more than it seems. For example, we can now write an operator that takes two sources and sums them, only producing data when both of them are ready, and not pumping any of them when none is ready. The reason is that in the old frame-based design, we can only get data until either the end of a frame or the end of a track. But if one source ends a track, the other will still fill a frame.

Lifting to frames


The real challenge now is to derive an efficient frame-based model that faithfully reflects the simply sample-based model.

I'll start with the biggest problem: sharing. We can't approximate it anymore, or we'll obtain the same problem. Hence I'm going to propose a two-phase pumping: In a first abstract phase we do a dry run of the pumping methods to know who pumps who at what instant. Then we have a precise picture of the sharing and we can do a real run, computing the actual stream content.

This method forbids some source behaviors: for example if a source acts differently based on the actual values of its input samples, it cannot meaningfully take part to the first phase. That would be a problem with a blank detection source. I believe we can relax this enough to get what liquidsoap currently does. The key is to declare some values as "sparse" or "slow", allowing the change of the value (and hence the reaction of the source) to only take place at the end of the frame. For example blank detection would set a flag that only takes effect for the next frame.

For most sources I want to automatically generate the #simulated_get method from a simple #get. This is why compilation is needed now.

Now, I believe I can run that simulation thing. The nice thing is that once I bite the bullet and decide we need this, there's no question about being more modest in other places. In particular, the new #is_ready will be asked not only if a source is ready at a particular point but for how long. This enables a nice frame-based implementation of the precise sum operator described above. Of course, it goes with a #get that takes not only a start but also a stop position.

Plans


There are several things we can do now:
  1. Compile frame-level sources from sample-level sources written in OCaml. This is only about analysing the source, and it does not need be user-friendly. There are a couple tricky analysis to run. For example, a sine source will have a counter decremented at each sample and reset after an end of track, and we'll have to turn this into the natural code for the frame-level sine.
  2. Extend liquidsoap scripts with a notation for sources and compile it down to OCaml objects. We'll have to ensure that the user never sees a type error coming from OCaml, but only errors from us, which should be more readable. The main extra challenge is to run liquidsoap code fast. Some of it can be compiled directly to OCaml but function calls won't match exactly. We'll have to inline or leave some interpreter calls.
  3. Do cross-source optimizations on top of the previous work. If the user writes a sum of sine sources, inline it into a source that computes directly the sums.
  4. Compile to C or even lower level code.

I gave a rather precise proposal of what Step 1 could look like. Still we can discuss some points, try to generalize a few things. My proposition doesn't look as clean as I would like in some places, but it has one great advantage: I'm sure we can program liquidsoap sources with that model, and even do some things currently impossible.

Step 2 presents even more design choices: should users write sources in an OO style like we do? should they work with an explicit notion of time? should time be discrete or continuous?

Libellés : ,

Design II: Liquidsoap today


h1 12/10/2010 03:11:00 PM

In this post, I will give a high level picture of the current internals of liquidsoap, show a few problems with them. The interested reader can read further details on the design of liquidsoap in our papers. In the next post, I'll finally describe a better future for liquidsoap.

The source protocol


In order to get a minimum understanding of liquidsoap's streaming model, let us consider the following example:
output(fallback([queue,playlist]))

What we want here is to produce a stream either from the playlist or the queue of requests, and output it somehow. The fallback between our two sources only switches from one to the other for new tracks. For example, if the queue becomes available while we are in the middle of a playlist track, we'll keep playing the playlist track, and only switch to the queue for the next track. In order to obtain the expected behavior this means that (when possible) unused sources are (almost) frozen in time: the queue doesn't start playing its request until the fallback asks it to do so. The stream of a source isn't defined in itself, but computed in interaction with the source's environment. Simply put, it's on-demand streaming.

Now what happens concretely. Streaming is clocked, and at each cycle of the clock we produce a frame, which is simply a buffer of samples (audio, video or MIDI) of fixed size/duration. The filling of the frame starts from the outputs: Our output passes the frame to the fallback, which passes it to the source that is currently playing a track, the frame is filled, passed back to fallback, which passes it back to output, which outputs the frame to the world.

Sometimes, it'll be a little different: When the child source of fallback ends its track, it won't completely fill the frame, which gives the opportunity for the fallback to select another source. The newly selected source isn't used right away. Instead, the partial frame is returned to the above operator, in our case the output, which passes it back because all it wants is a complete frame -- the story would be different with nested fallbacks. When the partial frame comes back, its filling is completed by the newly selected source.

We perform that kind of computation by having sources communicate with each other using just a few methods: we mainly have #get for filling a frame and #is_ready for knowing if a source has a ready track (ongoing or not).

All this works smoothly when sources are organized along a tree, i.e., when there is no sharing. But we really want sharing, as seen in the following example:
source = fallback([queue,playlist])
output_1(source)
output_2(source)
This means that at each cycle of the clock, the source will be asked twice to fill a frame with a segment of its stream. Of course, the source should give the same data twice: we don't want to split our stream among the two outputs, one frame each. To deal with this, we have some caching mechanisms in #get. In more details, we create new sources by inheriting from a base class which defines #get as a public method wrapping the virtual private method #get_frame which can be written without thinking about sharing at all.

Problem #1


Caching everything is really costly, so we have to detect when sharing is needed. The problem is that sometimes, it's not easy to tell statically:
output(
  switch([
    ({am},fallback([queue,am_playlist]),
    ({pm},fallback([queue,pm_playlist])]))))
Here it looks like the queue is shared, but in fact the first occurrence can only be used in the morning (am), the second only in the afternoon (pm). While we could try to detect some of those cases, it is hopeless to catch them all. So we're bound to over-approximate sharing. This has been done, and isn't too complicated, although it does introduce some activation/deactivation methods that must be used carefully, especially when using transitions. This is okay regarding performance, but it has one nasty side effect: the approximation can have an impact on the stream!

The scenario is as follows: Take a source S which looks like it could be used by operators A and B. Now A asks S to start filling a frame from the middle (it happens, cf. the fallback example above) but the source sees that it could be also be used by B, perhaps to fill a frame from the beginning. So S fills in its cache from the beginning, and copies the part of it that starts from the middle in A's frame. But if B never uses S, we'll have modified our behavior for nothing.

Problem #2


An even worse problem is that we implemented some caching mechanism for #get but not #is_ready. Now, it is possible that a source fills its cache, has no new track to produce, and hence declares itself as not ready. Which might not be correct from the viewpoint of an operator that wants to access the data from the beginning of the cached frame. As a result of this weakness, I recently had to allow what should be illegal behavior in some operators that rely quite precisely on the readiness of their sources. Of course, this is a slippery slope.


Next time, I'll describe a simple way to construct a better source protocol from first principles, avoiding those problems, and even getting more expressiveness by the way. On the other hand, it'll have to be compiled. Stay tuned.

Libellés : ,

Design


h1 12/09/2010 08:42:00 AM

I recently found a good series of articles on LtU, on the Ghosts of UNIX past. The author reflects on some successes and failures of Unix, trying to find patterns. I enjoyed reading the articles, they teach you a few things, and are generally good food for thought.

The first article describes the "full exploitation" pattern, successfully achieved in UNIX with files. The next articles describe various failures of UNIX design, classifying them in "conflated", "unfixable" and "high-maintenance" designs. While I don't have the culture to do the proposed exercises, I tried to relate this to programming language design -- which suprisingly hasn't been done on LtU.
  • Full exploitation is when you manage to re-use a concept a lot. This is good because it avoids inventing new things: it's hard enough to get a few concepts right. In programming languages, a good first-class notion of functions can be fully exploited to model control, event handlers, lazy computation, etc.
  • Conflated designs mix up several concepts. This is bad only if the two concepts cannot be separated. Otherwise, it's just syntactic sugar. Object oriented programming comes to mind: Methods are a conflated design, mixing functions and record members. Also, inheritance mixes record extension and dynamic binding -- which makes it hard for people to think about the latter aspect, a very important source of bugs.
  • Unfixable designs are unfixable designs. This is a bad definition, but sometimes it is hard to say why something is unfixable. Often, this has to do with the wide adoption of the wrong design. Note that conflated designs may often be fixable, essentially by creating ways to access conflated notions in isolation. I'm not sure what's a good programming language example here. Perhaps dynamic binding, probably stuck forever in emacs lisp. (By the way, a fun fact according to Olivier Danvy: dynamic binding was born from exagerated exploitation of a single stack in the implementation of Lisp -- in those old days, people were psyched about stacks and got a bit carried away.)
  • High-maintenance designs have nothing wrong in themselves but interfere with other things in such a way that requires a lot of care and work in those areas. With programming, various difficult things come to mind: state, concurrency, dynamic binding... Those aspects make it very hard to reason locally about a piece of code. In themselves, state and concurrency are cool, but their interaction leads to many problems.

More than programming languages, liquidsoap was in the back of my mind. But there does not seem to be interesting examples of those patterns in liquidsoap. In fact, it is not an exceptionally good design. It's just a first shot at making an end-user streaming application that is not just one object/tool but rather a system/language for building tools. Liquidsoap introduces an abstract notion of source, which is good, but does not fully exploit it. For instance, blank detection operators are black boxes that encapsulate the computation of the average audio volume, but a more flexible and efficient design would be achieved by isolating the computation of the volume stream as a source, then write operators that act depending on the characteristics of the volume source. Liquidsoap does not have unfixable designs because we give ourselves the liberty to change things when needed. It does have high-maintenance designs: the source protocol is complex, and we've had many bugs because it wasn't properly followed.

Beyond those not-so-good design examples, liquidsoap is broken in several ways. But I'm comfortable to admit it because it works good enough for many people and we know how to fix it. The code name for that exciting new line of work is, quite straightforwardly, Liquidsoap 2.0. Until now, we've been writing sources by hand in OCaml. Liquidsoap users write scripts in our own scripting language, and can only compose sources, never write a new one. In other words, Liquidsoap is an interpreter for a dedicated script language that manipulates an abstract/opaque notion of source. But liquidsoap 2.0 will be a compiler that lets you write sources, and not only compose them, and will produce optimized code. It will be as simple as it is today for beginners, but will enable much more for power users. Importantly, it will allow us developers to not write as much bug-prone code. There is a lot to say about what could be in liquidsoap 2.0, but I'll start with a simple down-to-earth story... next post!

Libellés : , ,

Random.self_init


h1 10/20/2010 03:15:00 PM

Maintenant que je travaille un peu dans le domaine de la sécurité, je pense notamment aux générateurs de nombres aléatoires. Après quelques lectures sur le sujet, je me suis demandé comment fonctionnait le module Random de Caml.

Il y a sûrement beaucoup de choses à dire, mais mon attention a été retenue par un truc simple: self_init initialise le générateur de façon très simple, à partir de l'heure et de l'identifiant du processus, données pour le moins aléatoires. Il faut donc vraiment se méfier de cela si on souhaite utiliser le module Random à des fins cryptographiques. En pratique, le petit bout de code suivant devine assez rapidement la graine générée, en supposant connu l'identifiant du processus (ce n'est pas bien dur à deviner sinon) et en supposant approximativement connue l'heure de l'initialisation:

(** Random.self_init relies on caml_sys_random_seed
  * which returns an int.
  * That integer is sec ^ usec ^ ppid<<16 ^ pid.
  * We try to guess the seed to reproduce the output of
  * the random generator.
  * The PID and PPID are supposed to be known,
  * the time in seconds and micro-seconds is supposed
  * to be known resp. with precision 1s and 1000ms.
  * With that knowledge it's really fast;
  * with less it'd take longer. *)

let gen sec usec pid ppid =
  sec lxor usec lxor pid lxor (ppid lsl 16)

let random _ = Random.int 100

(** Given a leaked sequence of numbers consecutively
  * generated by [random] _immediately after_
  * [Random.self_init], call [k] with seeds that allow
  * to reproduce that sequence. *)
let crack leak k =
  let t = Unix.gettimeofday () in
  let sec = int_of_float t in
  let usec = int_of_float (fst (modf t) *. 1_000_000.) in
  let pid = Unix.getpid () in
  let ppid = Unix.getppid () in
    (* Since we sec and usec are xored, trying sec and
     * sec-1 would be useless: all we'd be doing is
     * flipping a bit that is flipped anyway when
     * enumerating posssible values for usec. *)
    for usec = usec - 10000 to usec do
      let seed = gen sec usec pid ppid in
        Random.init seed ;
        if Array.init (Array.length leak) random = leak then
          k seed
    done

let () =
  let leak,valid =
    Random.self_init () ;
    let leak = Array.init 1000 random in
    let valid = Array.init 5000 random in
      leak, valid
  in
  let k seed =
    let guess =
      Random.init seed ;
      ignore (Array.init (Array.length leak) random) ;
      Array.init (Array.length valid) random
    in
      Printf.printf
        "Guessed seed: %d. Validated: %b.\n"
        seed (guess=valid)
  in
    crack leak k ;
    Printf.printf "Search space exhausted.\n"

Quand on le lance, on a quelque chose comme ça en quelques secondes:
$ ocamlopt unix.cmxa crack.ml -o crack && ./crack
Guessed seed: 262818353. Validated: true.
Search space exhausted.

Il y a diverses variations possible. Ce qui m'intéresse, et qui est encore assez obscur pour moi, c'est la génération (répétitive) de graines à partir d'une partie privée et d'une partie publique, utilisée pour éviter d'avoir des séquences générées identiques (ce qui exposerait à des attaques dans certains cas). A quel point on affaiblit ou compromet son secret, en fonction du générateur utilisé, etc?

Libellés :

Qui me regarde?


h1 8/30/2010 06:38:00 AM

Deux petites histoires un peu angoissantes...

(1) Je suis parti en vacances sur la route dans l'Ouest américain, et comme nous sommes passés par Las Vegas, j'avais cherché sur mon portable des hotels là bas. De retour de vacances, je me suis retrouvé plusieurs fois sur des pages webs quelconques avec de la pub pour des hotels à Las Vegas.

(2) Suite à une mésaventure, je me suis acheté une nouvelle montre. Comme Estelle voulait voir comme elle était belle, j'ai fait une petite recherche d'image sur le web. Et hier, j'ai reçu un mail d'Amazon m'annonçant une promo sur cette marque de montre en particulier.

Je suis pas parano, tout ceci n'est pas dû au hasard? Mais la seule explication serait que google ait vendu la mêche, mais cela suppose aussi que ces sites m'identifient, ce qui va un peu loin...

Edit: En fait, j'ai dû cliquer sur un lien Amazon pour la montre, c'est le plus probable. On est toujours connecté à Amazon, c'est traître.

Libellés :