Wednesday, January 8, 2014

Putting "hard" prior knowledge into learning

Regression is probably the machine learning task: estimate an output for given inputs. Decades of (very successful) research have shown that the problem is, and will always be, how to generalize such learned input/output relations to previously unseen inputs. This problem gets severe when either very few training data is available, or the data itself is very noisy. When we wanted to learn the low-level control of the Bionic Handling Assistant, we had both problems. Very few (because relatively expensive), and very bad training data. Challenging.
Even based on such bad training data, the to-be-learned controller had to be reliable. It should not go mad on unknown inputs, but should generalize smoothly and consistently. I wanted to have guarantees, considering that the controller we needed should not just be a stand-alone show-case, but operate at the bottom of everything we wanted to do with the BHA.
We came up with the idea to use some prior knowledge acting as constraint on the learning. In this case we had some prior knowledge. Such as: physics results in certain variables to have a strictly monotonous relation. Variables have ranges. These things we wanted to put into learning, and guarantee that the learner generalizes accordingly. Klaus (who just successfully defended his PhD, congratulations!) came up with a surprisingly flexible and general method to put such things into regression. Results just got published:

Neumann, K., M. Rolf, and J.J. Steil, "Reliable Integration of Continuous Constraints into Extreme Learning Machines", Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 21(supp02), 12/2013. (pdf)
Abstract — The application of machine learning methods in the engineering of intelligent technical systems often requires the integration of continuous constraints like positivity, monotonicity, or bounded curvature in the learned function to guarantee a reliable performance. We show that the extreme learning machine is particularly well suited for this task. Constraints involving arbitrary derivatives of the learned function are effectively implemented through quadratic optimization because the learned function is linear in its parameters, and derivatives can be derived analytically. We further provide a constructive approach to verify that discretely sampled constraints are generalized to continuous regions and show how local violations of the constraint can be rectified by iterative relearning. We demonstrate the approach on a practical and challenging control problem from robotics, illustrating also how the proposed method enables learning from few data samples if additional prior knowledge about the problem is available.
Let's dig into it...

The BHA is tough. Tough to control kinematically, tough to simulate, tough to deal with software-wise. This work focused on even another challenge: controlling the pneumatic actuators' length. The lengths can be sensed by means of cable-potentiometers. For their control, the pressure inside each of the actuators has to be adjusted automatically until the desired length is reached. Sounds not too difficult? In fact, it can be done by simple PID-control... but only very badly due to the long delays in the actuation and strong noise in the sensors – this kind of control is limited to very slow movements.
Training data (3D actuator
lengths of first segment)

We wanted to improve over that situation by adding a feedforward controller that can react quickly to new commands without waiting for noisy and delayed feedback. Such a controller could represent the relation between pressure and length in an equilibrium state and recall the necessary pressure once a length is desired.

The analytic shape of this equilibrium relation is totally unknown, so let's record some training data and learn it! Turns out:
  • Training data takes long time to be recorded. A single data point requires to apply some pressure on the robot, and wait until pressure and length have reached an equilibrium. "Wait" means 20 seconds to be sure... yes, it can actually take that long.
  • Training data is still messy. The BHA's soft material is not perfectly elastic, but viscoelastic: it has a certain memory of its recent form. As a result, the lengths for the same pressure differ every time; to some extent depending on which direction you're just coming from (which we attempted to mitigate by recording several trials with different directions), to some extent depending on longer time scale non-stationary behavior.
After all, there is few and bad data.

But data is not all we have. We some very basic prior-knowledge about the relation of pressure and length. The most important one is:
When you increase the pressure of one actuator, its length increases, too (or remains the same). This is a dimension-wise monotonicity relation. If you increase the i'th input, then the i'th output must increase.
No matter how little and bad the training data is – this relation is always there. Hence, the learner should satisfy it even if the training does not comply with in some cases, and in particular between data-samples (where no other information than prior knowledge can be available).

Now, how to tell a learner, e.g. a regressor, about this knowledge? Lets consider some general regressor f with inputs x, outputs y, and to be learned parameters θ:
f(x,θ) = y
Our prior knowledge now defines a set V of legal parameters. If θ is in V, the function satisfies the monotonicity condition for any input x. If not, there are inputs for which the input/output relation is not monotonous. Now we can search, within that set of legal parameters, for that particular value θ that describes the data best. And we are happy with that.

The problem is that we do not know this set of legal parameters. It's just extremely difficult to write it down by means of a general formalism. But! We can use a simple trick to make a step towards tractable optimization within V: Make f(x,θ) linear with respect to the parameters. This can easily be done by choosing some non-linear features g(x) of the input (in our case artificial-neuron activations), and then perform linear regression with f(x,θ)=g(x)·θ.

So what? The benefit of this formulation is that we can now easily formulate constraints for a single input x of the learner. Many constraints are just a linear inequality of the parameters θ. Constraints like monotonicity (and any other inequality of the derivatives of f) for a single point then define a half-space of legal parameters.
Inequalities at single inputs are added until the
constraint can be proven for the whole volume.

Still, we want constraints to be realized for all inputs. Given infinitely many points in a continuous space, this gives infinitely many constraints that we cannot do computation with. This brings up Klaus' work. It turns out that you do not actually need infinitely many constraints. In order to test whether some specific parameter value satisfies the constraint, he came up with a method to choose a finite number of constraints (at certain inputs) such that it can be proven that either the constraint is definitely violated, or definitely satisfied. The basic idea how to choose these inputs is to start with few input locations that are distributed across the input space and add more constraints where "necessary". "Necessary" here means that the constraint can be proven between neighbouring points. For small enough distances this can be done by considering well-known bounds on the remainder of the function's Taylor-approximation.

If the constraint is satisfied, we're happy. If the constraint is not satisfied, we can keep the single-input constraints that have been violated and use them for the further numeric optimization. Methods for parameter optimization (like regression) with a finite number of linear constraints are fairly standard, so that we can use them off-the-shelf... and we're done.

On the BHA we recorded three data-sets – each one for every segment of the robot. Then for every segment we took the data and formulated constraints such that the learned model
  1. satisfies monotonicity for each actuator (linear constraint on the first derivative of f), and
  2. does not exceed the admissible range of the output, which is simply given by minimum and maximum pressure possible in the actuation (linear constraint of the "zero'th" derivative of f).
All that put into the learning algorithm gives very decent model in spite of the little amount and bad quality of training data. And the compliance with the constraints comes with a guarantee right away, which gives a safe feeling when applying such model in a closed feedback loop on the robot. Adding such a model that estimates the equilibrium points significantly improves the control performance: The model in addition to standard feedback controller allows for very fast movement without control getting unstable, which is fundamental for almost any application of the robot.

The interesting thing about this constraint strategy is its generality. It can express any constraints that can be formulated as inequalities of the regressor's derivatives of any degree. Monotonicity and output-ranges are just the tip of the iceberg. Andre and Klaus recently applied the same method within a motion-generation problem in which it can be used to make stability guarantees on a to-be-learned dynamical system (best paper award at ESANN).

Probably, we're going to see even more applications in the future...

1 comment: