February 16, 2010 by Richard Roberts

This article is an illustrated summary of a recent paper we presented at CVPR 2009. We leverage some of the linear properties of optical flow fields to develop a method that automatically learns the relationship between camera motion and optical flow from data. The method can handle arbitrary imaging systems including very severe distortion, curved mirrors, and multiple cameras. Using this method, a robot can estimate it's motion in real time from video while detecting "motion anomalies" such as nearby or moving objects.

This article is a summary of the paper "Learning General Optical Flow Subspaces for Egomotion Estimation and Detection of Motion Anomalies", by Richard Roberts, Christian Potthast, and Frank Dellaert (all from the Georgia Institute of Technology). It was an oral presentation at CVPR 2009.

For those who prefer quick video overviews, this one illustrates dense optical flow estimation from sparse flow and also illustrates ego-motion estimation on a mobile robot. This video shows sparse flow vectors labelled as inliers (green) and outliers (red). In the first half of the video, the right side shows estimated dense flow, while in the second half of the video the right side shows the recovered platform trajectory (blue path) and approximate ground truth (green path). Note that the method works even with very unusual optics!

Imagine yourself on a sunny day walking down a tree-lined sidewalk. As you pass by the trees, you perceive their relative motion - growing closer to you, offset from but parallel to your line of travel, then passing next to you, and finally behind you and out of your field of view. As you turn a corner, you can perceive that you are turning relative to the rest of the world because the world "rotates" in your eyes' image.

But let's look a little closer at how the image in your eye changes as you move, or objects move relative to you. Without getting into geometric calculations, we can state what happens intuitively: as you move forward, things to your right move further and further right in your field of view before disappearing, and vice versa on the left. Things above you move closer to the top of your field of view, while things below you move towards the bottom. If you turn in place to the left, everything in the image shifts to the right. This phenomena is called optical flow.

We can describe optical flow using a vector field. The vector field pictured on the right corresponds to forward motion of the fly. Our work revolves around a few properties of optical flow that are well-known to the computer vision community:

- Optical flow for a translating (not rotating) camera depends on the distance to the "piece" of the world generating that flow. Nearby objects move faster in the image than faraway objects. This is the "parallax effect" - you see it driving down the highway at night when the trees whiz past while the moon follows your car!
- Optical flow also depends on the optics of the camera. Optical flow is the motion of the image, so the flow depends on how the camera distorts and scales the image.
- For any given set of distances from you to everything you can see (known as the depth field), and also given the optics of the camera, the optical flow depends only on your motion! Not only that, the flow is quite predictable. If you turn left, everything in the image moves right. When you move forward, everything moves to the outside. In fact, if you move forward while turning left, the flow you observe will be a
*vector sum*of the flow fields for each of those motions.

Note: We did not create the "fly" image but use it for illustration purposes, original appears here.

So, given camera optics and scene depth, there is a one-to-one correspondence between camera motion and optical flow. That means we can use optical flow to determine the motion of a robot. This is quite useful! A robot navigating only by "dead-reckoning" from how many times its wheels have turned encounters the same problem as you or I trying to walk with our eyes closed - we go in circles!

Obtaining the camera optics and scene depth is not so easy, though. Knowing the optics requires careful calibration and makes assumptions about the mathematical models used to describe them. Determining the absolute depth without resorting to a calibrated stereo pair is an ill-posed problem due to the ambiguity between nearby slow objects and faraway fast objects (Which is closer to the Earth: The sun or the moon? How do you know?). Performing these calibrations is time-consuming, and non-standard imaging systems (like the one in the video above involving a mirror), or cheap lenses with high distortion, would require entirely new and unknown distortion models!

This is where our method comes in - by simply recording and analyzing a bunch of video and optical flow taken by a moving robot, it turns out that we can learn the relationship between camera motion and optical flow from data under the assumption of *fixed* optics and depth without explicitly determining the optics and depth. This is because under these assumptions, there is a *linear* relationship between camera motion and optical flow that results from property "3" above. Thus, we use a linear subspace discovery method related to principal components analysis (PCA) to find the relationship automatically from data.

Unfortunately, traditional PCA will not find the correct motion-flow relationship on *real* optical flow data, even when this flow is computed by state-of-the-art algorithms. The reason is two-fold. For one, some regions of the image simply contain no texture that can be used to perceive optical flow, including saturated regions and smooth walls and floors. Secondly, we have made the assumption of constant depth, which only holds *approximately* in practice. In the image shown above, the ground plane and the building wall remain at a constant depth as the robot travels down the sidewalk, but the passing trees and other objects are *closer* than the usual depth for any pixel they occupy. For these reasons, we developed a *robust* extension to Probabilistic Principal Components Analysis (Tipping and Bishop, 1999), which both detects these nearby (or moving!) objects, and can ignore textureless regions of the image where flow cannot be computed.

One other detail that I have left out until now is that PCA does not quite find the relationship between camera motion and optical flow. Instead it finds the linear subspace on which the optical flow lives. To find the true relationship between the camera motion and the discovered subspace, we "align" the subspace via a linear transformation, learned from a small amount of labeled training data collected by driving the robot around via remote control.

The Bayes' net (above left) shows our generative model. The flow vectors **t** are a function of the subspace basis **W**, the camera motion at each frame **x**, and the probability that each flow vector in each frame is an inlier versus an outlier, **z**. The following video (from CVPR) explains the algorithm itself.

First, we can work with *completely arbitrary* imaging systems! You've already seen the example with the mirror, but the example (image) at the top of the page uses the images from two cameras, simply tiled side-by-side into a single image.

We can label *moving* or *nearby* objects, which allows the robot to notice objects it should avoid. These objects are *motion anomalies*, where the optical flow does not match the flow predicted through the model by the majority of the image. In the image on the right, the flow vectors on the moving car are red, indicating motion anomalies, because the car is moving. Also, in the image at the top of this page, the tree is covered in red flow vectors because it is nearby and thus moving faster in the image than usual.

Finally, we can recover the camera motion. The image with the graph overlayed below is a graph of the forward speed and yawing rate of the robot. The second image below shows integrated trajectories overlayed on a map of the area the robot traversed. In the images, the green traces are the ground truth robot motion, and the blue traces are the motion recovered by our method. The motion estimates provided by our method are on par or better than wheel odometry plus an inertial measurement unit (IMU).

The main benefit of this method is that the information about the optics and scene depth is wrapped into a model that is easily learned from video from the robot. No calibration grids or complicated lens distortion models are required! The primary limitation from a practical standpoint is the assumption of constant scene depth, which would break down if, for example, the robot turned to directly face a very close wall, departed from the ground plane, or moved from a narrow corridor into a large open area. We are working on extensions of this method to handle many of these cases by learning multiple models for multiple typical environments.

To find out more about this work, I recommend reading the paper. To learn more about me, you can check out my homepage.

■