My motivation behind writing about Visual Perception is primarily to understand its basis in humans and animals, and how Artificial Visual Perception (sight + interpretation) applies to machines, computers, and autonomous vehicles.
Artificial Visual Perception has enabled many use cases from robotics and autonomous vehicles to medicine and much more.
Recognition use cases include image and object recognition, vehicle model and its license plate recognition, optical character and cursive handwriting recognition, fingerprint recognition, gesture recognition, etc.
Commerce and industry, use cases include cart-free checkout in stores, image search, industrial inspection, photogrammetry and more. Medical imaging is helping radiology, ophthalmology, cardiology, and other medical fields. Examples of other use cases are people tracking, object detection, computational imaging, robotic process automation, cognitive intelligence, object classification, visual question answering, etc.
Human Visual Perception is very well-developed
Even though machines are achieving super-human capabilities when recognizing cats in images, or playing Go, Chess, or Poker against humans, machine vision is either sub-human or par-human in a majority of use cases. Artificial visual perception is, for instance, sub-human when it comes to safe autonomous driving ethically. It is par-human in the cases of image classification, and optical character and cursive handwriting recognition.
Perception is the ability to sense and interpret the environment
Humans perceive through the traditionally recognized five senses of sight, smell, hearing, taste, and touch. The brain processes and interprets what the eyes see, the ears hear, the skin feels, the tongue tastes, and the nose smells.
These five common senses are some of the exteroceptive or external senses, i.e., they help in perceiving the external environment. Other external senses in humans include the ability to sense temperature variations, a weakly developed ability to sense directions, and abilities to sense pain, balance, position, and movement.
Interoceptive or internal senses include an ability to sense hunger, thirst, vibration, carbon dioxide and oxygen levels, heart activity, blood sugar levels, gas distention, and a few more. Some animals can detect water pressure, magnetic and electrical fields.
Visual perception is the ability to see and interpret images and visual streams
Humans see by focusing and detecting images of visible light. Two photoreceptors (or sensory neurons) rods and cones recognize and react to light. Rods, the brightness receptors, are very sensitive to light and perceive movement, depth, and differences in brightness. Cones perceive fine details and colors. They are of three types. Short, medium and long wavelength cones are most sensitive to blue-violet, green, and greenish yellow wavelengths respectively. Source 1, Source 2.
Visual sensory information from rods and cones in retina is carried to the lateral geniculate body over the optical nerves and then to the various interconnected visual cortices for visual interpretation, i.e., V1 (primary), V2 (secondary), V3, V4, V5 (Middle temporal visible area) and V6 (dorsomedial area). The brain has two anatomical pathways. Source.
- The Ventral (underside) Pathway helps in consciously perceiving, recognizing, and identifying objects
- The Dorsal (upper side) Pathway helps in determining the size, position, and orientation of objects in space
Visual Perception in Humans and Animals
Humans have a well-developed sense of sight compared to most animals, but some animals have better vision than humans.
The Human eye can detect or interpret images in as little as 13 milliseconds. Humans can see sharper and detailed images compared to animals and can see more colors than rabbits, cats, and dogs. Birds have sharper and precise vision. They have an additional cone and can see color much better than humans as the sensitively is spread evenly across the ultraviolet and visible range.
The range of distance of a child is 14x that of an adult. A diving hawk’s vision is 50x that of an adult human or 50 dioptres. The distance humans can see deteriorates with age.
The eyes of humans and many animals are located at different lateral positions on the head resulting in binocular vision, i.e. two slightly different images projected to the retinas of the eyes gives the humans the perception of depth. A human being’s field of vision ranges from 120° to 190°. A rabbit can see almost 360°.
Artificial Visual Perception
Artificial Visual Perception is the ability of computers and machines to see and interpret images
Computer-Vision and Machine-Vision are about extracting information from images, video and multi-dimensional data from scanners and sensors. Machine-Vision originated from industrial automation and is used in industrial inspection, analysis, industrial process automation, robotic guidance. Robotic-Vision refers to visual sensors used by robotic machines that dictate an action.
Autonomous vehicles and mobile robots are examples where computer vision is used to interpret the environment. Examples or Machine-Vison include counting objects on a conveyer belt or using x-rays equipped robots to identify defects in materials, using infra-red drones with photographic techniques to see through a haze, UV light photography to detect power loss attributable to electrical discharge caused by ionization of air in overhead power lines.
Computer- and Machine-vision is often used in place of each other. They all include:
- acquiring images using image sensors,
- processing the image digitally,
- extracting meaningful information
- representing information as models, and
- converting the model representation into numerical or symbolic information interpretable by a computer
Autonomous Vehicle Visual Perception
Artificial Visual Perception in autonomous vehicles is the ability to see and interpret the environment to safely and independently drive a self-driving vehicle.
Autonomous vehicles use data from multiple sensors to concurrently build a map i.e. 3D model representation of the environment and localize the vehicle in map i.e. navigate the vehicle within the confines of the map while circumventing obstacles.
Visual Perception system of the autonomous car enables map-generation, obstacle detection, and localization
Sensors to see the environment
A number of specialized visible light cameras are used in vehicles. A camera can capture texture, color and contrast information effectively.
The cameras sacrifice resolution in favor of
- wide field of view via the use of wide lenses equivalent to 8mm-14mm full frame,
- faster response time,
- enhanced low-light sensitivity, achieved by using larger pixel sensors, and
- high computational efficiency, accomplished by using efficient on-vehicle and at times onboard image processing
Multiple camera setups are used in autonomous and semi-autonomous vehicles.
Cameras sensors produce volumes of images and visual streams suitable for deep learning and are used to help classify cars, trucks, and motorcycles for keeping a safe distance in advanced adaptive cruise control. Cameras and image processing is used to automatically switch between high- and low-beams based on the presence of other vehicles. Cameras with onboard image processing are used for signage recognition and interpretation, and people and signage classification.
Advanced signage (e.g. a specific speed limit in effect during school hours), construction signage, and dynamic signs require fast image processing, character segmentation, recognition, and can be interpreted using natural language understanding techniques. Cameras also detect lane markings and can be used to distinguish drivable from undrivable surfaces.
Cameras are also used inside the car for monitoring the driver, gaze control, and to respond to commands based on gesture recognition. Stereo cameras i.e. two cameras mounted a slight distance apart, provide binocular vision and help in the perception of depth.
Cameras operate in visible and infrared ranges of the electromagnetic spectrum.
Radars (Radio Detection And Ranging sensors) are used in vehicles for detecting the range, angle, or speed of other vehicles and obstacles. Radars send and receive radio waves of specific frequencies and compute the frequency signal change.
Radars are used in assistance with self-parking, sensing crashes, detecting blindspots, measuring parking boundaries, emergency braking, adaptive cruise control, automatic stop & go, etc. Some implementations of adaptive cruise control use radars instead of cameras.
Compared to cameras, radars have a much narrower field of view, requiring the use of multiple short-, medium-, and long-range radar sensors. Short Range Radar sensors work in 0.2 to 30m range, Medium Range Radar sensors in the 30-80m range, and Long Range Radar in the 80m-200m range. The range of frequencies for Radars in autonomous vehicles is in the neighborhood of 24-81 GHz. For instance, radar-based adaptive cruise control works at 77 GHz frequency and detecting blindspots uses 24 GHz radars or lower.
Radars lack the precision and small object detection capabilities of LiDARS but can be used in cloudy weather and are computationally light to operate.
It is possible to fool or jam Radars using equipment that can generate the equivalent of reflected radio waves. Countermeasures are available to mitigate risks.
LiDARs (Light Detection And Ranging sensors) fire beams of laser light and measure how long it takes for the light to return to the sensor. LiDARs enable high precision detection of smaller objects in real-time. It is useful as it can build near-perfect 3D monochromatic images of objects.
LiDARs are either rotating devices mounted on top of cars or directional and mounted to scan in a specific direction, for example, the side of a car.
LiDARs operate in the Near Infrared wavelengths. A 905 nanometer LiDAR has a range of approximately 200 meters. The new 1550 nanometer LiDARs are 40 times more powerful, provide 50 times greater resolution and have 10 times longer range. In Waymo Pacifica vans, cameras are mounted along the main LiDAR atop the vehicle.
LiDARS have limited usability at night and in cloudy weather or reflecting off dark and dense surfaces like tires.
It is possible to fool Lidars by replaying, relaying, jamming, or flooding. Countermeasures must be deployed for the safety of autonomous vehicles, occupants, and pedestrians.
Additional sensors like GPS, Dead-reckoning sensors, Inertial Measurement Sensors, infrared cameras, audio microphones, ultrasound sensors, are either used or being experimented with in autonomous vehicles.
- Dead-reckoning sensors use the wheel circumference and record wheel rotations and steering direction to calculate locations in areas where satellite-based navigation is not available.
- Inertial Measurement Sensors are multiple sensors that sense an autonomous vehicles force, angular rate, and surrounding magnetic fields utilizing a triad of accelerometers, gyroscopes, and magnetometers.
- Far-infrared cameras or microbolometers can detect pedestrians at night when the temperature of a person is differentiable from that of the environment.
- Passive Visual Infrared Sensors can detect presence at ranges of up to 300 meters.
- Driver Emotion Sensors check the emotion of drivers and passengers using cabin cameras.
- Steering Haptic Sensors detect if the drivers hands are on the steering wheel and how tight is the grip
What is the best sensor for autonomous driving?
The cliché answer is “all of them”.
Similar to humans, we retain 68% of content when both visual & auditory senses are stimulated, versus 10% with auditory senses only, multiple modality sensors working together is better. Sensors have specific pros and cons and are purpose-built to excel at specific sensing requirements like, proximity detection, range, resolution, speed detection, color and contrast differentiability, low cost, small size, and applicability in dark, bright or inclement weather.
Visible light cameras, infrared cameras, Radars, LiDARs, dedicated short-range vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications systems will be required for achieving Society of Automotive Engineers SAE level-5 autonomy, which encourages redundancy in sensors.
In 2016 the average cost of LiDAR was more expensive than the Prius they were mounted on averaging at $75,000. Google Waymo brought the cost to approximately $40,000 in 2017. Solid-state LiDARs (e.g. $250 from Quanergy and sub $100 from Velodyne) have the potential to bring the price to commercial-deployment levels. The quality of $75,000 Velodyne HDL-64E is however quite superior compared to the cheaper ones. 1550 nano-meter LiDARs have higher resolution and distance capability than the 905 nano-meter LiDAR.
Waymo switched to using three LiDARs on their Pacifica vans. The main long-range LiDAR is capable of zooming into objects on road and can “see a football helmet two full football fields away” i.e. approximately 200 meters. A short-range LiDAR mounted on the side of the body on passenger side provides an uninterrupted surround view and can detect objects and people.
Interpreting and driving in the environment
Autonomous vehicles and mobile robots concurrently operate in two stages, mapping, and localization.
- Mapping: In the first stage the autonomous system interprets the sensors and builds a representation of the environment as a map.
- Localization: In the second stage, the autonomous system plans a trajectory from a location to another location, and then executes the planned trajectory confined or localized in the map.
The autonomous system builds a map of the environment and then localizes the vehicle in that map, using Simultaneous Localization and Mapping ( SLAM) algorithms
The autonomous vehicle can represent the environment as a 3-D map, estimate its location on this map, follow roads, stay within lanes, identify signs, interpret dynamic signs, distinguish between drivable and not-drivable surfaces, and identify pedestrians and other vehicles, or make use of Radar-, LiDAR- and camera-based advanced driver assistance systems.
Components of the Visual Perception System in autonomous vehicles
An autonomous vehicle needs to know its location, understand the environment, be able to get from one location to another and understand the intent of the driver or passenger.
The Initial model of the environment is generated as a map with a sensing data collected by sampling from sensors. For most vehicles laser scanner LiDAR is used, however certain vehicles rely on cameras and Radars, as described above. Convolutional Neural Networks (ConvNets) and deep-learning approaches are used to recognize and classify cars, people, and obstacles on the road from a visual stream.
Planning & Control
Path planning algorithms determine a trajectory based on sensor data for the autonomous vehicle.
The autonomous control system is able to issue commands to the robotic motor interfaces to drive, speed, slow and stop the autonomous vehicles.
The path planning and autonomous control system constantly monitor both the location and the presence of obstacles missing from the map. The systems circumvent the obstacles using algorithms ranging from Monte Carlo Optimization to Deep Reinforcement Learning. An autonomous vehicle learns how to navigate when rewarded for staying on course and punished when it collides with something in the environment. This reward and punishment feedback reinforces which actions to perform and which to avoid.
The path planning system uses waypoint planning and navigation systems. To give the car enough time to react to obstacles and create an alternative trajectory, a virtual car bigger than the actual car is used in planning. See Dynamic Virtual Bumpers.
The curvature of the curb, the texture of buildings, and the trees are sufficient for use as reduced features to create a boundary to localize the autonomous vehicle within. By extracting and combining the perpendicular normals of curbs, trees, and walls, three-dimensional feature vectors can be created. The feature vectors when projected in a two-dimensional feature vector help making localization computation and map-generation streamlined and real-time.
Our goal is to work towards an SAE level 5 autonomous vehicle with fully self-driving capabilities, but we have a number of technical and psychological roadblocks to solve related to autonomous vehicle visual perception. For example, Autonomous vehicle visual perception in congested roads is an open issue, and autonomous vehicles cannot interpret hand signals from a police officer. A lot has been written about ethics and self-driving cars. Here is a good resource.
The autonomous vehicle visual perception system is unable to match the human visual perception system when driving in inclement weather, in poor visibility conditions, or in the snow. Vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications systems will be required for achieving SAE level-5 autonomy, but that comes with a large investment in infrastructure and it seems the initiative was dropped or shelved recently.
The autonomous vehicle visual perception cannot tolerate mistakes like incorrectly identifying the markings on the side of a truck as a fence or an advertisement sign with a picture of a car as an oncoming car. In other words, we still have work to do.
Meanwhile, we may see vast deployments of Parallel Autonomy, i.e. an autonomy system that can automatically correct the maneuvers of the human to ensure that driving is safe in dangerous situations. The semi-autonomous technology can be considered an advanced skid control or an ABS braking system on steroids and has the potential to bring technological advances sooner to vehicles.
Uber, Google, and others are perfecting Series Autonomy, where either the human or the autonomous system is in full control, but not at the same time. This fully autonomous vehicle, however, needs to address all ethical, technical and psychological barriers involved.
Lectures by Professor Doctor Daniela Rus, MIT CSAIL, in the MIT Sloan Artificial Intelligence classes where a helpful resource.