Machine learning has found applications beyond just analyzing charts of data and predicting stock market trends. Use cases in computer vision allow ML models to identify specific objects and features within visual mediums, and a technique known as semantic segmentation powers these applications.
Semantic segmentation assigns a label to the objects in an image and determines which pixels each object takes up. In simple terms, it determines what’s in a picture and where it is. For example, computer vision semantic segmentation has found uses in self-driving cars. These algorithms take pictures of the road and identify the locations of other cars, road signs, trees, and other features.
Semantic segmentation considers multiple objects of the same class together. However, semantic segmentation is only one of three approaches to image segmentation, the task of partitioning an image into regions that share similar characteristics. The others include instance segmentation, which can differentiate individual objects within the same class, and panoptic segmentation, which combines the strengths of both the instance and semantic methods.
How Does Semantic Segmentation Labeling Work?
You can think of semantic segmentation as separating an individual item from the rest of an image. In a nutshell, the technique:
- Classifies a type of object found in an image
- Localizes it by finding the boundaries and drawing around it
- Segments it by grouping together the pixels into a localized image
Semantic segmentation works with class labels and builds these outlines for each class of object it finds in the picture.
The Role of Convolutional Neural Networks in Semantic Segmentation
For applications in computer vision, businesses typically employ convolutional neural networks (CNN) to undergo semantic segmentation. CNNs consist of three components:
- A convolutional layer
- A pooling layer
- A fully connected layer
But to extract objects, CNNs essentially have to compress the image during pooling layers and consequently lose some information in the process. This process is the convolutional encoder-decoder. An encoder is a convolutional network that downsamples the image to pick up its features. A decoder, another convolutional network, upsamples the image through an interpolation technique to divide the image into separate segments.
The impact of CNNs on image processing is the same mechanism behind image compression. The artifacts you see on heavily compressed JPEGs on the Internet are an example of how much information is lost through these functions.
Measuring Semantic Segmentation Accuracy Through Metrics
When we evaluate the performance of an image segmentation model, how do we take quantitative measurements that we can use for comparisons and analyses? Several metrics data scientists have used in this regard include:
- Pixel accuracy is the most straightforward approach that calculates the number of correctly labeled pixels in relation to the total number of pixels. However, it doesn’t always paint a full picture, especially if the image is taken up almost entirely by a single object.
- Intersection Over Union (IOU) compares the actual object in an image with the model’s predicted outline of it. IOU takes the overlap between those two outlines and divides it by the union. IOU also works for multiple object classes by taking a weighted average for each class. If one object takes up a majority of the image, it will have less weight in the IOU calculation to prevent it from skewing the results.
- F1 score likewise aims to measure model performance accurately despite class imbalances. F1 combines two concepts: precision and recall.
Precision is how often a model is correct when it claims it found an object class. While it might not find all instances of that object, it has high confidence for the ones it does find.
Recall is a measure of how comprehensive the model is at finding all instances of an object class. High recall may result in several false positives, but the model doesn’t miss any genuine positives.
Examples of Semantic Segmentation Models Powered by Deep Learning
This issue of upsampling has prompted computer vision researchers to develop semantic segmentation models powered by deep learning. Some examples include:
- Fully Convolutional Networks, or FCNs. FCN-16 specifically involves the information from a previous pooling layer in the final segmentation map to reduce data loss. FCN-8 goes a step further and includes one more previous pooling layer.
- U-net, while similar to FCN, uses a mechanism called a shortcut connection to upsample the image. U-net arose in 2015 in the medical industry, where it searched for tumors.
- Pyramid Scene Parsing Network (PSPNet) works better than FCNs when it comes to building a holistic understanding of an entire image. The data loss from FCNs makes it difficult to differentiate two spatially similar objects, but PSPNet combines local and global context information to identify all objects in an image.
- ParseNet likewise capitalizes on global context information to improve prediction accuracy, an improvement on FCN.
- MaskRCNN combines FCN and another network, Faster RCNN, to generate both a bounding box and an accurate mask outline for an object.
- DeepLab by Google, which uses convolutional neural networks as the underlying architecture, is well-known for its low computational cost and strong performance. The current iteration is DeepLabv3.
Other options include Spatio-Temporal FCN (STFCN), which works for video segmentation. It combines FCN’s ability to parse individual images with LSTM, long short-term memory networks that can process sequential information over time. When such a model has access to multiple frames of the same scene, it can extrapolate more useful insights for applications like self-driving cars.
Real-World Applications of Semantic Segmentation Models
Computer vision empowered by semantic segmentation models has a number of potential applications across various tasks and industries.
- General image manipulation is useful for picture editing, online image searching, and removing backgrounds.
- Image processing in general is incredibly versatile. The agricultural sector can use drones to take aerial shots of fields and predict yield that season. During floods or earthquakes, those same images can locate individuals in need of rescue.
- Self-driving cars process data from external sensors to distinguish important objects and navigate around other vehicles and obstacles on the road.
- The medical industry can use semantic segmentation to search CT scans for anomalies. There have been attempts to detect cancers early on through this technology.
- Quality control systems benefit greatly from computer vision capable of detecting imperfections in the final product . Manufacturing facilities specifically will prevent defective units from entering the market this way.
- Automatic captioning is now available on many websites thanks to computer vision, which helps the visually impaired understand news websites and other online content.
And because semantic segmentation requires training with annotated images, computer vision can even help create large amounts of training datasets at scale.