We are currently working on deep learning models for bridge damage recognition, such as delimitation and rebar corrosion.

We are also working on several types of models for processing video from the aerial view.

  • Semantic segmentation (pixel-wise labeling): targeting larger objects
  • Object recognition and tracking: targeting smaller objects
  • Action recognition

We are also working on deep learning models for bridge damage recognition, such as delimitation and rebar corrosion.

Semantic Segmentation

First, we develop highly accurate methods for semantic segmentation (pixel-wise labeling).

In semantic segmentation, the RGB image is feed into the network frame by frame. The architecture consists of a sequence of layers.

  • Convolution layers perform feature extraction
  • Max-Pooling layer reduces dimensionality while keeping only strongest features. Other dimension reductions are done by convolution layers
  • Skip connections throughout the network ease the training process, allowing improved accuracy (principle of residual networks)
  • Two deconvolution layers perform upsampling from a ~15×15 pixel class probability heatmap to about 30×30, and then to the original input size
  • Another skip connection brings better spatial information from a ~30×30 pixel feature map, for a finer segmentation
  • Final softmax layer transforms outputs of previous layers into category index

The power of this architecture comes from the use of a newer ResNet architecture for semantic segmentation. Our 152-layer architecture provides better accuracy than the FCN-8 architecture. In applications with strict computation time requirements, our architecture can be compacted to a 50-layer version, which supports significantly faster computing than existing architectures while keeping a relatively good accuracy. Moreover, our models are trained with Deep Transfer Learning techniques that help to further improve the accuracy, especially when the amount of training data is small.

Semantic segmentation has to be performed in real time. Our current, non-optimized code runs as 3 frames per second.

Object Recognition and Tracking

We use an adapted Single- shot Multibox Detector, which is a state-of-the-art model for object detection in terms of speed/accuracy trade-off. The model can detect 6 different classes of objects from the Stanford Drone Dataset. Further, it can detect pedestrians in our newly created Okutama-action dataset, where it achieves validation accuracies of 80% mAP and 72% mAP respectively.

This models run at between 7-10 fps on a desktop grade GPU and between 1.5 and 3 fps on the GPU-powered Jetson TX1 SoC. They can serve as the basis for multiple object tracking algorithms taking the tracking-by-detection approach.

Action Recognition

We adapted the Single-shot Multibox Detector to predict Action categories.

  • Human to Object Interaction: reading, drinking, carrying, calling, pushing/pulling
  • None-Interaction: walking, running, lying, sitting, standing
  • Human to Human Interaction: handshaking, hugging