In one of our previous blog posts and webinars, we’ve written about the power of synthetic data and Neurolabs’ synthetic data engine that transforms 3D assets into robust synthetic datasets for computer vision. At Neurolabs, we believe synthetic data holds the key for making computer vision and object detection more accurate, more affordable and more scalable. We provide an end-to-end platform for generating and training computer vision models on rich and varied annotated synthetic data, taking away the costs of labelling and considerably shortening the time-to-value of object detection solutions.
This is the second of a series of blogposts in which we explore the power of synthetically generated datasets as a basis for object detection in real environments. The first post illustrates the power of randomising the parameters of a synthetic scene to generate robust synthetic datasets and validates this power of representation on an object recognition task in a simple environment.
In this blogpost, we incrementally increase the complexity of our problem and explore how well our synthetically generated datasets scale to more difficult object detection environments.
We analyse the impact of synthetic data on a new scenario. We collect a real dataset similar with the one from the previous blog, but having more complexity. On the top of the previous dataset, we add features like:
- Occlusions between objects (each occlusion covers at most 25% of an object)
- Using 3 different instances of the same class (e.g. 3 different bananas) instead of using only one instance per class
The images have the same camera view and the same background as the first dataset. The classes remain the same: Orange, Banana, Red Apples, Green Apples, Bun Plain, Bun Cereal, Croissant, Broccoli, Snickers, Bounty.
For the synthetic data generation, the main changes we have brought to the original dataset are:
- Using three 3D assets for each class
- Increased the amount of allowed overlap between objects in an image (max 25% of an object occluded)
- The number of objects per image has been extended from [2,4] to [2, 6]
- Scaling has been set from [1x, 2x] to [0.75x, 1.75x], to accommodate for the larger number of objects in an image
Synthetic vs. Real
We show an example of a real image compared to a synthetic one. The overlap between the objects in our simulated environment mimics the one of the real data, but we can see a clear difference between the textures and the lighting.
Same as for our previous experiments, we have used the EfficientDet D3 and Faster RCNN architectures for object detection. The Faster RCNN model uses a Resnet100 backbone. We have started with all models pre-trained on the COCO dataset and with the first 2 layers of the backbone frozen.
We use 2 architectures, a one-shot detector and a two-shot detector, to show the robustness of our results. Furthermore, we have not specifically designed our networks to generalise better for distributions of data similar to the one we generate for our synthetic data. The reason behind this is that we aim to prove the general usability of synthetic datasets regardless of detection technique used.
The experiments in Table 1 keep the same hyperparameter configurations for jobs trained with the same model, to ensure a fair comparison between them. For the synthetic data, we average over 3 runs of data generation.
We can see a considerable difference between the real only (green) and synthetic only (red) experiments. This can be attributed to the domain transfer capabilities of our data to real environments. The variations of the real environment (natural light variation, object overlap, object instance characteristics etc.) add layers of difficulty that models trained on our synthetic data struggle to represent. With similar amounts of data for real and synthetic experiments, real experiments perform better.
The same can not be said about our mixed dataset experiments. With the addition of a small fraction of real data into our training dataset (as little as 50 images) along with the synthetically generated data, the gap in performance becomes narrow.
Adding 50 real images to the synthetic dataset gives a boost in performance of 0.287 mAP on our EfficientDet experiments and 0.144 mAP on our FasterRCNN experiments. Returns diminish as we increase the number of real images, adding 100 real images gives an increase of 0.047 mAP and 0.019 mAP respectively over the 50 image experiments.
A mix of 200 real images and 900 synthetic images achieves a difference in mAP of 0.041 below the real only experiment for EfficienetDet, while using 27.21% of the amount of real data that the real dataset uses (735 images). The result using Faster-RCNN with a mix of 200 real images is only behind the real experiment with 0.014 mAP.
The plots in Figure 2 present the evolution of the validation loss during training with different datasets. The same validation set has been used across all the runs, randomly sampled from the real data. We use the evolution of the loss to check on the correctness of the learning process.
Validation loss for training only on synthetic data is the outlier in this graphic. This is because of the difficulty of fitting the model to the particularities of real data when the model has only seen synthetic data. They are not drawn from the same distribution.
As we have done in the last post, we investigate the images that have the highest loss during model training.
We apply the method on the synthetic dataset and mixed-200 dataset. As it is shown in both images in Figure 3, the hardest images to learn are those that have occlusions between objects. Also, the light seems to be another factor which challenges the model training as well as the texture of objects when it is similar with the background colour.
Continuing our exploration of synthetic data, we have shown that synthetically generated datasets can scale well to more complex problems. We have tackled a more difficult domain adaptation scenario and have obtained high accuracy results with only a fraction of the real data conventionally needed.
We highlight the strong potential of synthetic data, slowly moving away from the lab-like environment and peaking into the real world. For our next blog posts, we plan to move the problem to a generic real world environment and further increase the complexity of our synthetic data.