40.Knowledge distilled pre-training model for vision-language-navigation
Published in Applied Intelligence, 2022
Vision-language-navigation(VLN) is a challenging task that requires a robot to autonomously move to a destination based on visual observation following a human’s natural language instructions. To improve the performance and generalization ability, the pre-training model based on the transformer is used instead of the traditional methods. However, the pre-training model is not suitable for sustainable computing and practical application because of its complex computations and large amount of hardware occupation. Therefore, we propose a slight pre-training model through knowledge distillation. Through knowledge distillation, the plenty of knowledge encoded in a large “teacher” model can be well transferred to a small “student” model, which greatly reduces the model parameters and inference time while maintaining the original performance. In the experiments, the model size is reduced by 87%, and the average inference time is reduced by approximately 86%. It can be trained and run much faster. At the same time, 95% performance of the original model was maintained, which is still better than the traditional VLN models.
Recommended citation:
Knowledge distilled pre-training model for vision-language-navigation, B. Huang*, S. Zhang, J.-T. Huang, Y.-J. Yu, Z.-C. Shi and Y.-J. Xiong, Applied Intelligence, 2022, 53 (1): 5607–5619
Download Paper