Unified representation learning for vision-language understanding
Tuesday, October 20th, 04:30PM – 6:00PM (GMT-3)
Presentation Abstract: Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. In this talk, I will firstly introduce our research on VLP (unified Vision-Language Pretraining), which pretrains a model based on two objectives: bidirectional and seq2seq prediction to learn a unified representation for both understanding and generation tasks. To encourage vision and language-aligned representation, we further developed OSCAR (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to ease the learning of object-semantics alignment and creates new state-of-the-arts on six well-established vision-language understanding and generation tasks. I will present extensive image captioning examples and analysis to provide insights on the effectiveness of the learned VL-aligned representation.
Lei Zhang is a Principal Researcher and Research Manager of the computer vision research group in Microsoft Cloud & AI, leading a team working on visual recognition and computer vision. The team has made a significant impact to Microsoft Cognitive Services, including image tagging, object detection, entity recognition, and image captioning. Prior to this, he has worked with Microsoft Research Asia for 12 years as a Senior Researcher and later with Bing Multimedia Search for 2 years as a Principal Engineering Manager. He is an IEEE Fellow and has published 150+ papers and holds 50+ U.S. patents for his innovation in related fields.