Deep Learning for automatically describing images in natural language

Archives

Romanian Journal of Information Technology and Automatic Control / Vol. 30, No. 1, 2020

Deep Learning for automatically describing images in natural language – Image Captioning

Anca Mihaela HOTĂRAN, Mihnea Horia VREJOIU

Abstract:

Image Captioning (IC) in Computer Vision context refers to the automatic generation of textual descriptions associated with digital images. It is not only the recognition of the objects in these images, but also the description of their properties, as well as the relationships and interactions between them, all expressed textually in natural language, syntactically and semantically correct. Synthetically, the main steps in the automatic generation of textual descriptions associated with the images are: a) – extracting the visual information from the image, and, b) – “translating” it into an adequate and meaningful text. The spectacular developments in the field of deep neural networks and Deep Learning in recent years have led to absolutely remarkable progress also in the field of IC, the quality of the generated descriptive texts being substantially improved. Convolutional Neural Networks (CNN) have been naturally used to obtain essentialized vectorial representations of the image features, and Recurrent Neural Networks (RNN), in particular Long Short-Term Memory (LSTM), were used to decode these representations into phrases in natural language. In this paper we present an overview of the new techniques and methods based on Deep Learning used in the IC field, while also detailing and analyzing, as a case study, one of the best performing ones, using an encoder-decoder architecture combined with a mechanism for focusing the visual attention on the appropriate relevant regions of the image when generating each new word in the output sequence.

Keywords:
image captioning, machine learning, deep learning, deep neural network, convolutional network, recurrent network, LSTM, encoder-decoder, attentional mechanism.

View full article:

CITE THIS PAPER AS:
Anca Mihaela HOTĂRAN, Mihnea Horia VREJOIU, "Deep Learning for automatically describing images in natural language – Image Captioning", Romanian Journal of Information Technology and Automatic Control, ISSN 1220-1758, vol. 30(1), pp. 87-100, 2020. https://doi.org/10.33436/v30i1y202007