A Comprehensive Review and Open Challenges on Visual Question Answering Models

Authors

DOI:

https://doi.org/10.15649/2346030X.3370

Keywords:

VQA review, Image-question answering, Visual question answering

Abstract

Users are now able to actively interact with images and pose different questions based on images, thanks to recent developments in artificial intelligence. In turn, a response in a natural language answer is expected. The study discusses a variety of datasets that can be used to examine applications for visual question-answering (VQA), as well as their advantages and disadvantages. Four different forms of VQA models—simple joint embedding-based models, attention-based models, knowledge-incorporated models, and domain-specific VQA models—are in-depth examined in this article. We also critically assess the drawbacks and future possibilities of all current state-of-the-art (SoTa), end-to-end VQA models. Finally, we present the directions and guidelines for further development of the VQA models.

References

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, 2015. VQA: visual question answering, Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2425–2433. doi: 10.1109/ICCV.2015.279

Nelson Ruwa, Qirong Mao, Liangjun Wang, Ming Dong, 2018, Affective Visual Question Answering Network, IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 2018, pp. 170-173, doi: 10.1109/MIPR.2018.00038

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV '15). IEEE Computer Society, USA, 1–9. https://doi.org/10.1109/ICCV.2015.9

Geonmo Gu, Seong Tae Kim, Yong Man Ro, 2017, Adaptive attention fusion network for visual question answering, IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 2017, pp. 997-1002, doi: 10.1109/ICME.2017.8019540.

Ilija Ilievski, Shuicheng Yan, Jiashi Feng,206. A Focused Dynamic Attention model for visual question answering' [Online]. Available: https://arxiv.org/abs/-1604.01485

Zichao Yang; Xiaodong He; Jianfeng Gao; Li Deng; Alex Smola, 2016. Stacked Attention Networks for Image Question Answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 21-29.

Kushal Kafle; Christopher Kanan, 2016. Answer-Type Prediction for Visual Question Answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4976-4984.

Duy-Kien Nguyen; Takayuki Okatani,2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6087-6096, doi: 10.1109/CVPR.2018.00637.

Deepak Gupta, Pabitra Lenka, Asif Ekbal, Pushpak Bhattacharyya, 2020. A Unified Framework for Multilingual and Code-Mixed Visual Question Answering, Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (pp. 900–913). Association for Computational Linguistics

Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh, 2016. Hierarchical question image co-attention for visual question answering,' in Proc. NIPS, 2016, pp. 289_297.

Chao Yang; Mengqi Jiang; Bin Jiang; Weixin Zhou; Keqin Li,2019. Co-Attention Network with Question Type for Visual Question Answering, IEEE Access, vol. 7, pp. 40771-40781, doi: 10.1109/ACCESS.2019.2908035.

Lianli Gao, Liangfu Cao, Xing Xu, Jie Shao, Jingkuan Song,2020. Question-Led object attention for visual question answering, ,Neurocomputing, Volume 391, 2020, Pages 227-33, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2018.11.102

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 2, 13–23.

Hao Tan , Mohit Bansal, 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5100–5111, Hong Kong, China, November 3–7, 2019.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Li u, 2020. UNITER: UNiversal Image-TExt Representation Learning, https://doi.org/10.48550/arXiv.1909.11740

Xiujun Li, Xi Yin, Chunyuan Li , Pengchuan Zhang , Xiaowei Hu , Lei Zhang, Lijuan Wang , Houdong Hu , Li Dong , Furu Wei , Yejin Choi , and Jianfeng Gao ,,2020. OSCAR : Object-Semantics Aligned Pre-training for Vision-Language Task, Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12375. Springer, Cham. https://doi.org/10.1007/978-3-030-58577-8_8

Ze Hu; Jielong Wei; Qingbao Huang; Hanyu Liang; Xingmao Zhang; Qingguang Liu , 2020. Graph Convolutional Network for Visual Question Answering Based on Fine-grained Question Representation," 2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC), Hong Kong, China, 2020, pp. 218-224, doi: 10.1109/DSC50466.2020.00040.

Kevin J. Shih; Saurabh Singh; Derek Hoiem, 2016,. Where to Look: Focus Regions for Visual Question Answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4613-4621, doi: 10.1109/CVPR.2016.499.

Liyang Zhang , Shuaicheng Liu , Donghao Liu, Pengpeng Zeng, Xiangpeng Li, Jingkuan Song , Lianli Gao, 2021. Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering, in IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 10, pp. 4362-4373, Oct. 2021, doi: 10.1109/TNNLS.2020.3017530.

Peng Zhang; Yash Goyal; Douglas Summers-Stay; Dhruv Batra; Devi Parikh,2016, Yin and Yang: Balancing and Answering Binary Visual Questions, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 5014-5022, doi: 10.1109/CVPR.2016.542.

Danna Gurari; Qing Li; Abigale J. Stangl; Anhong Guo; Chi Lin; Kristen Grauman; Jiebo Luo; Jeffrey P. Bigham, 2018. VizWiz Grand Challenge: Answering Visual Questions from Blind People, IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 3608-3617, doi: 10.1109/CVPR.2018.00380.

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, Roozbeh Mottaghi 2022, A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge, In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_9

Binh X. Nguyen; Tuong Do; Huy Tran; Erman Tjiputra; Quang D. Tran; Anh Nguyen,2022, Coarse-to-Fine Reasoning for Visual Question Answering, IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 2022, pp. 4557-4565, doi: 10.1109/CVPRW56347.2022.00502.

Caiming Xiong, Stephen Merity, Richard Socher ,2016. DMN: Dynamic Memory Networks for Visual and Textual Question Answering, https://arxiv.org/abs/1603.01417v1

M. Dias, H. Aloj, N. Ninan and D. Koshti, "BERT based Multiple Parallel Co-attention Model for Visual Question Answering," 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 2022, pp. 1531-1537, doi: 10.1109/ICICCS53718.2022.9788253.

Dipali Koshti, Ashutosh Gupta, and Mukesh Kalla, 2022, BERT based Hierarchical Alternating Co-Attention Visual Question Answering using Bottom-Up Features”, Int J Intell Syst Appl Eng, vol. 10, no. 3s, pp. 158–168, Dec. 2022. https://doi.org/10.17762/ijisae.v10i3S.2427

Bei Liu, Zhicheng Huang, Zhaoyang Zeng, Zheyu Chen, Jianlong Fu, 2019. Learning Rich Image Region Representation for Visual Question Answering., Learning Rich Image Region Representation for Visual Question Answering, ArXiv, abs/1910.13077.

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang,2018. Bilinear attention networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (pp. 1571–1581). Curran Associates.

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach, 2017. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016

Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. Multimodal residual learning for visual QA. In Proceedings of the 30th International Conference on Neural Information Processing Systems (pp. 361–369). Curran Associates Inc.

Alberto Mario Bellini, 2020. Towards Open-Ended VQA Models Using Transformers, University of Illinois at Chicago https://doi.org/10.25417/UIC.13475892.V1

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao, 2017. Multimodal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV, pages 1839–1848, DOI:10.1109/ICCV.2017.202

Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander G. Hauptmann, 2018. Focal visual-text attention for visual question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6135-6143, doi: 10.1109/CVPR.2018.00642.

Ahmed Osman, Wojciech Samek, DRAU: Dual Recurrent Attention Units for Visual Question Answering, Computer Vision and Image Understanding, Volume 185, 2019,Pages 24-30,ISSN 1077-3142, https://doi.org/10.1016/j.cviu.2019.05.001.

Sheng Zhang, Min Chen, Jincai Chen, Fuhao Zou, Yuan-Fang Li, Ping Lu, 2021.Multimodal feature-wise co-attention method for visual question answering, Information Fusion,Volume 73,2021,Pages 1-10,ISSN 1566-2535, https://doi.org/10.1016/j.inffus.2021.02.022.

Hedi Ben-younes, Remi Cadene, Matthieu Cord, Nicolas Thome, 2017. MUTAN:Multimodal Tucker Fusion for Visual Question Answering, 2631-2639. 10.1109/ICCV.2017.285.

Farazi Moshiur, Salman Khan, Nick Barnes, 2020. Attention Guided Semantic Relationship Parsing for Visual Question Answering. https://doi.org/10.48550/arXiv.2010.01725

Guohao Li, Hang Su, Wenwu Zhu, 2017. Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks, ArXiv abs/1712.00733.

https://doi.org/10.48550/arXiv.1712.00733

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, Weidi Xie, “2023, PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering: https://arxiv.org/abs/2305.10415v5

Fuji Ren; Yangyang Zhou,2020, CGMVQA: A New Classification and Generative Model for Medical Visual Question Answering," in IEEE Access, vol. 8, pp. 50626-50636, 2020, doi: 10.1109/ACCESS.2020.2980024.,

Weixin Liang, Yanhao Jiang, Zixuan Liu, GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering, https://doi.org/10.48550/arXiv.2104.10283

Deng Huang, Peihao Chen, Runhao Zeng, Qing Du, Mingkui Tan, Chuang Gan, 2020, Location-Aware Graph Convolutional Networks for Video Question Answering, Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 11021-11028. https://doi.org/10.1609/aaai.v34i07.6737

Damien Teney, Lingqiao Liu, Anton van den Hengel, Graph-Structured Representations for Visual Question Answering," 2017. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 3233-3241, doi: 10.1109/CVPR.2017.344.

François Gardères, Maryam Ziaeefard, Baptiste Abeloos, Freddy Lecue2020., “ConceptBert: Concept-Aware Representation for Visual Question Answering, In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 489–498). Association for Computational Linguistics. http://dx.doi.org/10.18653/v1/2020.findings-emnlp.44

Peng Wang, Qi Wu,Chunhua Shen, Anthony Dick, Anton van den Hengel, 2017. FVQA: Fact-based Visual Question Answering” , in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 10, pp. 2413-2427, 1 Oct. 2018, doi: 10.1109/TPAMI.2017.2754246.

Maryam Ziaeefard and Freddy Lecue,2021, Towards Knowledge-Augmented Visual Question Answering, In Proceedings of the 28th International Conference on Computational Linguistics (pp. 1863–1873). International Committee on Computational Linguistics.

Yifeng Zhang, Ming Jiang, Qi Zhao, 2021. Explicit Knowledge Incorporation for Visual Reasoning, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 1356-1365, doi: 10.1109/CVPR46437.2021.00141.

Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, Anton van den Hengel,2018,Image Captioning and Visual Question Answering Based on Attributes and External Knowledge, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1367-1381, 1 June 2018, doi: 10.1109/TPAMI.2017.2708709.

Sanket Shah , Anand Mishra , Naganand Yadati , Partha Pratim Talukdar , 2019. KVQA: Knowledge-Aware Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 8876-8884. https://doi.org/10.1609/aaai.v33i01.33018876.

Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta,Marcus Rohrbach, 2021.KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 14106-14116, doi: 10.1109/CVPR46437.2021.01389

Ajeet Kumar Singh, Anand Mishra, Shashank Shekhar, Anirban Chakraborty,2019. From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason, IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 4601-4611, doi: 10.1109/ICCV.2019.00470.

Qingxing Cao, Bailin Li, Xiaodan Liang and Liang Lin,2019. Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network, ArXiv, abs/1909.10128.

Wenfeng Zhenga, Lirong Yinb, Xiaobing Chena, Zhiyang Maa ,Shan Liua , Bo Yanga, 2021, Knowledge base graph embedding module design for Visual question answering model”, Pattern Recognition, Volume 120, 2021, 108153, ISSN 0031-3203, https://doi.org/10.1016/j.patcog.2021.108153.

Bin He,Meng Xia, Xinguo Yu, Pengpeng Jian, 2017. An educational robot system of visual question answering for preschoolers. 441-445. 10.1109/ICRAE.2017.8291426.

Kushal Kafle, Robik Shrestha, Brian Price, Scott Cohen, Christopher Kanan, 2020. Answering Questions about Data Visualizations using Efficient Bimodal Fusion, IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 2020, pp. 1487-1496, doi: 10.1109/WACV45572.2020.9093494.

Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Akos Kadar, Adam Trischler, Yoshua Bengio , 2018, FigureQA: An Annotated Figure Dataset for Visual Reasoning, ICLR 2018.

Kushal Kafle, Brian Price, Scott Cohen, Christopher Kanan,,2018. DVQA: Understanding Data Visualizations via Question Answering,IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 5648-5656, doi: 10.1109/CVPR.2018.00592.

Noa Garcia, Chentao Ye, Zihua Liu, Qingtao Hu, Mayu Otani, Chenhui Chu, Yuta Nakashima, Teruko Mitamura, 2020. AQUA: A Dataset and Baselines for Visual Question Answering on Art, ECCV Workshop, 2020 Springer, https://arxiv.org/abs/2008.12520v1.

Sylvain Lobry, Diego Marcos, Jesse Murray, Devis Tuia, 2020 ,RSVQA: Visual Question Answering from Remote Sensing Data, in IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, pp. 8555-8566, Dec. 2020, doi: 10.1109/TGRS.2020.2988782.

David Morris, Eric Budack,2020. SlideImages: A Dataset for Educational Image Classification, Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part IIApr 2020Pages 289–296https://doi.org/10.1007/978-3-030-45442-5_36.

Shengyan Liu, Xuejie Zhang, Xiaobing Zhou & Jian Yang , 2022,BPI-MVQA: a bi-branch model for medical visual question answering. BMC Med Imaging 22, 79 (2022). https://doi.org/10.1186/s12880-022-00800-x.

Kenneth Marino; Mohammad Rastegari; Ali Farhadi; Roozbeh Mottaghi, 2019, OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 3190-3199, doi: 10.1109/CVPR.2019.00331.

Mengye Ren, Ryan Kiros, and Richard S. Zemel, 2015. Exploring models and data for image question answering. In NIPS, 2015.

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh, 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6325-6334, doi: 10.1109/CVPR.2017.670.

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick, 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 1988-1997, doi: 10.1109/CVPR.2017.215.

Andreas, J., Rohrbch M., Darrell T., & Klein D. 2016. Deep compositional question answering with neural module networks, in: CVPR, 2016.In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 39–48).

Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li FeiFei, 2016. Visual7w: Grounded question answering in images, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 4995-5004, doi: 10.1109/CVPR.2016.540.

Asma Ben Abacha, Vivek V. Datla, Sadid A. Hasan, Dina Demner-Fushman, & Henning Muller, 2020. Overview of the VQA-Med Task at ImageCLEF 2020: Visual Question Answering and Generation in the Medical Domain. In CLEF 2020 Working Notes. CEUR-WS.org.

Jason J. Lau, Soumya Gayen, Asma Ben Abacha Dina Demner-Fushman, 2018. A dataset of clinically generated visual questions and answers about radiology images. Sci Data 5, 180251 (2018). https://doi.org/10.1038/sdata.2018.251.

Peng Wang, QiWu, Chunhua Shen, Anton van den Hengel, and Anthony Dick. 2016, FVQA: Fact-Based Visual Question Answering," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 10, pp. 2413-2427, 1 Oct. 2018, doi: 10.1109/TPAMI.2017.2754246.

Peng Wang, Qi Wu, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel, 2017. Explicit knowledge-based reasoning for visual question answering, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence Main track. Pages 1290-1296. https://doi.org/10.24963/ijcai.2017/179

Zhibiao Wu and Martha Palmer,1994. Verbs semantics and lexical selection, in: Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Association for Computational Linguistics, 1994, pp. 133–138

Kishore Papineni, Salim Roukos, Todd Ward and WeiJing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02). Philadelphia, PA. July 2002. pp. 311-318.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan.

Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han, 2016, Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 30-38, doi: 10.1109/CVPR.2016.11.

Kan Chen, JiagWang, Liang-Chieh Chen, Haoyuan Gao,Wei Xu, and Ram Nevatia.2015. ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering. arXiv preprint arXiv:1511.05960.

Teney D, Hengel AV (2018) Visual question answering as a meta learning task. In: Computer vision—ECCV 2018 lecture notes in computer science. 229–245. Teney, D., & Hengel, A.V. (2017). Visual Question Answering as a Meta Learning Task. ArXiv, abs/1711.08105.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein, 2016. Neural Module Networks IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 39-48, doi: 10.1109/CVPR.2016.12.

Bei Liu, Zhicheng Huang, Zhaoyang Zeng, Zheyu Chen, Jianlong Fu , 2019. Learning Rich Image Region Representation for Visual Question Answering, arXiv:1910.13077v1.

Yang Shi, Tommaso Furlanello, Sheng Zha, Animashree Anandkumar ,2018. Question type guided attention in visual question answer- ing. Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IVSep 2018Pages 158–175https://doi.org/10.1007/978-3-030-01225-0_10.

Duy-Kien Nguyen and Takayuki Okatani, 2018. Improved fusion of visual and language representations by dense symmetric co-ttention for visual question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6087-6096, doi: 10.1109/CVPR.2018.00637.

Haoyuan gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu, 2015. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering. In Proc. Advances in Neural Inf. Process. Syst. https://doi.org/10.48550/arXiv.1505.05612.

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus, 2015. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167.

Huijuan Xu and Kate Saenko, 2015. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. arXiv preprint arXiv:1511.05234.

Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang, 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering”, In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI Press.

Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, Anton van den Hengel, 2017. Explicit knowledge-based reasoning for visual question answering. In IJCAI, pages 1290–1296, https://doi.org/10.24963/ijcai.2017/179

Wenfeng Zheng, Lirong Yin, Xiaobing Chen, Zhiyang Ma, Shan Liu, Bo Yang, 2021.Knowledge base graph embedding module design for Visual question answering model, Pattern Recognition, Volume 120, 2021, 108153, ISSN 0031-3203, https://doi.org/10.1016/j.patcog.2021.108153.

Prashan Wanigasekara; Kechen Qin; Emre Barut; Fan Yang; Weitong Ruan; Chengwei Su, 2022. "Semantic VL-BERT Visual Grounding via Attribute Learning," 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 2022, pp. 1-8, doi: 10.1109/IJCNN55064.2022.9892420.

Damien Teney, Peter Anderson, Xiaodong He, Anton van den Hengel, 2017. Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4223-4232, doi: 10.1109/CVPR.2018.00444.

Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, July 2022.

Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi,2022. Multi-Modal Answer Validation for Knowledge-Based VQA. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 2712-2721. https://doi.org/10.1609/aaai.v36i3.20174

Hyeonwoo Noh, Taehoon Kim, Jonghwan Mun, Bohyung Han, 2019. Transfer Learning via Unsupervised Task Discovery for Visual Question Answering, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 8377-8386, doi: 10.1109/CVPR.2019.00858.

Xiangtao Zheng; Binqiang Wang; Xingqian Du; Xiaoqiang Lu, Mutual Attention Inception Network for Remote Sensing Visual Question Answering, in IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-14, 2022, Art no. 5606514, doi: 10.1109/TGRS.2021.3079918.

Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'14). MIT Press, Cambridge, MA, USA, 1682–1690.

Published

2023-09-01

How to Cite

[1]
F. A. Shaik, D. Koshti, A. Gupta, M. Kalla, and A. Sharma, “A Comprehensive Review and Open Challenges on Visual Question Answering Models”, AiBi Revista de Investigación, Administración e Ingeniería, vol. 11, no. 3, pp. 126–142, Sep. 2023.

Altmetrics

Downloads

Download data is not yet available.