Abstract:
Image question answering is a multimodal learning task intersecting computer vision and natural language processing. With the breakthroughs in the deep neural networks, it has been the hotspot and focus of many researchers’ attention. To solve the task, researchers put forward numerous excellent models. Stacked attention networks (SANs) is one of the most typical models, and gets the state-of-the-art results in the test of four public visual question answering datasets. Although it has the excellent performance, because of the diversity of question and the sparsity of answer, it cannot fully learn the universal law of the corpus, and easily fall into the poor local optimal solution, which leads to the higher question answering error rate. By analyzing the causes of the error and observing the details of the model processing image question answering, we find that stochastic gradient descent based on momentum (baseline) has some defects in the optimization of SANs. To solve it, we propose static restart stochastic gradient descent based on image question answering. The experimental results show that its accuracy is 0.29% higher than baseline, but its convergence rate is slower than baseline. To verify the significance of the improved performance, we conduct statistical hypothesis test on the experimental results. The results of T test prove that its improved performance is extremely significant in the process of converging to the global optimal solution. To verify its effectiveness in the same kind of algorithm, we conduct effectiveness experiments with it and the state-of-the-art first-order optimization algorithms. The experimental results and analysis prove that it is more effective in solving image question answering. To verify its generalization performance and promotion value, we conduct the image recognition experiment on the classic Cifar-10 for the image recognition task. The experimental results and the results of T test prove that it has good generalization performance and promotion value in the process of converging to the global optimal solution.