Modified‎ ‎Step‎ ‎Size‎ ‎for‎ ‎Enhanced‎ ‎Stochastic Gradient Descent‎: ‎Convergence and Experiments

Document Type : Original Scientific Paper

Authors

1 ‎Department of Computer Science, ‎Faculty of Mathematical Science‎, ‎University of Kashan‎, ‎Kashan‎, ‎Iran

2 ‎Department of Applied Mathematics, ‎Shiraz University of Technology‎,‎ ‎Shiraz‎, ‎I‎. ‎R‎. ‎Iran‎

10.22052/mir.2023.253279.1426

Abstract

‎This paper introduces a novel approach to enhance the performance of the stochastic gradient descent (SGD) algorithm by incorporating a modified decay step size based on $\frac{1}{\sqrt{t}}$‎. ‎The proposed step size integrates a logarithmic term‎, ‎leading to the selection of smaller values in the final iterations‎. ‎Our analysis establishes a convergence rate of $O(\frac{\ln T}{\sqrt{T}})$ for smooth non-convex functions without the Polyak-Łojasiewicz condition‎. ‎To evaluate the effectiveness of our approach‎, ‎we conducted numerical experiments on image classification tasks using the Fashion-MNIST and CIFAR10 datasets‎, ‎and the results demonstrate significant improvements in accuracy‎, ‎with enhancements of $0.5\%$ and $1.4\%$ observed‎, ‎respectively‎, ‎compared to the traditional $\frac{1}{\sqrt{t}}$ step size‎. ‎The source code can be found at  https://github.com/Shamaeem/LNSQRTStepSize.

Keywords

Main Subjects


[1] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Stat. (1951) 400 - 407.
[2] A. Krizhevsky and G. Hinton, Learning Multiple Layers of Features from Tiny Images, University of Toronto, Tech. rep. 2009.
[3] A. Krizhevsky, I. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Commun. ACM 60 (2017) 84 - 90, https://doi.org/10.1145/3065386.
[4] J. Redmon and A. Farhadi, Yolo9000: better, faster, stronger, In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (2017) 7263 - 7271.
[5] J. Zhang and C. Zong, Deep neural networks in machine translation: an overview, IEEE Intell. Syst. 30 (2015) 16 􀀀 25,
https://doi.org/10.1109/MIS.2015.69.
[6] P. Mishra and K. Sarawadekar, Polynomial learning rate policy with warm restart for deep neural network, In: TENCON 2019-2019 IEEE Region 10 Conf. (TENCON), (2019) 2087 - 2092.
[7] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel and S. Lacoste- Julien, Painless stochastic gradient: interpolation, line-search, and convergence rates, In. Proc. of the 33rd Int. Conf. on Neural Information Processing Systems, (2019) 3732 - 3745.
[8] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin and P. Richtárik, Sgd: general analysis and improved rates, In: Int. Conf. on Machine Learning, California, (2019) 5200 - 5209.
[9] X. Wang, S. Magnússon and M. Johansson, On the convergence of step decay step-size for stochastic optimization, Adv. Neural Inf. Process. Syst. 34 (2021) 14226 - 14238.
[10] I. Loshchilov and F. Hutter, SGDR: Stochastic gradient descent with warm restarts, arXiv preprint arXiv:1608.03983, 2016.
[11] X. Li, Z. Zhuang and F. Orabona, A second look at exponential and cosine step sizes: simplicity, adaptivity and performance, In: Int. Conf. on Machine Learning, (2021) 6553 - 6564 PMLR.
[12] W. Tao, S. Long, G. Wu and Q. Tao, The role of momentum parameters in the optimal convergence of adaptive Polyak’s heavy-ball methods, arXiv preprint arXiv:2102.07314 2021.
[13] L. N. Smith, Cyclical learning rates for training neural networks, In: 2017 IEEE Winter Conf. on Applications of Computer Vision (WACV), (2017) 464 - 472, https://doi.org/10.1109/WACV.2017.58.
[14] G. Vrbancic and V. Podgorelec, Efficient ensemble for imagebased identification of pneumonia utilizing deep CNN and SGD with warm restarts, Expert Syst. Appl. 187 (2022) p. 115834, https://doi.org/10.1016/j.eswa.2021.115834.
[15] G. Xu, H. Cao, Y. Dong, C. Yue and Y. Zou, Stochastic gradient descent with step cosine warm restarts for pathological lymph node image classification via pet/ct images, In 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP) (2020) 490 - 493.
[16] J. Nocedal and S. J. Wright, Numerical optimization, Springer New York, NY, 1999
[17] A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim. 19 (2009) 1574 - 1609, https://doi.org/10.1137/070704277.
[18] B. T. Polyak, Gradient methods for minimizing functionals, Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3 (1963) 864 - 878, https://doi.org/10.1016/0041-5553(63)90382-3.
[19] S. Łojasiewicz, A topological property of real analytic subsets, Coll. du CNRS, Les équations aux dérivées partielles, 117 (1963) p. 2.
[20] D. P. Kingma and J. Ba, Adam: a method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
[21] N. Serhat Aybat, A. R. Fallah, M. Gurbuzbalaban and A. Ozdaglar, A universally optimal multistage accelerated stochastic gradient method, In. Proc. of the 33rd Int. Conf. on Neural Information Processing Systems, (2019) 8525 - 8536.
[22] K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (2016) 770 - 778.
[23] C. C. Chang and C. J. Lin, Libsvm: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (2011) 1 - 27, https://doi.org/10.1145/1961189.1961199.