Fuzzy Radial Basis Function Least Square Policy Iteration: A Novel Critic-Only Reinforcement Learning Framework

Document Type : Research Paper

Authors

1 Department of Electrical Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

2 Department of Electrical Engineering, Qazvin Branch, Islamic Azad University, Qazvin, Iran

3 Department of Electrical and Computer Engineering, Tarbiat Modares University (TMU), Tehran, Iran.

Abstract

In this paper, a new form of critic-only Reinforcement Learning algorithm for continuous state spaces control problems is proposed. Our approach, called Fuzzy-RBF Least Square Policy Iteration (FRLSPI), tunes the weight parameters of the fuzzy-RBF network (a hybrid model constituted by combining Takagi-Sugeno fuzzy rule inference system with RBF network) online and is acquired through combining Least Squares Policy Iteration (LSPI) with fuzzy-RBF network as a function approximator. In FRLSPI, based on the basis functions defined in the fuzzy-RBF network, a solution has been provided for the challenge of determining the state-action basis functions in LSPI. We also provide positive theoretical results concerning an error bound between the optimal and the approximated Action Value Function (AVF) for FRLSPI. Our proposed method has suitable features such as positive mathematical analysis, learning rate independency and, comparatively good convergence properties. Simulation studies regarding the mountain-car control task and acrobat problem demonstrate the applicability and performance of our learning framework. The overall results indicate that the proposed idea can outperform previously known reinforcement learning algorithms.

Keywords

Main Subjects


[1] D. Allahverdy, A. Fakharian, M. B. Menhaj, Back-stepping integral sliding mode control with iterative learning
control algorithm for quadrotor UAVs, Journal of Electrical Engineering and Technology, 14(6) (2019), 2539-2547.
https://doi.org/10.1007/s42835-019-00257-z
[2] B. Andr´e, C. Anderson, Restricted gradient-descent algorithm for value-function approximation in reinforcement
learning, Artificial Intelligence, 172(4-5) (2008), 454-482. https://doi.org/10.1016/j.artint.2007.08.001
[3] A. Barakat, P. Bianchi, J. Lehmann, Analysis of a target-based actor-critic algorithm with linear function approximation,
arXiv preprint arXiv:2106.07472, (2021). https://doi.org/10.48550/arXiv.2106.07472
[4] D. P. Bertsekas, J. N. Tsitsiklis, Neuro-dynamic programming, Belmont, MA: Athena Scientific, 1996.
[5] L. Bu¸soniu, D. Ernst, B. De Schutter, R. Babuˇska, Online least-squares policy iteration for reinforcement learning
control, Proceedings of the 2010 American Control Conference, Baltimore, MD, USA, (2010), 486-491. https:
//doi.org/10.1109/ACC.2010.5530856
[6] L. Bu¸soniu, A. Lazaric, M. Ghavamzadeh, R. Munos, R. Babuˇska, B. De Schutter, Least-squares methods for policy
iteration, In: Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. In: Adaptation, Learning, and Optimization, 12, Springer, Heidelberg, Germany, (2012), 75-109. https://doi.org/10.1007/
978-3-642-27645-3-3
[7] Y. Cui, T. Matsubara, K. Sugimoto, Kernel dynamic policy programming: Applicable reinforcement learning to
robot systems with high dimensional states, Neural Networks, 94 (2017), 13-23. https://doi.org/10.1016/j.
neunet.2017.06.007
[8] V. Derhami, V. J. Majd, M. N. Ahmadabadi, Fuzzy sarsa learning and the proof of existence of its stationary
points, Asian Journal of Control, (2008), 535-549. https://doi.org/10.1002/asjc.54
[9] Y. Duan, X. Chen, R. Houthooft, J. Schulman, P. Abbeel, Benchmarking deep reinforcement learning for continuous
control, arXiv preprint arXiv:1604.06778, (2016). https://doi.org/10.48550/arXiv.1604.06778
[10] K. Fr¨amling, Light-weight reinforcement learning with function approximation for real-life control tasks, In Proceedings of 5th International Conference on Informatics in Control, Automation and Robotics, Funchal, Madeira,
Portugal, (2008), 127-134. https://doi.org/10.5220/0001484001270134
[11] F. Ghorbani, V. Derhami, M. Afsharchi, Fuzzy least square policy iteration and its mathematical analysis, International
Journal of Fuzzy Systems, (2016), 1-14. https://doi.org/10.1007/s40815-016-0270-1
[12] R. A. Howard, Dynamic programming and Markov processes, New York: Wiley, 1960.
[13] K. S. Hwang, S. W. Tan, M. C. Tsai, Reinforcement learning to adaptive control of nonlinear systems, IEEE Transactions
on Systems, Man, and Cybernetics-Part B, 33(3) (2003), 514-521. https://doi.org/10.1109/TSMCB.
2003.811112
[14] H. S. Jakab, L. Csat´o, Sparse approximations to value functions in reinforcement learning, In: Koprinkova-Hristova,
P., Mladenov, V., Kasabov, N. (eds) Artificial Neural Networks. Springer Series in Bio-/Neuroinformatics, 4,
Springer, Cham, (2015). https://doi.org/10.1007/978-3-319-09903-3-14
[15] Y. Jia, X. Y. Zhou, Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms,
Journal of Machine Learning Research, 23(275) (2022), 12603-12652. https://doi.org/10.2139/ssrn.3969101
[16] M. G. Lagoudakis, R. Par, Least-squares policy iteration, Journal of Machine Learning Research, (2003), 1107-1249.
https://doi.org/10.1162/1532443041827907
[17] Y. J. Liu, L. Tang, S. Tong, C. P. Chen, D. J. Li, Reinforcement learning design-based adaptive tracking control
with less learning parameters for nonlinear discrete-time MIMO systems, IEEE Transactions on Neural Networks
and Learning Systems, 26(1) (2015), 165-176. https://doi.org/10.1109/TNNLS.2014.2360724
[18] I. Nishikawa, K. Matsunaga, An unsupervised learning of a layered network and its application to a motion acquisition,
Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290),
Honolulu, HI, USA, 2 (2002), 1667-1672. https://doi.org/10.1109/IJCNN.2002.1007768
[19] W. Rudin, Principles of mathematical analysis, 3rd ed. New York, NY, USA: McGraw-Hill Education, 1976.
[20] B. Saglam, C. Cicek, F. Multu, S. Kozat, Off-Policy correction for actor-critic algorithms in deep reinforcement
learning, arXiv, (2022). https://doi.org/10.48550/arXiv.2208.00755
[21] Second Annual Reinforcement Learning Competition. http://rl-competition.org.
[22] A. Sheikhlar, A. Fakharian, Online policy iteration-based tracking control of four wheeled omni-directional robots,
Journal of Dynamic Systems, Measurement, and Control, 140(8) (2018), 081017. https://doi.org/10.1115/1.
4039287
[23] J. Sherman, W. J. Morrison, Adjustment of an inverse matrix corresponding to a change in one element of a
given matrix, The Annals of Mathematical Statistics, 21(1) (1950), 124-127. https://doi.org/10.1214/aoms/
1177729893
[24] N. Snehal, W. Pooja, K. Sonam, S. R. Wagh, N. M. Singh, Control of an acrobot system using reinforcement
learning with probabilistic policy search, 2021 Australian and New Zealand Control Conference (ANZCC), Gold
Coast, Australia, (2021), 68-73. https://doi.org/10.1109/ANZCC53563.2021.9628194
[25] E. H. Sumiea, S. J. Abdulkadir, H. Alhussian, S. M. Al-Selwi, A. Alqushaibi, M. G. Ragab, S. M. Fati, Deep
deterministic policy gradient algorithm: A systematic review, Heliyon, 10 (2024). https://doi.org/10.1016/j.
heliyon.2024.e30697
[26] R. S. Sutton, A. G. Bareto, Reinforcement learning: An introduction, Second Edition, MIT Press, Massachusetts,
2017.
[27] N. Tziortziotis, C. Dimitrakakis, M. Vazirgiannis, Randomised bayesian least-squares policy iteration, arXiv, (2019).
https://doi.org/10.48550/arXiv.1904.03535
[28] B. Varga, B. Kulcs´ar, M. H. Chehreghani, Deep Q-learning: A robust control approach, International Journal of
Robust and Nonlinear Control, 33(1) (2023), 526-554. https://doi.org/10.1038/nature14236
[29] X. Xu, D. Hu, X. Lu, Kernel-based least squares policy iteration for reinforcement learning, In IEEE Transactions
on Neural Networks, 18(4) (2007), 973-992. https://doi.org/10.1109/TNN.2007.899161
[30] X. Xu, C. Liu, S. X. Yang, D. Hu, Hierarchical approximate policy iteration with binary-tree state space decomposition,
In IEEE Transactions on Neural Networks, 22(12) (2011), 1863-1877. https://doi.org/10.1109/TNN.
2011.2168422
[31] S. Yahyaa, B. Manderick, Knowledge gradient for online reinforcement learning, In: Duval, B., van den Herik, J.,
Loiseau, S., Filipe, J. (Eds.), Agents and Artificial Intelligence. In: ICAART 2014 LNCS, 8946, Springer, Cham,
(2014), 103-118. https://doi.org/10.1007/978-3-319-25210-0-7
[32] M. Zaki, A. Mohan, A. Goplan, S. Manner, Actor-critic based improper reinforcement learning, arXiv, (2022).
https://doi.org/10.48550/arXiv.2207.09090