# Reinforcement learning with Takagi-Sugeno-Kang fuzzy systems

*Complex Eng Syst*2023;3:9.

## Abstract

We propose reinforcement learning (RL) architectures for producing performant Takagi-Sugeno-Kang (TSK) fuzzy systems. The first employs an actor-critic algorithm to optimize existing TSK systems. An evaluation of this approach with respect to the Explainable Fuzzy Challenge (XFC) 2022 is given. A second proposed system applies Deep Q-Learning Network (DQN) principles to the Adaptive Network-based Fuzzy Inference System (ANFIS). This approach is evaluated in the CartPole environment and demonstrates comparability to the performance of traditional DQN. In both applications, TSK systems optimized via RL performed well in testing. Moreover, the given discussion and experimental results highlight the value of exploring the intersection of RL and fuzzy logic in producing explainable systems.

## Keywords

*,*Fuzzy systems

*,*Takagi-Sugeno-Kang fuzzy systems

*,*Adaptive neuro-fuzzy inference systems

*,*Reinforcement learning

## 1. INTRODUCTION

Fuzzy sets have been introduced in^{[1]} as a mathematical framework for modeling under uncertainty. Fuzzy systems employ a fuzzy inference engine to solve control problems in various frameworks and applications^{[2,3]}. In a majority of fuzzy control applications, a dynamic model joins a fuzzy system to create a dynamic fuzzy control system^{[4]}. However, there are various applications where dynamic models are either not available or too complex for effective use in the fuzzy setting. Such applications tend to include video games and other interactive environments^{[5]}.

Reinforcement learning (RL) comprises a collection of learning algorithms that allow optimization of various control systems. RL trains agents to adapt to a given environment via a reward system^{[6]}. The agent takes action by assessing the state of the environment, and RL allows the agent to maximize the expected reward. Such algorithms are well suited for complex environments; RL has shown great success in games^{[7]} and industrial applications^{[8]}.

Deep RL^{[9]} has recently seen significant developments as agents have successfully outperformed humans in games such as Go^{[7]}, Poker^{[10]}, and video games^{[11]}. This approach also finds effective employment in applications such as autonomous vehicles^{[12]}, UAVs^{[13]}, and fine-tuning of large language models^{[14]}. However, RL models and related deep learning architectures, in particular, tend to have a significant drawback in their relative lack of explainability.

Explainable machine learning is a subject of extensive research^{[15-17]} concerned with rendering relevant algorithms more understandable, transparent, and trustworthy. Explainable fuzzy AI has afforded significant developments in this domain due to characteristics such as amenability to visualization and expression in natural language^{[18-20]}. Notably, literature discussing directed RL in the setting of explainable fuzzy AI proves scarce despite recent advancements in the former and a capacity for mutual benefit; exploration of genetic algorithms proves more common historically^{[21-23]}.

To aid in developing RL algorithms capable of producing explainable models, fuzzy RL-based architectures^{[24]} show promise. Fuzzy Q-learning was proposed and studied in a series of research papers^{[25-27]} primarily concerned with performance in control applications rather than explainability in general. Additionally, these explorations are disconnected from recent developments in deep RL. This presents a gap in the literature. However, in^{[28]}, the author developed Takagi-Sugeno-Kang (TSK) fuzzy systems with successful applications in various environments where Deep Q-Learning Network (DQN)^{[29]} has illustrated effectiveness. Similar to an Adaptive Network-Based Fuzzy Inference System (ANFIS)^{[30]}, this approach leverages RL to train an agent in these environments and shows the promise of such experimentation.

This paper offers further practical study of RL-based TSK fuzzy systems and ANFIS architectures as solutions for developing explainable AI. The latter is compared to traditional Deep Q-learning algorithms. After a preliminary section where we provide an overview of the conceptual building blocks of our systems, we discuss the successful and winning application of RL-based TSK systems to the Asteroid Smasher framework and corresponding 2023 Explainable AI competition^{[31]}. Additionally, we offer a comparison of ANFIS and DQN in the CartPole environment. The value of the relationship between RL and fuzzy systems to explainability is highlighted and discussed in both cases.

## 2. PRELIMINARIES

### 2.1. Fuzzy systems

Fuzzy sets are defined as functions ^{[1]}. They are used in modeling under uncertainty, especially in a rule-based environment.

Fuzzy rules describe an imprecise cause-effect relationship between certain variables controlling a given system. Fuzzy rules are of the form

where

with antecedents

Mamdani fuzzy systems were introduced in^{[2]}, with the goal of developing fuzzy rule-based control systems. Mamdani fuzzy systems defined by the above rule base evaluate a fuzzy output as

The output of the system is then calculated using a defuzzification as an example

To model more complex systems, one can define systems with multiple antecedents and connect them with an aggregation operator.

TSK Fuzzy Systems have been introduced in^{[3,32]} and are based on rules with a fuzzy antecedent and a singleton consequence

In this paper, we are using a rule with a singleton consequence; however, TSK systems can also be defined with linear or higher-order consequences. The output of a TSK fuzzy system can be calculated as

In the case of multiple antecedents, the rule base is modified as an example

with the evaluation of the output being

In the present paper, we use TSK fuzzy systems in the context of RL.

### 2.2. Reinforcement learning

#### 2.2.1. Bellman's Equation

Bellman equation, named after Richard E. Bellman for his work in^{[33]}, is an equation for controlling systems based on states, rewards, and actions, which altogether allow us to compute the value for the state, i.e., the total expected reward in a given state, given by

*γ* is a discount factor, typically specified between 0 and 1, with a common value of 0.99. *γ* is tuned to balance "near" and "far" rewards with lower values prioritizing immediate rewards and higher values valuing future rewards more.

#### 2.2.2. Q-learning

^{[34]} is a model-free RL algorithm, which will learn the

*Q*^{new} are the new *α*, a learning rate, multiplied by the temporal difference (TD). The TD is the current reward of the state-action pair added to the discount factor *γ* multiplied by the maximum reward that can be earned from the next state before the current value of the state-action pair is subtracted. The

#### 2.2.3. Deep Q-learning

Common techniques for ^{[29]} and is a particular case of Q-learning that relies on a Deep NN architecture. This allows the agent to learn continuous states, learn continuous values in discrete states, and generalize to states not yet seen. For example, a simple ^{[6]}. However, such an approach does not generalize to more complex environments such as StarCraft Ⅱ^{[35]}; the practically countless number of possible states renders their storage in an extremely inefficient table, and Deep Q-learning handles these tasks elegantly.

#### 2.2.4. Improvements to Q-learning and Deep Q-learning

One very common problem with

Experience replay^{[36]} allows ^{[37]} and hindsight experience replay (HER)^{[38]}.

Double Deep ^{[39]} is another technique to aid in the stability of training. A second model, which is a copy of the original network, is offline (frozen) for a set number of training iterations. The target ^{[40]}, is performed on the target network using the weights of the online model multiplied by a small value. This form of weight regularization allows the target model to slowly improve over time and prevents harsh updates to the learning.

An actor-critic RL architecture consists of an actor that acts out the current policy and estimates the current value policy and a critic that evaluates the current policy and estimates the policy gradient using a TD evaluation^{[41]}. The system also encompasses a supervisor that controls the balance between exploration (performed by the actor) and exploitation (performed by the critic).

## 3. REINFORCEMENT LEARNING WITH FUZZY SYSTEMS

RL in the context of fuzzy logic was introduced in^{[24]} and based on the Mamdani framework. It was successfully used in the CartPole problem. RL TSK fuzzy systems have been developed for various applications^{[28,42]}.

### 3.1. Reinforcement learning TSK fuzzy system

As learned behaviors in complex systems are often the result of intricate optimization processes, explainable RL constitutes a challenging problem. However, the interpretability of fuzzy systems indicates the potential value of a method of TSK fuzzy system optimization inspired by traditional ^{[28]} stands to offer some unique benefits. This approach was successfully employed in applications such as a simulation in a simple discrete grid-world environment or in continuous environments such as CartPole or Lunar Lander^{[28]}.

The idea in the setting of a TSK-based

TD

When updating the parameters of a system in RL we can use the general equation

with

Here,

To give a theoretical foundation to our proposed algorithm, we include here an initial discussion on convergence. First, we observe that the Q-learning algorithm is known to be convergent^{[43,44]} under standard assumptions on the given Markov Decision Process (MDP). Additionally, NNs with various membership functions are known to be universal approximators^{[45,46]}. Combining the above results and the conclusions in^{[47,48]}, we can approximate the Q function by the output of the NN, which leads to the convergence of Deep Q-learning models. It is known that TSK fuzzy systems are universal approximators^{[49,50]}, i.e., they have similar approximation properties to those of NNs. Together, the ideas above allow the conclusion that replacing the NN in a DQN architecture with a TSK fuzzy system will retain the same properties as DQNs. In summary, the TSK fuzzy system is an approximator of the Q function. Therefore, the Q-learning algorithm with TSK fuzzy systems is convergent. The above theoretical motivation warrants a deeper investigation of the approximation properties of RL TSK fuzzy systems as a topic for future research.

### 3.2. Case study: Reinforcement learning TSK system for Asteroid Smasher

#### 3.2.1. Problem description

To test the algorithm, we created optimized TSK fuzzy systems through RL to play a variant of the game Asteroids called Asteroid Smasher (see Figure 1). Developed by the University of Cincinnati for the Explainable Fuzzy AI Challenge (XFC 2022), the game environment incorporates additional complexities to increase both the difficulty and value of explainable automation. These include the addition of multiple agents, screen-wrap for ships and hazards, and unique scenarios to test edge cases.

While the primary metric for performance concerned the total number of asteroids destroyed, other metrics included accuracy, number of lives lost, and execution speed. More importantly, the explainability of the system proved crucial as the motivating factor for the test case. In pursuit of balancing AI explainability with performance, TSK systems lending themselves to visualization and descriptions through natural language were developed and optimized via the described learning algorithm.

#### 3.2.2. System description

We created a base model comprised of fuzzy systems to serve as a foundation for optimization.

To balance score and execution time, each ship tracks distances and angles to immediate threats and potential targets. These values take into account considerations such as screen wrap and object size. More importantly, they serve as inputs to fuzzy systems that determine shooting, turning, and thrust behavior. A general overview of this process is shown in Figure 2.

#### 3.2.3. Learning algorithm

The employed algorithm for optimization approximates gradient descent by iteratively running scenarios and altering TSK outputs to more closely resemble the best performer. This requires establishing a performant objective function and additional hyperparameter tuning; the range of possible TSK output values, number of iterations, and learning rate all vary based on configuration. However, the theoretical upside is a more systematic approach to optimal performance when compared to approaches such as genetic algorithms. The approach used for this project is described in Algorithm 1.

Algorithm 1 Asteroid Smasher Iterative Learning |

1: Initialize FIS 2: Set learning rate 3: for each epoch do4: Create/choose a scenario 5: Test scenario with the current FIS and save score in the interval [0, 1] 6: fori in range(k) do7: Create a alteration vector h with adjustments to each TSK output parameter8: Create a new fuzzy system by adding h to the baseline's parameters9: Test scenario with altered fuzzy system and save score in the interval [0, 1] 10: end for11: Calculate maximum score difference 12: Create the new current FIS with 13: end for |

In other words, we created a number of slightly modified fuzzy systems per scenario and borrowed slightly from the alterations made to the best performer. The degree to which the modification is added to the baseline is a product of the alteration, the given learning rate, and the difference between the maximum score and baseline. Some stability was gained by adding the last condition; scores that were only slightly better appeared to warrant proportionally smaller adjustments.

Scenario creation constitutes an important component of this process. For some problems, this may be trivial, but others may require consideration of the problem, goal, and available resources. To train for the competition, highly difficult and random scenarios with a very large number of asteroids were generated in lieu of a battery of scenarios for reasons primarily related to ease of implementation.

#### 3.2.4. Experimental results

Applying the discussed iterative learning method resulted in a story of minor but noticeable performance increase across key competition metrics such as score, win rate in difficult scenarios, and number of successful hits on asteroids.

Performance comparisons of the base model, the product of optimization, and a model where TSK outputs were randomly initialized are shown in Figure 3 and Figure 4. It is important to note that, in this application, a manually tuned agent served as the starting point of optimization rather than the randomly initialized one.

Figure 3. Comparisons of average game score and number of victories in challenging scenarios for the manually tuned model (base), the product of reinforcement learning (RL), and randomly initialized models (random).

Even a sizable rolling average across epochs still illustrates notable instability. Part of this is likely attributable to the randomness innate to the extreme training scenarios. However, testing methods similar to those described previously for increasing the stability of RL may constitute a key area of further research.

#### 3.2.5. Explainability

An advantage of the FIS is its ability to encapsulate complex behavior in an interface that deals in partial truths and natural language; functions and interactions that would prove difficult to interpret otherwise are meaningfully organized in comprehensible fuzzy sets and rules. More fundamentally, dealing with partial truths can prove more familiar and intuitive than crisp logic to a human expert.

In this case, the developed systems are comprised of a natural language rule set and associated fuzzy sets that lend themselves to visualization. Figure 5 illustrates this with plots of antecedent fuzzy sets and

Figure 5. Antecedent fuzzy sets, optimized TSK consequents, and associated rules for determining ship thrust.

As previously discussed, the explainability of the FIS in applications such as control systems is already well established. Rather than further clarifying this, the described algorithm and Asteroid Smasher example instead serve to delineate and test an alternative to more commonly employed methods for the optimization of existing FIS-based architectures. Successful application here offers experimental evidence to join the relatively scarce literature on developing fuzzy systems with RL as opposed to historically prevalent approaches such as genetic algorithms.

### 3.3. Case study: reinforcement learning ANFIS for classic control environments

#### 3.3.1. Introduction to ANFIS

The intersection of fuzzy logic and RL stands to offer more than an alternative method of post-hoc optimization for the former: notoriously opaque NN architectures also stand to benefit from integration with explainable fuzzy systems. In other words, neuro-fuzzy systems offer a means for taking advantage of the strengths of artificial NNs while peering a little further into the black-box models that find widespread use in RL applications and elsewhere.

To bolster experimental results in this domain, we tested an ANFIS^{[30]} (Figure 6) on OpenAI's classical control environment CartPole^{[51]}. An example of the environment is shown in Figure 7. The ANFIS extends the TSK system by allowing the parameters of the fuzzy rules to be learned via gradient descent optimization rather than using fixed constants for the parameters of the antecedents and the consequences. Towards this end, the ANFIS is combined with a NN to learn representations from the inputs, which are then fuzzified by the network. Outputs of the ANFIS layers are calculated by multiplying the fuzzified output of the NN by antecedents of rules. Then, the antecedents are multiplied by learnable parameters and summed to form the consequences. In the case of multiple-consequent models where the output dimension is more than a single value, we used a modified ANFIS structure: the Multioutput Adaptive Neuro-Fuzzy Inference System (MANFIS)^{[52]}. This method creates a separate rule base for each output dimension but follows the defuzzification process as a typical ANFIS. This process is shown in Figure 8.

Figure 8. Flowchart for ANFIS model. NN output undergoes fuzzy rule evaluation and normalization. Separately, network output is transformed according to learned

A NN of abstract shape is used and can be configured as desired. After passing the input through the NN, the output is expanded in dimensions to match the number of rules. This is then passed through the fuzzy rules to calculate their firing levels prior to normalization along each output dimension. Then, in the case of a

#### 3.3.2. CartPole-v1

The CartPole-v1 environment has four variables for the input: the position of the cart, the velocity of the cart, the angle of the pole, and the angular velocity of the pole. The expected output has two actions: move left and move right. The job of the models is to learn and maximize the

#### 3.3.3. ANFIS DQN vs DQN

We compare the ANFIS DQN architecture proposed above with a standard DQN architecture. Both models are equipped with a Double DQN structure with soft updates and have a similar number of trainable parameters; 17,410 for the DQN and 17,407 for the ANFIS DQN. Both models are also equipped with the same optimization method, learning rate, etc., and trained for 100,000 iterations. The appendix contains parameters and hyperparameters.

#### 3.3.4. Experimental results

For testing the results of the two models, we look at the mean reward of 10 testing environments every 10,000 iterations. The goal is to have the agent hit a mean reward of 500 for all 10 test environments.

From Figure 9, we can see that both models can learn the environment. In some cases, we can see that the ANFIS DQN model learns the environment quicker, such as in seed 9 and seed 42. While in other cases, it seems to perform about as well as the DQN, such as in seeds 109 and 131. The most interesting case is in seed 42, where the ANFIS DQN and DQN solve the environment in the same number of steps, but the ANFIS DQN learns it with more stability while not falling off after learning and continues to have a solution after 60,000 iterations. This trend also appears in seed 9, where after 70,000 iterations, the ANFIS DQN has a stable solution. From these tests, we can see that the ANFIS DQN is able to match, if not slightly outperform, the DQN agent.

#### 3.3.5. Explainability

As suggested, a primary advantage of this approach over non-fuzzy DQN concerns the capacity for explainability; one may visualize the fuzzy rule components before and after training to help in understanding the changes made during optimization and the resulting behavior.

To illustrate this, an example concerning the approximation of a mathematical function is shown in Figure 10. In this case, Gaussian functions corresponding to fuzzy rules expand and contract following gradient descent to afford a close estimate of the function. Notably, one Gaussian function depicted by the straight line in Figure 10b becomes insignificant after training. Not only does this indicate potential robustness to network overparameterization, but the ability to visualize the components of the system in this way also highlights a propensity for troubleshooting architectural design and training.

Figure 10. Comparison of rules for approximating a function before and after optimization. The straight line represents a rule that becomes insignificant after training.

The CartPole example represents a jump in complexity that corresponds with increased challenges to helpful visualization; the 32 rules for movement in either direction stand to benefit from a more directed form of presentation. However, the same principles of visual feedback for training and resulting behavior apply. Both before and after training, one may observe the rule base of the ANFIS DQN and extract important information about the decisions of the agent.

We can see how some of the behaviors are shaped by analyzing the distribution and coverage of antecedents in Figure 11. In the current visualization, it may be hard to explain all aspects of interaction, but we can investigate some possible behaviors. The blue curves indicate the rules for moving the cart to the left, and the red curves describe movement to the right. The domain is the domain for input space from CartPole. Each input that is passed into the rules comes from the output of a NN. The blue curves are related to moving the cart to the left, and the red curves for moving the cart to the right. While we cannot see exactly what inputs the rules are related to, we can see some trends that indicate what each rule may be impacting. In the beginning, both sets of colored curves are fairly uniform around the origin. After training, we can see that they have spread out to cover more domains, and some curves are wider. Some interesting things to note are there are two inputs in the observation space for cart velocity and pole angle. Negative values are associated with the pole and the cart moving to the left. We can see how some of the blue curves moved further to the left, and the red curves moved further to the right. This is because the intelligent action of moving the cart the opposite way the pole is moving can help correct the position of the pole. These small insights can give us some explanation as to what the network is doing when making decisions. With a better visualization, we could see what each rule is specifically looking at in the input related to the output.

Figure 11. Comparison of rule antecedent distributions for moving left (blue) and right (red) in the CartPole environment. Input consists of encoded cart position, cart velocity, pole angle, and pole angular velocity.

It should be noted that there is also much unique potential for experimentation with novel explainability mechanisms. Further research of ANFIS DQN in this vein could support efforts in quantifying changes to rules after training, identifying rules that become insignificant, highlighting substantial activations for any one state, and exploring aggregations that describe overall trends in behavior. While similar ideas have certainly been explored to aid the interpretability of traditional NNs, the capacity for visualization offered by FIS-based architectures represents a significant opportunity due to the reasons discussed and the relative lack of existing literature in this domain specifically.

## 4. CONCLUSION

In the present paper, we study fuzzy systems controlled via RL and multiple applications of this framework. The first environment considered is the Asteroid Smasher game in which the associated agent made use of fuzzy systems trained via an actor-critic RL algorithm. This agent won XFC 2022, an Explainable AI competition emphasizing fuzzy logic. We also compared ANFIS DQN architectures with classic DQN architectures and demonstrated that the fuzzy systems perform similarly or, in some cases, better than DQN systems. More significantly, we highlighted the amenability of TSK-based architectures to explainability and visualization when compared to more traditional NNs.

Grounds for further research include an exploration of possible improvements to the stability of the RL algorithm used to tackle the Asteroid Smasher environment. With respect to ANFIS architectures, additional exploration of applications, network architectures, membership functions, and unique explainability mechanisms could also prove valuable for both performance and explainability. In both cases, there are extensions such as improved experience replay, dueling DQN, and distributional learning that could afford improved results.

The most significant takeaway concerns the relatively underinvestigated and mutually beneficial relationship between RL and fuzzy logic. Despite trends in the former and the value that each stands to offer the other, alternative methods are more commonly employed to optimize fuzzy systems. Moreover, promising neuro-fuzzy frameworks face relative obscurity in light of the steadfast need for trust and interpretability in domains where NNs are pervasive. Consequently, additional experimentation serves to afford value when building performant and explainable applications.

## DECLARATIONS

### Authors' contributions

Wrote and reviewed the paper: Zander E, van Oostendorp B, Bede B

### Availability of data and materials

There are no applicable datasets, but the implementation of a PyTorch compatible ANFIS layer is available here if relevant: https://github.com/Squeemos/pytorch_anfis.

### Financial support and sponsorship

None.

### Conflicts of interest

All authors declared that there are no conflicts of interest.

### Ethical approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Copyright

© The Author(s) 2023.

## ANFIS AND DQN HYPERPARAMETERS

Most hyperparameters were taken from previous work/recommended defaults of libraries

1. DQN Structure

● 2 linear layers with 128 nodes and ReLU activation functions

2. ANFIS Structure

● 2 linear layers with 128,127 nodes, respectively, and ReLU activation functions

● 16 rules, with the mean of the Gaussian membership function sampled from the Normal distribution and scaled by 2 to increase rule-base coverage, and the standard deviations sampled from the uniform distribution

● Learnable parameters/biases for summation sampled from the Normal distribution and scaled by 2

3. Optimizer: ADAM

4. Learning rate: .001

5. Gamma discount factor: 0.99

6. Max replay memory size: 10,000

7. Batch Size: 128

8. Begin training after: 1,000 iterations

9. Epsilon start: 1.0

10. Epsilon end: .01

11. Epsilon decay: After 20,000 iterations in a linear fashion

12. Gradient clipping norm: 10

13. Target network update iterations: 100

14. Tau soft update parameter: .001

## REFERENCES

2. Mamdani EH, Assilian S. An experiment in linguistic synthesis with a fuzzy logic controller. *Int J Man-Mac Studies* 1975;7:1-13.

3. Takagi T, Sugeno M. Fuzzy identification of systems and its applications to modeling and control. In: *IEEE Trans Syst, Man, Cybern* 1985;8: 116–32.

4. Precup RE, Hellendoorn H. A survey on industrial applications of fuzzy control. *Comput Industry* 2011;62:213-26.

5. Pirovano M. The use of fuzzy logic for artificial intelligence in games. *University of Milano, Milano* 2012. Available from:

6. Russell SJ. Artificial intelligence a modern approach. Pearson Education, Inc.; 2010.

7. Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge. *Nature* 2017;550:354-59.

8. Lakhani AI, Chowdhury MA, Lu Q. Stability-preserving automatic tuning of PID control with reinforcement learning. *Complex Eng Syst* 2022;2:3.

10. Zhao E, Yan R, Li J, Li K, Xing J. AlphaHoldem: high-Performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36; 2022. pp. 4689–97.

11. Shao K, Tang Z, Zhu Y, Li N, Zhao D. A survey of deep reinforcement learning in video games. *arXiv preprint arXiv: 191210944* 2019.

12. Chen J, Yuan B, Tomizuka M. Model-free deep reinforcement learning for urban autonomous driving. In: 2019 IEEE intelligent transportation systems conference (ITSC). IEEE; 2019. pp. 2765–71.

13. Walker O, Vanegas F, Gonzalez F, Koenig S. A deep reinforcement learning framework for UAV navigation in indoor environments. In: 2019 IEEE Aerospace Conference. IEEE; 2019. pp. 1–14.

14. Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback. *J Inf Process Syst* 2022;35:27730-44.

15. Mundhenk TN, Chen BY, Friedland G. Efficient saliency maps for explainable AI. *arXiv preprint arXiv: 191111293* 2019.

16. Borys K, Schmitt YA, Nauta M, et al. Explainable AI in medical imaging: an overview for clinical practitioners–Beyond saliency-based XAI approaches. *Eur J Radiol* 2023:110786.

17. Holzinger A, Saranti A, Molnar C, Biecek P, Samek W. Explainable AI methods-a brief overview. In: xxAI-Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended Papers. Springer; 2022. pp. 13–38.

19. Mencar C, Alonso JM. Paving the way to explainable artificial intelligence with fuzzy modeling. In: Fuzzy Logic and Applications. Springer; 2019. pp. 215–27.

20. Viaña J, Ralescu S, Cohen K, Ralescu A, Kreinovich V. Extension to Multidimensional Problems of a Fuzzy-Based Explainable and Noise-Resilient Algorithm. In: Decision Making Under Uncertainty and Constraints: A Why-Book. Springer; 2023. pp. 289–96.

21. Pickering L, Cohen K. Genetic Fuzzy Controller for the Homicidal Chauffeur Differential Game. In: Applications of Fuzzy Techniques: Proceedings of the 2022 Annual Conference of the North American Fuzzy Information Processing Society NAFIPS 2022. Springer; 2022. pp. 196–204.

22. Pickering L, Cohen K. Toward explainable AI—genetic fuzzy systems—a use case. In: Explainable AI and Other Applications of Fuzzy Techniques: Proceedings of the 2021 Annual Conference of the North American Fuzzy Information Processing Society, NAFIPS 2021. Springer; 2022. pp. 343–54.

23. Fernandez A, Herrera F, Cordon O, del Jesus MJ, Marcelloni F. Evolutionary fuzzy systems for explainable artificial intelligence: Why, when, what for, and where to? *IEEE Comput Intell Mag* 2019;14:69-81.

24. Berenji HR. A reinforcement learning—based architecture for fuzzy logic control. *Int J Approx Reason* 1992;6:267-92.

25. Glorennec PY, Jouffe L. Fuzzy Q-learning. In: Proceedings of 6th international fuzzy systems conference. vol. 2. IEEE; 1997. pp. 659–62.

26. Er MJ, Deng C. Online tuning of fuzzy inference systems using dynamic fuzzy Q-learning. *IEEE Trans Syst Man Cybern B Cybern* 2004;34:1478-89.

27. Jamshidi P, Sharifloo AM, Pahl C, Metzger A, Estrada G. Self-learning cloud controllers: Fuzzy q-learning for knowledge evolution. In: 2015 International Conference on Cloud and Autonomic Computing. IEEE; 2015. pp. 208–11.

28. Kumar S. Learning of Takagi-Sugeno Fuzzy Systems using Temporal Difference methods. DigiPen Institute of Technology; 2020.

29. Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. *Nature* 2015;518:529-33.

30. Jang JSR. ANFIS: adaptive-network-based fuzzy inference system. *IEEE Trans Syst, Man, Cybern* 1993;23:665-85.

31. King SD. Explainable AI competition; 2023. Accessed on 2/15/2023. Available from:

33. Bellman R. A markovian decision process. *Indiana Univ Math J* 1957;6: 679–84. Available from:

35. Vinyals O, Babuschkin I, Czarnecki WM, et al. Grandmaster level in StarCraft Ⅱ using multi-agent reinforcement learning. *Nature* 2019;575:350-54.

36. Lin LJ. Self-improving reactive agents based on reinforcement learning, planning and teaching. *Mach Learn* 1992;8:293-321.

39. van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. arXiv; 2015.

40. Polyak BT, Juditsky AB. Acceleration of Stochastic Approximation by Averaging. *SIAM J Control Optim* 1992;30:838-55.

41. Rosenstein MT, Barto AG, Si J, et al. Supervised actor-critic reinforcement learning. *Learning and Approximate Dynamic Programming: Scaling Up to the Real World* 2004: 359–80. Available from:

42. Yan X, Deng Z, Sun Z. Competitive Takagi-Sugeno fuzzy reinforcement learning. In: Proceedings of the 2001 IEEE International Conference on Control Applications (CCA'01)(Cat. No. 01CH37204). IEEE; 2001. pp. 878–83.

43. Jaakkola T, Jordan M, Singh S. Convergence of stochastic iterative dynamic programming algorithms. *Adv Neural Inf Proces Syst* 1993;6. Available from:

44. Melo FS. Convergence of Q-learning: a simple proof. *Inst Syst Robot, Tech Rep* 2001;1–4. Available from:

45. Cybenko G. Approximation by superpositions of a sigmoidal function. *Math Control Signal Systems* 1989;2:303-14.

46. Yarotsky D. Error bounds for approximations with deep ReLU networks. *Neural Netw* 2017;94:103-14.

47. Melo FS, Ribeiro MI. Q-learning with linear function approximation. In: Learning Theory: 20th Annual Conference on Learning Theory, COLT 2007, San Diego, CA, USA; June 13-15, 2007. Proceedings 20. Springer; 2007. pp. 308–22.

48. Papavassiliou VA, Russell S. Convergence of reinforcement learning with general function approximators. In: IJCAI. vol. 99; 1999. pp. 748–55. Available from:

49. Ying H. General Takagi-Sugeno fuzzy systems are universal approximators. In: 1998 IEEE International Conference on Fuzzy Systems Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98CH36228). vol. 1. IEEE; 1998. pp. 819–23.

50. Bede B. Mathematics of Fuzzy Sets and Fuzzy Logic. Springer; 2013.

52. Benmiloud T. Multioutput Adaptive Neuro-Fuzzy Inference System. In: Proceedings of the 11th WSEAS International Conference on Nural Networks and 11th WSEAS International Conference on Evolutionary Computing and 11th WSEAS International Conference on Fuzzy Systems. NN'10/EC'10/FS'10. Stevens Point, Wisconsin, USA: World Scientific and Engineering Academy and Society (WSEAS); 2010. p. 94–98.

## Cite This Article

Export citation file: **BibTeX** | **EndNote** | **RIS**

**OAE Style**

Zander E, van Oostendorp B, Bede B. Reinforcement learning with Takagi-Sugeno-Kang fuzzy systems. *Complex Eng Syst* 2023;3:9. http://dx.doi.org/10.20517/ces.2023.11

**AMA Style**

Zander E, van Oostendorp B, Bede B. Reinforcement learning with Takagi-Sugeno-Kang fuzzy systems. *Complex Engineering Systems*. 2023; 3(2): 9. http://dx.doi.org/10.20517/ces.2023.11

**Chicago/Turabian Style**

Eric Zander, Ben van Oostendorp, Barnabas Bede. 2023. "Reinforcement learning with Takagi-Sugeno-Kang fuzzy systems" *Complex Engineering Systems*. 3, no.2: 9. http://dx.doi.org/10.20517/ces.2023.11

**ACS Style**

Zander, E.; van Oostendorp B.; Bede B. Reinforcement learning with Takagi-Sugeno-Kang fuzzy systems. *Complex. Eng. Syst.* **2023**, *3*, 9. http://dx.doi.org/10.20517/ces.2023.11

## About This Article

### Special Issue

### Copyright

**Open Access**This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Data & Comments

### Data

### Comments

Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.

^{0}