ChatGPT: It’s just the beginning …? Part-2
In the last article we discussed about what methodologies we can use with ChatGPT to improve it further; from the discussions it came out that Imitation Learning is something we can set our eye on.
Now it seems our very own trusted friend ChatGPT has certified the same 😉
Without further ado let’s discuss about the Imitation Learning and its various types in details.
As we discussed early approaches to imitation learning use supervised learning to learn a policy as a machine learning model that maps environment observations to (optimal) expert actions. The method is known as Behavioural Cloning (BC), but it has a disadvantage: it can only simply repeat learned motions, not adapt to environmental changes. A critical issue is that when the agent is placed in a situation that differs from any of the expert trajectories, BC is prone to failure.
For instance, the car simulation agent is unsure what to do if it deviates from the expert trajectory and crashes. To avoid making a mistake, BC needs expert data on all possible trajectories in the environment, which makes it difficult. This necessitates a human in the loop, and such interactive access to an expert is usually expensive. Instead, we want to mimic the trial-and-error process that humans use to correct errors. In the preceding example, if the car agent can interact with the environment to learn "if I drive this way, will I crash?" it can correct itself to avoid that behavior.
This realisation inspired the formulation of imitation learning as a problem of learning a reward function from expert data, such that a policy that optimises the reward through environment interaction matches the expert, thus inverting the reinforcement learning problem; this approach is known as Inverse Reinforcement Learning (IRL).
Inverse Reinforcement Learning was proposed initially as a minimax game between two AI models with simple parallels to GANs — a class of generative models. In this formulation, the agent policy model (the "generator") generates actions that interact with the environment in order to obtain the highest rewards from a reward model using RL, whereas the reward model (the "discriminator") attempts to distinguish the agent policy behavior from expert behavior as in the figure below. The discriminator, like GANs, functions as a reward model that indicates how expert-like an action is.
As a result, if the policy engages in non-expert-like behaviour, it receives a low reward from the discriminator and learns to correct this behaviour. The saddle point solution is a unique equilibrium solution for this minimax game. At the equilibrium, the discriminator learns a reward that is indistinguishable from the expert's policy behaviour. It is possible to achieve expert performance with few demonstrations using adversarial learning of a policy and a discriminator. Adversarial Imitation refers to techniques inspired by such. (An illustration of the method is shown in the figure below.)
GAIL's method is implemented by comparing the difference between the generated and expert strategies. Iterative confrontation training can be used to achieve the closest possible distribution between the expert and the agent. Generative Adversarial networks (GANs) have been successfully applied to model-free policy imitation problems. Model-based generative adversarial imitation learning (MGAI) algorithm is also proposed, which is based on a forward model and allows the use of accurate discriminator gradients to train strategies. Pure learning methods with simple reward functions frequently result in non-human and overly rigid movement behaviors.
GAIL's algorithm was extended by Merel and his team to generate human-like motion patterns from limited demonstrations without access to actions. This method builds strategies and demonstrates how they can be reused to solve problems when controlled by a higher-level controller. When the agent trajectory deviates from the demonstrations, they are vulnerable to cascading failures. Variation auto-encoder (VAE) was also used to learn semantic policy embeddings, which made the GAIL algorithm more robust than the supervised controller, even with few demonstrations. Using these policies, a new version of GAIL can be created that avoids mode collapse and captures a wide range of behaviors.
Let’s talk about the pros and cons of these 3 types of Imitation learning, though the BC method is simple to implement and intuitive, a large amount of data is required, and the learned policy cannot adapt to the new environment. The IRL method compensates for the shortcomings of the preceding situations, but the consumption of training time is still costly. While GAIL introduces the concept of generative adversarial networks for imitation learning, which outperforms the other two methods in high-dimensional situations, the problem of model collapse is a major disadvantage of GAIL.
So, if we don’t want to pay for high cost of data collection and the local optimal policy solution, which may result in a poor learning effect, we can also start exploring transfer learning method, in which the learning model is trained in a simulation environment and then the knowledge is transferred to the real robot in order to acquire robot manipulation skills more efficiently.
So .. the question remains ...from here to where???
To summarize, RLHF is a promising approach to training machine learning systems, particularly in situations where defining a precise reward function for the agent to optimise is difficult or impractical. RLHF necessitates the use of an instructor to provide consistent and reliable feedback, which may necessitate extensive training and expertise. This can be a limitation in situations where human resources are limited or the feedback process is too complex or costly.
Moreover, RLHF is more sensitive to biases and subjective judgment on the part of the human instructor, which can impact the agent's learning process. This can be a problem as fairness or objectivity are essential. For some tasks, such as those that can be easily defined using a precise reward function or that require a high degree of expertise or fine-tuning, RLHF may not be the most efficient or effective approach. Also, the feedback provided by the instructor may not be exhaustive or representative of all possible scenarios, making RLHF not well-suited for tasks requiring a high degree of robustness. As a result, it is critical to carefully consider the strengths and limitations of RLHF and to employ it in conjunction with other methods as needed.
This is only the beginning, and there is more to come (Picture abhi baaki hai mere dost..) Ciao!