By now technologists around the world are well aware of what ChatGPT can do and what it cannot. The pros and cons, issues in the code, where it hallucinates… a long list has been prepared, so I would believe that a complete UAT has been performed by now 😊.
And in that hullaballoo from its inception in November 2022, and especially now with the launch of GPT4, there has been enormous expert speak pointing out that it would take away most jobs in technology (and elsewhere) and replace human beings in every sphere of commerce, science, and the arts.
There is also another whole world out there where people are trying to pass their tests and complete their assessments through ChatGPT, while there is a parallel world which is thinking beyond ChatGPT.
Image Source
My short personal take - having heard the ‘this time it’s different’ refrain over a fairly long career in the field, I would like to believe that as with any other ‘intelligent’ software and with any other IT discovery, it would not take away IT jobs but shake things around significantly to make some skills redundant while making others essential. As with most IT solutions, the opportunity (for scale) will lie in utilizing Large Language Models (like ChatGPT) for business advantage, by integrating with social apps or simply with productivity apps (which we are already experiencing with MS CoPilot for the Office Suite and GitHub CoPilot for coding).
But I am not a futurist and nor a test taker, thus I will not be discussing the jobs that ChatGPT could replace, and also not where we can replicate it. Instead, let us take a step ahead or, as I would like to put it, a step back, and explore the underlying beauty of the algorithms on which ChatGPT and many other large language models are based, and the areas where potentially it can be improved. Can we improve Reinforcement Learning (RL) and use it in conjunction with other methods as appropriate?
I will divide this article into 2 parts, one where will we explore the options and 2nd where will discuss those options in detail and how to use it along with RL.
Image Source
RLHF can be used to train an agent to perform a specific task, such as guiding someone to solve a Rubik's cube. The agent learns through trial and error, attempting various actions and receiving feedback on its performance from human interaction. Feedback can be numerical, such as a penalty or reward, or more abstract, such as spoken or textual appreciation or criticism. So is it only RL which can provide us with the required versatility or agility or can we look beyond Reinforcement Learning as we find a challenge around RL with human feedback, which always requires guidance from humans?
Image Source
This is where we unveil Imitation Learning and Transfer Learning.
Current AI systems are capable of complex decision-making, playing complex strategic games like Minecraft, or manipulating a Jigsaw puzzle, these systems frequently require over 100 million interactions with an environment to train — the equivalent of more than 100 years of human experience — to achieve human-level performance. In contrast, by observing an expert, a human can learn new skills in a relatively short period of time. How can we teach our artificial agents to learn as quickly as humans do?
Imitation Learning: Imitation is an important aspect of human learning; it is the ability to mimic and learn behavior from an instructor or an expert. This is necessary for learning new skills such as biking, or speaking an alien language. This is a type of supervised learning in which an agent learns a task by observing and mimicking the actions of a human expert. Imitation learning can be used to rapidly bootstrap the learning process and provide a solid foundation for the agent to build on. To use imitation learning with RLHF, the agent must learn to perform the task by observing and copying the human expert's actions. As needed, the expert can provide additional feedback and guidance to assist the agent in learning more effectively.
Therefore, combining imitation learning (Source) with reinforcement learning mechanisms can improve the speed and accuracy of imitation learning. At the moment, imitation learning methods are divided into three categories: behavior cloning (BC), inverse reinforcement learning (IRL), and generative adversarial imitation learning (GAIL). We will discuss it in detail in our next part.
This transfer learning optimization method can greatly improve the generalization of the original model as well as the speed of modelling new tasks. Data sets from the assigned task are collected and used to train the skill model. The learned model's knowledge can then be transferred to the learning agent, yielding a new model capable of reproducing the manipulation in new environments. However, because there is a reality gap between the simulation and reality, agents find it difficult to transfer and learn. If the policy is trained in a flawed simulation, it will not be able to adapt to changes in the external environment. These two learning techniques along with other ensemble models of RL works wonder.
This transfer learning optimization method can greatly improve the generalisation of the original model as well as the speed of modelling new tasks. Data sets from the assigned task are collected and used to train the skill model. The learned model's knowledge can then be transferred to the learning agent, yielding a new model capable of reproducing the manipulation in new environments. However, because there is a reality gap between the simulation and reality, agents find it difficult to transfer and learn. If the policy is trained in a flawed simulation, it will not be able to adapt to changes in the external environment. These two learning techniques along with other ensemble models of RL works wonder.
If you’re a NeurIPS fan you must have surely read through Model Ensembled Trust Region Policy Optimization (ME-TRPO) and Trust Region-Guided Proximal Policy Optimization (TR guided PPO) which talks about ensembled models in Reinforcement Learning and can be used to enhance the ability to explore within the trust region, this ensembled models have a better performance bound than the original Policy Optimization models.
Image Source
At the end, I would suggest teachers, lecturers, and trainers not to be afraid of ChatGPT, regardless of its new and upgraded versions. However, they should teach their students how and when to use it wisely. Learners should be aware that copying code snippets may assist them in assessments but will not benefit them during interviews or in-person interactions (based on my personal experience). Therefore, we should not ban this innovative invention but instead determine where we should draw the line to use it wisely and for our benefit while maintaining our originality. Ciao, until next time.
Σχόλια