I think you meant LoRA (not to be confused with LoRa)
With reinforcement learning, specifically actor critic, the actor is not training against a dataset. It's training against the critic. The critic is supposed to approximate the value function, which contains the current cost for a given action and the predicted future cost, assuming that you choose the optimal action at every step, including its impact on future actions. If you have a simple supervised cost function, what happens is that the critic acts as an averaging of loss functions. You could say that the critic is a compressed copy of the training data. When you train the actor, you're essentially taking not only the new data, but also the old data into account.
So, in a way, catastrophic forgetting is sort of solved, but not really. If you add new data, you run into the problem that your critic will slowly drift to the new data distribution. This means the problem wasn't solved, but you certainly managed to delay it. Delaying the problem is good though. What if you can delay it even more? What if you can delay it forever?
Here is my stupid and simple unproven idea: Nest the reinforcement learning algorithm. Each critic will add one more level of delay, thereby acting as a low pass filter on the supervised reward function. Since you have two critics now, you can essentially implement a hybrid pre-training + continual learning architecture. The most interesting aspect here is that you can continue training the inner-most critic without changing the outer critic, which now acts as a learned loss function.
*: it is possible to measure how much part of a prompt helps with a task e.g. measuring change in entropy