I think the postings in my link are quite clear: Mark Ayres had to lift the narration from the combined single track of the original release because he hadn't retained it as a separate track. You create a mask from the known music and remove it from the sound image, leaving just the dialogue and effects.
I figured that is where you were going. I've heard of this being used before for various needs. An analogy might be how they can extract a subject from the background in a picture, or when they run a process to remove print damage from video.
While it sounds simple enough in concept I have to wonder to what degree it has been perfected, and how much continuous configuration and touch up might be required on a scene by scene basis. Recall in some movies where DNR has been applied and actually mistakenly removed content. Also, I remember hearing with Star Trek TOS they had to manually go frame by frame to remove damage not found by the automated process.
So I'm not discounting the approach but the effort for the amount of music in the background and the number of episodes may also be cost prohibitive.