Much of my work has been to explain how the harmonicity (or F0) cue is used in auditory scene analysis. Harmonicity is the most powerful among ASA cues. It is also the cue most often exploited in computational ASA systems and voice-separation systems.
Results have not been spectacular. I attribute this to the following factors: (a) Much effort has been invested in so-called "harmonic enhancement" (using the target's F0), which is intuitively appealing but fundamentally not very powerful, and not used by the auditory system. (b) The difficulty of F0estimation of mixed speech, and of dealing with F0errors. (c) Spectral distortion caused by the segregation process. (d) Lack of effectiveness at high SNRs, at which systems are usually tested (speech separation is most effective at low SNR). (e) The tendency to work in the spectral domain, less flexible than the time domain with respect to non-stationarity.
In my opinion, people have not yet discovered how to milk this cow. Harmonicity should give spectacular benefits if the following conditions are respected: