Advanced optimal cross-modal fusion mechanism for audio-video based artificial emotion recognition