It is exciting to push your imagination for where else can you apply AI, machine learning and most certainly -- deep learning, that is so popular these days. I came across this question on quora that provoked me to think a bit how would one go about training a neural network to lip read. I don't actually know what made me answer this question more: that found myself in an unusual context sitting on an Angularjs meetup at Google offices in New York City (after work, usual level tired) or the question itself. Whatever the reason, here is my answer:
I would probably first start with formalizing what is lip reading process from a human understandable algorithm point of view. May be it is worth to talk to a professional, like a spy or something. Obviously you need training data. Understanding, what is lip reading from the algorithm perspective will affect on what data you need.
- To read a word of several syllables you’d need a sequence of anchor lip positions, that represent syllables. Or probably vowels / consonants. See, I don’t know, which one is best. But you’d need to start with the lowest level possible out of which you can compose larger sequences, like letters -> syllables -> words. Let’s call these states.
- A particular lip posture (is that the right word?) will most probably map to ambiguous states.
- Now the interesting part is how to resolve the ambiguities. Number 2 produces several options. Out of these you can produce a multitude of words that we can call candidates.
- Then you need to score candidates based on some local context information. Here it turns into a natural language understanding.
- I'd start with seq2seq.