The Massachusetts Institute of Technology, MIT, let their AI computer watch a year's worth of videos from the web in the name of learning about sounds. It is more common now that computers can identify objects, faces, animals, and landmarks – but the sounds in our world are still foreign to them.
A group of computer scientists from MIT have built off an existing project "computer vision" to add the capability of listening to what it is seeing. Humans have a natural, simultaneous relationship between sight and sound, so why don't our computers?
After processing a year's worth of online videos, "SoundNet" was created. It is able to determine sounds that we commonly can identify like rain, a baby's babble, human noises, and some animals. Using its vision and hearing, the computers would compare what it saw to the noises it was hearing. It learned things like babies tend to babble a lot. After learning with the help of video, the computer can now identify noises without visual queues.
In a Turing test, SoundNet scored a 92% in identifying noises with a few mistakes like mistaking footsteps for knocking on the door or humans laughing and the sound of hens. It's nothing that cannot be fixed, though - This is only with one year of media.
This type of development will greatly improve the area of computer listening. Voice assistants and sound recognition could use a revamp. Siri is a good example of some of the struggles in voice AI – ambient noise, stuttering of words, and inflections in voices. This may give users the natural communication that our devices lack. Home voice AI could be truly interactive when knowing that a dog is barking or a cat is meowing, or more importantly the sound of a window shatter or smoke alarm.
If we want natural communication, we need to give our computers the ability to experience their surrounding the same way we do. It will know to ignore the dog's bark when we're trying to tell it something or it will be the next best baby monitor distinguishing between a happy baby and a crying one.
Rutkin, Aviva. 3 Nov 2016. "Binge-watching videos teaches computers to recognize sounds". Daily News.