On the digital intelligence across modalities: Cheers and fears


Received: March 12, 2022; Accepted: March 21, 2022; Published: March 29, 2022

Humans observe and comprehend the world through their senses: sight, hearing, touch, smell, and taste. Most primate newborns -including human babies- are born with all the five senses in which they use in a complementary fashion to understand the events and entities surrounding them. However, modern Artificial Intelligence (AI) decided not to fully assimilate human development. That’s simply because AI models are developed in the cyberspace which contains different types of entities to recognize. As opposed to neonates born in hospitals surrounded by doctors and nurses, the developed AI models are likely to be born on the internet surrounded by millions of Instagram images and billions of Google words. Hence, AI scientists are focused primarily on teaching the digital models its surrounding environment, which happens to focus on two of the five signals that human use: sight (images/videos) and hearing (audio/text). This work discusses developing AI models in digital environments and analyze the mechanisms and implications of using those two senses to learn.The main signals in cyberspace happen to be visual (images, videos), and acoustic (audio, text). Those are all related and often co-exist to describe the same event or entity. Nonetheless, understanding each is quite complex, not to mention fusing all of them together is even more prohibitive. To start with visual signals in the form of images and videos. The human eye is one of the most intricate organs of the human body for a reason. That is, our environment presents a very large range of electromagnetic waves that vary significantly in their frequency and wavelength; making understanding the objects in one image is quite a hard task. For instance, let’s take the problem of scene understanding in computer vision (Figure 1). An image (or a video frame) typically contains many foreground objects and a varying background. The objects are usually of different shapes, types, textures, colors. Not to mention the different background, lighting and occlusion settings that vary from one image to another. Combining all of those making developing one system that can understand any given scene despite the high degree of variance an incredibly hard task.


Select your language of interest to view the total content in your interested language

Viewing options

Flyer image

Share This Article