The growing need for AI systems that can understand India’s diverse environments, languages, and contexts is bringing multimodal AI into sharper focus. A scene from the movie ‘Humans in the Loop (2024)’ offers a simple but powerful insight into how AI systems work, often with inadvertent biases. In one sequence, an AI model struggles to recognize images of Indian tribal communities because it has not been exposed to that kind of data before. As the protagonist begins collecting and labelling these images, adding more context and variety, the system gradually improves.
It is a vivid reminder that AI does not inherently understand the complex world. It learns from what it is shown. When certain realities are missing from the data, they remain invisible to the system.
This reflects a broader challenge. Much of today’s AI is built on global datasets that often overlook the diversity, complexity and cultural richness of India. From languages and cultural practices to everyday environments, large parts of our context remain underrepresented.
Understandably so! This gap is not always visible at first, but it becomes evident when systems fail to interpret situations accurately in real-world setting, especially in India.

The shift to Multimodal AI
Unlike traditional systems that rely on a single type of input, Multimodal AI brings together multiple forms of data such as images, text, audio and indigenous environmental signals to build a more complete understanding of the world. It attempts to move closer to how humans perceive and interpret situations, where meaning comes from combining different cues rather than analysing them in isolation. In many ways, it shifts AI from simple processing information to interpreting it within context.
Recognizing this shift, the Technology Innovation Hub at Indian Institute of Technology (IIT) Mandi is working on a first-of-its-kind Multimodal AI Lab in India. The focus is on building systems that are grounded in the country’s own data, diversity and real-world environment. Also, ensuring that this AI systems reflect local data with all its inherent richness rather than relying solely on externally available datasets.
Why Multimodal AI Labs matter
AI systems are highly capable within specific tasks, but often they struggle with diverse contexts. An image may be correctly identified, or a sentence accurately processed, yet the connection between them, the meaning within a situation is not always clear. This limitation becomes more visible in complex and diverse environments like India, where contexts and right data play a critical role in interpretation.
In this scenario, Multimodal AI addresses this gap by bringing different data streams together. When visual, textual and environmental inputs are combined, systems are able to interpret real-world scenarios with more depth and sensitivity.
Building such systems depends not only on how data is collected but also on the sophisticated fusion and synchronization of these diverse data streams. This requires real-world inputs gathered across environments through strategic partnerships. This makes the process as much about understanding people and context as it is about technology, highlighting the need for inclusive and representative data ecosystems.
Understanding India through the 3Cs
For AI systems to be meaningful in India, they need to reflect the indigenous environment they operate in. A useful way to look at this is through three aspects:
- Culturally: India’s diversity in language, traditions and practices is vast and systems. The same object, gesture or phrase can have different meanings across regions, and systems need to account for this variation.
- Contextually: The same input can have different meanings depending on location, setting or economic conditions. For example, how a service is used in a rural setting may differ significantly from its use in an urban environment.
- Continuously: Data in India is constantly evolving. From changing user behaviours to shifting economic conditions, and systems must adapt over time rather than rely on static information.
These factors make India both complex and uniquely rich as a data environment. This richness has not yet been captured completely in current AI systems, which creates both a challenge and an opportunity for India to build more representative models.
Multimodal AI through four layers
A simple view of Multimodal AI can be understood through these four layers:
- Data ingestion: Collecting data from different real-word sources
- Processing: Cleaning and organizing that data
- Storage: Making it accessible for learning and retrieval
- Serving: Generating the data’s outputs or insights
While each stage is important, the starting point often determines the outcome. If the data entering the system is limited or biased, then the results will reflect those limitations.
Building the foundation: Data ingestion
At IIT Mandi, the current focus is more on data ingestion and synchronization, which forms the foundation of the entire system.
This stage is where the quality, diversity and representativeness of data are established. In India, this involves capturing a wider range of inputs such as languages, dialects, as well as visual environments and behavioral patterns across rural and urban settings. Allowing a structured methods of collecting and organizing this data so that it can be effectively used by AI systems.
If this foundation is robust in India, the rest of the system is far more likely to perform effectively and produce meaningful outcomes.
Real-World Impact and Decisions
In classrooms, farms, and public systems across India, decisions are rarely based on a single input; they depend on a mix of visuals, language, and local context. Multimodal AI enables more accurate decision-making by bringing these disparate elements together.
Skilling and Economic Related Opportunities
Beyond technology, this initiative serves as a catalyst for human capital development:
- High-Volume Job Creation: New pathways are opening for high-volume roles across the entire multimodality lifecycle, including data collection, specialized annotation, and AI support functions.
- Grassroots Participation: Opportunities for local communities to participate in data ecosystems with relatively low infrastructure requirements.
- Advanced Skilling: New initiatives aligned with emerging AI applications, creating a bridge between technology and livelihoods.
- Institutional Collaboration: Scope for CSR and institutional participation in building representative data foundations.
Also read: Humans in the Loop: Entry-level jobs in Artificial Intelligence (AI)
Looking ahead
India now has a clear opportunity to shape AI systems that are inclusive, relevant and grounded in its own realities.
The Multimodal AI lab at TIH, IIT Mandi funded and supported by the Department of Science and Technology represents an important step in this direction with its focus on building strong data foundations that reflect India’s diversity and complexity.
As these efforts grow, they can help build AI systems that actively learn from India’s realities. This can enable better decisions, stronger public systems and solutions that are designed for read-world use. At the same time, creating pathways for wider participation in shaping India’s growing AI adoption.









