Projects

Deployed systems

A decade of voice-first services for under-served populations, and the language technologies and AI tools that grew out of them.

The talk below traces how our voice-based services for developing regions evolved from 2011 to 2021. For a short read, see The Internet of the Orals in Communications of the ACM; for the full story, the book chapter Voice Interfaces for Underserved Communities (Springer).

A decade of voice services

Polly: viral voice for the offline

Polly is an interactive voice service that works over an ordinary phone call. A user records a short clip, applies a funny voice modification, and forwards it to friends, who can do the same. That single playful act dissolves the usual barriers of training, trust, and advertisement, and makes the service spread person to person. Relaunched in Pakistan in 2012 from a seed of just five users, Polly reached 165,000 users within a year, at one point onboarding 1,000 new callers a day. An audio job portal built into Polly drew 34,000 job-seekers to its 728 listings, which were played over 386,000 times, connecting low-literate workers to real opportunities and leading to a one-million-euro grant from GIZ. Polly was later adapted as Polly-Santé to deliver Ebola information in eleven languages in Guinea, and reached further users through research partners in India.

Super Abbu: maternal health for expectant fathers

Launched in 2016, Super Abbu lets expectant parents anonymously record questions about pregnancy and childbirth, which volunteer doctors answer. Crucially, it targets fathers, the household decision-makers in much of Pakistan, with information usually directed only at women. It reached 21,000 users, 96 percent of them men, in two months, exposing an enormous unmet appetite for culturally sensitive, lifesaving guidance. Backed by UNICEF's Innovation Fund and featured by the BBC, its impact is now being measured through a randomized controlled trial supported by the NIH, and the accompanying study of how to advertise such services earned an Honorable Mention at ACM CHI.

Super Abbu

Rah-e-Maa

Baang: a voice-based social network

Baang is a social platform built for basic phones, a kind of voice Reddit where people record posts, listen, like, comment, report, and share. About 69 percent of its users were blind, for whom a voice-native social space is genuinely liberating, and the single strongest predictor of continued use was leaving comments, evidence that it was the social connection, not novelty, that kept people coming back. Users even raised their voices collectively against harassment and hate speech. During COVID-19 we extended Baang into a voice social platform for credible health information: over six months it took more than 500,000 calls from 12,000 mostly low-literate users, who posted 35,000-plus voice messages listened to millions of times, with verified doctor-recorded content promoted and misinformation removed (published at The Web Conference 2022).

Baang

Sawaal & Karamad: quizzes and crowd work by voice

Sawaal folds three jobs into one viral voice quiz: it finds out what people do not know, teaches them the right answer, and measures how much they retain over time. Users post and attempt multiple-choice questions and challenge their friends, and it grew to thousands of users with no advertising at all. Karamad brings digital crowd work to people who have only a basic phone: workers complete tasks such as translation, transcription, and audio surveys by voice and earn mobile airtime in return. In six months it organically engaged 725 workers, including women, unemployed, visually impaired, and non-literate participants, who completed nearly 4,000 tasks, opening the digital labor economy to people the internet had so far excluded.

Sawaal

Karamad

Audio security & speech forensics

As voice cloning becomes widely available, impersonation scams and audio disinformation threaten the very populations my other work serves. Through the Crime Investigation and Prevention Lab (CIPL) and a national R&D grant, I build audio deepfake detection and core Urdu speech forensics for law enforcement. Our approach is explainable and data-efficient: we found that synthetic speech fails to reproduce certain fine-grained phonemes, so focusing a self-attention model on just the 16 most informative phonemes (fricatives like /s/, nasals like /m/ and /n/) improved detection to an equal error rate near 11.98% while flagging exactly which sounds look suspicious. We also released a specialized Urdu deepfake-audio dataset and use intelligent dataset pruning to build lightweight detectors.

Urdu speech & language technology

Underneath the deployments sits a body of foundational technology for Urdu and other low-resource languages: the first medium-vocabulary Urdu speech-recognition system, Urdu text-to-speech, spoken-term detection, speech-emotion recognition, voice biometrics, pronunciation-lexicon generation, word segmentation, and unsupervised text simplification. With the calligrapher Nasrullah Mehr, the lab released Mehr-e-Nastaliq, the first calligraphy-based OpenType Urdu web font, which renders the flowing Nastaliq script faithfully in under 110 KB so it can travel even over slow connections.

Urdu speech recognition

Urdu text to speech

LLMs, datasets & efficient ML

Since 2021 the lab's work has broadened from voice interfaces into the language-modeling and machine-learning research that inclusive AI depends on: open Urdu datasets and benchmarks such as UQA for question answering and PakBBQ for cultural and social bias; evaluations of generalist versus specialist models for Urdu; and efficient machine learning through data selection, pruning, active learning, and inclusive federated learning. Together these bring large-dataset training within reach of moderate computing resources. More on teaching and AI literacy →