Surveying a Multi-Lingual Society

22 official languages, 10 major scripts, native speakers increasingly geographically dispersed.

Hindi and English are the most used languages in government and businesses with state-specific languages being the primary language in each state. So, for example, in West Bengal the main language used is Bangla or Bengali with Hindi and English translations available on some roads signs, airports, trains (but not always buses) and so on. Still:

Businesses and other entities often survey in Hindi or English in India. A recent exercise validating surveys in these languages revealed:

  • Many “English-speaking” respondents don’t actually speak English and fail to understand basic question and response wording.
  • Sentence construction in English varies wildly, as people are literally translating based on the grammar of their native language, which also varies wildly as the 22 different languages generally tend of have different grammatical rules—this makes the concept of “Indian English” challenging at best.
  • A good proportion of “Hindi” respondents only understand a very basic, simple level of Hindi, and even many common Hindi words are often beyond their understanding.
  • Many “Hindi” respondents cannot read the script or can do so only very laboriously—so reading survey questions can be difficult and time-consuming. Many Indians will use transliterated Hindi in social media, chat boards and other such places, thus signaling that they are users of the language, when often they really aren’t.
  • Respondents in certain parts of the country will refuse to communicate in Hindi and often do not understand English—these become immediate refusals.

The obvious solution: Offer the survey in all 22 different languages allowing respondents to choose the language they are most comfortable in. This works for online surveys where respondents can choose their preferred language.

But translating and validating survey instruments in so many languages is a time-consuming and expensive affair. One option would be to sample among the languages or select the ones with the highest population and focus on those populations.

But, if the survey is interviewer-administered, then survey language may be matched by state, but matching language by state is not always sufficient, especially in India’s urban areas. Indian TV channels and streaming services show the same advertisement dubbed in the language of the state you happen to be in, but that does not solve the problem of native speakers of other languages living in that state (for example, I failed to comprehend the same ad that I had watched in Bengali in Kolkata when I saw it in Marathi in Mumbai). So, trying to do a phone survey in Marathi with a Bengali living in Mumbai could lead to language-related issues with the added complication of the person possessing only basic Hindi and a splattering of English.

As AI becomes more proficient in translating speech, these issues might soon become easier to mitigate, but in the meantime they continue to pose challenges and often introduces bias in the data collected.

What are some of the ways in which you have navigated these language-related complications, especially if you have been collecting survey data in a multi-lingual context like India?

Leave a comment