- Article
How Data Translation Builds Better AI
Good AI doesnāt just happen (though it can seem like that when your smart home assistant jumps into a conversation). In order to train natural language processing algorithms behind the AI we use daily, tech companies need lots of quality training data to support theirĀ multilingual AI.
At Argos, we work on data collection translation projects for tech companies around the world. These projects typically involve either text translation, voice translation and recording, or image translation and annotation. The datasets we work within this space are very different from typical translation source material: theyāre larger and require specific translation rules, as theyāll be fed back into the clientās machine learning system.
To help you choose a translation vendor who can handle data translation projects of this type, weāve collected some best practices below for using translation to build better AI.
Translate the data resources you have
If youāre a tech company working in multilingual AI, translation can be useful for a range of data types.
Text datasets can be used to trainĀ natural language processing (NLP) AIs for machine translatorsĀ that support social media and translation apps. Your voice data assets can be used to improve the functionality of an AI system so it can recognize a variety of accents, genders, speech types and more. (This is just as important for an English-only system as it is for a multilingual one.) And images may be used to teach self-driving cars, tracking systems, hover-over translations apps, and more. High-quality translation is key to producing a system that can recognize words, image text, or commands in multiple languages.
Dataset translation is especially helpful if you want to create something in a less common language. Letās consider a simple case and imagine that youāre building a basic chatbot that can recognize several different languages, including Dutch, and then respond accordingly in the userās language. You probably already have a bunch of samples in English of things that you expect someone to write in the query box, but will have a much smaller pool of datasets in Dutch. This is where you can take your English data set and translate it to Dutch, doubling your data volume.
[form_newsletter]
If youāre working on more complex or specific projects, however, you may need to also specify how your datasets are translated. For example, if youāre a bank building a chatbot AI, youāll have a range of commands to translate. āWhat is my account balance?ā or āOpen a new accountā can be directly translated. But specific industry terms, like U.S. 401(k) plans, donāt always have universal equivalents. In those cases, youāll need to ask the translator to find or suggest equivalent terms or flag a term as untranslatable, to alert you where any native data might need to be collected.
Set style guides early
When starting a translation project for your multilingual AI tech, itās a good idea to choose a translation partner who will work with you to set up style and rule guides that will produce the type of data you want. Doing this at the beginning of the process will result in cleaner datasets that can smoothly feed back into your AI.
A quality data translation vendor will establish this style guide with you as you begin to define the scope of the project. The optimal translation process for your data will look different depending on your AI training model, so itās a good idea to carefully talk the process through with your translation partner to make sure their processes will help you get the data and data quality you need.
Embrace the human element (human-in-the-loop)
For any kind of data translation project, the best process is one that keeps humans in the loop. Supervised data translation andĀ quality assuranceārather than crowdsourcingāis the best way to get reliable and clean multilingual data sets that are suitable for test and training situations.
Check out ourĀ resources for planning large-scale translation projects orĀ contact usĀ to get started translating your data.