Good AI doesn't just happen (though it can seem like that when your smart home assistant jumps into a conversation). In order to train natural language processing algorithms behind the AI we use daily, tech companies need lots of quality training data to support their multilingual AI.
At Argos, we work on data collection translation projects for tech companies around the world. These projects typically involve either text translation, voice translation and recording, or image translation and annotation. The datasets we work within this space are very different from typical translation source material: they're larger and require specific translation rules, as they'll be fed back into the client's machine learning system.
To help you choose a translation vendor who can handle data translation projects of this type, we've collected some best practices below for using translation to build better AI.
Translate the data resources you have
If you're a tech company working in multilingual AI, translation can be useful for a range of data types.
Text datasets can be used to train natural language processing (NLP) AIs for machine translators that support social media and translation apps. Your voice data assets can be used to improve the functionality of an AI system so it can recognize a variety of accents, genders, speech types and more. (This is just as important for an English-only system as it is for a multilingual one.) And images may be used to teach self-driving cars, tracking systems, hover-over translations apps, and more. High-quality translation is key to producing a system that can recognize words, image text, or commands in multiple languages.
Dataset translation is especially helpful if you want to create something in a less common language. Let's consider a simple case and imagine that you're building a basic chatbot that can recognize several different languages, including Dutch, and then respond accordingly in the user's language. You probably already have a bunch of samples in English of things that you expect someone to write in the query box, but will have a much smaller pool of datasets in Dutch. This is where you can take your English data set and translate it to Dutch, doubling your data volume.
If you're working on more complex or specific projects, however, you may need to also specify how your datasets are translated. For example, if you're a bank building a chatbot AI, you'll have a range of commands to translate. "What is my account balance?" or "Open a new account" can be directly translated. But specific industry terms, like U.S. 401(k) plans, don't always have universal equivalents. In those cases, you'll need to ask the translator to find or suggest equivalent terms or flag a term as untranslatable, to alert you where any native data might need to be collected.
Set style guides early
When starting a translation project for your multilingual AI tech, it's a good idea to choose a translation partner who will work with you to set up style and rule guides that will produce the type of data you want. Doing this at the beginning of the process will result in cleaner datasets that can smoothly feed back into your AI.
A quality data translation vendor will establish this style guide with you as you begin to define the scope of the project. The optimal translation process for your data will look different depending on your AI training model, so it's a good idea to carefully talk the process through with your translation partner to make sure their processes will help you get the data and data quality you need.
Embrace the human element (human-in-the-loop)
For any kind of data translation project, the best process is one that keeps humans in the loop. Supervised data translation and quality assurance-rather than crowdsourcing-is the best way to get reliable and clean multilingual data sets that are suitable for test and training situations.