
Here at Ä¢¹½¶ÌÊÓÆµ, we¡¯ve recently done a number of Transliteration matching tasks helping people with Japanese, Russian Cyrillic and Arabic data sets.
Transliteration matching can seem challenging, especially when presented with text that you don¡¯t understand, but with the right techniques a lot can be achieved – the key is to really understand the problem and have some proven techniques for dealing with them:
- Transliteration – Matching data within a single character set
We have a long-standing Chinese customer who routinely matches data sets of 100¡¯s of millions of customer records all in Chinese. Even with messy data, this is a relatively straight forward task, as long as your matching algorithms can handle properly, fuzzy matching within a single character set e.g. Chinese customer database to Chinese marketing database is very similar to the same task in a roman character set albeit with some tweaks to fuzzy match tolerances.
- Frequency Analysis
Another very useful technique is to perform frequency analysis on the input text to help identify ¡®noise text¡¯ such as company legal forms within company names that can be either eliminated from the match or that should be matched with lower importance than the rest of a company name. For example frequency analysis on a Japanese entity master database may reveal a large number of company names containing the Kanji ¡°Öêʽ»áÉ硱 or ¡°Öꡱ ¨C the Japanese equivalent of ¡®Limited¡¯ (or ¡®Ltd.¡¯ in abbreviated form). The beauty of this technique is that it can be applied to any language or character set.
- Matching between character sets using Transliteration, fuzzy and phonetic matching
A common requirement in the AML/KYC space is matching account names in Chinese, Japanese, or Cyrillic etc to sanctions and PEP lists which are usually published in Latin script. In order to do this a process called ¡®¡¯ is required. Transliteration converts text in one character set to another, but the results from raw transliteration are not always usable since the resulting transliterated text is often more of a ¡®pronunciation guide¡¯ rather than how a native speaker would write the text in Latin script. However, by using a combination of fuzzy and phonetic matching on the transliterated string, it is possible to obtain very accurate matching.