Skip to content Skip to sidebar Skip to footer

Detecting Foreign Words

I am writing a script to detect words from a language B in a language A. The two languages are very similar and may have instances of the same words. The code is here if you are in

Solution 1:

If your method is returning words present in both languages, and you only want to return words that exist in one language, you might want to create a list of one-grams in language A and one-grams in language B, and then remove the words in both. Then, if you like, you can proceed with the bigram analysis.

That said, there are some good tools in Python for language identification. I've found lang-id to be one of the best. It comes pre-trained with language classifiers for over 90 languages, and is fairly easy to train for additional languages if you like. Here are the docs. There is also guess-language, but it doesn't perform as well in my estimation. Depending on how localized the bits of foreign language are, you could try chunking your texts at an appropriate level of granularity and running those chunks through (e.g.) langid's classifier.

Post a Comment for "Detecting Foreign Words"