Faster way to search millions of string in text: Python FlashText

In data science field, data cleaning process is the first and most important step to discover knowledge from data. Removing special characters, extra spaces, keeping all words in lowercase letters - are such kind of basic data cleaning task. Another important task is searching some special string (i.e. email or date type data) in text and replacing the string with a special symbol. There are many standard libraries in different programming languages (i.e. python, R) that have API to search string in a text file.

However, when we are dealing with millions of string searching operations in a single text file, the regular methods are not faster enough to find result in minutes. For example, python replace method, regular expression (regex), findit() method  - all the process become slow when we have to deal with millions of string searching.

To solve this problem, a python library, FlashText is written by a Google developer. It is almost 30 times faster than the regular python searching methods. The details are given in the following links.


Githubhttps://github.com/vi3k6i5/flashtext
Detailshttps://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f




মন্তব্যসমূহ