In data science field, data cleaning process is the first and most important step to discover knowledge from data. Removing special characters, extra spaces, keeping all words in lowercase letters - are such kind of basic data cleaning task. Another important task is searching some special string (i.e. email or date type data) in text and replacing the string with a special symbol. There are many standard libraries in different programming languages (i.e. python, R) that have API to search string in a text file.
However, when we are dealing with millions of string searching operations in a single text file, the regular methods are not faster enough to find result in minutes. For example, python replace method, regular expression (regex), findit() method - all the process become slow when we have to deal with millions of string searching.
To solve this problem, a python library, FlashText is written by a Google developer. It is almost 30 times faster than the regular python searching methods. The details are given in the following links.
Github: https://github.com/vi3k6i5/flashtext
Details: https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f
মন্তব্যসমূহ
একটি মন্তব্য পোস্ট করুন