TFIDF (term frequency–inverse document frequency) is a statistical approach that is popular in text mining to say how important a word is to a sentence in a whole document. You can use sklearn package of python to easily get vector representation of words in a sentence.
However, the default settings of TFIDF in sklearn package only consider two or multi-length words of a document as token. Since single length words are not important in English literature (e.g. 'a', 'b', 'c', ... 'z'); you don't need to worry about the process and you can easily use this package in your application.
The interesting fact is that the single length tokens or letters could be very important feature in other languages or domains. Hence, you need to be careful to change the default settings and get vectors for all the tokens.
The default token_pattern regexp in TfidfVectorizer selects words which have at least 2 chars. To change the settings, we can update the regular expression in token_pattern as follows:
from sklearn.feature_extraction.text import TfidfVectorizer
def my_vectorizer():
Vectorizer= TfidfVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000,
token_pattern = r"(?u)\b\w+\b" # Here is the change
)
return Vectorizer
corpus1 = [
'This is the first document a',
'This document is the second document b c ',
'And this is the third one d e ',
'Is this the first document f g',
] # added some single length tokens to test
vectorizer = my_vectorizer()
X = vectorizer.fit_transform(corpus1)
for token in vectorizer.get_feature_names():
print (token) # Here, you will see all the tokens of our corpus1
Reference:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
However, the default settings of TFIDF in sklearn package only consider two or multi-length words of a document as token. Since single length words are not important in English literature (e.g. 'a', 'b', 'c', ... 'z'); you don't need to worry about the process and you can easily use this package in your application.
The interesting fact is that the single length tokens or letters could be very important feature in other languages or domains. Hence, you need to be careful to change the default settings and get vectors for all the tokens.
The default token_pattern regexp in TfidfVectorizer selects words which have at least 2 chars. To change the settings, we can update the regular expression in token_pattern as follows:
from sklearn.feature_extraction.text import TfidfVectorizer
def my_vectorizer():
Vectorizer= TfidfVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000,
token_pattern = r"(?u)\b\w+\b" # Here is the change
)
return Vectorizer
corpus1 = [
'This is the first document a',
'This document is the second document b c ',
'And this is the third one d e ',
'Is this the first document f g',
] # added some single length tokens to test
vectorizer = my_vectorizer()
X = vectorizer.fit_transform(corpus1)
for token in vectorizer.get_feature_names():
print (token) # Here, you will see all the tokens of our corpus1
Reference:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
মন্তব্যসমূহ
একটি মন্তব্য পোস্ট করুন