Tfidfvectorizer - How Can I Check Out Processed Tokens?
How can I check the strings tokenized inside TfidfVertorizer()? If I don't pass anything in the arguments, TfidfVertorizer() will tokenize the string with some pre-defined methods
Solution 1:
build_tokenizer()
would exactly serve this purpose.
Try this!
tokenizer = lambda docs: [vectorizer.build_tokenizer()(doc) for doc in docs]
tokenizer(corpus)
Output:
[['This', 'is', 'the', 'first', 'document'],
['This', 'document', 'is', 'the', 'second', 'document'],
['And', 'this', 'is', 'the', 'third', 'one'],
['Is', 'this', 'the', 'first', 'document']]
One liner solution would be
list(map(vectorizer.build_tokenizer(),corpus))
Solution 2:
I'm not sure there's a built in sklearn function to get your output in that format but I'm pretty sure a fitted TfidfVectorizer instance has a vocabulary_
attribute that returns a dictionary of the mapping of terms to feature indices. Read more here.
A combination of that and the output of the get_feature_names
method should be able to do this for you. Hope it helps.
Solution 3:
This might not be syntactically correct (doing this on memory), but its the general idea:
Y = X.to_array()
Vocab = vectorizer.get_feature_names()
fake_corpus = []
for doc in Y:
l = [Vocab[word_index] for word_index in doc]
fake_corpus.append(l)
With Y you have the indexs of your words for each doc in the corpus, with vocab you have the words a given index corresponds too, so you basically just need to combine them.
Post a Comment for "Tfidfvectorizer - How Can I Check Out Processed Tokens?"