Skip to content

[NLP] The TF-IDF In Text Mining

TF-IDF (Term Frequency - Inverse Document Frequency) is a famous word weighting technology, it can show the importance of words to texts.

It can convert word into vector for computer calculations that is like the well-known Word2Vec,

In below, I will introduce the principles and formulas of TF-IDF one by one, and how to implement TF-IDF through program.


TF-IDF Principle

As mentioned in the preface, TF-IDF is often used for the weighting of information retrieval, and the so-called importance of a word to a text can understood as the weight of a word for a specific document.

The the question is, how do we define the so-called importance?

TF-IDF makes the following assumptions:

  • The more often a "word" appears in a "text", the more important the "word"
  • The more often a "word" appears in multiple "texts", the less important the "word"

For example, suppose we have a lot of "U.S. Travel Introduction Articles", and "the" and "corn" are both high-frequency words in article A.

But "the" always appears in other articles, and the "corn" only appeared in the article A and a few other articles.

The we will use the weight given by TF-IDF to determine that "corn" is a very important word for article A, which can distinguish article A from other articles.

In contrast, the word "the" is not important at all for article A, because it si a word that appears in most articles.

The above is the basic principle of TF-IDF, and the formula of TF-IDF will be introduced in more detail below.


TF-IDF Formula

We discuss TF and IDF separately.

TF is the so-called Term Frequency, that is an appear frequency of a word in a text. The calculation method is "the number of words appearing in the text" / "the number of all words in the text"

The numerator n<i, j> is the number of occurrences of the word t<i> in the text d<j>.

The denominator is the number of all words in the text d<j>.


IDF is the so-called Inverse Document Frequency, the calculation method is "log(the number of text/ the number of text that contained the word)".

log is a logarithmic function with base 10.
The numerator is the number of all files.
The denominator is the number of documents containing the term.
However, considering that the denominator may be 0, usually the denominator will be increased by one.


Then the so-called TF-IDF is to multiply the TF value by the IDF value.


Implement TF-IDF Program

Before starting the implementation, it is necessary to explain in advance that Scikit-Learn actually has a written kit. In many cases, unless you want to learn or make some special adjustments, it is mostly recommended to call the tools written by others directly to avoid re-creating wheels. .

The following sample programs will use Scikit-Learn and other packages, if not already installed in the environment, you can use the following command to install:

pip3 install matplotlib
pip3 install pandas
pip3 install scikit-learn


Then let's use Scikit-Learn to help us calculate TF-IDF! The 4 files tested were written by me.

# coding: utf-8
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer


# Documents
doc_0 = 'Today is a nice day'
doc_1 = 'Today is a bad day'
doc_2 = 'Today I want to play all day'
doc_3 = 'I went to play all day yesterday'
doc_all = [doc_0, doc_1, doc_2, doc_3]


# TF-IDF
vectorizer = TfidfVectorizer(smooth_idf=True)
tfidf = vectorizer.fit_transform(doc_all)
result = pd.DataFrame(tfidf.toarray(), columns=vectorizer.get_feature_names())
print('Scikit-Learn:')
print(result)



Output:


# coding: utf-8
import math
import pandas as pd
from sklearn.preprocessing import normalize


# Documents
doc_0 = 'Today is a nice day'
doc_1 = 'Today is a bad day'
doc_2 = 'Today I want to play all day'
doc_3 = 'I went to play all day yesterday'
doc_all = [doc_0, doc_1, doc_2, doc_3]
doc_all = [[word.lower() for word in doc.split() if len(word) >= 2] for doc in doc_all]


# TF
tf = dict()
for n in range(len(doc_all)):
    for word in doc_all[n]:
        if word not in tf: tf[word] = [0 for _ in doc_all]
        tf[word][n] = sum([1 for term in doc_all[n] if term == word])/len(doc_all[n])


# IDF
total_D = len(doc_all)
idf = dict()
for doc in doc_all:
    for word in doc:
        if word not in idf:
            word_idf = math.log(total_D/sum([1 for doc in doc_all if word in doc])+1)
            idf[word] = word_idf


# TF-IDF
sorted_word = sorted(set([word for word in tf]))
tfidf = list()
for word in sorted_word:
    value = tf[word]
    value = [v*idf[word] for v in value]
    tfidf.append(value)

tfidf = normalize(tfidf, norm='l2')
results = dict()
for n in range(len(sorted_word)):
    results[sorted_word[n]] = tfidf[n]


print(pd.DataFrame(results).transpose())



Output:

It is quite normal that there will be differences. In Scikit-Learn, the formula calculations of TF and IDF are not classics, but variants. In fact, there are many parameters that can be adjusted in Scikit-Learn. As for which is easy to use in tasks, It will be known after actual testing.


References


Read More

Leave a Reply