实战系列--tf-idf （基于sklearn）

本文基于scikit-learn官方文档，对tf-idf及其实际应用进行详细阐述。tf-idf是词向量表示的一种方法，相较于简单的one-hot编码，tf-idf不仅考虑了词语在文档中的出现次数（tf），还考虑了其在所有文档中的出现频率，通过逆文档词频（idf）调整权重，使得词向量表达更加丰富。

具体实现tf-idf表示的词向量，有以下两种方法：

方法一：利用TfidfTransformer，该方法基于已有的计数矩阵转换为tf-idf表示的矩阵，通常与CountVectorizer配合使用，后者用于从文本数据得到计数矩阵。方法二：直接使用TfidfVectorizer，它能直接通过文本数据得到tf-idf表示的词向量矩阵。

在深入了解TfidfTransformer的实现与功能之前，让我们先了解一下其关键参数与接口：

参数包括：norm（规范化输出向量）、smooth_idf（默认为True，控制是否平滑idf计算）、sublinear_tf（默认为False，控制是否使用次线性tf计算）。核心方法包括：

get_feature_names_out：根据输入特征给特征命名，若输入为null，则默认按x0,x1...规则命名。

set_params：用于设置评估器的参数。

idf_：当use_idf为True时，查看n个词的idf值。

fit：学习idf向量（全局术语权重），输入为词频矩阵，输出为转换器本身。

transform：完成数据转换，输入为词频的稀疏矩阵，输出为tf-idf表征的数组。

fit_transform：结合fit和transform功能，传入数据后转换为矩阵。

下面是一个使用TfidfTransformer的实例：

python

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.pipeline import Pipeline

corpus = ['this is the first document',

'this document is the second document',

'and this is the third one',

'is this the first document']

vocabulary = ['this', 'document', 'first', 'is', 'second', 'the',

'and', 'one']

pipe = Pipeline([('count', CountVectorizer(vocabulary=vocabulary)),

'tfid', TfidfTransformer())].fit(corpus)

pipe['count'].transform(corpus).toarray()

pipe['tfid'].idf_array([1. , 1.22314355, 1.51082562, 1. , 1.91629073, 1. , 1.91629073, 1.91629073])

pipe.transform(corpus).shape

(4, 8)

通过该实例，我们可以看到TfidfTransformer是如何将文本数据转换为tf-idf表示的。

另一种实现方式是直接使用TfidfVectorizer，它整合了CountVectorizer和TfidfTransformer的功能，简化了文本到tf-idf表示的过程。以下是一个使用TfidfVectorizer的实例：

python

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [... 'This is the first document.',

'This document is the second document.',

'And this is the third one.',

'Is this the first document?']

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names_out()

print(X.shape)

以上实例展示了TfidfVectorizer如何直接从原始文本数据生成tf-idf表示的词向量矩阵。

您可能感兴趣问答

Collapsible

热门标签

热点问答