利用大语言模型嵌入的文本数据七大高级特征工程技巧-青云TOP-AI综合资源站平台|青云聚合API大模型调用平台|全网AI资源导航平台

📢 转载信息

原文链接：https://machinelearningmastery.com/7-advanced-feature-engineering-tricks-for-text-data-using-llm-embeddings/

原文作者：Iván Palomares Carrascosa

利用大语言模型嵌入的文本数据七大高级特征工程技巧
Image by Editor

引言

大型语言模型（LLMs）不仅擅长理解和生成文本，它们还能将原始文本转换为称为嵌入（embeddings）的数值表示。这些嵌入对于将附加信息整合到传统的预测性机器学习模型（如scikit-learn中使用的模型）中以提高下游性能非常有用。

本文将展示七个高级Python特征工程技巧示例，它们通过利用LLM生成的嵌入为文本数据增加额外价值，从而增强依赖文本的机器学习模型在情感分析、主题分类、文档聚类和语义相似度检测等应用中的准确性和鲁棒性。

所有示例的通用设置

除非另有说明，下面的七个示例技巧都采用了这种通用设置。我们依赖Sentence Transformers来获取嵌入，并依赖scikit-learn进行模型实用程序操作。

!pip install sentence-transformers scikit-learn -q
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a lightweight LLM embedding model; builds 384-dimensional embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")

1. 结合 TF-IDF 和嵌入特征

第一个示例展示了如何联合提取（给定一个源文本数据集，如fetch_20newsgroups）TF-IDF特征和LLM生成的句子嵌入特征。然后，我们将这些特征类型结合起来训练一个逻辑回归模型，该模型基于组合特征对新闻文本进行分类，通常通过捕获词汇和语义信息来提高准确性。

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Loading data
data = fetch_20newsgroups(subset='train', categories=['sci.space', 'rec.autos'])
texts, y = data.data[:500], data.target[:500]

# Extracting features of two broad types
tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray()
emb = model.encode(texts, show_progress_bar=False)

# Combining features and training ML model
X = np.hstack([tfidf, StandardScaler().fit_transform(emb)])
clf = LogisticRegression(max_iter=1000).fit(X, y)
print("Accuracy:", clf.score(X, y))

2. 类别感知的嵌入聚类

此技巧采用少量示例文本序列，使用预加载的语言模型生成嵌入，对这些嵌入应用K-Means聚类以分配主题，然后将嵌入与每个示例的聚类标识符（其“主题类别”）的独热编码结合起来，以构建新的特征表示。这是一种创建紧凑的主题元特征的有用策略。

from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder

texts = ["Tokyo Tower is a popular landmark.", 
         "Sushi is a traditional Japanese dish.", 
         "Mount Fuji is a famous volcano in Japan.", 
         "Cherry blossoms bloom in the spring in Japan."]

emb = model.encode(texts)
topics = KMeans(n_clusters=2, n_init='auto', random_state=42).fit_predict(emb)
topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(-1, 1))
X = np.hstack([emb, topic_ohe])
print(X.shape)

3. 语义锚点相似度特征

这个简单的策略是计算文本与一小组固定“锚点”（或参考）句子的相似度，这些句子用作紧凑的语义描述符——本质上是语义地标。相似度特征矩阵中的每一列包含文本与一个锚点的相似度。主要价值在于允许模型学习文本与其关键概念相似度之间的关系，这对于文本分类模型非常有用。

from sklearn.metrics.pairwise import cosine_similarity

anchors = ["space mission", "car performance", "politics"]
anchor_emb = model.encode(anchors)
texts = ["The rocket launch was successful.", "The car handled well on the track."]
emb = model.encode(texts)
sim_features = cosine_similarity(emb, anchor_emb)
print(sim_features)

4. 通过辅助情感分类器进行元特征堆叠

对于与情感标签相关的文本，以下特征工程技术可以增加额外价值。元特征的构建是基于辅助分类器在嵌入上进行训练后返回的预测概率。将此元特征与原始嵌入堆叠起来，形成一个增强的特征集，这可以通过暴露比原始嵌入更具判别性的信息来提高下游性能。

此示例需要一些额外的设置：

!pip install sentence-transformers scikit-learn -q
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler # Import StandardScaler
import numpy as np

embedder = SentenceTransformer("all-MiniLM-L6-v2") # 384-dim

# Small dataset containing texts and sentiment labels
texts = ["I love this!", "This is terrible.", "Amazing quality.", "Not good at all."]
y = np.array([1, 0, 1, 0])

# Obtain embeddings from the embedder LLM
emb = embedder.encode(texts, show_progress_bar=False)

# Train an auxiliary classifier on embeddings
X_train, X_test, y_train, y_test = train_test_split(
    emb, y, test_size=0.5, random_state=42, stratify=y
)
meta_clf = LogisticRegression(max_iter=1000).fit(X_train, y_train)

# Leverage the auxiliary model's predicted probability as a meta-feature
meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(-1, 1) # Prob of positive class

# Augment original embeddings with the meta-feature
# Do not forget to scale again for consistency
scaler = StandardScaler()
emb_scaled = scaler.fit_transform(emb)
X_aug = np.hstack([emb_scaled, meta_feature]) # Stack features together

print("emb shape:", emb.shape)
print("meta_feature shape:", meta_feature.shape)
print("augmented shape:", X_aug.shape)
print("meta clf accuracy on test slice:", meta_clf.score(X_test, y_test))

5. 嵌入压缩与非线性扩展

此策略将PCA降维应用于LLM构建的原始嵌入，然后对这些压缩后的嵌入进行多项式扩展。乍一看可能有些奇怪，但这可能是一种在保持效率的同时捕获非线性结构度的有效方法。

!pip install sentence-transformers scikit-learn -q
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Loading a lightweight embedding language model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

texts = ["The satellite was launched into orbit.", 
         "Cars require regular maintenance.", 
         "The telescope observed distant galaxies."]

# Obtaining embeddings
emb = embedder.encode(texts, show_progress_bar=False)

# Compressing with PCA and enriching with polynomial features
pca = PCA(n_components=2).fit_transform(emb) # Reduced n_components to a valid value
poly = PolynomialFeatures(degree=2, include_bias=False).fit_transform(pca)

print("Original shape:", emb.shape)
print("After PCA:", pca.shape)
print("After polynomial expansion:", poly.shape)

6. 基于成对对比特征的关系学习

这里的目标是利用文本嵌入构建成对关系特征。以对比方式构建的相互关联的特征可以突出相似和不相似的方面。这对于本质上涉及文本间比较的预测过程尤其有效。

!pip install sentence-transformers -q
from sentence_transformers import SentenceTransformer
import numpy as np

# Loading embedder
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Example text pairs
pairs = [ ("The car is fast.", "The vehicle moves quickly."), 
          ("The sky is blue.", "Bananas are yellow.") ]

# Generating embeddings for both sides
emb1 = embedder.encode([p[0] for p in pairs], show_progress_bar=False)
emb2 = embedder.encode([p[1] for p in pairs], show_progress_bar=False)

# Building contrastive features: absolute difference and element-wise product
X_pairs = np.hstack([np.abs(emb1 - emb2), emb1 * emb2])
print("Pairwise feature shape:", X_pairs.shape)

🚀 想要体验更好更全面的AI调用？

欢迎使用青云聚合API，约为官网价格的十分之一，支持300+全球最新模型，以及全球各种生图生视频模型，无需翻墙高速稳定，文档丰富，小白也可以简单操作。

目录CONTENT

利用大语言模型嵌入的文本数据七大高级特征工程技巧

引言