📢 转载信息
原文作者:Iván Palomares Carrascosa
7种利用LLM嵌入进行文本数据特征工程的高级技巧
图片来源:Editor
引言
大型语言模型(LLMs)不仅擅长理解和生成文本;它们还能将原始文本转化为称为嵌入(embeddings)的数值表示。这些嵌入可用于将额外信息整合到传统的预测性机器学习模型中——例如scikit-learn中使用的模型——以提高下游性能。
本文介绍了七个高级的Python特征工程技巧示例,这些技巧通过利用LLM生成的嵌入为文本数据增加额外价值,从而增强依赖文本的下游机器学习模型(如情感分析、主题分类、文档聚类和语义相似度检测)的准确性和鲁棒性。
所有示例的通用设置
除非另有说明,以下七个示例技巧都使用了此通用设置。我们依靠Sentence Transformers来生成嵌入,并依靠scikit-learn进行建模工具。
!pip install sentence-transformers scikit-learn -q
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a lightweight LLM embedding model; builds 384-dimensional embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")
1. 结合TF-IDF和嵌入特征
第一个示例展示了如何(给定像fetch_20newsgroups这样的源文本数据集)联合提取TF-IDF特征和LLM生成的句子嵌入特征。然后,我们将这些特征类型结合起来,训练一个逻辑回归模型,根据组合特征对新闻文本进行分类,通常通过捕获词汇和语义信息来提高准确性。
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Loading data
data = fetch_20newsgroups(subset='train', categories=['sci.space', 'rec.autos'])
texts, y = data.data[:500], data.target[:500]
# Extracting features of two broad types
tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray()
emb = model.encode(texts, show_progress_bar=False)
# Combining features and training ML model
X = np.hstack([tfidf, StandardScaler().fit_transform(emb)])
clf = LogisticRegression(max_iter=1000).fit(X, y)
print("Accuracy:", clf.score(X, y))
2. 面向主题的嵌入聚类
此技巧需要几个示例文本序列,使用预加载的语言模型生成嵌入,对这些嵌入应用K-Means聚类以分配主题,然后将嵌入与每个示例的簇标识符(其“主题类别”)的独热编码结合起来,以构建新的特征表示。这是一种创建紧凑主题元特征的有用策略。
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
texts = ["Tokyo Tower is a popular landmark.",
"Sushi is a traditional Japanese dish.",
"Mount Fuji is a famous volcano in Japan.",
"Cherry blossoms bloom in the spring in Japan."]
emb = model.encode(texts)
topics = KMeans(n_clusters=2, n_init='auto', random_state=42).fit_predict(emb)
topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(-1, 1))
X = np.hstack([emb, topic_ohe])
print(X.shape)
3. 语义锚点相似度特征
这种简单的策略计算文本与一组固定的“锚点”(或参考)句子的相似度,这些句子用作紧凑的语义描述符——本质上是语义地标。相似度特征矩阵中的每一列包含文本与一个锚点的相似度。其主要价值在于允许模型学习文本与关键概念的相似性与目标变量之间的关系,这对于文本分类模型很有用。
from sklearn.metrics.pairwise import cosine_similarity
anchors = ["space mission", "car performance", "politics"]
anchor_emb = model.encode(anchors)
texts = ["The rocket launch was successful.", "The car handled well on the track."]
emb = model.encode(texts)
sim_features = cosine_similarity(emb, anchor_emb)
print(sim_features)
4. 通过辅助情感分类器进行元特征堆叠
对于与情感等标签相关的文本,以下特征工程技术会增加额外价值。元特征构建为在嵌入上训练的辅助分类器返回的预测概率。将此元特征与原始嵌入堆叠起来,生成一个增强的特征集,通过暴露比单独的原始嵌入更具区分性的信息,可以提高下游性能。
此示例需要一些额外的设置:
!pip install sentence-transformers scikit-learn -q
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import numpy as np
embedder = SentenceTransformer("all-MiniLM-L6-v2") # 384-dim
# Small dataset containing texts and sentiment labels
texts = ["I love this!", "This is terrible.", "Amazing quality.", "Not good at all."]
y = np.array([1, 0, 1, 0])
# Obtain embeddings from the embedder LLM
emb = embedder.encode(texts, show_progress_bar=False)
# Train an auxiliary classifier on embeddings
X_train, X_test, y_train, y_test = train_test_split(
emb, y, test_size=0.5, random_state=42, stratify=y
)
meta_clf = LogisticRegression(max_iter=1000).fit(X_train, y_train)
# Leverage the auxiliary model's predicted probability as a meta-feature
meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(-1, 1) # Prob of positive class
# Augment original embeddings with the meta-feature
# Do not forget to scale again for consistency
scaler = StandardScaler()
emb_scaled = scaler.fit_transform(emb)
X_aug = np.hstack([emb_scaled, meta_feature]) # Stack features together
print("emb shape:", emb.shape)
print("meta_feature shape:", meta_feature.shape)
print("augmented shape:", X_aug.shape)
print("meta clf accuracy on test slice:", meta_clf.score(X_test, y_test))
5. 嵌入压缩与非线性扩展
此策略将PCA维度降低技术应用于LLM构建的原始嵌入,然后对这些压缩后的嵌入进行多项式扩展。乍一看可能很奇怪,但这是一种在保持效率的同时捕获非线性结构 的有效方法。
!pip install sentence-transformers scikit-learn -q
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
# Loading a lightweight embedding language model
embedder = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["The satellite was launched into orbit.",
"Cars require regular maintenance.",
"The telescope observed distant galaxies."]
# Obtaining embeddings
emb = embedder.encode(texts, show_progress_bar=False)
# Compressing with PCA and enriching with polynomial features
pca = PCA(n_components=2).fit_transform(emb) # Reduced n_components to a valid value
poly = PolynomialFeatures(degree=2, include_bias=False).fit_transform(pca)
print("Original shape:", emb.shape)
print("After PCA:", pca.shape)
print("After polynomial expansion:", poly.shape)
6. 使用成对对比特征进行关系学习
这里的目标是从文本嵌入构建成对关系特征。以对比方式构建的相关特征可以突出相似和不相似的方面。这对于本质上需要比较文本的预测过程尤其有效。
!pip install sentence-transformers -q
from sentence_transformers import SentenceTransformer
import numpy as np
# Loading embedder
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# Example text pairs
pairs = [ ("The car is fast.", "The vehicle moves quickly."),
("The sky is blue.", "Bananas are yellow.") ]
# Generating embeddings for both sides
emb1 = embedder.encode([p[0] for p in pairs], show_progress_bar=False)
emb2 = embedder.encode([p[1] for p in pairs], show_progress_bar=False)
# Building contrastive features: absolute difference and element-wise product
X_pairs = np.hstack([np.abs(emb1 - emb2), emb1 * emb2])
print("Pairwise feature shape:", X_pairs.shape)
🚀 想要体验更好更全面的AI调用?
欢迎使用青云聚合API,约为官网价格的十分之一,支持300+全球最新模型,以及全球各种生图生视频模型,无需翻墙高速稳定,文档丰富,小白也可以简单操作。
评论区