Scikit-learn 实用工具指南
介绍
✓ 数据预处理:标准化、编码、缺失值处理
✓ 数据划分:train_test_split、K折交叉验证
✓ 评估指标:准确率、混淆矩阵、分类报告
✓ 实用工具:降维、特征选择等
核心理念: sklearn负责数据处理和评估,PyTorch负责模型训练。
安装
pip install scikit-learn
最常用的工具(按使用频率)
1. 数据划分
train_test_split - 最常用
from sklearn.model_selection import train_test_split
import torch
import numpy as np
# 假设你有数据
X = np.random.randn(1000, 10) # 1000个样本,10个特征
y = np.random.randint(0, 2, 1000) # 二分类标签
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20%作为测试集
random_state=42, # 固定随机种子
stratify=y # 保持类别比例(分类任务推荐)
)
# 转换为PyTorch张量
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.LongTensor(y_train)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.LongTensor(y_test)
print(f"训练集: {X_train.shape}") # (800, 10)
print(f"测试集: {X_test.shape}") # (200, 10)
三次划分:训练集、验证集、测试集
# 方法1: 两次调用train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
# 结果:训练集70%,验证集15%,测试集15%
# 方法2: 手动计算比例
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.25, random_state=42 # 0.25 * 0.8 = 0.2
)
# 结果:训练集60%,验证集20%,测试集20%