Python 如何使用Scikit-learn进行维度缩减
维度缩减是一种无监督机器学习方法,用于选择每个数据样本的特征变量数量,并选择一组主要特征。主成分分析(PCA)是Scikit-learn中提供的维度缩减的流行算法之一。
在本教程中,我们使用Python Scikit-learn(Sklearn)进行主成分分析和增量主成分分析来进行维度缩减。
使用主成分分析(PCA)
PCA是一种统计方法,通过分析原始数据集的特征,将数据线性投影到新的特征空间中。PCA背后的主要概念是选择数据的“主要”特征并基于它们构建特征。这将给我们一个新的数据集,它的大小较低,但具有与原始数据集相同的信息。
示例
在下面的示例中,我们将使用Scikit-learn软件包默认提供的鸢尾花数据集,使用PCA进行拟合(初始化为2个组件)。
# Importing the necessary packages
from sklearn import datasets
from sklearn import decomposition
# Load iris plant dataset
iris = datasets.load_iris()
# Print details about the datset
print('Features names : '+str(iris.feature_names))
print('\n')
print('Features size : '+str(iris.data.shape))
print('\n')
print('Target names : '+str(iris.target_names))
print('\n')
print('Target size : '+str(iris.target.shape))
X_iris, Y_iris = iris.data, iris.target
# Intialize PCA and fit the data
pca_2 = decomposition.PCA(n_components=2)
pca_2.fit(X_iris)
# Transforming iris data to new dimensions(with 2 features)
X_iris_pca2 = pca_2.transform(X_iris)
# Printing new dataset
print('New Dataset size after transformations : ', X_iris_pca2.shape)
print('\n')
# Getting the direction of maximum variance in data
print("Components : ", pca_2.components_)
print('\n')
# Getting the amount of variance explained by each component
print("Explained Variance:",pca_2.explained_variance_)
print('\n')
# Getting the percentage of variance explained by each component
print("Explained Variance Ratio:",pca_2.explained_variance_ratio_)
print('\n')
# Getting the singular values for each component
print("Singular Values :",pca_2.singular_values_)
print('\n')
# Getting estimated noise covariance
print("Noise Variance :",pca_2.noise_variance_)
输出
它将产生以下输出:
Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Features size : (150, 4)
Target names : ['setosa' 'versicolor' 'virginica']
Target size : (150,)
New Dataset size after transformations : (150, 2)
Components : [[ 0.36138659 -0.08452251 0.85667061 0.3582892 ]
[ 0.65658877 0.73016143 -0.17337266 -0.07548102]]
Explained Variance: [4.22824171 0.24267075]
Explained Variance Ratio: [0.92461872 0.05306648]
Singular Values : [25.09996044 6.01314738]
Noise Variance : 0.051022296508184406
使用增量主成分分析(IPCA)
增量主成分分析(IPCA)用于解决主成分分析(PCA)的最大限制,即PCA仅支持批处理,意味着所有要处理的输入数据都应适应内存。
Scikit-learn机器学习库提供了sklearn.decomposition.IPCA模块,可以通过在连续获取的数据块上使用其partial_fit方法或启用np.memmap(内存映射文件)来实现Out-of-Core PCA,而无需将整个文件加载到内存中。
与PCA相同,在使用IPCA进行分解时,输入数据在应用SVD之前会居中但不会缩放每个特征。
示例
在下面的示例中,我们将使用scikit-learn软件包默认提供的鸢尾花数据集,使用IPCA(初始化为2个组件和批量大小=20)。
# Importing the necessary packages
from sklearn import datasets
from sklearn import decomposition
# Load iris plant dataset
iris = datasets.load_iris()
# Print details about the datset
print('Features names : '+str(iris.feature_names))
print('\n')
print('Features size : '+str(iris.data.shape))
print('\n')
print('Target names : '+str(iris.target_names))
print('\n')
print('Target size : '+str(iris.target.shape))
X_iris, Y_iris = iris.data, iris.target
# Initialize PCA and fit the data
ipca_2 = decomposition.IncrementalPCA(n_components=2, batch_size=20)
ipca_2.fit(X_iris)
# Transforming iris data to new dimensions(with 2 features)
X_iris_ipca2 = ipca_2.transform(X_iris)
# Printing new dataset
print('New Dataset size after transformations : ', X_iris_ipca2.shape)
print('\n')
# Getting the direction of maximum variance in data
print("Components : ", ipca_2.components_)
print('\n')
# Getting the amount of variance explained by each component
print("Explained Variance:",ipca_2.explained_variance_)
print('\n')
# Getting the percentage of variance explained by each component
print("Explained Variance Ratio:",ipca_2.explained_variance_ratio_)
print('\n')
# Getting the singular values for each component
print("Singular Values :",ipca_2.singular_values_)
print('\n')
# Getting estimated noise covariance
print("Noise Variance :",ipca_2.noise_variance_)
输出
它会产生以下的输出 –
Features names : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Features size : (150, 4)
Target names : ['setosa' 'versicolor' 'virginica']
Target size : (150,)
New Dataset size after transformations : (150, 2)
Components : [[ 0.3622612 -0.0850586 0.85634557 0.35805603]
[ 0.64678214 0.73999163 -0.17069766 -0.07033882]]
Explained Variance: [4.22535552 0.24227125]
Explained Variance Ratio: [0.92398758 0.05297912]
Singular Values : [25.09139241 6.0081958 ]
Noise Va riance : 0.00713274779746683