数据处理的统计学习(scikit-learn教程)
练习: from sklearn import datasets,neighbors,linear_model digits = datasets.load_digits() X_digits = digits.data y_digits = digits.target 【完整代码】 from sklearn import datasets,linear_model digits = datasets.load_digits() X_digits = digits.data y_digits = digits.target n_samples = len(X_digits) X_train = X_digits[:.9 * n_samples] y_train = y_digits[:.9 * n_samples] X_test = X_digits[.9 * n_samples:] y_test = y_digits[.9 * n_samples:] knn = neighbors.KNeighborsClassifier() logistic = linear_model.LogisticRegression()print('KNN score: %f' % knn.fit(X_train,y_train).score(X_test,y_test))print('LogisticRegression score: %f' ? ? ?% logistic.fit(X_train,y_test)) (3)支持向量机(SVMs)线性SVNs:
样例:Plot different SVM分类器 iris数据集 SVMs能够被用于回归——SVR(支持向量回归)—用于分类——SVC(支持向量分类) 使用核函数:
svc = svm.SVC(kernel='rbf') 交互式样例: 练习: 尝试使用SVMs根据iris数据集前两个特征将其分成两类。留出每一类的10%作为测试样例。 iris = datasets.load_iris() X = iris.data y = iris.target X = X[y != 0,:2] y = y[y != 0] 完整代码: """================================SVM Exercise================================A tutorial exercise for using different SVM kernels.This exercise is used in the :ref:`using_kernels_tut` part of the:ref:`supervised_learning_tut` section of the :ref:`stat_learn_tut_index`."""print(__doc__)import numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasets,svm iris = datasets.load_iris() X = iris.data y = iris.target X = X[y != 0,:2] y = y[y != 0] n_sample = len(X) np.random.seed(0) order = np.random.permutation(n_sample) X = X[order] y = y[order].astype(np.float) X_train = X[:.9 * n_sample] y_train = y[:.9 * n_sample] X_test = X[.9 * n_sample:] y_test = y[.9 * n_sample:]# fit the modelfor fig_num,kernel in enumerate(('linear','rbf','poly')): ? ?clf = svm.SVC(kernel=kernel,gamma=10) ? ?clf.fit(X_train,y_train) ? ?plt.figure(fig_num) ? ?plt.clf() ? ?plt.scatter(X[:,0],X[:,1],c=y,zorder=10,cmap=plt.cm.Paired) ? ?# Circle out the test data ? ?plt.scatter(X_test[:,X_test[:,s=80,facecolors='none',zorder=10) ? ?plt.axis('tight') ? ?x_min = X[:,0].min() ? ?x_max = X[:,0].max() ? ?y_min = X[:,1].min() ? ?y_max = X[:,1].max() ? ?XX,YY = np.mgrid[x_min:x_max:200j,y_min:y_max:200j] ? ?Z = clf.decision_function(np.c_[XX.ravel(),YY.ravel()]) ? ?# Put the result into a color plot ? ?Z = Z.reshape(XX.shape) ? ?plt.pcolormesh(XX,YY,Z > 0,cmap=plt.cm.Paired) ? ?plt.contour(XX,Z,colors=['k','k','k'],linestyles=['--','-','--'],? ? ? ? ? ? ? ?levels=[-.5,0,.5]) ? ?plt.title(kernel) plt.show() 三、模型选择:选择模型和他们的参数(1)分数,和交叉验证分数众所周知,每一个模型会得出一个score方法用于裁决模型在新的数据上拟合的质量。其值越大越好。 from sklearn import datasets,svm digits = datasets.load_digits() X_digits = digits.data y_digits = digits.target svc = svm.SVC(C=1,kernel='linear') svc.fit(X_digits[:-100],y_digits[:-100]).score(X_digits[-100:],y_digits[-100:]) 为了获得一个更好的预测精确度度量,我们可以把我们使用的数据折叠交错地分成训练集和测试集: import numpy as np X_folds = np.array_split(X_digits,3) y_folds = np.array_split(y_digits,3) scores = list()for k in range(3): ? ?# We use 'list' to copy,in order to 'pop' later on ? ?X_train = list(X_folds) ? ?X_test ?= X_train.pop(k) ? ?X_train = np.concatenate(X_train) ? ?y_train = list(y_folds) ? ?y_test ?= y_train.pop(k) ? ?y_train = np.concatenate(y_train) ? ?scores.append(svc.fit(X_train,y_test))print(scores) (编辑:PHP编程网 - 黄冈站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |