版本0.16#

版本0.16.1#

April 14, 2015

Changelog#

Bug修复#

允许输入数据大于 block_size 在 covariance.LedoitWolf 通过 Andreas Müller .
修复中的错误 isotonic.IsotonicRegression 导致不稳定结果的重复数据消除 calibration.CalibratedClassifierCV 通过 Jan Hendrik Metzen .
修复标签排序 preprocessing.label_binarize 作者：迈克尔·海尔曼。
修复中的几个稳定性和收敛问题 cross_decomposition.CCA 和 cross_decomposition.PLSCanonical 通过 Andreas Müller
修复中的错误 cluster.KMeans 当 precompute_distances=False 基于fortra命令的数据。
修复速度回归 ensemble.RandomForestClassifier 的 predict 和 predict_proba 通过 Andreas Müller .
修复回归，其中 utils.shuffle 通过将列表和收件箱转换为数组 Olivier Grisel

版本0.16#

March 26, 2015

亮点#

速度提高（特别是在 cluster.DBSCAN ）、减少内存需求、修复错误和更好的默认设置。
多项逻辑回归和路径算法 linear_model.LogisticRegressionCV .
通过PCA的核心外学习 decomposition.IncrementalPCA .
使用分类器的概率校准 calibration.CalibratedClassifierCV .
cluster.Birch 大规模数据集的集群方法。
使用对位置敏感的哈希森林进行可扩展的大约最近邻搜索 neighbors.LSHForest .
使用格式错误的输入数据时，改进了错误消息并进行了更好的验证。
与熊猫蚯蚓的更强集成。

Changelog#

新功能#

新 neighbors.LSHForest 为大约最近邻居搜索实现位置敏感哈希。通过 Maheshakya Wijewardena .
添加 svm.LinearSVR .此类使用支持量回归的liblinear实现，对于大样本量来说，这比 svm.SVR 具有线性核心。通过 Fabian Pedregosa 还有罗强。
增量适合 GaussianNB .
添加 sample_weight 支持 dummy.DummyClassifier 和 dummy.DummyRegressor .通过 Arnaud Joly .
添加了 metrics.label_ranking_average_precision_score 指标.通过 Arnaud Joly .
添加 metrics.coverage_error 指标.通过 Arnaud Joly .
添加 linear_model.LogisticRegressionCV .通过 Manoj Kumar , Fabian Pedregosa , Gael Varoquaux 和 Alexandre Gramfort .
添加 warm_start 构造器参数使任何经过训练的森林模型都可以增量地生长额外的树木。通过 Laurent Direr .
添加 sample_weight 支持 ensemble.GradientBoostingClassifier 和 ensemble.GradientBoostingRegressor .通过 Peter Prettenhofer .
添加 decomposition.IncrementalPCA ，PCA算法的实现，通过 partial_fit 法通过 Kyle Kastner .
平均新元 SGDClassifier 和 SGDRegressor 通过 Danny Sullivan .
添加 cross_val_predict function which computes cross-validated estimates. By Luis Pedro Coelho
添加 linear_model.TheilSenRegressor ，一个稳健的基于广义中位数的估计器。通过 Florian Wilhelm .
添加 metrics.median_absolute_error ，一个稳健的指标。通过 Gael Varoquaux 和 Florian Wilhelm .
添加 cluster.Birch ，一种在线集群算法。通过 Manoj Kumar , Alexandre Gramfort 和 Joel Nothman .
Added shrinkage support to discriminant_analysis.LinearDiscriminantAnalysis using two new solvers. By Clemens Brunner and Martin Billinger.
添加 kernel_ridge.KernelRidge ，核化岭回归的实现。通过 Mathieu Blondel 和 Jan Hendrik Metzen .
所有求解者 linear_model.Ridge 现在支持 sample_weight. By Mathieu Blondel .
添加 cross_validation.PredefinedSplit 针对固定的用户提供的交叉验证折叠进行交叉验证。通过 Thomas Unterthiner .
添加 calibration.CalibratedClassifierCV ，一种校准分类器预测概率的方法。通过 Alexandre Gramfort , Jan Hendrik Metzen , Mathieu Blondel 和 Balazs Kegl .

增强功能#

添加选项 return_distance 在 hierarchical.ward_tree to return distances between nodes for both structured and unstructured versions of the algorithm. By Matteo Visconti di Oleggio Castello .添加了相同的选项 hierarchical.linkage_tree. By Manoj Kumar
在评分器对象中添加对样本权重的支持。具有样品重量支持的称重系统将自动从中受益。 Noel Dawe 和 Vlad Niculae .
添加 newton-cg 和 lbfgs solver support in linear_model.LogisticRegression. By Manoj Kumar .
添加 selection="random" 用于实现随机坐标下降的参数 linear_model.Lasso , linear_model.ElasticNet 和相关的。通过 Manoj Kumar .
添加 sample_weight 参数以 metrics.jaccard_similarity_score 和 metrics.log_loss .通过 Jatin Shah .
支持稀疏多标签指标表示 preprocessing.LabelBinarizer 和 multiclass.OneVsRestClassifier （由 Hamzeh Alsalhi 感谢Rohit Sivaprasad），以及评估指标（由 Joel Nothman ).
添加 sample_weight 参数以 metrics.jaccard_similarity_score .通过 Jatin Shah .
添加对多类的支持 metrics.hinge_loss .添加 labels=None 作为可选参数。通过 Saurabh Jha .
添加 sample_weight 参数以 metrics.hinge_loss .通过 Saurabh Jha .
添加 multi_class="multinomial" 选项 linear_model.LogisticRegression 实现一个逻辑回归求解器，该求解器最大限度地减少交叉熵或多项损失，而不是默认的“一vs-Rest”设置。支持 lbfgs and newton-cg solvers. By Lars Buitinck 和 Manoj Kumar .求解器选项 newton-cg 作者：Simon Wu。
DictVectorizer 现在可以执行 fit_transform 当给出选项时，在单次传递中的迭代对象上 sort=False .通过 Dan Blanchard .
model_selection.GridSearchCV 和 model_selection.RandomizedSearchCV 现在可以配置为与可能失败并在单个折叠上产生错误的估计器一起工作。此选项由 error_score 参数.这不影响重新调整时出现的错误。通过 Michal Romaniuk .
添加 digits 参数以 metrics.classification_report 允许报表显示不同精度的浮点数。通过 Ian Gilmore .
将分位数预测策略添加到 dummy.DummyRegressor .通过 Aaron Staple .
添加 handle_unknown 选项 preprocessing.OneHotEncoder 在转换期间更优雅地处理未知的类别特征。通过 Manoj Kumar .
为决策树及其集合添加了对稀疏输入数据的支持。通过 Fares Hedyati 和 Arnaud Joly .
优化 cluster.AffinityPropagation 通过减少大型临时数据结构的内存分配数量。通过 Antony Lee .
随机森林中特征重要性计算的并行化。通过 Olivier Grisel 和 Arnaud Joly .
添加 n_iter_ 属性给接受a的估计器 max_iter 属性在其构造函数中。通过 Manoj Kumar .
增加了决策功能， multiclass.OneVsOneClassifier 通过 Raghav RV 和 Kyle Beauchamp .
neighbors.kneighbors_graph 和 radius_neighbors_graph support non-Euclidean metrics. By Manoj Kumar
参数 connectivity 在 cluster.AgglomerativeClustering 和家庭现在接受返回连接矩阵的呼叫。通过 Manoj Kumar .
支持稀疏 metrics.pairwise.paired_distances .通过 Joel Nothman .
cluster.DBSCAN 现在支持稀疏输入和样本权重，并已进行优化：内循环已在Cython中重写，半径邻居查询现在批量计算。通过 Joel Nothman 和 Lars Buitinck .
添加 class_weight 参数按类别频率自动对样本加权 ensemble.RandomForestClassifier , tree.DecisionTreeClassifier , ensemble.ExtraTreesClassifier 和 tree.ExtraTreeClassifier .通过 Trevor Stephens .
grid_search.RandomizedSearchCV 如果所有参数都以列表形式给出，则现在进行采样而不进行替换。通过 Andreas Müller .
分组化计算 metrics.pairwise_distances 现在支持Scipy指标和自定义调用。通过 Joel Nothman .
允许对所有集群算法进行匹配和评分 pipeline.Pipeline .通过 Andreas Müller .
更强大的种子播种和改进的错误消息 cluster.MeanShift 通过 Andreas Müller .
制定停止标准 mixture.GMM, mixture.DPGMM and mixture.VBGMM less dependent on the number of samples by thresholding the average log-likelihood change instead of its sum over all samples. By Hervé Bredin .
的结果 manifold.spectral_embedding 通过翻转特征量的符号来确定。通过 Hasil Sharma .
中的性能和内存使用率显着改进 preprocessing.PolynomialFeatures .通过 Eric Martin .
数字稳定性改进 preprocessing.StandardScaler 和 preprocessing.scale .通过 Nicolas Goix
svm.SVC 现在实现了适合稀疏输入的 decision_function .通过 Rob Zinkov 和 Andreas Müller .
cross_validation.train_test_split 现在保留输入类型，而不是转换为numpy数组。

文档改进#

添加了使用示例 pipeline.FeatureUnion 用于异类输入。通过 Matt Terry
Documentation on scorers was improved, to highlight the handling of loss functions. By Matt Pico.
现在注意到Liblinear输出和scikit-learn的包装之间存在差异。通过 Manoj Kumar .
改进的文档生成：引用类或函数的示例现在显示在类/函数的API参考页面的图库中。通过 Joel Nothman .
更明确的示例生成器和数据转换文档。通过 Joel Nothman .
sklearn.neighbors.BallTree 和 sklearn.neighbors.KDTree 用于指向空页面，声明它们是BinaryTree的别名。此问题已修复，以显示正确的类文档。通过 Manoj Kumar .
添加了轮廓图，用于分析KMeans集群 metrics.silhouette_samples 和 metrics.silhouette_score .看到在KMeans聚类中使用轮廓分析选择聚类数

Bug修复#

元估计器现在支持鸭子类型的存在 decision_function , predict_proba 以及其他方法。这修复了 grid_search.GridSearchCV, grid_search.RandomizedSearchCV, pipeline.Pipeline, feature_selection.RFE, feature_selection.RFECV when nested. By Joel Nothman
的 scoring 当 grid_search.GridSearchCV 作为基本估计器给出，或者基本估计器没有预测。
功能 hierarchical.ward_tree now returns the children in the same order for both the structured and unstructured versions. By Matteo Visconti di Oleggio Castello .
feature_selection.RFECV 现在可以正确处理以下情况： step 不等于1。通过 Nikolay Mayorov
的 decomposition.PCA 现在它的美白已经被取消了 inverse_transform .而且它的 components_ 现在总是有单位长度。通过 Michael Eickenberg .
修复数据集下载不完整的问题 datasets.download_20newsgroups is called. By Manoj Kumar .
Vincent Duberty和Jan Hendrik Metzen对高斯过程子包进行了各种修复。
调用 partial_fit 与 class_weight=='auto' 抛出适当的错误消息并建议解决方法。通过 Danny Sullivan .
RBFSampler 与 gamma=g 以前接近 rbf_kernel 与 gamma=g/2. ;的定义 gamma 现在是一致的，如果您使用固定值，这可能会极大地改变您的结果。(If您交叉验证了 gamma ，可能没什么太重要。）通过 Dougal Sutherland .
管道对象委托 classes_ 归因于基本估计器。例如，它允许对管道对象进行装袋。通过 Arnaud Joly
neighbors.NearestCentroid now uses the median as the centroid when metric is set to manhattan. It was using the mean before. By Manoj Kumar
修复数字稳定性问题 linear_model.SGDClassifier 和 linear_model.SGDRegressor 通过剪裁大的梯度并确保权重衰减重新缩放始终为正（对于大的l2正规化和大的学习率值）。通过 Olivier Grisel
当 compute_full_tree is set to "auto", the full tree is built when n_clusters is high and is early stopped when n_clusters is low, while the behavior should be vice versa in cluster.AgglomerativeClustering (and friends). This has been fixed By Manoj Kumar
修复数据懒惰居中的问题 linear_model.enet_path 和 linear_model.lasso_path .它以一为中心。已改为以起源为中心。通过 Manoj Kumar
修复中预先计算的亲和力矩阵的处理 cluster.AgglomerativeClustering 当使用连接约束时。通过 Cathy Deng
正确 partial_fit 处理 class_prior 为 sklearn.naive_bayes.MultinomialNB 和 sklearn.naive_bayes.BernoulliNB .通过 Trevor Stephens .
修复了在 metrics.precision_recall_fscore_support 使用未排序时 labels 在多标签设置中。通过 Andreas Müller .
避免跳过方法中的第一个最近邻居 radius_neighbors , kneighbors , kneighbors_graph 和 radius_neighbors_graph 在 sklearn.neighbors.NearestNeighbors 和家庭，当查询数据与匹配数据不相同时。通过 Manoj Kumar .
在中修复对数密度计算 mixture.GMM with tied covariance. By Will Dawson
修复了中的缩放错误 feature_selection.SelectFdr 其中一个因素 n_features 失踪了通过 Andrew Tulloch
修复零除 neighbors.KNeighborsRegressor 当使用距离加权并具有相同数据点时，和相关类别。通过 Garret-R .
修复了GMM中非正值协方差矩阵的舍入误差。通过 Alexis Mignon .
修复了中条件概率计算中的错误 naive_bayes.BernoulliNB .通过 Hanna Wallach .
使该方法 radius_neighbors 的 neighbors.NearestNeighbors 返回位于边界上的样本 algorithm='brute' .通过 Yan Yi .
翻转标志 dual_coef_ 的 svm.SVC 使其与文档一致， decision_function .作者：Artem Sobolev。
固定的系带处理 isotonic.IsotonicRegression .我们现在使用目标的加权平均值（二级方法）。通过 Andreas Müller 和 Michael Bommarito .

API变更摘要#

GridSearchCV 和 cross_val_score 和其他元估计器不再将pandas DataFrame转换为数组，从而允许在自定义估计器中进行特定于DataFrame的操作。
multiclass.fit_ovr , multiclass.predict_ovr , predict_proba_ovr , multiclass.fit_ovo , multiclass.predict_ovo , multiclass.fit_ecoc 和 multiclass.predict_ecoc 已被废弃。改用基本估计值。
最近邻居估计器用于获取任意关键字参数并将这些参数传递给其距离度量。scikit-learn 0.18中将不再支持这一点;使用 metric_params 相反，争论。
n_jobs fit方法的参数转移到
线性回归类。
的 predict_proba 方法 multiclass.OneVsRestClassifier 现在，在多类情况下，每个样本返回两个概率;这与其他估计器和方法的文档一致，但以前的版本意外地仅返回正概率。由威尔·拉蒙德和 Lars Buitinck .
更改中预计算的默认值 linear_model.ElasticNet 和 linear_model.Lasso 为false当n_samples > n_features时，将precompute设置为“auto”会更慢，因为Gram矩阵的计算在计算上是昂贵的，并且超过了仅为一个alpha拟合Gram的好处。 precompute="auto" 现已废弃并将于0.18删除 Manoj Kumar .
暴露 positive 选项 linear_model.enet_path 和 linear_model.enet_path 其将系数约束为正。通过 Manoj Kumar .
用户现在应该提供一个显式的 average 参数以 sklearn.metrics.f1_score , sklearn.metrics.fbeta_score , sklearn.metrics.recall_score 和 sklearn.metrics.precision_score 当执行多类或多标签（即，非二进制）分类时。通过 Joel Nothman .
scoring 交叉验证参数现在接受 'f1_micro', 'f1_macro' or 'f1_weighted'. 'f1' is now for binary classification only. Similar changes apply to 'precision' and 'recall'. By Joel Nothman .
的 fit_intercept , normalize 和 return_models 参数 linear_model.enet_path 和 linear_model.lasso_path 已被删除。自0.14起已弃用
从现在开始，所有估计量将统一提高 NotFittedError 当任何 predict 在模型适合之前调用类似的方法。通过 Raghav RV .
输入数据验证已被重构，以实现更一致的输入验证。的 check_arrays 功能被替换为 check_array 和 check_X_y .通过 Andreas Müller .
允许 X=None 的方法中 radius_neighbors , kneighbors , kneighbors_graph 和 radius_neighbors_graph 在 sklearn.neighbors.NearestNeighbors 和家人.如果设置为无，那么对于每个样本，这将避免将样本本身设置为第一个最近邻居。通过 Manoj Kumar .
添加参数 include_self 在 neighbors.kneighbors_graph 和 neighbors.radius_neighbors_graph 这必须由用户显式设置。如果设置为True，则样本本身被视为第一近邻。
thresh 参数已被弃用，转而支持新建 tol parameter in GMM, DPGMM and VBGMM. See Enhancements section for details. By Hervé Bredin .
如果可能的话，估计器将将具有dype对象的输入视为数字。通过 Andreas Müller
估计者现在提出 ValueError consistently when fitted on empty data (less than 1 sample or less than 1 feature for 2D input). By Olivier Grisel .
的 shuffle 选择 linear_model.SGDClassifier , linear_model.SGDRegressor , linear_model.Perceptron , linear_model.PassiveAggressiveClassifier 和 linear_model.PassiveAggressiveRegressor 现在默认为 True .
cluster.DBSCAN 现在使用确定性初始化。的 random_state 参数已被弃用。通过 Erich Schubert .

代码贡献者#

A. Flaxman, Aaron Schumacher, Aaron Staple, abhishek thakur, Akshay, akshayah3, Aldrian Obaja, Alexander Fabisch, Alexandre Gramfort, Alexis Mignon, Anders Aagaard, Andreas Mueller, Andreas van Cranenburgh, Andrew Tulloch, Andrew Walker, Antony Lee, Arnaud Joly, banilo, Barmaley.exe, Ben Davies, Benedikt Koehler, bhsu, Boris Feld, Borja Ayerdi, Boyuan Deng, Brent Pedersen, Brian Wignall, Brooke Osborn, Calvin Giles, Cathy Deng, Celeo, cgohlke, chebee7i, Christian Stade-Schuldt, Christof Angermueller, Chyi-Kwei Yau, CJ Carey, Clemens Brunner, Daiki Aminaka, Dan Blanchard, danfrankj, Danny Sullivan, David Fletcher, Dmitrijs Milajevs, Dougal J. Sutherland, Erich Schubert, Fabian Pedregosa, Florian Wilhelm, floydsoft, Félix-Antoine Fortin, Gael Varoquaux, Garrett-R, Gilles Louppe, gpassino, gwulfs, Hampus Bengtsson, Hamzeh Alsalhi, Hanna Wallach, Harry Mavroforakis, Hasil Sharma, Helder, Herve Bredin, Hsiang-Fu Yu, Hugues SALAMIN, Ian Gilmore, Ilambharathi Kanniah, Imran Haque, isms, Jake VanderPlas, Jan Dlabal, Jan Hendrik Metzen, Jatin Shah, Javier López Peña, jdcaballero, Jean Kossaifi, Jeff Hammerbacher, Joel Nothman, Jonathan Helmus, Joseph, Kaicheng Zhang, Kevin Markham, Kyle Beauchamp, Kyle Kastner, Lagacherie Matthieu, Lars Buitinck, Laurent Direr, leepei, Loic Esteve, Luis Pedro Coelho, Lukas Michelbacher, maheshakya, Manoj Kumar, Manuel, Mario Michael Krell, Martin, Martin Billinger, Martin Ku, Mateusz Susik, Mathieu Blondel, Matt Pico, Matt Terry, Matteo Visconti dOC, Matti Lyra, Max Linke, Mehdi Cherti, Michael Bommarito, Michael Eickenberg, Michal Romaniuk, MLG, mr.Shu, Nelle Varoquaux, Nicola Montecchio, Nicolas, Nikolay Mayorov, Noel Dawe, Okal Billy, Olivier Grisel, Óscar Nájera, Paolo Puggioni, Peter Prettenhofer, Pratap Vardhan, pvnguyen, queqichao, Rafael Carrascosa, Raghav R V, Rahiel Kasim, Randall Mason, Rob Zinkov, Robert Bradshaw, Saket Choudhary, Sam Nicholls, Samuel Charron, Saurabh Jha, sethdandridge, sinhrks, snuderl, Stefan Otte, Stefan van der Walt, Steve Tjoa, swu, Sylvain Zimmer, tejesh95, terrycojones, Thomas Delteil, Thomas Unterthiner, Tomas Kazmar, trevorstephens, tttthomasssss, Tzu-Ming Kuo, ugurcaliskan, ugurthemaster, Vinayak Mehta, Vincent Dubourg, Vjacheslav Murashkin, Vlad Niculae, wadawson, Wei Xue, Will Lamond, Wu Jiang, x0l, Xinfan Meng, Yan Yi, Yu-Chin