博文

[转载]核密度估计( Density Estimation)

已有 4796 次阅读 2022-7-19 09:59 |个人分类:Python|系统分类:科研笔记|文章来源:转载

refer to: https://scikit-learn.org/stable/modules/density.html

https://zhuanlan.zhihu.com/p/360982296

https://blog.csdn.net/kryolith/article/details/39832573

https://blog.csdn.net/weixin_41297324/article/details/86577728

2.8. Density Estimation

Density estimation walks the line between unsupervised learning, feature engineering, and data modeling. Some of the most popular and useful density estimation techniques are mixture models such as Gaussian Mixtures (GaussianMixture), and neighbor-based approaches such as the kernel density estimate (KernelDensity). Gaussian Mixtures are discussed more fully in the context ofclustering, because the technique is also useful as an unsupervised clustering scheme.

Density estimation is a very simple concept, and most people are already familiar with one common density estimation technique: the histogram.

2.8.1. Density Estimation: Histograms

A histogram is a simple visualization of data where bins are defined, and the number of data points within each bin is tallied. An example of a histogram can be seen in the upper-left panel of the following figure:

A major problem with histograms, however, is that the choice of binning can have a disproportionate effect on the resulting visualization. Consider the upper-right panel of the above figure. It shows a histogram over the same data, with the bins shifted right. The results of the two visualizations look entirely different, and might lead to different interpretations of the data.

Intuitively, one can also think of a histogram as a stack of blocks, one block per point. By stacking the blocks in the appropriate grid space, we recover the histogram. But what if, instead of stacking the blocks on a regular grid, we center each block on the point it represents, and sum the total height at each location? This idea leads to the lower-left visualization. It is perhaps not as clean as a histogram, but the fact that the data drive the block locations mean that it is a much better representation of the underlying data.

This visualization is an example of a kernel density estimation, in this case with a top-hat kernel (i.e. a square block at each point). We can recover a smoother distribution by using a smoother kernel. The bottom-right plot shows a Gaussian kernel density estimate, in which each point contributes a Gaussian curve to the total. The result is a smooth density estimate which is derived from the data, and functions as a powerful non-parametric model of the distribution of points.

2.8.2. Kernel Density Estimation

Kernel density estimation in scikit-learn is implemented in theKernelDensity estimator, which uses the Ball Tree or KD Tree for efficient queries (see Nearest Neighbors for a discussion of these). Though the above example uses a 1D data set for simplicity, kernel density estimation can be performed in any number of dimensions, though in practice the curse of dimensionality causes its performance to degrade in high dimensions.

In the following figure, 100 points are drawn from a bimodal distribution, and the kernel density estimates are shown for three choices of kernels:

It’s clear how the kernel shape affects the smoothness of the resulting distribution. The scikit-learn kernel density estimator can be used as follows:

>>>

>>> from sklearn.neighbors import KernelDensity>>> import numpy as np>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)>>> kde.score_samples(X)array([-0.41075698, -0.41075698, -0.41076071, -0.41075698, -0.41075698,       -0.41076071])

Here we have used kernel='gaussian', as seen above. Mathematically, a kernel is a positive function

which is controlled by the bandwidth parameter . Given this kernel form, the density estimate at a point within a group of points

is given by:

The bandwidth here acts as a smoothing parameter, controlling the tradeoff between bias and variance in the result. A large bandwidth leads to a very smooth (i.e. high-bias) density distribution. A small bandwidth leads to an unsmooth (i.e. high-variance) density distribution.

KernelDensity implements several common kernel forms, which are shown in the following figure:

The form of these kernels is as follows:

Gaussian kernel (kernel = 'gaussian')

Tophat kernel (kernel = 'tophat')

Epanechnikov kernel (kernel = 'epanechnikov')

Exponential kernel (kernel = 'exponential')

Linear kernel (kernel = 'linear')

Cosine kernel (kernel = 'cosine')

The kernel density estimator can be used with any of the valid distance metrics (see DistanceMetric for a list of available metrics), though the results are properly normalized only for the Euclidean metric. One particularly useful metric is theHaversine distancewhich measures the angular distance between points on a sphere. Here is an example of using a kernel density estimate for a visualization of geospatial data, in this case the distribution of observations of two different species on the South American continent:

One other useful application of kernel density estimation is to learn a non-parametric generative model of a dataset in order to efficiently draw new samples from this generative model. Here is an example of using this process to create a new set of hand-written digits, using a Gaussian kernel learned on a PCA projection of the data:

The “new” data consists of linear combinations of the input data, with weights probabilistically drawn given the KDE model.

Examples:

Simple 1D Kernel Density Estimation: computation of simple kernel density estimates in one dimension.
Kernel Density Estimation: an example of using Kernel Density estimation to learn a generative model of the hand-written digits data, and drawing new samples from this model.
Kernel Density Estimate of Species Distributions: an example of Kernel Density estimation using the Haversine distance metric to visualize geospatial data

转载本文请联系原作者获取授权，同时请注明本文来自张书会科学网博客。
链接地址：https://blog.sciencenet.cn/blog-587102-1347908.html

上一篇：python # start-stop-number产生等间距的数 k_n=np.linspace(2,33,32)
下一篇：[转载]设置主副刻度对象属性

收藏 IP: 119.78.226.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

张书会

扫一扫，分享此博文

zsh2270365028的个人博客分享 http://blog.sciencenet.cn/u/zsh2270365028

博文

[转载]核密度估计( Density Estimation)

2.8. Density Estimation

2.8.1. Density Estimation: Histograms

2.8.2. Kernel Density Estimation

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

张书会

全部精选博文导读

zsh2270365028的个人博客分享 http://blog.sciencenet.cn/u/zsh2270365028

博文

[转载]核密度估计( Density Estimation)

2.8. Density Estimation

2.8.1. Density Estimation: Histograms

2.8.2. Kernel Density Estimation

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

张书会

全部精选博文导读

该博文允许注册用户评论请点击登录评论 (0 个评论)