- 浏览: 523552 次
- 性别:
- 来自: 山东济南
文章分类
最新评论
-
dragon_8844:
非常不错,nice
java.util.concurrent 多线程框架 -
wusendong:
很好的文章!受益匪浅,谢谢!
java.util.concurrent 多线程框架 -
SINCE1978:
你也关注并发啊
java.util.concurrent 多线程框架 -
lku1314:
这个不错 刚刚找到这个组建 以前孤陋寡闻了 像lz学习!标 ...
quartz 在WEB中应用小结 -
lliiqiang:
人们对于目标需要的需求明确的去做,对于目标以外的因素是随机的执 ...
flex和后端的数据交互(一)--XML和HTTPService
A Tutorial on Clustering Algorithms
Clustering: An Introduction
What is Clustering?
Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
We can show this with a simple graphical example:
In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering.
Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.
The Goals of Clustering
So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs.
For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection).
Possible Applications
Clustering algorithms can be applied in many fields, for instance:
-
Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
-
Biology: classification of plants and animals given their features;
-
Libraries: book ordering;
-
Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
-
City-planning: identifying groups of houses according to their house type, value and geographical location;
-
Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;
-
WWW: document classification; clustering weblog data to discover groups of similar access patterns.
Requirements
The main requirements that a clustering algorithm should satisfy are:
-
scalability;
-
dealing with different types of attributes;
-
discovering clusters with arbitrary shape;
-
minimal requirements for domain knowledge to determine input parameters;
-
ability to deal with noise and outliers;
-
insensitivity to order of input records;
-
high dimensionality;
-
interpretability and usability.
Problems
There are a number of problems with clustering. Among them:
-
current clustering techniques do not address all the requirements adequately (and concurrently);
-
dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
-
the effectiveness of the method depends on the definition of “distance” (for distance-based clustering);
-
if an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multi-dimensional spaces;
-
the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.
Clustering Algorithms
Classification
Clustering algorithms may be classified as listed below:
-
Exclusive Clustering
-
Overlapping Clustering
-
Hierarchical Clustering
-
Probabilistic Clustering
In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi-dimensional plane.
On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value.
Instead, a hierarchical clustering algorithm is based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted.
Finally, the last kind of clustering use a completely probabilistic approach.
In this tutorial we propose four of the most used clustering algorithms:
-
K-means
-
Fuzzy C-means
-
Hierarchical clustering
-
Mixture of Gaussians
Each of these algorithms belongs to one of the clustering types listed above. So that, K-means is an exclusive clustering algorithm, Fuzzy C-means is an overlapping clustering algorithm, Hierarchical clustering is obvious and lastly Mixture of Gaussian is a probabilistic clustering algorithm. We will discuss about each clustering method in the following paragraphs.
Distance Measure
An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading. Figure shown below illustrates this with an example of the width and height measurements of an object. Despite both measurements being taken in the same physical units, an informed decision has to be made as to the relative scaling. As the figure shows, different scalings can lead to different clusterings.
Notice however that this is not only a graphic issue: the problem arises from the mathematical formula used to combine the distances between the single components of the data feature vectors into a unique distance measure that can be used for clustering purposes: different formulas leads to different clusterings.
Again, domain knowledge must be used to guide the formulation of a suitable distance measure for each particular application.
Minkowski Metric
For higher dimensional data, a popular measure is the Minkowski metric,
where d is the dimensionality of the data. The Euclidean distance is a special case where p=2, while Manhattan metric has p=1. However, there are no general theoretical guidelines for selecting a measure for any given application.
It is often the case that the components of the data feature vectors are not immediately comparable. It can be that the components are not continuous variables, like length, but nominal categories, such as the days of the week. In these cases again, domain knowledge must be used to formulate an appropriate measure.
Bibliography
- Tariq Rashid: “Clustering”
http://www.cs.bris.ac.uk/home/tr1690/documentation/fuzzy_clustering_initial_report/node11.html - Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering”
http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html - Pier Luca Lanzi: “Ingegneria della Conoscenza e Sistemi Esperti – Lezione 2: Apprendimento non supervisionato”
http://www.elet.polimi.it/upload/lanzi/corsi/icse/2002/Lezione%202%20-%20Apprendimento%20non%20supervisionato.pdf
来自于:http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html
发表评论
-
java动态编程一例
2009-12-22 08:37 1564Test.java package test; im ... -
tomcat SSL基本配置
2009-11-30 09:25 1496切换到$TOMCAT_HOME下: 1.生成 server ... -
Oracle 10g自带性能监测工具
2009-11-04 09:08 3124安装Oracle 10g时可以选择安装自带的性能监测工具,对于 ... -
Liferay应用界面
2009-09-16 17:26 2200我们用Liferay Portal开发的项目,已有小成,sho ... -
Liferay Iframe Portlet
2009-09-01 08:30 5869The Iframe portlet makes it pos ... -
LifeRay 5.1.2 使用struts1.1时ClassNotFoundException
2009-08-25 10:37 1223今天修改一个portlet时出现 java.lang.Clas ... -
Tomcat 5 中文路径问题
2009-08-06 10:02 1373在tomcat 5下一个动态加载svg图形文件的页面; 页面中 ... -
dhtmlxtree使用中的CharConversionException: isHexDigit
2009-08-04 16:48 1863使用dhtmlxtree时,点击树节点异步加载子节点数据时,在 ... -
程序员五大层次,你属于哪一层?
2009-07-14 13:54 1186软件界一个无 ... -
第一次面别人
2009-05-27 09:40 1659“面”不是吃的,是看的。 从换了工作以后都很忙,周 ... -
liferay中使用struts时jar文件冲突
2009-05-26 13:52 1704异常: java.lang.NoSuchMethodE ... -
犹豫中,不知道该怎么办了!
2009-03-04 10:38 1187先说说我现在的公司, ... -
参数传递的浏览器差异
2008-10-19 22:32 1711情况大体是这样的:一个头页面header.jsp上有一个搜索框 ... -
该死的黑客
2008-09-22 15:06 1389公司一台对外网服务的linux系统数据库服务器,上周五被人破译 ... -
The Examples for Quartz Time Format
2008-05-16 15:02 1763The Examples for Quartz Time Fo ... -
让人头疼的新手
2008-05-13 11:25 4662刚进公司没多久时,领导让我带两个新人(07年7月份毕业的)。他 ... -
昨天参加了一次面试
2008-05-06 11:40 2845象去年这个时候一样, ... -
开发小记
2008-02-13 15:54 14321. 在oracle中字符串拼 ... -
javascript实现日期操作的工具包
2007-12-13 13:39 3244最近一个小项目中用到了dwr,其中使用到了日期型数据;查了一下 ... -
开发中莫名奇妙的事(八卦)
2007-09-28 08:52 20812007年9月22日 ant进行jar ...
相关推荐
经典的MM优化算法的通俗教程,英文原版,介绍了如何构造Majorization函数以及Minorization函数的常见方法,值得参考。
In recent years, spectral clustering has become one of the most popular modern clustering ...and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first gla
Alex Smola et al.A tutorial on support vector regression.Statistics and Computing,2004 In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function ...
A tutorial on Principal Components Analysis
A Tutorial on Probability Theory
A Tutorial on Support Vector Regression In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of ...
The emphasis of the paper is on motivation and intuition rather than technical completeness (although we could not avoid giving some technical details). Since it is not intended to be a historical ...
A Tutorial on Graph-Based SLAM 全英文,值得仔细研究
adaptive Markov chain Monte Carlo algorithms 的基本入门文献
A Tutorial on Principal Component Analysis by Jonathon Shlens. 2014年版。英文。
These representations can be used as features for a wide range of tasks on graphs such as classification, clustering, link prediction, and visualization. In this survey, we give an overview of ...
一份关于56G PAM4信号的指导说明。对理解PAM4信号的性能评估很有帮助。
经典SVM分类算法A Tutorial on Support Vector Machines for Pattern
A Tutorial on Deep Learning
A Tutorial on JasperReports, iReport and JFreeChart
A-tutorial-on-learning-with-Bayesian-networks,A-tutorial-on-learning-with-Bayesian-networksA-tutorial-on-learning-with-Bayesian-networks。
关于机器学习中支持向量机的综述。非常适合初学者以及写作参考。
A Tutorial on Using the ALSA Audio API(中英文版)
hidden Markov models
A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems