Public Outreach
Outreach is an effort in an organization or group to connect its ideas or practices to the efforts of other organizations, groups, specific audiences or the general public. This is exactly what DAME Group intends to offer through this page.
Data Mining & Knowledge Discovery
Research and development activities in the area of KDD (Knowledge Discovery in Databases) and DM (Data Mining) concerns the design and definition of methods, techniques and tools for the extraction of unknown information from data. Or in other words, the extraction of novel and not explicitely available patterns/features from large and noisy large data volumes.Knowledge Discovery in Databases
The term KDD was coined in 1989, as referred to the entire process of knowledge discovery in the data, to emphasize the application at high-level of peculiar methods for information extraction from data, known under the generic name of Data Mining. So far, KDD is considered as the entire process, tipically interactive and iterative, of finding and understanding the relations between data, that involves the iterated application of DM methods and algorithms.Of course one of the main aspects of the KDD process is that to obtain a better generalizing capability, a larger knowledge is needed. This can be obtained by collecting more information, by defining more complex and specialized models or through a deeper theoretical analysis.
Let's look inside KDD systems from a technical point of view. A KDD system is mainly composed by a set of components able to identify and extract novel, useful as well as attractive relations from data stored in a Database. An ideal KDD should include:
- Controller: it controls the functional flow through other components. Its decisions are based on the domain knowledge and on user information. In some systems, where the discovery task is static and well defined, the controller can simply execute a pre-defined sequence of actions. In more versatile systems, the controller can assume a larger decision responsibility. Anyway, in almost all controller systems, the interaction with user (at least a sort of supervision) is mandatory to take final decisions;
- Database Interface: it generates and processes database requests. Data, in fact, are extracted from the database by using queries. A typical query consists of a certain number of parametric fields used to find constrained entries from a set of tables. The most part of interfaces supports the query language SQL;
- Base of Knowledge (BoK): it is the repository of information related to a specific domain. It is different from a simple dictionary, containing information essentially focused on the fields, names, types and simple contraints on the values that data can assume. The BoK is made by more information on the data structure, constraints between different fields, specific knowledge on the applicable conceptual domain, generalization hierarchy etc. The main role of the domain knowledge is to drive the research of interesting relationships. It can be obtained by focusing the attention on portions of data, by guiding the extraction algorithms and by carefully evaluating the discovered relations between data;
- Focus: it determines portions of data to analyze. This component determines which data should be extracted from database. Hence it requires to specify which are the tables to be accessed, which fields to find and how many output features to be released. To do this the Focus component requires detailed information on the table structure and which fields appears as proper for the specific task to be carry out. For example, if a sampling is requested, it should have a proper technique to randomly select the data, together with the knowledge on the input required by the extraction algorithms to format properly the relevant attributes;
- Pattern Extraction: it consists of a series of pattern extraction algorithms. Here the term pattern is referred to a whatever relation existing between database elements. The algorithms used in this case are the kernel of a discovery system. Different algorithms based on Machine Learning techniques and Statistical Analysis are conceptually included in the KDD system definition;
- Evaluation: It evaluates the usefulness of extracted patterns. This component determines the related degree of interest of the founded relations and it decides which are the relations to be presented and their order;
Data Mining
Historically, DM techniques originated from the application of machine learning and statistical analysis methods to databases for the mining of patterns. Current available huge amounts of data in many social/scientific/economic disciplines, in which astronomy is only an example, caused the failure of traditional analysis methods. They in fact can create information reports on data, but they are not able to analyze them internally and to focus the attention on their important intrinsic features.The DM techniques play an important role in the KDD process to discover and extract the information, not immediately visible in a large volume of data, in an automatic way. As briefly mentioned above, the main contribution to the DM evolution is coming from:
- huge data archives;
- less expensive data storage;
- computing farm efficiency;
- new analysis methods and techniques;
So far, Data Mining can be considered as a methodology for problem solving, whose basic role consists in finding a description, either mathematical or logical, of relations and regularities in a data set. Hence it includes the definition of data model. This model plays the role of inferred knowledge on a database. So it involves the problem of an efficient information retrieval methodology. In other words, the information is not only the set of stored data, but also the knowledge that can be inferred from the database. The DM methods designed and developed for model definition essentially follow two main approaches: statistical-based or machine learning. The statistical approach is historically associated to the analysis of data with the goal to find correlations and dependencies between data. Let's look more deeply the Machine Learning (ML) approach.
Machine learning
(ML) is a scientific discipline that is concerned with the design and development of algorithms that allow computers to learn based on data, such as from sensor data or databases. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data.By following the definition of ML, the Data Mining based on Machine Learning approach has as one of the most important feature the technique of learning. there are basically two main Learning Techniques: supervised and unsupervised.
The supervised learning is based on the existence of an external tutor which defines the classes and provides examples of each class to the cognitive system. In a supervised algorithm the main task is to discover common properties in the examples of each class, i.e. a description of that class. This technique is also named learning from examples.
In the unsupervised learning the main actor is the system that must be able to discover the classes, not provided by a tutor, but directly from the intrinsic property exploration inside the parameter space. This technique is also known as learning from observations. The output of this process is a set of class descriptions, together representing a valid summary of problem solution. The process of building of a class description is an iterative mechanism in which the best description is searched. There should exist a quality evaluation function (merit function) able to drive the selection of the best choice at each iterative step. The process works on a subset of the whole problem parameter space, named training set. In the case of Data Mining, the training set is a subset of available data in the data set. The most part of DBMS are relationally structured, so far the examples of training set are essentially tuples of database. They are usually collected as a whole, unique table. This assumption is convenient from a technical point of view. But conceptually it presents the frequent problem of internal values that can be unknown or NaN (Not a Number).
The analysis process in Machine Learning often makes use of a simplified model to understand the problem, by recognizing main analogies between model and real problem features. Then it collects similar objects and builds rules to predict the class component behavior. This is the common process of DM algorithms based on three main components, the model, model evaluation and exploration method.
The model is an abstraction of the real problem, subject of the investigation. A proper definition of the model should be based on two features: the model function and its representation.
The model function can be associated with one of the following application methods:
- Classification: it classifies a set of data in one or more predefined classes. It is able to predict the belonging of a datum to a certain class.
- Regression: it associates a set of data to a variable and it predicts its value. Examples of such applications are the estimation of the probability of a datum to assume a value.
- Clustering: it tries to identify a finite set of categories or clusters to describe the data. The categories can be mutually exclusive, or can consist in a more complex representation, such as hierarchical or overlapping categories. it is different from classification in the sense that classes are not predefined, but discovered depending on the data feature extraction.
- statistical summarization: it furnishes a compact view of data, for example in terms of their mean or standard deviation. More complex operators involve summarising rules as discoveries of functional relations between variables in the parameter space.
- dependence modeling: it consists of finding a model able to describe significative dependencies between variables. They are at two levels: the structural level of the model predicts, often in a graphical way, which variables are locally dependent from each other; the quantitative level specifies the power of such dependencies in an arbitrary numerical scale.
- Dimensional reduction: it determines relations between fields in the parameter space of the data set, with the role to derive multiple correlations able to enclose in a simplified scenario all data features.
- Time Sequence Analysis: it is the research of temporal sequences of events in the data, coming from a continuous process. It enables the prediction of data events (trigger) in the forthcoming process.
The model representation can be defined as the language adopted to describe the patterns that can be discovered. It determines either the data representation flexibility inside the model and human interpretation degree of the model behavior. There is often the trend to design model representations with a high complexity degree, eventually by stressing their specialization field. But not always the more complex solution is the best one, especially in term of understanding of their output features. Coming from the direct experience, model representations derived from Nature behavior have best results and generalization (deduction) capabilities. Typical examples are: decision tree, neural networks, genetic algorithms, fuzzy inference rules and hybrid representations of them, grouped as Soft Computing. More in detail:
- Decision Tree: they are one of the most developed techniques to partition sets of objects into classes. A decision tree classifies samples into a finite number of classes. The tree nodes are labeled with the attribute names, the edges are labeled with the possibile attribute values, and finally the leaves with different classes. An object is classified by following a path along the tree, based on the edges corresponding to the values of object attributes. the final task of such a structure is to build a simple tree able to classify all training set samples in the correct way. One of worst features of decision trees is that they can grow too much on real problems, becoming very difficult to be understood. One of the common solutions is to simplify the tree representation through the so-called decision rules. They are structured as self-consistent production rules, easy to be formalized and interpreted. The tree path becomes an expert system, composed by rules if-then-else, optimized also in terms of final classification accuracy. The limit of such an approach is reached in case of not well defined attribute descriptions. In this case, a fuzzification of production rules is always possible and suggested, with a little lack of simplicity;
- Artificial Neural Networks (ANN): This family of ML model representations has a structure and a computational model very different from typical models derived from Artificial Intelligence. Basic idea behind them is that it is possible to obtain intelligent systems (in the ML sense) by using a simplified computational scheme close to the human brain. The most famous artificial neuron model was introduced by McCulloch & Pitts in 1943. To simulate it, they are composed by a high number of neurons (or artificial computational nodes based on the biological neuron behavior) communicating between them through connections (the artificial alter ego of synapses). But why ANN are different from typical Artificial Intelligence computational models? Well, because they do not have a canonical algorithm specifying the operations to be executed, but their computation process is defined through information exchange mechanisms depending on how their neurons are connected. An ANN learns through self acquired experience, not simply by means of a computer instruction sequence. So far, ANN main features are:
- high number of simple computing nodes (neurons);
- high number of connections between neurons (synapses);
- specialized and hierarchically neuron grouping (layers);
- distributed and parallel control scheme;
- robust learning algorithm;
- generalization capability;
The role of learning process in the ANN is basically to strengthen the synapse (represented by a numerical normalized weight) between neurons whose activation value corresponds to a state coherent with the current input. The learning can be performed by a specific training phase (use case of the model), in which a subset of data samples is submitted to the network through input layer (the external world interface of an ANN). A taxonomy of ANN models can be done depending on the specific learning rule adopted:
- Self-association ANN: it associates stored patterns to themselves;
- Pair-association ANN: it associates stored patterns to different others;
- Supervised ANN: it learns by using supervised learning rules;
- Unsupervised ANN: it learns by using unsupervised learning rules;
Moreover, ANNs, having the capability to recognize patterns from input data sets, are able to apply many of the Data Mining methods, such as classification, clustering, association and time sequence analysis, to huge and noisy data sets. Furthermore, due to their intrinsic parallelism, can be easily adapted to be executed in a parallel computing farm.
- Genetic Algorithms (GA): As in the case of Neural Networks, an artificial emulation of the biological model, the human brain, also Genetic Algorithms are a computing model representation originated from a natural model, the evolution of biological species as inspired by Darwin's law. GAs are stocastic procedures that combines survival of strongest elements from a population with the exchange of structured knowledge to form a research and optimization algorithm. Starting from a random population, at each generation a new set of artificial "creatures" (in the form of strings or chromosome) is obtained by re-combining substrings (by applying various genetic operators) of best elements of previous generation. Occasionally new chromosome genes are introduced in the population to ensure variety during evolution. GAs were coined by John Holland in the early 1970s. Despite of their intrinsic stocastic nature, GAs are not a simple random research process. They in fact efficiently explore historical information to deduce new research points with better features. They belong to the category of probabilistic algorithms, but are different because they combine stocastic and driven research.One of their main features is the robustness, e.g. the right equilibrium between efficiency and correctness, needed as known in the real life to survive. They are also as easy to be designed and implemented, as powerful in their optimization research, and they are not limited by mathematical constraints, as in all other ML algorithms. As mentioned before, the GAs perform a multi-directional search by maintaining a population of potential solutions and encouraging the formation of information blocks together with their exchange of knowledge. The population throws a simulated evolution, where at each generation "good" solutions are propagated, while "bad" ones are killed or at least changed by probabilistic mutation operators. In order to evaluate the goodness of a member of population, a special "fitness" function is applied. Common genetic operators are crossover, which re-combines portions of member parents or various types of genetic mutation, regulated by an associated probability for all candidates. More in general, the properties of a GA are:
- a genetic representation of solutions to a specific problem;
- a stocastic generation of first population;
- a fitness function to evaluate the goodness of population members at each generation;
- genetic operators to apply evolution;
- probabilistic rules associated to genetic operators;
The model evaluation determines how much a specified model and its parameters are able to satisfy KDD process criteria. The exploration method mainly consists of two components, the parameter search and model research. In the former the algorithm looks for the parameters able to optimize the model evaluation criteria, based on both observed data and a fixed representation of the model. The latter can be considered as an iterative process on the parameter space. Usually the model space is too large, causing an heuristic approach in the implementation of evaluation methods.
The model exploration method determines the rules to formalize the problem to find the best description of the model, given a defined evaluation method. The related task is to define properly the concept of research space. A research space is a three-element set, composed by a set of descriptions, a set of operations and a merit function. The set of operations can be split into generalization and specialization, one the opposite of the other. The merit function assigns a value at each description, i.e. its quality. In the case of application of decision trees, the generalization consists in a bottom-up pathway, while specialization corresponds to explore the tree in a top-down direction.
Inference techniques
From a logical point of view there are two main inference techniques, respectively, deduction and induction.
Deduction permits to infer information that is a logical consequence of information originally contained in the data. Many management systems (DBMS or DataBase Management Systems), such as those relationally structured, make available simple deduction operators on the information (for example the join operator). Deduction based databases are usually classified as KBMS (Knowledge Base Management Systems).
Induction permits to infer information as a generalization of the one contained in the data. Such type of information can be considered knowledge of data object properties. In this sense the Induction process looks for regularities in the data, i.e. combinations of attribute values which are shared all over the database.
In conclusion, the main difference between the two inference techniques can be summarized by the consideration that deduction always infers data well validated internally to the database, while induction infers data that cannot always be considered valid outside the database. In other words, deduction can support generalization feature, not always supported by induction mechanisms. So far, one of the main tasks of induction is to select more plausible rules able to reduce as much as possible evaluation errors in the data mining process (e.g. in case of unknown feature exploration inside the data sets).
to be continued...

Drawing of an empty pyramid
Leonardo da Vinci
De Divina Proportione, Luca Pacioli, Milan, 1497
