Data Mining - Terminologies and Notation

Autor: Elissa Haddad • October 8, 2016 • Course Note • 568 Words (3 Pages) • 1,250 Views

Page 1 of 3

Terminologies and Notation

Algorithm: A specific procedure used to implement a particular data mining technique: classification tree, discriminant analysis, and the like

Attribute= Feature= Input variable= Predictor= X = Independent variable: A variable, usually denoted by X, used as an input into a predictive model. From a database perspective, called a “Field”.

Case= Observation= Record: The unit of analysis on which the measurements are taken (a customer, a transaction, …). Also, called instance, sample, example, case, record, pattern, and row. In spreadsheets, each row represents a record; and each column a variable. Note that the use of the term “sample” here is different from its usual meaning in statistics, where it refers to a collection of observations.

Confidence: * A performance measure in association rules of the type “If A and B are purchased, then C is also purchased”. Confidence is the conditional probability that C will be purchased if A & B were purchased. * Also has a broader meaning in statistics (confidence interval) concerning the degree of error in an estimate that results from selecting one sample as opposed to another.

Dependent variable= Outcome variable= output variable= target= Response: A variable, usually denoted Y, which is the variable being predicted n supervised learning.

Estimation= Prediction: The prediction of the numerical value of a continuous output variable

Holdout Data= Holdout set= Validation set= Test set: A sample of data not used in fitting a model, but instead used to assess the performance of that model.

Model: An algorithm applied to a dataset, complete with its settings

Profile: A set of measurement on an observation (ex: The height, weight, and age or a person)

Score= a predicted value or class. Scoring new data means using a model developed with training data to predict output values in new data

Success Class: The class of interest in a binary outcome

Supervised Learning: The process of providing an algorithm with records in which an output variable of interest is known and the algorithm “learns” how tp predict this value with new records where the output is unknown.

Test Data= test set: The portion of the data used only at the end of the model building and selection process to assess how well the final model might perform on new data.

Training data= training set: The portion of the data used to fit a model

Unsupervised learning: An analysis in which one attempts to learn patterns in the data other than predicting an output value of interest

Validation data= validation set= The portion of the data used to assess how well the model fits, to adjust models, and to select the best model from among those that have been tried.

Variable: any measurement on the records, including both the input X and the output Y variables.

...

Download as: txt (3 Kb) pdf (55.6 Kb) docx (9.2 Kb)

Continue for 2 more pages »

Read Full Essay Save