DATA MINING

Data Mining Mastery Quiz

Test your knowledge on the fascinating field of Data Mining! This quiz covers various topics including natural language processing, information retrieval, and anomaly detection. Perfect for both students and professionals in the field, it comprises 41 multiple-choice questions designed to challenge your understanding and application of data mining concepts.

41 Engaging Questions
Focus on Key Data Mining Topics
Assess Your Knowledge and Skills

41 Questions10 MinutesCreated by MiningData42

Natural language processing and information extraction

Text mining

Web Data

�“design” can be a noun or a verb (Ambiguous POS) •“root” has multiple meanings (Ambiguous sense)

Word level ambiguity

Syntactic ambiguity

Anaphora resolution

Presupposition

�“natural language processing” (Modification) •“A man saw a boy with a telescope.” (PP Attachment)

Word level ambiguity

Syntactic ambiguity

Anaphora resolution

Presupposition

�“John persuaded Bill to buy a TV for himself.” (himself = John or Bill?)

Word level ambiguity

Syntactic ambiguity

Anaphora resolution

Presupposition

�“He has quit smoking.” implies that he smoked before.

Word level ambiguity

Syntactic ambiguity

Anaphora resolution

Presupposition

An extensive lexical network for the English language

Wordnet

Synsets

Relationship

Words

Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.

Text databases

Information

Text analysis

Database

A field developed in parallel with database systems

Information retrieval

Text databases

Structured data

Data stored

The percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)

Precision

Recall

Data

Text

The percentage of documents that are relevant to the query and were, in fact, retrieved

Recall

Precision

Text

Data

A document can be described by a set of representative keywords called

Index terms

Assignment

Attributes

Predicts that each document is either relevant or non-relevant based on the match of a document to the query

Boolean model

Query

A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining

Synonymy

Polysemy

Finds similar documents based on a set of common keywords

Similarity based retrieval

Text mining

Web mining

Set of words that are deemed “irrelevant”, even though they may appear frequently

Stop list

Stop word

Token

Several words are small syntactic variants of each other since they share a common word stem

Word stem

Term frequency

Stop list

Each entry frequent_table(i, j) = # of occurrences of the word ti in document di , sually, the ratio instead of the absolute number of occurrences is used

Term frequency table

Word stem

Stop word

Measure the closeness of a document to a query (a set of keywords)

Similarity metrics

Relative term

Similarity based

Associate a signature with each document

Signature file

Signature

Cluster documents by a common author

Similarity detection

Text mining

Web mining

Unusual correlation between entities

Link analysis

Anomaly detection

Sequence analysis

Predicting a recurring event

Sequence analysis

Link analysis

Anomaly detection

Ind information that violates usual patterns

Anomaly detection

Sequence analysis

Link analysis

Anchor text correlations with linked objects

Patterns in anchors/links

Patterns in text

Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them

Motivation

Assoociation

Preprocess the text data by parsing, stemming, removing stop words, etc.

Association

Analysis

Consider each document as a transaction

Evoke association mining algorithms

Term level association mining

No need for human effort in tagging documents

Term level association mining

Evoke association mining algorithm

Represent a doc by a term vector

Vector space model

Term vector model

E.g. “a”, “the”, “always”, “along”

Word stopping

Word stemming

E.g. “computer”, “computing”, “computerize” => “compute”

Word stemming

Word stopping

More frequent within a document  more relevant to semantics

TF (Term frequency)

IDF(INverse document frequency)

Less frequent among documents  more discriminative

IDF

More frequent => more relevant to topic

Weighting

Normalization

Document length varies => relative frequency preferred

Normalization

Weighting

Is a collection of classification algorithms based on Bayes Theorem.

Naive bayes

Machine learning

Decision tree

A single independent variable is used to predict the value of a dependent variable.

Simple linear regression

Multiple linear regression

Two or more independent variables.

Multiple regression

Regression

Single regression

Measures the level of impurityin a group of examples

Impurity/entropy(informal)

Purity

Tells us how important a given attribute of the feature vectors is.

Information gain

Attribute gain

DATA MINING

Data Mining Mastery Quiz

More Quizzes