• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/38

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

38 Cards in this Set

  • Front
  • Back

Idea behind FP-Tree?

1

Problem of Apriori?

1

Why is frequent pattern growth fast?

- No candidate generation, no candidate test


-- Apriori algorithm has to proceed breadth-first


- Use compact data structure


- Eliminate repeated database scan


- Basic operation is counting and FP-tree building

Why summerizing itemsets?


Name three representations!

- search space of FI is large and grows exp


- low minimum support value result in extreme high number of frequent item sets


- reduces computional and storage demands


- closed, maximal and non-derivabel itemsets


M ⊆ C ⊆ F

Maximal frequent itemset?

- F frequent itemset


- X (in F) ist maximal wenn es kein Y in F gibt, von dem X teilmenge ist


- The set of maximal itemsets does not contain the complete support information but is compact

Closed frequent itemset?

c(X) = i(t(X))


- An itemset X is closed in a data set D if there exists no proper super-itemset Y such that support(X) = support(Y) in D.


- The set of closed frequent itemsets contains complete information regarding its corresponding frequent itemsets.

GenMax?

1

Charm?

1

support , confidence rechenregel?

sup(X): anzahl der vorkommnise von X in F durch alle


conf(X->Y): sup(X U Y) / sup(X)

- KDD Process

- Knowledge discovery in databases


1) data cleaning and intrepreation


--60% of the effort


-- removing noise, incosinstens


-- calculate missing values


2) focusing of task relevant data


-- selection (e.g., sales data for the year 2001)


-- projection (Select the relevant attributes/columns )


-- Transformations (normalization, discretetization)


3) Data Mining:


-- Clustering


-- Classification


-- Frequent Pattern Mining


-- Outlier Mining


4) data evaulation and knowledge presentation


-- visualisation

What is data warehouse?

- A decision support database that is maintained separately from the organization’s operational database(s)


- source for data mining



-- subject-oriented


-- integrated (a lot sources)


-- time-variant (over long time period)


-- nonvolatile (2 operation: load & analyze)

Properties of cube measures?

- distributive


-- f(A U B) = f(A) + f(B)


- Algebraic


-- computed by a algebraic function


-holostic


-- ?

Models in Datawarehouse?

- Star


- Snowflake


- Galaxy

OLAP Operations?

- Drill down


- roll up


- slice and dice


- pivot


curse of dimensionality

complexity rises exponitionally to amount of dimensions

options for materialization?

- no materialization (on the fly)


- full materialization (calc all cubiods)


- partly materialization (strategy)


Bitmap Index?

Für gewählte Spalte, erstelle eine Tabelle mit allen möglichen Werten und trage für jede zeile binär codiert ein ob zutrifft oder nicht

discovery stragies?

- hypothese driven (by user)


- discovery driven


-- precompute measures


-- SelfExp, InExp, PathExp


-- limited because no distribution assumptions e.g. trends

catogorization of data

- values


-- catogorial


--- nominal (no order)


--- ordinal (order)


-- numerical


--- discrete


--- continuies


dimensions?

- 1D: age:12


- multidemnsional : (x,y)


- high-dimensional


- no dimensional (metric data)

chi-square?

sum( (e_i-n_i)^2 / e_i )

data transformations?

- reduction strategies


-- numerosity


--- parametric (model)


--- non-parametric (sampling)


-- reduce dimensions


--- feature selection (stepwise feature selction/elimination)


--- statistically (PCA)


-- discretizes


--- equi-width (-outlier)


--- equi-hight


--- clustering


--- entropy based


--- Interval merge with chi square


pro / cons von k-means

+ fast and easy


- k has to be known


- problem with varying density


- non convex forms


- result and runtime depends on initialization


-- good method exist for that

funktionsweise k-means?

- k centres randomly assigned


- points are assigend to nearest center


- new center of this points is calculated


- repeat

slihoute cooeficient calculation?

a() - average distance inside


b() - average dinstance nearest other cluster


s(o) = b(o)-a(o) / max(a(o),b(o))


s_c = average for every point in every cluster

silhoutce cooeficient interpretation

s(o) == -1


--> bad, on overage nearer to B


s(0) == 0


--> distance equal to A and B


s(0) == 1


--> super clustering



s_c > 0.7 --> good clustering


s_c > 0.5 --> meduim clustering


s_c > 0.25 --> weak clustering


below --> no structure

how to determine parameters in k-means

- minPts a good value is often 2d-1


- eps draw distance function and look for kink

where is dbscan bad?

Hierachies! There is no kink...

how can overfitting be avoided

- remove noisy training data


- good size of training data (not to big or small)


- good samples which represent the problem domain

No Free Lunch Theorem?

- without knowledge of the problem there are no reasons to favor one classifier


- i.e. there are always situation where another classifier is better


--> ensemble classifier

Bagging VS Bootstrapping

- Bagging: randomly choose samples to learn m different classifiers


- bootstrapping: sequently go throw samples and learn all bad classified samples on a new classifier

limitations of statistically outlier detection

- most tests are for single attributes only


- In many cases, data distribution may not be known


- statistical measures might be sensitive to outliers themselves


- not all outliers are detected


--> distance based approaches

problem of distance based outlier detection

globaly searched, sometimes bad results


better localy!


--> LOF

Primäre VS sekundär index

- Primäre Ordnung der Daten


- sekündarer indize der gesondert gespeichert wird um schnellen zugriff zu erlauben

Index Strukturen: Organisation

- Raumorganissiert (z.B. Hashverfahren)


- datenorganisierte (z.B. Suchbäume)

Index Strukturen: Struktur

- sequentieller Scan (Bitmap)


- hierachisch (pruning bei bäumen)


- Streuspeicher (hashing)

Anfoderungen an Indexstrukturen

Basis


- effizientes suchen


- dynamisch (einfügen, löschen, ändern)


zusätlich


- nebenläufigkeit


- ordnungserhaltung


- anpassung an verschiedene datanverteilungen

Bespiele für Indexstrukturen

- invertierte liste


- composite index (erst nach nachname dann vorname)


- R-Baum