• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/80

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

80 Cards in this Set

  • Front
  • Back
  • 3rd side (hint)
What is cross validation?
It’s a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
What is the goal of cross validation?
The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting, and get an insight on how the model will generalize to an independent data set.
What is cross validation used for?
Cross-validation is used in settings where the goal is prediction and one wants to estimate how accurately a model will perform in practice. It is also used to select model hyperparameters by finding the set of hyperparameters that give a minimum or maximum value of interest in cross-validation tests.
What are some examples of cross validations?
leave-one-out cross validation, K-fold cross validation

How do you do cross validation correctly?

If we have a data rich environment, we can randomly divide the data set into a training set, a validation set and then a test set. The training set is used to fit the model, the validation set is used for estimating prediction error for model selection and the test set is used for assessment of the generalization error of the final chosen model.

Two goals in mind: model selection and model assessment

What is the end goal of designing algorithms?
The ultimate goal is to design systems with good generalization capacity, that is, systems that correctly identify patterns in data instances not seen before. Simply, building a predictive model is not your motive. But, creating and selecting a model which gives high accuracy on out of sample data. Hence, it is crucial to check accuracy of the model prior to computing predicted values.

Relate generalization of a learning system to performance and complexity.

The generalization performance of a learning system strongly depends on the complexity of the model assumed. If the model is too simple, the system can only capture the actual data regularities in a rough manner. In this case, the system has poor generalization properties and is said to suffer from underfitting. Can interpret this as a high bias model. By contrast, when the model is too complex, the system can identify accidental patterns in the training data that need not be present in the test set. These spurious patterns can be the result of random fluctuations or of measurement errors during the data collection process. In this case, the generalization capacity is also poor. The learning system is said to be affected by overfitting. Can interpret this as high variance.

Is it better to design robust or accurate algorithms?
simpler models are preferred if more complex models do not significantly improve the quality of the description for the observations. Occam’s Razor. It depends on the learning task. Choose the right balance. Spurious patterns, which are only present by accident in the data, tend to have complex forms.
What is the benefit of using ensemble learning methods?
Ensemble learning can help balancing bias/variance (several weak learners together = strong learner).
How are model metrics defined and selected?
The selection of model metrics depends on whether it’s a regression or classification problem. Evaluation metrics explain the performance of a model. An important aspects of evaluation metrics is their capability to discriminate among model results.
What are common metrics for regression problems?
Try to write out each one mathematically. RSS (residual sum of squares), RMSE (root mean squared error), MAE (mean absolute error), WMAE(weighted mean absolute error), RMSLE (root mean squared logarithmic error).
What are common metrics for classification problems?
Accuracy: TP + TN / (TP + TN + FP + FN), Recall / Sensitivity / True Positive Rate: TP / (TP + FN), Precision / Positive Predictive Rate: TP / (TP + FP), Specificity / True Negative Rate: TN / (TN + FP)
What is an ROC curve?
For a binary classification problem, the ROC plots the True Positive Rate vs the False Positive Rate (1- Specificity). Ideally, at a FPR of 0 the TPR will be 1. The ROC illustrates how the probability threshold at which the classification occurs affects the classification.
What is AUC?
AUC is area under the curve. Ideal area under the curve would be 1.
Explain what regularization is and why it’s useful.
Regularization adds a term to the least squares loss function that penalizes the magnitude of the model coefficients. This term is called a penalty, or shrinkage term. It is used to prevent overfitting and thereby improve the generalization of a model (increase bias and decrease variance). We have covered L1 (Lasso) and L2 (Ridge) regularization techniques. In both cases the penalty term is a function of the model coefficients. This term includes a value, lambda, that affects how sensitive the total cost function is to the penalty term.
What are the regularization terms?
In the case of L1, the penalty term is lambda multiplied by the sum of the absolute value of the model coefficients. Lasso (L1) zeros out some coefficients entirely. A disadvantage of this method is that this selection can be arbitrary. In the case of L2, the shrinkage term is lambda multiplied by the sum of the square value of the coefficients. Ridge (L2) maintains all features in the data set. Ridge regression tends to perform better than Lasso when the coefficients are correlated. Both L1 and L2 help deal with collinearity issues. A combination of the two is termed an Elastic Net that usually combines the advantages of both.
Explain what a local optimum is.
A local optimum is a solution that is optimal within a neighboring set of candidate solutions. This is in contrast with a global optimum that is the optimal solution among all solutions.
Why is a local optimum important and in what context?
The question of whether the cost function is at a local minimum or a global minimum arises in this context: In K-means clustering an objective cost function will always decrease until a local optimum is reached. However, cluster results will depend on the initial random cluster assignment.
What are specific ways of determining if you have a local optimum problem?
Differing initializations resulting in different clusters is evidence of a local minimum problem. This problem can be addressed for K-Means by repeating the clusters for many different initialization values and then taking the solution that has the lowest cost. Often in machine learning the model coefficients are fit in such a way that the cost function associated with the predictive error is minimized.
How can the local optimum problem be addressed?
This problem can be addressed for K-Means by repeating the clusters for many different initialization values and then taking the solution that has the lowest cost. Assume you need to generate a predictive model using multiple regression.
What analysis can you use to validate a model?
R2, Analysis of residuals and Out-of-sample evaluation.
Explain R2.
this coefficient of determination quantifies the fraction of the variance of the predicted (dependent) variable that can be explained by the independent variable. However having a large R2 is not enough. In fact, you can always increase R2 by adding more variables but that doesn’t mean your model is better or “validated”
Explain analysis of residuals.
Check homoskedasticity (is the variance from the regression line the same for all values of the predictor variable? It shouldn’t increase or decrease as the predictor variable changes.) The residuals should be normally distributed. The target variables, or anything from one row of data to the next, should not be dependent on each other (time series and spatial data can violate this.) No multicollinearity.
Explain out-of-sample evaluation.
Cross-validation (then checking with R2)
What is latent semantic indexing?
Latent semantic indexing and retrieval is a method that uses singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. It is based on the principle that words that are used in the same contexts tend to have similar meanings. For example: two synonyms may never occur in the same passage but should nonetheless have highly associated representations

What are some applications of latent semantic indexing?

Learning correct word meanings, Subject matter comprehension, Information retrieval, Sentiment analysis (social network analysis)

Explain what resampling methods are?
Resampling methods repeatedly draw samples from a given sample in order to provide alternative representations of that sample.
What are resampling methods used for?
Resampling is commonly used to generate alternative training sets so that a model of interest can be refit on each set in order to obtain additional information about the fitted model.
Give an example of resampling methods.
For example: repeatedly draw different samples from training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fit differs
How does cross-validation use resampling?
In cross-validation the the data set is randomly split into training and test data sets (using random sampling with no replactement) and then the model is trained on the training set and evaluated on the test set to evaluate model performance.
How does boostrapping use resampling?
In bootstrapping data is drawn with replacement from the dataset. Bootstrapping is mostly used to quantify the uncertainty associated with a given estimator or statistical learning method, but it can also be used to provide alternative datasets for a model to train on (such as in random forests).
What is principal component analysis?
PCA is a statistical method that uses an orthogonal transformation to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components. It’s a form of dimensionality reduction.
Explain the sort of problems you would use PCA for.
In PCA data is reduced from m to k dimensions (k <= m) where the goals is to find the k vectors onto which to project the data so as to minimize the projection error. PCA is used in compression (reducing disk/memory needed to store data), predictive models, data visualization, PCR (principal component regression), and survey and polling analyses.

Explain steps for using the PCA algorithm.

1) Preprocessing (standardization)


2) Compute covariance matrix Σ


3) Compute eigenvectors and eigenvalues of Σ


4) Choose k principal components so as to retain x% of the variance (typically x=99) by summing the eigenvalues.

What are the limitations of the PCA algorithm.
1. PCA is not scale invariant. 2. The directions with largest variance are assumed to be of most interest. 3. Only considers orthogonal transformations (rotations) of the original variables. 4. If the variables are correlated, PCA can achieve dimension reduction. If not, PCA just orders them according to their variances.
Explain what false positives are.
Improperly reporting the presence of a condition when it’s not in reality. Example: HIV positive test when the patient is actually HIV negative.
Explain what false negatives are.
Improperly reporting the absence of a condition when in reality it’s the case. Example: not detecting a disease when the patient has this disease.
When are false positives more important than false negatives?
In a non-contagious disease, where treatment delay doesn’t have any long-term consequences but the treatment itself is grueling. HIV test: psychological impact.
When are false negatives more important than false positives?
If early treatment is important for good outcomes. In quality control: a defective item passes through the cracks! Software testing: a test to catch a virus has failed.
What is the difference between supervised and unsupervised learning?
Supervised learning: predictors (or features) are associated with a response (target); we wish to fit a model that relates features to targets for better understanding the relation between them (inference) or with the aim to accurately predicting the target for future observations (prediction). Unsupervised learning: there isn’t a response (target) that can supervise the analysis. Unsupervised learning instead tends to aim towards organizing for finding structure in the data.
Give examples of supervised learning techniques.
Support vector machines, neural networks, linear regression, logistic regression, extreme gradient boosting.
Give examples of unsupervised learning techniques.
Clustering (hierarchecal, k-means, density-based), principal component analysis, singular value decomposition; identify group of customers, non-negative matrix factorization (NMF), self-organizing maps.
Give an example of an application of supervised learning.
Predict the price of a house based on the area, size; churn prediction; predict the relevance of search engine results.
Give an example of an application of unsupervised learning.
Find customer segments; image segmentation; classify senators by their voting.
When would you use random forests instead of Support-Vector-Machines and why?
In a case of a multi-class classification problem: SVM will require one-against-all method (memory intensive). If one needs to know the feature importances. If one needs to get a model fast (SVM is long to tune, need to choose the appropriate kernel and its parameters, for instance sigma and epsilon) In a semi-supervised learning context (random forest and dissimilarity measure): SVM can work only in a supervised learning mode.
What is collaborative filtering?
Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly.
How is collaborative filtering used in a machine learning context?
This is investigated mathematically by looking for correlations between users or between items. Then these correlations are used to weight the ratings of users on items that are most highly correlated. Some recommendation systems use collaborative filtering.
Provide two or more ways of determining how similar data points are.
Cosine similarity quantifies how much two multidimensional vectors point in the same direction on a scale from 1 (pointing in the same direction) to -1 (in the same line but pointing in opposite directions). Cosine similarity of 0 means the vectors are orthogonal. Euclidean distance quantifies the straight line distance between two vectors using the root sum of the squared differences. Euclidean distance depends on both vector magnitudes and the directions the vectors are pointing.

What’s the difference between the coefficient of determination R2 and an adjusted R2?

R2 measures how close data are to a fitted regression line. Another way to think of it is the fraction of variance in the target variable that can be explained by the model. Adjusted R2 tells you the fraction of the variance explained by only the independent variables that actually affect the dependent variable.

R2 is defined mathematically as 1 minus the sum of the squared residuals (SSR) divided by the total sum of the squares (SST), where SST is the sum, over all observations, of the squared differences of each observation from the overall mean.

Explain how to calculate R2 and adjusted R2.
R2 = 1 – SSR/SST



Adjusted R2 = ((1 – n) R2 – k) / (n – k – 1)




where n is the number of data points and k is the number of coefficients.

When might you use the adjusted value?
Use adjusted R2 when you wish to punish model complexity.
Do you think the ensemble average of 50 small decision trees will outperform a single deep decision tree?
A single deep decision tree is likely to suffer from overfitting. It will exhibit very little bias on the training data, but likely show large variance on the test data.
Why might an ensemble decision tree model outperform a single deep tree model?
The idea of using many small trees is an example of the approach of combining many weak learners into a strong learner. Each tree of the 50 trees will likely show a good degree of bias, but as the ensemble average is taken both bias and variance will be decrease. So the model using multiple small trees will likely be better in practice.
For a regression model, why might the mean square error (MSE) be a poor measure of model performance?
In MSE, the residuals are squared. Therefore large outliers, which will result in large residuals (errors) and even larger squared residuals, may unduly affect the regression to minimize the effect of the outliers rather than the rest of the data.
What performance measure for a continuous value output model would you suggest instead of mean squared error (MSE)?
An alternate approach would be to use the mean absolute error (MAE) which would prevent squaring an already possibly large value.
What is a major shortcoming of the Naive Bayes approach?
Naive Bayes assumes that all features in the dataset are uncorrelated and independent. In large datasets this is rarely the case.
How might you address shortcomings of the Naive Bayes approach?
A way to deal with correlation between features is to use Principal Component Analysis (PCA). PCA performs a linear mapping of the data to a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the covariance (and sometimes the correlation) matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. These eigenvectors are less correlated than the original higher dimension feature vectors.
What are a couple of reasons to use an intercept term in a linear regression?
The intercept will guarantee that the residuals have zero mean. It guarantees the least squares slopes estimates (the coefficients) are unbiased.
What assumptions are required for linear regression?
The data used in fitting the model is representative of the population. The true underlying relation between x and y is linear. Variance of the residuals is constant (homoscedastic, not heteroscedastic). The residuals are independent (time series and spacial data can complicate this). The residuals are normally distributed.
What does linear regression assume about the distributions of x and y?
Linear regression doesn’t assume anything about the distributions of x and y, it only makes assumptions about the distribution of the residuals, and this is all that’s needed for the statistical tests to be valid.
How might one find the local minimums or maximums of a function?
Minimums or maximums in a function are found by finding where the first derivative (taken analytically or numerically) is equal to zero.
How do you know if an output from a function is a minimum or maximum value?
Where the first derivative is zero: if the second derivative is positive, the point is a local minimum. if the second derivative is negative, the point is a local maximum. The second derivative is the derivative of the first derivative.

What is a major difference between linear and logistic regression?

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to 2 or more discrete classes

Provide a scenario in which both logistic regression and linear regression would be used.
For instance, if X contains the area in square feet of houses, and y contains the corresponding sale price of those houses, you could use linear regression to predict selling price as a function of house size. However, it you wanted to predict whether or not a house would sell for more than $200k based on size, you would use logistic regression. The possible outputs are either Yes, the house will sell for more than $200K, or No, the house will not.
How would you interpret coefficients obtained from Linear Regression?
The interpretation of each of the coefficients (besides the intercept) is the difference in the predicted value y for each one-unit difference in the feature associated with that coefficient, if all other features remain constant.
How would you interpret coefficients obtained from Logistic Regression?
The coefficients (besides the intercept) can be interpreted in terms of an increase or decrease in the odds ratio, which is the ratio of the probability of something happening to the probability of something not happening. A one unit increase in a feature will either increase or decrease the odds ratio by the coefficient for that feature.
What is multicollinearity in your data set and why is it a potential problem?
Collinearity occurs in a dataset when two or more features (a.k.a X variables) are highly correlated. These features provide redundant information. The issue with collinear features is that the design matrix becomes singular and can’t be inverted so the best fit coefficients can’t be determined.
How do you detect multicollinearity?
Opposing signs for the coefficients of the affected variables, where it’s expected that both would be positive or negative. The standard errors of the regression coefficients of the affected variables tend to be large. Large changes in the individual coefficients when a predictor variable is added or deleted. Rule of thumb: a variance inflation factor (VIF) > 5 indicates a multicollinearity problem, where:

tolerance = 1 – R2j




and




VIF = 1 / tolerance




R2j is the coefficient of determination of a regression of feature j on all the other features.

How might you remove multicollinearity?
Regularization (Ridge and Lasso). Principal component analysis (PCA). Engineering a feature that combines the affected features. Simple dropping one of the features (worst, but viable, option)

Describe the steps to building a decision tree model

1) Take the entire data set as input. 2) Feature by feature, search for a split that maximizes the “separation” of the classes. This split will divide the data in two. 3) Apply the best split (the one that decreases impurity the most) to the input data. There are different ways to do this (see below). 4) Re-apply steps 1 – 3, where in each case the divided data is the new data set. 5) Stop at a stopping criteria. For example, this could be a maximum depth, or minimum number of samples per leaf, or all the data has been perfectly classified (and the model is probably overfit!). 6) Optional – to decrease overfitting, you could go back and “prune” the trees to some maximum depth.

Describe how a decision tree works.
Algorithms for constructing decision trees usually work top-down, by choosing a feature at each step that “best” splits the set of items. Different algorithms use different metrics for measuring "best". These generally measure the homogeneity of the target variable within the subsets after the split. These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split.
What are some algorithms for determining the best split of a set of items?
Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Information gain (entropy) is based on the concept of entropy from information theory, where the goal is to mimimize heterogeneity in the resulting subsets.
What is the Variance reduction method and when is it used?
Is often employed in cases where the target is continuous (regression tree), meaning that use of other metrics would first require discretization before being applied. The variance reduction is defined as the total reduction of the variance of the target variable due to the split.
What is the curse of dimensionality?
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high dimensional spaces.
How does the curse of dimensionality affect distance and similarity measures?
Everything becomes far and difficult to organize, so everything is nearly equally far and dissimilar. Common theme: when number of dimensions increases, the volume of the space increases so fast that the available data becomes sparse. Issue with any method that requires statistical significance: the amount of data needed to support the result grows exponentially with the dimensionality.
What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
It’s not a bad idea – it should help avoid overfitting, so you could use this to increase the generality of the model. Regularization and ensemble methods address this, too.

What are endogenous variables

Similar to dependent variables, these are determined by other variables in the system.

What are exogenous variables

These are variables that are not effected by any other variables in the system, although they may be effected by factors outside the model.

What is information gain for a decision tree?

We want to determine which feature in a given set of training feature vectors is most useful for discriminating between the classes to be learned.

What is information gain used for?

It tells us how important a given attribute of the feature vectors is so that we can use it to decide the ordering of attributes in the nodes of a decision tree