datasciencefun | Unsorted

Telegram-канал datasciencefun - Data Science & Machine Learning

50007

Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free For collaborations: @Guideishere12 Buy ads: https://telega.io/c/datasciencefun

Subscribe to a channel

Data Science & Machine Learning

Which regularization techniques do you know?

There are mainly two types of regularization,

L1 Regularization (Lasso regularization) - Adds the sum of absolute values of the coefficients to the cost function.
L2 Regularization (Ridge regularization) - Adds the sum of squares of coefficients to the cost function

Here, Lambda determines the amount of regularization.

Читать полностью…

Data Science & Machine Learning

Everything you need to know about TensorFlow 2.0
Keras-APIs, SavedModels, TensorBoard, Keras-Tuner and more.

https://hackernoon.com/everything-you-need-to-know-about-tensorflow-2-0-b0856960c074?

Читать полностью…

Data Science & Machine Learning

What happens to our linear regression model if we have three columns in our data: x, y, z  —  and z is a sum of x and y?

We would not be able to perform the regression. Because z is linearly dependent on x and y so when performing the regression would be a singular (not invertible) matrix.

Читать полностью…

Data Science & Machine Learning

What do we do with categorical variables?

Categorical variables must be encoded before they can be used as features to train a machine learning model. There are various encoding techniques, including:

One-hot encoding
Label encoding
Ordinal encoding
Target encoding

Читать полностью…

Data Science & Machine Learning

What is the PR (precision-recall) curve?

A precision-recall curve (or PR Curve) is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds. Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance.

Читать полностью…

Data Science & Machine Learning

What kind of problems neural nets can solve?

Neural nets are good at solving non-linear problems. Some good examples are problems that are relatively easy for humans (because of experience, intuition, understanding, etc), but difficult for traditional regression models: speech recognition, handwriting recognition, image identification, etc.

Читать полностью…

Data Science & Machine Learning

What’s the interpretation of the bias term in linear models?

Bias is simply, a difference between predicted value and actual/true value. It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias.

Читать полностью…

Data Science & Machine Learning

What are the main assumptions of linear regression?

There are several assumptions of linear regression. If any of them is violated, model predictions and interpretation may be worthless or misleading.

1) Linear relationship between features and target variable.

2) Additivity means that the effect of changes in one of the features on the target variable does not depend on values of other features. For example, a model for predicting revenue of a company have of two features - the number of items a sold and the number of items b sold. When company sells more items a the revenue increases and this is independent of the number of items b sold. But, if customers who buy a stop buying b, the additivity assumption is violated.

3) Features are not correlated (no collinearity) since it can be difficult to separate out the individual effects of collinear features on the target variable.

4) Errors are independently and identically normally distributed (yi = B0 + B1*x1i + ... + errori):

i) No correlation between errors (consecutive errors in the case of time series data).

ii) Constant variance of errors - homoscedasticity. For example, in case of time series, seasonal patterns can increase errors in seasons with higher activity.

iii) Errors are normaly distributed, otherwise some features will have more influence on the target variable than to others. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.

Читать полностью…

Data Science & Machine Learning

What is gradient boosting trees?

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

Читать полностью…

Data Science & Machine Learning

Udacity's AWS Machine Learning Scholarship to learn machine learning fundamentals for free
👇👇

bit.ly/3vXVYLG

Udacity has partnered with AWS to launch the new AWS Machine Learning Scholarship program to enable developers of all skill levels learn the fundamentals of machine learning - for free! Applicants will gain access to the AWS Machine Learning Foundations course and top performers will be selected to receive a full scholarship to the AWS Machine Learning Engineer Nanodegree program. Additionally, the first 150 students to successfully complete the course will receive an AWS DeepLens device and the first 2,500 who enroll in the course will receive $35 in AWS credits! Applications are open now and close on July 12.

Читать полностью…

Data Science & Machine Learning

What are the main parameters of the decision tree model?

• maximum tree depth
• minimum samples per leaf node
• impurity criterion

Читать полностью…

Data Science & Machine Learning

Three different learning styles in machine learning algorithms:

1. Supervised Learning

Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.

A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.

Example problems are classification and regression.

Example algorithms include: Logistic Regression and the Back Propagation Neural Network.

2. Unsupervised Learning

Input data is not labeled and does not have a known result.

A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.

Example problems are clustering, dimensionality reduction and association rule learning.

Example algorithms include: the Apriori algorithm and K-Means.

3. Semi-Supervised Learning

Input data is a mixture of labeled and unlabelled examples.

There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.

Example problems are classification and regression.

Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.

Читать полностью…

Data Science & Machine Learning

Learning and Practicing SQL: Resources and Platforms

1. https://sqlbolt.com/
2. https://sqlzoo.net/
3. https://www.codecademy.com/learn/learn-sql
4. https://www.w3schools.com/sql/
5. https://www.hackerrank.com/domains/sql
6. https://www.windowfunctions.com/
7. https://selectstarsql.com/
8. https://quip.com/2gwZArKuWk7W
9. https://leetcode.com/problemset/database/
10. http://thedatamonk.com/

Читать полностью…

Data Science & Machine Learning

How do we evaluate classification models?

Depending on the classification problem, we can use the following evaluation metrics:

Accuracy
Precision
Recall
F1 Score
Logistic loss (also known as Cross-entropy loss)
Jaccard similarity coefficient score

Читать полностью…

Data Science & Machine Learning

What is sigmoid? What does it do?

A sigmoid function is a type of activation function, and more specifically defined as a squashing function. Squashing functions limit the output to a range between 0 and 1, making these functions useful in the prediction of probabilities.

Sigmod(x) = 1/(1+e^{-x})

Читать полностью…

Data Science & Machine Learning

Machine Learning for Everyone in some words

https://vas3k.com/blog/machine_learning/

Читать полностью…

Data Science & Machine Learning

Today is the last day to get exclusive 75% discount by using coupon code JULY75

Читать полностью…

Data Science & Machine Learning

What’s the difference between random forest and gradient boosting?

Random Forests builds each tree independently while Gradient Boosting builds one tree at a time.
Random Forests combine results at the end of the process (by averaging or "majority rules") while Gradient Boosting combines results along the way.

Читать полностью…

Data Science & Machine Learning

What is the area under the PR curve? Is it a useful metric?

The Precision-Recall AUC is just like the ROC AUC, in that it summarizes the curve with a range of threshold values as a single score.

A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

Читать полностью…

Data Science & Machine Learning

What is AUC (AU ROC)? When to use it?

AUC stands for Area Under the ROC Curve. ROC is a probability curve and AUC represents degree or measure of separability. It's used when we need to value how much model is capable of distinguishing between classes. The value is between 0 and 1, the higher the better.

Читать полностью…

Data Science & Machine Learning

Top 7 Nanodegree Certification Programs to Master Data Science and Machine Learning👇👇

1. Programming for Data Science with Python

bit.ly/324JSUh

2. DATA VISUALIZATION

bit.ly/3mzCSZ7

3. Become a Machine Learning Engineer

bit.ly/3mxRMiC

4. Learn Python from intermediate to Advanved level

https://bit.ly/3ju08s2

5. Intro to Machine Learning with TensorFlow

bit.ly/2PR0vjY

6. Intermediate Python

https://bit.ly/3z7YPnZ

7. Become a Data scientists

https://bit.ly/2TRo9P7

Get special 75% discount on any of the above courses by using the coupon code JULY75

Enroll as soon as possible because Udacity is giving huge discount this time and coupon code is valid for limited time only.

Читать полностью…

Data Science & Machine Learning

What is the ROC curve? When to use it?

ROC stands for Receiver Operating Characteristics. The diagrammatic representation that shows the contrast between true positive rate vs false positive rate.

It is used when we need to predict the probability of the binary outcome.

Читать полностью…

Data Science & Machine Learning

Supervised learning requires a training set to teach models to yield the desired output. Training dataset includes inputs and correct outputs, which allow the desired model to learn over time. The algorithm also measures its accuracy through the loss function, adjusting until the error has been sufficiently minimized.

Читать полностью…

Data Science & Machine Learning

What happens when we have correlated features in our data?

In random forest, since random forest samples some features to build each tree, the information contained in correlated features is twice as much likely to be picked than any other information contained in other features.

In general, when you are adding correlated features, it means that they linearly contains the same information and thus it will reduce the robustness of your model. Each time you train your model, your model might pick one feature or the other to "do the same job" i.e. explain some variance, reduce entropy, etc.

Читать полностью…

Data Science & Machine Learning

What is feature selection? Why do we need it?

Feature Selection is a method used to select the relevant features for the model to train on. We need feature selection to remove the irrelevant features which leads the model to under-perform.

Читать полностью…

Data Science & Machine Learning

What are the decision trees?

This is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables.

In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible.

A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variable.

Various techniques : like Gini, Information Gain, Chi-square, entropy.

Читать полностью…

Data Science & Machine Learning

Can you explain how cross-validation works?

Cross-validation is the process to separate your total training set into two subsets: training and validation set, and evaluate your model to choose the hyperparameters. But you do this process iteratively, selecting differents training and validation set, in order to reduce the bias that you would have by selecting only one validation set

Читать полностью…

Data Science & Machine Learning

Is accuracy always a good metric?

Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.

What are precision, recall, and F1-score?

Precision and recall are classification evaluation metrics:
P = TP / (TP + FP) and R = TP / (TP + FN).

Where TP is true positives, FP is false positives and FN is false negatives

In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives.

F1 is a combination of both precision and recall in one score (harmonic mean):
F1 = 2 * PR / (P + R).
Max F score is 1 and min is 0, with 1 being the best.

Читать полностью…

Data Science & Machine Learning

Where to get data for your next machine learning project?

An overview of 5 amazing resources to accelerate your next project with data!

📌 Google Datasets
Easy to search Datasets on Google Dataset Search engine as it is to search for anything on Google Search! You just enter the topic on which you need to find a Dataset.

📌 Kaggle Dataset
Explore, analyze, and share quality data.

📌 Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources

📌 Awesome Public Datasets
A topic-centric list of HQ open datasets.

📌 Azure public data sets
Public data sets for testing and prototyping.

Читать полностью…

Data Science & Machine Learning

What is overfitting?

When your model perform very well on your training set but can't generalize the test set, because it adjusted a lot to the training set.

Читать полностью…
Subscribe to a channel