Data science has become a boom in the current industry. It is one of the most popular technologies these days. Most of the statistics students want to learn data science. Because statistics is the building block of the machine learning algorithms. But most of the students don’t know how much statistics they need to know to start data science. To overcome this problem we are going to share with you the best ever tips on statistics for data science. In this blog, you are going to see which statistics are crucial to start with data science.
Introduction to Statistics
Statistics is one of the most crucial subjects for students. It has various methods that are helpful to solve the most complex problems of real life. Statistics are almost everywhere. Data science and data analysts use it to have a look on the meaningful trends in the world. Besides, statistics has the power to drive meaningful insight from the data.
Statistics offers a variety of functions, principles, and algorithms. That is helpful to analyze raw data, build a Statistical Model, and infer or predict the result.
Terminologies in Statistics
Before getting started with data science; we have to be well aware of the key statistical terminologies.
The population: It is the set of the given sources from which the data has to be collected. There can be a huge number of population.
Sample: It is the subset of data that is extracted from the given Population.
Variable: It is the characteristics, number, or quantity of the data that can be measured or counted. In other words, the variable is the data item.
statistical model: The statistical model is also known as the statistical Parameter or population parameter.
Types of Analysis
Statistics has two types of analysis.I
Quantitative Analysis: Quantitative Analysis is also known as statistical analysis. It is the science or the art of collecting and interpreting data with numbers and graphs. We also use it to identify patterns and trends.
Qualitative Analysis: Qualitative is also known as Non-Statistical Analysis. It gives generic information. It also uses text, sound, and other forms of media.
Numerical: Numerical data types are those data types that are expressed with digits. These data types are measurable. There are two major types of data types i.e. discrete and continuous.
Categorical: Categorical data types are qualitative data and it is classified into categories. There are two types of major categorical data types i.e. nominal (no order) or ordinal (ordered data).
Measures of Central Tendency
Mean: Means stands for the average of the given dataset.
Median: Median is the middle of the given ordered dataset.
Mode: Mode is the most common value in a given dataset. It is the only relevant for discrete data.
Measures of Variability
Range: Range is the difference between the maximum and minimum value in a given dataset.
Variance (σ2): Variance measures how to spread out a set of the given data is relative to the mean.
Standard Deviation (σ): It is also a measurement of how spread out numbers are in the given data set. The square root of the variance is also known as standard deviation.
Z-score: Z score determines the number of standard deviations a data point is from the mean.
R-Squared: R square is a statistical measure of fit. It used to indicate how much variation of a dependent variable is explained by the independent variable(s). We can use it only for simple linear regression.
Adjusted R-squared: It is similar to the R squared and also R square modified version. It has been adjusted for the number of predictors in the model. It decreases if the old term improves the model more than would be expected by chance and vice versa.
Measurements of Relationships between Variables
Covariance: If we want to find the difference between two variables then we use the covariance. It is based on the philosophy that if it is positive then they tend to move in the same direction. Or if it’s negative then they tend to move in opposite directions. There will also be no relation with each other if they are zero.
Correlation: Correlation is all about measuring the strength of a relationship between two different variables. It ranges from -1 to 1. It is the normalized version of co-variance. Most of the time the correlation of +/- 0.7 represents a strong relationship between two different variables. On the other hand, there is no relationship between variables when the correlations between -0.3 and 0.3
Probability Distribution Functions
Probability Density Function (PDF): It is for continuous data. Hereby in the continuous data, the value at any point can be interpreted as providing a relative likelihood. In addition, the value of the random variable will also be equal to that sample.
Probability Mass Function (PMF): In the probability mass function for discrete data. It also gives the probability of a given occurring value.
Cumulative Density Function (CDF): The cumulative density function is used to tell us the probability that the random variable is less than a certain value. In addition, is also the integral of the PDF.
Continuous Data Distributions
Uniform Distribution: Continuous data distributions is a probability distribution. In this distribution, all the outcomes are equally likely.
Normal/Gaussian Distribution: The normal distribution is commonly referred to as the bell curve. In addition, it is also related to the central limit theorem. It has a standard deviation of 1 and the mean is 0.
T-Distribution: The T distribution is another probability distribution. It is used to estimate population parameters when the sample size is small.
Uniform Distribution: In this probability distribution we have the single value that only occurs within a certain range. The value outside this range is just 0. It is also known as on and off distribution.
Position Distribution: it is quite similar to the normal distribution. But it offers the addition factor i.e. the skewness. The lower the value of the skewness the distribution will relatively uniformly spread in all directions. But if the skewness is high then the data will spread out in different directions with unequal distribution,
Discrete Data Distributions
Poisson Distribution: One of the most common probability distributions. It expresses the probability of a given number of events occurring within a given fixed time period.
Binomial Distribution: The probability distribution of the number of successes in a sequence of n independent experiences each with its own Boolean-valued outcome (p, 1-p).
The Moments describe different aspects of nature and the shape of any given distribution. Moments happened in sequence therefore the means is the first moment, the variance is the second one, skewness is the third one and the kurtosis is the fourth one and the last one.
Probability is all about the likelihood that the event is occurring.
Conditional Probability:- In this probability [P(A|B)] is the likelihood of an event occurring. The event occurring is based on the occurrence of an event that occurred previously
Bayes’ Theorem: The Bayes’ theorem is the most popular mathematical formula. It is used to determine the conditional probability. It is based on the methodology that the probability of A given B is equal to the probability of B given A times the probability of A over the probability of B”.
True positive: It detects the condition if the condition is present.
True negative: It does not detect the condition if the condition is not present.
False-positive: It automatically detects the condition if the condition is absent.
False-negative: It does not detect the condition if the condition is present.
Sensitivity: It measures the ability of a test to detect the condition. If the condition is present. The sensitivity = TP/(TP+FN)
Specificity: It measures the ability of a test to correctly exclude the condition if the condition is absent. It specificity = TN/(TN+FP)
Predictive value positive: Predictive value positive is also called as precision. In this the proportion of positives that correspond to the presence of the condition. Here is the formula PVP = TP/(TP+FP)
Predictive value negative: In this the proportion of negatives. It also corresponds to the absence of the condition. Here is the formula PVN = TN/(TN+FN)
Now we have gone through all the basic concepts of statistics for data science. If you are going to start with data science then you should try to have a good command over all these statistical concepts. It will help you a lot when you start learning data science. With the help of these concepts, you will be able to understand data science concepts. So what are you waiting for? Grab the best statistics books and start learning these concepts.
Have great one!