Understanding the world of Statistics
This is the first post in this category where we would be going deeper into the magical world of statistics. Without the knowledge of statistics, analytics is incomplete. In fact, anyone aspiring to be a data analyst must first learn this before moving to more advanced topics like machine learning etc.
For this introductory post, I would be explaining the different branches of statistics along with the different types of data that one is bound to encounter while working on any analytics project. So let’s start.
Branches of Statistics
This branch deals with deriving insight about a particular data sample in its entirety or an entire population. This branch of statistics is desirable when you are studying small datasets and/or the cost of collecting and managing data is negligible.
This branch deals with deriving insight about a population on the basis of some data points that have been sampled from the population. This is favourable when you have to deal with a large dataset and analysis of the entire thing is both costly and time consuming.
Let me explain this with the help of an example: Megastore ABC that has been operation for 20 years would like to know the average footfall at its stores on any given day. Now, one could go about gathering the data for the entire 20 years and averaging to know the average footfall. Otherwise, one may collect data for the past couple of years and average it to get the result. The former appears to be tedious and time consuming (descriptive statistics) while the latter seems to be cost effective and less time consuming (inferential statistics). Although, such a sampling approach does have its shortfalls (which will be explained in a later post), it does give a fair idea about the problem at hand.
Types of Data
No, we are not talking about character, integer, float, double and other terms linked to the programmer’s dictionary. We are strictly speaking in a statistical context. Let us see what these are –
This kind of data (also referred to as categorical data) is the most basic kind of data. Names of people, Employee IDs, Names of the countries of the world etc. are all examples of nominal data. This type of data is not quite helpful for analysis since it can only be used for classification and categorization. This is referred to as the lowest level of data measurement.
This type of data is one step ahead of the nominal data. In this type of data, there is a certain priority or importance linked with the data. Leader board rankings at an F1 race, Performance ratings are all examples of ordinal data. In ordinal data, the order gives some insights but a numerical comparison cannot be made based on the order. For example, the person who won the race cannot be twice as fast as the person who came in second.
This is a level higher than the ordinal data in the sense that numerical comparisons are possible between observations. Temperature, Percentage change in the stock price, CGPA of students are all examples of Interval data. One specific shortfall of interval data is that there is no absolute zero defined meaning that zero has a different meaning for different units of measurement. For instance, Zero degree Fahrenheit and Zero degree Celsius are not the same.
This is the highest level of data measurement and is the most preferred for analysis. Transaction amounts, Goals scored in a football match, scores in a cricket match, are all examples of ratio data. In ratio data numerical comparisons are possible and zero has its literal meaning.
Higher level data can be easily converted to lower level data but the opposite is not true. You can categorize cricket matches as high scoring and low scoring but you cannot assign scores to low scoring and/or high scoring matches.
More to follow in this category!!!!