“Statistical analysis is a component of data analytics. In the context of business intelligence (BI), statistical analysis involves collecting and scrutinizing every data sample in a set of items from which samples can be drawn. A sample, in statistics, is a representative selection drawn from a total population.” (https://whatis.techtarget.com/definition/statistical-analysis)
Below I will describe how statistical analysis can be implemented and used in Power BI. Even this field is very wide, I will choose to describe only few measures that were most suitable in my case. I have a dataset with employee responses over a set of 144 questions (Preventing psychological risks at work) and I need to find how their answers depend on age, department, experience…
There are few Power BI visuals already available to get some answers in seconds:
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values.( Wikipedia)
This visual show standard deviation for our dataset: Question category on X axis and Response value on Y axis. Line value represent standard deviation of Values and Age and for the calculation I used STDEV.P function which return a number representing the standard deviation of the entire population.
There are some basic statistic calculation like mean, margin of error, confidence interval, correlation we can use to get insights, estimate computed of the observed data and also to run different scenarios that can help guide future actions.
In Power BI you can use CONFIDENCE.NORM function – Returns the confidence interval for a population mean, using a normal distribution.
CONFIDENCE.NORM(alpha,standard_dev,size)
Confidence interval = Mean -/+ Margin of Error
The best way to see all these measures is to build Normal distribution chart.
Normal distribution is a type of continuous probability distribution for a real-valued random variable. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.[1][2] Their importance is partly due to the central limit theorem. It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable whose distribution converges to a normal distribution as the number of samples increases. (Wikipedia)
This is a clustered column chart having Normal Distribution as value, Mean and Confidence interval in details.
I used GENERATESERIES function to build Normal Distribution table.
For more complex data set you can calculate Pearson Correlation coefficient (R), coefficient of determination (R2), m slope and B intercept, which in a Regression Model can offer predicted values and measure the linear correlation between two variables.
Correlation plot visual is available in Power BI to investigate dependence between multiple variables and to highlight the most correlated variables in a data table.
I described a small part of statistical measures that can be used in Power BI but there are lots of other calculation that can be implemented. A large and complex data set will give you the opportunity:
What do you think?
” I have a dataset with employee responses over a set of 144 questions (Preventing psychological risks at work)”. And how to get that dataset to follow?
Hi Vladimir,
Thank you for your interest in our blog.
This dataset I worked on contains sensitive data and cannot be published because it contains employee responses. The answers are grouped into categories by age, sex, department, experience and seniority in the company.
This article contains analysis ideas that can be implemented on any dataset.
Of course the field of data analysis is very wide and there are many statistical models that can be implemented depending on the type or the size of your data or the answers you are looking for. I hope you’ll find here some ideas.
Regards,
Cornelia