Statistical Analysis in Power BI

BI

“Statistical analysis is a component of data analytics. In the context of business intelligence (BI), statistical analysis involves collecting and scrutinizing every data sample in a set of items from which samples can be drawn. A sample, in statistics, is a representative selection drawn from a total population.” (https://whatis.techtarget.com/definition/statistical-analysis)

Below I will describe how statistical analysis can be implemented and used in Power BI. Even this field is very wide, I will choose to describe only  few measures that were most suitable in my case. I have a dataset with employee responses over a set of 144 questions (Preventing psychological risks at work) and I need to find how their answers depend on age, department, experience…

There are few Power BI visuals already available to get some answers in seconds:

  • Pie chart, Bar chart – used to get percentages by category. These are easy to use and will give you a good overall view and are primary components – descriptive statistics easy to understand and visualize raw data.
pie

  • Line & clustered column chart with Standard Deviation

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values.( Wikipedia)

This visual show standard deviation for our dataset: Question category on X axis and Response value on Y axis. Line value represent standard deviation of Values and Age and for the calculation I used STDEV.P function which return a number representing the standard deviation of the entire population.

_StandardDeviationVarsta = CALCULATE(STDEV.P(query[_ValueNo]),ALLSELECTED(query[Varsta]))

There are some basic statistic calculation like mean, margin of error, confidence interval, correlation we can use to get insights, estimate computed of the observed data and also to run different scenarios that can help guide future actions.

_Mean ABCD = SUM(query[_VarstaNo])/COUNT(query[_Tema])

  • Margin of error – is a statistic expressing the amount of random sampling error in the results of a survey. The larger the margin of error, the less confidence one should have that a poll result would reflect the result of a survey of the entire population
  • Confidence interval – is a type of estimate computed from the statistics of the observed data. This proposes a range of plausible values for an unknown parameter (for example, the mean). The interval has an associated confidence level that the true parameter is in the proposed range.

In Power BI you can use CONFIDENCE.NORM function – Returns the confidence interval for a population mean, using a normal distribution.

CONFIDENCE.NORM(alpha,standard_dev,size)

_ConfidenceInterval = CONFIDENCE.NORM(0.05,[_StandardDeviation],144)

Confidence interval = Mean -/+ Margin of Error

The best way to see all these measures is to build Normal distribution chart.

  • Normal distribution

Normal distribution is a type of continuous probability distribution for a real-valued random variable. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.[1][2] Their importance is partly due to the central limit theorem. It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable whose distribution converges to a normal distribution as the number of samples increases. (Wikipedia)

This is a clustered column chart having Normal Distribution as value, Mean and Confidence interval in details.

I used GENERATESERIES function to build Normal Distribution table.

NormalDistribution = GENERATESERIES( [_Mean ABCD]- [_Confidence_min], [_Mean ABCD]+[_Confidence_max], 0.01)

  • Pearson correlation coefficient

For more complex data set you can calculate Pearson Correlation coefficient (R), coefficient of determination (R2), m slope and B intercept, which in a Regression Model can offer predicted values and measure the linear correlation between two variables.

_Predicted responses Age = [b (intercept) age] + [m (slope) age]*’Age Group No'[Age Group No Value]

Correlation plot visual is available in Power BI to investigate dependence between multiple variables and to highlight the most correlated variables in a data table.

I described a small part of statistical measures that can be used in Power BI but there are lots of other calculation that can be implemented. A large and complex data set will give you the opportunity:

  • to describe the nature of the data to be analyzed,
  • to explore the correlation between variables,
  • to create a good model to understand how the data relates to the underlying population,
  • to predict and anticipate future trends.

What do you think?

  • ” I have a dataset with employee responses over a set of 144 questions (Preventing psychological risks at work)”. And how to get that dataset to follow?

  • Hi Vladimir,
    Thank you for your interest in our blog.
    This dataset I worked on contains sensitive data and cannot be published because it contains employee responses. The answers are grouped into categories by age, sex, department, experience and seniority in the company.

    This article contains analysis ideas that can be implemented on any dataset.
    Of course the field of data analysis is very wide and there are many statistical models that can be implemented depending on the type or the size of your data or the answers you are looking for. I hope you’ll find here some ideas.

    Regards,
    Cornelia