*“Statistical analysis** is a
component of data analytics. In the context of business intelligence
(BI), statistical analysis involves collecting and
scrutinizing every data sample in a set of items from which samples can be
drawn. A sample, in statistics, is a representative selection drawn
from a total population.” (*

Below I will describe how statistical analysis can be implemented and used in Power BI. Even this field is very wide, I will choose to describe only few measures that were most suitable in my case. I have a dataset with employee responses over a set of 144 questions (Preventing psychological risks at work) and I need to find how their answers depend on age, department, experience…

There are few Power BI visuals already available to get some answers in seconds:

**Pie chart, Bar chart**– used to get percentages by category. These are easy to use and will give you a good overall view and are primary components – descriptive statistics easy to understand and visualize raw data.

**Natural Language**– I have used Arria NLG (https://samples.studio.arria.com/v2-powerbi/) to generate narrative insights that support decision-making.

**Line & clustered column chart with Standard Deviation**

*In **statistics**, the standard
deviation is a measure of the amount of variation or *

This visual show standard deviation for our dataset: *Question
category* on X axis and *Response value *on Y axis. Line value
represent standard deviation of Values and Age and for the calculation I used STDEV.P function which
return a number representing the standard deviation of the entire population.

_StandardDeviationVarsta = CALCULATE(STDEV.P(query[_ValueNo]),ALLSELECTED(query[Varsta]))

There are some
basic statistic calculation like **mean, margin of error, confidence interval,
correlation** we can use to get insights, estimate computed of the observed
data and also to run different scenarios that can help guide future actions.

**Mean**–*For a**data set**, the**arithmetic mean**, also called the**mathematical expectation**or**average**, is the central value of a discrete set of numbers (Wikipedia)*

_Mean ABCD = SUM(query[_VarstaNo])/COUNT(query[_Tema])

**Margin of error**– is a statistic expressing the amount of random sampling error in the results of a survey. The larger the margin of error, the less confidence one should have that a poll result would reflect the result of a survey of the entire population.

**Confidence interval –**is a type of estimate computed from the statistics of the observed data. This proposes a range of plausible values for an unknown parameter (for example, the mean). The interval has an associated confidence level that the true parameter is in the proposed range.

In Power BI you can use CONFIDENCE.NORM function – Returns the confidence interval for a population mean, using a normal distribution.

CONFIDENCE.NORM(alpha,standard_dev,size)

_ConfidenceInterval = CONFIDENCE.NORM(0.05,[_StandardDeviation],144)

Confidence interval = Mean -/+ Margin of Error

The best way to see all these measures is to build Normal distribution chart.

**Normal distribution**

*Normal distribution **is a type of **continuous probability distribution** for a **real-valued** **random variable**.** **Normal distributions are important in **statistics** and are often used in the **natural** and **social sciences** to represent real-valued **random variables** whose distributions are not known.*^{[1]}^{[2]}* Their importance is partly due to
the **central
limit theorem**. It
states that, under some conditions, the average of many samples (observations)
of a **random variable** with finite **mean** and **variance** is itself a random variable whose
distribution **converges** to a normal distribution as the number of samples
increases*. (Wikipedia)

This is a clustered column chart having Normal Distribution as value, Mean and Confidence interval in details.

I used GENERATESERIES function to build Normal Distribution table.

NormalDistribution =
GENERATESERIES( [_Mean ABCD]- [_Confidence_min], [_Mean
ABCD]+[_Confidence_max], 0.01)

**Pearson correlation coefficient**

For more complex data set you can
calculate Pearson Correlation coefficient (R), coefficient of determination (*R*^{2}),
m
slope and B intercept,
which in a Regression Model can offer predicted values and measure the linear
correlation between two variables.

_Predicted responses Age =
[b (intercept) age] + [m (slope) age]*’Age Group No'[Age Group No Value]

Correlation plot visual is available in Power BI to investigate dependence between multiple variables and to highlight the most correlated variables in a data table.

I described a small part of statistical measures that can be used in Power BI but there are lots of other calculation that can be implemented. A large and complex data set will give you the opportunity:

- to describe the nature of the data to be analyzed,
- to explore the correlation between variables,
- to create a good model to understand how the data relates to the underlying population,
- to predict and anticipate future trends.

## What do you think?

” I have a dataset with employee responses over a set of 144 questions (Preventing psychological risks at work)”. And how to get that dataset to follow?

Hi Vladimir,

Thank you for your interest in our blog.

This dataset I worked on contains sensitive data and cannot be published because it contains employee responses. The answers are grouped into categories by age, sex, department, experience and seniority in the company.

This article contains analysis ideas that can be implemented on any dataset.

Of course the field of data analysis is very wide and there are many statistical models that can be implemented depending on the type or the size of your data or the answers you are looking for. I hope you’ll find here some ideas.

Regards,

Cornelia