Statistics-451 : Applied Statistics for Engineers and Scientist was taught Summer term 2021 at PSU by Subhash Kochar. The problems on this page are from Probability & Statistics for Engineering & Sciences 9th Edition by Jay L. Devore - Duxbury Publisher, and the work mine.
beam <- c(5.9, 7.2, 7.3, 6.3, 8.1, 6.8, 7.0, 7.6, 6.8, 6.5, 7.0, 6.3, 7.9, 9.0,
8.2, 8.7, 7.8, 9.7, 7.4, 7.7, 9.7, 7.8, 7.7, 11.6, 11.3, 11.8, 10.7)
stem(beam)
##
## The decimal point is at the |
##
## 5 | 9
## 6 | 33588
## 7 | 00234677889
## 8 | 127
## 9 | 077
## 10 | 7
## 11 | 368
A representative strength value would be 7.7, as more observations are concentrated around this value than any other.
No, I would argue that the data has a slight positive skewed to the right. I say this because the higher frequency values (in blues) seem to be to the left of the representative value (indicated by the red line), which is to the left of the mean (indicated by the orange line).
Yes there seems to be a small density of values outlying around 11.
## [1] "14.81 %"
An unlikely 14.81% of values exceed 10 MPa.
wood.g <- c(.31, .35, .36, .36, .37, .38, .40, .40, .40,
.41, .41, .42, .42, .42, .42, .42, .43, .44,
.45, .46, .46, .47, .48, .48, .48, .51, .54,
.54, .55, .58, .62, .66, .66, .67, .68, .75)
##
## The decimal point is at the |
##
## 0 | 344444444444444444
## 0 | 555555555566677778
##
## The decimal point is 1 digit(s) to the left of the |
##
## 3 | 156678
## 4 | 0001122222345667888
## 5 | 14458
## 6 | 26678
## 7 | 5
##
## The decimal point is 1 digit(s) to the left of the |
##
## 3 | 1
## 3 | 56678
## 4 | 000112222234
## 4 | 5667888
## 5 | 144
## 5 | 58
## 6 | 2
## 6 | 6678
## 7 |
## 7 | 5
Looking at the three stem-and-leaf displays there are a few interesting features about the gravity for various wood types used in construction data that stands out. Dropping the last digit in the data, the first stem-and-leaf display shows us that there’s an even number of values that are >0.5 and 0.5that appear most dense around 0.4. The second and third stem-and-leaf displays show us that these values are most dense about 0.42, with a small blip around 6.6 and obvious outlier for value 0.75.
cylinders <- c(6.1, 5.8, 7.8, 7.1, 7.2, 9.2, 6.6, 8.3, 7.0, 8.3,
7.8, 8.1, 7.4, 8.5, 8.9, 9.8, 9.7, 14.1, 12.6, 11.2)
##
## The decimal point is at the |
##
## 5 | 8
## 6 | 16
## 7 | 012488
## 8 | 13359
## 9 | 278
## 10 |
## 11 | 2
## 12 | 6
## 13 |
## 14 | 1
## cylinders
## 5.8 6.1 6.6 7 7.1 7.2 7.4 8.1 8.5 8.9 9.2 9.7 9.8 11.2 12.6 14.1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 7.8 8.3
## 2 2
## [1] 8.575
## [1] 8.2
This data of cylinders also appears to have a slight positive skew, shown by mean (orange line) to the right of the median (red line) and mode (yellow line). However the mean is a lot closer to the median and mode in this than in the previous graph because there appear to bee less extreme observations.
14.1 appears to be an outlying strength value, as well as 11.2 and 12.6. Without these three values the mean and median are almost identical, shown below in cylinders.clean.
## [1] 6.1 5.8 7.8 7.1 7.2 9.2 6.6 8.3 7.0 8.3 7.8 8.1 7.4 8.5 8.9 9.8 9.7
## [1] 7.858824
## [1] 7.8
## [1] "15 %"
About 15% of the data observations exceed 10 Mpa.
Both displays show relatively normal distributions with slight skews to the right. Similarly both the beam and cylinders have approximately 15% of the data outside the normal distribution. An obvious difference between the two is that the beam data was more skewed than the cylinder. A reason for this may be that the range of values for the beam is smaller than the range of values for cylinders, so the beam mean is more sensitive to outlying data.
urban
## [1] 6 5 11 33 4 5 80 18 35 17 23
farm
## [1] 4.0 14.0 11.0 9.0 9.0 8.0 4.0 20.0 5.0 8.9 21.0 9.2 3.0 2.0 0.3
Determine the value of the sample standard deviation for each sample, interpret these values, and then contrast variability in the two samples.
sd(urban)
## [1] 22.29961
sd(farm)
## [1] 6.087669
Another way we could find the standard deviation given the hint and the following equation given in class :
\[s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i - x)^2\] \[=\frac{1}{n-1}[\sum_{i=1}^{n}x_i^2 - \frac{(\sum_{i=1}^{n}x_i)^2}{n}]\]
# hint
sigma_x_i.u <- 237.0
sigma_x_i.f <- 128.4
sigma_x_i_2.u <- 10079
sigma_x_i_2.f <- 1617.94
n.u <- length(urban)
n.f <- length(farm)
constant.u <- 1/(n.u-1)
constant.f <- 1/(n.f-1)
sv.u <- constant.u*(sigma_x_i_2.u - (sigma_x_i.u)^2/n.u)
sv.f <- constant.f*(sigma_x_i_2.f - (sigma_x_i.f)^2/n.f)
sqrt(sv.u)
## [1] 22.29961
sqrt(sv.f)
## [1] 6.087669
The quickest way to do this in r is with quantile().
quantile(urban)
## 0% 25% 50% 75% 100%
## 4.0 5.5 17.0 28.0 80.0
quantile(farm)
## 0% 25% 50% 75% 100%
## 0.3 4.0 8.9 10.1 21.0
Explanation : coming soon
Another way to find the forth spread is by first computing the upper fourth and lower fourth . To do that I first sorted the data, and then split it in half. Notice that because n for both data sets is odd, I figure out the middle value that is included in both, before I create the new data sets.
# sort
sorted.u <- sort(urban)
sorted.f <- sort(farm)
sorted.u
## [1] 4 5 5 6 11 17 18 23 33 35 80
sorted.f
## [1] 0.3 2.0 3.0 4.0 4.0 5.0 8.0 8.9 9.0 9.0 9.2 11.0 14.0 20.0 21.0
# find included value
include.u <- as.integer(n.u/2)+1
include.f <- as.integer(n.f/2)+1
sorted.u[include.u]
## [1] 17
sorted.f[include.f]
## [1] 8.9
# upper and lower forth
urban.lower <- c(4, 5, 5, 6, 11, 17)
urban.upper <- c(17, 18, 23, 33, 35, 80)
farm.lower <- c(0.3, 2.0, 3.0, 4.0, 4.0, 5.0, 8.0, 8.9)
farm.upper <- c(8.9, 9.0, 9.0, 9.2, 11.0, 14.0, 20.0, 21.0)
Now that the data is sorted, the middle values of 17 and 8.9 have been found, and the data has been split into two chunks I can compute the minimum,lower forth median, median, upper forth median, and the largest value.
min(urban)
## [1] 4
median(urban.lower)
## [1] 5.5
median(urban)
## [1] 17
median(urban.upper)
## [1] 28
max(urban)
## [1] 80
min(farm)
## [1] 0.3
median(farm.lower)
## [1] 4
median(farm)
## [1] 8.9
median(farm.upper)
## [1] 10.1
max(farm)
## [1] 21
urban.bag <- c(34.0, 49.0, 13.0, 33.0, 24.0, 24.0, 35.0, 104.0, 34.0, 40.0, 38.0, 1.0)
farm.bag <- c(2.0, 64.0, 6.0, 17.0, 35.0, 11.0, 17.0, 13.0, 5.0, 27.0, 23.0,
28.0, 10.0, 13.0, 0.2)
quantile(urban.bag)
## 0% 25% 50% 75% 100%
## 1.0 24.0 34.0 38.5 104.0
quantile(farm.bag)
## 0% 25% 50% 75% 100%
## 0.2 8.0 13.0 25.0 64.0
Construct a comparative boxplot (as did the cited paper) and compare and contrast the four samples.
par(mfrow = c(1,2))
boxplot(urban)
boxplot(urban.bag)
par(mfrow = c(1,2))
boxplot(farm)
boxplot(farm.bag)
oxi.induct.time.min <- c(87, 103, 130, 160, 180, 195, 132, 145, 211, 105, 145,
153, 152, 138, 87, 99, 93, 119, 129)
oxi.var <- var(oxi.induct.time.min)
oxi.sd <- sd(oxi.induct.time.min)
oxi.var
## [1] 1264.766
oxi.sd
## [1] 35.56355
The standard deviation has the same units as the data values (minutes) so in hours the standard deviation would be 35.56/60 (or a little over half an hour) whereas the variance is the standard deviation squared, so the values would be converted 1264.766/60^2.
oxi.var/60^2
## [1] 0.3513239
oxi.sd/60
## [1] 0.5927258
# verification
oxi.induct.time.hour <- oxi.induct.time.min/60
var(oxi.induct.time.hour)
## [1] 0.3513239
sd(oxi.induct.time.hour)
## [1] 0.5927258
Test <-c(7200, 6100, 7300, 7300, 8000, 7400,
7300, 7300, 8000, 6700, 8300)
Cannister <- c(5250, 5625, 5900, 5900, 5700, 6050,
5800, 6000, 5875, 6100, 5850, 6600)
Construct a comparative boxplot and comment on interesting features (the cited article did not include such a picture, but the authors commented that they had looked at one).
par(mfrow = c(1,2))
boxplot(Test, ylim = c(5000, 8500), main = "Test")
boxplot(Cannister, ylim = c(5000, 8500), main = "Cannister")
mean(Test)-mean(Cannister)
## [1] 1467.045
mean(Test)
## [1] 7354.545
mean(Cannister)
## [1] 5887.5
sd(Test)
## [1] 613.7811
sd(Cannister)
## [1] 317.9301