Introducton

Statistics-451 : Applied Statistics for Engineers and Scientist was taught Summer term 2021 at PSU by Subhash Kochar. The problems on this page are from Probability & Statistics for Engineering & Sciences 9th Edition by Jay L. Devore - Duxbury Publisher, and the work mine.

1.10 : Consider the strength data for beams given in Example 1.2.

Construct a stem-and leaf display of the data. What appears to be a representative strength value? Do the observations appear to be highly concentrated about the representative value or rather spread out?

beam <- c(5.9, 7.2, 7.3, 6.3, 8.1, 6.8, 7.0, 7.6, 6.8, 6.5, 7.0, 6.3, 7.9, 9.0,
         8.2, 8.7, 7.8, 9.7, 7.4, 7.7, 9.7, 7.8, 7.7, 11.6, 11.3, 11.8, 10.7)
stem(beam)

## 
##   The decimal point is at the |
## 
##    5 | 9
##    6 | 33588
##    7 | 00234677889
##    8 | 127
##    9 | 077
##   10 | 7
##   11 | 368

A representative strength value would be 7.7, as more observations are concentrated around this value than any other.

Does the display appear to be reasonably symmetric about a representative value, or would you describe its shape in some other way?

No, I would argue that the data has a slight positive skewed to the right. I say this because the higher frequency values (in blues) seem to be to the left of the representative value (indicated by the red line), which is to the left of the mean (indicated by the orange line).

Do there appear to be any outlying strength values?

Yes there seems to be a small density of values outlying around 11.

What proportion of strength observations in this sample exceed 10 MPa?

## [1] "14.81 %"

An unlikely 14.81% of values exceed 10 MPa.

1.12 : The accompanying specific gravity values for various wood types used in construction appeared in the article “Bolted Connection Design Values Based on European Yield Model” (J. of Structural Engr., 1993: 2169-2186):

wood.g <- c(.31, .35, .36, .36, .37, .38, .40, .40, .40,
            .41, .41, .42, .42, .42, .42, .42, .43, .44,
            .45, .46, .46, .47, .48, .48, .48, .51, .54,
            .54, .55, .58, .62, .66, .66, .67, .68, .75)

Construct a stem-and-leaf display using repeated stems (see the previous exercise), and comment on any interesting features of the display.

## 
##   The decimal point is at the |
## 
##   0 | 344444444444444444
##   0 | 555555555566677778

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   3 | 156678
##   4 | 0001122222345667888
##   5 | 14458
##   6 | 26678
##   7 | 5

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   3 | 1
##   3 | 56678
##   4 | 000112222234
##   4 | 5667888
##   5 | 144
##   5 | 58
##   6 | 2
##   6 | 6678
##   7 | 
##   7 | 5

Looking at the three stem-and-leaf displays there are a few interesting features about the gravity for various wood types used in construction data that stands out. Dropping the last digit in the data, the first stem-and-leaf display shows us that there’s an even number of values that are >0.5 and 0.5that appear most dense around 0.4. The second and third stem-and-leaf displays show us that these values are most dense about 0.42, with a small blip around 6.6 and obvious outlier for value 0.75.

1.16 The article cited in Example 1.2 also gave the accompanying strengths observations for cylinders:

cylinders <- c(6.1, 5.8, 7.8, 7.1, 7.2, 9.2, 6.6, 8.3, 7.0, 8.3,
           7.8, 8.1, 7.4, 8.5, 8.9, 9.8, 9.7, 14.1, 12.6, 11.2)

Construct a comparative stem-and-leaf display (see the previous exercise) of the beam and cylinder data and then answer the questions in parts (b)-(d) of Exercise 10 for the observations on cylinders.

## 
##   The decimal point is at the |
## 
##    5 | 8
##    6 | 16
##    7 | 012488
##    8 | 13359
##    9 | 278
##   10 | 
##   11 | 2
##   12 | 6
##   13 | 
##   14 | 1

Does the display appear to be reasonably symmetric about a representative value, or would you describe its shape in some other way?

## cylinders
##  5.8  6.1  6.6    7  7.1  7.2  7.4  8.1  8.5  8.9  9.2  9.7  9.8 11.2 12.6 14.1 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  7.8  8.3 
##    2    2

## [1] 8.575

## [1] 8.2

This data of cylinders also appears to have a slight positive skew, shown by mean (orange line) to the right of the median (red line) and mode (yellow line). However the mean is a lot closer to the median and mode in this than in the previous graph because there appear to bee less extreme observations.

Do there appear to be any outlying strength values?

14.1 appears to be an outlying strength value, as well as 11.2 and 12.6. Without these three values the mean and median are almost identical, shown below in cylinders.clean.

##  [1] 6.1 5.8 7.8 7.1 7.2 9.2 6.6 8.3 7.0 8.3 7.8 8.1 7.4 8.5 8.9 9.8 9.7

## [1] 7.858824

## [1] 7.8

What proportion of strength observations in this sample exceed 10 MPa?

## [1] "15 %"

About 15% of the data observations exceed 10 Mpa.

In what ways are the two sides of the display similar? Are there any obvious differences between the beam observations and the cylinder observations?

Both displays show relatively normal distributions with slight skews to the right. Similarly both the beam and cylinders have approximately 15% of the data outside the normal distribution. An obvious difference between the two is that the beam data was more skewed than the cylinder. A reason for this may be that the range of values for the beam is smaller than the range of values for cylinders, so the beam mean is more sensitive to outlying data.

1.34 : Exposure to microbial products, especially endotoxin, may have an impact on vulnerability to allergic diseases. The article “Dust Sampling Methods for Endotoxin – An Essential, But Underestimated Issue” (Indoor Air, 2006: 20-27) considered various issues associated with determining endotoxin concentration. The following data on concentration (EU/mg) in settled dust for one sample of urban homes and another of farm homes was kindly supplied by the authors of the cited article.

urban <- c(6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0)
farm <- c(4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0, 0.3 )

Determine the sample mean for each sample. How do they compare?

mean(urban)

## [1] 21.54545

mean(farm)

## [1] 8.56

The mean endotoxin concentration is greater in urban homes than farm homes.

Determine the sample median for each sample. How do they compare? Why is the median for the urban sample so different from the mean for that sample?

median(urban)

## [1] 17

median(farm)

## [1] 8.9

The median endotoxin concentration is greater in urban homes than farm homes.

The urban sample’s median is so different from the mean, because the mean is more sensitive to outliers in the data (such as 80 EU/mg) than the median is.

Calculate the trimmed mean for each sample by deleting the smallest and largest observation. What are the corresponding trimming percentages? How do the values of these trimmed means compare to the corresponding means and medians?

To start I will sort the data to see what are the largest and smallest observations for both.

sort(urban)

##  [1]  4  5  5  6 11 17 18 23 33 35 80

sort(farm)

##  [1]  0.3  2.0  3.0  4.0  4.0  5.0  8.0  8.9  9.0  9.0  9.2 11.0 14.0 20.0 21.0

From the sorted data I can see the lowest and highest values for urban are c(4,80), and for farm they are c(0.3,21). These values are removed from the urban.trim and farm.trim shown below.

urban.trim <- c(6.0, 5.0, 11.0, 33.0, 5.0, 18.0, 35.0, 17.0, 23.0)
farm.trim <- c(4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 9.2, 3.0, 2.0)

To find the corresponding trimming percentages I subtracted the sum of the subtracted values from the sums of the urban and farm data sets respectively, and then divide each by the length of each data set respectively.

(sum(urban)-84)/length(urban)

## [1] 13.90909

(sum(farm)-21.3)/length(farm)

## [1] 7.14

The two values removed from the urban dataset is about 14% of the data, slightly more then the 7.14% variation of the farm data set.

Now to look at the trimmed means.

mean(urban.trim)

## [1] 17

mean(farm.trim)

## [1] 8.238462

The trimmed mean of the urban data is closer to the median of the urban data, whereas the trimmed mean for the farm data is farther away from the mean and median of the untrimmed data. After this analysis it would seem that trimming the first data set may be appropriate, whereas trimming the second may lead to misleading data. Looking at the stem-and-leaf displays are helpful in visualizing distributions and outliers.

stem(urban, scale = 2)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   0 | 4556
##   1 | 178
##   2 | 3
##   3 | 35
##   4 | 
##   5 | 
##   6 | 
##   7 | 
##   8 | 0

stem(farm, scale = .5)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   0 | 02344589999
##   1 | 14
##   2 | 01

1.48 : Exercise 34 presented the following data on endotoxin concentration in settled dust both for a sample of urban homes and for a sample of farm homes:

urban

##  [1]  6  5 11 33  4  5 80 18 35 17 23

farm

##  [1]  4.0 14.0 11.0  9.0  9.0  8.0  4.0 20.0  5.0  8.9 21.0  9.2  3.0  2.0  0.3

Determine the value of the sample standard deviation for each sample, interpret these values, and then contrast variability in the two samples.

sd(urban)

## [1] 22.29961

sd(farm)

## [1] 6.087669

Another way we could find the standard deviation given the hint and the following equation given in class :

\[s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i - x)^2\] \[=\frac{1}{n-1}[\sum_{i=1}^{n}x_i^2 - \frac{(\sum_{i=1}^{n}x_i)^2}{n}]\]

# hint
sigma_x_i.u <- 237.0
sigma_x_i.f <- 128.4
sigma_x_i_2.u <- 10079
sigma_x_i_2.f <- 1617.94
n.u <- length(urban)
n.f <- length(farm)
constant.u <- 1/(n.u-1)
constant.f <- 1/(n.f-1)
sv.u <- constant.u*(sigma_x_i_2.u - (sigma_x_i.u)^2/n.u)
sv.f <- constant.f*(sigma_x_i_2.f - (sigma_x_i.f)^2/n.f)
sqrt(sv.u)

## [1] 22.29961

sqrt(sv.f)

## [1] 6.087669

Compute the fourth spread for each sample and compare. Do the fourth spreads convey the same message about variability that the standard deviations do? Explain.

The quickest way to do this in r is with quantile().

quantile(urban)

##   0%  25%  50%  75% 100% 
##  4.0  5.5 17.0 28.0 80.0

quantile(farm)

##   0%  25%  50%  75% 100% 
##  0.3  4.0  8.9 10.1 21.0

Explanation : coming soon

Another way to find the forth spread is by first computing the upper fourth and lower fourth . To do that I first sorted the data, and then split it in half. Notice that because n for both data sets is odd, I figure out the middle value that is included in both, before I create the new data sets.

# sort 
sorted.u <- sort(urban)
sorted.f <- sort(farm)
sorted.u

##  [1]  4  5  5  6 11 17 18 23 33 35 80

sorted.f

##  [1]  0.3  2.0  3.0  4.0  4.0  5.0  8.0  8.9  9.0  9.0  9.2 11.0 14.0 20.0 21.0

# find included value 
include.u <- as.integer(n.u/2)+1
include.f <- as.integer(n.f/2)+1
sorted.u[include.u]

## [1] 17

sorted.f[include.f]

## [1] 8.9

# upper and lower forth 
urban.lower <- c(4,  5,  5,  6, 11, 17)
urban.upper <- c(17, 18, 23, 33, 35, 80)
farm.lower <- c(0.3,  2.0,  3.0,  4.0,  4.0,  5.0,  8.0,  8.9)
farm.upper <- c(8.9,  9.0,  9.0,  9.2, 11.0, 14.0, 20.0, 21.0)

Now that the data is sorted, the middle values of 17 and 8.9 have been found, and the data has been split into two chunks I can compute the minimum,lower forth median, median, upper forth median, and the largest value.

min(urban)

## [1] 4

median(urban.lower)

## [1] 5.5

median(urban)

## [1] 17

median(urban.upper)

## [1] 28

max(urban)

## [1] 80

min(farm)

## [1] 0.3

median(farm.lower)

## [1] 4

median(farm)

## [1] 8.9

median(farm.upper)

## [1] 10.1

max(farm)

## [1] 21

The authors of the cited article also provided endotoxin concentrations in dust bag dust:

urban.bag <- c(34.0, 49.0, 13.0, 33.0, 24.0, 24.0, 35.0, 104.0, 34.0, 40.0, 38.0, 1.0)
farm.bag <- c(2.0, 64.0, 6.0, 17.0, 35.0, 11.0, 17.0, 13.0, 5.0, 27.0, 23.0,
              28.0, 10.0, 13.0, 0.2)
quantile(urban.bag)

##    0%   25%   50%   75%  100% 
##   1.0  24.0  34.0  38.5 104.0

quantile(farm.bag)

##   0%  25%  50%  75% 100% 
##  0.2  8.0 13.0 25.0 64.0

Construct a comparative boxplot (as did the cited paper) and compare and contrast the four samples.

par(mfrow = c(1,2))
boxplot(urban)
boxplot(urban.bag)

par(mfrow = c(1,2))
boxplot(farm)
boxplot(farm.bag)

1.51 : The article “A Thin-Film Oxygen Uptake Test for the Evaluation of Automotive Crankcase Lubricants” (Lubric. Engr., 1984: 75-83) reported the following data on oxidation-induction time (min) for various commercial oils:

oxi.induct.time.min <- c(87, 103, 130, 160, 180, 195, 132, 145, 211, 105, 145,
                         153, 152, 138, 87, 99, 93, 119, 129)

Calculate the sample variance and the standard deviation.

oxi.var <- var(oxi.induct.time.min)
oxi.sd <- sd(oxi.induct.time.min)
oxi.var

## [1] 1264.766

oxi.sd

## [1] 35.56355

If the observations were re-expressed in hours, what would be the resulting values of the sample variance and sample standard deviation? Answer without actually performing the re-expression.

The standard deviation has the same units as the data values (minutes) so in hours the standard deviation would be 35.56/60 (or a little over half an hour) whereas the variance is the standard deviation squared, so the values would be converted 1264.766/60^2.

oxi.var/60^2

## [1] 0.3513239

oxi.sd/60

## [1] 0.5927258

# verification 
oxi.induct.time.hour <- oxi.induct.time.min/60
var(oxi.induct.time.hour)

## [1] 0.3513239

sd(oxi.induct.time.hour)

## [1] 0.5927258

Use R (or any other software) to analyze data of Exercise 60, Chapter1. Also, compare the means of the standard deviations for the two groups.

1.60 : Observations on burst strength (lb/in2) were obtained both for test nozzle welds (“Proper Procedures Are the Key to Welding Radioactive Waste Canisters,” Welding J., Auud. 1997: 61-67)

Test <-c(7200, 6100, 7300, 7300, 8000, 7400,
         7300, 7300, 8000, 6700, 8300)         
Cannister <- c(5250, 5625, 5900, 5900, 5700, 6050,
               5800, 6000, 5875, 6100, 5850, 6600)

Construct a comparative boxplot and comment on interesting features (the cited article did not include such a picture, but the authors commented that they had looked at one).

par(mfrow = c(1,2))
boxplot(Test, ylim = c(5000, 8500), main = "Test")
boxplot(Cannister, ylim = c(5000, 8500), main = "Cannister")

mean(Test)-mean(Cannister)

## [1] 1467.045

mean(Test)

## [1] 7354.545

mean(Cannister)

## [1] 5887.5

sd(Test)

## [1] 613.7811

sd(Cannister)

## [1] 317.9301

STAT-451 : Homework 1