Chapter 1: Introduction - University of Kent

Statistics is also used in practice in many different walks of life, going beyond
simple data ... Number of samples: For example, when investigating the effects of
smoking during pregnancy, we would ..... Let's consider grouping all observations
between 1 - 3, 4 - 6 and so on. ...... Proof: (see also Example Sheet 4) ....
Exercises:.

Part of the document


Chapter 1: Introduction 1.1 What is Statistics? Statistics involves collecting, analysing, presenting and interpreting
data. We frequently see statistical tools (such as bar charts, tables, plots of
data, averages and percentages) on TV, in newspapers and in magazines. Such
methods used to organise and summarise data, so as to increase the
understanding of the data, are called descriptive statistics. Statistics is also used in practice in many different walks of life, going
beyond simple data summarisation to answer a wide variety of questions such
as:
. Medicine: Does a certain new drug prolong life for AIDS sufferers?
. Science: Is global warming really happening?
. Education: Are GCSE and A level examinations standards declining?
. Psychology: Is the national lottery making us a nation of compulsive
gamblers?
. Sociology: Is the gap between rich and poor widening in Britain?
. Business: Do Persil adverts really make us want to buy Persil?
. Finance: What will interest rates be in 6 months time?
1.2 Populations and Samples Suppose that we wanted to investigate whether smoking during pregnancy
leads to lower birth weight of babies. We use this example to illustrate
the following definitions. Definitions:
. Experimental unit: the object on which measurements are made.
For above example, we are measuring birth weights of newborn babies, so a
unit is a newborn baby.
. Variable: a measurable characteristic of a unit.
For above example, the variable is birth weight.
. Population: the set of all units about which information is required.
For above example, the population is all newborn babies.
. Sample: a subset of units of the population for which we can observe the
variable of interest.
For above example, a sample would be the observed birth weights for a set
of newborn babies (which will be a subset of all newborn babies).
. Random sample: a sample such that each unit in the population has the
same chance of being chosen independently of whether or not any other
unit is chosen. To determine whether smoking during pregnancy leads to lower birth weight
of babies, we would compare a random sample of weights of new-born babies
whose mothers smoke, with a random sample of weights of new-born babies of
non-smoking mothers. By analysing the sample data, we would hope to be able
to draw conclusions about the effects on birth weight of smoking during
pregnancy for all babies (i.e. the population). The process of using a
random sample to draw conclusions about a population is called statistical
inference. If we do not have a random sample, then sampling bias can invalidate our
statistical results. For example, birth weights of twins are generally
lower than the weights of babies born alone. So if all the non-smoking
mothers in the sample were giving birth to twins, whereas all the smoking
mothers were giving birth to single babies, then the conclusions we draw
about the effects of smoking in pregnancy will not necessarily be correct
as they are affected by sampling bias. Different units of the same population will have different values of the
same variable ( this is called natural variation. For example, obviously
the weights of all newborn babies are not the same. So different samples
will contain different data- called sampling variability. Therefore it is
important to bear in mind that slightly different conclusions could be
reached from different samples. 1.3 Types of Data Different types of data require different types of analysis. The type of
data set is determined by several factors: . Type of variable:
> quantitative data - i.e. numerical (e.g., heights of students,
number of phone calls in an hour).
> qualitative data - i.e. non-numerical (for example, eye colour, M/F).
Quantitative data can be subdivided further:
> discrete - a discrete variable can take only particular values (e.g.,
number of phone calls received at an exchange).
> continuous- a continuous variable can take any value in a given range
(e.g., heights of students).
. Number of variables measured:
> 1 variable ( univariate data.
> 2 variables ( bivariate data. E.g., we may have both the heights and
weights of a set of individuals. The data set then consists of pairs
of observations on each unit such as (1.7m, 65kg).
> 3 or more variables ( multivariate data. E.g., we have heights,
weights, eye colour, gender for a group of individuals. In this case
the data set consists of sets of 4 observations made on each unit such
as (1.7m, 65kg, blue, M).
. Number of samples: For example, when investigating the effects of smoking
during pregnancy, we would observe two samples:
> a sample of birth weights of babies born to smoking mothers
> a sample of birth weights of babies born to non-smoking mothers.
. Relationship of samples (if more than 1 sample):
> Are the samples independent? E.g., the two birth weight samples should
be independent.
> Are the samples dependent?
* Example:
Suppose that a doctor would like to assess the effectiveness of changing to
a low-fat diet in lowering cholesterol for a group of patients. To do this
the doctor might measure the cholesterol of the patients before starting on
the low-fat diet and then measure the cholesterol for the same patients
after they have been on the low-fat diet. We therefore have 2 samples of
measured cholesterol:
. a sample before the diet
. a sample after the diet.
However, the 2 samples are not independent, since the cholesterol
measurements for each sample were taken on the same patients. Samples of
this type are called matched pair data.
1.4 Recommended Books You will need to use statistical tables for the course. The tables used in
the exams are:
. Lindley, D.V. and Scott, W.F., New Cambridge Elementary Statistical
Tables, C.U.P., 1984.
Statistical tables will be used throughout this course. There are many books which cover the material in this course. Some good
books are:
. Introduction to probability and statistics for engineers and scientists;
[with CD-ROM] / Sheldon M. Ross
. Probability and Statistics for Engineers and Scientists - 7th edition,
R.E.Walpole, R.H.Myers, S.L.Myres and K. Ye, Prentice Hall, 2002
. Clarke, G.M., and Cooke, D. A Basic Course in Statistics, Edward Arnold,
4th edition, 1999.
. Daly, F., Hand, D.J., Jones, M.C., Lunn, A.D. and McConway, K.J. Elements
of Statistics, Open University, 1995.
Goes beyond what's required for this course, but is quite clearly written
with some real examples.
. Devore, J and Peck, R. Introductory Statistics, West, 1990.
Rather simplistic at times, but has lots of real examples. Especially good
if you have not done any statistics before.
. Spiegel, M.R., Probability and Statistics, Schaum Outline Series, 1988. In addition, you could browse in the library around QA276 and find a book
which suits you. For starters you could try looking at some of the
following. . Anderson, D.R., Sweeney, D.J. and Williams, T.A. Introduction to
Statistics: Concepts and Applications, West, 2nd edition, 1991.
. Bassett, E.E., Bremner, J.M., Jolliffe, I.T., Jones, B., Morgan, B.J.T.
and North, P.M., Statistics: Problems and Solutions, Edward Arnold, 1986.
. Moore, D.S., The Basic Practice of Statistics, Freeman, 1995.
. Moore, D.S., Think and Explain with Statistics, Addison-Wesley, 1986.
. Moore, D.S., Statistics: Concepts and Controversies, Freeman, 1991, 1985,
1979. There are many online books which could be useful. See for example
http://www.statsoft.com/textbook/stathome.html
Chapter 2: Graphical and Numerical Statistics 2.1 Histograms Histograms give a visual representation of continuous data. We consider
two separate cases corresponding to when (i) all the bars in the histogram
have the same width; (ii) the intervals are of variable widths. 1. Histograms with equal class widths * Example:
Mercury contamination can be particularly high in certain types of fish.
The mercury content (ppm) on the hair of 40 fishermen in a region thought
to be particularly vulnerable are given below (From paper "Mercury content
of commercially imported fish of the Seychelles, and hair mercury levels of
a selected part of the population." Environ. Research, (1983), 305-312.)
|13.26 |32.43 |18.10 |58.23 |64.00 |68.20 |35.35 |33.92 |23.94 |18.28 |
|22.05 |39.14 |31.43 |18.51 |21.03 | 5.50 | 6.96 | 5.19 |28.66 |26.29 |
|13.89 |25.87 | 9.84 |26.88 |16.81 |38.65 |19.23 |21.82 |31.58 |30.13 |
|42.42 |16.51 |21.16 |32.97 | 9.84 |10.64 |29.56 |40.69 |12.86 |13.80 | * The first step is to group the data. A reasonable choice of class
intervals is:
0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70.
The frequency table that results from the use of these intervals is: |Interval |Frequency |
|0-10 |5 |
|10-20 |11 |
|20-30 |10 |
|30-40 |9 |
|40-50 |2 |
|50-60 |1 |
|60-70 |2 | To construct the histogram in this situation (i.e. all class widths equal):
. Mark boundaries of the class intervals on the horizontal axis.
. The height of the bars above each interval can be taken as the frequency
for that interval.
Instead of using frequencies to give the heights of the rectangles in a
histogram, relative frequencies may be used. The relative frequency for an
interval is that interval's frequency divided by the total frequency. * So for the mercury example... |Interval |Frequency |Relative |
| | |frequency |
|0-10 |5 |.125 |
|10-20 |11 |