Homogeneity is usually an important factor in a data analysis. For example,
suppose we wish to demonstrate that life spans have been increasing since 1500.
But suppose we are interested not so much in aggregate mortality, but in how the same
individual might fare in different centuries.
One approach would be to select a certain sector of society to which the analysis
will be confined. This way, comparisons across time will be made between subjects
with similar life circumstances. I think a good choice would be music composers.
Until sometime in the 19th century, a composer would have been a middle to upper class professional employee (composers started to demand more respect roughly around the year 1800). A composer probably would have lived comfortably, but not in luxury. Most importantly, even for early composers we know their year of birth and year of death (to within a few years, in some cases).
So, I assembled a data set of 60 composers from the 16th to 19th century using WIKIPEDIA. The only control over selection was an attempt to select "the most famous ones", while making sure the 16th century was well represented. So I started with Vivaldi, Bach, Mozart and Beethoven, etc, and managed to continue until I selected 60 observations.
I did not take account of month or day of birth. I captured year of birth and year of death, and estimated lifespan by taking the difference. I then plotted Y = [Life Span] against X = [Year of Birth] (see below)
When I first did this a few years ago I did not expect what I saw. There seemed a distinct functional relationship between X and Y, with noise, but it was not monotone. Life spans started high at the beginning of the range, but dropped quite noticeably for composers born in the 18th and early 19th century. Life span then increased again later in the 19th century.
I noticed the same thing with this new data set. But this data must be examined carefully. First of all, life spans are not usually normally distributed, and the form of the life span distribution in this example appears to change dramatically, especially around the year X = 1700. Also, we clearly cannot use simple linear regression, because we must consider the possibility that the relationship between X and Y is not monotone (let alone linear). In this case, splines are a good choice, so I used a natural cubic spline with 4 degrees of freedom (via the splines package in R with default settings).
Finally, we will attempt to predict not the mean of Y but the median, using quantile regression (the quantreg package in R). The resulting fit is superimposed on the scatter plot.
So what we see is a more or less constant life span distribution for composers born in the 16th and 17th centuries, but then a rather sharp decrease in the median life span for composers born in the 18th century. The median life span then begins to rise again throughout the 19th century.
It's interesting to single out a few composers:
Wolfgang Amadeus Mozart, born 1756, lived 35 years;
Carl Maria von Weber, born 1786, lived 40 years;
Franz Schubert, born 1797, lived 31 year;
Felix Mendelssohn, born 1809, lived 38 years;
Frédéric Chopin, born 1810, lived 39 years;
Robert Schumann, born 1810, lived 46 years;
(remember, the lifespans are correct to within one year). These composers were born between 1756 and 1810, and none lived more than 46 years.
I think there may be a sensible hypothesis to explain what we see. The composers listed above all died from some illness. I also believe they lived in large cities (Vienna, Paris, etc). So, if the musical profession became more urbanized, and if large cities
were significantly more dangerous medically at that time, we have an explanation. But that's just a hypothesis for now.
You can download the data as a CSV file here.
Comentários