Friday, 13 December 2013

Les Villes en Equation: The World According to Zipf



Read a short article this morning in the RER on the way to work (aside: yesterday there was a rail strike here in France, affecting hundreds of thousands if travellers.  The timing could hardly be worse, as residents here in Ile de France are being asked to take measures to curb emissions in an effort to reduce a string of days of pollution deemed "Très élevé"; the strike forced many to take to their cars) which touched upon several of my own peculiar curiosities.  Mathematics, language, and geography.

There is a not terribly widely known model paradigm known as Zipf's law, named for a mid-20th century linguist, in which elements of a finite or countably infinite set  can be rank-ordered, with the frequencies inversely proportional to the item's rank.

Put simply, the first item in the list will appear with a frequency that is twice that of the second; the third most common will appear approximately 1/3 as frequently, etc.

In mathematical terms, this is written as:

f(k;s,N)=\frac{1/k^s}{\sum_{n=1}^N (1/n^s)}.

where N is the total number of elements in the set; k is the rank of the item in the set (1...N), and s is the power parameter for the series.

George Zipf, for whom the distribution (and law) is named observed that, empirically in language, the most frequently used word will appear approximately twice as often as the second, three times as frequently as the third, etc.  Zipf applied his thinking to the famous Brown corpus of English from a study of American texts conducted by Brown University, and discovered that the most common word in the texts ('the') appeared about 70,000 times in American literature surveyed.  The second most common ("of") appeared about 36,000 times; the third ("and") was used in 28,000 cases.  1.00, 0.51, 0.40.  Not a perfect fit, but close.

In the article, it is pointed out that, ranking the most populous US cities (New York, Los Angeles, Chicago, Houston, Philadelphia) one sees an approximately Zipf-ian distribution.

The populations (1, 2, 3, 4, 5) are

8.4MM, 3.9MM, 2.7MM, 2.2MM, 1.5MM

The ratios:

1.00, 0.46, 0.32, 0.25, 0.18

Very close to 1, 1/2, 1/3, 1/4, and 1/5.

Since Zipf also studied the patterns of use of Chinese, I thought it might be interesting to examine if Zipf's law applies in that giant country.

According to the CIA Fact Book, the most populous cities in China are Shanghai, Beijing, Tianjin, Guangzhou, and Shenzhen.

Populations:  22MM, 19MM, 11MM, 11MM, 10MM.

The distribution here seems to fail, with three observations.  First, Chinese census data define 'cities' somewhat differently, tracking "built-up areas," "urban areas," etc., so comparisons may be a bit off.  Second, whilst I knew that China had some big cities (the country is more than a billion and a half people), these are massive cities.  New York would be about eighth.  China has cities that would be the third largest state in the US.  Third, I am shocked to see Shenzhen as the fifth largest city in China.  I visited Shenzhen in 1989, and there was not much there besides a few glass apartments that catered to business travelers who came and went from Hongkong, just across the Shumchun river.  It has added I would suspect nine million residents in 25 years.

All in all, not Moore's law, but still quite something.
Post a Comment