Exploring Language Dynamics



Otakar Smrz
Institute of Formal and Applied Linguistics, Charles University, Prague

Introduction

This article offers a dim insight into the behaviour of natural language if regarded as a nonlinear dynamical system.

One may imagine that both written and spoken language is just an output of some much more complex mental process. The succession of letters or the flow of speech may be represented by a one-dimensional row of numbers on which time series analysis tools can be applied.

Most results have been achieved by using the TISEAN package []. Kind guidance in less intelligible parts of our search was provided by Mgr. Kiril Ribarov.

Language data

In order to enable some comparison between languages and to try to prove the desired results' independence on the input representation, we processed four data sets:

  1. English prose in ASCII encoding (96,000 characters)
  2. Arabic prose and poetry written with full or partial vowelization in ASCII 127 transliteration [] (131,000 characters)
  3. The same Arabic data in ASCII 255 encoding [] (123,000 characters)
  4. Czech newspaper articles read by 52 different speakers and recorded in an acoustic wave format (one sentence each, 398 seconds altogether)

The first three files were converted to the time series format simply by assigning each character its ordinal number and writing it on a new line of the output file. The speech record was split into time windows 10 milliseconds long. For each window, average acoustic energy was computed. The output file contains logarithms of this quantity (the base of the logarithm is irrelevant and we do not know it anyway).

We assume that the sampling rate is 1 Hz, i.e. there is a delay of 1 second between any two adjacent time series data points. Such rescaling might not harm the nature of the results.

Methods and results

Our usage of the methods was inspired by [], but we do not follow it strictly. The graphical results are enclosed in separate files.

Histogram

For English and Arabic data, we wanted to get histograms of individual characters. The features of the binary required that data points 0 and 256 be added to the input file and that the number of all possible chracters be doubled to get the number of bases. We used this DOS batch:

type dat0256.dat > %1.inp
type %1.dat >> %1.inp
c:\tisean\histogram.exe %1.inp -o %1.out -b 512

Processing of Czech data was a bit vague. Since we lack the number of energy bases, some experimentation is necessary. We found 256 quite nice, but there are more possible solutions. We fixed the value range by adding data points -1.1 and 11.0 to the input.

Zipf's law

We used the histograms for testing the validity of Zipf's law. The steps were arranged in a descending order and plotted along with the curve
f(x) = 1
x*log(1.78*R)
x being the rank and f(x) the appropriate relative frequency. The coefficient R is equal to the size of the dictionary, which is in our case the number of occupied histogram bases.

For English (R = 61) and Arabic (R1 = 43, R2 = 53), the results are satisfactory (we must consider that Zipf's law applies to much larger dictionaries). For Czech R = 245, but even if we set R = Q = 2, there is no such concordance. Some cure could rest in recomputing the histogram for a different number of bases.

Predecessor versus successor, recurrence plot

In the former, we mark the occurrence of a pair of adjacent characters. We only present this for our ASCII 255 Arabic data. Notice that the symmetry of the map is just apparent.

The latter routine tells us more about the system dynamics. The plots are presented for m = 1, 2, 3. See [] for interpretation. However, the gorgeous contrast in behaviour of the two types of our data is remarkable.

Power spectrum

Power spectra were computed using the spectrum.exe binary with default options. And the results are shocking! For all written language data, no periodicity was discovered. The data show the features of white noise. On the other hand, speech data seem as brown noise, for the spectrum vanishes like f-3 or so. Can this difference indicate some possible violation of the representation independence presumption later on?

Correlation and mutual information

Correlation (both corr and autocor) results are somehow obscure. We do not know why there are negative values in the output when dealing with non-negative functions. Yet, all written data show similar behavior while speech results have some other character. Mutual information function can be described likewise.

We would like to get some notion about the most suitable time delay from these plots. Let's look for zero points in the former and marked minima in the latter. For English, we yield the time lag t = 24. Arabic data are more ambiguous. We may decide to choose the same one, though. Thus, when we finally find embedding dimensions, their distinct values will not be due to unequal time delays.

As for Czech, we prefer the time lag t = 318 read from the mutual information plot. Do not get scared by such a big number! You must remember that the original sampling rates of characters and acoustic windows are of different order.

Searching for correlation dimension

We ran the d2 routine from the package, setting the lag times properly. It produced four output files, three of which can be plotted and examined. In *.c2 files we obtain the correlation sum C(e), *.d2 yields the correlation dimension as a function of e (this is not understood at all - we expect an invariant number, don't we?) and *.h2 provides us with the correlation entropy.

In addition, the correlation sum in *.c2 can be processed by c2d, c2t and c2g routines. We tried to find a plateau in their graphs for increasing embedding dimension m and interpret its level as the correlation dimension d, since C(e) » ed. The bottommost curves in the plots of *.d2, *.c2d, *.c2t and *.c2g correspond to the embedding dimension m = 1.

Among all resembling ones, Takens' estimator plot is of the neatest appearence. Perhaps, we should have gone beyond m = 10 in our computations to see the limit of the plateau. English seems to have d = 1.5. Plots for Arabic are the least persuasive, but we may estimate d = 5 or so. Czech features some peculiar convergence. If there is any, we hope to find d within the range from 3 to 4.

False nearest neighbours

On account of too long computational time needed, we have performed this analysis on english.dat and arabic_1.dat only so far. The plots could help us in assessing the most suitable embedding dimension. Unfortunately, we can see no steep slope or hill-side.

Embedding dimension estimates

We assume the relation m = d*2+1. Thus for English m = 4, Arabic maps into about m = 11 and Czech yields m = 8, say.

To find the principal directions (eigenvectors) in the embedding space and to render several vector projections, one can use the svd routine.

Conclusion

There was a clear hypothesis: if the language is a reflection of any inferrable dynamical system, then this system might be inferred independently of the representation of the language. This does not say of course that every method must succeed in finding it; it says that the successful ones must intersect.

We have been surprised by the comparison of power spectra of speech and script, and also other methods indicated that representation matters. To be honest, nothing like that has been found for the correlation and embedding dimensions. Dependence of these parametres on the type of the language must be studied more thouroughly. Still, effects of various representations, encodings, data structure etc. should be reviewed first!

References

[]
Rainer Hegger, Holger Kantz, Thomas Schreiber: Practical implementation of nonlinear time series methods: The TISEAN package
CHAOS 9, 413 (1999)
http://www.mpipks-dresden.mpg.de/~tisean/TISEAN_2.0

[]
Klaus Lagally: - a System for Typesetting Arabic, User Manual Version 3.09.
Institut für Informatik, Universität Stuttgart, 1999
ftp://ftp.informatik.uni-stuttgart.de/pub/arabtex

[]
Otakar Smrz: User's Guide to Version 1.00
http://come.to/arabspell