Exploring Language Dynamics |
Otakar Smrz |
Institute of Formal and Applied Linguistics, Charles University, Prague |
This article offers a dim insight into the behaviour of natural language if regarded as a nonlinear dynamical system.
One may imagine that both written and spoken language is just an output of some much more complex mental process. The succession of letters or the flow of speech may be represented by a one-dimensional row of numbers on which time series analysis tools can be applied.
Most results have been achieved by using the TISEAN package []. Kind guidance in less intelligible parts of our search was provided by Mgr. Kiril Ribarov.
In order to enable some comparison between languages and to try to prove the desired results' independence on the input representation, we processed four data sets:
The first three files were converted to the time series format simply by assigning each character its ordinal number and writing it on a new line of the output file. The speech record was split into time windows 10 milliseconds long. For each window, average acoustic energy was computed. The output file contains logarithms of this quantity (the base of the logarithm is irrelevant and we do not know it anyway).
We assume that the sampling rate is 1 Hz, i.e. there is a delay of 1 second between any two adjacent time series data points. Such rescaling might not harm the nature of the results.
Our usage of the methods was inspired by [], but we do not follow it strictly. The graphical results are enclosed in separate files.
For English and Arabic data, we wanted to get histograms of individual characters. The features of the binary required that data points 0 and 256 be added to the input file and that the number of all possible chracters be doubled to get the number of bases. We used this DOS batch:
type dat0256.dat > %1.inp type %1.dat >> %1.inp c:\tisean\histogram.exe %1.inp -o %1.out -b 512
Processing of Czech data was a bit vague. Since we lack the number of energy bases, some experimentation is necessary. We found 256 quite nice, but there are more possible solutions. We fixed the value range by adding data points -1.1 and 11.0 to the input.
We used the histograms for testing the validity of Zipf's law. The steps were
arranged in a descending order and plotted along with the curve
|
For English (R = 61) and Arabic (R1 = 43, R2 = 53), the results are satisfactory (we must consider that Zipf's law applies to much larger dictionaries). For Czech R = 245, but even if we set R = Q = 2, there is no such concordance. Some cure could rest in recomputing the histogram for a different number of bases.
In the former, we mark the occurrence of a pair of adjacent characters. We only present this for our ASCII 255 Arabic data. Notice that the symmetry of the map is just apparent.
The latter routine tells us more about the system dynamics. The plots are presented for m = 1, 2, 3. See [] for interpretation. However, the gorgeous contrast in behaviour of the two types of our data is remarkable.
Power spectra were computed using the spectrum.exe binary with default options. And the results are shocking! For all written language data, no periodicity was discovered. The data show the features of white noise. On the other hand, speech data seem as brown noise, for the spectrum vanishes like f-3 or so. Can this difference indicate some possible violation of the representation independence presumption later on?
Correlation (both corr and autocor) results are somehow obscure. We do not know why there are negative values in the output when dealing with non-negative functions. Yet, all written data show similar behavior while speech results have some other character. Mutual information function can be described likewise.
We would like to get some notion about the most suitable time delay from these plots. Let's look for zero points in the former and marked minima in the latter. For English, we yield the time lag t = 24. Arabic data are more ambiguous. We may decide to choose the same one, though. Thus, when we finally find embedding dimensions, their distinct values will not be due to unequal time delays.
As for Czech, we prefer the time lag t = 318 read from the mutual information plot. Do not get scared by such a big number! You must remember that the original sampling rates of characters and acoustic windows are of different order.
We ran the d2 routine from the package, setting the lag times properly. It produced four output files, three of which can be plotted and examined. In *.c2 files we obtain the correlation sum C(e), *.d2 yields the correlation dimension as a function of e (this is not understood at all - we expect an invariant number, don't we?) and *.h2 provides us with the correlation entropy.
In addition, the correlation sum in *.c2 can be processed by c2d, c2t and c2g routines. We tried to find a plateau in their graphs for increasing embedding dimension m and interpret its level as the correlation dimension d, since C(e) » ed. The bottommost curves in the plots of *.d2, *.c2d, *.c2t and *.c2g correspond to the embedding dimension m = 1.
Among all resembling ones, Takens' estimator plot is of the neatest appearence. Perhaps, we should have gone beyond m = 10 in our computations to see the limit of the plateau. English seems to have d = 1.5. Plots for Arabic are the least persuasive, but we may estimate d = 5 or so. Czech features some peculiar convergence. If there is any, we hope to find d within the range from 3 to 4.
On account of too long computational time needed, we have performed this analysis on english.dat and arabic_1.dat only so far. The plots could help us in assessing the most suitable embedding dimension. Unfortunately, we can see no steep slope or hill-side.
We assume the relation m = d*2+1. Thus for English m = 4, Arabic maps into about m = 11 and Czech yields m = 8, say.
To find the principal directions (eigenvectors) in the embedding space and to render several vector projections, one can use the svd routine.
There was a clear hypothesis: if the language is a reflection of any inferrable dynamical system, then this system might be inferred independently of the representation of the language. This does not say of course that every method must succeed in finding it; it says that the successful ones must intersect.
We have been surprised by the comparison of power spectra of speech and script, and also other methods indicated that representation matters. To be honest, nothing like that has been found for the correlation and embedding dimensions. Dependence of these parametres on the type of the language must be studied more thouroughly. Still, effects of various representations, encodings, data structure etc. should be reviewed first!