A Memory-Based Model of Syntactic Analysis

Department of Computational Linguistics Remko Scha Rens Bod Khalil Sima'an

Published in: W. Daelemans (ed.): Journal of Experimental & Theoretical Artificial Intelligence. (Special Issue on Memory-Based Language Processing) 11, 3 (July 1999), pp. 409-440. © 1999 Taylor & Francis Ltd.

This research was partially supported by the Dutch National Science Foundation NWO under the Priority Programme "Language and Speech Technology".

A Memory-Based Model of Syntactic Analysis: Data-Oriented Parsing

Remko Scha, Rens Bod and Khalil Sima'an

Institute for Logic, Language and Computation
University of Amsterdam
Spuistraat 134
1012 VB Amsterdam, The Netherlands

Abstract

This paper presents a memory-based model of human syntactic processing: Data-Oriented Parsing. After a brief introduction (section 1), it argues that any account of disambiguation and many other performance phenomena inevitably has an important memory-based component (section 2). It discusses the limitations of probabilistically enhanced competence-grammars, and argues for a more principled memory-based approach (section 3). In sections 4 and 5, one particular memory-based model is described in some detail: a simple instantiation of the "Data-Oriented Parsing" approach ("DOP1"). Section 6 reports on experimentally established properties of this model, and section 7 compares it with other memory-based techniques. Section 8 concludes and points to future work.

1. Introduction

Could it be the case that all of human language cognition takes place by means of similarity- and analogy-based processes which operate on a store of concrete past experiences? For those of us who are tempted to give a positive answer to this question, one of the most important challenges consists in describing the processes that deal with syntactic structure.

A person who knows a language can understand and produce a virtually endless variety of new and unforeseen utterances. To describe precisely how people actually do this, is clearly beyond the scope of linguistic theory; some kind of abstraction is necessary. Modern linguistics has therefore focussed its attention on the infinite repertoire of possible sentences (and their structures and interpretations) that a person's conception of a language in principle allows: the person's linguistic "competence".

In its effort to understand the nature of this "knowledge of language", linguistic theory uses the artificial languages of logic and mathematics as its paradigm sources of inspiration. Linguistic research proceeds on the assumption that a language is a well-defined formal code -- that to know a language is to know a non-redundant, consistent set of rules (a "competence grammar"), which establishes unequivocally which word sequences belong to the language, and what their pronunciations, syntactic analyses and semantic interpretations are.

Language-processing algorithms which are built for practical applications, or which are intended as cognitive models, must address some of the problems that linguistic competence grammars abstract away from. They cannot just produce the set of all possible analyses of an input utterance: in the case of ambiguity, they should pick the most plausible analysis; if the input is uncertain (as in the case of speech recognition) they should pick the most plausible candidate; if the input is corrupted (by typos or spelling-mistakes) they should make the most plausible correction.

A competence-grammar which gives a plausible characterization of the set of possible sentences of a language does no more (and no less) than provide a rather broad framework within which many different models of an individual's language processing capabilities ("performance models") may be specified. To investigate what a performance model of human language processing should look like, we do not have to start from scratch. We may, for instance, look at the ideas of previous generations of linguists and psychologists, even though these ideas did not yet get articulated as mathematical theories or computational models. If we do that, we find one very common idea: that language users produce and understand new utterances by constructing analogies with previously experienced ones. Noam Chomsky has noted, for instance, that this view was held by Bloomfield, Hockett, Jespersen, Paul, Saussure, and "many others" (Chomsky 1966, p. 12).

This intuitively appealing idea may be summarized as memory-based language-processing (if we want to emphasize that it involves accessing representations of concrete past language experiences), or as analogy-based language-processing (if we want to draw attention to the nature of the process that is applied to these representations). The project of embodying it in a formally articulated model seems a worthwhile challenge. In the next few sections of this paper we will discuss empirical reasons for actually undertaking such a project, and we will report on our first steps in that direction.

The next section therefore discusses in some detail one particular reason to be interested in memory-based models: the problem of ambiguity resolution. Section 3 will then start to address the technical challenge of designing a mathematical and computational system which complies with our intuitions about the memory-based nature of language-processing, while at the same time doing justice to some insights about syntactic structure which have emerged from the Chomskyan tradition.

2. Disambiguation and statistics

As soon as a formal grammar characterizes a non-trivial part of a natural language, it assigns an unmanageably large number of different analyses to almost every input string. This is problematic, because most of these analyses are not perceived as plausible by a human language user, although there is no conceivable clear-cut reason for a theory of syntax or semantics to label them as deviant (cf. Church & Patil, 1983; MacDonald et al., 1994; Martin et al., 1983.). Often, it is only a matter of relative implausibility. A certain interpretation may seem completely absurd to a human language user, just because another interpretation is much more plausible.

The disambiguation problem gets worse if the input is uncertain, and the system must explore alternative guesses about that. In spoken language understanding systems we encounter a very dramatic instance of this phenomenon. The speech recognition component of such a system usually generates many different guesses as to what the input word sequence might be, but it does not have enough information to choose between them. Just filtering out the 'ungrammatical' candidates is of relatively little help in this situation. The competence grammar of a language, being a characterization of the set of possible sentences and their structures, is clearly not intended to account for the disambiguation behavior of language users. Psycholinguistics and language technology, however, must account for such behavior; they must do this by embedding linguistic competence grammars in a proper theory of language performance.

There are many different criteria that play a role in human disambiguation behavior. First of all we should note that syntactic disambiguation is to some extent a side-effect of semantic disambiguation. People prefer plausible interpretations to implausible ones -- where the plausibility of an interpretation is assessed with respect to the specific semantic/pragmatic context at hand, taking into account conventional world knowledge (which determines what beliefs and desires we tend to attribute to others), social conventions (which determine what beliefs and desires tend to get verbally expressed), and linguistic conventions (which determine how they tend to get verbalized).

If we bracket out the influence of semantics and context, we notice another important factor that influences human disambiguation behavior: the frequency of occurrence of lexical items and syntactic structures. It has been established that (1) people register frequencies and frequency-differences (e.g. Hasher & Chromiak, 1977; Kausler & Puckett, 1980; Pearlmutter & MacDonald, 1992), (2) analyses that a person has experienced before are preferred to analyses that must be newly constructed (e.g. Hasher & Zacks, 1984; Jacoby & Brooks, 1984; Fenk-Oczlon, 1989), and (3) this preference is influenced by the frequency of occurrence of analyses: more frequent analyses are preferred to less frequent ones (e.g. Fenk-Oczlon, 1989; Mitchell et al., 1992; Juliano & Tanenhaus, 1993).

These findings are not surprising -- they are predicted by general information-theoretical considerations. A system confronted with an ambiguous signal may optimize its behavior by taking into account which interpretations are more likely to be correct than others -- and past occurrence frequencies may be the most reliable indicator for these likelihoods.

3. From probabilistic competence-grammars to data-oriented parsing

In the previous section we saw that the human language processing system seems to estimate the most probable analysis of a new input sentence, on the basis of successful analyses of previously encountered ones. But how is this done? What probabilistic information does the system derive from its past language experiences? The set of sentences that a language allows may best be viewed as infinitely large, and probabilistic information is used to compare alternative analyses of sentences never encountered before. A finite set of probabilities of units and combination operations must therefore be used to characterize an infinite set of probabilities of sentence-analyses.

This problem can only be solved if a more basic, non-probabilistic one is solved first: we need a characterization of the complete set of possible sentence-analyses of the language. As we saw before, that is exactly what the competence-grammars of theoretical syntax try to provide. Most probabilistic disambiguation models therefore build directly on that work: they characterize the probabilities of sentence-analyses by means of a "stochastic grammar", constructed out of a competence grammar by augmenting the rules with application probabilities derived from a corpus. Different syntactic frameworks have been extended in this way. Examples are Stochastic Context-Free Grammar (Suppes, 1970; Sampson, 1986; Black et al., 1992), Stochastic Lexicalized Tree-Adjoining Grammar (Resnik, 1992; Schabes, 1992), Stochastic Unification-Based Grammar (Briscoe & Carroll, 1993) and Stochastic Head-Driven Phrase Structure Grammar (Brew, 1995).

A statistically enhanced competence grammar of this sort defines all sentences of a language and all analyses of these sentences. It also assigns probabilities to each of these sentences and each of these analyses. It therefore makes definite predictions about an important class of performance phenomena: the preferences that people display when they must choose between different sentences (in language production and speech recognition), or between alternative analyses of sentences (in disambiguation).

The accuracy of these predictions, however, is necessarily limited. Stochastic grammars assume that the statistically significant language units coincide exactly with the lexical items and syntactic rules employed by the competence grammar. The most obvious case of frequency-based bias in human disambiguation behavior therefore falls outside their scope: the tendency to assign previously seen interpretations rather than innovative ones to platitudes and conventional phrases. Platitudes and conventional phrases demonstrate that syntactic constructions of arbitrary size and complexity may be statistically important, also if they are completely redundant from the point of view of a competence grammar.

Stochastic grammars which define their probabilities on minimal syntactic units thus have intrinsic limitations as to the kind of statistical distributions they can describe. In particular, they cannot account for the statistical biases which are created by frequently occurring complex structures. (For a more detailed discussion regarding some specific formalisms, see Bod (1995, Ch. 3).) The obvious way to remedy this, is to allow redundancy: to specify statistically significant complex structures as part of a "phrasal lexicon", even though the grammar could already generate these structures in a compositional way. To be able to do that, we need a grammar formalism which builds up a sentence structure out of explicitly specified component structures: a "Tree Grammar" (cf. Fu 1982). The simplest kind of Tree Grammar that might fit our needs is the formalism known as Tree Substitution Grammar (TSG).

A Tree Substitution Grammar describes a language by specifying a set of arbitrarily complex "elementary trees". The internal nodes of these trees are labelled by non-terminal symbols, the leaf nodes by terminals or non-terminals. Sentences are generated by a "tree rewrite process": if a tree has a leaf node with a non-terminal label, substitute on that node an elementary tree with that root label; repeat until all leaf nodes are terminals.

Tree Substitution Grammars can be arbitrarily redundant: there is no formal reason to disallow elementary trees which can also be generated by combining other elementary trees. Because of this property, a probabilistically enhanced TSG could model in a rather direct way how frequently occurring phrases and structures influence a language user's preferences and expectations: we could design a very redundant TSG, containing elementary trees for all statistically relevant phrases, and then assign the proper probabilities to all these elementary trees.

If we want to explore the possibilities of such Stochastic Tree Substitution Grammars (STSG's), the obvious next question is: what are the statistically relevant phrases? Suppose we have a corpus of utterances sampled from the population of expected inputs, and annotated with labelled constituent trees representing the contextually correct analyses of the utterances. Which subtrees should we now extract from this corpus to serve as elementary trees in our STSG?

There may very well be constraints on the form of cognitively relevant subtrees, but currently we do not know what they are. Note that if we only use subtrees of depth 1, the TSG is non-redundant: it would be equivalent to a CFG. If we introduce redundancy by adding larger subtrees, we can bias the analysis of previously experienced phrases and patterns in the direction of their previously experienced structures. We certainly want to include the structures of complete constituents and sentences for this purpose, but we may also want to include many partially lexicalized syntactic patterns.

Are there statistical constraints on the elementary trees that we want to consider? Should we only employ the most frequently occurring ones? That is not clear either. Psychological experiments have confirmed that the interpretation of ambiguous input is influenced by the frequency of occurrence of various interpretations in one's past experience . Apparently, the individual occurrences of these interpretations had a cumulative effect on the cognitive system. This implies that, at the time of a new occurrence, there is a memory of the previous occurrences. And in particular, that at the time of the second occurrence, there is a memory of the first. Frequency effects can only build up over time on the basis of memories of unique occurrences. The simplest way to allow this to happen is to store everything.

We thus arrive at a memory-based language processing model, which employs a corpus of annotated utterances as a representation of a person's past language experience, and analyses new input by means of an STSG which uses as its elementary trees all subtrees that can be extracted from the corpus, or a large subset of them. This approach has been called Data-Oriented Parsing (DOP). As we just summarized it, this model is crude and underspecified, of course. To build working sytems based on this idea, we must be more specific about subtree selection, probability calculations, parsing algorithms, and disambiguation criteria. These issues will be considered in the next few sections of this paper.

But before we do that, we should zoom out a little bit and emphasize that we do not expect that a simple STSG model as just sketched will be able to account for all linguistic and psycholinguistic phenomena that we may be interested in. We employ Stochastic Tree Substitution Grammar because it is a very simple kind of probabilistic grammar which allows us nevertheless to take into account the probabilities of arbitrarily complex subtrees. We do not believe that a corpus of contextless utterances with labelled phrase structure trees is an adequate model of someone's language experience, nor that syntactic processing is necessarily limited to subtree-composition. To build more adequate models, the corpus annotations will have to be enriched considerably, and more complex processes will have to be allowed in extracting data from the corpus as well as in analysing the input.

The general approach proposed here should thus be distinguished from the specific instantiations discussed in this paper. We can in fact articulate the overall idea fairly explicitly by indicating what is involved in specifying a particular technical instantiation (cf. Bod, 1995). To describe a specific "data-oriented processing" model, four components must be defined:

a formalism for representating utterance-analyses,

an extraction function which specifies which fragments or abstractions of the utterance- analyses may be used as units in constructing an analysis of a new utterance,

the combination operations that may be used in putting together new utterances out of fragments or abstractions,

a probability model which specifies how the probability of an analysis of a new utterance is computed on the basis of the occurrence-frequencies of the fragments or abstractions in the corpus.

Construed in this way, the data-oriented processing framework allows for a wide range of different instantiations. It then boils down to the hypothesis that human language processing is a probabilistic process that operates on a corpus of representations of past language experiences -- leaving open how the utterance-analyses in the corpus are represented, what sub-structures or other abstractions of these utterance-analyses play a role in processing new input, and what the details of the probabilistic calculations are.

Current DOP models are typically concerned with syntactic disambiguation, and employ readily available corpora which consist of contextless sentences with syntactic annotations. In such corpora, sentences are annotated with their surface phrase structures as perceived by a human annotator. Constituents are labeled with syntactic category symbols: a human annotator has designated each constituent as belonging to one of a finite number of mutually exclusive classes which are considered as potentially inter-substitutable.

Corpus-annotation necessarily occurs against the background of an annotation convention of some sort. Formally, this annotation convention constitutes a grammar, and in fact, it may be considered as a competence grammar in the Chomskyan sense: it defines the set of syntactic structures that is possible. We do not presuppose, however, that the set of possible sentences, as defined by the representational formalism employed, coincides with the set of sentences that a person will judge to be grammatical. The competence grammar as we construe it, must be allowed to overgenerate: as long as it generates a superset of the grammatical sentences and their structures, a properly designed probabilistic disambiguation mechanism may be able to distinguish grammatical sentences and grammatical structures from their ungrammatical or less grammatical alternatives. An annotated corpus can thus be viewed as a stochastic grammar which defines a subset of the sentences and structures allowed by the annotation scheme, and which assigns empirically motivated probabilities to each of these sentences and structures.

The current paper thus explores the properties of some varieties of a language-processing model which embodies this approach in a stark and simple way. The model demonstrates that memory-based language-processing is possible in principle. For certain applications it performs better already than some probabilistically enhanced competence-grammars, but its main goal is to serve as a starting point for the development of further refinements, modifications and generalizations.

4. A Simple Data-Oriented Parsing Model: DOP1

We will now look in some detail at one simple DOP model, which is known as DOP1 (Bod 1992, 1993a, 1995). Consider a corpus consisting of only two trees, labeled with conventional syntactic categories:

Figure 1. Imaginary corpus of two trees.

Various subtrees can be extracted from the trees in such a corpus. The subtrees we consider are: (1) the trees of complete constituents (including the corpus trees themselves, but excluding individual terminal nodes); and (2) all trees that can be constructed out of these constituent trees by deletiing proper subconstituent trees and replacing them by their root nodes.

The subtree-set extracted from a corpus defines a Stochastic Tree Substitution Grammar. The stochastic sentence generation process of DOP1 employs only one operation for combining subtrees, called "composition", indicated as o. The composition-operation identifies the leftmost nonterminal leaf node of one tree with the root node of a second tree, i.e., the second tree is substituted on the leftmost nonterminal leaf node of the first tree. Starting out with the "corpus" of Figure 1 above, for instance, the sentence "She saw the dress with the telescope", may be generated by repeated application of the composition operator to corpus subtrees in the following way:

Figure 2. Derivation and parse for "She saw the dress with the telescope"

Several other derivations, involving different subtrees, may of course yield the same parse tree; for instance:

or

Figures 3/4. Two other derivations of the same parse for
"She saw the dress with the telescope".

Note also that, given this example corpus, the sentence we considered is ambiguous; by combining other subtrees, a different parse may be derived, which is analogous to the first rather than the second corpus sentence.

DOP1 computes the probability of substituting a subtree t on a specific node as the probability of selecting t among all corpus-subtrees that could be substituted on that node. This probability is equal to the number of occurrences of t, |t|, divided by the total number of occurrences of subtrees t' with the same root label as t. Let r (t) return the root label of t. Then we may write:

Since each node substitution is independent of previous substitutions, the probability of a derivation D = t1 o . . . o tn is computed by the product of the probabilities of the subtrees ti involved in it:

The probability of a parse tree is the probability that it is generated by any of its derivations. The probability of a parse tree T is thus computed as the sum of the probabilities of its distinct derivations D:

This probability may be viewed as a measure for the average similarity between a sentence analysis and the analyses of the corpus utterances: it correlates with the number of corpus trees that share subtrees with the sentence analysis, and also with the size of these shared fragments. Whether this measure constitutes an optimal way of weighing frequency and size against each other, is a matter of empirical investigation.

5. Computational Aspects of DOP1

We now consider the problems of parsing and disambiguation with DOP1. The algorithms we discuss do not exploit the particular properties of Data-Oriented Parsing; they work with any Stochastic Tree-Substitution Grammar.

5.1 Parsing

The algorithm that creates a parse forest for an input sentence is derived from algorithms that exist for Context-Free Grammars, which parse an input sentence of n words in polynomial (usually cubic) time. These parsers use a chart or well-formed substring table. They take as input a set of context-free rewrite rules and a sentence and produce as output a chart of labeled phrases. A labeled phrase is a sequence of words labeled with a category symbol which denotes the syntactic category of that phrase. A chart-like parse forest can be obtained by including pointers from a category to the other categories which caused it to be placed in the chart. Algorithms that accomplish this can be found in e.g. Kay (1980), Winograd (1983), Jelinek et al. (1990).

The chart parsing approach can be applied to parsing with Stochastic Tree-Substitution Grammars if we note that every elementary tree t of the STSG can be viewed as a context-free rewrite rule: root(t) —> yield(t) (cf. Bod 1992). In order to obtain a chart-like forest for a sentence parsed with an STSG, we label the phrases not only with their syntactic categories but with their full elementary trees. Note that in a chart-like forest generated by an STSG, different derivations that generate identical trees do not collapse. We will therefore talk about a derivation forest generated by an STSG (cf. Sima'an et al. 1994).

We now show what such a derivation forest may look like. Assume an example STSG which has the trees in Figure 5 as its elementary trees. A chart parser analysing the input string abcd on the basis of this STSG, will then create the derivation forest illustrated in Figure 6. The visual representation is based on Kay (1980): every entry (i,j) in the chart is indicated by an edge and spans the words between the i-th and the j-th position of a sentence. Every edge is labeled with linked elementary trees that constitute subderivations of the underlying subsentence. (The probabilities of the elementary trees, needed in the disambiguation phase, have been left out.)

Figure 5. Elementary trees of an example STSG.

Figure 6. Derivation forest for the string abcd.

Note that some of the derivations in the forest generate the same tree. By exhaustively unpacking the forest, four different derivations generating two different trees are obtained. Both trees are generated twice, by different derivations (with possibly different probabilities).

5.2 Disambiguation

The derivation forest defines all derivations (and thereby all parses) of the input sentence. Disambiguation now consists in choosing the most likely parse within this set of possibilities. The stochastic model of DOP1 as described above specifies a definite disambiguation criterion that may be used for this purpose: it assigns a probability to every parse tree by accumulating the probabilities of all its different derivations; these probabilities define the most probable parse (MPP) of a sentence.

We may expect that the most probable derivation (MPD) of a sentence, which is simpler to compute, often yields a parse which is identical to the most probable parse. If this is indeed the case, the MPD may be used to estimate the MPP (cf. Bod 1993a, Sima'an 1995). We now discuss first the most probable derivation and then the most probable parse.

5.2.1 The most probable derivation

A cubic time algorithm for computing the most probable derivation of a sentence can be designed on the basis of the well-known Viterbi algorithm (Viterbi 1967; Jelinek et al. 1990; Sima'an 1995). The basic idea of Viterbi is the elimination of low probability subderivations in a bottom-up fashion. Two different subderivations of the same part of the sentence whose resulting subparses have the same root can both be developed (if at all) to derivations of the whole sentence in the same ways. Therefore, if one of these two subderivations has a lower probability, it can be eliminated.

The computation of the most probable derivation from a chart generated by an STSG can be accomplished by selecting at every chart-entry the most probable subderivation for each root-node, while the other subderivations for that root-node are eliminated. We will not give the full algorithm here, but its structure can be gleaned from the algorithm which computes the probability of the most derivation:

Algorithm 1: Computing the probability of the most probable derivation

Given an STSG, let S denote its start non-terminal, R denote its set of elementary trees, and P denote its probability function over the elementary trees. Let us assume for the moment that the elementary trees of the STSG are in Chomsky Normal Form (CNF), i.e. every elementary tree has either two frontier nodes both labeled by non-terminals, or one frontier node labeled by a terminal. An elementary tree t that has root label A and a sequence of ordered frontier labels H is represented by A—>t H. Let the triple <A, i, j> denote the fact that non-terminal A is in chart entry (i, j) after parsing the input string W1,...,Wn; this implies that the STSG can derive the substring Wi+1,...,Wj, starting with an elementary tree that has root label A. The probability of the MPD of string W1,...,Wn, represented as PPMPD, is computed recursively as follows:

where

Obviously, if we drop the CNF assumption, we may apply exactly the same strategy. And by introducing some bookkeeping to keep track of the subderivations which yield the highest probabilities at each step, we get an algorithm which actually computes the most probable derivation. (For some more detail, see Sima'an et al. 1994.)

5.2.2 The most probable parse

The most probable parse tree of a sentence cannot be computed in deterministic polynomial time: Sima'an (1996b) proved that for STSG's the problem of computing the most probable parse is NP-hard. This does not mean, however, that every disambiguation algorithm based on this notion is necessarily intractable. We will now investigate to what extent tractability may be achieved if we forsake analytical probability calculations, and are satisfied with estimations instead.

Because the derivation forest specifies a statistical ensemble of derivations, we may employ the Monte Carlo method (Hammersley & Handscomb 1964) for this purpose: we can estimate parse tree probabilities by sampling a suitable number of derivations from this ensemble, and observing which parse tree results most frequently from these derivations.

We have seen that a best-first search, as accomplished by Viterbi, can be used for computing the most probable derivation from the derivation forest. In an analogous way, we may conduct a random-first search, which selects a random derivation from the derivation forest by making, for each node at each chart-entry, a random choice between the different alternative subtrees on the basis of their respective substitution probabilities. By iteratively generating several random derivations we can estimate the most probable parse as the parse which results most often from these derivations. (The probability of a parse is the probability that any of its derivations occurs.) According to the Law of Large Numbers, the most frequently generated parse converges to the most probable parse as we increase the number of derivations that we sample.

This strategy is exemplified by the following algorithm (Bod 1993b, 95):

Algorithm 2: Sampling a random derivation

Given a derivation forest of a sentence of n words, consisting of labeled entries (i,j) that span the words between the ith and the jth position of the sentence. Every entry is labeled with elementary trees, each with its probability and, for every non-terminal leaf node, a pointer to the relevant sub-entry. (Cf. Figure 6 in Section 5.1 above.) Sampling a derivation from such a chart consists of choosing at random one of the elementary trees for every root-node at every labeled entry (e.g. bottom-up, breadth-first):

for length := 1 to n do
    for start := 0 to n - length do
         for each root node X in chart-entry (start, start + length) do
               select at random a tree from the distribution of elementary trees with root node X;
                eliminate the other elementary trees with root node X from this chart-entry

The resulting randomly pruned derivation forest trivially defines one "random derivation" for the whole sentence: take the elementary tree of chart-entry (0, n) and recursively substitute the elementary subtrees of the relevant sub-entries on non-terminal leaf nodes.

The parse tree that results from this derivation constitutes a first guess for the most probable parse. A more reliable guess can be computed by sampling a larger number of random derivations, and selecting the parse which results most often from these derivations. How large a sample set should be chosen?

Let us first consider the probability of error: the probability that the parse that is most frequently generated by the sampled derivations, is in fact not equal to the most probable parse. An upper bound for this probability is given by
where the different values of i are indices corresponding to the different parses, 0 is the index of the most probable parse, pi is the probability of parse i; and N is the number of derivations that was sampled (cf. Hammersley & Handscomb 1964).

This upper bound on the probability of error becomes small if we increase N, but if there is an i with pi close to p0 (i.e., if there are different parses in the top of the sampling distribution that are almost equally likely), we must make N very large to achieve this effect. If there is no unique most probable parse, the sampling process will of course not converge on one outcome. In that case, we are interested in all of the parses that outrank all the other ones. But also when the probabilities of the most likely parses are very close together without being exactly equal, we may be interested not in the most probable parse, but in the set of all these almost equally highly probable parses. This reflects the situation in which there is an ambiguity which cannot be resolved by probabilistic syntactic considerations.

We conclude, therefore, that the task of a syntactic disambiguation component is the calculation of the probability distribution of the various possible parses, and only in the case of a forced choice experiment we choose the parse with the highest probability from this distribution (cf. Bod & Scha 1997). When we estimate this probability distribution by statistical methods, we must establish the reliability of this estimate. This reliability is characterized by the probability of significant errors in the estimates of the probabilities of the various parses.

If a parse has probability pi, and we try to estimate the probability of this parse by its frequency in a sequence of N independent samples, the variance in the estimated probability is pi(1 - pi)/N. Since 0 < pi ² 1, the variance is always smaller than or equal to 1/(4N). Thus, the standard error s, which is the square root of the variance, is always smaller than or equal to 1/(2 sqrt(N)). This allows us to calculate a lower bound for N given an upper bound for s, by N ³ 1/(4 square(s)). For instance, we obtain a standard error s ² 0.05 if N ³ 100.

We thus arrive at the following algorithm:

Algorithm 3: Estimating the parse probabilities

Given a derivation forest of a sentence and a threshold sM for the standard error:

N := the smallest integer larger than 1/(4 square(sM))

repeat N times:

sample a random derivation from the derivation forest;
store the parse generated by this derivation;

for each parse i:

estimate the conditional probability given the sentence by pi := #(i) / N

In the case of a forced choice experiment we select the parse with the highest probability from this distribution. Rajman (1995) gives a correctness proof for Monte Carlo disambiguation; he shows that the probability of sampling a parse i from a derivation forest of a sentence w is equal to the conditional probability of i given w: P(i|w).

Note that this algorithm differs essentially from the disambiguation algorithm given in Bod (1995), which increases the sample size until the probability of error of the MPP estimation has become sufficiently small. That algorithm takes exponential time in the worst case, though this was overlooked in the complexity discussion in Bod (1995). (This was brought to our attention in personal conversation by Johnson (1995) and in writing by Goodman (1996, 1998).)

The present algorithm (from Bod & Scha 1996, 1997) therefore focuses on estimating the distribution of the parse probabilities; it assumes a value for the maximally allowed standard error (e.g. 0.05), and samples a number of derivations which is guaranteed to achieve this; this number is quadratic in the chosen standard error. Only in the case of a forced choice experiment, the most frequently occurring parse is selected from the sample distribution.

5.2.3 Optimizations

In the past few years, several optimizations have been proposed for disambiguating with STSG. Sima'an (1995, 1996a) gives an optimization for computing the most probable derivation which starts out by using only the CFG-backbone of an STSG; subsequently, the constraints imposed by the STSG are employed to further restrict the parse space and to select the most probable derivation. This optimization achieves linear time complexity in STSG size without risking an impractical increase of memory-use. Bod (1993b, 1995) proposes to use only a small random subset of the corpus subtrees (5%) so as to reduce the search space for computing the most probable parse. Sekine and Grishman (1995) use only subtrees rooted with S or NP categories, but their method suffers considerably from undergeneration. Goodman (1996) proposes a different polynomial time disambiguation strategy which computes the so-called "maximum constituents parse" of a sentence (i.e. the parse which maximizes the expected number of correct constituents) rather than the most probable parse or most probable derivation. However, Goodman also shows that the "maximum constituents parse" may return parse trees that cannot be produced by the subtrees of DOP1 (Goodman 1996: 147). Chappelier & Rajman (1998) and Goodman (1998) give some optimizations for selecting a random derivation from a derivation forest. For a more extensive discussion of these and some other optimization techniques see Bod (1998a) and Sima'an (1999).

6. Experimental Properties of DOP1

In this section, we establish some experimental properties of DOP1. We will do so by studying the impact of various fragment restrictions.

6.1 Experiments on the ATIS corpus

We first summarize a series of pilot experiments carried out by Bod (1998a) on a set of 750 sentence analyses from the Air Travel Information System (ATIS) corpus (Hemphill et al. 1990) that were annotated in the Penn Treebank (Marcus et al. 1993). [1] These experiments focussed on tests about the Most Probable Parse as defined by the original DOP1 probability model. [2] Their goal was not primarily to establish how well DOP1 would perform on this corpus, but to find out how the accuracies obtained by "undiluted" DOP1 compare with the results obtained by more restricted STSG-models which do not employ the complete set of corpus subtrees as their elementary trees.

We use the blind testing method, dividing the 750 ATIS trees into a 90% training set of 675 trees and a 10% test set of 75 trees. The division was random except for one constraint: that all words in the test set actually occurred in the training set. [3] The 675 training set trees were converted into fragments (i.e. subtrees) and were enriched with their corpus probabilities. The 75 sentences from the test set served as input sentences that were parsed with the subtrees from the training set and disambiguated by means of the algorithms described in the previous section. The most probable parses were estimated from probability distributions of 100 sampled derivations. We use the notion of parse accuracy as our accuracy metric, defined as the percentage of the most probable parses that are identical to the corresponding test set parses.

Because the MPP estimation is a fairly costly algorithm, we have not yet been able to repeat all our experiments for different training-set/test-set splits, to obtain average results with standard deviations. We made one exception, however. We will very often be comparing the results of an experiment with the results obtained when employing all corpus-subtrees as elementary trees; therefore, it was important to establish at least that the parse accuracy obtained in this fashion (which was 85%) was not due to some unlikely random split.

On 10 random training/test set splits of the ATIS corpus we achieved an average parse accuracy of 84.2% with a standard deviation of 2.9%. Our 85% base line accuracy lies thus solidly within the range of values predicted by the more extentensive experiment.

The impact of overlapping fragments: MPP vs. MPD

The stochastic model of DOP1 can generate the same parse in many different ways; the probability of a parse must therefore be computed as the sum of the probabilities of all its derivations. We have seen, however, that the computation of the Most Probable Parse according to this model has an unattractive complexity, whereas the Most Probable Derivation is much easier to compute. We may therefore wonder how often the parse generated by the Most Probable Derivation is in fact the correct one: perhaps this method constitutes a good approximation of the Most Probable Parse, and can achieve very similar parse accuracies. And we cannot exclude that it might yield even better accuracies, if it somehow compensates for bad properties of the stochastic model of DOP1. For instance, by summing up over probabilities of several derivations, the Most Probable Parse takes into account overlapping fragments, while the Most Probable Derivation does not. It is not a priori obvious whether we do or do not want this property.

We thus calculated the accuracies based on the analyses generated by the Most Probable Derivations of the test sentences. The parse accuracy obtained by the trees generated by the Most Probable Derivations was 69%, which is lower than the 85% base line parse accuracy obtained by the Most Probable Parse. We conclude that overlapping fragments play an important role in predicting the appropriate analysis of a sentence.

The impact of fragment size

Next, we tested the impact of the size of the fragments on the parse accuracy. Large fragments capture more lexical/syntactic dependencies than small ones. We investigated to what extent this actually leads to better predictions of the appropriate parse. We therefore performed experiments with versions of DOP1 where the fragment collection is restricted to subtrees with a certain maximum depth (where the depth of a tree is defined as the length of the longest path from the root to a leaf). For instance, restricting the maximum depth of the subtrees to 1 gives us fragments that cover exactly one level of constituent structure, which makes DOP1 equivalent to a stochastic context-free grammar (SCFG). For a maximal subtree depth of 2, we obtain fragments that also cover two levels of constituent structure, which capture some more lexical/syntactic dependencies, etc. The following table shows the results of these experiments, where the parse accuracy for each maximal depth is given for both the most probable parse and for the parse generated by the most probable derivation (the accuracies are rounded off to the nearest integer).

Table 1. Accuracy increases if larger corpus fragments are used

The table shows an increase in parse accuracy, for both the most probable parse and the most probable derivation, when enlarging the maximum depth of the subtrees. The table confirms that the most probable parse yields better accuracy than the most probable derivation, except for depth 1 where DOP1 is equivalent to an SCFG (and where every parse is generated by exactly one derivation). The highest parse accuracy reaches 85%.

The impact of fragment lexicalization

We now consider the impact of lexicalized fragments on the parse accuracy. By a lexicalized fragment we mean a fragment whose frontier contains one or more words. The more words a fragment contains, the more lexical (collocational) dependencies are taken into account. To test the impact of lexicalizationon the parse accuracy, we performed experiments with different versions of DOP1 where the fragment collection is restricted to subtrees whose frontiers contain a certain maximum number of words; the maximal subtree depth was kept constant at 6.

These experiments are particularly interesting since we can simulate lexicalized grammars in this way. Lexicalized grammars have become increasingly popular in computational linguistics (e.g. Schabes 1992; Srinivas & Joshi 1995; Collins 1996, 1997; Charniak 1997; Carroll & Weir 1997). However, all lexicalized grammars that we know of restrict the number of lexical items contained in a rule or elementary tree. It is a significant feature of the DOP approach that we can straightforwardly test the impact of the number of lexical items allowed.

The following table shows the results of our experiments, where the parse accuracy is given for both the most probable parse and the most probable derivation.

Table 2. Accuracy as a function of the maximum number of words in fragment frontiers.

The table shows an initial increase in parse accuracy, for both the most probable parse and the most probable derivation, when enlarging the amount of lexicalization that is allowed. For the most probable parse, the accuracy is stationary when the maximum is enlarged from 4 to 6 words, but it increases again if the maximum is enlarged to 8 words. For the most probable derivation, the parse accuracy reaches its maximum already at a lexicalization bound of 4 words. Note that the parse accuracy deteriorates if the lexicalization bound exceeds 8 words. Thus, there seems to be an optimal lexical maximum for the ATIS corpus. The table confirms that the most probable parse yields better accuracy than the most probable derivation, also for different lexicalization sizes.

The impact of fragment frequency

We may expect that highly frequent fragments contribute to a larger extent to the prediction of the appropriate parse than very infrequent fragments. But while small fragments can occur very often, most larger fragments typically occur once. Nevertheless, large fragments contain much lexical/structural context, and can parse a large piece of an input sentence at once. Thus, it is interesting to see what happens if we systematically remove low-frequency fragments. We performed an additional set of experiments by restricting the fragment collection to subtrees with a certain minimum number of occurrences, but without applying any other restrictions.

Table 3. Accuracy decreases if lower bound on fragment frequency increases
(for the most probable parse).

The results, presented in Table 3, indicate that low frequency fragments contribute significantly to the prediction of the appropriate analysis: the parse accuracy seriously deteriorates if low frequency fragments are discarded. This seems to contradict common wisdom that probabilities based on sparse data are not reliable. Since especially large fragments are once-occurring events, there seems to be a preference in DOP1 for an occurrence-based approach if enough context is provided: large fragments, even if they occur once, tend to contribute to the prediction of the appropriate parse, since they provide much contextual information. Although these fragments have very low probabilities, they tend to induce the most probable parse because fewer fragments are needed to construct a parse.

In Bod (1998a), the impact of some other fragment restrictions is studied. Among other things, it is shown there that the parse accuracy decreases if subtrees with only non-head words are eliminated.

6.2 Experiments on larger corpora: the SRI-ATIS corpus and the OVIS corpus

In the following experiments (summarized from Sima'an 1999) [4] we only employ the most probable derivation rather than the most probable parse. Since the most probable derivation can be computed more efficiently than the most probable parse (see section 5), it can be tested more extensively on larger corpora. The experiments were conducted on two domains: the Amsterdam OVIS tree-bank (Bonnema et al. 1997) and the SRI-ATIS tree-bank (Carter 1997). [5] It is worth noting that the SRI-ATIS tree-bank differs considerably from the Penn Treebank ATIS-corpus that was employed in the experiments reported in the preceding subsection.

In order to acquire workable and accurate DOP1 models from larger tree-banks, a set of heuristic criteria is used for selecting the fragments. Without these criteria, the number of subtrees would not be manageable. For example, in OVIS there are more than hunderd and fifty million subtrees. Even when the subtree depth is limited to e.g. depth five, the number of subtrees in the OVIS tree-bank remains more than five million. The subtree selection criteria are expressed as constraints on the form of the subtrees that are projected from a tree-bank into a DOP1 model. The constraints are expressed as upper-bounds on: the depth (d), the number of substitution-sites (n), the number of terminals (l) and the number of consecutive terminals (L) of the subtree. These constraints apply to all subtrees but the subtrees of depth 1, i.e. subtrees of depth 1 are not subject to these selection criteria. In the sequel we represent the four upper-bounds by the short notation ddnnllLL. For example, d4n2l7L3 denotes a DOP STSG obtained from a tree-bank such that every elementary tree has at most depth 4, and a frontier containing at most 2 substitution sites and 7 terminals; moreover, the length of any consecutive sequence of terminals on the frontier of that elementary tree is limited to 3 terminals. Since all projection parameters except for the upper-bound on the depth are usually a priori fixed, the DOP1 STSG obtained under a depth upper-bound that is equal to an integer i will be represented by the short notation DOP(i).

We used the following evaluation metrics: Recognized (percentage of recognized sentences), TLC (Tree-Language Coverage: the percentage of test set parses that is in the tree language of the DOP1 STSG), exact match (either syntactic/semantic or only syntactic), labeled bracketing recall and precision (as defined in Black et al. 1991, concerning either syntactic plus semantic or only syntactic annotation). Below we summarize the experimental results pertaining to some of the issues that are addressed in Sima'an (1999). Some of these issues are similar to those addressed by the experiments with the most probable parse on the small ATIS tree-bank in subsection 6.1, e.g. the impact of fragment size. Other issues are orthogonal and supplement the issues addressed in the experiments concerning the most probable parse.

Experiments on the SRI-ATIS corpus

In this section we report experiments on syntactically annotated utterances from the SRI International ATIS tree-bank. The utterances of the tree-bank originate from the ATIS domain (Hemphill et al. 1990). For the present experiments, we have access to 13335 utterances that are annotated syntactically (we refer to this tree-bank here as the SRI-ATIS corpus/tree-bank). The annotation scheme originates from the linguistic grammar that underlies the Core Language Engine (CLE) system in Alshawi (1992). The annotation process is described in Carter (1997). For the experiments summarized below, some of the training parameters were fixed: the DOP models were projected under the parameters n2l4L3, while the subtree depth bound was allowed to vary.

A training-set of 12335 trees and a test-set of 1000 trees were obtained by partitioning the SRI-ATIS tree-bank randomly. DOP1 models with various depth upper-bound values were trained on the training-set and tested on the test-set. It is noteworthy that the present experiments are extremely time-consuming: for upper-bound values larger than three, the models become huge and very slow, e.g. it takes more than 10 days for DOP(4) to parse and disambiguate the test-set (1000 sentences). This is despite of the subtree upper bounds n2l4L3, which limit the total size of the subtree-set to less than three hunderd thousand subtrees.

Table 4. The impact of subtree depth (SRI-ATIS)

Table 4 (left-hand side) shows the results for depth upper-bound values up to four. An interesting and suprising result is that the exact-match of DOP(1) on this larger and different ATIS tree-bank (46%) is very close to the result reported in the preceding subsection. This also holds for the DOP(4) model (here 82.70% exact-match vs. 84% on the Penn Treebank ATIS corpus). More striking is that the present experiments concern the most probable derivation while the experiments of the preceding section concern the most probable parse. In the preceding subsection, the exact-match of the most probable derivation did not exceed 69%, while in this case it is 82.70%. This might be explained by the fact that the availability of more data is more crucial for the accuracy of the most probable derivation than the most probable parse. This is certainly not due to a simpler tree-bank or domain since the annotations here are as deep as those in the Penn Treebank. In any case, it would be interesting to consider the exact match that the most probable pare achieves on this tree-bank. This, however, will remain an issue for future research because computing the most probable parse is still infeasible on such large tree-banks.

The issue of course here is still the impact of employing deeper subtrees. Clearly, as the results show, the difference between the DOP(1) (the SCFG) and any deeper DOP1 model is at least 23% (DOP(2)). This difference increases to 36.70% at DOP(4). To validate this difference, we ran experiments with a four-fold cross-validation experiment that confirms the magnitude of this difference. In the right-hand side of table 4 means and standard-deviations for two DOP1 models are reported. Four independent partitions into test (1000 trees each) and training sets (12335 trees each) were used here for training and testing these DOP1 models. These results show a mean difference of 24% exact-match between DOP(2) and DOP(1) (SCFG): a substantial accuracy improvement achieved by memory-based parsing using DOP1, above simply using the SCFG underlying the tree-bank (as for instance in Charniak 1996).

Experiments on the OVIS corpus

The Amsterdam OVIS ("Openbaar Vervoer Informatie Systeem") corpus contains 10000 syntactically and semantically annotated trees. For detailed information concerning the syntactic and semantic annotation scheme of the OVIS tree-bank we refer the reader to Bonnema et al. (1997). In acquiring DOP1 models the semantic and syntactic annotations are treated as one annotation in which the labels of the nodes in the trees are a juxtaposition of the syntactic and semantic labels. Although this results in many more non-terminal symbols (and thus also DOP model parameters), Bonnema (1996) shows that the resulting syntactic+semantic DOP models are better than the mere syntactic DOP1 models. Since the utterances in the OVIS tree-bank are answers to questions asked by a dialogue system, these utterances tend to be short. The average sentence length in OVIS is 3.43 words. However, the results reported in Sima'an (1999) concern only sentences that contain at least two words; the number of those sentences is 6797 and their average length is 4.57 words. All DOP1 models are projected under the subtree selection criterion n2l7L3, while the subtree depth upper bound was allowed to vary.

It is interesting here to observe the effect of varying subtree depth on the performance of the DOP1 models on a tree-bank from a different domain. To this end, in a set of experiments one random partition of the OVIS tree-bank into a test-set of 1000 trees and a training set of 9000 was used to test the effect of allowing the projection of deeper elementary trees in DOP STSGs. DOP STSGs were projected from the training set with upper-bounds on subtree depth equal to 1, 3, 4, and 5. Each of the four DOP models was run on the sentences of the test-set (1000 sentences). The resulting parse trees were then compared to the correct test set trees.

Table 5. The impact of subtree depth (OVIS)

The lefthand side of table 5 above shows the results of these DOP1 models. Note that the recognition power (Recognized) is not affected by the depth upper-bound in any of the DOP1 models. This is because all models allowed all subtrees of depth 1 to be elementary trees. As the results show, a slight accuracy degradation occurs when the subtree depth upper bound is increased from four to five. This has been confirmed separately by earlier experiments conducted on similar material (Bonnema et al. 1997). An explanation for this degradation might be that including larger subtrees implies many more subtrees and sparse-data effects. It is not clear, therefore, whether this finding contradicts the Memory-Based Learning doctrine that maintaining all cases in the case-base is more profitable. It might equally well be the case that this problem is specific for the DOP1 model and the way it assigns probabilities to subtrees.

Table 5 also shows the means and standard deviations (stds) of two DOP1 models on five independent partitions of the OVIS tree-bank into training set (9000 trees) and test set (1000 trees). For every partition, the DOP1 model was trained only on the training set and then tested on the test set. We observe here the means and standard deviation of the models DOP(1) (the SCFG underlying the tree-bank) and DOP(4). Clearly, the difference between DOP(4) and DOP(1) observed in the preceding set of experiments is supported here. However, the improvement of DOP(4) on the SCFG in exact match and the other accuracy measures, although significant, is disappointing: it is about 2.85% exact match and 3.35% syntactic exact match. The improvement itself is indeed in line with the observation that DOP1 improves on the SCFG underlying the tree-bank. This can be seen as an argument for MBL syntactic analysis as opposed to the traditional enrichment of "linguistic" grammars with probabilities.

We have thus seen that there is quite some evidence for our hypothesis that the structural units of language processing cannot limited to a minimal set of rules, but should be defined in terms of a large set of previously seen structures. It is interesting to note that similar results are obtained by other instantiations of memory-based language processing. For example, van den Bosch & Daelemans (1998) report that almost every criterion for removing instances from memory yields worse accuracy than keeping full memory of learning material (for the task of predicting English word pronunciation). Despite this interesting convergence of results, there is a significant difference between DOP and other memory-based approaches. We will go into this topic in the following section.

7. DOP: probabilistic recursive MBL

In this section we make explicit the relationship between the present Data Oriented Processing (DOP) framework and the Memory-Based Learning framework. We show how the DOP framework extends the general MBL framework with probabilistic reasoning in order to deal with complex performance phenomena such as syntactic disambiguation. In order to keep this discussion concrete we also analyze the model DOP1, the first instantiation of the DOP framework. Subsequently, we contrast the DOP model to other existing MBL approaches that employ so called "flat" or "intermediate" descriptions as opposed to the hierarchical descriptions used by the DOP model.

7.1 Case-Based Reasoning

In the Machine Learning (ML) literature, e.g. Aamodt & Plaza (1994), Mitchell (1997), various names, e.g. Instance-Based, Case-Based, Memory-Based or Lazy, are used for a paradigm of learning that can be characterized by two properties:

(1) it involves a lazy learning algorithm that does not generalize over the training examples but stores them, and
(2) it involves lazy generalization during the application phase: each new instance is classified (on its own) on basis of its relationship to the stored training examples; the relationship between two instances is examined by means of so called similarity functions.

We will refer to this general paradigm by the term MBL (although the term Lazy Learning, as Aha (1998) suggests, might be more suitable). There are various forms of MBL that differ in several respects. In this study we are specifically interested in the Case-Based Reasoning (CBR) variant of MBL (Kolodner 1993, Aamodt & Plaza 1994).

Case-Based Reasoning differs from other MBL approaches, e.g. k-nearest neighbor methods, in that it does not represent the instances of the task concept [6] as real-valued points in an n-dimensional Euclidean space; instead, CBR methods represent the instances by means of complex symbolic descriptions, e.g. graphs (Aamodt & Plaza 1994, Mitchell 1997). This implies that CBR methods require more complex similarity functions. It also implies that CBR methods view their learning task in a different way than other MBL methods: while the latter methods view their learning task as a classification problem, CBR methods view their learning task as the construction of classes for input instances by reusing parts of the stored classified training-instances.

According to overviews of the CBR literature, e.g. Mitchell (1997), Aha & Wettschereck (1997), there exist various CBR methods that address a wide variety of tasks, e.g. conceptual designs of mechanical devices (Sycara et al. 1992), reasoning about legal cases (Ashley 1990), scheduling and planning problems (Veloso & Carbonell 1993) and mathematical integration problems (Elliot & Scott 1991). Rather than pursuing the infeasible task of contrasting DOP to each of these methods, we will firstly highlight the specific aspects of DOP as an extension to CBR. Subsequently we compare DOP to recent approaches that extend CBR with probabilistic reasoning.

7.2 The DOP framework and CBR

We will show that the first three components of a DOP model as described in the DOP framework of section 3 define a CBR method, and that the fourth component extends CBR with probabilistic reasoning. To this end, we will express each component in CBR terminology and then show how this specifies a CBR system. The first component of DOP, i.e. a formal representation of utterance-analyses, specifies the representation language of the instances and classes in the parsing task, i.e. the so called case description language. The second component, i.e. fragments definition, defines the units that are retrieved from memory when a class (tree) is being constructed for a new instance; the retrieved units are exactly the sub-instances and sub-classes that can be combined into instances and classes. The third component, i.e. definition of combination operations, concerns the definition of the constraints on combining the retrieved units into trees when parsing a new utterance. Together, the latter two components define exactly the retrieval, reuse and revision aspects of the CBR problem solving cycle (Aamodt & Plaza 1994). The similarity measure, however, is not explicitly specified in DOP but is implicit in a retrieval strategy that relies on simple string equivalence. Thus, the first three components of a DOP model specify exactly a simple CBR system for natural language analysis, i.e., a natural language parser. This system is lazy: it does not generalize over the tree-bank until it starts parsing a new sentence, and it defines a space of analyses for a new input sentence simply by matching and combining fragments from the case-base (i.e. tree-bank).

The fourth component of the DOP framework, however, extends the CBR approach for dealing with ambiguities in the definition of the case description language. It specifies how the frequencies of the units retrieved from the case-base define a probability value for the utterance that is being parsed. We may conclude that the four components of the DOP framework define a CBR method that uses possibly recursive case descriptions, string-matching for retrieval of cases, and a probabilistic model for resolving ambiguity. The latter property of DOP is crucial for the task that the DOP framework addresses: disambiguation. Disambiguation differs from the task that mainstream CBR approaches address, i.e. constructing a class for the input instance. Linguistic disambiguation involves classification under an ambiguous definition of the "case description language", i.e., the formal representation of the utterance analyses, which is usually a grammar. Since the fragments (second component of a DOP model) are defined under this "grammar", combining them by means of the combination operations (third component) usually defines an ambiguous CBR system: for classifying (i.e. parsing) a new instance (i.e. sentence), this CBR system may construct more than one analysis (i.e., class). The ambiguity of this CBR system is inherent to the fact that it is often infeasible to construct an unambiguous natural language formal representation of analyses. A DOP model solves this ambiguity when classifying an instance by assigning a probability value to every constructed class in order to select the most probable one. Next we will show how these observations about the DOP framework apply to the DOP1 model for syntactic disambiguation.

7.3 DOP1 and CBR methods

In order to explore the differences between the DOP1 model and CBR methods, we will express DOP1 as an extension to a CBR parsing system. To this end, it is necessary to identify the case-base, the instance-space, the class-space, and the "similarity function" that DOP assumes. In DOP, the training tree-bank contains <string, tree> pairs that represent the classified instances, where string is an instance and tree is a class.

A DOP model memorizes in its case-base exactly the finite set of tree-bank trees. When parsing a new input sentence, the DOP model retrieves from the case-base all subtrees of the trees in its case-base and tries to use them for constructing trees for that sentence. Let us refer to the ordered sequence of symbols that decorate the leaf nodes of a subtree as the frontier of that subtree. Moreover, let us call the frontier of a subtree from the case-base a subsentential-form from the tree-bank. During the retrieval phase, a DOP1 model retrieves from its case-base all pairs <str, st>, where st is a subtree of the tree-bank and the string str is its frontier SSF. These subtrees are used for constructing classes, i.e. trees, for the input sentence using the substitution-operation. This operation enables constrained assemblage of sentences (instances) and trees (classes). The set of sentences and the set of trees that can be assembled from subtrees by means of substitution constitute respectively the instance-space and the class-space of a DOP1 model.

Thus, the instance-space and the class-space of a DOP1 model are defined by the Tree-Substitution Grammar (TSG) that employs the subtrees of the tree-bank trees as its elementary trees; this TSG is a recursive device that defines infinite instance- and class-spaces. However, this TSG, which represents the "runtime" expansion of DOP's case-base, does not generalize over the CFG that underlies the tree-bank (the case description language), since the two grammars have equal string-languages (instance-spaces) and equal tree-languages (class-spaces).

The probabilities that DOP1 attaches to the subtrees in the TSG are induced from the frequencies in the tree-bank and can be seen as subtree weights. Thus, the STSG that a DOP model projects from a tree-bank can be viewed as an infinite runtime case-base containing instance-class-weight triples that have the form <SSF, subtree, probability>.

Task and similarity function

The task implemented by a DOP1 model is disambiguation: the identification of the most plausible tree that the TSG assigns to the input sentence. Syntactic disambiguation is indeed a classification task in the presence of an infinite class-space. For an input sentence, this class-space is firstly limited to a finite set by the (usually) ambiguous TSG: only trees that the TSG constructs for that sentence by combining subtrees from the tree-bank are in the specific class-space of that sentence. Subsequently, the tree with the highest probability (according to the STSG) in this limited space is selected as the most plausible one for the input sentence. To understand the matching and retrieval processes of a DOP1 model, let us consider both steps of disambiguation separately.

In the first step, i.e. parsing, the similarity function that DOP1 employs is a simple recursive string-matching procedure. First, all substrings of the input sentence are matched against SSFs in the case-base (i.e. the TSG) and the subtrees corresponding to the matched SSFs are retrieved; every substring that matches an SSF is replaced by the label of the root node of the retrieved subtree (note that there can be many subtrees retrieved for the same SSF, their roots replace the substring as alternatives). This SSF-matching and subtree-retrieval procedure is recursively applied to the resulting set of strings until the last set of strings does not change.

Technically speaking, this "recursive matching process" is usually implemented as a parsing algorithm that constructs an efficient representation of all trees that the TSG assigns to the input sentence, called the parse-forest that the TSG constructs for that sentence (see section 5).

What is the role of probabilities in the similarity function?

Thus, rather than employing a Euclidean or any other metric to measure the distance between the input sentence and the sentences in the case-base, DOP1 resorts to a recursive matching process where similarity between SSFs is implemented as simple string-equivalence. Beyond parsing, which can be seen as the first step of disambiguation, DOP1 faces the ambiguity of the TSG, which follows from the ambiguity of the CFG underlying the tree-bank (i.e. the case description language definition). This is one important point where DOP1 deviates from mainstream CBR methods that usually employ unambiguous definitions of the case description language, or resort to (often ad hoc) heuristics that give a marginal role to disambiguation.

For natural language processing it is usually not feasible to construct unambiguous grammars. Therefore, the parse-forest that the parsing process constructs for an input sentence usually contains more than one tree. The task of selecting the most plausible tree from this parse-forest, i.e. syntactic disambiguation, constitutes the main task addressed by performance models of syntactic analysis such as DOP1. For disambiguation, DOP1 ranks the alternative trees in the parse-forest of the input sentence by computing a probability for every tree. Subsequently, it selects the tree with the highest probability from this space. It is interesting to consider how the probability of a tree in DOP1 relates to the matching process that takes place during parsing.

Given a parse-tree, we may view each derivation that generates it as a "successful" combination of subtrees from the case-base; to every such combination we assign a "matching-score" of 1. All sentence-derivations that generate a different parse-tree (including the ones that generate a different sentence) receive matching-score 0. The probability of a parse-tree as computed by DOP1 is in fact the expectation value (or mean) of the scores (with respect to this parse-tree) of all derivations allowed by the TSG; this expectation value weighs the score of every derivation by the probability of that derivation. The probability of a derivation is computed as the product of the probabilities of the subtrees that participate in it. Subtree probabilities, in their turn, are based on the frequency counts of the subtrees and are conditioned on the constraint embodied by the tree-substitution combination operation.

This brief description of DOP1 in CBR terminology shows that the following aspects of DOP1 are not present in other mainstream CBR methods: (i) both the instance- and class-spaces are infinite and are defined in terms of a recursive matching process embodied by a TSG-based parser that matches strings by equality, and then retrieves and combines the subtrees associated with the matching strings using the substitution-operation, (ii) the CBR task of constructing a tree (i.e. class) for a input sentence is further complicated in DOP1 by the ambiguity of this TSG, (iii) the "structural similarity function" that most other CBR methods employ is, therefore, implemented in DOP as a recursive process that is complemented by spanning a probability distribution over the instance- and class-spaces in order to facilitate disambiguation, and (iv) the probabilistic disambiguation process in DOP1 is conducted globally over the whole sentence rather than locally on parts of the sentence. Hence we may characterize DOP1 as a lazy probabilistic recursive CBR classifier that addresses the problem of global sentential-level syntactic disambiguation.

7.4 DOP and recent probabilistic extensions to CBR

Recent literature within the CBR approach advocates extending CBR with probabilistic reasoning. Waltz and Kasif (1996) and Kasif et al. (forthcoming) refer to the framework that combines CBR with probabilistic reasoning with the name Memory-Based Reasoning (MBR). Their arguments for this framework are based on the need for systems to be able to adapt to rapidly changing environments "where abstract truths are at best temporary or contingent". This work differs from DOP in at least two important ways: (i) it assumes a non-recursive finite class-space, and (ii) it employs probabilistic reasoning for inducing so called "adaptive distance metrics" (these are distance metrics that automatically change as new training material enters the system) rather than for disambiguation.

These differences imply that this approach does not take note of the specific aspects of the disambiguation task as found in natural language parsing, e.g., the need for recursive symbolic descriptions and that disambiguation lies at the hart of any performance task. The syntactic ambiguity problem, thus, has an additional dimension of complexity next to the dimensions that are addressed by the mainstream ML literature.

7.5 DOP vs. other MBL approaches in NLP

In this section we will concentrate on contrasting DOP to some MBL methods that are used for implementing natural language processing (NLP) tasks. Firstly we briefly address the relatively clear differences between methods based on variations of the k-Nearest Neighbor (k-NN) approach and DOP. Subsequently we discuss more thoroughly the recently introduced Memory-Based Sequence Learning (MBSL) method (Argamon et al. 1998) and how it relates to the DOP model.

7.5.1 k-NN vs. DOP

From the description of CBR methods earlier in this section, it became clear that the main difference between k-NN methods and CBR methods is that the latter employ complex data structures rather than feature vectors for representing cases. As mentioned before, DOP's case description language is further enhanced by recursion and complicated by ambiguity. Moreover, while k-NN methods classify their input in a partitioning of a real-valued multi-dimensional Euclidean space, the CBR methods (including DOP) must construct a class for an input instance. The similarity measures in k-NN methods are based on measuring the distance between the input instance and each of the instances in memory. In DOP, this measure is simplified during parsing to string-equivalence and complicated during disambiguation by a probabilistic ranking of the alternative trees of the input sentence. Of course, it is possible to imagine a DOP model that employs k-NN methods during the parsing phase, so that the string matching process becomes more complex than simple string-equivalence. In fact, a simple version of such an extension of DOP has been studied in Zavrel (1996) -- with inconclusive empirical results due to the lack of suitable training material.

Recently, extensions and enhancements of the basic k-NN methods (Daelemans et al. 1999a) have been applied to limited forms of syntactic analysis (Daelemans et al. 1999b). This work employs k-NN methods for very specific syntactic classification tasks (for instance for recognizing NP's or VP's, and for deciding on PP-attachment or subject/object relations), and then combines these classifiers into shallow parsers. The classifiers carry out their task on the basis of local information; some of them (e.g. subject/object) rely on preprocessing of the input string by other classifiers (e.g. NP- and VP-recognition). This approach has been tested successfully on the individual classification tasks and on shallow parsing, yielding state-of-the-art accuracy results.

This work differs from the DOP approach in important respects. First of all, it does not address the full parsing problem; it is not intended to deal with with arbitrary tree structures derived from an arbitrary corpus, but hardcodes a limited number of very specific syntactic notions. Secondly, the classification decisions (or disambiguation decisions) are based on a limited context which is fixed in advance. Thirdly, the approach employs similarity metrics rather than stochastic modeling techniques. Some of these features make the approach more efficient than DOP, and therefore easier to apply to large tree-banks. But at the same time this method shows clear limitations if we look at its applicability to general parsing tasks, or if we consider the disambiguation accuracy that can be achieved if only local information is taken into account .

7.5.2 Memory-Based Sequence Learning vs. DOP

Memory-Based Sequence Learning (MBSL), described in Aragamon et al. (1998) can be seen as analogous to a DOP model at the level of flat non-recursive linguistic descriptions. It is interesting to pursue this analogy by analysing MBSL in terms of the four components that constitute a DOP model (cf. the end of section 3 above). Since MBSL is a new method that seems closer to DOP1 than all other MBL methods discussed in this volume, we will first summarize how it works before we compare it with DOP1.

Like DOP1, an MBSL system works on the basis of a corpus of utterances annotated with labelled constituent structures. It assumes a different representation of these structures, however: an MBSL corpus consists of bracketed strings. Each pair of brackets delimits the borders of a substring of some syntactic category, e.g., Noun Phrase (NP) or Verb Phrase (VP). For every syntactic category, a separate MBSL system is constructed. Given a corpus containing bracketed strings of part-of-speech tags (pos-tags), an MBSL system stores all [7] the substrings that contain at least one bracket together with their occurrence counts; such substrings are called tiles. Moreover, the MBSL system stores all substrings stripped from the brackets together with their occurrence count. The positive evidence in the corpus for a given tile is computed as the ratio between the occurrence count of the tile to the total occurrence count of the substring obtained by stripping the tile from the brackets.

When an MBSL system is prompted to assign brackets to a new input sentence, it assigns all possible brackets to the sentence and then it computes the positive evidence for every pair of brackets on basis of its stored corpus. To this end, every subsequence of the input sentence surrounded by a (matching) pair of brackets is considered a candidate. A candidate together with the rest of the sentence is considered a situated candidate. To evaluate the positive evidence for a situated candidate, a tiling of that situated candidate is attempted on basis of the tiles that are stored. When tiling a situated candidate, tiles are retrieved from storage and placed such that they cover it entirely. Only tiles of sufficient positive evidence are retrieved for this purpose (a threshold on sufficient evidence is used). To specify how a cover is exactly obtained, it is necessary to define the notion of connecting tiles (with respect to a situated candidate). We say that a tile partially covers a situated candidate if the tile is equivalent to some substring of that situated candidate; the substring is then called a matching substring of the tile. Given a situated candidate and a tile T1 that partially covers it, tile T2 is called connecting to tile T1 iff T2 also partially-covers the situated candidate and the matching substrings of T1 and T2 overlap without being included in each other, or are adjacent to each other in the situated candidate. The shortest substring of a situated candidate that is partially covered by two connecting tiles is said to be covered by the two tiles; we also say that the two tiles constitute a cover of that substring. The notion of a cover is then generalized to a sequence of connecting tiles: a sequence of connecting tiles is an ordered sequence of tiles such that every tile is connecting to the tile that precedes it in that sequence. Hence, a cover of the situated candidate is a sequence of connecting tiles that covers it entirely. Crucially, there can be many different covers of a situated candidate.

The evidence score for a situated candidate is a function of the evidence for the various covers that can be found for it in the corpus. MBSL estimates this score by a heuristic function that combines various statistical measures concerning properties of the covers e.g., number of covers, number of tiles in a cover, size of overlap between tiles, etc. In order to select a bracketing for the input sentence, all possible situated candidates are ranked according to their scores. Starting with the situated candidate with the highest score C, the bracketing algorithm removes all other situated candidates that contain a pair of brackets that overlaps with the brackets in C. This procedure is repeated iteratively until it stabilizes. The remaining set of situated candidates defines a bracketing of the input sentence.

It is interesting to look at MBSL from the perspective of the DOP framework. Let us first consider its representation formalism, its fragment extraction function and its combination operation. The formalism that MBSL employs to represent utterance analyses is a part-of-speech--tagging of the sentences together with a bracketing that defines the target syntactic category. The fragments are the tiles that can be extracted from the training corpus. The combination operation is an operation that combines two tiles such that they connect. We cannot extend this analogy to include the covers, however, because MBSL considers the covers locally with respect to every situated candidate. If MBSL would have employed covers for a whole consistent bracketing of the input sentence, MBSL covers would have corresponded to DOP derivations, and consistent bracketings to parses. Another important difference between MBSL and DOP shows up if we look at the way in which "disambiguation" or pruning of the bracketing takes place in MBSL. Rather than employing a probabilistic model, MBSL resorts to local heuristic functions with various parameters that must be tuned in order to achieve optimal results.

In summary, MBSL and DOP are analogous but nevertheless rather different MBL methods for dealing with syntactic ambiguity. The differences can be summarized by (i) the locally-based (MBSL) vs. globally-based (DOP) ranking strategy of alternative analyses, and (ii) the ad hoc heuristics for computing scores (MBSL) vs. the stochastic model (DOP) for computing probabilities.

Note also that the combination operation employed by the MBSL system allows a kind of generalization over the tree-bank that is not possible in DOP1. Because MBSL allows tiling situated candidates using tiles that contain one bracket (either a left or a right bracket) the matching pairs of brackets that result from a series of connecting tiles may delimit a string of part-of-speech-tags that cannot be constructed by nested pairs of matching brackets from the tree-bank (which is a kind of substitution-operation of bracketed strings).

In contrast to MBSL, DOP1's tree substitution-operation generates only SSFs that can be obtained by nesting SSFs from the tree-bank under the restrictive constraints of the substitution-operation. This implies that each pair of matching brackets that corresponds to an SSF is a matching pair of brackets that can be found in the tree-bank. DOP1 generates exactly the same set of trees as the CFG underlying the tree-bank, while MBSL generalizes beyond the string-language and the tree-language of this CFG. It is not obvious, however, that MBSL operates in terms of the most felicitous representation of hierarchic surface structure, and one may wonder to what extent the generalizations produced in this way are actually useful.

Before we close off this section, we should emphasize that DOP-models do not necessarily lack all generalization power. Some extensions of DOP1 have been developed that learn new subtrees (and SSFs) by allowing mismatches between categories (in particular between SSFs and part-of-speech-tags). For instance, Bod (1995) discusses a model which assigns non-zero probabilities to unobserved lexical items on the basis of Good-Turing estimation. Another DOP-model (LFG-DOP) employs a corpus annotated with LFG-structures, and allows generalization over feature values (Bod & Kaplan 1998). We are currently investigating the design of DOP-models with more powerful generalization capabilities.

8. Conclusion

In this paper we focused on the Memory-Based aspects of the Data Oriented Parsing (DOP) model. We argued that disambiguation is a central element of linguistic processing that can be most suitably modelled by a memory-based learning model. Furthermore, we argued that this model must employ linguistic grammatical descriptions and that it should employ probabilities in order to account for the psycholinguistic observations concerning the central role that frequencies play in linguistic processing. Based on these motivations we presented the DOP model as a probabilistic recursive Memory-Based model for linguistic processing. We discussed some computational aspects of a simple instantiation of it, called DOP1, that aims specifically at syntactic disambiguation.

We also summarized some of our empirical and experimental observations concerning DOP1 that are specifically related to its Memory-Based nature. We conclude that these observations provide convincing evidence for the hypothesis that the structural units for language processing should not be limited to a minimal set of grammatical rules but that they must be defined in terms of a (redundant) space of linguistic relations that are encountered in an explicit "memory" or "case-base" that represents linguistic experience. Moreover, although many of our empirical observations support the argument for a purely memory-based approach, we encountered other empirical observations that exhibit the strong tension that exists between this argument and various factors that are encountered in practice e.g. data-sparseness. We may conclude, however, that our empirical observations do support the idea that "forgetting exceptions is harmful" (Van den Bosch & Daelemans 1998).

Furthermore, we analyzed the DOP model within the MBL paradigm and contrasted it to other MBL approaches within NLP and outside it. We concluded that despite the many similarities between DOP and other MBL models, there are major differences. For example, DOP1 distinguishes itself from other MBL models in that (i) it deals with a classification task involving infinite instance spaces and class spaces described by an ambiguous recursive grammar, and (ii) it employs a disambiguation facility based on a stochastic extension of the syntactic case-base.

Endnotes

We employed a cleaned-up version of this corpus in which mistagged words had been corrected by hand, and analyses with so-called pseudo-attachments had been removed.

Bod (1998a) also presents experiments with extensions of DOP1 that generalize over the corpus fragments. DOP1 cannot cope with unknown words.

Bod (1995, 1998a) shows how to extend the model to overcome this limitation.

The experiments in Sima'an (1999) concern a comparison between DOP1 and a new model, called Specialized DOP (SDOP). Since this comparison is not the main issue here, we will summarize the results concerning the DOP1 model only. However, it might be of interest here to mention the conclusions of the comparison. In short, the SDOP model extends the DOP1 model with automatically inferred subtree-selection criteria. These criteria are determined by a new learning algorithm that specializes the annotation of the training tree-bank to the domain of language use represented by that tree-bank. The SDOP models acquired from a tree-bank are substantially smaller than the DOP1 models. Nevertheless, Sima'an (1999) shows that the SDOP models are at least as accurate and have the same coverage as DOP1 models.

The SRI-ATIS tree-bank was generously made available for these experiments by SRI International, Cambridge (UK).

A concept is a function; members of its domain are called instances and members of its range are called classes. The task or target concept is the function that the learning process tries to estimate.

In fact only limited context is allowed around a bracket, which means that not all of the substrings in the corpus are stored.

References

A. Aamodt and E. Plaza, 1994. "Case-Based Reasoning: Foundational issues, methodological variations and system approaches", AI Communications, 7. 39-59.

D. Aha, 1998. "The Omnipresence of Case-Based Reasoning in Science and Application", Knowledge-Based Systems 11(5-6), 261-273.

D. Aha and D. Wettschereck, 1997. "Case-Based Learning: Beyond Classification of Feature Vectors", ECML-97 invited paper. Also in MLnet ECML'97 workshop. MLnet News, 5:1, 8-11.

S. Abney, 1996. "Statistical Methods and Linguistics." In: Judith L. Klavans and Philip Resnik (eds.): The Balancing Act. Combining Symbolic and Statistical Approaches to Language. Cambridge (Mass.): MIT Press, pp. 1-26.

H. Alshawi, editor, 1992. The Core Language Engine, Boston: MIT Press.

H. Alshawi and D. Carter, 1994. "Training and Scaling Preference Functions for Disambiguation", Computational Linguistics.

K. Ashley, 1990. Modeling legal argument: Reasoning with cases and hypotheticals, Cambridge, MA: MIT Press.

M. van den Berg, R. Bod and R. Scha, 1994. "A Corpus-Based Approach to Semantic Interpretation", Proceedings Ninth Amsterdam Colloquium, Amsterdam, The Netherlands.

E. Black, S. Abney, D. Flickenger, C. Gnadiec, R. Grishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini and T. Strzalkowski, 1991. "A Procedure for Quantitatively Comparing the Syntactic Coverage of English", Proceedings DARPA Speech and Natural Language Workshop, Pacific Grove, Morgan Kaufmann.

A. van den Bosch and W. Daelemans, 1998. "Do not forget: Full memory in memory-based learning of word pronunciation", Proceedings of NeMLaP3/CoNLL98.

E. Black, J. Lafferty and S. Roukos, 1992. "Development and Evaluation of a Broad-Coverage Probabilistic Grammar of English-Language Computer Manuals", Proceedings ACL'92, Newark, Delaware.

R. Bod, 1992. "A Computational Model of Language Performance: Data Oriented Parsing", Proceedings COLING'92, Nantes.

R. Bod, 1993a. "Using an Annotated Language Corpus as a Virtual Stochastic Grammar", Proceedings AAAI'93, Morgan Kauffman, Menlo Park, Ca.

R. Bod, 1993b. "Monte Carlo Parsing". Proceedings of the Third International Workshop on Parsing Technologies. Tilburg/Durbuy.

R. Bod, 1995. Enriching Linguistics with Statistics: Performance Models of Natural Language, Ph.D. thesis, ILLC Dissertation Series 1995-14, University of Amsterdam.

R. Bod, 1998a. Beyond Grammar, CSLI Publications / Cambridge University Press, Cambridge.

R. Bod, 1998b. "Spoken Dialogue Interpretation with the DOP Model", Proceedings COLING-ACL'98, Montreal, Canada.

R. Bod and R. Kaplan, 1998. "A Probabilistic Corpus-Driven Model for Lexical-Functional Analysis", Proceedings COLING-ACL'98, Montreal, Canada.

R. Bod and R. Scha, 1996. Data-Oriented Language Processing. An Overview. Technical Report LP-96-13. Institute for Logic, Language and Computation, University of Amsterdam.

R. Bod and R. Scha, 1997. "Data-Oriented Language Processing." In S. Young and G. Bloothooft (eds.) Corpus-Based Methods in Language and Speech Processing, Kluwer Academic Publishers, Boston. 137-173.

R. Bonnema, 1996 Data-Oriented Semantics. Master's Thesis, Department of Computational Linguistics (Institute for Logic, Language and Computation).

R. Bonnema, R. Bod and R. Scha, 1997. "A DOP Model for Semantic Interpretation", Proceedings 35th Annual Meeting of the ACL / 8th Conference of the EACL, Madrid, Spain.

C. Brew, 1995. "Stochastic HPSG", Proceedings European chapter of the ACL'95, Dublin, Ireland.

T. Briscoe and J. Carroll, 1993. "Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars", Computational Linguistics 19(1), 25-59.

J. Carroll and D. Weir, 1997. "Encoding Frequency Information in Lexicalized Grammars", Proceedings 5th International Workshop on Parsing Technologies, MIT, Cambridge (Mass.).

D. Carter, 1997. "The TreeBanker: a Tool for Supervised Training of Parsed Corpora", Proceedings of the workshop on Computational Environments for Grammar Development and Linguistic Engineering, ACL/EACL'97, Madrid, Spain.

T. Cartwright and M. Brent, 1997. "Syntactic categorization in early language acquisition: formalizing the role of distributional analysis." Cognition 63, pp. 121-170.

J. Chappelier and M. Rajman, 1998. "Extraction stochastique d'arbres d'analyse pour le modèle DOP", Proceedings TALN 1998, Paris, France.

E. Charniak, 1996. "Tree-bank Grammars", Proceedings AAAI'96, Portland, Oregon.

E. Charniak, 1997. "Statistical Techniques for Natural Language Parsing", AI Magazine.

N. Chomsky, 1966. Cartesian Linguistics. A Chapter in the History of Rationalist Thought. New York: Harper & Row.

K. Church and R. Patil, 1983. Coping with Syntactic Ambiguity or How to Put the Block in the Box on the Table, MIT/LCS/TM-216.

J. Coleman and J. Pierrehumbert, 1997. "Stochastic Phonological Grammars and Acceptability", Proceedings Computational Phonology, Third Meeting of the ACL Special Interest Group in Computational Phonology, Madrid, Spain.

M. Collins, 1996. "A new statistical parser based on bigram lexical dependencies", Proceedings ACL'96, Santa Cruz (Ca.).

M. Collins, 1997. "Three generative lexicalised models for statistical parsing", Proceedings EACL-ACL'97, Madrid, Spain.

W. Daelemans, A. van den Bosch and J. Zavrel, 1999a. "Forgetting exceptions is harmful in language learning." Machine Learning 34, 11-41.

W. Daelemans, S. Buchholz and J. Veenstra, 1999b. "Memory-Based Shallow Parsing." To appear in CoNLL '99.

G. DeJong, 1981. "Generalizations based on explanations", Proceedings of the Seventh International Joint Conference on Artificial Intelligence.

G. DeJong and R. Mooney, 1986. "Explanation-Based Generalization: An alternative view", Machine Learning 1:2, 145-176.

J. Eisner, 1997. "Bilexical Grammars and a Cubic-Time Probabilistic Parser", Proceedings Fifth International Workshop on Parsing Technologies, Boston, Mass.

T. Elliot and P. Scott, 1991. "Instance-based and generalization-based learning procedures applied to solving integration problems", Proceedings of the 8th Conference of the Society for the Study of AI, Leeds, UK: Springer.

G. Fenk-Oczlon, 1989. "Word frequency and word order in freezes", Linguistics 27, 517-556.

K. Fu, 1982. Syntactic Pattern Recognition and Applications, Prentice-Hall.

J. Goodman, 1996. "Efficient Algorithms for Parsing the DOP Model", Proceedings Empirical Methods in Natural Language Processing, Philadelphia, PA.

J. Goodman, 1998. Parsing Inside-Out, Ph.D. thesis, Harvard University, Mass.

J. Hammersley and D. Handscomb, 1964. Monte Carlo Methods, Chapman and Hall, London.

F. van Harmelen and A. Bundy, 1988. "Explanation-Based Generalization = Partial Evaluation", Artificial Intelligence 36, 401-412.

I. Hasher and W. Chromiak, 1977. "The processing of frequency information: an automatic mechanism?", Journal of Verbal Learning and Verbal Behavior 16, 173-184.

I. Hasher and R. Zacks, 1984. "Automatic Processing of Fundamental Information: the case of frequency of occurrence", American Psychologist 39, 1372-1388.

C. Hemphill, J. Godfrey and G. Doddington, 1990. "The ATIS spoken language systems pilot corpus". Proceedings DARPA Speech and Natural Language Workshop, Hidden Valley, Morgan Kaufmann.

L. Jacoby and L. Brooks, 1984. "Nonanalytic Cognition: Memory, Perception and Concept Learning", G. Bower (ed.), Psychology of Learning and Motivation (Vol. 18, 1-47), San Diego: Academic Press.

F. Jelinek, J. Lafferty and R. Mercer, 1990. Basic Methods of Probabilistic Context Free Grammars, Technical Report IBM RC 16374 (#72684), Yorktown Heights.

M. Johnson, 1995. Personal communication.

C. Juliano and M. Tanenhaus, 1993. "Contingent Frequency Effects in Syntactic Ambiguity Resolution", Fifteenth Annual Conference of the Cognitive Science Society, 593-598, Hillsdale, NJ.

S. Kasif, S. Salzberg, D. Waltz, J. Rachlin and D. Aha, forthcoming. "A Probabilistic Framework for Memory-Based Reasoning." To appear in Artificial Intelligence.

D. Kausler and J. Puckett, 1980. "Frequency Judgments and Correlated Cognitive Abilities in Young and Elderly Adults", Journal of Gerontology 35, 376-382.

M. Kay, 1980. Algorithmic Schemata and Data Structures in Syntactic Processing. Report CSL-80-12, Xerox PARC, Palo Alto, Ca.

J. Kolodner, 1993. Case-Based Reasoning, Morgan Kauffman, Menlo Park.

M. MacDonald, N. Pearlmutter and M. Seidenberg, 1994. "Lexical Nature of Syntactic Ambiguity Resolution", Psychological Review 101, 676-703.

M. Marcus, B. Santorini and M. Marcinkiewicz, 1993. "Building a Large Annotated Corpus of English: the Penn Treebank", Computational Linguistics 19(2).

W. Martin, K. Church and R. Patil, 1987. "Preliminary Analysis of a Breadth-first Parsing Algorithm: Theoretical and Experimental Results", in: L. Bolc (ed.), Natural Language Parsing Systems, Springer Verlag, Berlin.

T. Mitchell, 1997. Machine Learning, McGraw-Hill Series in Computer Science.

D. Mitchell, F. Cuetos and M. Corley, 1992. "Statistical versus Linguistic Determinants of Parsing Bias: Cross-linguistic Evidence", Fifth Annual CUNY Conference on Human Sentence Processing, New York, NY.

N. Pearlmutter and M. MacDonald, 1992. "Plausibility and Syntactic Ambiguity Resolution", Proceedings 14th Annual Conf. of the Cognitive Society.

M. Rajman, 1995. Apports d'une approche a base de corpus aux techniques de traitement automatique du langage naturel, Ph.D. Thesis, École Nationale Supérieure des Télécommunications, Paris.

P. Resnik, 1992. "Probabilistic Tree-Adjoining Grammar as a Framework for Statistical Natural Language Processing", Proceedings COLING'92, Nantes.

G. Sampson, 1986. "A Stochastic Approach to Parsing", Proceedings COLING'86, Bonn, Germany.

R. Scha, 1990. "Language Theory and Language Technology; Competence and Performance" (in Dutch), in Q.A.M. de Kort & G.L.J. Leerdam (eds.), Computertoepassingen in de Neerlandistiek, Almere: Landelijke Vereniging van Neerlandici (LVVN-jaarboek). [English translation]

R. Scha, 1992. "Virtual Grammars and Creative Algorithms" (in Dutch), Gramma/TTT 1(1). [English translation]

Y. Schabes, 1992. "Stochastic Lexicalized Tree-Adjoining Grammars", Proceedings COLING'92, Nantes.

S. Schütz, 1996. Part-of-Speech Tagging: Rule-Based, Markovian, Data-Oriented. Master's Thesis, University of Amsterdam, The Netherlands.

S. Sekine and R. Grishman, 1995. "A Corpus-based Probabilistic Grammar with Only Two Non-terminals", Proceedings Fourth International Workshop on Parsing Technologies, Prague, Czech Republic.

K. Sima'an, R. Bod, S. Krauwer and R. Scha, 1994. "Efficient Disambiguation by means of Stochastic Tree Substitution Grammars", Proceedings International Conference on New Methods in Language Processing, UMIST, Manchester, UK.

K. Sima'an, 1995. "An optimized algorithm for Data Oriented Parsing", Proceedings International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria.

K. Sima'an, 1996a. "An optimized algorithm for Data Oriented Parsing", in R. Mitkov and N. Nicolov (eds.), Recent Advances in Natural Language Processing 1995, volume 136 of Current Issues in Linguistic Theory. John Benjamins, Amsterdam.

K. Sima'an, 1996b. "Computational Complexity of Probabilistic Disambiguation by means of Tree Grammars", Proceedings COLING-96, Copenhagen, Denmark. (cmp-lg/9606019)

K. Sima'an, 1997. "Explanation-Based Learning of Data-Oriented Parsing", in T. Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL'97, Madrid, Spain.

K. Sima'an, 1999. Learning Efficient Disambiguation. Ph.D. Thesis, University of Utrecht. ILLC Dissertation Series nr. 1999-02. Utrecht / Amsterdam, March 1999.

B. Srinivas and A. Joshi, 1995. "Some novel applications of explanation-based learning to parsing lexicalized tree-adjoining grammars", Proceedings ACL'95, Cambridge (Mass.).

P. Suppes, 1970. "Probabilistic Grammars for Natural Languages", Synthese 22.

K. Sycara, R. Guttal, J. Koning, S. Narasimhan and D. Navinchandra, 1992. "CADET: A case-based synthesis tool for engineering design", International Journal of Expert Systems, 4(2), 151-188.

D. Tugwell, 1995. "A State-Transition Grammar for Data-Oriented Parsing", Proceedings EACL'95, Dublin, Ireland.

M. Veloso and J. Carbonell, 1993. "Derivational analogy in PRODIGY: Automating case acquisition, storage, and utilization", Machine Learning, 10, 249-279.

A. Viterbi, 1967. "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm", IEEE Trans. Information Theory, IT-13, 260-269.

D. Waltz and S. Kasif, 1996. "On Reasoning from Data," Computing Surveys.

T. Winograd, 1983. Language as a Cognitive Process. Volume I: syntax. Reading (Mass.).

J. Zavrel, 1996. Lexical Space: Learning and Using Continuous Linguistic Representations. Master's Thesis, Department of Philosophy, Utrecht University, The Netherlands.

.