Minimal & Maximal Reference Answers.

MINIMAL AND MAXIMAL REFERENCE ANSWERS

I. Overview

A minimal/maximal reference answer pair is produced for each utterance interpretation that is taken as a request to retrieve a set of tuples from the database. In order for a hypothesis answer to be considered correct, it must contain at least the set of fields in one of the minimal reference answers for the utterance and no more than the set of fields in the corresponding maximal reference answer. The minimal reference answers are produced by hand (using NLParse or the equivalent), according to the following definition:

A minimal reference answer includes only the fields that contain, as defined by the principles of interpretation, the information explicitly requested by the subject. Uncertainty as to exactly what information is explicitly requested should be considered to be an ambiguity, and treated as any other ambiguity would be.

The last clause means that (1) if there are more than six possibilities, the query will be classified as hopelessly vague, except that (2) the principles of interpretation may stipulate interpretations of utterances that would otherwise be treated as hopelessly vague.

For each minimal reference answer, a maximal reference answer is generated automatically. A maximal reference contains:

all the fields included in the corresponding minimal reference answer,
PLUS the fields mentioned in the SQL producing the minimal reference answer that can reasonably be included in an answer to the query (this notion is defined below),
PLUS the primary keys (that can reasonably be included in an answer to the query) associated with the fields included in the maximal reference answer by (1) or (2),
EXCEPT that where two fields are constrained by the query to have the same value, only one of the fields is included in the maximal reference answer.

The motivation for the fields included under (2) is that it seems reasonable to include in an answer the attributes of an entity that have to be examined in order to tell whether the entity belongs in the answer. For example, if a list of flights arriving before a certain time is requested, it is reasonable to include the arrival time of the flights in the answer.

The motivation for the fields included under (3) is that if a list of attributes of entities is requested, it is reasonable to include in the answer the entities they are attributes of. For example, if the seating capacity of all airplanes is requested, it is reasonable to include which airplane each seating capacity is associated with.

The motivation for the exception made under (4) is that many, if not most, queries involve equi-joins where pairs of fields are required to have the same value. For example, a query that requires retrieving fares for a set of flights must contain a set of restrictions equivalent to:

flight.flight_id = flight_fare.flight_id AND
flight_fare.fare_id = fare.fare_id

If all of these fields were included in the maximal reference answer, then every flight_id and every fare_id would be included twice. To reduce the storage requirements for maximal answers and the complexity of the search space faced by the comparator, these duplications have been eliminated.

II. What can be included?

The notion of "can reasonably be included in an answer to the query" is defined by exception. The general idea is to include any field that is used to restrict an answer as part of the maximal answer, but there are a few cases where this either cannot or should not be done. The types of fields we have identified that cannot or should not be included in maximal answers are:

Fields bound by subqueries that occur within a negative condition or a universal quantifier,
Fields whose values are free to vary within a set of tuples being operated on by a group function,
Fields bound by subqueries that occur within a disjunctive condition.

In SQL, fields bound by a subquery cannot be returned as part of an answer, but many queries containing subqueries can be re-expressed as joins in which the subquery is eliminated. For subqueries that occur within negative conditions or universal quantifiers, however, there is in general no equivalent form that replaces the subquery with a join. Subqueries within negative conditions or universal quantifiers arise >from queries such as "What airlines have no flights that go to Boston?" or "Which airlines have only nonstop flights?" In the first case, even though the flight destination field must be examined to answer the query, returning it make no sense, because we can't list the nonexistent flights to Boston for all the airlines that don't have any. The second case can be assimilated to the first, since "Which airlines have only nonstop flights?" is equivalent to "Which airlines have no flights that have more than 0 stops?" Even if we chose not to view the question that way, it is doubtful that answering "Which airlines have only nonstop flights?" by listing all the flights for each such airline would be a reasonable thing to do.

The group functions supported by SQL are AVG, COUNT, MAX, MIN, and SUM. In SQL, no field can be returned as part of an answer that includes a value computed by a group function, unless the field occurs in a GROUP BY clause that defines the groups that the group function is applied to. This restriction makes sense, because unless a field is constrained to have a constant value for all members of the group, no value of that field is particularly associated with the result of the group function. For example, if we ask "How many flights go to cities served by United?", the answer is a single number for the whole group of flights, and it makes no sense to include any particular city as part of the answer. On the other hand, if we ask "How many flights go to each city served by United?", the answer should contain one number for each city served by United, so cities can and should be included in the maximal answer. To apply this principle to maximal answer generation for group function queries, any field that is constrained by the form of the query to have a fixed value within the scope of the group function is added to the GROUP BY clause, and all the fields mentioned in the GROUP BY clause are returned in the maximal answer.

Fields bound by subqueries that occur within a disjunctive condition could, in principle, be added to the maximal answer, but in practice doing so blows up the maximal answer with irrelevant information. The problem is that with a condition of the form "WHERE A OR B", fields introduced by subqueries inside A will be relevant to the answer only for rows in the answer for which A is true. Suppose the query were, "Which flights are either nonstop, or have a stop in Denver?" To include information from the stop table in the maximal answer, the query would have to be transformed into a join between the flight table and the stop table, with a row for each combination of flight table entry and stop table entry, such that either the flight was nonstop or the stop was a stop in Denver for the flight. The result is that the answer would have an entry for every nonstop flight combined with every entry in the stop table, whether or not it was a stop for that flight in Denver. Note that this problem is confined to subqueries inside a disjunction. Other disjunctions, such as disjunctions over values of a single field, such as "List the flights that stop in Denver or Philadelphia," present no such problems.

III. Phrasing queries.

Automatically generating reasonable maximal reference answers requires that the SQL for minimal reference answers be created with some care--more so than if getting the right minimal reference answer were all that mattered. In general, this requires phrasing the SQL for the minimal reference answer as closely as possible to the literal form of the user's query. If shortcuts are taken, the maximal reference answer may be missing fields that it should contain. On the other hand, if the SQL for the minimal reference answer is phrased in an overly complicated way, the maximal reference answer may contain too many fields or unnecessary joins.

A specific case worth bringing up is requests for fares that also mention flights, using an expression such as "fares for flights from X to Y." This should be considered ambiguous between a request for fares only or a request for both fares and flights. It is, of course, a special case of a more general pattern--requests for "As for Bs," and for the most part, the fact that the Bs will have some restriction on them will mean that they will get into the maximal answer, so they needn't be included in the minimal answer. The particular case of "fares for flights from X to Y," however, presents a potential trap, because the structure of the database allows fares for flights from X to Y to be retrieved just by restricting from_airport and to_airport in the fare table, in which case flights would not get included in the maximal answer.

The problem is that retrieving fares for flights from X to Y just by restricting from_airport and to_airport in the fare table represents an impermissable short-cut in generating the database query for the minimal answer. The user's query literally asks for fares of flights from X to Y, but the shortcut database query retrieves fares from X to Y, based on special knowledge that these two queries always return the same answer. If no shortcut is taken, the flight table will be accessed and the proper maximal answer will be generated. If the user asked only for "fares from X to Y", on the other hand, it is an unnecessary complication to access the flight table, and flights would improperly be included in the maximal answer.

In some cases, NLParse has had to be modified to make it possible to phrase queries without introducing unnecessary complications. For example, originally, a request for the cheapest flights between two cities had to be phrased:

List flights from Boston and to San Francisco and having prices of cheapest one direction fares from Boston and to San Francisco

because "cheapest" was defined so that it applied only to fares and not to flights. This meant that the restrictions on the flights had to be restated as restrictions on the fares, with the result that they entered into the maximal answer twice. NLParse has now been modified so that "cheapest flights" can be asked for directly, eliminating this redundancy.