The .fs files serve for encoding sentence structures in natural language. Each such file contains a sequence of trees whose nodes correspond to words of the sentence. Each node (word) is described by a set of attributes.
The names and data types of particular attributes are not part of FS format. Rather, each FS file has a header which defines attributes for its tree nodes locally. To be able to understand the Prague Dependency Treebank FS files, you need to read this file (general FS syntax) as well as the definition of FS attributes used in PDT.
The nonterminal symbols are surrounded by
< >
characters, terminal symbols or strings of
terminal symbols are enclosed in double quotes. A c-like notation is
used inside of quotes, thus "\t"
means the character with
the code 9, i.e. HTAB. The character "\n"
represents the
end of line regardless the platform, i.e. it matches not only real
"\n" in its C sense, but also "\r\n" (DOS-Windows EOL), or even
"\r".
Any end of line escaped by a backslash (\\\n
)
has a special meaning. It is generated only for the sake of human
legibility of the file. When processing the file, such escaped end of
line is discarded immediately and its surroundings is parsed as if it
were not present. It can appear almost everywhere so in the syntax
description it is not mentioned anywhere. It can even appear within an
identifier but unlike the other backslash-escaped function characters
it does not become a part of the identifier.
The unary postfix operators "*
", "+
" and
"?
" mean that the operand appears n-times in a row, where
n>=0
for *, n>0
for +, and
n
is 0 or 1 for ?.
In contexts where a nonterminal can be interpreted as a set, the
binary operator "-
" can be used. It denotes a difference
of two sets.
The file contains a header with node attribute definitions, and a sequence of trees.
- <fs-file> ::=
- <definition-line>+ "\n"+ (<tree> "\n")+
<editor-configuration>?
- <editor-configuration> ::=
- "(" <number> ("," <number>)* ")"
Note: The numbers in the editor configuration are indexes of attributes that ought to be displayed by default. (The editor allows to turn on displaying the rest.) The attribute indices must be ordered ascending, otherwise the program crashes. It is thus impossible to enforce a different ordering of attributes when displaying the tree.
An identifier is one of the main elements of the FS file syntax. It is a string of arbitrary characters starting by the first character and ending before the first function character (it self is not a part of the identifier). Even function characters can be parts of identifiers when they are escaped by a backslash (the backslash used for escaping a special character is not a part of the identifier).
Note: The length of identifiers is limited, the limit depends on the usage. For an attribute name it is limited to 20 characters, for an attribute value it is limited to 120 characters.
- <attribute-name> ::=
- <identifier>
- <attribute-value> ::=
- <identifier>
- <identifier> ::=
- <identifier-character>+
- <identifier-character> ::=
- <normal-character> | <escaped-character>
- <function-character> ::=
- "\\" | "=" | "," | "[" | "]" | "|"
- <normal-character> ::=
- <any-character>-<function-character>-"\n"
- <escaped-character> ::=
- "\\" (<any-character>-"\n")
The beginning of each file contains a header with definitions of
the attributes which can appear in tree nodes. Each header line begins
with the @
character. Follows a capital letter denoting
properties of the attribute, then a space and the attribute name. For
example "@P lemma
".
Note: In the list of allowed values in the @L definition
(<values>
), the values cannot be repeated.
- <definition-line> ::=
- ("@" <property> <view>? " " <attribute-name>
"\n") |
- ("@L" <view>? " " <attribute-name> "|" <values>
"\n")
- <property> ::=
- "K" | "P" | "O" | "N" | "V" | "W" | "H"
- <view> ::=
- "1" | "2" | "3"
- <values> ::=
- <attribute-value> ("|" <values>)?
K
P
ord=7
,
e.g.). Positional attributes don't. The name of a positional attribute
is figured out after the relative position of its value with respect
to the previous values (see details below in the paragraph
"Node").O
L
H
N
@W
attribute is
provided. If the @N
attribute is not present, the tree is
centered regardless there is or is not a @W
attribute. Maximally one such attribute per FS file can be
defined.W
@N
and @W
attributes are defined, the former
specifies the ordering of nodes in tree view while the latter
specifies the ordering of words in the linear view on status line. It
enables that a non-projective tree is reordered by the user to a
projective order but the sentence remains displayed in the original
order on the status line.V
@VH
(default) or
@VA
. The former is default (i.e. @V
is the
same as @VH
) and means that the values of hidden nodes
(see the attribute @H
) will not be displayed even on the
status line. The latter means that even hidden nodes shall be shown on
status line.More than one property can be defined for one attribute. The definition lines with all the properties need not follow each other in the file header. They must however fulfill the following constraints:
@V
attribute per file can be defined.@W
attribute per file can be defined.@N
attribute per file can be defined.@N
property cannot be combined with other
properties. Nevertheless the @N
attribute has
automatically the properties @P
and @O
as
well.@V
and
@L
.@L
must be the last property defined for an attribute
but it cannot be the only property of that attribute.The view mode can be defined optionally. It can be required that the value of the attribute be always highlighted in the tree editor.
- 1
- ATTR_SHADOW
- 2
- ATTR_HILITE
- 3
- ATTR_XHILITE
The definition of node attributes in the Prague Dependency Treebank can serve as an example.
The trees are described in the usual parentheses notation,
i.e. after the description of an inner node the parenthesized
comma-separated list of its children (or their subtrees) follows. The
children of each node must be ordered according to the values of their
numeric attribute @N
, if any. Breaking this rule can
cause the tree editor to display the tree incorrectly (the
projectivity is involved; it is assumed that the numeric attribute
contains the index of the word according to the sentence word
order).
- <tree> ::=
- <node> ("(" <children> ")")?
- <children> ::=
- <tree> ("," <children>)?
Besides pure syntax it is also necessary to check the relations
between the element <attributes>
and the
definitions of the respective attributes in the header of the
file. The constraints following from these relations are described
below.
- <node> ::=
- <attribute-set> ("|" <node>)?
- <attribute-set> ::=
- "[" <attributes>? "]"
- <attributes> ::=
- <attribute> ("," <attributes>)?
- <attribute> ::=
- (<attribute-name> "=")? <values>
- <values> ::=
- <attribute-value> ("|" <values>)
The element <attributes>
must fulfill the
following constraints (based on the particular definition of
attributes in the file header):
<attribute-name>
element
must equal to a name of an attribute defined in the header.<attribute>
element with the same
<attribute-name>
appears twice or if the attribute
name is not mentioned but the last read attribute's definition
immediately precedes the definition of an attribute whose value has
already been read.@L
attribute must be one of the
predefined values from the definition of the attribute.Here is an example of a whole FS file with some trees.