The .fs files serve for encoding sentence structures in natural language. Each such file contains a sequence of trees whose nodes correspond to words of the sentence. Each node (word) is described by a set of attributes.
This file describes a standard that is not really part of the FS format. In fact the node attributes can be defined for every FS file independently. Despite of that, the PDT files usually share the same node attributes; these are described here. If you need to learn the general FS syntax, please refer to its own description.
Please note that the root of a tree is exception to the rule that each node represents one word/token from the sentence. The root does not correspond to any word but bears some information about the whole sentence. Some attributes thus have special interpretation in the root.
Not every attribute described below appears in every PDT FS file. Some may be defined in the header of the file but do not appear in the data. Some others may even not appear in headers of some files but may be defined in other files. Especially this holds for attributes bound to the tectogrammatical layer of annotation, and to attributes used for technical reasons.
There are several methods to leave the value of an attribute empty or undefined:
tag
) is empty: [dělá,dělat,,1]
. The
attribute name cannot be present, so
[form=dělá,lemma=dělat,tag=,ord=1]
is incorrect![dělá,dělat,ord=1]
.-
, although it is not mentioned in the description of
the tag system. Example: [dělat,dělá,-,1]
.???
usually means that the value is
unknown. It is even not known whether the attribute applies to the
given word. This is the default value for attributes that have it in
value list.NA
usually means that the value has been
set (is known) but is undefined because the attribute is not
relevant for the given word.form
@P form @O form
Corresponds to the CSTS elements <f>
,
<d>
, and <fadd>
.
In most cases the value of this attribute is identical to the word form as it appeared in the original text, including the upper/lowercase distinction. It differs only when a normalization step has been performed:
form
, the other bears the
pronoun.form
, the other's form
is jsi (you
are).form
, the other's form
is
nebo». Joint forms of this type are rather archaic.The root has form=#n
where n is the
number of the sentence in this file. Sometimes this value can be
non-numeric (e.g. form=#22A
) if a sentence has been split
in two or more sentences.
origf
@V origf @P origf
Corresponds to the CSTS element <w>
.
The original word form as it appeared in the sentence, before
normalization if any. If the word was misspelled, it remains
misspelled in this attribute but is corrected in
form
.
The root has origf=#n
where n is the
number of the sentence in this file. Sometimes this value can be
non-numeric (e.g. origf=#22A
) if a sentence has been
split in two or more sentences.
lemma
@P lemma @O lemma
Corresponds to the CSTS element <l>
.
The lemma uniquely identifies a word as a lexical unit. It is represented as a string of letters and other characters which in most cases corresponds to the base form of the word, also used as dictionary entry. The following forms are considered base forms:
Part of speech |
Base form |
noun | nominative singular (if singular does not exist, plural) |
adjective | nominative masculine singular, affirmative, positive |
pronoun | nominative masculine singular (if case, gender and number are relevant); e.g. there are only three personal pronouns: já (I), ty (you), on (he). |
numeral | nominative masculine singular (if case, gender and number are relevant) |
verb | infinitive |
adverb | positive affirmative (if relevant) |
preposition | without vocalization (e.g. v, not ve) |
other | original word form |
Ortographic variants are united if they are really based on ortography only and not on some sense shift as well.
A lemma is case sensitive so the proper names can be identified
even if they are identical to general nouns. The case of the lemma
does not however reflect the case of the word form in the text. Should
the word be capitalized only because it appeared in the beginning of a
sentence or a heading, its lemma
is all lowercase.
A sense identification in the form of a dash and one or more
decimal digits (e.g. -2
) can be added to the lemma
string. Such identification distinguishes lexical units that would be
otherwise indistinguishable (e.g. stát-1 = state,
country, stát-2 = to become, to happen,
stát-3 = to stand, stát-4 = to cost). The
sense distinction is shallow as it is motivated mostly by different
morphologic or syntactic properties of the distinguished lemmas.
The string described up to this point can be enriched by
comments. The comments are connected to the lemma by the underscore
character. Parenthesized comment preceded by circumflex contains a
short description of the meaning (in Czech). It often acompanies
lemmas with distinguished senses
(e.g. stát-1_^(státní_útvar)
). A comment preceded by a
semicolon encodes some lexical and stylistical categories,
e.g. G
in Grónsko_;G
means that
Grónsko (Greenland) is a geographical name.
The lemma
of the root is #
.
lemmaMM_source
@P lemmaMM_source
Corresponds to the CSTS elements <MMl src="source">
.
This set of attributes is automatically created during conversion from CSTS to FS.
lemmaMM_source
@P lemmaMD_source
Corresponds to the CSTS elements <MDl src="source">
.
This set of attributes is automatically created during conversion from CSTS to FS.
tag
@P tag @O tag
Corresponds to the CSTS element <t>
.
The attribute tag
contains the part of speech and
morphological tag. The Czech tag system uses roughly 3000
theoretically possible tags; one to two thousands of them really
appear in the PDT. There are two possible views of each tag:
compact and positional. There is a one-to-one mapping
between both systems so it is up to the user which one they
prefer. A tag is positional if and only if it is a string of 15
characters (English letters, digits, dashes and other special
characters (such as dots, exclamation marks...)). A compact tag has
variable length but is always shorter than 15. It contains only
uppercase English letters, digits, and sometimes a dash. The compact
system is older. The tags may be more legible for an experienced user;
they encode only properties relevant for the given part of
speech. Nevertheless it is difficult to parse them automatically
because there is a lot of rules saying "if up to this point we read
blablabla, the next character encodes the gender, otherwise it's the
tense...". On the other hand, in a positional tag, the index
(position) of a character already says which morphologic property it
encodes. The price for that is that the tags are long and contain long
sequences of dashes for categories not relevant for the given
word.
See the description of the compact tag system (available in: pdffile, psfile) and the description of the positional tag system (detailed description available in: pdffile, psfile; quick reference available in: htmlfile, pdffile ).
As for any FS attribute, there can be a set of values (tags)
separated by the vertical bar character (|
). If the
lemma
of this node contains several lemma alternatives,
the tag set must use special tags --
to separate the tag
set for lemma i from the tag set for lemma i+1.
The root has the tag
ZSB
.
wt
@P wt
Corresponds to the attribute w
of the
CSTS element <t>
.
tagMM_source
@P tagMM_source
Corresponds to the CSTS element <MMt src="source">
.
This set of attributes is automatically created during conversion from CSTS to FS.
tagMD_source
@P tagMD_source
Corresponds to the CSTS element <MDt src="source">
.
This set of attributes is automatically created during conversion from CSTS to FS.
wMDl_source, wMDt_source
@P wMDl_source @P wMDt_source
Correspond to the attribute w
of
CSTS element <MDl src="source">
and
<MDt src="source">
.
This set of attributes is automatically created during conversion from CSTS to FS.
A comprehensive view of this part of annotation can be found in the manual for the analytical layer annotators (in Czech).
afun
@P afun @O afun @L2 afun|---|Pred|Pnom|AuxV|Sb|Obj|Atr|Adv|AtrAdv|AdvAtr\ |Coord|AtrObj|ObjAtr|AtrAtr|AuxT|AuxR|AuxP|Apos|ExD|AuxC|Atv|AtvV\ |AuxO|AuxZ|AuxY|AuxG|AuxK|AuxX|AuxS|Pred_Co|Pnom_Co|AuxV_Co|Sb_Co\ |Obj_Co|Atr_Co|Adv_Co|AtrAdv_Co|AdvAtr_Co|Coord_Co|AtrObj_Co\ |ObjAtr_Co|AtrAtr_Co|AuxT_Co|AuxR_Co|AuxP_Co|Apos_Co|ExD_Co|AuxC_Co\ |Atv_Co|AtvV_Co|AuxO_Co|AuxZ_Co|AuxY_Co|Pred_Ap|Pnom_Ap|AuxV_Ap|Sb_Ap\ |Obj_Ap|Atr_Ap|Adv_Ap|AtrAdv_Ap|AdvAtr_Ap|Coord_Ap|AtrObj_Ap\ |ObjAtr_Ap|AtrAtr_Ap|AuxT_Ap|AuxR_Ap|AuxP_Ap|Apos_Ap|ExD_Ap|AuxC_Ap\ |Atv_Ap|AtvV_Ap|AuxO_Ap|AuxZ_Ap|AuxY_Ap|Pred_Pa|Pnom_Pa|AuxV_Pa|Sb_Pa\ |Obj_Pa|Atr_Pa|Adv_Pa|AtrAdv_Pa|AdvAtr_Pa|Coord_Pa|AtrObj_Pa\ |ObjAtr_Pa|AtrAtr_Pa|AuxT_Pa|AuxR_Pa|AuxP_Pa|Apos_Pa|ExD_Pa|AuxC_Pa\ |Atv_Pa|AtvV_Pa|AuxO_Pa|AuxZ_Pa|AuxY_Pa|???
Corresponds to the CSTS elements <A>
.
Analytical function (surface-syntactic tag). Denotes the type of dependency between governing and dependent nodes. Besides typical syntactical categories like subject, predicate, object, attribute or adverbial, contains also many auxiliary relations and distinguishes coordinations and appositive modifiers from real dependencies.
See the description of the analytical function system (available in: pdffile, psfile).
The root has the afun
AuxS
.
afunMD_source
@P afunMD_source
Corresponds to the CSTS element <MDA src="source">
.
This set of attributes is automatically created during conversion from CSTS to FS.
ord
@N ord
Corresponds to the CSTS element <r>
.
Index of the word in the sentence (original word order). The root
has the index of 0
.
govMD_source
@P govMD_source
Corresponds to the CSTS element <MDg src="source">
.
This set of attributes is automatically created during conversion from CSTS to FS.
Bunch of attributes has been added to the FS files on the tectogrammatical layer (see the header below). For their description, please refer to this postscript file or directly to the manual for the tectogrammatical annotators.
ID1
@P ID1
Corresponds to the attribute id
of the CSTS element
<s>
.
This attribute is non-empty only for root nodes. Its value is then the sentence identification within the Czech National Corpus.
ID2
@P ID2
Corresponds to a part of the attribute id
of the CSTS
element <s>
.
This attribute appears only in older files and is non-empty only for root nodes. Its value is then the name of the file the tree appears in.
nospace
@P nospace
Corresponds to the CSTS element <D>
.
If the value of this attribute is 1, no space followed the original form in the original data.
origfkind
@P origfkind
Corresponds to the attribute kind
of the CSTS element <w>
.
formtype
@P formtype
Corresponds to the attribute case
of the CSTS element
<f>
or to the attribute type
of the
CSTS element <d>
.
cstslang
@P cstslang
Corresponds to the attribute lang
of the CSTS
top-level element <csts>
.
cstssource
@P cstssource
Corresponds to the CSTS element <source>
.
cstsmarkup
@P cstsmarkup
Corresponds to the CSTS element <markup>
in
case it is a subelement of CSTS element <h>
.
This attribute is non-empty only for the root node of the first
tree in a file. It stores the original SGML form of all subelements of CSTS element
<markup>
, as stored in the CSTS header <h>
.
chap
@P chap
Corresponds to the CSTS element <c>
.
This attribute is non-empty only for root nodes. If its value is 1 then the senence represented in the tree is the first sentence of a chapter or section.
doc
@P doc
Corresponds to the attribute file
of the CSTS element <doc>
.
This attribute is non-empty only for root nodes. If the value of this attribute is non-empty then the senence represented in the tree is the first sentence of the document and the value is the original file name of the document.
docid
@P docid
Corresponds to the attribute id
of the CSTS element <doc>
.
docmarkup
@P docmarkup
Corresponds to the CSTS element <markup>
in
case the element appears in the document header <a>
.
This attribute is non-empty only for root nodes and only if the
doc
attribute is also non-empty. It stores the original
SGML form of all subelements of CSTS element
<markup>
of the document header <a>
.
docprolog
@P docprolog
Corresponds to the CSTS element <a>
.
This attribute is non-empty only for root nodes and only if the doc
attribute is also non-empty. It stores the original SGML form of all
subelements of CSTS element <a>
except <markup>
.
gappre
, gappost
@P gappre @P gappost
These attributes correspond to the CSTS elements <i>
,
<idioms>
,
<idiom>
and <iref>
.
This attributes store the original SGML form of the above named
CSTS elements appearing just before (in case of gappre
) or just
after (in case of gappost
) all other elements which form the node.
@P lemma @O lemma @P tag @O tag @P form @O form @P afun @O afun @L1 afun|---|Pred|Pnom|AuxV|Sb|Obj|Atr|Adv|AtrAdv|AdvAtr|Coord|AtrObj|ObjAtr|AtrAtr|AuxT|AuxR|AuxP|Apos|ExD|AuxC|Atv|AtvV|AuxO|AuxZ|AuxY|AuxG|AuxK|AuxX|AuxS|Pred_Co|Pnom_Co|AuxV_Co|Sb_Co|Obj_Co|Atr_Co|Adv_Co|AtrAdv_Co|AdvAtr_Co|Coord_Co|AtrObj_Co|ObjAtr_Co|AtrAtr_Co|AuxT_Co|AuxR_Co|AuxP_Co|Apos_Co|ExD_Co|AuxC_Co|Atv_Co|AtvV_Co|AuxO_Co|AuxZ_Co|AuxY_Co|AuxG_Co|AuxK_Co|AuxX_Co|Pred_Ap|Pnom_Ap|AuxV_Ap|Sb_Ap|Obj_Ap|Atr_Ap|Adv_Ap|AtrAdv_Ap|AdvAtr_Ap|Coord_Ap|AtrObj_Ap|ObjAtr_Ap|AtrAtr_Ap|AuxT_Ap|AuxR_Ap|AuxP_Ap|Apos_Ap|ExD_Ap|AuxC_Ap|Atv_Ap|AtvV_Ap|AuxO_Ap|AuxZ_Ap|AuxY_Ap|AuxG_Ap|AuxK_Ap|AuxX_Ap|Pred_Pa|Pnom_Pa|AuxV_Pa|Sb_Pa|Obj_Pa|Atr_Pa|Adv_Pa|AtrAdv_Pa|AdvAtr_Pa|Coord_Pa|AtrObj_Pa|ObjAtr_Pa|AtrAtr_Pa|AuxT_Pa|AuxR_Pa|AuxP_Pa|Apos_Pa|ExD_Pa|AuxC_Pa|Atv_Pa|AtvV_Pa|AuxO_Pa|AuxZ_Pa|AuxY_Pa|AuxG_Pa|AuxK_Pa|AuxX_Pa|Generated|NA|??? @P ID1 @P ID2 @VA origf @P origf @P afunprev @P semPOS @P tagauto @P lemauto @N ord @P dord @W sentord @P govTR @P nospace @P root @P ending @P punct @P alltags @P wt @P origfkind @P formtype @P gappost @P gappre @P cstslang @P cstssource @P cstsmarkup @P chap @P doc @P docid @P docmarkup @P docprolog @P1 warning @P3 err1 @P3 err2 @P reserve1 @P reserve2 @P reserve3 @P reserve4 @P reserve5 @P wMDt_a @P wMDl_a @P wMDt_b @P wMDl_b @P tagMD_a @P lemmaMD_a @P tagMD_b @P lemmaMD_b