Talbanken05
This is the home page for Talbanken05, a modernized version of Talbanken76,
a Swedish treebank of roughly 300,000 words, constructed at Lund University
in the 1970s. The treebank comes with no guarantee but is freely available
for research and educational purposes as long as proper credit is given for
the work done to produce the material (both in Lund and in Växjö).
The archive available for download contains the entire treebank (divided into
sections P, G, IB and SD) in three versions:
- MAMBA: Original syntactic and lexical annotation (original text encoding, with corrections)
- FPS: Flat phrase structure annotation (TIGER-XML encoding)
- DPS: Deepened phrase structure annotation (TIGER-XML encoding)
- Dep: Dependency structure annotation (Malt-XML encoding)
The two phrase structure versions are encoded in
TIGER-XML,
which means that they can be searched and displayed using
TIGERSearch.
The Dep version is encoded in
Malt-XML
but can be converted using
MaltConverter
to a TIGER-XML encoding of dependency structure conforming to the
guidelines
proposed by the
Nordic Treebank Network.
Below we give a brief description of the original treebank (Talbanken76), the
process of conversion, and the three different annotation standards (FPS, DPS, Dep).
The conversion is described more fully in:
- Nilsson, J., Hall, J. and Nivre, J. (2005) MAMBA Meets TIGER:
Reconstructing a Swedish Treebank from Antiquity. In Proceedings
of the NODALIDA Special Session on Treebanks.
Talbanken76
Talbanken76 was originally published as:
- Jan Einarsson: Talbankens skriftspråkskonkordans (1976)
- Jan Einarsson: Talbankens talspråkskonkordans (1976)
The data were collected in several projects at Lund University in the 1970s
and the material is described in several publications:
- Ulf Teleman: Manual för grammatisk beskrivning av talad och skriven svenska (MAMBA) (1974)
- Margareta Westman: Bruksprosa (1974)
- Nils Jörgensen: Meningsbyggnaden i talad svenska (1976)
- Tor G Hultman och Margareta Westman: Gymnasistsvenska (1977)
- Jan Einarsson: Talad och skriven svenska (1978)
Teleman (1974) describes the analysis principles, while the other books apply these
principles to different authentic materials.
Talbanken76 consists of a written language part and a spoken language part
of roughly equal size. The written language part in turn consists of two sections,
the so-called professional prose section (P), with
data from textbooks, brochures, newspapers, etc., and a collection
of high school students' essays (G). The spoken language part also
has two sections, interviews (IB) and conversations and
debates (SD). Altogether, the corpus contains close to 300,000
running tokens.
The MAMBA annotation scheme consists of two layers, the first
being a lexical analysis, consisting of part-of-speech
information including morphological features, and the second
being a syntactic analysis, in terms of grammatical functions.
Both layers are flat in the sense that they consist of tags
assigned to individual word tokens, but the syntactic layer also
gives information about constituent structure, as exemplified in
the annotation of the sentence Genom skattereformen införs
individuell beskattning av arbetsinkomster (Through the tax
reform, individual taxation of work income is introduced):
*GENOM PR AAPR
SKATTEREFORMEN NNDDSS AA
INFÖRS VVPSSMPA FV
INDIVIDUELL AJ SSAT
BESKATTNING VN SS
AV PR SSETPR
ARBETSINKOMSTER NN SS SSET
. IP IP
The first column of annotation is the lexical analysis, while
the second column is the syntactic analysis. The grammatical
subject of the sentence is the phrase individuell beskattning
av arbetsinkomster (individual taxation of work income),
where the head word beskattning (taxation) is assigned
the simple tag SS for subject, while the pre-modifying
adjective individuell (individual) is tagged SS and AT
for adjectival modifier; in the post-modifying prepositional
phrase, the noun arbetsinkomster (work income) is tagged
SS and ET for post-modifier, while the preposition av
(of) is tagged SS, ET and PR for preposition.
Tables explaining the categories used can be found here:
Conversion
The syntactic analysis in Talbanken76 is described by its
creators as an eclectic combination of
dependency grammar, topological field analysis and
immediate constituent analysis. This makes it
very suitable for conversion to both
phrase structure and dependency annotation. The conversion
has proceeded in three steps:
-
The original flat but multi-layered annotation is converted
to a bare phrase structure annotation, i.e. a phrase
structure with unlabeled nonterminal nodes, and edges labeled
with grammatical functions. This conversion is rather
straightforward given the partially hierarchical annotation
exemplified above.
-
The bare phrase structure annotation is extended to a full
phrase structure representation by labeling nonterminal nodes
with syntactic categories. These categories are not part of
the original annotation and have to be inferred from
other parts of the annotation.
-
The full phrase structure annotation is converted to a
dependency annotation using the standard technique with
head-finding rules and
preserving grammatical functions as edge labels. Head-finding
rules are not part of the original annotation scheme and have
to be constructed manually.
Phrase Structure Annotation
The phrase structure annotation, which is the outcome of the second
conversion step, uses a conventional set of phrase types (S, NP, VP,
etc.) in combination with the grammatical functions of the original
MAMBA annotation. The representation allows discontinuous phrases,
although discontinuous constituents are relatively rare in the treebank.
The phrase structure annotation comes in two versions, one with the
flattest possible trees that can be extracted from the original
annotation, called Flat Phrase Structure (FPS),
and one where trees have been deepened by inserting,
e.g., NPs within PPs and VPs within (larger) VPs,
called Deepened Phrase Structure (DPS).
In both cases, the conversion has necessitated the introduction
of a small number of new syntactic functions.
Dependency Structure Annotation
The dependency annotation (Dep), which is the outcome of the third conversion
step, consists of terminal nodes connected by edges labeled with
the same syntactic functions as DPS, extended with the label ROOT
for words that are not governed by another word in the dependency
structure. The representation allows non-projective
dependency structures, which are needed to capture discontinuous
constituents.
The conversion from phrase structure to dependency structure
uses a priority list for finding the head of a phrase.