The evaluation basically
consists in a comparison of keywords assigned automatically by Lexware and
manually by indexers at Riksdagsbiblioteket to 1403 riksdag’s documents.
The
texts of these documents are provided in file txt.zip. A thesaurus of 3969 nodes, which was
specially designed by
Riksdagsbiblioteket for the purpose
of indexing, is provided in file thesaurus.zip. Keywords are restricted to terms of this thesaurus.
Several
lists with various sorting of the results of manual and automatic indexation
are available to assist the evaluation analysis. File diff.txt consists of a
list of documents with two sets of keys each: manually and automatically
assigned keys. Documents are identified with a title and an id, the latter need
to be used for looking up document texts in txt.zip.
Download
file txt.zip
Download file thesaurus.zip
Download file diff.zip
·
rixlextes97.txt -
thesaurus in the original form provided by Riksdagsbiblioteket,
·
terms.txt - alphabetical
list of all terms in the theasaurus,
·
termtree.txt - thesaurus
formatted as tree of terms (see comments in the beginning of this file).
·
diff.txt – is the main
listing of results of manual and automatic indexing. It’s a list of documents
each provided with manually and automatically assigned keys.
·
stat.txt – a concise
table showing in percent the number of matches of the distinguished types (see
below sec. 5) between manually and automatically assigned keywords.
·
doc_avg.txt, doc_e.txt,
doc_ebn.txt, doc_min.txt are differently sorted lists containing: document id,
measures: E [%] EBN[%] (jfr sec.5), M-A (manual-automatic), AVG (average weight
keyword), MIN (minimum weight keyword)
and R% (Reliability), document title and document length in text word tokens.
·
keydist.txt – shows the
distribution of terms assigned as keywords, ordered from the most to the least
frequently used term as keyword. Of total 3969 thesaurus terms 1842 terms were
used as keywords for the 1403 indexed documents.
·
docperkey.txt – lists
documents for each key of the distribution list Number of occurrences of a term
as keyword is thus followed by document id, tite, length (in text words).
·
mkey.txt, mkeydiv.txt,
mtreeall.txt, mtreeusk.txt, man-auto.txt, termslen.txt, termsrel.txt, termsrelx.txt
are all listings which show various measures used by Lexware for estimation of
relevance of a term as a keyword for a document. These measures are listed
below.
·
EQUAL (E) means that the automatically assigned term matches exactly
the manually assigned term.
·
BROADER (B) means that the automatically assigned term is broader than
the manually assigned term. It is less specific, i.e. higher up in the
thesaurus tree.
·
NARROWER (N) means that automatically assigned term is narrower than the
manually assigned term. It is more specific, i.e. lower in the thesaurus tree.
·
SIBLING (S) means terms with a common parent.
·
TREE (T) means a common parent higher in the thesaurus tree, up to one
of 542 root nodes.
·
DIFFERENT (D) means that terms are not in either of the above listed
relations, i.e. they are truly different.
·
NON_TERM (NT) is a keyword not chosen from the terms of the thesaurus,
proposed by Lexware also as a possible term to be added to the thesaurus. This
happens whenever a salient concept in the document does not have a match in the
thesaurus.
There are 1403 txt-files
with manually assigned keywords. These are used in the comparison with
automatic indexation. Each document has
an id by which it is identified on the lists comparing results, like diff.txt.
This id is used also as the name of the file having the source text of a
document.