Welcome to AAMA - the Afro-Asiatic Morphology Archive.
Getting Started
Overview
- 1. The AAMA Project
- 2. Install and configure required software
- 3. Download data and tools
- 4. The EDN format for morphological data
- 5. Paradigm Labels in the AAMA archve
- 6. Generate RDF data from morphological data files
- 7.Upload RDF data to SPARQL service
- 8. Query SPARQL service
- 9. Remote Data and Webapp Update
Details
-
1. Introduction: The AAMA Project
The purpose of the AAMA Project is to create a morphological archive whose data can be:
- curated (edited/created) -- and hopefully shared!
- inspected
- manipulated
- queried
In the first instance the archive should make available and comparable the major morphological paradigms of some forty Cushitic and Omotic languages, and in the longer term help situate the morphologies of these two language families within Afroasiatic. Ultimately we hope also that the archive and its accompanying software may serve as a tool for exploration of typology and structure of the form of linguistic organization known as the paradigm.
As presently configured tha AAMA project consists of three interconnected modules:
-
1.1 Data Files
An extensible collection of data files containing morphological paradigms from Afroasiatic languages. The data in itself is application-neutral, and could be cast into any plausible datastore format, and used in conjunction with tools and query-and-display applications constructed using any appropriate programming tools.
Presently archived files cover principally the verbal and pronominal morphological paradigms of thirty-three Cushitic and six Omotic languages. In addition there are files with parallel data covering five Semitic languages and two varieties of Egyptian -- limited Berber and Chadic data is in the process of being entered. The intention behind the project is, with the help of collaborators, to extend the scope of the archive to include eventually as complete a representation as possible of all branches of the Afroasiatic language complex.
Nominal paradigms are systematically included in the archive whenever they have been present in the underlying monographic source. However we have found that Cushitic-Omotic nominal morphosyntax does not lend itself as exhaustively to straight-forward word-level paradigmatic treatment as pronominal and verbal. We are experimenting with various consistent ways to systematically treat at least case, number, focus morphosyntax across the archive.
Informally we can define "Paradigm" in its simplest and most obvious sense as:
- Any presentation of one or more linguistic forms ("tokens": words, affixes, clitics, stems, etc.), which share a set of morphological property/value pairs, and which vary systematically along the values of another set of properties.
For consistency within the archive, we are using as normative/persistent paradigm format, the json-like edn, a reasonable, human-readable/-editable approximation to traditional paradigm notation. To illustrate what is by far the most common data-structure in the archive, the paradigm, what traditionally would be termed:
- the number, person gender paradigm of the imperfect affirmative of the Burunge glide verb xaw-'come'
In table form:
Number Person Gender Token Singular Person1 Commo xaw Singular Person2/>td> Common xaydă Singular Person3 Masc xay Singular Person3 Fem xaydă Plural Person1 Common xaynă Plural Person2 Common xayday Plural Person3 Common xayay Paradigms are formally rendered in AAMA's EDN format by a nested data-sturucture, we call ":termcluster": where entities are either labels/indices (prefixed by ":") or data strings (enclosed in quotes); where square brackets ( "[ ]") enclose arrays and braces("{ }") enclose indexed arrays. So that the paradigm just seen in table form woud be rendered by the following data structure:
{:termcluster {:label "burunge-VBaseImperfGlideStemBaseForm-xaw" :note "Kiessling1994 ## 7.2.2,7.2.3" :common {:vmorphClass :Finite :polarity :Affirmative, :lexeme :xaw, :pos :Verb, :stemClass :GlideStem, :tam :Imperfect} :terms [[:number :person :gender :token], [:Singular :Person1 :Common "xaw"] [:Singular :Person2 :Common "xaydă"], [:Singular :Person3 :Masc "xay"] [:Singular :Person3 :Fem "xaydă"] [:Plural :Person1 :Common "xaynă"], [:Plural :Person2 :Common "xayday"], [:Plural :Person3 :Common "xayay"]] } }
Where :termcluster is an indexed list, with a :label and :note property; :common is an indexed list of the property=value pairs common to every member of the paradigm, and the array :terms has as its first member an array of the paradigm term properties (= paradigm column heads), while each subsequent member array lists the values , in order, of the properties.
Any or all the the data files can be downloaded from the AAMA site, and corrections to the existing files and submission, for on-line sharing, of new language files are hereby sollicited!
-
1.2 A Resource Descripton Framework (RDF) Datastore and Related Tools
The data archive will hopefully serve a number of research and reference purposes. One such purpose is the creation of a queriable datastore, and to this end we have elected to set up such a datastore using the W3C-sanctioned RDF format.
Very good introductions to RDF datastores and the associated SPARQL query language can be found in their respective W3C home sites. But very basically RDF involves:
- Identifying units of information, and assigning them URL-like
unique Uniform Resource Identifiers (URI).
For example, in a paradigm cited above from the burunge-pdgms.edn file one of the possible values of the property tam (TenseAspectMode) is Imperfect. In the correspnding full rdf/xml format file beja-arteiga-pdgms.rdf file, the property tam has the full URI:
<http://id.oi.uchicago.edu/aama/2013/burunge/tam>
In the more readable TTL ( TTL RDF notation format, this URI would be notated brn:tam.
and the value Imperfect has the URI:
<http://id.oi.uchicago.edu/aama/2013/burunge/Imperfect>
(in ttl notation brn:Imperfect)
Formal URI's are valuable for distinguishing terminologies and building nomenclatures and ontologies. But in practice they are not visibly present in the user-end of our query application.
- Representing the complex pieces of information involving these
concepts by organizing these conceptual units into tripartite
statements called 'triples'
Triples are conventionally noted:
s p o .
and usually, but without semantic prejudice, read:
subject predicate object .
For example, as one might expect, an extremely common triple in a datastore like AAMA is of the form:
paradigmTermID-s hasProperty-p withValue-o .
Thus if the first term of the edn paradigm given above had the pdgmTermID aama:d3c483b1 one of the (many) triples descibing it would be (in the ttl notation):
aama:d3c483b1 brn:tam brn:Imperfect .
Where aama: is the ttl abbreviation for
<http://id.oi.uchicago.edu/aama/2013/>
Another might be:
aama:d3c483b1 brn:person brn:Person1 .
stating that 'the :person property of the term as the value :Person1'
And so forth. A good way to see practically the relation between the EDN data file and its RDF transform is to take a look at the EDN and TTL versions of a language data file of interest (e.g. beja-arteiga-pdgms.edn and beja-arteiga-pdgms.ttl).
Not surprisingly it takes a very large number of triples to describe even a moderately large datastore (AAMA on a recent count had 987,911). But they are very rapidly produced and indexed (a few seconds per language using AAMA's edn2ttl program), efficiently stored, and permit extremely quick access to information for display, comparison, manipulation, and reasoning. Among the RDF tools in the on-line material, there is an executable file (with source code) for transforming the (edn) data files into appropriate RDF datastore (ttl/rdf) format, and a set of scripts to upload data files to a local RDF server.
Although RDF is an extremely interesting topic in itself, running the relevant scripts for adding to or correcting archive-data in the edn files (usually done via an application menu choice), requires no special knowledge about RDF datastores. Some knowledge of the structure of an RDF datastore and the SQL-like SPARQL query language however IS required if you want to revise or add a page to the webapp, submitting a new query to the datastore in order to extract new information.
Pending an on-line publicly accessible datastore, you can set one up on your own computer. Instructions are given below for setting up an RDF server on an individual machine, and loading the data into it.
- Identifying units of information, and assigning them URL-like
unique Uniform Resource Identifiers (URI).
1.3 Query/Display Interface
A rather basic menu-driven application which essentially:
- gathers requested language and morphological property and value information via the usual array of HTML form selection-list, checkbox, and text-input mechanisms;
- formulates them into a SPARQL query,
- which it submits to the datastore, returning the response, typically formatted into an HTML table.
Below we give instructions for downloading, launching, and initializing it. More details are on the app are available in the aama/webapp README and the app
Help
menu, as well as a demo video
-
2. How to Install and configure required software
-
2.1 Git client
The aama project uses GitHub to store data and tools; you'll need a git client in order to download the tools repository and the data repositories you're interested in. Follow the instructions at Set Up Git.
Note that you do not need to create a github account unless you want to edit the data or code. Instructions for how to do that are below.
-
2.2 aama directory
Create and switch to an
aama
directory structure on your local drive, e.g.~/ $ mkdir aama ~/ $ cd aama
-
2.3 rdf2rdf.jar
We use this tool to convert RDF files to various formats. Download it and save it someplace convenient -
~/aama/jar
is a good place. -
2.4 Fuseki
Fuseki is the SPARQL server we are using to query the dataset. Download the
apache-jena-fuseki-2.4.0-distribution
distribution (either the zip file or the tar file; NB, make sure your Java JDK is up-to-date with the download) and store it in a convenient location.~/aama/fuseki
is a good place. The following steps will install theaama
dataset and verify that it runs. Futher information about Fuseki, as well as information and links about RDF linked data and the SPARQL query language can be found at the Apache Jena site.
-
-
3. Download data, tools, and application code
-
Take a look at the Aama repositories and decide which languages interest you. In general we use one repository per language, or in some cases, language variety, e.g. beja-arteiga, beja-bishari, etc.
Now you need to download the data to your local harddrive. Create a
data
directory inside theaama
directory, e.g.~/aama $ mkdir data
. Then clone each language repository into the data directory:~/ $ cd aama/data ~/aama/data $ git clone https://github.com/aama/afar.git ~/aama/data $ git clone https://github.com/aama/geez.git ~/aama/data $ git clone https://github.com/aama/yemsa.git
Alternatively, you can create a personal github account, fork the aama repositories (copy them to your account), and then clone your repositories to your local drive. See Fork a Repo for details.
-
In the same
aama
directory, clone the aama tools repository:~/aama $ git clone https://github.com/aama/tools.git
and the web application:~/aama $ git clone https://github.com/aama/webapp.git
When you're done, your directory structure should look like this (assuming you have cloned afar, geez, and yemsa):
aama |-data |---afar |---geez |---yemsa |-fuseki |-jar |-tools |-webapp
-
-
4. The EDN format for morphological data
The normative/persistant data format is the json-like edn: Extensible Data Notation. This notation has the advantage of being rigorously defined system of terms (
:term
), strings ("string"
), vectors ([a b c d]
maps ({a b, c d}
and sets (#{a b c d}
, and thus reliably transformable into a consistent RDF notation, while at the same time providing a human-readable natural format for data-entry and inspection.Our current EDN structure (cf. below) while open to extension and revision, seems to provide a natural notation for the verbal and pronominal inflectional paradigms encountered in Afroasiatic, and perhaps for inflectional paradigms generally.
Since the EDN file is the normative/persistant data format, any corrections or additions you want to make must be made in this file, from which you will then generate new TTL/RDF files to be uploaded to the SPARQL server. And in fact, as long as you observe the above structure for EDN files, you can create any number of new language files of your own, transform them to RDF format, and upload them to the SPARQL server for querying.
5. Paradigm Labels in AAMA
In this application, for the purposes of display, comparison, modification, in the various select-lists, checkbox-lists, and text-input fields, paradigms are labeled as a comma-separated string of shared property=value components, followed, after a delimiter '%' by a comma-separated list of the properties whose values consititute the rows of the paradigm. In the, frequently long, paradigm lists automatically generated from the EDN file by the "Create Paradigm Lists" utility, for ease in processing the first two properties are always pos (part-of-speech) and morphClass, and, for ease in reading the 'property=' part of the label is omitted. Thus the label of the paradigm illustrated above would be:
Verb,Finite,lex=:xaw,polarity=Affirmative,stemClass=Glide,tam=Imperfect%number,person,gender
and might occur in a list as:
. . . Verb,Finite,lex=:qadid,polarity=Affirmative,stemClass=DentalStem,tam=Perfect%number,person,gender Verb,Finite,lex=:qadid,polarity=Affirmative,stemClass=DentalStem,tam=Subjunctive%number,person,gender Verb,Finite,lex=:xaw,polarity=Affirmative,stemClass=GlideStem,tam=Imperfect%number,person,gender Verb,Finite,lex=:xaw,polarity=Affirmative,stemClass=GlideStem,tam=Perfect%number,person,gender Verb,Finite,lex=:xaw,polarity=Affirmative,stemClass=GlideStem,tam=Subjunctive%number,person,gender . . .
-
6. Generate RDF data from morphological data files
In order to convert EDN-format data files to TTL ("turtle" -- a more easily human-readable RDF format), you will need the
aama-edn2ttl.jar
file. You will find this file in the~/aama/tools/clj
directory, which also contains its source-code. You should move this file to wherever you saved the rdf2rdf.jar file (for example in~/aama/jar
), which in turn will convert the .ttl file to the rdf-xml which is needed for uploading to the Fuseki SPARQL service.The EDN->TTL->RDF conversion can be effected by running:
~/aama $ tools/bin/aama-edn2rdf.sh "data/*"
which will make a .ttl and .rdf file for every .edn file in the data/ directory. This script presumes that the two jar files have been placed in~/aama/jar
; you should edit it if the jar files have been located elsewhere. (The conversion of a single language file can be effected by substituting, e.g.,data/oromo
for"data/*"
.) -
7. Upload RDF data to SPARQL service
In order to upload the RDF files to Fuseki, you must first start the server by running:
~/aama $ tools/bin/fuseki.sh
This script, like the following, assumes that the current version of Fuseki, for the momentapache-jena-fuseki-2.4.0
, has been placed in theaama/fuseki
directory, and that the filetools/aamaconfig.ttl
has been copied to the Fuseki version directory; the scripts should be edited for the correct locations if this is not the case. When run for the first time, you will notice that the script, which references the configuration filetools/aamaconfig.ttl
, will have placed a, for the moment empty, data sub-directoryaama
in thefuseki/apache-jena-fuseki-2.4.0/
directory.The following script:
~/aama $ tools/bin/aama-rdf2fuseki.sh "data/*"
will load all the rdf files inaama/data
into the Fuseki server.[NOTE: If the remote repository already has data loaded into its Fuseki server, the following script:
~/aama $ tools/bin/fudelete.sh "data/*"
must be run first.]You can test the upload with the script:
~/aama $ tools/bin/fuqueries.sh
which runs the queriescount-triples.rq
("How many triples are there in the datastore?") andlist-graphs.rq
("What are the URIs of the language subgraphs?"), from the directorytools/sparql/rq-ru
. If the upload has been successful, you will see an output such as the following (assuming again that afar, geez, and yemsa are the languages which have been cloned into aama/data/).Query: tools/sparql/rq-ru/count-triples.rq ?sTotal 33871 Query: tools/sparql/rq-ru/list-graphs.rq ?g <http://oi.uchicago.edu/aama/2013/graph/afar> <http://oi.uchicago.edu/aama/2013/graph/geez> <http://oi.uchicago.edu/aama/2013/graph/yemsa>
-
8. Query SPARQL service
A SPARQL service can accessed to explore the morphological data via three interfaces:-
8.1 An AAMA command-line interface
(Some sample queries in
tools/sparql/rq-ru
and scripts intools/bin
. An earlier version exists with an extensive command-line application written in Perl.) -
8.2 The Apache Jena Fuseki interface
You can see this on your browser at
localhost:3030
after you launch Fuseki. SPARQL queries, for example, those contained in thetools/sparql/rq-ru/
directory, can be run directly against the datastore in the Fuseki Control Panel on thelocalhost:3030/dataset.html
page (select the/aama
dataset when prompted). -
8.3 A web application specifically oriented to AAMA data
A preliminary menu-driven web application, will have already been downloaded following the instructions outlined above in Download data, tools, and application code. This application demonstrates the use of SPARQL query templates for display and comparison of paradigms and morphosyntactic properties and categories. It is written in Clojure, a LISP dialect with a very involved community of users who have created a formidable, and constantly growing set of libraries. However essentially the same functionality could be achieved by any software framework which can provide a web interface for handling SPARQL queries submitted to an RDF datastore,
The application presupposes that the Fuseki AAMA data server has been launched through the invocation of the shell-script
bin/fuseki.sh
. At present the application can be run in the webapp directory, either- from the downloaded sorce-code, with the command
lein ring server
using Leiningen - or as a Java application from the jar file to be
found in the webapp directory, with
the command
java -jar aama-webapp.jar
.
In either case, the application will be seen on the browser at
localhost:3000
. Note that, in order to run, the application must be initialized in either case by generating/loading application-specific menu and index files, following the steps detailed in theHelp > Initialize Application
menu option. - from the downloaded sorce-code, with the command
-
-
9. Remote Data and Webapp Update
AAMA is an on-going project. Its data is constantly being updated, corrected, and added-to; the accompanying webb application is in a process of constant revision. To ensure that your data and web app are up-to-date you should periodically run the following shell scripts, which assume that git has been installed and that the data and webapp have been cloned from the master version in the manner outlined above.
The following script:
~/aama $ tools/bin/aama-pulldata.sh data/[LANG]
will update the edn language data file in the data/[LANG] directory.While:
~/aama $ tools/bin/aama-pulldata.sh "data/*"
will update the edn language data files in all the data/[LANG] directories.Once revised (or new) edn files have been installed, remember to run the appropriate scripts to transform them to ttl and rdf format and to load them into the SPARQL server, as outlined above.
Finally, the script:
~/aama $ tools/bin/aama-pullwebapp.sh
will update he files of the web applicagtion.
Appendix 1: The Data Schema
Basic structure:
In outline each language EDN file has the following structure (see any of the language .edn files for a concrete example, and see below for explanation of terms):
{ |-:lang "language name" |-:sgpref "string representing 3-character ns prefix used for the | URI of language-specific morphosyntactic properties | and values" |-:datasource "bibliographic source(s) for the data in the file" |-:geodemoURL "on-line geo-/demo-graphical information about the language" |-:geodemoTXT "short textual summary of geo-/demographcal information" |-:schemata { "associative map of each morphosyntactic property used | in the inflectional paradigms with a vector of its values" | } |-:lexemes { "associative map of paradigmatic 'lexemes' with summary map | of properties -- a rudimentary lexicon of paradigm lexemes" | } |-:termclusters [ "label-ordered vector of term-clusters/paradigms, each of which has the structure:" | { |----:label "descriptive label assigned to the term-cluster at data-entry" |----:note |----:common "map of property-value pairs which all members of the | termcluster have in common" |----:terms "vector of vectors, the first of which enumerates the | properties which differentiate individual terms, while | the others list, in order, the value of the i-th | property -- in fact, a paradigm for the distinct property- | value pairs of the lexeme in question" | } | | . . . | | ] }
Appendix 2: The Data Files
At present the following data files are available:
- Aari
- Afar
- Akkadian-ob
- Alaaba
- Arabic
- Arbore
- Awngi
- Bayso
- Beja-arteiga
- Beja-atmaan
- Beja-beniamer
- Beja-bishari
- Beja-hadendowa
- Berber-ghadames
- Bilin
- Boni-jara
- Boni-kijee-bala
- Boni-kilii
- Burji
- Burunge
- Coptic-sahidic
- Dahalo
- Dhaasanac
- Dizi
- Egyptian-middle
- Elmolo
- Gawwada
- Gedeo
- Geez
- Hadiyya
- Hausa
- Hdi
- Hebrew
- Iraqw
- Kambaata
- Kemant
- Khamtanga
- Koorete
- Maale
- Mubi
- Oromo
- Rendille
- Saho
- Shinassha
- Sidaama
- Somali-standard
- Syriac
- Tsamakko
- Wolaytta
- Yaaku
- Yemsa