Welcome to AAMA - the Afro-Asiatic Morphology Archive.
Getting Started
Overview
- 1. The AAMA Project
- 2. Install and configure required software
- 3. Download data and tools
- 4. The JSON format for morphological data
- 5. Paradigm Labels in the AAMA archve
- 6. Generate RDF data from morphological data files
- 7.Upload RDF data to SPARQL service
- 8. Query SPARQL service
- 9. Remote Data and Webapp Update
Details
-
1. Introduction: The AAMA Project
The purpose of the AAMA Project is to create a morphological archive whose data can be:
- curated (edited/created) -- and hopefully shared!
- inspected
- manipulated
- queried
In the first instance the archive should make available and comparable the major morphological paradigms of some forty Cushitic and Omotic languages, and in the longer term help situate the morphologies of these two language families within Afroasiatic. Ultimately we hope also that the archive and its accompanying software may serve as a tool for exploration of typology and structure of the form of linguistic organization known as the paradigm.
As presently configured tha AAMA project consists of three interconnected modules:
-
1.1 Data Files
An extensible collection of data files containing morphological paradigms from Afroasiatic languages. The data in itself is application-neutral, and could be cast into any plausible datastore format, and used in conjunction with tools and query-and-display applications constructed using any appropriate programming tools.
Presently archived files cover principally the verbal and pronominal morphological paradigms of thirty-three Cushitic and six Omotic languages. In addition there are files with parallel sample data covering five Semitic languages and two varieties of Egyptian -- limited Berber and Chadic data is in the process of being entered. The intention behind the project is, with the help of collaborators, to extend the scope of the archive to include eventually as complete a representation as possible of all branches of the Afroasiatic language complex.
Nominal paradigms are systematically included in the archive whenever they have been present in the underlying monographic source. However we have found that Cushitic-Omotic nominal morphosyntax does not lend itself as exhaustively to straight-forward word-level paradigmatic treatment as pronominal and verbal. We are experimenting with various consistent ways to systematically treat at least case, number, focus morphosyntax across the archive.
Informally we can define "Paradigm" in its simplest and most obvious sense as:
- Any presentation of one or more linguistic forms ("tokens": words, affixes, clitics, stems, etc.), which share a set of morphological property/value pairs, and which vary systematically along the values of another set of properties.
For consistency within the archive, we are using JSON as normative/persistent paradigm format, which allows a reasonable, human-readable/-editable approximation to traditional paradigm notation. To illustrate what is by far the most common data-structure in the archive, the paradigm, what traditionally would be termed:
- the number, person gender paradigm of the imperfect affirmative of the Burunge glide verb xaw-'come'
In table form:
Number Person Gender Token Singular Person1 Commo xaw Singular Person2 Common xaydă Singular Person3 Masc xay Singular Person3 Fem xaydă Plural Person1 Common xaynă Plural Person2 Common xayday Plural Person3 Common xayay Paradigms are formally rendered in AAMA's JSON format by a nested data-sturucture, we call ":termcluster": where entities are either labels/indices or data strings (enclosed in quotes); where square brackets ( "[ ]") enclose arrays and braces("{ }") enclose indexed arrays. So that the paradigm just seen in table form woud be rendered by the following data structure:
{"termcluster": {"label": "burunge-VBaseImperfGlideStemBaseForm-xaw", "note": "Kiessling1994 ## 7.2.2,7.2.3", "common": { "polarity": "Affirmative", "lexeme": "xaw", "pos": "Verb", "stemClass": "GlideStem", "tam": "Imperfect" }, "terms": [["number", "person", "gender", "token"], ["Singular", "Person1", "Common, "xaw"] ["Singular", "Person2", "Common, "xaydă"], ["Singular", "Person3", "Masc, "xay"] ["Singular", "Person3", "Fem, "xaydă"] ["Plural", "Person1", "Common, "xaynă"], ["Plural", "Person2", "Common, "xayday"], ["Plural", "Person3", "Common, "xayay"]] } }Where "termcluster" is an indexed list, with a unique "label" and a "note" property, which always indicates the paradigm's published source; "common" is an indexed list of the property=value pairs common to every member of the paradigm, and the array "terms" has as its first member an array of the paradigm term properties (= paradigm column heads), while each subsequent member array lists the values , in order, of the properties.
Any or all the the data files can be downloaded from the AAMA site, and corrections to the existing files and submission, for on-line sharing, of new language files are hereby sollicited!
-
1.2 A Resource Descripton Framework (RDF) Datastore and Related Tools
The data archive will hopefully serve a number of research and reference purposes. One such purpose is the creation of a queriable datastore, which will enable easy maiipulation and combination and comparison of morphological information within and between different languages and language families. To this end we have elected to set up such a datastore using the W3C-sanctioned format.
Very good introductions to RDF datastores and the associated SPARQL query language can be found in their respective W3C home sites. But, very basically, RDF involves:
- Identifying units of information, and assigning them URL-like
unique Uniform Resource Identifiers (URI).
For example, in a paradigm cited above from the burunge-pdgms.json file one of the possible values of the property tam (TenseAspectMode) is Imperfect. In the correspnding full rdf/xml format file beja-arteiga-pdgms.rdf file, the property tam has the full URI:
<http://id.oi.uchicago.edu/aama/2013/burunge/tam>
Since the first part of this URI is common to all Burunge morphological properties and values, in the more readable TTL ( TTL) RDF notation format, this URI would be notated brn:tam. and the Burunge TTL file, would contain in a brief abbreviation section (typically five to ten items) the entry:
@prefix brn: <http://id.oi.uchicago.edu/aama/2013/burunge/>
Similarly, the value Imperfect, which has the URI:
<http://id.oi.uchicago.edu/aama/2013/burunge/Imperfect>
would be in ttl notation brn:Imperfect)
Formal URIs are valuable for distinguishing terminologies and building nomenclatures and ontologies. But in practice they are not visibly present in the user-end of our query application.
- Representing the complex pieces of information involving these
concepts by organizing these conceptual units into tripartite
statements called 'triples'
Triples are conventionally noted:
s p o .
and usually, but without semantic prejudice, read:
subject predicate object .
For example, as one might expect, an extremely common triple in a datastore like AAMA is of the form:
paradigmTermID-s hasProperty-p withValue-o .
Thus if the first term of the JSON paradigm given above had the pdgmTermID aama:d3c483b1 one of the (many) triples descibing it would be (in the ttl notation):
aama:d3c483b1 brn:tam brn:Imperfect .
Where aama: is the ttl abbreviation for
<http://id.oi.uchicago.edu/aama/2013/>
Another might be:
aama:d3c483b1 brn:person brn:Person1 .
stating that 'the :person property of the term has the value :Person1'
And so forth. A good way to see practically the relation between the JSON data file and its RDF transform is to take a look at a paradigm of interest in the JSON and TTL versions of a language data file of interest: e.g.
burunge-pdgms:{termclusters:[{label:"brn-VerbGlideStem-xaw-ImperfectAffirmative"}]}and its corresponding RDF transformation in the `burunge-pdgms.ttl` file.
Not surprisingly it takes a very large number of triples to describe even a moderately large datastore (AAMA on a recent count had 987,911). But they are very rapidly produced and indexed (a few seconds per language using the AAMA pdgmDict-json2ttl.py program), efficiently stored, and permit extremely quick access to information for display, comparison, manipulation, and reasoning. As mentioned, among the RDF tools in the on-line material, there is a Python script for transforming the (JSON) data files into appropriate RDF datastore (ttl) format, and a set of scripts to upload data files to a local Fuseki RDF server.
- Identifying units of information, and assigning them URL-like
unique Uniform Resource Identifiers (URI).
Although RDF is an extremely interesting topic in itself, running the relevant scripts for adding to or correcting archive-data in the json files (usually done via an application menu choice), requires no special knowledge about RDF datastores. Some knowledge of the structure of an RDF datastore and the SQL-like SPARQL query language however IS required if you want to revise or add a page to the webapp, submitting a new query to the datastore in order to extract new information.
Pending an on-line publicly accessible datastore, you can set one up on your own computer. Instructions are given below for setting up an RDF server on an individual machine, and loading the data into it.
1.3 Query/Display User Interface
The directory 'webappy' contains a set of Python scripts which contitute the elements of a rather basic 'proof-of-concept' application:
- A set of scripts which index the paradigm files, set up the matrerial for the menu
and select lists and input forms, and programatically transform the JSON files into
ttl. These are principally:
pdgmDict-schemata.py pdgmDict-lexemespy pdgmDict-pvlists.py pdgmDict-json2ttl.py -
A set of scripts to choose, display, and manipulate morphological material within
and between language families. For the moment we are using the native Python
Tcl/tk-derived tkinter graphic library, although we plan to return to a unified menu-based
browser application, similar to our earlier Clojure-based application. The principal Python
scripts in this version are::
pdgmDisp-baseApp-PDGM.py pdgmDisp-baseApp-GPDGM.py pdgmDispUI-formsearch.pyThese scripts generally work as follows:- They gather requested language and morphological property and value information via an array of form selection-list, checkbox, and text-input mechanisms;
- formulate them into a SPARQL query,
- which is submitted to the datastore, returning a CSV response,
- which in turn is typically formatted into one or more tables using 'pandas' and other Python libraries.
- A set of scripts which index the paradigm files, set up the matrerial for the menu
and select lists and input forms, and programatically transform the JSON files into
ttl. These are principally:
Below we give instructions for downloading, launching, and initializing the app. More details on the app are available in the aama/webappy README . Also, a brief demo video of the earlier HTML/CLOJURE version can be seen at AAMA DEMO
2. How to Install and configure required software
-
2.1 Git client
The aama project uses GitHub to store data and tools; you wll need a git client in order to download the tools repository and the data repositories you are interested in. Follow the instructions at Set Up Git.
Note that you do not need to create a github account unless you want to edit the data or code. Instructions for how to do that are below.
-
2.2 aama directory
We will assume that the data is placed in a directory called 'aama-data' and application software is to be placed in a directory called 'webappy'. So create and switch to an
aamadirectory structure on your local drive, e.g.~/ $ mkdir aama-data ~/ $ mkdir webappy ~/ $ cd webappy ~/ $ webappy/mkdir bin ~/ $ cd aama-data ~/ $ aama-data/mkdir data -
2.4 Fuseki
Fuseki is the SPARQL server we are using to query the dataset. Download the
apache-jena-fuseki-2.4.0-distributiondistribution (either the zip file or the tar file; NB, make sure your Java JDK is up-to-date with the download) and store it in a convenient location.~/jenais a good place. The following steps will install theaamadataset and verify that it runs. Futher information about Fuseki, as well as information and links about RDF linked data and the SPARQL query language can be found at the Apache Jena site.
3. Download data, tools, and application code
-
Take a look at the Aama repositories and decide which languages interest you. In general we use one repository per language, or in some cases, language variety, e.g. beja-huda is the variety of Beja described by Richard Hudson in . . . beja-van is the variety of Beja described by . . . Verhove ub . . . etc.
Now you need to download the data to your local harddrive. Create a
datadirectory inside theaamadirectory, e.g.~/aama $ mkdir data. Then clone each language repository into the data directory:~/ $ cd aama-data/data ~/aama-data/data $ git clone https://github.com/aama/afar.git ~/aama-data/data $ git clone https://github.com/aama/geez.git ~/aama-data/data $ git clone https://github.com/aama/yemsa.git
Alternatively, you can create a personal github account, fork the aama repositories (copy them to your account), and then clone your repositories to your local drive. See Fork a Repo for details.
-
In the
webappydirectory, clone the aama Python web application repository, with the shell scripts, which should be later moved to the 'bin' subdirectory:~/aama $ git clone https://github.com/aama/webappy.git
When youhave finished, your directory structure should look like this (assuming you have cloned afar, geez, and yemsa):
~/ |-aama-data |---afar |---geez |---yemsa |-jena |-webappy |---bin
4. The JSON format for morphological data
For the normative/persistant data format we are using
JSON. This notation has the advantage of being
rigorously defined system of terms (:term),
strings ("string"), vectors
([a b c d] maps ({a b, c d} and sets
(#{a b c d}, and thus reliably transformable
into a consistent RDF notation, while at the same time
providing a human-readable natural format for data-entry and
inspection.
Our current JSON structure (cf. below) while open to extension and revision, seems to provide a natural notation for the verbal and pronominal inflectional paradigms encountered in Afroasiatic, and perhaps for inflectional paradigms generally.
Since the JSON file is the normative/persistant data format, any corrections or additions you want to make must be made in this file, from which you will then generate new TTL/RDF files to be uploaded to the SPARQL server. And in fact, as long as you observe the above structure for JSON files, you can create any number of new language files of your own, transform them to RDF format, and upload them to the SPARQL server for querying.
5. Paradigm Labels in AAMA
In this application, for the purposes of display, comparison, modification, in the various select-lists, checkbox-lists, and text-input fields, paradigms are labeled as a comma-separated string of shared property=value components, followed, after a delimiter '%' by a comma-separated list of the properties whose values consititute the rows of the paradigm. In the, frequently long, paradigm lists automatically generated from the JSON file by the "Create Paradigm Lists" utility, for ease in processing the first two properties are always pos (part-of-speech) and morphClass, and, for ease in reading the 'property=' part of the label is omitted. Thus the label of the paradigm illustrated above would be:
pos=Verb,lex=xaw,polarity=Affirmative,stemClass=Glide,tam=Imperfect%number,person,gender
and might occur in a list as:
. . . pos=Verb,lex=qadid,polarity=Affirmative,stemClass=DentalStem,tam=Perfect%number,person,gender pos=Verb,lex=qadid,polarity=Affirmative,stemClass=DentalStem,tam=Subjunctive%number,person,gender pos=Verb,lex=xaw,polarity=Affirmative,stemClass=GlideStem,tam=Imperfect%number,person,gender pos=Verb,lex=xaw,polarity=Affirmative,stemClass=GlideStem,tam=Perfect%number,person,gender pos=Verb,lex=xaw,polarity=Affirmative,stemClass=GlideStem,tam=Subjunctive%number,person,gender . . .
6. Generate RDF data from morphological data files
In order to convert JSON-format data files to TTL ("turtle"
-- a more easily human-readable RDF format), you will
use the pdgmDict-json2ttl.py file in the
webappy directory. The aama-datastore-update.sh
shell script will call aama-ttl2fuseki.sh which in turn
will convert the .ttl file to the rdf-xml which is needed for
uploading to the Fuseki SPARQL service.
7. Upload RDF data to SPARQL service
In order to upload the RDF files to Fuseki, you must first start the server by running:
~/aama $ webappy/bin/fuseki.sh
This script, like the following, assumes that the current version
of Fuseki, for the moment apache-jena-fuseki-3.16.0, has
been placed in the jena directory, and that
the file aamaconfig.ttl has been copied
to the Fuseki version directory; the
scripts should be edited for the correct locations if this
is not the case. When run for the first time, you will notice
that the script, which references the configuration file
aamaconfig.ttl, will have placed a,
for the moment empty, data
sub-directory aama in the
jena/apache-jena-fuseki-3.16.0/ directory.
The following script:
~/aama $ webappy/bin/aama-datastore-update.sh "../aama-datadata/[LANG]"
will load the relevant LANG-pdgms.ttl file in aama-data/data/[LANG]
into the Fuseki server.
It also automatically runs the queries count-triples.rq
("How many triples are there in the datastore?") and
list-graphs.rq ("What are the URIs of the
language subgraphs?"), from the directory
webappy/bin.
If the upload has been successful, you will see an output such as
the following (assuming again that afar, geez, and yemsa are the
languages which have been cloned into aama/data/).
Query: bin/fuquery-gen.sh bin/count-triples.rq ?sTotal 33871 Query: bin/fuquery-gen.sh bin/list-graphs.rq ?g <http://oi.uchicago.edu/aama/2013/graph/afar> <http://oi.uchicago.edu/aama/2013/graph/geez> <http://oi.uchicago.edu/aama/2013/graph/yemsa>
8. Query SPARQL service
A SPARQL service can be accessed to explore the morphological data via three interfaces:-
8.1 An AAMA command-line interface
(Some sample queries in
webappy/bin. An earlier version exists with an extensive command-line application written in Perl.) -
8.2 The Apache Jena Fuseki interface
You can see this on your browser at
localhost:3030after you launch Fuseki. SPARQL queries, for example, . . . , can be run directly against the datastore in the Fuseki Control Panel on thelocalhost:3030/dataset.htmlpage (select the/aamadataset when prompted). Also thepdgmDisp-...scripts automatically write to the terminal all SPARQL queries generated in the course of the computation. These queries can be copied and pasted into the Fuseki panel for inspection and debugging. -
8.3 An application specifically oriented to AAMA data
### REVISE! ###
A preliminary menu-driven GUI application, will have already been downloaded following the instructions outlined above in Download data, tools, and application code. This application demonstrates the use of SPARQL query templates for display and comparison of paradigms and morphosyntactic properties and categories. It is written in Python, which has a very engaged community of users who have created a formidable, and constantly growing set of libraries. However essentially the same functionality could be achieved by any software framework which can provide a web interface for handling SPARQL queries submitted to an RDF datastore,
The application presupposes that the Fuseki AAMA data server has been launched through the invocation of the shell-script
bin/fuseki.sh. ication menu option.
9. Remote Data and Webapp Update
AAMA is an on-going project. Its data is constantly being updated, corrected, and added-to; the accompanying webb application is in a process of constant revision. To ensure that your data and web app are up-to-date you should periodically run the following shell scripts, which assume that git has been installed and that the data and webapp have been cloned from the master version in the manner outlined above.
The following script:
~/aama $ tools/bin/aama-pulldata.sh data/[LANG]
will update the JSON language data file in the
data/[LANG] directory.
While:
~/aama $ tools/bin/aama-pulldata.sh "data/*"
will update the JSON language data files in all the
data/[LANG] directories.
Once revised (or new) JSON files have been installed, remember to run the appropriate scripts to transform them to ttl format and to load them into the SPARQL server, as outlined above.
Finally, the script:
~/aama $ tools/bin/aama-pullwebappy.sh
will update he files of the web applicagtion.
Appendix 1: The Data Schema
Basic structure:
In outline each language JSON file has the following structure (see any of the LANGUAGE-pdgms.json files for a concrete example, and see below for explanation of terms):
{
|-:lang "language name"
|-:sgpref "string representing 3-character ns prefix used for the
| URI of language-specific morphosyntactic properties
| and values"
|-:datasource "bibliographic source(s) for the data in the file"
|-:geodemoURL "on-line geo-/demo-graphical information about the language"
|-:geodemoTXT "short textual summary of geo-/demographcal information"
|-:schemata { "associative map of each morphosyntactic property used
| in the inflectional paradigms with a list of its values"
| }
|-:lexemes { "associative map of paradigmatic 'lexemes' with summary map
| of properties -- a rudimentary lexicon of paradigm lexemes"
| }
|-:termclusters [ "label-ordered list of term-clusters/paradigms,
each of which has the structure:"
| {
|----:label "descriptive label assigned to the term-cluster at data-entry"
|----:note
|----:common "map of property-value pairs which all members of the
| termcluster have in common"
|----:terms "list of lists, the first of which enumerates the
| properties which differentiate individual terms, while
| the others list, in order, the value of the i-th
| property -- in fact, a paradigm for the distinct property-
| value pairs of the lexeme in question"
| }
|
| . . .
|
| ]
}
Appendix 2: The Data Files
At present the following data files are available:
- Aari
- Afar
- Akkadian-ob
- Alaaba
- Arabic
- Arbore
- Awngi
- Bayso
- Beja-alm
- Beja-hud
- Beja-rei
- Beja-rop
- Beja-van
- Beja-wed
- Berber-ghadames
- Bilin
- Boni-jara
- Boni-kijee-bala
- Boni-kilii
- Burji
- Burunge
- Coptic-sahidic
- Dahalo
- Dhaasanac
- Dizi
- Egyptian-middle
- Elmolo
- Gawwada
- Gedeo
- Geez
- Hadiyya
- Hausa
- Hdi
- Hebrew
- Iraqw
- Kambaata
- Kemant
- Khamtanga
- Koorete
- Maale
- Mubi
- Oromo
- Rendille
- Saho
- Shinassha
- Sidaama
- Somali
- Syriac
- Tsamakko
- Wolaytta
- Yaaku
- Yemsa