Number	Person	Gender	Token
Singular	Person1	Commo	xaw
Singular	Person2/>td>	Common	xaydă
Singular	Person3	Masc	xay
Singular	Person3	Fem	xaydă
Plural	Person1	Common	xaynă
Plural	Person2	Common	xayday
Plural	Person3	Common	xayay

View on GitHub

aama

Afro-Asiatic Morphology Archive

Welcome to AAMA - the Afro-Asiatic Morphology Archive.

Getting Started

Overview

1. The AAMA Project
2. Install and configure required software
3. Download data and tools
4. The JSON format for morphological data
5. Paradigm Labels in the AAMA archve
6. Generate RDF data from morphological data files
7.Upload RDF data to SPARQL service
8. Query SPARQL service
9. Remote Data and Webapp Update

Appendix 1: The Data Schema

Appendix 2: The Data Files

Details

1. Introduction: The AAMA Project

The purpose of the AAMA Project is to create a morphological archive whose data can be:
- curated (edited/created) -- and hopefully shared!
- inspected
- manipulated
- queried
on individual browsers.

In the first instance the archive should make available and comparable the major morphological paradigms of some forty Cushitic and Omotic languages, and in the longer term help situate the morphologies of these two language families within Afroasiatic. Ultimately we hope also that the archive and its accompanying software may serve as a tool for exploration of typology and structure of the form of linguistic organization known as the paradigm.

As presently configured tha AAMA project consists of three interconnected modules:
- 1.1 Data Files
  
  An extensible collection of data files containing morphological paradigms from Afroasiatic languages. The data in itself is application-neutral, and could be cast into any plausible datastore format, and used in conjunction with tools and query-and-display applications constructed using any appropriate programming tools.
  
  Presently archived files cover principally the verbal and pronominal morphological paradigms of thirty-three Cushitic and six Omotic languages. In addition there are files with parallel sample data covering five Semitic languages and two varieties of Egyptian -- limited Berber and Chadic data is in the process of being entered. The intention behind the project is, with the help of collaborators, to extend the scope of the archive to include eventually as complete a representation as possible of all branches of the Afroasiatic language complex.
  
  Nominal paradigms are systematically included in the archive whenever they have been present in the underlying monographic source. However we have found that Cushitic-Omotic nominal morphosyntax does not lend itself as exhaustively to straight-forward word-level paradigmatic treatment as pronominal and verbal. We are experimenting with various consistent ways to systematically treat at least case, number, focus morphosyntax across the archive.
  
  Informally we can define "Paradigm" in its simplest and most obvious sense as:
  
  Any presentation of one or more linguistic forms ("tokens": words, affixes, clitics, stems, etc.), which share a set of morphological property/value pairs, and which vary systematically along the values of another set of properties.
  
  For consistency within the archive, we are using JSON as normative/persistent paradigm format, which allows a reasonable, human-readable/-editable approximation to traditional paradigm notation. To illustrate what is by far the most common data-structure in the archive, the paradigm, what traditionally would be termed:
  
  the number, person gender paradigm of the imperfect affirmative of the Burunge glide verb xaw-'come'
  
  In table form:
  
  Number Person Gender Token
  
  Singular Person1 Commo xaw
  
  Singular Person2/>td> Common xaydă
  
  Singular Person3 Masc xay
  
  Singular Person3 Fem xaydă
  
  Plural Person1 Common xaynă
  
  Plural Person2 Common xayday
  
  Plural Person3 Common xayay
  
  Paradigms are formally rendered in AAMA's JSON format by a nested data-sturucture, we call ":termcluster": where entities are either labels/indices or data strings (enclosed in quotes); where square brackets ( "[ ]") enclose arrays and braces("{ }") enclose indexed arrays. So that the paradigm just seen in table form woud be rendered by the following data structure:
  
  {"termcluster": {"label": "burunge-VBaseImperfGlideStemBaseForm-xaw", "note": "Kiessling1994 ## 7.2.2,7.2.3", "common": { "polarity": "Affirmative", "lexeme": "xaw", "pos": "Verb", "stemClass": "GlideStem", "tam": "Imperfect" }, "terms": [["number", "person", "gender", "token"], ["Singular", "Person1", "Common, "xaw"] ["Singular", "Person2", "Common, "xaydă"], ["Singular", "Person3", "Masc, "xay"] ["Singular", "Person3", "Fem, "xaydă"] ["Plural", "Person1", "Common, "xaynă"], ["Plural", "Person2", "Common, "xayday"], ["Plural", "Person3", "Common, "xayay"]] } }
  
  Where "termcluster" is an indexed list, with a unique "label" and a "note" property, which always indicates the paradigm's published source; "common" is an indexed list of the property=value pairs common to every member of the paradigm, and the array "terms" has as its first member an array of the paradigm term properties (= paradigm column heads), while each subsequent member array lists the values , in order, of the properties.
  
  Any or all the the data files can be downloaded from the AAMA site, and corrections to the existing files and submission, for on-line sharing, of new language files are hereby sollicited!
- 1.2 A Resource Descripton Framework (RDF) Datastore and Related Tools
  
  The data archive will hopefully serve a number of research and reference purposes. One such purpose is the creation of a queriable datastore, which will enable easy maiipulation and combination and comparison of morphological information within and between different languages and language families. To this end we have elected to set up such a datastore using the W3C-sanctioned format.
  
  Very good introductions to RDF datastores and the associated SPARQL query language can be found in their respective W3C home sites. But very basically RDF involves:
  1. Identifying units of information, and assigning them URL-like unique Uniform Resource Identifiers (URI).
    For example, in a paradigm cited above from the burunge-pdgms.edn file one of the possible values of the property tam (TenseAspectMode) is Imperfect. In the correspnding full rdf/xml format file beja-arteiga-pdgms.rdf file, the property tam has the full URI:
```
<http://id.oi.uchicago.edu/aama/2013/burunge/tam>
		      
```
    Since the first part of this URI is common to all Burunge morphological properties and values, in the more readable TTL ( TTL) RDF notation format, this URI would be notated brn:tam. and the Burunge TTL file, would contain in a brief abbreviation section (typically five to ten items) the entry:
```
<@prefix brn:     >
```
    Similarly, the value Imperfect, which has the URI:
```
<http://id.oi.uchicago.edu/aama/2013/burunge/Imperfect>
 		      
```
    would be in ttl notation brn:Imperfect)
    
    Formal URIs are valuable for distinguishing terminologies and building nomenclatures and ontologies. But in practice they are not visibly present in the user-end of our query application.
  2. Representing the complex pieces of information involving these concepts by organizing these conceptual units into tripartite statements called 'triples'
    Triples are conventionally noted:
```
s p o . 
		      
```
    and usually, but without semantic prejudice, read:
```
subject predicate object .
		      
```
    For example, as one might expect, an extremely common triple in a datastore like AAMA is of the form:
```
paradigmTermID-s hasProperty-p withValue-o .
			
```
    Thus if the first term of the edn paradigm given above had the pdgmTermID aama:d3c483b1 one of the (many) triples descibing it would be (in the ttl notation):
```
aama:d3c483b1 brn:tam brn:Imperfect .
		      
```
    Where aama: is the ttl abbreviation for
```
<http://id.oi.uchicago.edu/aama/2013/>
		      
```
    Another might be:
```
aama:d3c483b1 brn:person brn:Person1 .
		      
```
    stating that 'the :person property of the term as the value :Person1'
    
    And so forth. A good way to see practically the relation between the JSON data file and its RDF transform is to take a look at the JSON and TTL versions of a language data file of interest (e.g. beja-hud-pdgms.json and beja-hud-pdgms.ttl).
  Not surprisingly it takes a very large number of triples to describe even a moderately large datastore (AAMA on a recent count had 987,911). But they are very rapidly produced and indexed (a few seconds per language using the AAMA pdgmDict-json2ttl.py program), efficiently stored, and permit extremely quick access to information for display, comparison, manipulation, and reasoning. As mentioned, among the RDF tools in the on-line material, there is a Python script for transforming the (json) data files into appropriate RDF datastore (ttl) format, and a set of scripts to upload data files to a local Fuseki RDF server.
- 1.3 Query/Display User Interface
  
  The directory 'webappy' contains a set of Python scripts which contitute the elements of a rather basic 'proof-of-concept' application:
  1. A set of scripts which index the paradigm files, set up the matrerial for the menu and select lists and input forms, and programatically transform the json files into ttl. These are principally:
```
pdgmDict-pvlists.py
pdgmDict-json2ttlpy
                      
```
  2. A set of scripts to choose, display, and manipulate morphological material within and between language families. For the moment we are using the native Python Tcl/tk-derived tkinter graphic library, although we plan to return to a unified menu-based browser application, similar to our earlier Clojure-based application. The principal Python scripts in this version are::
```
pdgmDispUI-ltsource.py
pdgmDispUI-formsearch.py
                      
```
    These scripts generally work as follows:
    1. They gather requested language and morphological property and value information via an array of form selection-list, checkbox, and text-input mechanisms;
    2. formulate them into a SPARQL query,
    3. which is submitted to the datastore, returning a CSV response,
    4. which in turn is typically formatted into one or more tables using 'pandas' and other Python libraries.
Below we give instructions for downloading, launching, and initializing it. More details are on the app are available in the aama/webapp README . Also, a brief demo video of the earlier HTML version can be seen at

2. How to Install and configure required software

2.1 Git client

The aama project uses GitHub to store data and tools; you wll need a git client in order to download the tools repository and the data repositories you are interested in. Follow the instructions at Set Up Git.

Note that you do not need to create a github account unless you want to edit the data or code. Instructions for how to do that are below.
2.2 aama directory

We will assume that the data is plaed in a directory called 'aama-data' and application software is to be placed in a directory called 'webappy'. So Create and switch to an aama directory structure on your local drive, e.g.
```
~/ $ mkdir aama-data
~/ $ mkdir webappy
~/ $ cd webappy
~/ $ webappy/mkdir bin
~/ $ cd aama-data
~/ $ aama-data/mkdir data
                    
```
2.4 Fuseki

Fuseki is the SPARQL server we are using to query the dataset. Download the apache-jena-fuseki-2.4.0-distribution distribution (either the zip file or the tar file; NB, make sure your Java JDK is up-to-date with the download) and store it in a convenient location. ~/fuseki is a good place. The following steps will install the aama dataset and verify that it runs. Futher information about Fuseki, as well as information and links about RDF linked data and the SPARQL query language can be found at the Apache Jena site.

3. Download data, tools, and application code

Take a look at the Aama repositories and decide which languages interest you. In general we use one repository per language, or in some cases, language variety, e.g. beja-arteiga, beja-bishari, etc.

Now you need to download the data to your local harddrive. Create a data directory inside the aama directory, e.g. ~/aama $ mkdir data. Then clone each language repository into the data directory:
```
~/ $ cd aama-data/data
~/aama-data/data $ git clone https://github.com/aama/afar.git
~/aama-data/data $ git clone https://github.com/aama/geez.git
~/aama-data/data $ git clone https://github.com/aama/yemsa.git
```
Alternatively, you can create a personal github account, fork the aama repositories (copy them to your account), and then clone your repositories to your local drive. See Fork a Repo for details.
In the webappy directory, clone the aama Python web application repository, with the shell scripts, which should be later moved to the 'bin' subdirectory:
```
~/aama $ git clone https://github.com/aama/webappy.git
```

When youhave finished, your directory structure should look like this (assuming you have cloned afar, geez, and yemsa):

   ~/
   |-aama-data
   |---afar
   |---geez
   |---yemsa
   |-fuseki
   |-webappy
   |---bin

4. The JSON format for morphological data

For the normative/persistant data format we are using json. This notation has the advantage of being rigorously defined system of terms (:term), strings ("string"), vectors ([a b c d] maps ({a b, c d} and sets (#{a b c d}, and thus reliably transformable into a consistent RDF notation, while at the same time providing a human-readable natural format for data-entry and inspection.

Our current JSON structure (cf. below) while open to extension and revision, seems to provide a natural notation for the verbal and pronominal inflectional paradigms encountered in Afroasiatic, and perhaps for inflectional paradigms generally.

Since the JSON file is the normative/persistant data format, any corrections or additions you want to make must be made in this file, from which you will then generate new TTL/RDF files to be uploaded to the SPARQL server. And in fact, as long as you observe the above structure for JSON files, you can create any number of new language files of your own, transform them to RDF format, and upload them to the SPARQL server for querying.

5. Paradigm Labels in AAMA

In this application, for the purposes of display, comparison, modification, in the various select-lists, checkbox-lists, and text-input fields, paradigms are labeled as a comma-separated string of shared property=value components, followed, after a delimiter '%' by a comma-separated list of the properties whose values consititute the rows of the paradigm. In the, frequently long, paradigm lists automatically generated from the JSON file by the "Create Paradigm Lists" utility, for ease in processing the first two properties are always pos (part-of-speech) and morphClass, and, for ease in reading the 'property=' part of the label is omitted. Thus the label of the paradigm illustrated above would be:

pos=Verb,lex=xaw,polarity=Affirmative,stemClass=Glide,tam=Imperfect%number,person,gender

and might occur in a list as:

.  .  .
pos=Verb,lex=qadid,polarity=Affirmative,stemClass=DentalStem,tam=Perfect%number,person,gender
pos=Verb,lex=qadid,polarity=Affirmative,stemClass=DentalStem,tam=Subjunctive%number,person,gender
pos=Verb,lex=xaw,polarity=Affirmative,stemClass=GlideStem,tam=Imperfect%number,person,gender
pos=Verb,lex=xaw,polarity=Affirmative,stemClass=GlideStem,tam=Perfect%number,person,gender
pos=Verb,lex=xaw,polarity=Affirmative,stemClass=GlideStem,tam=Subjunctive%number,person,gender
. . .

6. Generate RDF data from morphological data files

In order to convert JSON-format data files to TTL ("turtle" -- a more easily human-readable RDF format), you will use the pdgmDict-pdgm2ttl.py file in the webappy directory. The aama-datastore-update.sh shell script will call aama-ttl2fuseki.sh which in turn will convert the .ttl file to the rdf-xml which is needed for uploading to the Fuseki SPARQL service.

7. Upload RDF data to SPARQL service

In order to upload the RDF files to Fuseki, you must first start the server by running:

~/aama $ webappy/bin/fuseki.sh

This script, like the following, assumes that the current version of Fuseki, for the moment apache-jena-fuseki-3.16.0, has been placed in the fuseki directory, and that the file tools/aamaconfig.ttl has been copied to the Fuseki version directory; the scripts should be edited for the correct locations if this is not the case. When run for the first time, you will notice that the script, which references the configuration file tools/aamaconfig.ttl, will have placed a, for the moment empty, data sub-directory aama in the fuseki/apache-jena-fuseki-3.16.0/ directory.

The following script:

 ~/aama $ webappy/bin/aama-datastore-update.sh "../aama-datadata/[LANG]"

will load the relevant ttl file in aama-data/data/[LANG] into the Fuseki server.

It also automatically runs the queries count-triples.rq ("How many triples are there in the datastore?") and list-graphs.rq ("What are the URIs of the language subgraphs?"), from the directory webappy/bin. If the upload has been successful, you will see an output such as the following (assuming again that afar, geez, and yemsa are the languages which have been cloned into aama/data/).

Query: bin/fuquery-gen.sh bin/count-triples.rq
?sTotal
33871
Query: bin/fuquery-gen.sh bin/list-graphs.rq
?g
<http://oi.uchicago.edu/aama/2013/graph/afar>
<http://oi.uchicago.edu/aama/2013/graph/geez>
<http://oi.uchicago.edu/aama/2013/graph/yemsa>

8. Query SPARQL service

A SPARQL service can accessed to explore the morphological data via three interfaces:

8.1 An AAMA command-line interface

(Some sample queries in webappy/bin. An earlier version exists with an extensive command-line application written in Perl.)
8.2 The Apache Jena Fuseki interface

You can see this on your browser at localhost:3030 after you launch Fuseki. SPARQL queries, for example, those contained in the tools/sparql/rq-ru/ directory, can be run directly against the datastore in the Fuseki Control Panel on the localhost:3030/dataset.html page (select the /aama dataset when prompted). Also the pdgmDispUI-... scripts automatically write to the terminal all SPARQL queries geneerated in the course of the computation. These queries can be copied and pasted into the Fuseki panel for inspection and debugging.
8.3 A web application specifically oriented to AAMA data

A preliminary menu-driven GUI application, will have already been downloaded following the instructions outlined above in Download data, tools, and application code. This application demonstrates the use of SPARQL query templates for display and comparison of paradigms and morphosyntactic properties and categories. It is written in Python, which has a very involved community of users who have created a formidable, and constantly growing set of libraries. However essentially the same functionality could be achieved by any software framework which can provide a web interface for handling SPARQL queries submitted to an RDF datastore,

The application presupposes that the Fuseki AAMA data server has been launched through the invocation of the shell-script bin/fuseki.sh. ication menu option.
9. Remote Data and Webapp Update

AAMA is an on-going project. Its data is constantly being updated, corrected, and added-to; the accompanying webb application is in a process of constant revision. To ensure that your data and web app are up-to-date you should periodically run the following shell scripts, which assume that git has been installed and that the data and webapp have been cloned from the master version in the manner outlined above.

The following script:
```
 ~/aama $ tools/bin/aama-pulldata.sh data/[LANG]
                
```
will update the json language data file in the data/[LANG] directory.

While:
```
 ~/aama $ tools/bin/aama-pulldata.sh "data/*"
                
```
will update the json language data files in all the data/[LANG] directories.

Once revised (or new) edn files have been installed, remember to run the appropriate scripts to transform them to ttl format and to load them into the SPARQL server, as outlined above.

Finally, the script:
```
 ~/aama $ tools/bin/aama-pullwebappy.sh 
                
```
will update he files of the web applicagtion.

Appendix 1: The Data Schema

Basic structure:

In outline each language JSON file has the following structure (see any of the language .edn files for a concrete example, and see below for explanation of terms):


{
|-:lang         "language name"
|-:sgpref       "string representing 3-character ns prefix used for the 
|                  URI of language-specific morphosyntactic properties 
|                  and values"
|-:datasource   "bibliographic source(s) for the data in the file"
|-:geodemoURL   "on-line geo-/demo-graphical information about the language"
|-:geodemoTXT   "short textual summary of geo-/demographcal information"
|-:schemata {	"associative map of each morphosyntactic property used 
|                  in the inflectional paradigms with a vector of its values"
|            }
|-:lexemes   {  "associative map of paradigmatic 'lexemes' with summary map 
|                  of properties -- a rudimentary lexicon of paradigm lexemes"
|            }
|-:termclusters   [  "label-ordered vector of  term-clusters/paradigms,
                    each of which has the structure:"
|   {
|----:label     "descriptive label assigned to the term-cluster at data-entry"
|----:note
|----:common    "map of property-value pairs which all members of the
|                 termcluster have in common"
|----:terms     "vector of vectors, the first of which enumerates  the
|                  properties which differentiate individual terms, while
|                  the others list, in order, the value of the i-th
|                  property -- in fact, a paradigm for the distinct property-
|                  value pairs of the lexeme in question"
|   }
|
| . . .
|
| ]
}

Appendix 2: The Data Files

At present the following data files are available:

Welcome to AAMA - the Afro-Asiatic Morphology Archive.

Getting Started

Overview

Details

1. Introduction: The AAMA Project

1.1 Data Files

1.2 A Resource Descripton Framework (RDF) Datastore and Related Tools

1.3 Query/Display User Interface

2. How to Install and configure required software

2.1 Git client

2.2 aama directory

2.4 Fuseki

3. Download data, tools, and application code

4. The JSON format for morphological data

5. Paradigm Labels in AAMA

6. Generate RDF data from morphological data files

7. Upload RDF data to SPARQL service

8. Query SPARQL service

8.1 An AAMA command-line interface

8.2 The Apache Jena Fuseki interface

8.3 A web application specifically oriented to AAMA data

9. Remote Data and Webapp Update

Appendix 1: The Data Schema

Appendix 2: The Data Files