View on GitHub

aama

Afro-Asiatic Morphology Archive

Welcome to AAMA - the Afro-Asiatic Morphology Archive.

Getting Started

Overview

Appendix 1: The Data Schema

Appendix 2: The Data Files

Details
  • 2. How to Install and configure required software

    • 2.1 Git client

      The aama project uses GitHub to store data and tools; you wll need a git client in order to download the tools repository and the data repositories you are interested in. Follow the instructions at Set Up Git.

      Note that you do not need to create a github account unless you want to edit the data or code. Instructions for how to do that are below.

    • 2.2 aama directory

      We will assume that the data is plaed in a directory called 'aama-data' and application software is to be placed in a directory called 'webappy'. So Create and switch to an aama directory structure on your local drive, e.g.

      ~/ $ mkdir aama-data
      ~/ $ mkdir webappy
      ~/ $ cd webappy
      ~/ $ webappy/mkdir bin
      ~/ $ cd aama-data
      ~/ $ aama-data/mkdir data
                          

    • 2.4 Fuseki

      Fuseki is the SPARQL server we are using to query the dataset. Download the apache-jena-fuseki-2.4.0-distribution distribution (either the zip file or the tar file; NB, make sure your Java JDK is up-to-date with the download) and store it in a convenient location. ~/fuseki is a good place. The following steps will install the aama dataset and verify that it runs. Futher information about Fuseki, as well as information and links about RDF linked data and the SPARQL query language can be found at the Apache Jena site.

  • 3. Download data, tools, and application code

    • Take a look at the Aama repositories and decide which languages interest you. In general we use one repository per language, or in some cases, language variety, e.g. beja-arteiga, beja-bishari, etc.

      Now you need to download the data to your local harddrive. Create a data directory inside the aama directory, e.g. ~/aama $ mkdir data. Then clone each language repository into the data directory:

      ~/ $ cd aama-data/data
      ~/aama-data/data $ git clone https://github.com/aama/afar.git
      ~/aama-data/data $ git clone https://github.com/aama/geez.git
      ~/aama-data/data $ git clone https://github.com/aama/yemsa.git

      Alternatively, you can create a personal github account, fork the aama repositories (copy them to your account), and then clone your repositories to your local drive. See Fork a Repo for details.

    • In the webappy directory, clone the aama Python web application repository, with the shell scripts, which should be later moved to the 'bin' subdirectory:

      ~/aama $ git clone https://github.com/aama/webappy.git

    When youhave finished, your directory structure should look like this (assuming you have cloned afar, geez, and yemsa):

       ~/
       |-aama-data
       |---afar
       |---geez
       |---yemsa
       |-fuseki
       |-webappy
       |---bin
    		
  • 4. The JSON format for morphological data

    For the normative/persistant data format we are using json. This notation has the advantage of being rigorously defined system of terms (:term), strings ("string"), vectors ([a b c d] maps ({a b, c d} and sets (#{a b c d}, and thus reliably transformable into a consistent RDF notation, while at the same time providing a human-readable natural format for data-entry and inspection.

    Our current JSON structure (cf. below) while open to extension and revision, seems to provide a natural notation for the verbal and pronominal inflectional paradigms encountered in Afroasiatic, and perhaps for inflectional paradigms generally.

    Since the JSON file is the normative/persistant data format, any corrections or additions you want to make must be made in this file, from which you will then generate new TTL/RDF files to be uploaded to the SPARQL server. And in fact, as long as you observe the above structure for JSON files, you can create any number of new language files of your own, transform them to RDF format, and upload them to the SPARQL server for querying.

  • 5. Paradigm Labels in AAMA

    In this application, for the purposes of display, comparison, modification, in the various select-lists, checkbox-lists, and text-input fields, paradigms are labeled as a comma-separated string of shared property=value components, followed, after a delimiter '%' by a comma-separated list of the properties whose values consititute the rows of the paradigm. In the, frequently long, paradigm lists automatically generated from the JSON file by the "Create Paradigm Lists" utility, for ease in processing the first two properties are always pos (part-of-speech) and morphClass, and, for ease in reading the 'property=' part of the label is omitted. Thus the label of the paradigm illustrated above would be:

    pos=Verb,lex=xaw,polarity=Affirmative,stemClass=Glide,tam=Imperfect%number,person,gender
    

    and might occur in a list as:

    .  .  .
    pos=Verb,lex=qadid,polarity=Affirmative,stemClass=DentalStem,tam=Perfect%number,person,gender
    pos=Verb,lex=qadid,polarity=Affirmative,stemClass=DentalStem,tam=Subjunctive%number,person,gender
    pos=Verb,lex=xaw,polarity=Affirmative,stemClass=GlideStem,tam=Imperfect%number,person,gender
    pos=Verb,lex=xaw,polarity=Affirmative,stemClass=GlideStem,tam=Perfect%number,person,gender
    pos=Verb,lex=xaw,polarity=Affirmative,stemClass=GlideStem,tam=Subjunctive%number,person,gender
    . . . 
  • 6. Generate RDF data from morphological data files

    In order to convert JSON-format data files to TTL ("turtle" -- a more easily human-readable RDF format), you will use the pdgmDict-pdgm2ttl.py file in the webappy directory. The aama-datastore-update.sh shell script will call aama-ttl2fuseki.sh which in turn will convert the .ttl file to the rdf-xml which is needed for uploading to the Fuseki SPARQL service.

  • 7. Upload RDF data to SPARQL service

    In order to upload the RDF files to Fuseki, you must first start the server by running:

    ~/aama $ webappy/bin/fuseki.sh
                    
    This script, like the following, assumes that the current version of Fuseki, for the moment apache-jena-fuseki-3.16.0, has been placed in the fuseki directory, and that the file tools/aamaconfig.ttl has been copied to the Fuseki version directory; the scripts should be edited for the correct locations if this is not the case. When run for the first time, you will notice that the script, which references the configuration file tools/aamaconfig.ttl, will have placed a, for the moment empty, data sub-directory aama in the fuseki/apache-jena-fuseki-3.16.0/ directory.

    The following script:

     ~/aama $ webappy/bin/aama-datastore-update.sh "../aama-datadata/[LANG]"
                    
    will load the relevant ttl file in aama-data/data/[LANG] into the Fuseki server.

    It also automatically runs the queries count-triples.rq ("How many triples are there in the datastore?") and list-graphs.rq ("What are the URIs of the language subgraphs?"), from the directory webappy/bin. If the upload has been successful, you will see an output such as the following (assuming again that afar, geez, and yemsa are the languages which have been cloned into aama/data/).

    Query: bin/fuquery-gen.sh bin/count-triples.rq
    ?sTotal
    33871
    Query: bin/fuquery-gen.sh bin/list-graphs.rq
    ?g
    <http://oi.uchicago.edu/aama/2013/graph/afar>
    <http://oi.uchicago.edu/aama/2013/graph/geez>
    <http://oi.uchicago.edu/aama/2013/graph/yemsa>
    	  

  • 8. Query SPARQL service

    A SPARQL service can accessed to explore the morphological data via three interfaces:
    • 8.1 An AAMA command-line interface

      (Some sample queries in webappy/bin. An earlier version exists with an extensive command-line application written in Perl.)

    • 8.2 The Apache Jena Fuseki interface

      You can see this on your browser at localhost:3030 after you launch Fuseki. SPARQL queries, for example, those contained in the tools/sparql/rq-ru/ directory, can be run directly against the datastore in the Fuseki Control Panel on the localhost:3030/dataset.html page (select the /aama dataset when prompted). Also the pdgmDispUI-... scripts automatically write to the terminal all SPARQL queries geneerated in the course of the computation. These queries can be copied and pasted into the Fuseki panel for inspection and debugging.

    • 8.3 A web application specifically oriented to AAMA data

      A preliminary menu-driven GUI application, will have already been downloaded following the instructions outlined above in Download data, tools, and application code. This application demonstrates the use of SPARQL query templates for display and comparison of paradigms and morphosyntactic properties and categories. It is written in Python, which has a very involved community of users who have created a formidable, and constantly growing set of libraries. However essentially the same functionality could be achieved by any software framework which can provide a web interface for handling SPARQL queries submitted to an RDF datastore,

      The application presupposes that the Fuseki AAMA data server has been launched through the invocation of the shell-script bin/fuseki.sh. ication menu option.

    • 9. Remote Data and Webapp Update

      AAMA is an on-going project. Its data is constantly being updated, corrected, and added-to; the accompanying webb application is in a process of constant revision. To ensure that your data and web app are up-to-date you should periodically run the following shell scripts, which assume that git has been installed and that the data and webapp have been cloned from the master version in the manner outlined above.

      The following script:

       ~/aama $ tools/bin/aama-pulldata.sh data/[LANG]
                      
      will update the json language data file in the data/[LANG] directory.

      While:

       ~/aama $ tools/bin/aama-pulldata.sh "data/*"
                      
      will update the json language data files in all the data/[LANG] directories.

      Once revised (or new) edn files have been installed, remember to run the appropriate scripts to transform them to ttl format and to load them into the SPARQL server, as outlined above.

      Finally, the script:

       ~/aama $ tools/bin/aama-pullwebappy.sh 
                      
      will update he files of the web applicagtion.

    • Appendix 1: The Data Schema

      Basic structure:

      In outline each language JSON file has the following structure (see any of the language .edn files for a concrete example, and see below for explanation of terms):

      
      {
      |-:lang         "language name"
      |-:sgpref       "string representing 3-character ns prefix used for the 
      |                  URI of language-specific morphosyntactic properties 
      |                  and values"
      |-:datasource   "bibliographic source(s) for the data in the file"
      |-:geodemoURL   "on-line geo-/demo-graphical information about the language"
      |-:geodemoTXT   "short textual summary of geo-/demographcal information"
      |-:schemata {	"associative map of each morphosyntactic property used 
      |                  in the inflectional paradigms with a vector of its values"
      |            }
      |-:lexemes   {  "associative map of paradigmatic 'lexemes' with summary map 
      |                  of properties -- a rudimentary lexicon of paradigm lexemes"
      |            }
      |-:termclusters   [  "label-ordered vector of  term-clusters/paradigms,
                          each of which has the structure:"
      |   {
      |----:label     "descriptive label assigned to the term-cluster at data-entry"
      |----:note
      |----:common    "map of property-value pairs which all members of the
      |                 termcluster have in common"
      |----:terms     "vector of vectors, the first of which enumerates  the
      |                  properties which differentiate individual terms, while
      |                  the others list, in order, the value of the i-th
      |                  property -- in fact, a paradigm for the distinct property-
      |                  value pairs of the lexeme in question"
      |   }
      |
      | . . .
      |
      | ]
      }
      
      

      Appendix 2: The Data Files

      At present the following data files are available: