ASIt Linguistic Linked Dataset

This Version
2012-05-18, http://purl.org/asit/RDF/asit-data.rdf (rdf-schema)
Authors of this document
Emanuele Di Buccio, Giorgio Maria Di Nunzio, Gianmaria Silvello
ASIt Digital Library
http://purl.org/asit

Introduction

The Atlante Sintattico d'Italia, Syntactic Atlas of Italy (ASIt) enterprise builds on a long standing tradition of collecting and analysing linguistic corpora, which has originated different efforts and projects over the years. ASIt accounts for minimally different variants within a sample of closely related languages, thus it does not need a thorough part of speech (POS) disambiguation, since the "trivial" identification of basic POS (e.g. Nouns vs Verbs) is not enough to capture cross-linguistic differences between closely related languages. Secondly, the linguistic variants cannot be reduced to lexical distinctions only, i.e. syntactic differences are in general unpredictable on the basis of the properties of single lexical items. A specific tag set designed to capture sentence-level phenomena without taking into consideration POS tags is needed. As a consequence, while other tag sets are designed to carry out a gross linguistic analysis of a vast corpus, the ASIt tag set aims to capture fine-grained grammatical differences by comparing various dialectal translations of the same sentence. Moreover, in order to pin down these subtle asymmetries, the linguistic analysis must be carried out manually.

To explain why the needs for ASIt are so special we have to take into consideration two different aspects: the nature of Italian dialects, and the kind of linguistic theory ASIt aims to interact with. The Italian dialectal area presents a kind of variation that involves parametric choices affecting many general aspects of syntax, morphology, and phonology. The kind of information we want to gather involves not only the presence of a certain element, but also the absence of an element; an element can be omitted only in some constructions and in conjunction with specific characteristics of the language. For this reason, ASIt proposed the creation of a specific set of tags starting from a universal core shared by all languages (on the basis of the work done by DynaSAND), and subsequently developing a language-specific periphery which is compatible with other projects.

Dialectal data stored in the ASIt were gathered during a twenty-year-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy. These data and information were collected by means of questionnaires formed by sets of Italian sentences: dialectal speakers were asked to translate them into their dialects and write their translations in the questionnaire; therefore, each questionnaire is associated with many parallel dialectal translations. At present, there are eight different questionnaires written in Italian and almost 500 questionnaires, corresponding to the eight Italian questionnaires, written in more than 240 different dialects, for a total of more than 54,000 sentences and more than 40,000 tags stored in the data resource managed by the ASIt digital library system.

RDF Schema of the Dataset

The RDF/S modeling the ASIt Linguistic Linked Dataset is organized in three main conceptual areas:
Geographical area It is the place where a given dialect is spoken and where a speaker is born.
Derivation area It focuses on the background of the speaker: the level of knowledge of the dialect, the particular variety of the dialect, the birthplace, the ancestors, the document that she/he translated.
Tagging area It is how the document is structured and how it has been tagged (at a sentence level and at a word level.

A relevant aspect of the ASIt curated database is that it explicitly models sentence level tagging, which is not modeled by any other of the presented linguistic projects. Furthermore, we have developed a language-specific set of POS tags which is suitable for ASIt dialectal data but, at the same time, allows these data to be linked to other databases of dialect syntax. We can therefore imagine the creation of a language-specific tagset as starting from a universal core, shared by all languages, and subsequently developing a language-specific periphery, which is compatible with other databases and able to classify language-specific structures.


In the following Figure we report the main classes and properties defining the RDF/S of the dataset.

RDF/S graph

As far as the vocabulary adopted in this specification is concerned, we use the namespaces and prefixes reported in the following table; asit is the only vocabulary which is not inherited from other domains.

Prefix Namespace Description
asit
http://purl.org/asit/terms/ ASIt vocabulary terms
dcterms
http://purl.org/dc/terms/ Dublin Core terms
foaf
http://xmlns.com/foaf/0.1/ Friend of a friend
geo
http://www.w3.org/2003/01/geo/wgs84_pos/ WGS84 Geo Positioning
gn
http://www.geonames.org/ontology GeoNames Ontology
owl
http://www.w3.org/2002/07/owl/ OWL vocabulary terms
rdf
http://www.w3.org/1999/02/22-rdf-syntax-ns/ RDF vocabulary terms
rdfs
http://www.w3.org/2000/01/rdf-schema/ RDF Schema

The ASIt curated database was the starting point for defining the RDF/S underlying the ASIt Linguistic Linked Dataset presented here. In the table below we report the OWL data type properties of the presented classes.

Class OWL Datatype Properties
Region gn:officialName, asit:geographicPartition , asit:note
Province gn:officialName, gn:shortName, asit:note
Town gn:officialName, geo:alt, geo:lat, geo:long, gn:population, asit:note, asit:provinceCapital, asit:provinceLittoral, asit:altimetricArea, asit:mountainTown, asit:surface
Dialect asit:dialectName
Actor foaf:firstName, foaf:lastName, foaf:name, foaf:birthday, foaf:gender, foaf:mailbox, asit:placeOfBirth, asit:education, asit:job, asit:country, asit:lang, asit:note, asit:affiliation
Document dcterms:title, dcterms:date
Sentence asit:sentence, asit:transcription, asit:note
Word asit:wordText, asit:transcription
Tag asit:tagDescription, asit:mandatory

We exploited this RDF/S to expose the linguistic data in the ASIt curated database as a Linked Dataset whose details are reported in the table below.

Version Number 1.02
Version Date 2012-08-03
Availability Public
Serialization RDF/XML
Triples 421,948
Size 38.3 MB

The ASIt curated database is synchronized with the ASIt Linguistic Linked Dataset, where every entity in the database corresponds to a class in the linked dataset; therefore, the dataset is maintained following the same policies adopted for the database, ensuring the quality of the data exposed. As a consequence, the ASIt Linguistic Linked Dataset size grows proportionally to the size of the data in the curated database: the number of entries associated with a database entity is related to the number of instances of the RDF class we mapped from it. Since the research activities on the linguistic ASIt database is still ongoing, the number of documents and sentences is increasing over time as well as the tags the linguistic researchers associate with them.

Vocabulary Specification

Acknowledgments

This work has been supported by the Project FIRB "Un'inchiesta grammaticale sui dialetti italiani: ricerca sul campo, gestione dei dati, analisi linguistica" (Bando FIRB - Futuro in ricerca 2008).

Copyright ©
Website design and management: Information Management Systems Research Group