Big Data/BI Zone is brought to you in partnership with:

Damaris has posted 16 posts at DZone. View Full User Profile

Tree Bank Browser: Syntactical Structures Specialized Search Engine

05.31.2013
| 2610 views |
  • submit to reddit

We would like to share a new application for searching syntactical structures called TreeBank Brower implemented by the Institut Universitari de Lingüística Aplicada (IULA)  at the Universitat Pompeu Fabra (Barcelona).

The TreeBank browser is an interesting tool addressed to linguists, which contains a Spanish treebank with more than 42.000 sentences syntactically annotated. Dependency grammar is the formalism used to represent the syntactic information. Such formalism allows seeing a sentence as a graph, therefore all the syntactic information in the corpus is represented as a DEX  directed graph, being the nodes the words in the corpus and the dependencies the relationship edges. The dependency is the annotation among related words, for instance the fragment “Sr. Salvatori vendió” would be represented by the following subgraph:

“Sr. Salvatori ” and “vendió” are two nodes in a sentence of the corpus and the relationship between them is a dependency called “SUBJ” (Subject). It means that in a given sentence the main verb is “vendió” (to sell) and “Sr. Salvatori” is the subject who sold something. 

All the sentences in the corpus have been semi-automatically analysed using a grammar with a predefined set of dependency relations such as subject, direct object, specifier, modifier, punctuation … Consider a more complete example in the following sentence “Además, la memoria es la base para el aprendizaje”. It has been analyzed and showed by the Treebank browser as follows:

The TreeBank browser allows searching for sentences in the corpus that satisfies a user defined patterns. Such patterns take into consideration both dependencies and words information; the latter may include any combination of part-of-speech, word form and lemma. As an example, we may query for all the sentences in the corpus whose main verb is “establecer” with a modifier and has a common noun as SUBJECT.

Taking profit of the nature of the graph, there are no restrictions in the position of the elements in a search. Therefore, in the previous query, we will find a solution independently of the relative position of each item of the query in the sentences (ex. subjects/modifiers in preverbal or postverbal position). For each result, the user may download it in a standard tabulated form or as a graph by exporting it to a standard graphml format, which makes the information more attractive and readable.

Treebanks are a resource for developing a number of useful tools in the Natural LanguageProcessing area like training of parsers and taggers, work on machine translation and speech recognition among many others.

For more information about the tool please take a look at the documentation available in the TreeBank browser website.

---

The TreeBank browser uses DEX a nosql solution high-performance graph database developed by Sparsity Technologies . One of its main characteristics is its query performance for the retrieval and exploration of large networks. Its implementation with very light specialized structures allows analysing and querying billions of objects at very low storage cost.

To read more about DEX

Published at DZone with permission of its author, Damaris Coll.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)