Welcome to XMLCorpus’s documentation!

class xmlc.Annotation(tag: Optional[str], item_tag: str, morphology: xmlc.Morphology, parts_of_speech: Optional[xmlc.Field] = None, gloss: Optional[xmlc.Field] = None)[source]

Master class containing all possible annotations that can exist in a XML file.

gloss = None

The annotation’s glossary - can be None.

Type:Field or None
morphology = None

The annotation’s morphology.

static parse(annotation: lxml.etree._Element, tag: str = 'annotation', **kwargs) → xmlc.Annotation[source]

With the given lxml.etree.Element, parses the item_tag and creates a new XMLItem with its data.

Parameters:
  • element (lxml.etree._Element) – the element to parse.
  • tag (str) – the XML tag itself.
  • kwargs – arbitrary arguments for custom parsing options.
Returns:

the new tag.

Return type:

XMLItem

Raises:

ValueError – if the element.tag is different than tag.

parts_of_speech = None

The annotation’s part of speech - can be None.

Type:Field or None
to_table(tabletype='simple') → str[source]

Represents the XMLItem by a table.

Parameters:tabletype (str) –

the table format to use. The following formats are available:

  • ”plain”
  • ”simple”
  • ”github”
  • ”grid”
  • ”fancy_grid”
  • ”pipe”
  • ”orgtbl”
  • ”jira”
  • ”presto”
  • ”pretty”
  • ”psql”
  • ”rst”
  • ”mediawiki”
  • ”moinmoin”
  • ”youtrack”
  • ”html”
  • ”latex”
  • ”latex_raw”
  • ”latex_booktabs”
  • ”textile”

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:the table representation of the XMLItem.
Return type:str
class xmlc.AnnotationElements[source]

Enumeration containing the possible parts that conforms an annotation. Can be:

  1. Morphology
  2. Parts of speech
  3. Gloss
class xmlc.AnnotationStatus[source]
Enumeration containing the three possible statuses for a sentence:
  1. Annotated
  2. Unannotated
  3. Reviewed
class xmlc.Field(tag: Optional[str], cls: T = <class 'xmlc.Value'>, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>)[source]

Class grouping a set of :class:`Value`s.

cls

alias of Value

to_table(tabletype='simple') → str[source]

Represents the XMLItem by a table.

Parameters:tabletype (str) –

the table format to use. The following formats are available:

  • ”plain”
  • ”simple”
  • ”github”
  • ”grid”
  • ”fancy_grid”
  • ”pipe”
  • ”orgtbl”
  • ”jira”
  • ”presto”
  • ”pretty”
  • ”psql”
  • ”rst”
  • ”mediawiki”
  • ”moinmoin”
  • ”youtrack”
  • ”html”
  • ”latex”
  • ”latex_raw”
  • ”latex_booktabs”
  • ”textile”

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:the table representation of the XMLItem.
Return type:str
class xmlc.Morphology(tag: Optional[str], cls: T = <class 'xmlc.Field'>, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>)[source]

The morphology contains a group of fields containing values. This describes how the text’s tokens are.

cls

alias of Field

get(item: Union[str, int], default_value: Any = None) → Union[xmlc.Field, Any][source]

Searchs for an item, given its position or its tag. If not found, returns the default value.

Parameters:
  • item (str or int) – the item to look for. Can be the index or the identifier tag.
  • default_value (Any) – the value to return when not found.
Returns:

the found Field or the default value.

Return type:

Field or Any

to_table(ignored='simple') → str[source]

Represents the XMLItem by a table.

Parameters:tabletype (str) –

the table format to use. The following formats are available:

  • ”plain”
  • ”simple”
  • ”github”
  • ”grid”
  • ”fancy_grid”
  • ”pipe”
  • ”orgtbl”
  • ”jira”
  • ”presto”
  • ”pretty”
  • ”psql”
  • ”rst”
  • ”mediawiki”
  • ”moinmoin”
  • ”youtrack”
  • ”html”
  • ”latex”
  • ”latex_raw”
  • ”latex_booktabs”
  • ”textile”

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:the table representation of the XMLItem.
Return type:str
class xmlc.Sentence(tag: Optional[str], cls: T, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>, id: str = '', status: xmlc.AnnotationStatus = <AnnotationStatus.UNANNOTATED: 'unannotated'>, alignment_id: Optional[str] = None)[source]

Structure containing a set of tokens, which conforms a sentence.

alignment_id = None

Aligned sentence ID - represents a translation of this sentence.

cls

alias of Token

find_by(data: Dict[xmlc.AnnotationElements, Union[Set[str], str]]) → List[xmlc.Token][source]

Recursively looks for tokens that fulfill with the data requirements specified.

Parameters:data (dict[AnnotationElements, set[str] or str]) – a dictionary containing the annotation elements to filter and the conditions of the filtering.
Returns:a list of tokens that fulfills the requirements.
Return type:list[Token]
id = ''

Sentence unique ID.

classmethod parse(element: lxml.etree._Element, subcls: T, tag: str = None, **kwargs) → Optional[xmlc.Sentence][source]

With the given lxml.etree.Element, parses the item_tag and creates a new XMLGroup with its data. In addition to XMLItem, finds and parses any subitem contained by the tag.

Parameters:
  • element (lxml.etree._Element) – the element to parse.
  • subcls (T) – the subclass type used when parsing found objects.
  • tag (str) – the XML tag itself.
  • kwargs – arbitrary arguments for custom parsing options.
Returns:

the new group of tags.

Return type:

XMLItem

Raises:
side_by_side(another: xmlc.Sentence, tabletype='plain') → str[source]

With the given sentence, compares all tokens contained in both sentences (defined by their alignment ID) and generates a table with the comparison.

Parameters:
  • another (Sentence) – the other sentence to compare.
  • tabletype (str) – the output table format.
Returns:

table representation of the comparison.

Return type:

str

status = 'unannotated'

Sentence annotation status - possible values defined at AnnotationStatus.

to_table(tabletype='plain') → str[source]

Represents the XMLItem by a table.

Parameters:tabletype (str) –

the table format to use. The following formats are available:

  • ”plain”
  • ”simple”
  • ”github”
  • ”grid”
  • ”fancy_grid”
  • ”pipe”
  • ”orgtbl”
  • ”jira”
  • ”presto”
  • ”pretty”
  • ”psql”
  • ”rst”
  • ”mediawiki”
  • ”moinmoin”
  • ”youtrack”
  • ”html”
  • ”latex”
  • ”latex_raw”
  • ”latex_booktabs”
  • ”textile”

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:the table representation of the XMLItem.
Return type:str
class xmlc.Source(tag: Optional[str], cls: T = <class 'xmlc.Sentence'>, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>, id: str = '', language: str = '', title: str = '', citation_part: str = '', alignment_id: Optional[str] = None, editorial_note: Optional[str] = None, annotator: Optional[str] = None, reviewer: Optional[str] = None, original_url: Optional[str] = None)[source]

The source conforms a set of sentences organized and translated into another source.

alignment_id = None

Source’s translation’s ID.

annotator = None

Source’s annotator.

citation_part = ''

Source’s citation.

cls

alias of Sentence

compare(another: xmlc.Source, sentences: Tuple[str, ...] = (), status: Optional[xmlc.AnnotationStatus] = None, tabletype: str = 'simple') → str[source]

With the given source, compares each sentence defined at sentences and generates a table with the sentences comparison.

Parameters:
  • another (Source) – the other source to compare with.
  • sentences (tuple[str, ..]) – the sentences to compare. Empty means all.
  • status (AnnotationStatus) – the sentence status to use when comparing. None means unused.
  • tabletype (str) – the output format for the table.
Returns:

sources comparison as a table.

Return type:

str

Raises:

ValueError – if the sources are not aligned.

editorial_note = None

Source’s editorial note.

find_words_by(data: Dict[xmlc.AnnotationElements, Union[Set[str], str]]) → List[xmlc.Token][source]

With the given requirements, find all tokens that fulfills them.

Parameters:data (dict[AnnotationElements, set[str] or str]) – a dictionary containing the annotation elements to filter and the conditions of the filtering.
Returns:a list of tokens that fulfills the requirements.
Return type:list[Token]
id = ''

Source unique ID.

language = ''

Source’s language.

original_url = None

Source’s original URL.

classmethod parse(element: lxml.etree._Element, subcls: T, tag: str = None, **kwargs) → Optional[xmlc.Source][source]

With the given lxml.etree.Element, parses the item_tag and creates a new XMLGroup with its data. In addition to XMLItem, finds and parses any subitem contained by the tag.

Parameters:
  • element (lxml.etree._Element) – the element to parse.
  • subcls (T) – the subclass type used when parsing found objects.
  • tag (str) – the XML tag itself.
  • kwargs – arbitrary arguments for custom parsing options.
Returns:

the new group of tags.

Return type:

XMLItem

Raises:
reviewer = None

Source’s reviewer.

title = ''

Source’s title.

to_table(tabletype='simple') → str[source]

Represents the XMLItem by a table.

Parameters:tabletype (str) –

the table format to use. The following formats are available:

  • ”plain”
  • ”simple”
  • ”github”
  • ”grid”
  • ”fancy_grid”
  • ”pipe”
  • ”orgtbl”
  • ”jira”
  • ”presto”
  • ”pretty”
  • ”psql”
  • ”rst”
  • ”mediawiki”
  • ”moinmoin”
  • ”youtrack”
  • ”html”
  • ”latex”
  • ”latex_raw”
  • ”latex_booktabs”
  • ”textile”

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:the table representation of the XMLItem.
Return type:str
xmlc.T = ~T

Generic type for designating groups of XML tags.

class xmlc.Token(id: str, form: str, alignment_id: Optional[List[str]] = None, lemma: Optional[str] = None, part_of_speech: Optional[xmlc.Value] = None, morphology: Optional[xmlc.Morphology] = None, gloss: Optional[xmlc.Value] = None)[source]
The token represents a word. A word has only two mandatory attributes:
  • The id.
  • The form, it is, the word itself.

All other values are optional and can be omitted.

alignment_id = None

Optional alignment ID, it is, the translated word(s) ID(s).

describe(tabletype='simple') → List[str][source]
Generates a list with the description of the word. It consists on:
  • Form.
  • Lemma.
  • Morphology fields.
  • Part of speech.
  • Glossary.
Parameters:tabletype (str) – the output format for the table - only used if LaTeX.
Returns:the token representation.
Return type:list[str]
form = None

The word itself.

gloss = None

Optional glossary defined by that word.

id = None

The word unique ID.

lemma = None

Word’s lemma.

morphology = None

Optional morphology items defining that word.

static parse(element: lxml.etree._Element, tag: str = 'token', **kwargs) → xmlc.XMLItem[source]

With the given lxml.etree.Element, parses the item_tag and creates a new XMLItem with its data.

Parameters:
  • element (lxml.etree._Element) – the element to parse.
  • tag (str) – the XML tag itself.
  • kwargs – arbitrary arguments for custom parsing options.
Returns:

the new tag.

Return type:

XMLItem

Raises:

ValueError – if the element.tag is different than tag.

part_of_speech = None

Optional part of speech corresponding that word.

to_table(tabletype='simple', add_headers=True) → str[source]

Represents the XMLItem by a table.

Parameters:tabletype (str) –

the table format to use. The following formats are available:

  • ”plain”
  • ”simple”
  • ”github”
  • ”grid”
  • ”fancy_grid”
  • ”pipe”
  • ”orgtbl”
  • ”jira”
  • ”presto”
  • ”pretty”
  • ”psql”
  • ”rst”
  • ”mediawiki”
  • ”moinmoin”
  • ”youtrack”
  • ”html”
  • ”latex”
  • ”latex_raw”
  • ”latex_booktabs”
  • ”textile”

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:the table representation of the XMLItem.
Return type:str
class xmlc.Value(tag: str, summary: str)[source]

The simplest XML item available, containing both a tag and a summary.

static parse(element: lxml.etree._Element, tag: str = 'value', **kwargs) → xmlc.Value[source]

With the given lxml.etree.Element, parses the item_tag and creates a new XMLItem with its data.

Parameters:
  • element (lxml.etree._Element) – the element to parse.
  • tag (str) – the XML tag itself.
  • kwargs – arbitrary arguments for custom parsing options.
Returns:

the new tag.

Return type:

XMLItem

Raises:

ValueError – if the element.tag is different than tag.

summary = None

Value summary.

tag = None

Value identifier tag.

to_table(tabletype='simple') → str[source]

Represents the XMLItem by a table.

Parameters:tabletype (str) –

the table format to use. The following formats are available:

  • ”plain”
  • ”simple”
  • ”github”
  • ”grid”
  • ”fancy_grid”
  • ”pipe”
  • ”orgtbl”
  • ”jira”
  • ”presto”
  • ”pretty”
  • ”psql”
  • ”rst”
  • ”mediawiki”
  • ”moinmoin”
  • ”youtrack”
  • ”html”
  • ”latex”
  • ”latex_raw”
  • ”latex_booktabs”
  • ”textile”

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:the table representation of the XMLItem.
Return type:str
class xmlc.XMLGroup(tag: Optional[str], item_tag: str, cls: T, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>)[source]

Specialization of XMLItem for containing a variable set of fields of type T.

Those fields can be accessed in three ways:
  • By providing the index using fields.
  • By providing the field tag by using dirs and fields.
  • By direct access using both index or tag.
cls = None

The generic class used when parsing found subclasses.

dirs = None

Map containing the T identifiers and its position in fields.

fields = None

List of arbitrary length containing the T objects.

classmethod parse(element: lxml.etree._Element, subcls: T, tag: str = None, **kwargs) → Optional[xmlc.XMLGroup][source]

With the given lxml.etree.Element, parses the item_tag and creates a new XMLGroup with its data. In addition to XMLItem, finds and parses any subitem contained by the tag.

Parameters:
  • element (lxml.etree._Element) – the element to parse.
  • subcls (T) – the subclass type used when parsing found objects.
  • tag (str) – the XML tag itself.
  • kwargs – arbitrary arguments for custom parsing options.
Returns:

the new group of tags.

Return type:

XMLItem

Raises:
subitem_tag = None

The containing T tag.

to_table(tabletype='simple') → str[source]

Represents the XMLItem by a table.

Parameters:tabletype (str) –

the table format to use. The following formats are available:

  • ”plain”
  • ”simple”
  • ”github”
  • ”grid”
  • ”fancy_grid”
  • ”pipe”
  • ”orgtbl”
  • ”jira”
  • ”presto”
  • ”pretty”
  • ”psql”
  • ”rst”
  • ”mediawiki”
  • ”moinmoin”
  • ”youtrack”
  • ”html”
  • ”latex”
  • ”latex_raw”
  • ”latex_booktabs”
  • ”textile”

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:the table representation of the XMLItem.
Return type:str
class xmlc.XMLItem(tag: Optional[str], item_tag: str)[source]

Base XML wrapper class. This item consists on a dataclass with basically two fields:

  • tag, containing the XML tag identifier.
  • item_tag, containing the XML tag itself.

This abstract class defines two abstract methods that must be override:

Its main function is to simplify and contain basic XML data types.

item_tag = None

The XML tag itself, overriden by subclasses.

static parse(element: lxml.etree._Element, tag: str, **kwargs) → Optional[xmlc.XMLItem][source]

With the given lxml.etree.Element, parses the item_tag and creates a new XMLItem with its data.

Parameters:
  • element (lxml.etree._Element) – the element to parse.
  • tag (str) – the XML tag itself.
  • kwargs – arbitrary arguments for custom parsing options.
Returns:

the new tag.

Return type:

XMLItem

Raises:

ValueError – if the element.tag is different than tag.

tag = None

The XML tag identifier, overriden by subclasses.

to_table(tabletype='simple') → str[source]

Represents the XMLItem by a table.

Parameters:tabletype (str) –

the table format to use. The following formats are available:

  • ”plain”
  • ”simple”
  • ”github”
  • ”grid”
  • ”fancy_grid”
  • ”pipe”
  • ”orgtbl”
  • ”jira”
  • ”presto”
  • ”pretty”
  • ”psql”
  • ”rst”
  • ”mediawiki”
  • ”moinmoin”
  • ”youtrack”
  • ”html”
  • ”latex”
  • ”latex_raw”
  • ”latex_booktabs”
  • ”textile”

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:the table representation of the XMLItem.
Return type:str
xmlc.create_column_headers(first_header: str, tabletype: str) → List[str][source]

With the given first header and the table type, creates a list of headers used when designing the table for showing XMLItem or XMLGroup values.

The output list consists on: .. code-block:: python

return [[first header],
[Lemma], [Part of speech], [Morphology], [Gloss]]
Parameters:
  • first_header (str) – the first header to put.
  • tabletype (str) – the table format - used only if LaTeX.
Returns:

a list containing the headers.

Return type:

list[str]

xmlc.main(args)[source]

Main function that demonstrates how XMLCorpus works. Must receive a file containing two souces with IDs ‘text1’ and ‘text2’, respectively.

Parameters:args – command line arguments provided when this script is called.

Indices and tables