Welcome to XMLCorpus’s documentation!¶

class xmlc.Annotation(tag: Optional[str], item_tag: str, morphology: xmlc.Morphology, parts_of_speech: Optional[xmlc.Field] = None, gloss: Optional[xmlc.Field] = None)[source]¶

Master class containing all possible annotations that can exist in a XML file.

gloss = None¶

The annotation’s glossary - can be None.

Type:	Field or None

morphology = None¶: The annotation’s morphology.

static parse(annotation: lxml.etree._Element, tag: str = 'annotation', **kwargs) → xmlc.Annotation[source]¶

With the given lxml.etree.Element, parses the item_tag and creates a new XMLItem with its data.

Parameters:	element (lxml.etree._Element) – the element to parse. tag (str) – the XML tag itself. kwargs – arbitrary arguments for custom parsing options.
Returns:	the new tag.
Return type:	XMLItem
Raises:	ValueError – if the element.tag is different than tag.

parts_of_speech = None¶

The annotation’s part of speech - can be None.

Type:	Field or None

to_table(tabletype='simple') → str[source]¶

Represents the XMLItem by a table.

Parameters:

tabletype (str) –

the table format to use. The following formats are available:

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:	the table representation of the `XMLItem`.
Return type:	str

class xmlc.AnnotationElements[source]¶

Enumeration containing the possible parts that conforms an annotation. Can be:

class xmlc.AnnotationStatus[source]¶

Enumeration containing the three possible statuses for a sentence:

Annotated
Unannotated
Reviewed

class xmlc.Field(tag: Optional[str], cls: T = <class 'xmlc.Value'>, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>)[source]¶

Class grouping a set of :class:`Value`s.

cls¶: alias of Value

to_table(tabletype='simple') → str[source]¶

Represents the XMLItem by a table.

Parameters:

tabletype (str) –

the table format to use. The following formats are available:

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:	the table representation of the `XMLItem`.
Return type:	str

class xmlc.Morphology(tag: Optional[str], cls: T = <class 'xmlc.Field'>, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>)[source]¶

The morphology contains a group of fields containing values. This describes how the text’s tokens are.

cls¶: alias of Field

get(item: Union[str, int], default_value: Any = None) → Union[xmlc.Field, Any][source]¶

Searchs for an item, given its position or its tag. If not found, returns the default value.

Parameters:	item (str or int) – the item to look for. Can be the index or the identifier tag. default_value (Any) – the value to return when not found.
Returns:	the found `Field` or the default value.
Return type:	Field or Any

to_table(ignored='simple') → str[source]¶

Represents the XMLItem by a table.

Parameters:

tabletype (str) –

the table format to use. The following formats are available:

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:	the table representation of the `XMLItem`.
Return type:	str

class xmlc.Sentence(tag: Optional[str], cls: T, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>, id: str = '', status: xmlc.AnnotationStatus = <AnnotationStatus.UNANNOTATED: 'unannotated'>, alignment_id: Optional[str] = None)[source]¶

Structure containing a set of tokens, which conforms a sentence.

alignment_id = None¶: Aligned sentence ID - represents a translation of this sentence.

cls¶: alias of Token

find_by(data: Dict[xmlc.AnnotationElements, Union[Set[str], str]]) → List[xmlc.Token][source]¶

Recursively looks for tokens that fulfill with the data requirements specified.

Parameters:	data (dict[AnnotationElements, set[str] or str]) – a dictionary containing the annotation elements to filter and the conditions of the filtering.
Returns:	a list of tokens that fulfills the requirements.
Return type:	list[Token]

id = ''¶: Sentence unique ID.

classmethod parse(element: lxml.etree._Element, subcls: T, tag: str = None, **kwargs) → Optional[xmlc.Sentence][source]¶

With the given lxml.etree.Element, parses the item_tag and creates a new XMLGroup with its data. In addition to XMLItem, finds and parses any subitem contained by the tag.

Parameters:	element (lxml.etree._Element) – the element to parse. subcls (T) – the subclass type used when parsing found objects. tag (str) – the XML tag itself. kwargs – arbitrary arguments for custom parsing options.
Returns:	the new group of tags.
Return type:	XMLItem
Raises:	ValueError – if the element.tag is different than tag. AttributeError – if subcls is not a subclass of `XMLItem` or `XMLGroup`.

side_by_side(another: xmlc.Sentence, tabletype='plain') → str[source]¶

With the given sentence, compares all tokens contained in both sentences (defined by their alignment ID) and generates a table with the comparison.

Parameters:	another (Sentence) – the other sentence to compare. tabletype (str) – the output table format.
Returns:	table representation of the comparison.
Return type:	str

status = 'unannotated'¶: Sentence annotation status - possible values defined at AnnotationStatus.

to_table(tabletype='plain') → str[source]¶

Represents the XMLItem by a table.

Parameters:

tabletype (str) –

the table format to use. The following formats are available:

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:	the table representation of the `XMLItem`.
Return type:	str

class xmlc.Source(tag: Optional[str], cls: T = <class 'xmlc.Sentence'>, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>, id: str = '', language: str = '', title: str = '', citation_part: str = '', alignment_id: Optional[str] = None, editorial_note: Optional[str] = None, annotator: Optional[str] = None, reviewer: Optional[str] = None, original_url: Optional[str] = None)[source]¶

The source conforms a set of sentences organized and translated into another source.

alignment_id = None¶: Source’s translation’s ID.

annotator = None¶: Source’s annotator.

citation_part = ''¶: Source’s citation.

cls¶: alias of Sentence

compare(another: xmlc.Source, sentences: Tuple[str, ...] = (), status: Optional[xmlc.AnnotationStatus] = None, tabletype: str = 'simple') → str[source]¶

With the given source, compares each sentence defined at sentences and generates a table with the sentences comparison.

Parameters:	another (Source) – the other source to compare with. sentences (tuple[str, ..]) – the sentences to compare. Empty means all. status (AnnotationStatus) – the sentence status to use when comparing. None means unused. tabletype (str) – the output format for the table.
Returns:	sources comparison as a table.
Return type:	str
Raises:	ValueError – if the sources are not aligned.

editorial_note = None¶: Source’s editorial note.

find_words_by(data: Dict[xmlc.AnnotationElements, Union[Set[str], str]]) → List[xmlc.Token][source]¶

With the given requirements, find all tokens that fulfills them.

Parameters:	data (dict[AnnotationElements, set[str] or str]) – a dictionary containing the annotation elements to filter and the conditions of the filtering.
Returns:	a list of tokens that fulfills the requirements.
Return type:	list[Token]

id = ''¶: Source unique ID.

language = ''¶: Source’s language.

original_url = None¶: Source’s original URL.

classmethod parse(element: lxml.etree._Element, subcls: T, tag: str = None, **kwargs) → Optional[xmlc.Source][source]¶

With the given lxml.etree.Element, parses the item_tag and creates a new XMLGroup with its data. In addition to XMLItem, finds and parses any subitem contained by the tag.

Parameters:	element (lxml.etree._Element) – the element to parse. subcls (T) – the subclass type used when parsing found objects. tag (str) – the XML tag itself. kwargs – arbitrary arguments for custom parsing options.
Returns:	the new group of tags.
Return type:	XMLItem
Raises:	ValueError – if the element.tag is different than tag. AttributeError – if subcls is not a subclass of `XMLItem` or `XMLGroup`.

reviewer = None¶: Source’s reviewer.

title = ''¶: Source’s title.

to_table(tabletype='simple') → str[source]¶

Represents the XMLItem by a table.

Parameters:

tabletype (str) –

the table format to use. The following formats are available:

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:	the table representation of the `XMLItem`.
Return type:	str

xmlc.T = ~T¶: Generic type for designating groups of XML tags.

class xmlc.Token(id: str, form: str, alignment_id: Optional[List[str]] = None, lemma: Optional[str] = None, part_of_speech: Optional[xmlc.Value] = None, morphology: Optional[xmlc.Morphology] = None, gloss: Optional[xmlc.Value] = None)[source]¶

The token represents a word. A word has only two mandatory attributes:

The id.
The form, it is, the word itself.

All other values are optional and can be omitted.

alignment_id = None¶: Optional alignment ID, it is, the translated word(s) ID(s).

describe(tabletype='simple') → List[str][source]¶

Generates a list with the description of the word. It consists on:

Form.
Lemma.
Morphology fields.
Part of speech.
Glossary.

Parameters:	tabletype (str) – the output format for the table - only used if LaTeX.
Returns:	the token representation.
Return type:	list[str]

form = None¶: The word itself.

gloss = None¶: Optional glossary defined by that word.

id = None¶: The word unique ID.

lemma = None¶: Word’s lemma.

morphology = None¶: Optional morphology items defining that word.

static parse(element: lxml.etree._Element, tag: str = 'token', **kwargs) → xmlc.XMLItem[source]¶

With the given lxml.etree.Element, parses the item_tag and creates a new XMLItem with its data.

Parameters:	element (lxml.etree._Element) – the element to parse. tag (str) – the XML tag itself. kwargs – arbitrary arguments for custom parsing options.
Returns:	the new tag.
Return type:	XMLItem
Raises:	ValueError – if the element.tag is different than tag.

part_of_speech = None¶: Optional part of speech corresponding that word.

to_table(tabletype='simple', add_headers=True) → str[source]¶

Represents the XMLItem by a table.

Parameters:

tabletype (str) –

the table format to use. The following formats are available:

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:	the table representation of the `XMLItem`.
Return type:	str

class xmlc.Value(tag: str, summary: str)[source]¶

The simplest XML item available, containing both a tag and a summary.

static parse(element: lxml.etree._Element, tag: str = 'value', **kwargs) → xmlc.Value[source]¶

With the given lxml.etree.Element, parses the item_tag and creates a new XMLItem with its data.

Parameters:	element (lxml.etree._Element) – the element to parse. tag (str) – the XML tag itself. kwargs – arbitrary arguments for custom parsing options.
Returns:	the new tag.
Return type:	XMLItem
Raises:	ValueError – if the element.tag is different than tag.

summary = None¶: Value summary.

tag = None¶: Value identifier tag.

to_table(tabletype='simple') → str[source]¶

Represents the XMLItem by a table.

Parameters:

tabletype (str) –

the table format to use. The following formats are available:

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:	the table representation of the `XMLItem`.
Return type:	str

class xmlc.XMLGroup(tag: Optional[str], item_tag: str, cls: T, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>)[source]¶

Specialization of XMLItem for containing a variable set of fields of type T.

Those fields can be accessed in three ways:

By providing the index using fields.
By providing the field tag by using dirs and fields.
By direct access using both index or tag.

cls = None¶: The generic class used when parsing found subclasses.

dirs = None¶: Map containing the T identifiers and its position in fields.

fields = None¶: List of arbitrary length containing the T objects.

classmethod parse(element: lxml.etree._Element, subcls: T, tag: str = None, **kwargs) → Optional[xmlc.XMLGroup][source]¶

With the given lxml.etree.Element, parses the item_tag and creates a new XMLGroup with its data. In addition to XMLItem, finds and parses any subitem contained by the tag.

Parameters:	element (lxml.etree._Element) – the element to parse. subcls (T) – the subclass type used when parsing found objects. tag (str) – the XML tag itself. kwargs – arbitrary arguments for custom parsing options.
Returns:	the new group of tags.
Return type:	XMLItem
Raises:	ValueError – if the element.tag is different than tag. AttributeError – if subcls is not a subclass of `XMLItem` or `XMLGroup`.

subitem_tag = None¶: The containing T tag.

to_table(tabletype='simple') → str[source]¶

Represents the XMLItem by a table.

Parameters:

tabletype (str) –

the table format to use. The following formats are available:

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:	the table representation of the `XMLItem`.
Return type:	str

class xmlc.XMLItem(tag: Optional[str], item_tag: str)[source]¶

Base XML wrapper class. This item consists on a dataclass with basically two fields:

This abstract class defines two abstract methods that must be override:

Its main function is to simplify and contain basic XML data types.

item_tag = None¶: The XML tag itself, overriden by subclasses.

static parse(element: lxml.etree._Element, tag: str, **kwargs) → Optional[xmlc.XMLItem][source]¶

With the given lxml.etree.Element, parses the item_tag and creates a new XMLItem with its data.

Parameters:	element (lxml.etree._Element) – the element to parse. tag (str) – the XML tag itself. kwargs – arbitrary arguments for custom parsing options.
Returns:	the new tag.
Return type:	XMLItem
Raises:	ValueError – if the element.tag is different than tag.

tag = None¶: The XML tag identifier, overriden by subclasses.

to_table(tabletype='simple') → str[source]¶

Represents the XMLItem by a table.

Parameters:

tabletype (str) –

the table format to use. The following formats are available:

See also

Table formats are defined by tabulate - more information about formatting at: https://pypi.org/project/tabulate/

Returns:	the table representation of the `XMLItem`.
Return type:	str

xmlc.create_column_headers(first_header: str, tabletype: str) → List[str][source]¶

With the given first header and the table type, creates a list of headers used when designing the table for showing XMLItem or XMLGroup values.

The output list consists on: .. code-block:: python

Parameters:	first_header (str) – the first header to put. tabletype (str) – the table format - used only if LaTeX.
Returns:	a list containing the headers.
Return type:	list[str]

xmlc.main(args)[source]¶

Main function that demonstrates how XMLCorpus works. Must receive a file containing two souces with IDs ‘text1’ and ‘text2’, respectively.

Parameters:	args – command line arguments provided when this script is called.

Welcome to XMLCorpus’s documentation!¶

Indices and tables¶

XMLCorpus

Navigation

Related Topics