Welcome to XMLCorpus’s documentation!¶
-
class
xmlc.
Annotation
(tag: Optional[str], item_tag: str, morphology: xmlc.Morphology, parts_of_speech: Optional[xmlc.Field] = None, gloss: Optional[xmlc.Field] = None)[source]¶ Master class containing all possible annotations that can exist in a XML file.
-
morphology
= None¶ The annotation’s morphology.
-
static
parse
(annotation: lxml.etree._Element, tag: str = 'annotation', **kwargs) → xmlc.Annotation[source]¶ With the given
lxml.etree.Element
, parses theitem_tag
and creates a newXMLItem
with its data.Parameters: - element (lxml.etree._Element) – the element to parse.
- tag (str) – the XML tag itself.
- kwargs – arbitrary arguments for custom parsing options.
Returns: the new tag.
Return type: Raises: ValueError – if the element.tag is different than tag.
-
to_table
(tabletype='simple') → str[source]¶ Represents the
XMLItem
by a table.Parameters: tabletype (str) – the table format to use. The following formats are available:
- ”plain”
- ”simple”
- ”github”
- ”grid”
- ”fancy_grid”
- ”pipe”
- ”orgtbl”
- ”jira”
- ”presto”
- ”pretty”
- ”psql”
- ”rst”
- ”mediawiki”
- ”moinmoin”
- ”youtrack”
- ”html”
- ”latex”
- ”latex_raw”
- ”latex_booktabs”
- ”textile”
See also
Table formats are defined by
tabulate
- more information about formatting at: https://pypi.org/project/tabulate/Returns: the table representation of the XMLItem
.Return type: str
-
-
class
xmlc.
AnnotationElements
[source]¶ Enumeration containing the possible parts that conforms an annotation. Can be:
- Morphology
- Parts of speech
- Gloss
-
class
xmlc.
AnnotationStatus
[source]¶ - Enumeration containing the three possible statuses for a sentence:
- Annotated
- Unannotated
- Reviewed
-
class
xmlc.
Field
(tag: Optional[str], cls: T = <class 'xmlc.Value'>, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>)[source]¶ Class grouping a set of :class:`Value`s.
-
to_table
(tabletype='simple') → str[source]¶ Represents the
XMLItem
by a table.Parameters: tabletype (str) – the table format to use. The following formats are available:
- ”plain”
- ”simple”
- ”github”
- ”grid”
- ”fancy_grid”
- ”pipe”
- ”orgtbl”
- ”jira”
- ”presto”
- ”pretty”
- ”psql”
- ”rst”
- ”mediawiki”
- ”moinmoin”
- ”youtrack”
- ”html”
- ”latex”
- ”latex_raw”
- ”latex_booktabs”
- ”textile”
See also
Table formats are defined by
tabulate
- more information about formatting at: https://pypi.org/project/tabulate/Returns: the table representation of the XMLItem
.Return type: str
-
-
class
xmlc.
Morphology
(tag: Optional[str], cls: T = <class 'xmlc.Field'>, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>)[source]¶ The morphology contains a group of fields containing values. This describes how the text’s tokens are.
-
get
(item: Union[str, int], default_value: Any = None) → Union[xmlc.Field, Any][source]¶ Searchs for an item, given its position or its tag. If not found, returns the default value.
Parameters: Returns: the found
Field
or the default value.Return type: Field or Any
-
to_table
(ignored='simple') → str[source]¶ Represents the
XMLItem
by a table.Parameters: tabletype (str) – the table format to use. The following formats are available:
- ”plain”
- ”simple”
- ”github”
- ”grid”
- ”fancy_grid”
- ”pipe”
- ”orgtbl”
- ”jira”
- ”presto”
- ”pretty”
- ”psql”
- ”rst”
- ”mediawiki”
- ”moinmoin”
- ”youtrack”
- ”html”
- ”latex”
- ”latex_raw”
- ”latex_booktabs”
- ”textile”
See also
Table formats are defined by
tabulate
- more information about formatting at: https://pypi.org/project/tabulate/Returns: the table representation of the XMLItem
.Return type: str
-
-
class
xmlc.
Sentence
(tag: Optional[str], cls: T, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>, id: str = '', status: xmlc.AnnotationStatus = <AnnotationStatus.UNANNOTATED: 'unannotated'>, alignment_id: Optional[str] = None)[source]¶ Structure containing a set of tokens, which conforms a sentence.
-
alignment_id
= None¶ Aligned sentence ID - represents a translation of this sentence.
-
find_by
(data: Dict[xmlc.AnnotationElements, Union[Set[str], str]]) → List[xmlc.Token][source]¶ Recursively looks for tokens that fulfill with the data requirements specified.
Parameters: data (dict[AnnotationElements, set[str] or str]) – a dictionary containing the annotation elements to filter and the conditions of the filtering. Returns: a list of tokens that fulfills the requirements. Return type: list[Token]
-
id
= ''¶ Sentence unique ID.
-
classmethod
parse
(element: lxml.etree._Element, subcls: T, tag: str = None, **kwargs) → Optional[xmlc.Sentence][source]¶ With the given
lxml.etree.Element
, parses theitem_tag
and creates a newXMLGroup
with its data. In addition toXMLItem
, finds and parses any subitem contained by the tag.Parameters: - element (lxml.etree._Element) – the element to parse.
- subcls (T) – the subclass type used when parsing found objects.
- tag (str) – the XML tag itself.
- kwargs – arbitrary arguments for custom parsing options.
Returns: the new group of tags.
Return type: Raises: - ValueError – if the element.tag is different than tag.
- AttributeError – if subcls is not a subclass of
XMLItem
orXMLGroup
.
-
side_by_side
(another: xmlc.Sentence, tabletype='plain') → str[source]¶ With the given sentence, compares all tokens contained in both sentences (defined by their alignment ID) and generates a table with the comparison.
Parameters: Returns: table representation of the comparison.
Return type:
-
status
= 'unannotated'¶ Sentence annotation status - possible values defined at
AnnotationStatus
.
-
to_table
(tabletype='plain') → str[source]¶ Represents the
XMLItem
by a table.Parameters: tabletype (str) – the table format to use. The following formats are available:
- ”plain”
- ”simple”
- ”github”
- ”grid”
- ”fancy_grid”
- ”pipe”
- ”orgtbl”
- ”jira”
- ”presto”
- ”pretty”
- ”psql”
- ”rst”
- ”mediawiki”
- ”moinmoin”
- ”youtrack”
- ”html”
- ”latex”
- ”latex_raw”
- ”latex_booktabs”
- ”textile”
See also
Table formats are defined by
tabulate
- more information about formatting at: https://pypi.org/project/tabulate/Returns: the table representation of the XMLItem
.Return type: str
-
-
class
xmlc.
Source
(tag: Optional[str], cls: T = <class 'xmlc.Sentence'>, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>, id: str = '', language: str = '', title: str = '', citation_part: str = '', alignment_id: Optional[str] = None, editorial_note: Optional[str] = None, annotator: Optional[str] = None, reviewer: Optional[str] = None, original_url: Optional[str] = None)[source]¶ The source conforms a set of sentences organized and translated into another source.
-
alignment_id
= None¶ Source’s translation’s ID.
-
annotator
= None¶ Source’s annotator.
-
citation_part
= ''¶ Source’s citation.
-
compare
(another: xmlc.Source, sentences: Tuple[str, ...] = (), status: Optional[xmlc.AnnotationStatus] = None, tabletype: str = 'simple') → str[source]¶ With the given source, compares each sentence defined at sentences and generates a table with the sentences comparison.
Parameters: - another (Source) – the other source to compare with.
- sentences (tuple[str, ..]) – the sentences to compare. Empty means all.
- status (AnnotationStatus) – the sentence status to use when comparing. None means unused.
- tabletype (str) – the output format for the table.
Returns: sources comparison as a table.
Return type: Raises: ValueError – if the sources are not aligned.
-
editorial_note
= None¶ Source’s editorial note.
-
find_words_by
(data: Dict[xmlc.AnnotationElements, Union[Set[str], str]]) → List[xmlc.Token][source]¶ With the given requirements, find all tokens that fulfills them.
Parameters: data (dict[AnnotationElements, set[str] or str]) – a dictionary containing the annotation elements to filter and the conditions of the filtering. Returns: a list of tokens that fulfills the requirements. Return type: list[Token]
-
id
= ''¶ Source unique ID.
-
language
= ''¶ Source’s language.
-
original_url
= None¶ Source’s original URL.
-
classmethod
parse
(element: lxml.etree._Element, subcls: T, tag: str = None, **kwargs) → Optional[xmlc.Source][source]¶ With the given
lxml.etree.Element
, parses theitem_tag
and creates a newXMLGroup
with its data. In addition toXMLItem
, finds and parses any subitem contained by the tag.Parameters: - element (lxml.etree._Element) – the element to parse.
- subcls (T) – the subclass type used when parsing found objects.
- tag (str) – the XML tag itself.
- kwargs – arbitrary arguments for custom parsing options.
Returns: the new group of tags.
Return type: Raises: - ValueError – if the element.tag is different than tag.
- AttributeError – if subcls is not a subclass of
XMLItem
orXMLGroup
.
-
reviewer
= None¶ Source’s reviewer.
-
title
= ''¶ Source’s title.
-
to_table
(tabletype='simple') → str[source]¶ Represents the
XMLItem
by a table.Parameters: tabletype (str) – the table format to use. The following formats are available:
- ”plain”
- ”simple”
- ”github”
- ”grid”
- ”fancy_grid”
- ”pipe”
- ”orgtbl”
- ”jira”
- ”presto”
- ”pretty”
- ”psql”
- ”rst”
- ”mediawiki”
- ”moinmoin”
- ”youtrack”
- ”html”
- ”latex”
- ”latex_raw”
- ”latex_booktabs”
- ”textile”
See also
Table formats are defined by
tabulate
- more information about formatting at: https://pypi.org/project/tabulate/Returns: the table representation of the XMLItem
.Return type: str
-
-
xmlc.
T
= ~T¶ Generic type for designating groups of XML tags.
-
class
xmlc.
Token
(id: str, form: str, alignment_id: Optional[List[str]] = None, lemma: Optional[str] = None, part_of_speech: Optional[xmlc.Value] = None, morphology: Optional[xmlc.Morphology] = None, gloss: Optional[xmlc.Value] = None)[source]¶ - The token represents a word. A word has only two mandatory attributes:
- The id.
- The form, it is, the word itself.
All other values are optional and can be omitted.
-
alignment_id
= None¶ Optional alignment ID, it is, the translated word(s) ID(s).
-
describe
(tabletype='simple') → List[str][source]¶ - Generates a list with the description of the word. It consists on:
- Form.
- Lemma.
- Morphology fields.
- Part of speech.
- Glossary.
Parameters: tabletype (str) – the output format for the table - only used if LaTeX. Returns: the token representation. Return type: list[str]
-
form
= None¶ The word itself.
-
gloss
= None¶ Optional glossary defined by that word.
-
id
= None¶ The word unique ID.
-
lemma
= None¶ Word’s lemma.
-
morphology
= None¶ Optional morphology items defining that word.
-
static
parse
(element: lxml.etree._Element, tag: str = 'token', **kwargs) → xmlc.XMLItem[source]¶ With the given
lxml.etree.Element
, parses theitem_tag
and creates a newXMLItem
with its data.Parameters: - element (lxml.etree._Element) – the element to parse.
- tag (str) – the XML tag itself.
- kwargs – arbitrary arguments for custom parsing options.
Returns: the new tag.
Return type: Raises: ValueError – if the element.tag is different than tag.
-
part_of_speech
= None¶ Optional part of speech corresponding that word.
-
to_table
(tabletype='simple', add_headers=True) → str[source]¶ Represents the
XMLItem
by a table.Parameters: tabletype (str) – the table format to use. The following formats are available:
- ”plain”
- ”simple”
- ”github”
- ”grid”
- ”fancy_grid”
- ”pipe”
- ”orgtbl”
- ”jira”
- ”presto”
- ”pretty”
- ”psql”
- ”rst”
- ”mediawiki”
- ”moinmoin”
- ”youtrack”
- ”html”
- ”latex”
- ”latex_raw”
- ”latex_booktabs”
- ”textile”
See also
Table formats are defined by
tabulate
- more information about formatting at: https://pypi.org/project/tabulate/Returns: the table representation of the XMLItem
.Return type: str
-
class
xmlc.
Value
(tag: str, summary: str)[source]¶ The simplest XML item available, containing both a
tag
and asummary
.-
static
parse
(element: lxml.etree._Element, tag: str = 'value', **kwargs) → xmlc.Value[source]¶ With the given
lxml.etree.Element
, parses theitem_tag
and creates a newXMLItem
with its data.Parameters: - element (lxml.etree._Element) – the element to parse.
- tag (str) – the XML tag itself.
- kwargs – arbitrary arguments for custom parsing options.
Returns: the new tag.
Return type: Raises: ValueError – if the element.tag is different than tag.
-
to_table
(tabletype='simple') → str[source]¶ Represents the
XMLItem
by a table.Parameters: tabletype (str) – the table format to use. The following formats are available:
- ”plain”
- ”simple”
- ”github”
- ”grid”
- ”fancy_grid”
- ”pipe”
- ”orgtbl”
- ”jira”
- ”presto”
- ”pretty”
- ”psql”
- ”rst”
- ”mediawiki”
- ”moinmoin”
- ”youtrack”
- ”html”
- ”latex”
- ”latex_raw”
- ”latex_booktabs”
- ”textile”
See also
Table formats are defined by
tabulate
- more information about formatting at: https://pypi.org/project/tabulate/Returns: the table representation of the XMLItem
.Return type: str
-
static
-
class
xmlc.
XMLGroup
(tag: Optional[str], item_tag: str, cls: T, fields: List[T] = <factory>, dirs: Dict[str, int] = <factory>)[source]¶ Specialization of
XMLItem
for containing a variable set of fields of typeT
.- Those fields can be accessed in three ways:
-
cls
= None¶ The generic class used when parsing found subclasses.
-
classmethod
parse
(element: lxml.etree._Element, subcls: T, tag: str = None, **kwargs) → Optional[xmlc.XMLGroup][source]¶ With the given
lxml.etree.Element
, parses theitem_tag
and creates a newXMLGroup
with its data. In addition toXMLItem
, finds and parses any subitem contained by the tag.Parameters: - element (lxml.etree._Element) – the element to parse.
- subcls (T) – the subclass type used when parsing found objects.
- tag (str) – the XML tag itself.
- kwargs – arbitrary arguments for custom parsing options.
Returns: the new group of tags.
Return type: Raises: - ValueError – if the element.tag is different than tag.
- AttributeError – if subcls is not a subclass of
XMLItem
orXMLGroup
.
-
to_table
(tabletype='simple') → str[source]¶ Represents the
XMLItem
by a table.Parameters: tabletype (str) – the table format to use. The following formats are available:
- ”plain”
- ”simple”
- ”github”
- ”grid”
- ”fancy_grid”
- ”pipe”
- ”orgtbl”
- ”jira”
- ”presto”
- ”pretty”
- ”psql”
- ”rst”
- ”mediawiki”
- ”moinmoin”
- ”youtrack”
- ”html”
- ”latex”
- ”latex_raw”
- ”latex_booktabs”
- ”textile”
See also
Table formats are defined by
tabulate
- more information about formatting at: https://pypi.org/project/tabulate/Returns: the table representation of the XMLItem
.Return type: str
-
class
xmlc.
XMLItem
(tag: Optional[str], item_tag: str)[source]¶ Base XML wrapper class. This item consists on a dataclass with basically two fields:
- tag, containing the XML tag identifier.
- item_tag, containing the XML tag itself.
This abstract class defines two abstract methods that must be override:
Its main function is to simplify and contain basic XML data types.
-
item_tag
= None¶ The XML tag itself, overriden by subclasses.
-
static
parse
(element: lxml.etree._Element, tag: str, **kwargs) → Optional[xmlc.XMLItem][source]¶ With the given
lxml.etree.Element
, parses theitem_tag
and creates a newXMLItem
with its data.Parameters: - element (lxml.etree._Element) – the element to parse.
- tag (str) – the XML tag itself.
- kwargs – arbitrary arguments for custom parsing options.
Returns: the new tag.
Return type: Raises: ValueError – if the element.tag is different than tag.
-
tag
= None¶ The XML tag identifier, overriden by subclasses.
-
to_table
(tabletype='simple') → str[source]¶ Represents the
XMLItem
by a table.Parameters: tabletype (str) – the table format to use. The following formats are available:
- ”plain”
- ”simple”
- ”github”
- ”grid”
- ”fancy_grid”
- ”pipe”
- ”orgtbl”
- ”jira”
- ”presto”
- ”pretty”
- ”psql”
- ”rst”
- ”mediawiki”
- ”moinmoin”
- ”youtrack”
- ”html”
- ”latex”
- ”latex_raw”
- ”latex_booktabs”
- ”textile”
See also
Table formats are defined by
tabulate
- more information about formatting at: https://pypi.org/project/tabulate/Returns: the table representation of the XMLItem
.Return type: str
-
xmlc.
create_column_headers
(first_header: str, tabletype: str) → List[str][source]¶ With the given first header and the table type, creates a list of headers used when designing the table for showing
XMLItem
orXMLGroup
values.The output list consists on: .. code-block:: python
- return [[first header],
- [Lemma], [Part of speech], [Morphology], [Gloss]]
Parameters: Returns: a list containing the headers.
Return type:
-
xmlc.
main
(args)[source]¶ Main function that demonstrates how XMLCorpus works. Must receive a file containing two souces with IDs ‘text1’ and ‘text2’, respectively.
Parameters: args – command line arguments provided when this script is called.