API Reference

Table of Contents

artifician

artifician.feature_definition

FeatureDefinition Objects

class FeatureDefinition()

Contains all the functionality for preparing a single feature.

Attributes:

  • value Any - The value of the feature.

  • cached dict - Cached observables for different events.

  • extractor Callable - Function to extract feature value from the artifician.

  • EVENT_PROCESSED Callable - Event that processes the feature.

  • MAP_VALUES Callable - Event that maps values of the feature.

  • extractor_parameters Tuple - Parameters for the extractor function.

__init__

def __init__(extractor: Callable = lambda sample: sample,
             subscribe_to: List = None,
             *extractor_parameters)

Initializes a FeatureDefinition instance.

Arguments:

  • extractor Callable, optional - Function to extract feature value.

  • subscribe_to List - List of publishers to subscribe to.

  • extractor_parameters - Additional parameters for the extractor.

Raises:

  • ValueError - If no publishers are provided to subscribe to.

process

def process(publisher, sample: Any) -> None

Processes the sample to build the feature value.

Arguments:

  • sample Any - The sample data.

  • publisher - The instance of the publisher.

map

def map(feature_value: Any) -> None

Maps the feature value into an int or list of ints.

Arguments:

  • feature_value Any - The feature value to be mapped.

observe

def observe(event: Callable) -> Subject

Builds and returns an observable for a given event.

Arguments:

  • event Callable - The function to create an observable from.

Returns:

  • Observable - An observable for the given event.

subscribe

def subscribe(publisher, pool_scheduler=None) -> None

Defines logic for subscribing to an event in a publisher.

Arguments:

  • publisher - The publisher instance.

  • pool_scheduler optional - The scheduler instance for concurrency.

artifician.dataset

Dataset Objects

class Dataset()

Dataset contains all the functionality for preparing Artifician data. It observes events and stores all processed data in a Pandas DataFrame.

Attributes:

  • cached dict - Cached observables for different events.

  • datastore pd.DataFrame - DataFrame to store all samples.

  • PREPARE_DATASET Callable - Event to prepare the dataset.

  • POST_PROCESS Callable - Event for post-processing actions on the dataset.

add_samples

def add_samples(samples: Any) -> pd.DataFrame

Adds samples to the datastore.

Arguments:

  • samples Any - Artifician data to be added.

Returns:

  • pd.DataFrame - The updated dataset.

Raises:

  • TypeError - If the input data is not a list.

observe

def observe(event)

Builds and returns an observable for a given event.

Arguments:

  • event Callable - Function to create an observable from.

Returns:

  • rx.subject.Subject - Observable for the given event.

post_process

def post_process()

This event should be called after Artifician data is prepared. Listeners to the post_process event can perform collective actions on the dataset.

artifician.processors.chain

chain Objects

class chain()

Manages a chain of processors.

This class handles the sequential execution of a chain of processors and can subscribe to a publisher to trigger the processing.

Attributes:

  • processors list - A list of processors in the chain.

__init__

def __init__(processors=None) -> None

Initializes the chain with an optional list of processors.

Arguments:

  • processors list, optional - An initial list of processors to be managed.

then

def then(next_processor) -> 'chain'

Adds a processor to the end of the chain.

Arguments:

  • processor Processor - The processor to add to the chain.

Returns:

  • processor_chaining chain - The chain instance.

process

def process(publisher, data: any) -> any

Processes data sequentially through the chain of processors.

Arguments:

  • data - The data to be processed by the chain.

Returns:

The final processed data after passing through all processors.

subscribe

def subscribe(publisher, pool_scheduler=None) -> None

Subscribes the processor chain to a feature definition.

The feature definition will trigger the processing of the chain.

Arguments:

  • feature_definition publisher - The feature definition to subscribe to.

artifician.processors.mapper

Mapper Objects

class Mapper(processor.Processor)

Mapper is a processor responsible for mapping/converting feature values to int

Attributes:

  • feature_map FeatureMap - Feature map contains dictionary --> {value: id}

  • map_key_values bool - True ---> Map both key and value, False ---> map only keys

__init__

def __init__(feature_map, subscribe_to=None, map_key_values=False)

initialise Mapper by setting up the feature map

Arguments:

  • feature_map FeatureMap - instance of feature_map

  • map_key_values Boolean - True = map both the key and values, False = map only values

process

def process(publisher, feature_value)

update the feature value of the publisher by mapping features value to int

Arguments:

  • publisher object - instance of the publisher

  • feature_value(string) - feature_value

Returns:

value_id =

subscribe

def subscribe(publisher, pool_scheduler=None)

Defines logic for subscribing to an event in publisher

Arguments:

  • publisher object - instance of publisher

  • pool_scheduler rx.scheduler.ThreadPoolScheduler - scheduler instance for concurrency

Returns:

None

FeatureMap Objects

class FeatureMap()

Converts given value to int

Attributes:

  • values_map dictionary - {value : id}

get_value_id

def get_value_id(value)

returns the id of the value in values. convert any datatype to str as dictionary keys can not be of other than str and int. each format can be converted to str only.

Arguments:

  • value any - value

Returns:

  • value_id int - ID of the value

artifician.processors

artifician.processors.processor

Processor Objects

class Processor(ABC)

Interface for processors in the Artifician library, updated for processor chaining.

This abstract class defines the interface for processors, including methods for processing data and subscribing to publishers, along with the ability to chain processors.

process

@abstractmethod
def process(publisher, *data)

Process the data and update the publisher with the processed values.

Arguments:

  • publisher - The publisher to which the processed data will be updated.

  • data - The data to be processed.

subscribe

@abstractmethod
def subscribe(publisher, pool_scheduler=None)

Subscribe the processor to a publisher (e.g., FeatureDefinition).

Arguments:

  • publisher - The publisher to subscribe to.

  • pool_scheduler optional - The scheduler to be used for subscription.

then

def then(next_processor)

Link this processor to the next one in the chain.

Arguments:

  • next_processor - The next processor to add to the chain.

Returns:

  • chain - chain of processors

Raises:

  • TypeError - If the next_processor is not a valid processor instance.

artifician.processors.text

artifician.processors.text.text_cleaner

TextCleaningProcessor Objects

class TextCleaningProcessor(Processor)

Processor for cleaning and preprocessing text data.

Configurable attributes for various cleaning operations.

__init__

def __init__(lowercase=True,
             remove_punctuation=True,
             remove_numbers=True,
             strip_whitespace=True,
             remove_html_tags=True,
             remove_urls=True,
             subscribe_to=None)

Initialize a TextCleaningProcessor object.

Arguments:

  • lowercase bool - Flag to convert text to lowercase.

  • remove_punctuation bool - Flag to remove punctuation.

  • remove_numbers bool - Flag to remove numbers.

  • strip_whitespace bool - Flag to strip extra whitespaces.

  • remove_html_tags bool - Flag to remove HTML tags.

  • remove_urls bool - Flag to remove URLs.

  • custom_stop_words List[str] - Optional list of custom stop words.

  • subscribe_to list - Optional list of publishers to subscribe to.

process

def process(publisher, text: Union[str, List[str]]) -> Union[str, List[str]]

Process the text or list of texts to clean and preprocess.

Arguments:

  • publisher - The publisher associated with the processor.

  • text Union[str, List[str]] - The text or list of texts to be processed.

Returns:

Union[str, List[str]]: Cleaned and preprocessed text.

subscribe

def subscribe(publisher, pool_scheduler=None)

Subscribe to a publisher for event-driven processing.

Arguments:

  • publisher object - The publisher to subscribe to.

  • pool_scheduler optional - Scheduler instance for concurrency.

Returns:

None

artifician.processors.text.stop_word_remover

StopWordsRemoverProcessor Objects

class StopWordsRemoverProcessor(Processor)

Processor for removing stop words from text data.

Attributes:

  • stop_words set - A set of stop words to be removed.

__init__

def __init__(custom_stop_words: List[str] = None, subscribe_to=None)

Initialize a StopWordsRemoverProcessor object.

Arguments:

  • custom_stop_words List[str] - Optional list of custom stop words.

  • subscribe_to list - Optional list of publishers to subscribe to.

process

def process(publisher, text: Union[str, List[str]]) -> Union[str, List[str]]

Process the text or list of texts to remove stop words.

Arguments:

  • publisher - The publisher associated with the processor.

  • text Union[str, List[str]] - The text or list of texts to be processed.

Returns:

Union[str, List[str]]: Text after stop words removal.

Raises:

  • ValueError - If the input text is None or an empty list.

subscribe

def subscribe(publisher, pool_scheduler=None)

Subscribe to a publisher for event-driven processing.

Arguments:

  • publisher object - The publisher to subscribe to.

  • pool_scheduler optional - Scheduler instance for concurrency.

Returns:

None

artifician.processors.text.tokenizer

TokenizationProcessor Objects

class TokenizationProcessor(Processor)

Tokenization Processor for splitting text into tokens.

Attributes:

  • method str - Method to use for tokenization ('word' or 'sentence').

  • nlp - spaCy language model for processing text.

__init__

def __init__(method: str = 'word', subscribe_to=None)

Initialize a TokenizationProcessor object.

Arguments:

  • method str - Method to use for tokenization ('word' or 'sentence').

process

def process(
        publisher, text: Union[str, List[str],
                               None]) -> Union[List[str], List[List[str]]]

Process the text or list of texts and split it into tokens.

Arguments:

  • text Union[str, List[str], None] - The text or list of texts to be tokenized.

Returns:

Union[List[str], List[List[str]]]: A list of tokens or list of lists of tokens.

Raises:

  • ValueError - If the input text is None or an empty list.

subscribe

def subscribe(publisher, pool_scheduler=None)

Defines logic for subscribing to an event in publisher

Arguments:

  • publisher object - instance of the publisher

  • pool_scheduler rx.scheduler.ThreadPoolScheduler - scheduler instance for concurrency

Returns:

None

artifician.processors.text.stemlemtizer

StemLemProcessor Objects

class StemLemProcessor(Processor)

Processor for applying stemming and lemmatization to text data.

Attributes:

  • mode str - Mode of operation ('stemming' or 'lemmatization').

  • nlp - spaCy language model for lemmatization.

  • stemmer - NLTK stemmer for stemming.

__init__

def __init__(mode: str = 'lemmatization', subscribe_to=None)

Initialize a StemLemProcessor object.

Arguments:

  • mode str - Operation mode ('stemming' or 'lemmatization').

  • subscribe_to list - Optional list of publishers to subscribe to.

process

def process(publisher, text: Union[str, List[str]]) -> Union[str, List[str]]

Process the text or list of tokens for stemming or lemmatization.

Arguments:

  • publisher - The publisher associated with the processor.

  • text Union[str, List[str]] - The text or list of tokens to be processed.

Returns:

Union[str, List[str]]: Processed text or list of processed tokens.

subscribe

def subscribe(publisher, pool_scheduler=None)

Subscribe to a publisher for event-driven processing.

Arguments:

  • publisher object - The publisher to subscribe to.

  • pool_scheduler optional - Scheduler instance for concurrency.

Returns:

None

artifician.processors.normalizer

Normalizer Objects

class Normalizer(processor.Processor)

Normalize the given string value

Attributes:

  • strategy NormalizerStrategy - strategy for normalizing string

  • delimiter dictionary - delimiter for splitting the string

__init__

def __init__(strategy=None, subscribe_to=None, delimiter=None)

Initialize the Normalizer by setting up the normalizer strategy and the delimiter

Arguments:

  • strategy NormalizerStrategy - NormalizerStrategy instance which normalizes string

  • delimiter dictionary - delimiter for splitting the string

process

def process(publisher, feature_raw)

Normalize the feature_raw value Note : publisher.feature_value is updated instead of returning the value as normalizer being a processor

Arguments:

  • publisher object - instance of the publisher

  • feature_raw string - feature value

Returns:

None

subscribe

def subscribe(publisher, pool_scheduler=None)

Defines logic for subscribing to an event in publisher

Arguments:

  • publisher object - instance of the publisher

  • pool_scheduler rx.scheduler.ThreadPoolScheduler - scheduler instance for concurrency

Returns:

None

NormalizerStrategy Objects

class NormalizerStrategy(ABC)

interface for normalizer strategies

PropertiesNormalizer Objects

class PropertiesNormalizer(NormalizerStrategy)

Split by delimiter into a format that preserves the sequential position of each value found.

normalize

def normalize(feature_raw, delimiter)

split by delimiter into format that preserves sequential position of each value in feature text found

Arguments:

  • delimiter - delimiter is used for breaking string

  • feature_raw string - feature_raw

Returns:

  • feature_normalized list - list of tuple of normalized feature raw

PathsNormalizer Objects

class PathsNormalizer(NormalizerStrategy)

split by delimiter into a format that preserves position within tree of each value found

get_path_values

@staticmethod
def get_path_values(feature_raw_values, delimiter)

gets path values sequentially

Arguments:

  • feature_raw_values list - list of strings

  • delimiter string - delimiter is used for breaking string

Returns:

  • feature_normalized list - list of tuple of normalized feature text values

normalize

def normalize(feature_raw, delimiter)

split by delimiter into a format that preserves position within tree of each value found

Arguments:

  • feature_raw string - feature text

  • delimiter dict - delimiter is used for breaking string

Returns:

  • feature_normalized list - list of tuple of normalized feature text values

KeyValuesNormalizer Objects

class KeyValuesNormalizer(NormalizerStrategy)

split by delimiter into a format that preserves value and label association found.

normalize_key_values

@staticmethod
def normalize_key_values(key_values, assignment)

break down text using assignment into key value pair

Arguments:

  • key_values list - list of strings

  • assignment string - string that separates key and values

Returns:

  • feature_normalized list - list of tuple of normalized feature text values

normalize

def normalize(feature_raw, delimiter)

split by delimiter into a format that preserves value and label association found.

Arguments:

  • feature_raw string - feature_raw

  • delimiter - delimiter is used for breaking string

Returns:

  • feature_normalized list - list of tuple of normalized feature text values

StrategySelector Objects

class StrategySelector()

Based on the text input select the appropriate normalizer strategy to normalize the text

get_paths_delimiter

def get_paths_delimiter(texts)

Identify whether the given texts is a paths string if yes return the appropriate delimiter to normalize text

Arguments:

  • texts list - list of strings

Returns:

  • Bool True/False - True if the given texts is identified as paths texts

get_key_values_delimiter

def get_key_values_delimiter(texts)

Identify whether the given texts is a key values string if yes return the appropriate delimiter to normalize text

Arguments:

  • texts str - list of strings

Returns:

  • Bool True/False - True if the given texts is identified as key:values text else returns false

get_properties_delimiter

def get_properties_delimiter(texts)

Identify whether the given texts is a properties string if yes return the appropriate delimiter to normalize text

Arguments:

  • texts str - list of strings

Returns:

  • delimiter dict - delimiter to normalize the string

select

def select(texts)

Arguments:

  • texts(list) - list of strings

Returns:

  • strategy_properties list - list of strategy and properties to normalize the text

artifician.extractors.text_extractors.keyword_extractor

KeywordExtractor Objects

class KeywordExtractor()

Keyword Extractor class for extracting specific keywords from a text.

Attributes:

  • method str - Method to use for keyword extraction ('manual', 'frequency', 'tfidf', etc.)

  • keywords List[str] - List of keywords to search within the text for 'manual' method.

__init__

def __init__(method: str = 'manual', keywords: List[str] = None)

Initialize a new KeywordExtractor object.

Arguments:

  • method str - Method to use for keyword extraction.

  • keywords List[str] - List of keywords to search within the text for 'manual' method.

Raises:

  • ValueError - If the keywords list is empty for 'manual' method.

artifician.extractors.html_extractors

get_node_text

def get_node_text(node: List[Union[str, Tag]]) -> str

Extracts text from a given node.

Arguments:

  • node List[Union[str, Tag]] - The node list to extract text from.

Returns:

  • str - The text content of the node.

Raises:

  • TypeError - If the first element in the node list is not a bs4.element.Tag.

  • ValueError - If the node list is empty.

get_node_attribute

def get_node_attribute(node: List[Union[str, Tag]], attribute: str) -> str

Retrieves the value of a specified attribute from a given node.

Arguments:

  • node List[Union[str, Tag]] - The node list to get the attribute from.

  • attribute str - The name of the attribute to retrieve.

Returns:

  • str - The value of the attribute.

Raises:

  • TypeError - If the first element in the node list is not a bs4.element.Tag.

get_parent_node_text

def get_parent_node_text(node: List[Union[str, Tag]]) -> str

Extracts text from the parent node of a given node.

Arguments:

  • node List[Union[str, Tag]] - The node list to extract parent text from.

Returns:

  • str - The text content of the parent node.

Raises:

  • TypeError - If the first element in the node list is not a bs4.element.Tag.

get_child_node_text

def get_child_node_text(node: List[Union[str, Tag]]) -> str

Extracts text from the first child node of a given node.

Arguments:

  • node List[Union[str, Tag]] - The node list to extract child text from.

Returns:

  • str - The text content of the child node.

Raises:

  • TypeError - If the first element in the node list is not a bs4.element.Tag.

count_child_nodes

def count_child_nodes(node: List[Union[str, Tag]]) -> int

Counts the number of child nodes for a given node.

Arguments:

  • node List[Union[str, Tag]] - The node list to count children for.

Returns:

  • int - The number of child nodes.

Raises:

  • TypeError - If the first element in the node list is not a bs4.element.Tag.

get_sibling_node_text

def get_sibling_node_text(node: List[Union[str, Tag]]) -> str

Extracts text from the first sibling node of a given node.

Arguments:

  • node List[Union[str, Tag]] - The node list to extract sibling text from.

Returns:

  • str - The text content of the sibling node.

Raises:

  • TypeError - If the first element in the node list is not a bs4.element.Tag.

get_parent_attribute

def get_parent_attribute(node: List[Union[str, Tag]], attribute: str) -> str

Retrieves the value of a specified attribute from the parent of a given node.

Arguments:

  • node List[Union[str, Tag]] - The node list to get the parent attribute from.

  • attribute str - The name of the attribute to retrieve.

Returns:

  • str - The value of the attribute from the parent node.

Raises:

  • TypeError - If the first element in the node list is not a bs4.element.Tag.

get_child_attribute

def get_child_attribute(node: List[Union[str, Tag]], attribute: str) -> str

Retrieves the value of a specified attribute from the first child of a given node.

Arguments:

  • node List[Union[str, Tag]] - The node list to get the child attribute from.

  • attribute str - The name of the attribute to retrieve.

Returns:

  • str - The value of the attribute from the child node.

Raises:

  • TypeError - If the first element in the node list is not a bs4.element.Tag.

artifician.extractors

Last updated