API Reference
Table of Contents
artifician
artifician.feature_definition
FeatureDefinition Objects
Contains all the functionality for preparing a single feature.
Attributes:
value
Any - The value of the feature.cached
dict - Cached observables for different events.extractor
Callable - Function to extract feature value from the artifician.EVENT_PROCESSED
Callable - Event that processes the feature.MAP_VALUES
Callable - Event that maps values of the feature.extractor_parameters
Tuple - Parameters for the extractor function.
__init__
Initializes a FeatureDefinition instance.
Arguments:
extractor
Callable, optional - Function to extract feature value.subscribe_to
List - List of publishers to subscribe to.extractor_parameters
- Additional parameters for the extractor.
Raises:
ValueError
- If no publishers are provided to subscribe to.
process
Processes the sample to build the feature value.
Arguments:
sample
Any - The sample data.publisher
- The instance of the publisher.
map
Maps the feature value into an int or list of ints.
Arguments:
feature_value
Any - The feature value to be mapped.
observe
Builds and returns an observable for a given event.
Arguments:
event
Callable - The function to create an observable from.
Returns:
Observable
- An observable for the given event.
subscribe
Defines logic for subscribing to an event in a publisher.
Arguments:
publisher
- The publisher instance.pool_scheduler
optional - The scheduler instance for concurrency.
artifician.dataset
Dataset Objects
Dataset contains all the functionality for preparing Artifician data. It observes events and stores all processed data in a Pandas DataFrame.
Attributes:
cached
dict - Cached observables for different events.datastore
pd.DataFrame - DataFrame to store all samples.PREPARE_DATASET
Callable - Event to prepare the dataset.POST_PROCESS
Callable - Event for post-processing actions on the dataset.
add_samples
Adds samples to the datastore.
Arguments:
samples
Any - Artifician data to be added.
Returns:
pd.DataFrame
- The updated dataset.
Raises:
TypeError
- If the input data is not a list.
observe
Builds and returns an observable for a given event.
Arguments:
event
Callable - Function to create an observable from.
Returns:
rx.subject.Subject
- Observable for the given event.
post_process
This event should be called after Artifician data is prepared. Listeners to the post_process event can perform collective actions on the dataset.
artifician.processors.chain
chain Objects
Manages a chain of processors.
This class handles the sequential execution of a chain of processors and can subscribe to a publisher to trigger the processing.
Attributes:
processors
list - A list of processors in the chain.
__init__
Initializes the chain with an optional list of processors.
Arguments:
processors
list, optional - An initial list of processors to be managed.
then
Adds a processor to the end of the chain.
Arguments:
processor
Processor - The processor to add to the chain.
Returns:
processor_chaining
chain - The chain instance.
process
Processes data sequentially through the chain of processors.
Arguments:
data
- The data to be processed by the chain.
Returns:
The final processed data after passing through all processors.
subscribe
Subscribes the processor chain to a feature definition.
The feature definition will trigger the processing of the chain.
Arguments:
feature_definition
publisher - The feature definition to subscribe to.
artifician.processors.mapper
Mapper Objects
Mapper is a processor responsible for mapping/converting feature values to int
Attributes:
feature_map
FeatureMap - Feature map contains dictionary --> {value: id}map_key_values
bool - True ---> Map both key and value, False ---> map only keys
__init__
initialise Mapper by setting up the feature map
Arguments:
feature_map
FeatureMap - instance of feature_mapmap_key_values
Boolean - True = map both the key and values, False = map only values
process
update the feature value of the publisher by mapping features value to int
Arguments:
publisher
object - instance of the publisherfeature_value(string)
- feature_value
Returns:
value_id =
subscribe
Defines logic for subscribing to an event in publisher
Arguments:
publisher
object - instance of publisherpool_scheduler
rx.scheduler.ThreadPoolScheduler - scheduler instance for concurrency
Returns:
None
FeatureMap Objects
Converts given value to int
Attributes:
values_map
dictionary - {value : id}
get_value_id
returns the id of the value in values. convert any datatype to str as dictionary keys can not be of other than str and int. each format can be converted to str only.
Arguments:
value
any - value
Returns:
value_id
int - ID of the value
artifician.processors
artifician.processors.processor
Processor Objects
Interface for processors in the Artifician library, updated for processor chaining.
This abstract class defines the interface for processors, including methods for processing data and subscribing to publishers, along with the ability to chain processors.
process
Process the data and update the publisher with the processed values.
Arguments:
publisher
- The publisher to which the processed data will be updated.data
- The data to be processed.
subscribe
Subscribe the processor to a publisher (e.g., FeatureDefinition).
Arguments:
publisher
- The publisher to subscribe to.pool_scheduler
optional - The scheduler to be used for subscription.
then
Link this processor to the next one in the chain.
Arguments:
next_processor
- The next processor to add to the chain.
Returns:
chain
- chain of processors
Raises:
TypeError
- If the next_processor is not a valid processor instance.
artifician.processors.text
artifician.processors.text.text_cleaner
TextCleaningProcessor Objects
Processor for cleaning and preprocessing text data.
Configurable attributes for various cleaning operations.
__init__
Initialize a TextCleaningProcessor object.
Arguments:
lowercase
bool - Flag to convert text to lowercase.remove_punctuation
bool - Flag to remove punctuation.remove_numbers
bool - Flag to remove numbers.strip_whitespace
bool - Flag to strip extra whitespaces.remove_html_tags
bool - Flag to remove HTML tags.remove_urls
bool - Flag to remove URLs.custom_stop_words
List[str] - Optional list of custom stop words.subscribe_to
list - Optional list of publishers to subscribe to.
process
Process the text or list of texts to clean and preprocess.
Arguments:
publisher
- The publisher associated with the processor.text
Union[str, List[str]] - The text or list of texts to be processed.
Returns:
Union[str, List[str]]: Cleaned and preprocessed text.
subscribe
Subscribe to a publisher for event-driven processing.
Arguments:
publisher
object - The publisher to subscribe to.pool_scheduler
optional - Scheduler instance for concurrency.
Returns:
None
artifician.processors.text.stop_word_remover
StopWordsRemoverProcessor Objects
Processor for removing stop words from text data.
Attributes:
stop_words
set - A set of stop words to be removed.
__init__
Initialize a StopWordsRemoverProcessor object.
Arguments:
custom_stop_words
List[str] - Optional list of custom stop words.subscribe_to
list - Optional list of publishers to subscribe to.
process
Process the text or list of texts to remove stop words.
Arguments:
publisher
- The publisher associated with the processor.text
Union[str, List[str]] - The text or list of texts to be processed.
Returns:
Union[str, List[str]]: Text after stop words removal.
Raises:
ValueError
- If the input text is None or an empty list.
subscribe
Subscribe to a publisher for event-driven processing.
Arguments:
publisher
object - The publisher to subscribe to.pool_scheduler
optional - Scheduler instance for concurrency.
Returns:
None
artifician.processors.text.tokenizer
TokenizationProcessor Objects
Tokenization Processor for splitting text into tokens.
Attributes:
method
str - Method to use for tokenization ('word' or 'sentence').nlp
- spaCy language model for processing text.
__init__
Initialize a TokenizationProcessor object.
Arguments:
method
str - Method to use for tokenization ('word' or 'sentence').
process
Process the text or list of texts and split it into tokens.
Arguments:
text
Union[str, List[str], None] - The text or list of texts to be tokenized.
Returns:
Union[List[str], List[List[str]]]: A list of tokens or list of lists of tokens.
Raises:
ValueError
- If the input text is None or an empty list.
subscribe
Defines logic for subscribing to an event in publisher
Arguments:
publisher
object - instance of the publisherpool_scheduler
rx.scheduler.ThreadPoolScheduler - scheduler instance for concurrency
Returns:
None
artifician.processors.text.stemlemtizer
StemLemProcessor Objects
Processor for applying stemming and lemmatization to text data.
Attributes:
mode
str - Mode of operation ('stemming' or 'lemmatization').nlp
- spaCy language model for lemmatization.stemmer
- NLTK stemmer for stemming.
__init__
Initialize a StemLemProcessor object.
Arguments:
mode
str - Operation mode ('stemming' or 'lemmatization').subscribe_to
list - Optional list of publishers to subscribe to.
process
Process the text or list of tokens for stemming or lemmatization.
Arguments:
publisher
- The publisher associated with the processor.text
Union[str, List[str]] - The text or list of tokens to be processed.
Returns:
Union[str, List[str]]: Processed text or list of processed tokens.
subscribe
Subscribe to a publisher for event-driven processing.
Arguments:
publisher
object - The publisher to subscribe to.pool_scheduler
optional - Scheduler instance for concurrency.
Returns:
None
artifician.processors.normalizer
Normalizer Objects
Normalize the given string value
Attributes:
strategy
NormalizerStrategy - strategy for normalizing stringdelimiter
dictionary - delimiter for splitting the string
__init__
Initialize the Normalizer by setting up the normalizer strategy and the delimiter
Arguments:
strategy
NormalizerStrategy - NormalizerStrategy instance which normalizes stringdelimiter
dictionary - delimiter for splitting the string
process
Normalize the feature_raw value Note : publisher.feature_value is updated instead of returning the value as normalizer being a processor
Arguments:
publisher
object - instance of the publisherfeature_raw
string - feature value
Returns:
None
subscribe
Defines logic for subscribing to an event in publisher
Arguments:
publisher
object - instance of the publisherpool_scheduler
rx.scheduler.ThreadPoolScheduler - scheduler instance for concurrency
Returns:
None
NormalizerStrategy Objects
interface for normalizer strategies
PropertiesNormalizer Objects
Split by delimiter into a format that preserves the sequential position of each value found.
normalize
split by delimiter into format that preserves sequential position of each value in feature text found
Arguments:
delimiter
- delimiter is used for breaking stringfeature_raw
string - feature_raw
Returns:
feature_normalized
list - list of tuple of normalized feature raw
PathsNormalizer Objects
split by delimiter into a format that preserves position within tree of each value found
get_path_values
gets path values sequentially
Arguments:
feature_raw_values
list - list of stringsdelimiter
string - delimiter is used for breaking string
Returns:
feature_normalized
list - list of tuple of normalized feature text values
normalize
split by delimiter into a format that preserves position within tree of each value found
Arguments:
feature_raw
string - feature textdelimiter
dict - delimiter is used for breaking string
Returns:
feature_normalized
list - list of tuple of normalized feature text values
KeyValuesNormalizer Objects
split by delimiter into a format that preserves value and label association found.
normalize_key_values
break down text using assignment into key value pair
Arguments:
key_values
list - list of stringsassignment
string - string that separates key and values
Returns:
feature_normalized
list - list of tuple of normalized feature text values
normalize
split by delimiter into a format that preserves value and label association found.
Arguments:
feature_raw
string - feature_rawdelimiter
- delimiter is used for breaking string
Returns:
feature_normalized
list - list of tuple of normalized feature text values
StrategySelector Objects
Based on the text input select the appropriate normalizer strategy to normalize the text
get_paths_delimiter
Identify whether the given texts is a paths string if yes return the appropriate delimiter to normalize text
Arguments:
texts
list - list of strings
Returns:
Bool
True/False - True if the given texts is identified as paths texts
get_key_values_delimiter
Identify whether the given texts is a key values string if yes return the appropriate delimiter to normalize text
Arguments:
texts
str - list of strings
Returns:
Bool
True/False - True if the given texts is identified as key:values text else returns false
get_properties_delimiter
Identify whether the given texts is a properties string if yes return the appropriate delimiter to normalize text
Arguments:
texts
str - list of strings
Returns:
delimiter
dict - delimiter to normalize the string
select
Arguments:
texts(list)
- list of strings
Returns:
strategy_properties
list - list of strategy and properties to normalize the text
artifician.extractors.text_extractors.keyword_extractor
KeywordExtractor Objects
Keyword Extractor class for extracting specific keywords from a text.
Attributes:
method
str - Method to use for keyword extraction ('manual', 'frequency', 'tfidf', etc.)keywords
List[str] - List of keywords to search within the text for 'manual' method.
__init__
Initialize a new KeywordExtractor object.
Arguments:
method
str - Method to use for keyword extraction.keywords
List[str] - List of keywords to search within the text for 'manual' method.
Raises:
ValueError
- If the keywords list is empty for 'manual' method.
artifician.extractors.html_extractors
get_node_text
Extracts text from a given node.
Arguments:
node
List[Union[str, Tag]] - The node list to extract text from.
Returns:
str
- The text content of the node.
Raises:
TypeError
- If the first element in the node list is not a bs4.element.Tag.ValueError
- If the node list is empty.
get_node_attribute
Retrieves the value of a specified attribute from a given node.
Arguments:
node
List[Union[str, Tag]] - The node list to get the attribute from.attribute
str - The name of the attribute to retrieve.
Returns:
str
- The value of the attribute.
Raises:
TypeError
- If the first element in the node list is not a bs4.element.Tag.
get_parent_node_text
Extracts text from the parent node of a given node.
Arguments:
node
List[Union[str, Tag]] - The node list to extract parent text from.
Returns:
str
- The text content of the parent node.
Raises:
TypeError
- If the first element in the node list is not a bs4.element.Tag.
get_child_node_text
Extracts text from the first child node of a given node.
Arguments:
node
List[Union[str, Tag]] - The node list to extract child text from.
Returns:
str
- The text content of the child node.
Raises:
TypeError
- If the first element in the node list is not a bs4.element.Tag.
count_child_nodes
Counts the number of child nodes for a given node.
Arguments:
node
List[Union[str, Tag]] - The node list to count children for.
Returns:
int
- The number of child nodes.
Raises:
TypeError
- If the first element in the node list is not a bs4.element.Tag.
get_sibling_node_text
Extracts text from the first sibling node of a given node.
Arguments:
node
List[Union[str, Tag]] - The node list to extract sibling text from.
Returns:
str
- The text content of the sibling node.
Raises:
TypeError
- If the first element in the node list is not a bs4.element.Tag.
get_parent_attribute
Retrieves the value of a specified attribute from the parent of a given node.
Arguments:
node
List[Union[str, Tag]] - The node list to get the parent attribute from.attribute
str - The name of the attribute to retrieve.
Returns:
str
- The value of the attribute from the parent node.
Raises:
TypeError
- If the first element in the node list is not a bs4.element.Tag.
get_child_attribute
Retrieves the value of a specified attribute from the first child of a given node.
Arguments:
node
List[Union[str, Tag]] - The node list to get the child attribute from.attribute
str - The name of the attribute to retrieve.
Returns:
str
- The value of the attribute from the child node.
Raises:
TypeError
- If the first element in the node list is not a bs4.element.Tag.
artifician.extractors
Last updated