from nlp_preprocessing.seq_token_generator import *
SpacyTokenizer allow to tokenize your text. __call__ method can
class SpacyTokenizer[source]
SpacyTokenizer(vocab_file=None,spacy_tokenizer=<spacy.tokenizer.Tokenizer object at 0x167c88950>,special_token=['[PAD]'],pad_token_index=0)
SpacyTokenizer.__call__[source]
SpacyTokenizer.call(inputs,call_type='tokenize',max_seq=None)
__call__ method allow to call encode, encode_plus and tokenize from single interface.
Args:
inputs (List or string): Input can be string or list of text
call_type (str, optional): can be encode, encode_plus, tokenize. Defaults to 'tokenize'.
max_seq ([type], optional): it applies for encode and encode_plus call_type Defaults to None (for tokenzie call_type).
Returns:
tokens or ids: List or List of List
SpacyTokenizer.encode[source]
SpacyTokenizer.encode(text,max_seq=128)
encode method allow to encode text into ids with max_seq lenght
Args:
text (string): input text
max_seq (int, optional): Defaults to 128.
Returns:
tokens: List of token
SpacyTokenizer.encode_plus[source]
SpacyTokenizer.encode_plus(input_texts,max_seq=128)
encode_plus method allow to encode list of text into list of ids with max_seq lenght
Args:
input_texts (List): List of text
max_seq (int, optional): Defaults to 128.
Returns:
tokens: List of List of token
SpacyTokenizer.tokenize[source]
SpacyTokenizer.tokenize(input_texts)
tokenizer method allow to tokenize text
Args:
input_texts (List): Takes list of text(string)
Returns:
tokens: List[List]
Examples
texts = ['Hi, how are you', "I am good"]
tokens = SpacyTokenizer()(texts, call_type='tokenize')
print('Output :',tokens)
2it [00:00, 4.64it/s]
Output : [['Hi', ',', 'how', 'are', 'you'], ['I', 'am', 'good']]
texts = ['Hi, how are you', "I am good"]
spacy_tokenizer = SpacyTokenizer()
tokens = spacy_tokenizer.tokenize(texts)
ids = [spacy_tokenizer.convert_tokens_to_ids(token) for token in tokens]
print('Tokens : ',tokens)
print('Token_ids : ', ids)
print('Vocab : ', spacy_tokenizer.vocab)
2it [00:00, 79.79it/s]
Tokens : [['Hi', ',', 'how', 'are', 'you'], ['I', 'am', 'good']]
Token_ids : [[1, 2, 3, 4, 5], [6, 7, 8]]
Vocab : {'[PAD]': 0, 'Hi': 1, ',': 2, 'how': 3, 'are': 4, 'you': 5, 'I': 6, 'am': 7, 'good': 8}
texts = ['Hi, how are you', "I am good"]
spacy_tokenizer = SpacyTokenizer()
ids = spacy_tokenizer.encode_plus(texts, max_seq=10)
print('Token_ids : ', ids)
print('Vocab : ', spacy_tokenizer.vocab)
2it [00:00, 80.31it/s]
Token_ids : [[1, 2, 3, 4, 5, 0, 0, 0, 0, 0], [6, 7, 8, 0, 0, 0, 0, 0, 0, 0]]
Vocab : {'[PAD]': 0, 'Hi': 1, ',': 2, 'how': 3, 'are': 4, 'you': 5, 'I': 6, 'am': 7, 'good': 8}
spacy_tokenizer = SpacyTokenizer()
ids = spacy_tokenizer.encode('Hi, how are you', max_seq=10)
print('Token Ids :', ids)
print('Vocab : ', spacy_tokenizer.vocab)
Token Ids : [1, 2, 3, 4, 5, 0, 0, 0, 0, 0]
Vocab : {'[PAD]': 0, 'Hi': 1, ',': 2, 'how': 3, 'are': 4, 'you': 5}
from nlp_preprocessing.seq_parser_token_generator import *
SpacyParseTokenizer allow to tokenize text and get different parse tokens i.e. dependency parse, tag parse, pos parse from Spacy model
class SpacyParseTokenizer[source]
SpacyParseTokenizer(parsers=['pos', 'tag', 'dep'])
SpacyParseTokenizer.__call__[source]
SpacyParseTokenizer.call(inputs,call_type='tokenize',max_seq=None)
__call__ method allow a single interface to call encode, encode_plus and tokenize methods
Args:
inputs (List or string): It can be string (for encode call type) or List for encode_plus and tokenize
call_type (str, optional): can be encode, encode_plus, tokenize. Defaults to 'tokenize'.
max_seq ([type], optional): it applies for encode and encode_plus call_type Defaults to None (for tokenzie call_type).
Returns: results: dict (contains keys i.e. tag, pos, dep)
SpacyParseTokenizer.tokenize[source]
SpacyParseTokenizer.tokenize(input_texts)
tokenizer method allow to tokenize text
Args:
input_texts (List): Takes list of text(string)
Returns:
results: dict
SpacyParseTokenizer.encode[source]
SpacyParseTokenizer.encode(text,max_seq=128)
encode method allow to encode text into ids with max_seq lenght
Args:
text (string): input text
max_seq (int, optional): Defaults to 128.
Returns:
results: dict
SpacyParseTokenizer.encode_plus[source]
SpacyParseTokenizer.encode_plus(input_texts,max_seq=128)
encode_plus method allow to encode list of text into list of ids with max_seq lenght
Args:
input_texts (List): List of text
max_seq (int, optional): Defaults to 128.
Returns:
results: dict
Examples
texts = ['Hi, how are you', "I am good"]
tokens = SpacyParseTokenizer()(texts, call_type='tokenize')
print('Output :',tokens)
2it [00:00, 51.19it/s]
Output : {'pos': [['INTJ', 'PUNCT', 'ADV', 'AUX', 'PRON'], ['PRON', 'AUX', 'ADJ']], 'tag': [['UH', ',', 'WRB', 'VBP', 'PRP'], ['PRP', 'VBP', 'JJ']], 'dep': [['intj', 'punct', 'advmod', 'ROOT', 'nsubj'], ['nsubj', 'ROOT', 'acomp']]}
texts = ['Hi, how are you', "I am good"]
tokens = SpacyTokenizer()(texts, call_type='tokenize')
parse_tokens = SpacyParseTokenizer()(texts, call_type='tokenize')
print('Output : ',tokens)
print('Parse Dict : ', parse_tokens)
2it [00:00, 70.61it/s]
2it [00:00, 54.82it/s]
Output : [['Hi', ',', 'how', 'are', 'you'], ['I', 'am', 'good']]
Parse Dict : {'pos': [['INTJ', 'PUNCT', 'ADV', 'AUX', 'PRON'], ['PRON', 'AUX', 'ADJ']], 'tag': [['UH', ',', 'WRB', 'VBP', 'PRP'], ['PRP', 'VBP', 'JJ']], 'dep': [['intj', 'punct', 'advmod', 'ROOT', 'nsubj'], ['nsubj', 'ROOT', 'acomp']]}