from nlp_preprocessing.seq_token_generator import *
SpacyTokenizer
allow to tokenize your text. __call__
method can
class
SpacyTokenizer
[source]
SpacyTokenizer
(vocab_file
=None
,spacy_tokenizer
=<spacy.tokenizer.Tokenizer object at 0x167c88950>
,special_token
=['[PAD]']
,pad_token_index
=0
)
SpacyTokenizer.__call__
[source]
SpacyTokenizer.call
(inputs
,call_type
='tokenize'
,max_seq
=None
)
__call__
method allow to call encode, encode_plus and tokenize from single interface.
Args:
inputs (List or string): Input can be string or list of text
call_type (str, optional): can be encode, encode_plus, tokenize. Defaults to 'tokenize'.
max_seq ([type], optional): it applies for encode and encode_plus call_type Defaults to None (for tokenzie call_type).
Returns:
tokens or ids: List or List of List
SpacyTokenizer.encode
[source]
SpacyTokenizer.encode
(text
,max_seq
=128
)
encode
method allow to encode text into ids with max_seq lenght
Args:
text (string): input text
max_seq (int, optional): Defaults to 128.
Returns:
tokens: List of token
SpacyTokenizer.encode_plus
[source]
SpacyTokenizer.encode_plus
(input_texts
,max_seq
=128
)
encode_plus
method allow to encode list of text into list of ids with max_seq lenght
Args:
input_texts (List): List of text
max_seq (int, optional): Defaults to 128.
Returns:
tokens: List of List of token
SpacyTokenizer.tokenize
[source]
SpacyTokenizer.tokenize
(input_texts
)
tokenizer
method allow to tokenize text
Args:
input_texts (List): Takes list of text(string)
Returns:
tokens: List[List]
Examples
texts = ['Hi, how are you', "I am good"]
tokens = SpacyTokenizer()(texts, call_type='tokenize')
print('Output :',tokens)
2it [00:00, 4.64it/s]
Output : [['Hi', ',', 'how', 'are', 'you'], ['I', 'am', 'good']]
texts = ['Hi, how are you', "I am good"]
spacy_tokenizer = SpacyTokenizer()
tokens = spacy_tokenizer.tokenize(texts)
ids = [spacy_tokenizer.convert_tokens_to_ids(token) for token in tokens]
print('Tokens : ',tokens)
print('Token_ids : ', ids)
print('Vocab : ', spacy_tokenizer.vocab)
2it [00:00, 79.79it/s]
Tokens : [['Hi', ',', 'how', 'are', 'you'], ['I', 'am', 'good']]
Token_ids : [[1, 2, 3, 4, 5], [6, 7, 8]]
Vocab : {'[PAD]': 0, 'Hi': 1, ',': 2, 'how': 3, 'are': 4, 'you': 5, 'I': 6, 'am': 7, 'good': 8}
texts = ['Hi, how are you', "I am good"]
spacy_tokenizer = SpacyTokenizer()
ids = spacy_tokenizer.encode_plus(texts, max_seq=10)
print('Token_ids : ', ids)
print('Vocab : ', spacy_tokenizer.vocab)
2it [00:00, 80.31it/s]
Token_ids : [[1, 2, 3, 4, 5, 0, 0, 0, 0, 0], [6, 7, 8, 0, 0, 0, 0, 0, 0, 0]]
Vocab : {'[PAD]': 0, 'Hi': 1, ',': 2, 'how': 3, 'are': 4, 'you': 5, 'I': 6, 'am': 7, 'good': 8}
spacy_tokenizer = SpacyTokenizer()
ids = spacy_tokenizer.encode('Hi, how are you', max_seq=10)
print('Token Ids :', ids)
print('Vocab : ', spacy_tokenizer.vocab)
Token Ids : [1, 2, 3, 4, 5, 0, 0, 0, 0, 0]
Vocab : {'[PAD]': 0, 'Hi': 1, ',': 2, 'how': 3, 'are': 4, 'you': 5}
from nlp_preprocessing.seq_parser_token_generator import *
SpacyParseTokenizer
allow to tokenize text and get different parse tokens i.e. dependency parse, tag parse, pos parse from Spacy model
class
SpacyParseTokenizer
[source]
SpacyParseTokenizer
(parsers
=['pos', 'tag', 'dep']
)
SpacyParseTokenizer.__call__
[source]
SpacyParseTokenizer.call
(inputs
,call_type
='tokenize'
,max_seq
=None
)
__call__
method allow a single interface to call encode, encode_plus and tokenize methods
Args:
inputs (List or string): It can be string (for encode call type) or List for encode_plus and tokenize
call_type (str, optional): can be encode, encode_plus, tokenize. Defaults to 'tokenize'.
max_seq ([type], optional): it applies for encode and encode_plus call_type Defaults to None (for tokenzie call_type).
Returns: results: dict (contains keys i.e. tag, pos, dep)
SpacyParseTokenizer.tokenize
[source]
SpacyParseTokenizer.tokenize
(input_texts
)
tokenizer
method allow to tokenize text
Args:
input_texts (List): Takes list of text(string)
Returns:
results: dict
SpacyParseTokenizer.encode
[source]
SpacyParseTokenizer.encode
(text
,max_seq
=128
)
encode
method allow to encode text into ids with max_seq lenght
Args:
text (string): input text
max_seq (int, optional): Defaults to 128.
Returns:
results: dict
SpacyParseTokenizer.encode_plus
[source]
SpacyParseTokenizer.encode_plus
(input_texts
,max_seq
=128
)
encode_plus
method allow to encode list of text into list of ids with max_seq lenght
Args:
input_texts (List): List of text
max_seq (int, optional): Defaults to 128.
Returns:
results: dict
Examples
texts = ['Hi, how are you', "I am good"]
tokens = SpacyParseTokenizer()(texts, call_type='tokenize')
print('Output :',tokens)
2it [00:00, 51.19it/s]
Output : {'pos': [['INTJ', 'PUNCT', 'ADV', 'AUX', 'PRON'], ['PRON', 'AUX', 'ADJ']], 'tag': [['UH', ',', 'WRB', 'VBP', 'PRP'], ['PRP', 'VBP', 'JJ']], 'dep': [['intj', 'punct', 'advmod', 'ROOT', 'nsubj'], ['nsubj', 'ROOT', 'acomp']]}
texts = ['Hi, how are you', "I am good"]
tokens = SpacyTokenizer()(texts, call_type='tokenize')
parse_tokens = SpacyParseTokenizer()(texts, call_type='tokenize')
print('Output : ',tokens)
print('Parse Dict : ', parse_tokens)
2it [00:00, 70.61it/s]
2it [00:00, 54.82it/s]
Output : [['Hi', ',', 'how', 'are', 'you'], ['I', 'am', 'good']]
Parse Dict : {'pos': [['INTJ', 'PUNCT', 'ADV', 'AUX', 'PRON'], ['PRON', 'AUX', 'ADJ']], 'tag': [['UH', ',', 'WRB', 'VBP', 'PRP'], ['PRP', 'VBP', 'JJ']], 'dep': [['intj', 'punct', 'advmod', 'ROOT', 'nsubj'], ['nsubj', 'ROOT', 'acomp']]}