from nlp_preprocessing.vocab_embedding_extractor import VocabEmbeddingExtractor
VocabEmbeddingExtractor
allow to extract vocab and corrosponding embeddings from word2vec, fasttext, glove etc. word embeddings
class
VocabEmbeddingExtractor
[source]
VocabEmbeddingExtractor
(vector_file
,input_file
,column_name
)
VocabEmbeddingExtractor
takes files(input file, vector file) and extract vocab and corrosponding embeddings from word2vec, fasttext, glove etc. word embeddings
Args:
vector_file (string): external vector file i.e. word2vec, glove, fasttext
input_file (string): input file in csv
column_name (string): text column name from input file
VocabEmbeddingExtractor.process
[source]
VocabEmbeddingExtractor.process
(output_dir
,special_tokens
=[]
)
process
method allow to process and save output to output_dir
Args:
output_dir (string): output directory
special_tokens (list of string, optional): List all special tokens i.e [PAD], [SEP] . Defaults to [].
Example
vector_file='../input/fasttext-crawl-300d-2m-with-subword/crawl-300d-2m-subword/crawl-300d-2M-subword.vec'
input_file='../input/complete-tweet-sentiment-extraction-data/tweet_dataset.csv'
column_name='text'
extractor = VocabEmbeddingExtractor(vector_file, input_file, column_name)
output_dir = '.'
special_tokens = ['[UNK]','[SEP]']
extractor.process(output_dir, special_tokens)