zoo.feature.text package

Submodules

zoo.feature.text.text_feature module

class zoo.feature.text.text_feature.TextFeature(text=None, label=None, uri=None, jvalue=None, bigdl_type='float')[source]

Bases: bigdl.util.common.JavaValue

Each TextFeature keeps information of a single text record. It can include various status (if any) of a text, e.g. original text content, uri, category label, tokens, index representation of tokens, BigDL Sample representation, prediction result and so on.

get_label()[source]

Get the label of the TextFeature. If no label is stored, -1 will be returned.

Returns:Int
get_sample()[source]

Get the Sample representation of the TextFeature. If the TextFeature hasn’t been transformed to Sample, None will be returned.

Returns:BigDL Sample
get_text()[source]

Get the text content of the TextFeature.

Returns:String
get_tokens()[source]

Get the tokens of the TextFeature. If text hasn’t been segmented, None will be returned.

Returns:List of String
get_uri()[source]

Get the identifier of the TextFeature. If no id is stored, None will be returned.

Returns:String
has_label()[source]

Whether the TextFeature contains label.

Returns:Boolean
keys()[source]

Get the keys that the TextFeature contains.

Returns:List of String
set_label(label)[source]

Set the label for the TextFeature.

Parameters:label – Int
Returns:The TextFeature with label.

zoo.feature.text.text_set module

class zoo.feature.text.text_set.DistributedTextSet(texts=None, labels=None, jvalue=None, bigdl_type='float')[source]

Bases: zoo.feature.text.text_set.TextSet

DistributedTextSet is comprised of RDDs.

class zoo.feature.text.text_set.LocalTextSet(texts=None, labels=None, jvalue=None, bigdl_type='float')[source]

Bases: zoo.feature.text.text_set.TextSet

LocalTextSet is comprised of lists.

class zoo.feature.text.text_set.TextSet(jvalue, bigdl_type='float', *args)[source]

Bases: bigdl.util.common.JavaValue

TextSet wraps a set of texts with status.

classmethod from_relation_lists(relations, corpus1, corpus2, bigdl_type='float')[source]

Used to generate a TextSet for ranking.

This method does the following: 1. For each id1 in relations, find the list of id2 with corresponding label that comes together with id1. In other words, group relations by id1. 2. Join with corpus to transform each id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each list, generate a TextFeature having Sample with: - feature of shape (list_length, text1_length + text2_length). - label of shape (list_length, 1).

Parameters:
  • relations – List or RDD of Relation.
  • corpus1 – TextSet that contains all id1 in relations. For each TextFeature in corpus1,

text must have been transformed to indexedTokens of the same length. :param corpus2: TextSet that contains all id2 in relations. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length. Note that if relations is a list, then corpus1 and corpus2 must both be LocalTextSet. If relations is RDD, then corpus1 and corpus2 must both be DistributedTextSet.

Returns:TextSet.
classmethod from_relation_pairs(relations, corpus1, corpus2, bigdl_type='float')[source]

Used to generate a TextSet for pairwise training.

This method does the following: 1. Generate all RelationPairs: (id1, id2Positive, id2Negative) from Relations. 2. Join RelationPairs with corpus to transform id to indexedTokens. Note: Make sure that the corpus has been transformed by SequenceShaper and WordIndexer. 3. For each pair, generate a TextFeature having Sample with: - feature of shape (2, text1Length + text2Length). - label of value [1 0] as the positive relation is placed before the negative one.

Parameters:
  • relations – List or RDD of Relation.
  • corpus1 – TextSet that contains all id1 in relations. For each TextFeature in corpus1,

text must have been transformed to indexedTokens of the same length. :param corpus2: TextSet that contains all id2 in relations. For each TextFeature in corpus2, text must have been transformed to indexedTokens of the same length. Note that if relations is a list, then corpus1 and corpus2 must both be LocalTextSet. If relations is RDD, then corpus1 and corpus2 must both be DistributedTextSet.

Returns:TextSet.
generate_sample()[source]

Generate BigDL Sample. Need to word2idx first. See TextFeatureToSample for more details.

Returns:TextSet with Samples.
generate_word_index_map(remove_topN=0, max_words_num=-1, min_freq=1, existing_map=None)[source]

Generate word_index map based on sorted word frequencies in descending order. Return the result dictionary, which can also be retrieved by ‘get_word_index()’. Make sure you call this after tokenize. Otherwise you will get an error. See word2idx for more details.

Returns:Dictionary {word: id}
get_labels()[source]

Get the labels of a TextSet (if any). If a text doesn’t have a label, its corresponding position will be -1.

Returns:List of int for LocalTextSet.

RDD of int for DistributedTextSet.

get_predicts()[source]

Get the prediction results (if any) combined with uris (if any) of a TextSet. If a text doesn’t have a uri, its corresponding uri will be None. If a text hasn’t been predicted by a model, its corresponding prediction will be None.

Returns:List of (uri, prediction as a list of numpy array) for LocalTextSet.

RDD of (uri, prediction as a list of numpy array) for DistributedTextSet.

get_samples()[source]

Get the BigDL Sample representations of a TextSet (if any). If a text hasn’t been transformed to Sample, its corresponding position will be None.

Returns:List of Sample for LocalTextSet.

RDD of Sample for DistributedTextSet.

get_texts()[source]

Get the text contents of a TextSet.

Returns:List of String for LocalTextSet.

RDD of String for DistributedTextSet.

get_uris()[source]

Get the identifiers of a TextSet. If a text doesn’t have a uri, its corresponding position will be None.

Returns:List of String for LocalTextSet.

RDD of String for DistributedTextSet.

get_word_index()[source]

Get the word_index dictionary of the TextSet. If the TextSet hasn’t been transformed from word to index, None will be returned.

Returns:Dictionary {word: id}
is_distributed()[source]

Whether it is a DistributedTextSet.

Returns:Boolean
is_local()[source]

Whether it is a LocalTextSet.

Returns:Boolean
load_word_index(path)[source]

Load the word_index map which was saved after the training, so that this TextSet can directly use this word_index during inference. Each separate line should be “word id”.

Note that after calling load_word_index, you do not need to specify any argument when calling word2idx in the preprocessing pipeline as now you are using exactly the loaded word_index for transformation.

For LocalTextSet, load txt from a local file system. For DistributedTextSet, load txt from a local or distributed file system (such as HDFS).

Returns:TextSet with the loaded word_index.
normalize()[source]

Do normalization on tokens. Need to tokenize first. See Normalizer for more details.

Returns:TextSet after normalization.
random_split(weights)[source]

Randomly split into list of TextSet with provided weights. Only available for DistributedTextSet for now.

Parameters:weights – List of float indicating the split portions.
classmethod read(path, sc=None, min_partitions=1, bigdl_type='float')[source]

Read text files with labels from a directory. The folder structure is expected to be the following: path |dir1 - text1, text2, … |dir2 - text1, text2, … |dir3 - text1, text2, … Under the target path, there ought to be N subdirectories (dir1 to dirN). Each subdirectory represents a category and contains all texts that belong to such category. Each category will be a given a label according to its position in the ascending order sorted among all subdirectories. All texts will be given a label according to the subdirectory where it is located. Labels start from 0.

Parameters:path – The folder path to texts. Local or distributed file system (such as HDFS)

are supported. If you want to read from a distributed file system, sc needs to be specified. :param sc: An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is None and in this case texts will be read as a LocalTextSet. :param min_partitions: Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.

Returns:TextSet.
classmethod read_csv(path, sc=None, min_partitions=1, bigdl_type='float')[source]

Read texts with id from csv file. Each record is supposed to contain the following two fields in order: id(string) and text(string). Note that the csv file should be without header.

Parameters:path – The path to the csv file. Local or distributed file system (such as HDFS)

are supported. If you want to read from a distributed file system, sc needs to be specified. :param sc: An instance of SparkContext. If specified, texts will be read as a DistributedTextSet. Default is None and in this case texts will be read as a LocalTextSet. :param min_partitions: Int. A suggestion value of the minimal partition number for input texts. Only need to specify this when sc is not None. Default is 1.

Returns:TextSet.
classmethod read_parquet(path, sc, bigdl_type='float')[source]

Read texts with id from parquet file. Schema should be the following: “id”(string) and “text”(string).

Parameters:
  • path – The path to the parquet file.
  • sc – An instance of SparkContext.
Returns:

DistributedTextSet.

save_word_index(path)[source]

Save the word_index dictionary to text file, which can be used for future inference. Each separate line will be “word id”.

For LocalTextSet, save txt to a local file system. For DistributedTextSet, save txt to a local or distributed file system (such as HDFS).

Parameters:path – The path to the text file.
set_word_index(vocab)[source]

Assign a word_index dictionary for this TextSet to use during word2idx. If you load the word_index from the saved file, you are recommended to use load_word_index directly.

Returns:TextSet with the word_index set.
shape_sequence(len, trunc_mode='pre', pad_element=0)[source]

Shape the sequence of indices to a fixed length. Need to word2idx first. See SequenceShaper for more details.

Returns:TextSet after sequence shaping.
to_distributed(sc=None, partition_num=4)[source]

Convert to a DistributedTextSet.

Need to specify SparkContext to convert a LocalTextSet to a DistributedTextSet. In this case, you may also want to specify partition_num, the default of which is 4.

Returns:DistributedTextSet
to_local()[source]

Convert to a LocalTextSet.

Returns:LocalTextSet
tokenize()[source]

Do tokenization on original text. See Tokenizer for more details.

Returns:TextSet after tokenization.
transform(transformer)[source]
word2idx(remove_topN=0, max_words_num=-1, min_freq=1, existing_map=None)[source]

Map word tokens to indices. Important: Take care that this method behaves a bit differently for training and inference.

—————————————Training——————————————– During the training, you need to generate a new word_index dictionary according to the texts you are dealing with. Thus this method will first do the dictionary generation and then convert words to indices based on the generated dictionary.

You can specify the following arguments which pose some constraints when generating the dictionary. In the result dictionary, index will start from 1 and corresponds to the occurrence frequency of each word sorted in descending order. Here we adopt the convention that index 0 will be reserved for unknown words. After word2idx, you can get the generated word_index dictionary by calling ‘get_word_index’. Also, you can call save_word_index to save this word_index dictionary to be used in future training.

Parameters:remove_topN – Non-negative int. Remove the topN words with highest frequencies

in the case where those are treated as stopwords. Default is 0, namely remove nothing. :param max_words_num: Int. The maximum number of words to be taken into consideration. Default is -1, namely all words will be considered. Otherwise, it should be a positive int. :param min_freq: Positive int. Only those words with frequency >= min_freq will be taken into consideration. Default is 1, namely all words that occur will be considered. :param existing_map: Existing dictionary of word_index if any. Default is None and in this case a new dictionary with index starting from 1 will be generated. If not None, then the generated dictionary will preserve the word_index in existing_map and assign subsequent indices to new words.

—————————————Inference——————————————– During the inference, you are supposed to use exactly the same word_index dictionary as in the training stage instead of generating a new one. Thus please be aware that you do not need to specify any of the above arguments. You need to call load_word_index or set_word_index beforehand for dictionary loading.

Need to tokenize first. See WordIndexer for more details.

Returns:TextSet after word2idx.

zoo.feature.text.transformer module

class zoo.feature.text.transformer.Normalizer(bigdl_type='float')[source]

Bases: zoo.feature.text.transformer.TextTransformer

Removes all dirty characters (non English alphabet) from tokens and converts words to lower case. Need to tokenize first. Original tokens will be replaced by normalized tokens.

>>> normalizer = Normalizer()
creating: createNormalizer
class zoo.feature.text.transformer.SequenceShaper(len, trunc_mode='pre', pad_element=0, bigdl_type='float')[source]

Bases: zoo.feature.text.transformer.TextTransformer

Shape the sequence of indices to a fixed length. If the original sequence is longer than the target length, it will be truncated from the beginning or the end. If the original sequence is shorter than the target length, it will be padded to the end. Need to word2idx first. The original indices sequence will be replaced by the shaped sequence.

# Arguments len: Positive int. The target length. trunc_mode: Truncation mode. String. Either ‘pre’ or ‘post’. Default is ‘pre’. If ‘pre’, the sequence will be truncated from the beginning. If ‘post’, the sequence will be truncated from the end. pad_element: Int. The element to be padded to the sequence if the original length is smaller than the target length. Default is 0 with the convention that we reserve index 0 for unknown words. >>> sequence_shaper = SequenceShaper(len=6, trunc_mode=”post”, pad_element=10000) creating: createSequenceShaper

class zoo.feature.text.transformer.TextFeatureToSample(bigdl_type='float')[source]

Bases: zoo.feature.text.transformer.TextTransformer

Transform indexedTokens and label (if any) of a TextFeature to a BigDL Sample. Need to word2idx first.

>>> to_sample = TextFeatureToSample()
creating: createTextFeatureToSample
class zoo.feature.text.transformer.TextTransformer(bigdl_type='float', *args)[source]

Bases: zoo.feature.common.Preprocessing

Base class of Transformers that transform TextFeature.

transform(text_feature)[source]

Transform a TextFeature.

class zoo.feature.text.transformer.Tokenizer(bigdl_type='float')[source]

Bases: zoo.feature.text.transformer.TextTransformer

Transform text to array of string tokens.

>>> tokenizer = Tokenizer()
creating: createTokenizer
class zoo.feature.text.transformer.WordIndexer(map, bigdl_type='float')[source]

Bases: zoo.feature.text.transformer.TextTransformer

Given a wordIndex map, transform tokens to corresponding indices. Those words not in the map will be aborted. Need to tokenize first.

# Arguments map: Dict with word (string) as its key and index (int) as its value.

>>> word_indexer = WordIndexer(map={"it": 1, "me": 2})
creating: createWordIndexer

Module contents