zoo.common package¶
Submodules¶
zoo.common.nncontext module¶
-
class
zoo.common.nncontext.ZooContextMeta[source]¶ Bases:
type-
log_output¶ Whether to redirect Spark driver JVM’s stdout and stderr to the current python process. This is useful when running Analytics Zoo in jupyter notebook. Default to be False. Needs to be set before initializing SparkContext.
-
-
zoo.common.nncontext.getOrCreateSparkContext(conf=None, appName=None)[source]¶ Get the current active SparkContext or create a new SparkContext. :param conf: An instance of SparkConf. If not specified, a new SparkConf with Analytics Zoo and BigDL configurations would be created and used. :param appName: The name of the application if any.
Returns: An instance of SparkContext.
-
zoo.common.nncontext.get_optimizer_version(bigdl_type='float')[source]¶ Get DistriOptimizer version. return optimizerVersion
-
zoo.common.nncontext.init_nncontext(conf=None, spark_log_level='WARN', redirect_spark_log=True)[source]¶ Creates or gets a SparkContext with optimized configurations for BigDL performance. This method will also initialize the BigDL engine.
Note: If you use spark-shell or Jupyter notebook, as the SparkContext is created before your code, you have to set the Spark configurations through command line options or the properties file before calling this method. In this case, you are recommended to use the launch scripts we provide: https://github.com/intel-analytics/analytics-zoo/tree/master/scripts.
Parameters: conf – An instance of SparkConf. If not specified, a new SparkConf with Analytics Zoo and BigDL configurations would be created and used. You can also input a string here to indicate the name of the application. :param spark_log_level: The log level for Spark. Default to be ‘WARN’. :param redirect_spark_log: Whether to redirect the Spark log to local file. Default to be True.
Returns: An instance of SparkContext.
-
zoo.common.nncontext.init_spark_on_k8s(master, container_image, num_executors, executor_cores, executor_memory='2g', driver_memory='1g', driver_cores=4, extra_executor_memory_for_ray=None, extra_python_lib=None, spark_log_level='WARN', redirect_spark_log=True, jars=None, conf=None, python_location=None)[source]¶ Create a SparkContext with Analytics Zoo configurations on Kubernetes cluster for k8s client mode. You are recommended to use the Docker image intelanalytics/hyperzoo:latest. You can refer to https://github.com/intel-analytics/analytics-zoo/tree/master/docker/hyperzoo to build your own Docker image.
Parameters: - master – The master address of your k8s cluster.
- container_image – The name of the docker container image for Spark executors.
For example, intelanalytics/hyperzoo:latest :param num_executors: The number of Spark executors. :param executor_cores: The number of cores for each executor. :param executor_memory: The memory for each executor. Default to be ‘2g’. :param driver_cores: The number of cores for the Spark driver. Default to be 4. :param driver_memory: The memory for the Spark driver. Default to be ‘1g’. :param extra_executor_memory_for_ray: The extra memory for Ray services. Default to be None. :param extra_python_lib: Extra python files or packages needed for distribution. Default to be None. :param spark_log_level: The log level for Spark. Default to be ‘WARN’. :param redirect_spark_log: Whether to redirect the Spark log to local file. Default to be True. :param jars: Comma-separated list of jars to be included on driver and executor’s classpath. Default to be None. :param conf: You can append extra conf for Spark in key-value format. i.e conf={“spark.executor.extraJavaOptions”: “-XX:+PrintGCDetails”}. Default to be None. :param python_location: The path to your running Python executable. If not specified, the default Python interpreter in effect would be used.
Returns: An instance of SparkContext.
-
zoo.common.nncontext.init_spark_on_local(cores=2, conf=None, python_location=None, spark_log_level='WARN', redirect_spark_log=True)[source]¶ Create a SparkContext with Analytics Zoo configurations on the local machine.
Parameters: cores – The number of cores for Spark local. Default to be 2. You can also set it to “*” to use all the available cores. i.e init_spark_on_local(cores=”*”) :param conf: You can append extra conf for Spark in key-value format. i.e conf={“spark.executor.extraJavaOptions”: “-XX:+PrintGCDetails”}. Default to be None. :param python_location: The path to your running Python executable. If not specified, the default Python interpreter in effect would be used. :param spark_log_level: The log level for Spark. Default to be ‘WARN’. :param redirect_spark_log: Whether to redirect the Spark log to local file. Default to be True.
Returns: An instance of SparkContext.
-
zoo.common.nncontext.init_spark_on_yarn(hadoop_conf, conda_name, num_executors, executor_cores, executor_memory='2g', driver_cores=4, driver_memory='1g', extra_executor_memory_for_ray=None, extra_python_lib=None, penv_archive=None, additional_archive=None, hadoop_user_name='root', spark_yarn_archive=None, spark_log_level='WARN', redirect_spark_log=True, jars=None, conf=None)[source]¶ Create a SparkContext with Analytics Zoo configurations on Yarn cluster for yarn-client mode. You only need to create a conda environment and install the python dependencies in that environment beforehand on the driver machine. These dependencies would be automatically packaged and distributed to the whole Yarn cluster.
Parameters: - hadoop_conf – The path to the yarn configuration folder.
- conda_name – The name of the conda environment.
- num_executors – The number of Spark executors.
- executor_cores – The number of cores for each executor.
- executor_memory – The memory for each executor. Default to be ‘2g’.
- driver_cores – The number of cores for the Spark driver. Default to be 4.
- driver_memory – The memory for the Spark driver. Default to be ‘1g’.
- extra_executor_memory_for_ray – The extra memory for Ray services. Default to be None.
- extra_python_lib – Extra python files or packages needed for distribution.
Default to be None. :param penv_archive: Ideally, the program would auto-pack the conda environment specified by ‘conda_name’, but you can also pass the path to a packed file in “tar.gz” format here. Default to be None. :param additional_archive: Comma-separated list of additional archives to be uploaded and unpacked on executors. Default to be None. :param hadoop_user_name: The user name for running the yarn cluster. Default to be ‘root’. :param spark_yarn_archive: Conf value for setting spark.yarn.archive. Default to be None. :param spark_log_level: The log level for Spark. Default to be ‘WARN’. :param redirect_spark_log: Whether to redirect the Spark log to local file. Default to be True. :param jars: Comma-separated list of jars to be included on driver and executor’s classpath. Default to be None. :param conf: You can append extra conf for Spark in key-value format. i.e conf={“spark.executor.extraJavaOptions”: “-XX:+PrintGCDetails”}. Default to be None.
Returns: An instance of SparkContext.
-
zoo.common.nncontext.init_spark_standalone(num_executors, executor_cores, executor_memory='2g', driver_cores=4, driver_memory='1g', master=None, extra_executor_memory_for_ray=None, extra_python_lib=None, spark_log_level='WARN', redirect_spark_log=True, conf=None, jars=None, python_location=None, enable_numa_binding=False)[source]¶ Create a SparkContext with Analytics Zoo configurations on Spark standalone cluster.
You need to specify master if you already have a Spark standalone cluster. For a standalone cluster with multiple nodes, make sure that analytics-zoo is installed via pip in the Python environment on every node. If master is not specified, a new Spark standalone cluster on the current single node would be started first and the SparkContext would use its master address. You need to call stop_spark_standalone after your program finishes to shutdown the cluster.
Parameters: - num_executors – The number of Spark executors.
- executor_cores – The number of cores for each executor.
- executor_memory – The memory for each executor. Default to be ‘2g’.
- driver_cores – The number of cores for the Spark driver. Default to be 4.
- driver_memory – The memory for the Spark driver. Default to be ‘1g’.
- master – The master URL of an existing Spark standalone cluster: ‘spark://master:port’.
You only need to specify this if you have already started a standalone cluster. Default to be None and a new standalone cluster would be started in this case. :param extra_executor_memory_for_ray: The extra memory for Ray services. Default to be None. :param extra_python_lib: Extra python files or packages needed for distribution. Default to be None. :param spark_log_level: The log level for Spark. Default to be ‘WARN’. :param redirect_spark_log: Whether to redirect the Spark log to local file. Default to be True. :param jars: Comma-separated list of jars to be included on driver and executor’s classpath. Default to be None. :param conf: You can append extra conf for Spark in key-value format. i.e conf={“spark.executor.extraJavaOptions”: “-XX:+PrintGCDetails”}. Default to be None. :param python_location: The path to your running Python executable. If not specified, the default Python interpreter in effect would be used. :param enable_numa_binding: Whether to use numactl to start spark worker in order to bind different worker processes to different cpus and memory areas. This is may lead to better performance on a multi-sockets machine. Defaults to False.
Returns: An instance of SparkContext.
zoo.common.utils module¶
-
class
zoo.common.utils.JTensor(storage, shape, bigdl_type='float', indices=None)[source]¶ Bases:
bigdl.util.common.JTensor
-
class
zoo.common.utils.Sample(features, labels, bigdl_type='float')[source]¶ Bases:
bigdl.util.common.Sample-
classmethod
from_ndarray(features, labels, bigdl_type='float')[source]¶ Convert a ndarray of features and labels to Sample, which would be used in Java side. :param features: an ndarray or a list of ndarrays :param labels: an ndarray or a list of ndarrays or a scalar :param bigdl_type: “double” or “float”
>>> import numpy as np >>> from bigdl.util.common import callBigDlFunc >>> from numpy.testing import assert_allclose >>> np.random.seed(123) >>> sample = Sample.from_ndarray(np.random.random((2,3)), np.random.random((2,3))) >>> sample_back = callBigDlFunc("float", "testSample", sample) >>> assert_allclose(sample.features[0].to_ndarray(), sample_back.features[0].to_ndarray()) >>> assert_allclose(sample.label.to_ndarray(), sample_back.label.to_ndarray()) >>> expected_feature_storage = np.array(([[0.69646919, 0.28613934, 0.22685145], [0.55131477, 0.71946895, 0.42310646]])) >>> expected_feature_shape = np.array([2, 3]) >>> expected_label_storage = np.array(([[0.98076421, 0.68482971, 0.48093191], [0.39211753, 0.343178, 0.72904968]])) >>> expected_label_shape = np.array([2, 3]) >>> assert_allclose(sample.features[0].storage, expected_feature_storage, rtol=1e-6, atol=1e-6) >>> assert_allclose(sample.features[0].shape, expected_feature_shape) >>> assert_allclose(sample.labels[0].storage, expected_label_storage, rtol=1e-6, atol=1e-6) >>> assert_allclose(sample.labels[0].shape, expected_label_shape)
-
classmethod