Beta Spark 2.3.1-2.2.1-2-beta

Welcome to the documentation for the DC/OS Apache Spark. For more information about new and changed features, see the release notes.

Apache Spark is a fast and general-purpose cluster computing system for big data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. For more information, see the Apache Spark documentation.

DC/OS Apache Spark consists of Apache Spark with a few custom commits along with DC/OS-specific packaging.

DC/OS Apache Spark includes:

Benefits

  • Utilization: DC/OS Apache Spark leverages Mesos to run Spark on the same cluster as other DC/OS services
  • Improved efficiency
  • Simple Management
  • Multi-team support
  • Interactive analytics through notebooks
  • UI integration
  • Security, including file- and environment-based secrets

Features

  • Multiversion support
  • Run multiple Spark dispatchers
  • Run against multiple HDFS clusters
  • Backports of scheduling improvements
  • Simple installation of all Spark components, including the dispatcher and the history server
  • Integration of the dispatcher and history server
  • Zeppelin integration
  • Kerberos and SSL support

Related Services

Install and Customize

Spark is available in the Universe and can be installed by using either the GUI or the DC/OS CLI.…Read More

Spark Quickstart

This tutorial will get you up and running in minutes with Spark. You will install the DC/OS Apache Spark service.…Read More

Usage Example

Perform a default installation by following the instructions in the Install and Customize section of this topic.…Read More

Integration with HDFS

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark‚Äôs classpath: hdfs-site.xml, which provides default behaviors for the HDFS client. core-site.xml, which sets the default filesystem name. You can specify the location of these files at install time or for each job.…Read More

History Server

DC/OS Apache Spark includes The Spark History Server. Because the history server requires HDFS, you must explicitly enable it.…Read More

Security

This topic describes how to configure DC/OS service accounts for Spark.…Read More

Upgrade

Go to the Universe > Installed page of the DC/OS GUI. Hover over your Spark Service to see the Uninstall button, then select it. Alternatively, enter the following from the DC/OS CLI:…Read More

Uninstall

The Spark dispatcher persists state in ZooKeeper, so to fully uninstall the Spark DC/OS package, you must go to http:///exhibitor, click on Explorer, and delete the znode corresponding to your instance of Spark. By default this is spark_mesos_Dispatcher.…Read More

Runtime Configuration Change

You can customize DC/OS Apache Spark in-place when it is up and running.…Read More

Run a Spark Job

Before submitting your job, upload the artifact (e.g., jar file) to a location visible to the cluster (e.g., HTTP, S3, or HDFS). Learn more.…Read More

Interactive Spark Shell

You can run Spark commands interactively in the Spark shell. The Spark shell is available in Scala, Python, and R.…Read More

Custom Docker Images

Note: Customizing the Spark image Mesosphere provides is supported. However, customizations have the potential to adversely affect the integration between Spark and DC/OS. In situations where Mesosphere support suspects a customization may be adversely impacting Spark with DC/OS, Mesosphere support may request that the customer reproduce the issue with an unmodified Spark image.…Read More

Fault Tolerance

Failures such as host, network, JVM, or application failures can affect the behavior of three types of Spark components:…Read More

Job Scheduling

This document is a simple overview of material described in greater detail in the Apache Spark documentation [here][1] and [here][2].…Read More

Kerberos

Kerberos is an authentication system to allow Spark to retrieve and write data securely to a Kerberos-enabled HDFS cluster. As of Mesosphere Spark 2.2.0-2, long-running jobs will renew their delegation tokens (authentication credentials). This section assumes you have previously set up a Kerberos-enabled HDFS cluster. Note Depending on your OS, Spark may need to be run as root in order to authenticate with your Kerberos-enabled service. This can be done by setting --conf spark.mesos.driverEnv.SPARK_USER=root when submitting your job.…Read More

Troubleshooting

The Mesos cluster dispatcher is responsible for queuing, tracking, and supervising drivers. Potential problems may arise if the dispatcher does not receive the resources offers you expect from Mesos, or if driver submission is failing. To debug this class of issue, visit the Mesos UI at http:///mesos/ and navigate to the sandbox for the dispatcher.…Read More

Version Policy

We have selected the latest version of the Apache Spark stable release train for new releases. We support HDFS version 2.6 by default and version 2.7.…Read More

Limitations

Mesosphere does not provide support for Spark app development, such as writing a Python app to process data from Kafka or writing Scala code to process data from HDFS.…Read More

Release Notes

This is a beta release of the DC/OS Spark framework. It contains multiple improvements as well as new features that are to be considered of beta quality. Do not operate this version in production.…Read More