Как запустить spark 3.2.0 в Google dataproc?

# #apache-spark #pyspark #google-cloud-dataproc

#apache-spark #pyspark #google-cloud-dataproc

Вопрос:

В настоящее время в Google dataproc нет spark 3.2.0 в качестве изображения. Последняя доступная версия-3.1.2. Я хочу использовать функциональность pandas на pyspark, которую spark выпустила с 3.2.0.

Я делаю следующие шаги, чтобы использовать spark 3.2.0

  1. Создал локальную среду «pyspark» с pyspark 3.2.0 в ней
  2. Экспортировал среду yaml с conda env export gt; environment.yaml
  3. Создал кластер dataproc с этой средой.yaml. Кластер создается правильно, и среда доступна для мастера и всех работников
  4. Затем я изменяю переменные среды. export SPARK_HOME=/opt/conda/miniconda3/envs/pyspark/lib/python3.9/site-packages/pyspark (указать на pyspark 3.2.0), export SPARK_CONF_DIR=/usr/lib/spark/conf (использовать конфигурационный файл dataproc) и export PYSPARK_PYTHON=/opt/conda/miniconda3/envs/pyspark/bin/python (сделать доступными пакеты среды)

Теперь, если я попытаюсь запустить оболочку pyspark, я получу:

 21/12/07 01:25:16 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener AppStatusListener threw an exception java.lang.NumberFormatException: For input string: "null"  at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)  at java.lang.Integer.parseInt(Integer.java:580)  at java.lang.Integer.parseInt(Integer.java:615)  at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)  at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)  at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)  at org.apache.spark.util.Utils$.parseHostPort(Utils.scala:1126)  at org.apache.spark.status.ProcessSummaryWrapper.lt;initgt;(storeTypes.scala:527)  at org.apache.spark.status.LiveMiscellaneousProcess.doUpdate(LiveEntity.scala:924)  at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:50)  at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1213)  at org.apache.spark.status.AppStatusListener.onMiscellaneousProcessAdded(AppStatusListener.scala:1427)  at org.apache.spark.status.AppStatusListener.onOtherEvent(AppStatusListener.scala:113)  at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:100)  at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)  at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)  at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)  at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:117)  at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:101)  at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)  at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)  at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)  at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)  at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$dispatch(AsyncEventQueue.scala:100)  at org.apache.spark.scheduler.AsyncEventQueue$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)  at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)  at org.apache.spark.scheduler.AsyncEventQueue$anon$2.run(AsyncEventQueue.scala:96)  

Однако оболочка запускается даже после этого. Но он не выполняет код. Создает исключения: Я пытался бежать: set(sc.parallelize(range(10),10).map(lambda x: socket.gethostname()).collect()) но, я получаю:

 21/12/07 01:32:15 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1638782400702_0003_01_000001 on host: monsoon-test1-w-2.us-central1-c.c.monsoon-credittech.internal. Exit status: 1. Diagnostics: [2021-12-07  01:32:13.672]Exception from container-launch. Container id: container_1638782400702_0003_01_000001 Exit code: 1 [2021-12-07 01:32:13.717]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : ltChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)  at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)  at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)  at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)  at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)  at java.lang.Thread.run(Thread.java:748) 21/12/07 01:31:43 ERROR org.apache.spark.executor.YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver monsoon-test1-m.us-central1-c.c.monsoon-credittech.internal:44367 disassociated! Shutting down. 21/12/07 01:32:13 WARN org.apache.hadoop.util.ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException  at java.util.concurrent.FutureTask.get(FutureTask.java:205)  at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)  at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) 21/12/07 01:32:13 ERROR org.apache.spark.util.Utils: Uncaught exception in thread shutdown-hook-0 java.lang.InterruptedException  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)  at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)  at java.util.concurrent.Executors$DelegatedExecutorService.awaitTermination(Executors.java:675)  at org.apache.spark.rpc.netty.MessageLoop.stop(MessageLoop.scala:60)  at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1(Dispatcher.scala:197)  at org.apache.spark.rpc.netty.Dispatcher.$anonfun$stop$1$adapted(Dispatcher.scala:194)  at scala.collection.Iterator.foreach(Iterator.scala:943)  at scala.collection.Iterator.foreach$(Iterator.scala:943)  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)  at scala.collection.IterableLike.foreach(IterableLike.scala:74)  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)  at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:194)  at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:331)  at org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:309)  at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)  at org.apache.spark.executor.Executor.stop(Executor.scala:335)  at org.apache.spark.executor.Executor.$anonfun$new$2(Executor.scala:76)  at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)  at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)  at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)  at scala.util.Try$.apply(Try.scala:213)  at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)  at org.apache.spark.util.SparkShutdownHookManager$anon$2.run(ShutdownHookManager.scala:178)  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  at java.util.concurrent.FutureTask.run(FutureTask.java:266)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748)  

and the same error repeats multiple times before coming to a stop.

What am I doing wrong and How can I use python 3.2.0 on google dataproc?