При записи в Amazon S3 с помощью PySpark я получаю возможности org/apache/hadoop/fs/StreamCapabilities

#amazon-web-services #apache-spark #amazon-s3 #pyspark

Вопрос:

Проблема:

Я пытаюсь использовать hadoop-aws с pyspark, чтобы иметь возможность читать/записывать файлы в Amazon S3.

Подходы

Установка пакетов

Установка hadoop-aws и соответствующие зависимости путем передачи его координат maven и его зависимостей spark.jars.packages . Однако я получаю org/apache/hadoop/fs/StreamCapabilities ошибку.

Компиляция spark

 ./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2-bin-hadoop2.7.5 --pip --tgz -Phadoop-cloud -Dhadoop.version=2.7.5
 

Когда я использую скомпилированную версию, я также получаю ту же ошибку org/apache/hadoop/fs/StreamCapabilities .

Вот содержимое spark-3.0.2/банок

 JLargeArrays-1.5.jar                   commons-lang3-3.9.jar                        ivy-2.4.0.jar                           jsr305-3.0.0.jar                         shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar                    commons-logging-1.1.3.jar                    jackson-annotations-2.10.0.jar          jul-to-slf4j-1.7.30.jar                  shims-0.7.45.jar
RoaringBitmap-0.7.45.jar               commons-math3-3.4.1.jar                      jackson-core-2.10.0.jar                 kryo-shaded-4.0.2.jar                    slf4j-api-1.7.30.jar
activation-1.1.1.jar                   commons-net-3.1.jar                          jackson-core-asl-1.9.13.jar             leveldbjni-all-1.8.jar                   slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar                 commons-text-1.6.jar                         jackson-databind-2.10.0.jar             log4j-1.2.17.jar                         snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar              compress-lzf-1.0.3.jar                       jackson-dataformat-cbor-2.10.0.jar      lz4-java-1.7.1.jar                       spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar               core-1.1.2.jar                               jackson-jaxrs-1.9.13.jar                machinist_2.12-0.6.8.jar                 spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar       curator-client-2.7.1.jar                     jackson-mapper-asl-1.9.13.jar           macro-compat_2.12-1.1.1.jar              spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar            curator-framework-2.7.1.jar                  jackson-module-paranamer-2.10.0.jar     metrics-core-4.1.1.jar                   spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar  curator-recipes-2.7.1.jar                    jackson-module-scala_2.12-2.10.0.jar    metrics-graphite-4.1.1.jar               spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar             flatbuffers-java-1.9.0.jar                   jackson-xc-1.9.13.jar                   metrics-jmx-4.1.1.jar                    spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar                 gson-2.2.4.jar                               jakarta.annotation-api-1.3.5.jar        metrics-json-4.1.1.jar                   spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar            guava-14.0.1.jar                             jakarta.inject-2.6.1.jar                metrics-jvm-4.1.1.jar                    spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar                hadoop-annotations-2.7.5.jar                 jakarta.validation-api-2.0.2.jar        minlog-1.3.0.jar                         spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar                hadoop-auth-2.7.5.jar                        jakarta.ws.rs-api-2.1.6.jar             netty-all-4.1.47.Final.jar               spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar                hadoop-aws-2.7.5.jar                         jakarta.xml.bind-api-2.3.2.jar          objenesis-2.5.1.jar                      spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar         hadoop-azure-2.7.5.jar                       janino-3.0.16.jar                       opencsv-2.3.jar                          spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar                         hadoop-client-2.7.5.jar                      javassist-3.25.0-GA.jar                 orc-core-1.5.10.jar                      spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar                     hadoop-common-2.7.5.jar                      javax.servlet-api-3.1.0.jar             orc-mapreduce-1.5.10.jar                 spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar          hadoop-hdfs-2.7.5.jar                        jaxb-api-2.2.2.jar                      orc-shims-1.5.10.jar                     spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar                hadoop-mapreduce-client-app-2.7.5.jar        jaxb-runtime-2.3.2.jar                  oro-2.0.8.jar                            spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar             hadoop-mapreduce-client-common-2.7.5.jar     jcl-over-slf4j-1.7.30.jar               osgi-resource-locator-1.0.3.jar          spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar                    hadoop-mapreduce-client-core-2.7.5.jar       jersey-client-2.30.jar                  paranamer-2.8.jar                        spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar          hadoop-mapreduce-client-jobclient-2.7.5.jar  jersey-common-2.30.jar                  parquet-column-1.10.1.jar                spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar                   hadoop-mapreduce-client-shuffle-2.7.5.jar    jersey-container-servlet-2.30.jar       parquet-common-1.10.1.jar                spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar                   hadoop-openstack-2.7.5.jar                   jersey-container-servlet-core-2.30.jar  parquet-encoding-1.10.1.jar              stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar            hadoop-yarn-api-2.7.5.jar                    jersey-hk2-2.30.jar                     parquet-format-2.4.0.jar                 stream-2.9.6.jar
commons-cli-1.2.jar                    hadoop-yarn-client-2.7.5.jar                 jersey-media-jaxb-2.30.jar              parquet-hadoop-1.10.1.jar                threeten-extra-1.5.0.jar
commons-codec-1.10.jar                 hadoop-yarn-common-2.7.5.jar                 jersey-server-2.30.jar                  parquet-jackson-1.10.1.jar               univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar          hadoop-yarn-server-common-2.7.5.jar          jetty-sslengine-6.1.26.jar              protobuf-java-2.5.0.jar                  xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar            hive-storage-api-2.7.1.jar                   jetty-util-6.1.26.jar                   py4j-0.10.9.jar                          xercesImpl-2.12.0.jar
commons-compress-1.20.jar              hk2-api-2.6.1.jar                            jetty-util-9.4.34.v20201102.jar         pyrolite-4.30.jar                        xml-apis-1.4.01.jar
commons-configuration-1.6.jar          hk2-locator-2.6.1.jar                        joda-time-2.10.5.jar                    scala-collection-compat_2.12-2.1.1.jar   xmlenc-0.52.jar
commons-crypto-1.1.0.jar               hk2-utils-2.6.1.jar                          json4s-ast_2.12-3.6.6.jar               scala-compiler-2.12.10.jar               xz-1.5.jar
commons-digester-1.8.jar               htrace-core-3.1.0-incubating.jar             json4s-core_2.12-3.6.6.jar              scala-library-2.12.10.jar                zookeeper-3.4.14.jar
commons-httpclient-3.1.jar             httpclient-4.5.6.jar                         json4s-jackson_2.12-3.6.6.jar           scala-parser-combinators_2.12-1.1.2.jar  zstd-jni-1.4.4-3.jar
commons-io-2.4.jar                     httpcore-4.4.12.jar                          json4s-scalap_2.12-3.6.6.jar            scala-reflect-2.12.10.jar
commons-lang-2.6.jar                   istack-commons-runtime-3.0.8.jar             jsp-api-2.1.jar                         scala-xml_2.12-1.2.0.jar
 

Компиляция spark только с помощью hadoop-облака

 ./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2 --pip --tgz -Phadoop-cloud
 

When I try to save files on Amazon S3, I get the following error:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/store/EtagChecksum

Here the jars for this built:

 JLargeArrays-1.5.jar                   commons-lang3-3.9.jar                        ivy-2.4.0.jar                           jsr305-3.0.0.jar                         shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar                    commons-logging-1.1.3.jar                    jackson-annotations-2.10.0.jar          jul-to-slf4j-1.7.30.jar                  shims-0.7.45.jar
RoaringBitmap-0.7.45.jar               commons-math3-3.4.1.jar                      jackson-core-2.10.0.jar                 kryo-shaded-4.0.2.jar                    slf4j-api-1.7.30.jar
activation-1.1.1.jar                   commons-net-3.1.jar                          jackson-core-asl-1.9.13.jar             leveldbjni-all-1.8.jar                   slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar                 commons-text-1.6.jar                         jackson-databind-2.10.0.jar             log4j-1.2.17.jar                         snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar              compress-lzf-1.0.3.jar                       jackson-dataformat-cbor-2.10.0.jar      lz4-java-1.7.1.jar                       spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar               core-1.1.2.jar                               jackson-jaxrs-1.9.13.jar                machinist_2.12-0.6.8.jar                 spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar       curator-client-2.7.1.jar                     jackson-mapper-asl-1.9.13.jar           macro-compat_2.12-1.1.1.jar              spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar            curator-framework-2.7.1.jar                  jackson-module-paranamer-2.10.0.jar     metrics-core-4.1.1.jar                   spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar  curator-recipes-2.7.1.jar                    jackson-module-scala_2.12-2.10.0.jar    metrics-graphite-4.1.1.jar               spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar             flatbuffers-java-1.9.0.jar                   jackson-xc-1.9.13.jar                   metrics-jmx-4.1.1.jar                    spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar                 gson-2.2.4.jar                               jakarta.annotation-api-1.3.5.jar        metrics-json-4.1.1.jar                   spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar            guava-14.0.1.jar                             jakarta.inject-2.6.1.jar                metrics-jvm-4.1.1.jar                    spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar                hadoop-annotations-2.7.4.jar                 jakarta.validation-api-2.0.2.jar        minlog-1.3.0.jar                         spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar                hadoop-auth-2.7.4.jar                        jakarta.ws.rs-api-2.1.6.jar             netty-all-4.1.47.Final.jar               spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar                hadoop-aws-2.7.4.jar                         jakarta.xml.bind-api-2.3.2.jar          objenesis-2.5.1.jar                      spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar         hadoop-azure-2.7.4.jar                       janino-3.0.16.jar                       opencsv-2.3.jar                          spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar                         hadoop-client-2.7.4.jar                      javassist-3.25.0-GA.jar                 orc-core-1.5.10.jar                      spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar                     hadoop-common-2.7.4.jar                      javax.servlet-api-3.1.0.jar             orc-mapreduce-1.5.10.jar                 spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar          hadoop-hdfs-2.7.4.jar                        jaxb-api-2.2.2.jar                      orc-shims-1.5.10.jar                     spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar                hadoop-mapreduce-client-app-2.7.4.jar        jaxb-runtime-2.3.2.jar                  oro-2.0.8.jar                            spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar             hadoop-mapreduce-client-common-2.7.4.jar     jcl-over-slf4j-1.7.30.jar               osgi-resource-locator-1.0.3.jar          spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar                    hadoop-mapreduce-client-core-2.7.4.jar       jersey-client-2.30.jar                  paranamer-2.8.jar                        spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar          hadoop-mapreduce-client-jobclient-2.7.4.jar  jersey-common-2.30.jar                  parquet-column-1.10.1.jar                spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar                   hadoop-mapreduce-client-shuffle-2.7.4.jar    jersey-container-servlet-2.30.jar       parquet-common-1.10.1.jar                spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar                   hadoop-openstack-2.7.4.jar                   jersey-container-servlet-core-2.30.jar  parquet-encoding-1.10.1.jar              stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar            hadoop-yarn-api-2.7.4.jar                    jersey-hk2-2.30.jar                     parquet-format-2.4.0.jar                 stream-2.9.6.jar
commons-cli-1.2.jar                    hadoop-yarn-client-2.7.4.jar                 jersey-media-jaxb-2.30.jar              parquet-hadoop-1.10.1.jar                threeten-extra-1.5.0.jar
commons-codec-1.10.jar                 hadoop-yarn-common-2.7.4.jar                 jersey-server-2.30.jar                  parquet-jackson-1.10.1.jar               univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar          hadoop-yarn-server-common-2.7.4.jar          jetty-sslengine-6.1.26.jar              protobuf-java-2.5.0.jar                  xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar            hive-storage-api-2.7.1.jar                   jetty-util-6.1.26.jar                   py4j-0.10.9.jar                          xercesImpl-2.12.0.jar
commons-compress-1.20.jar              hk2-api-2.6.1.jar                            jetty-util-9.4.34.v20201102.jar         pyrolite-4.30.jar                        xml-apis-1.4.01.jar
commons-configuration-1.6.jar          hk2-locator-2.6.1.jar                        joda-time-2.10.5.jar                    scala-collection-compat_2.12-2.1.1.jar   xmlenc-0.52.jar
commons-crypto-1.1.0.jar               hk2-utils-2.6.1.jar                          json4s-ast_2.12-3.6.6.jar               scala-compiler-2.12.10.jar               xz-1.5.jar
commons-digester-1.8.jar               htrace-core-3.1.0-incubating.jar             json4s-core_2.12-3.6.6.jar              scala-library-2.12.10.jar                zookeeper-3.4.14.jar
commons-httpclient-3.1.jar             httpclient-4.5.6.jar                         json4s-jackson_2.12-3.6.6.jar           scala-parser-combinators_2.12-1.1.2.jar  zstd-jni-1.4.4-3.jar
commons-io-2.4.jar                     httpcore-4.4.12.jar                          json4s-scalap_2.12-3.6.6.jar            scala-reflect-2.12.10.jar
commons-lang-2.6.jar                   istack-commons-runtime-3.0.8.jar             jsp-api-2.1.jar                         scala-xml_2.12-1.2.0.jar
 

Интуиция

Я думаю, что ошибка связана с каким-то внутренним несоответствием hadoop-aws версии и того, что в hadoop-common ней . Однако я не понимаю, как я мог бы решить/разрешить, передав конфигурации в SparkSession из pyspark или как скомпилировать spark таким образом, чтобы они были разрешены.

Комментарии:

1. Я решил эту проблему, построив против spark 3.1.2 и попросив spark build включить hadoop-cloud библиотеки. /make-distribution.sh --name spark-3.1.2-bin-hadoop3.2.0 --pip --tgz -Phadoop-3.2 -Phadoop-cloud -Dhadoop.version=3.2.0