#amazon-web-services #apache-spark #amazon-s3 #pyspark
Вопрос:
Проблема:
Я пытаюсь использовать hadoop-aws с pyspark, чтобы иметь возможность читать/записывать файлы в Amazon S3.
Подходы
Установка пакетов
Установка hadoop-aws
и соответствующие зависимости путем передачи его координат maven и его зависимостей spark.jars.packages
. Однако я получаю org/apache/hadoop/fs/StreamCapabilities
ошибку.
Компиляция spark
./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2-bin-hadoop2.7.5 --pip --tgz -Phadoop-cloud -Dhadoop.version=2.7.5
Когда я использую скомпилированную версию, я также получаю ту же ошибку org/apache/hadoop/fs/StreamCapabilities
.
Вот содержимое spark-3.0.2/банок
JLargeArrays-1.5.jar commons-lang3-3.9.jar ivy-2.4.0.jar jsr305-3.0.0.jar shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar commons-logging-1.1.3.jar jackson-annotations-2.10.0.jar jul-to-slf4j-1.7.30.jar shims-0.7.45.jar
RoaringBitmap-0.7.45.jar commons-math3-3.4.1.jar jackson-core-2.10.0.jar kryo-shaded-4.0.2.jar slf4j-api-1.7.30.jar
activation-1.1.1.jar commons-net-3.1.jar jackson-core-asl-1.9.13.jar leveldbjni-all-1.8.jar slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar commons-text-1.6.jar jackson-databind-2.10.0.jar log4j-1.2.17.jar snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar compress-lzf-1.0.3.jar jackson-dataformat-cbor-2.10.0.jar lz4-java-1.7.1.jar spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar core-1.1.2.jar jackson-jaxrs-1.9.13.jar machinist_2.12-0.6.8.jar spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar curator-client-2.7.1.jar jackson-mapper-asl-1.9.13.jar macro-compat_2.12-1.1.1.jar spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar curator-framework-2.7.1.jar jackson-module-paranamer-2.10.0.jar metrics-core-4.1.1.jar spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar curator-recipes-2.7.1.jar jackson-module-scala_2.12-2.10.0.jar metrics-graphite-4.1.1.jar spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar flatbuffers-java-1.9.0.jar jackson-xc-1.9.13.jar metrics-jmx-4.1.1.jar spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar gson-2.2.4.jar jakarta.annotation-api-1.3.5.jar metrics-json-4.1.1.jar spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar guava-14.0.1.jar jakarta.inject-2.6.1.jar metrics-jvm-4.1.1.jar spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar hadoop-annotations-2.7.5.jar jakarta.validation-api-2.0.2.jar minlog-1.3.0.jar spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar hadoop-auth-2.7.5.jar jakarta.ws.rs-api-2.1.6.jar netty-all-4.1.47.Final.jar spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar hadoop-aws-2.7.5.jar jakarta.xml.bind-api-2.3.2.jar objenesis-2.5.1.jar spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar hadoop-azure-2.7.5.jar janino-3.0.16.jar opencsv-2.3.jar spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar hadoop-client-2.7.5.jar javassist-3.25.0-GA.jar orc-core-1.5.10.jar spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar hadoop-common-2.7.5.jar javax.servlet-api-3.1.0.jar orc-mapreduce-1.5.10.jar spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar hadoop-hdfs-2.7.5.jar jaxb-api-2.2.2.jar orc-shims-1.5.10.jar spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar hadoop-mapreduce-client-app-2.7.5.jar jaxb-runtime-2.3.2.jar oro-2.0.8.jar spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar hadoop-mapreduce-client-common-2.7.5.jar jcl-over-slf4j-1.7.30.jar osgi-resource-locator-1.0.3.jar spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar hadoop-mapreduce-client-core-2.7.5.jar jersey-client-2.30.jar paranamer-2.8.jar spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar hadoop-mapreduce-client-jobclient-2.7.5.jar jersey-common-2.30.jar parquet-column-1.10.1.jar spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar hadoop-mapreduce-client-shuffle-2.7.5.jar jersey-container-servlet-2.30.jar parquet-common-1.10.1.jar spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar hadoop-openstack-2.7.5.jar jersey-container-servlet-core-2.30.jar parquet-encoding-1.10.1.jar stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar hadoop-yarn-api-2.7.5.jar jersey-hk2-2.30.jar parquet-format-2.4.0.jar stream-2.9.6.jar
commons-cli-1.2.jar hadoop-yarn-client-2.7.5.jar jersey-media-jaxb-2.30.jar parquet-hadoop-1.10.1.jar threeten-extra-1.5.0.jar
commons-codec-1.10.jar hadoop-yarn-common-2.7.5.jar jersey-server-2.30.jar parquet-jackson-1.10.1.jar univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar hadoop-yarn-server-common-2.7.5.jar jetty-sslengine-6.1.26.jar protobuf-java-2.5.0.jar xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar hive-storage-api-2.7.1.jar jetty-util-6.1.26.jar py4j-0.10.9.jar xercesImpl-2.12.0.jar
commons-compress-1.20.jar hk2-api-2.6.1.jar jetty-util-9.4.34.v20201102.jar pyrolite-4.30.jar xml-apis-1.4.01.jar
commons-configuration-1.6.jar hk2-locator-2.6.1.jar joda-time-2.10.5.jar scala-collection-compat_2.12-2.1.1.jar xmlenc-0.52.jar
commons-crypto-1.1.0.jar hk2-utils-2.6.1.jar json4s-ast_2.12-3.6.6.jar scala-compiler-2.12.10.jar xz-1.5.jar
commons-digester-1.8.jar htrace-core-3.1.0-incubating.jar json4s-core_2.12-3.6.6.jar scala-library-2.12.10.jar zookeeper-3.4.14.jar
commons-httpclient-3.1.jar httpclient-4.5.6.jar json4s-jackson_2.12-3.6.6.jar scala-parser-combinators_2.12-1.1.2.jar zstd-jni-1.4.4-3.jar
commons-io-2.4.jar httpcore-4.4.12.jar json4s-scalap_2.12-3.6.6.jar scala-reflect-2.12.10.jar
commons-lang-2.6.jar istack-commons-runtime-3.0.8.jar jsp-api-2.1.jar scala-xml_2.12-1.2.0.jar
Компиляция spark только с помощью hadoop-облака
./spark-3.0.2/dev/make-distribution.sh --name spark-3.0.2 --pip --tgz -Phadoop-cloud
When I try to save files on Amazon S3, I get the following error:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/store/EtagChecksum
Here the jars for this built:
JLargeArrays-1.5.jar commons-lang3-3.9.jar ivy-2.4.0.jar jsr305-3.0.0.jar shapeless_2.12-2.3.3.jar
JTransforms-3.1.jar commons-logging-1.1.3.jar jackson-annotations-2.10.0.jar jul-to-slf4j-1.7.30.jar shims-0.7.45.jar
RoaringBitmap-0.7.45.jar commons-math3-3.4.1.jar jackson-core-2.10.0.jar kryo-shaded-4.0.2.jar slf4j-api-1.7.30.jar
activation-1.1.1.jar commons-net-3.1.jar jackson-core-asl-1.9.13.jar leveldbjni-all-1.8.jar slf4j-log4j12-1.7.30.jar
aircompressor-0.10.jar commons-text-1.6.jar jackson-databind-2.10.0.jar log4j-1.2.17.jar snappy-java-1.1.8.2.jar
algebra_2.12-2.0.0-M2.jar compress-lzf-1.0.3.jar jackson-dataformat-cbor-2.10.0.jar lz4-java-1.7.1.jar spark-catalyst_2.12-3.0.2.jar
antlr4-runtime-4.7.1.jar core-1.1.2.jar jackson-jaxrs-1.9.13.jar machinist_2.12-0.6.8.jar spark-core_2.12-3.0.2.jar
aopalliance-repackaged-2.6.1.jar curator-client-2.7.1.jar jackson-mapper-asl-1.9.13.jar macro-compat_2.12-1.1.1.jar spark-graphx_2.12-3.0.2.jar
apacheds-i18n-2.0.0-M15.jar curator-framework-2.7.1.jar jackson-module-paranamer-2.10.0.jar metrics-core-4.1.1.jar spark-hadoop-cloud_2.12-3.0.2.jar
apacheds-kerberos-codec-2.0.0-M15.jar curator-recipes-2.7.1.jar jackson-module-scala_2.12-2.10.0.jar metrics-graphite-4.1.1.jar spark-kvstore_2.12-3.0.2.jar
api-asn1-api-1.0.0-M20.jar flatbuffers-java-1.9.0.jar jackson-xc-1.9.13.jar metrics-jmx-4.1.1.jar spark-launcher_2.12-3.0.2.jar
api-util-1.0.0-M20.jar gson-2.2.4.jar jakarta.annotation-api-1.3.5.jar metrics-json-4.1.1.jar spark-mllib-local_2.12-3.0.2.jar
arpack_combined_all-0.1.jar guava-14.0.1.jar jakarta.inject-2.6.1.jar metrics-jvm-4.1.1.jar spark-mllib_2.12-3.0.2.jar
arrow-format-0.15.1.jar hadoop-annotations-2.7.4.jar jakarta.validation-api-2.0.2.jar minlog-1.3.0.jar spark-network-common_2.12-3.0.2.jar
arrow-memory-0.15.1.jar hadoop-auth-2.7.4.jar jakarta.ws.rs-api-2.1.6.jar netty-all-4.1.47.Final.jar spark-network-shuffle_2.12-3.0.2.jar
arrow-vector-0.15.1.jar hadoop-aws-2.7.4.jar jakarta.xml.bind-api-2.3.2.jar objenesis-2.5.1.jar spark-repl_2.12-3.0.2.jar
audience-annotations-0.5.0.jar hadoop-azure-2.7.4.jar janino-3.0.16.jar opencsv-2.3.jar spark-sketch_2.12-3.0.2.jar
avro-1.8.2.jar hadoop-client-2.7.4.jar javassist-3.25.0-GA.jar orc-core-1.5.10.jar spark-sql_2.12-3.0.2.jar
avro-ipc-1.8.2.jar hadoop-common-2.7.4.jar javax.servlet-api-3.1.0.jar orc-mapreduce-1.5.10.jar spark-streaming_2.12-3.0.2.jar
avro-mapred-1.8.2-hadoop2.jar hadoop-hdfs-2.7.4.jar jaxb-api-2.2.2.jar orc-shims-1.5.10.jar spark-tags_2.12-3.0.2.jar
azure-storage-2.0.0.jar hadoop-mapreduce-client-app-2.7.4.jar jaxb-runtime-2.3.2.jar oro-2.0.8.jar spark-unsafe_2.12-3.0.2.jar
breeze-macros_2.12-1.0.jar hadoop-mapreduce-client-common-2.7.4.jar jcl-over-slf4j-1.7.30.jar osgi-resource-locator-1.0.3.jar spire-macros_2.12-0.17.0-M1.jar
breeze_2.12-1.0.jar hadoop-mapreduce-client-core-2.7.4.jar jersey-client-2.30.jar paranamer-2.8.jar spire-platform_2.12-0.17.0-M1.jar
cats-kernel_2.12-2.0.0-M4.jar hadoop-mapreduce-client-jobclient-2.7.4.jar jersey-common-2.30.jar parquet-column-1.10.1.jar spire-util_2.12-0.17.0-M1.jar
chill-java-0.9.5.jar hadoop-mapreduce-client-shuffle-2.7.4.jar jersey-container-servlet-2.30.jar parquet-common-1.10.1.jar spire_2.12-0.17.0-M1.jar
chill_2.12-0.9.5.jar hadoop-openstack-2.7.4.jar jersey-container-servlet-core-2.30.jar parquet-encoding-1.10.1.jar stax-api-1.0-2.jar
commons-beanutils-1.9.4.jar hadoop-yarn-api-2.7.4.jar jersey-hk2-2.30.jar parquet-format-2.4.0.jar stream-2.9.6.jar
commons-cli-1.2.jar hadoop-yarn-client-2.7.4.jar jersey-media-jaxb-2.30.jar parquet-hadoop-1.10.1.jar threeten-extra-1.5.0.jar
commons-codec-1.10.jar hadoop-yarn-common-2.7.4.jar jersey-server-2.30.jar parquet-jackson-1.10.1.jar univocity-parsers-2.9.0.jar
commons-collections-3.2.2.jar hadoop-yarn-server-common-2.7.4.jar jetty-sslengine-6.1.26.jar protobuf-java-2.5.0.jar xbean-asm7-shaded-4.15.jar
commons-compiler-3.0.16.jar hive-storage-api-2.7.1.jar jetty-util-6.1.26.jar py4j-0.10.9.jar xercesImpl-2.12.0.jar
commons-compress-1.20.jar hk2-api-2.6.1.jar jetty-util-9.4.34.v20201102.jar pyrolite-4.30.jar xml-apis-1.4.01.jar
commons-configuration-1.6.jar hk2-locator-2.6.1.jar joda-time-2.10.5.jar scala-collection-compat_2.12-2.1.1.jar xmlenc-0.52.jar
commons-crypto-1.1.0.jar hk2-utils-2.6.1.jar json4s-ast_2.12-3.6.6.jar scala-compiler-2.12.10.jar xz-1.5.jar
commons-digester-1.8.jar htrace-core-3.1.0-incubating.jar json4s-core_2.12-3.6.6.jar scala-library-2.12.10.jar zookeeper-3.4.14.jar
commons-httpclient-3.1.jar httpclient-4.5.6.jar json4s-jackson_2.12-3.6.6.jar scala-parser-combinators_2.12-1.1.2.jar zstd-jni-1.4.4-3.jar
commons-io-2.4.jar httpcore-4.4.12.jar json4s-scalap_2.12-3.6.6.jar scala-reflect-2.12.10.jar
commons-lang-2.6.jar istack-commons-runtime-3.0.8.jar jsp-api-2.1.jar scala-xml_2.12-1.2.0.jar
Интуиция
Я думаю, что ошибка связана с каким-то внутренним несоответствием hadoop-aws
версии и того, что в hadoop-common
ней . Однако я не понимаю, как я мог бы решить/разрешить, передав конфигурации в SparkSession из pyspark или как скомпилировать spark таким образом, чтобы они были разрешены.
Комментарии:
1. Я решил эту проблему, построив против spark 3.1.2 и попросив spark build включить
hadoop-cloud
библиотеки./make-distribution.sh --name spark-3.1.2-bin-hadoop3.2.0 --pip --tgz -Phadoop-3.2 -Phadoop-cloud -Dhadoop.version=3.2.0