код python для распаковки архивированного файла на сервере s3 в databricks

#python #amazon-s3 #databricks

Вопрос:

Код предназначен для распаковки архивированного файла, присутствующего на сервере s3. Код выполняется в databricks, версии python: 3 и pandas === 0.19.0

zip_ref = zip-файл.ZipFile(path, mode=’r’) приведенная выше строка выдает ошибку, как показано ниже. Ошибка FileNotFoundError: [Ошибка 2] Нет такого файла или каталога: путь

Пожалуйста, дайте мне знать, почему эта строка выдает ошибку, хотя путь указан правильно. ИЛИ есть способ прочитать содержимое в Zip-папке, не извлекая его.

1. проверьте, что находится в «пути», должно быть похоже 's3://bucketname/filename.zip' , не забудьте расширение

2. Привет, путь указан правильно. Я попытался сохранить файл в путь, он успешно работает.

Ответ №1:

вы можете использовать

 with zipfile.ZipFile("/dbfs/folder/file.zip", "r") as zip_ref:
    zip_ref.extractall("targetdir")

или тот же код, что и выше, избегайте использования ':' в строке пути

Ответ №2:

 Below is the code

### Declare the variables 
s3client = boto3.client('s3')  # s3 client (Boto3 is the AWS SDK for python)
s3resources = boto3.resource('s3') # s3 resource
filetype = '.zip' # filetype such as zip, csv, json
source_url = 's3://bucketname/' # s3 url with bucket name
bucketname = 'bucketname' # bucket name
zipfile_name = 'local_file'   filetype # folder name with file type in DataBricks
filename = 'zipfilename'   filetype # object key or filename with extn
shapefile_name = 'shapafilename.shp'  # extract file name with type from s3
shapefile_path = os.path.abspath(zipfile_name) #  '/'   filename  # local filepath from the DB
os_CurDir_file = os.curdir   'shapefiles'
### downloading the files from s3 to the local databricks
s3resources.Bucket(bucketname).download_file(filename, zipfile_name)   
### unzip the file in the local DB
with zipfile.ZipFile(shapefile_path, 'r') as zip_ref:
    zip_ref.extractall(os_CurDir_file)   
### import shapefile using geopandas
plot_locations_df = geopandas.read_file(
                          os.path.join(
                          os_CurDir_file, 
                          shapefile_name))
plot_locations_df['geometry'] = plot_locations_df.geometry.apply(lambda x: x.wkt).apply(lambda x: re.sub('"(.*)"', '\1', x)) ### convert struct to string
display(plot_locations_df.head(5))