Создайте фрейм данных из глубоко вложенного json

#python #json #pandas #parsing

#питон #json #панды #разбор

Вопрос:

Я пытаюсь прочитать приведенную ниже схему json в фрейм данных, я могу преобразовать ее в свой предпочтительный тип, выполнив итерацию по всем узлам, но это может занять некоторое время, потому что исходные файлы json намного длиннее, чем в этом примере. (в десятках тысяч)

 "data": [  {  "node": {  "id": 2,  "title": "Berserk",  "main_picture": {  "medium": "https://api-cdn.myanimelist.net/images/manga/1/157897.jpg",  "large": "https://api-cdn.myanimelist.net/images/manga/1/157897l.jpg"  },  "alternative_titles": {  "synonyms": [  "Berserk: The Prototype"  ],  "en": "Berserk",  "ja": "u30d9u30ebu30bbu30ebu30af"  },  "start_date": "1989-08-25",  "end_date": "2021-09-10",  "synopsis": "Guts, a former mercenary now known as the "Black Swordsman," is out for revenge. After a tumultuous childhood, he finally finds someone he respects and believes he can trust, only to have everything fall apart when this person takes away everything important to Guts for the purpose of fulfilling his own desires. Now marked for death, Guts becomes condemned to a fate in which he is relentlessly pursued by demonic beings.nnSetting out on a dreadful quest riddled with misfortune, Guts, armed with a massive sword and monstrous strength, will let nothing stop him, not even death itself, until he is finally able to take the head of the one who stripped himu2014and his loved oneu2014of their humanity.nn[Written by MAL Rewrite]nnIncluded one-shot:nVolume 14: Berserk: The Prototype",  "mean": 9.42,  "rank": 1,  "popularity": 2,  "num_list_users": 468103,  "num_scoring_users": 223269,  "nsfw": "white",  "created_at": "1970-01-01T00:00:00 00:00",  "updated_at": "2021-12-03T19:02:32 00:00",  "media_type": "manga",  "status": "on_hiatus",  "genres": [  {  "id": 1,  "name": "Action"  },  {  "id": 2,  "name": "Adventure"  },  {  "id": 10,  "name": "Fantasy"  },  {  "id": 40,  "name": "Psychological"  },  {  "id": 37,  "name": "Supernatural"  },  {  "id": 38,  "name": "Military"  },  {  "id": 6,  "name": "Demons"  },  {  "id": 8,  "name": "Drama"  },  {  "id": 14,  "name": "Horror"  },  {  "id": 41,  "name": "Seinen"  }  ],  "my_list_status": {  "status": "completed",  "is_rereading": false,  "num_volumes_read": 0,  "num_chapters_read": 0,  "score": 9,  "updated_at": "2020-05-29T15:13:01 00:00"  },  "num_volumes": 41,  "num_chapters": 380,  "authors": [  {  "node": {  "id": 1868,  "first_name": "Kentarou",  "last_name": "Miura"  },  "role": "Story amp; Art"  }  ]  },  "ranking": {  "rank": 1  }  },  {  "node": {  "id": 1706,  "title": "JoJo no Kimyou na Bouken Part 7: Steel Ball Run",  "main_picture": {  "medium": "https://api-cdn.myanimelist.net/images/manga/3/179882.jpg",  "large": "https://api-cdn.myanimelist.net/images/manga/3/179882l.jpg"  },  "alternative_titles": {  "synonyms": [  "JoJo's Bizarre Adventure Part 7: Steel Ball Run",  "SBR"  ],  "en": "",  "ja": "u30b8u30e7u30b8u30e7u306eu5947u5999u306au5192u967a Part7 STEEL BALL RUN"  },  "start_date": "2004-01-19",  "end_date": "2011-04-19",  "synopsis": "In the American Old West, the world's greatest race is about to begin. Thousands line up in San Diego to travel over six thousand kilometers for a chance to win the grand prize of fifty million dollars. With the era of the horse reaching its end, contestants are allowed to use any kind of vehicle they wish. Competitors will have to endure grueling conditions, traveling up to a hundred kilometers a day through uncharted wastelands. The Steel Ball Run is truly a one-of-a-kind event.nnThe youthful Johnny Joestar, a crippled former horse racer, has come to San Diego to watch the start of the race. There he encounters Gyro Zeppeli, a racer with two steel balls at his waist instead of a gun. Johnny witnesses Gyro using one of his steel balls to unleash a fantastical power, compelling a man to fire his gun at himself during a duel. In the midst of the action, Johnny happens to touch the steel ball and feels a power surging through his legs, allowing him to stand up for the first time in two years. Vowing to find the secret of the steel balls, Johnny decides to compete in the race, and so begins his bizarre adventure across America on the Steel Ball Run.nn[Written by MAL Rewrite]",  "mean": 9.25,  "rank": 2,  "popularity": 27,  "num_list_users": 178230,  "num_scoring_users": 105411,  "nsfw": "white",  "created_at": "1970-01-01T00:00:00 00:00",  "updated_at": "2021-12-03T18:56:19 00:00",  "media_type": "manga",  "status": "finished",  "genres": [  {  "id": 1,  "name": "Action"  },  {  "id": 2,  "name": "Adventure"  },  {  "id": 37,  "name": "Supernatural"  },  {  "id": 13,  "name": "Historical"  },  {  "id": 7,  "name": "Mystery"  },  {  "id": 14,  "name": "Horror"  },  {  "id": 27,  "name": "Shounen"  },  {  "id": 41,  "name": "Seinen"  }  ],  "num_volumes": 24,  "num_chapters": 96,  "authors": [  {  "node": {  "id": 2619,  "first_name": "Hirohiko",  "last_name": "Araki"  },  "role": "Story amp; Art"  }  ]  },  "ranking": {  "rank": 2  }  }  ],  "paging": {  "next": "https://api.myanimelist.net/v2/manga/ranking?offset=2amp;ranking_type=mangaamp;limit=2amp;fields=id,title,main_picture,alternative_titles,start_date,end_date,synopsis,mean,rank,popularity,num_list_users,num_scoring_users,nsfw,created_at,updated_at,media_type,status,genres,my_list_status,num_volumes,num_chapters,authors{first_name,last_name},pictures,background,related_anime,related_manga,recommendations"  } }  

Колонки, которые вызывают у меня борьбу, — это жанры и авторы. Мне нужно только поле » имя «в столбце» жанры «и столбцы «имя», «фамилия» в столбце «авторы».

Формат, который я хочу для обоих, — это

 genres first_name last_name  0 Action, Adventure, Fantasy, Kentarou Miura  Psychological, Supernatural ...  1 Action, Adventure, Supernatural, Hirohiko Araki  Historical, Mystery, Horror ...  

Код, который у меня сейчас есть, таков

 import pandas as pd import json with open('./response.json') as file:  data = json.load(file)  df = pd.json_normalize(data['data'])  

Этот код содержит столбцы жанров и авторов в этом формате

 node.genres node.authors 0 [{'id': 1, 'name': 'Action'}, {'id': 2, 'name'... [{'node': {'id': 1868, 'first_name': 'Kentarou... 1 [{'id': 1, 'name': 'Action'}, {'id': 2, 'name'... [{'node': {'id': 2619, 'first_name': 'Hirohiko... 2 [{'id': 1, 'name': 'Action'}, {'id': 2, 'name'... [{'node': {'id': 1881, 'first_name': 'Eiichiro... 3 [{'id': 1, 'name': 'Action'}, {'id': 2, 'name'... [{'node': {'id': 1911, 'first_name': 'Takehiko... 4 [{'id': 8, 'name': 'Drama'}, {'id': 7, 'name':... [{'node': {'id': 1867, 'first_name': 'Naoki', ...  

Я хочу знать, могу ли я использовать трюк, который может дать мне предпочтительный тип без перебора всех строк.

Ответ №1:

В качестве альтернативы вы можете выполнить одно из следующих действий. В ограниченном тестировании быстрее всего использовать методы apply() поверх str.

Вариант 1

 df['genres'] = df.apply(lambda x: ', '.join([d['name'] for d in x['node.genres']]), axis=1) df['first_name'] = df['node.authors'].str[0].str['node'].str['first_name'] df['last_name'] = df['node.authors'].str[0].str['node'].str['last_name']  

Вариант 2

 df['genres'] = df.apply(lambda x: ', '.join([d['name'] for d in x['node.genres']]), axis=1) df['first_name'] = df.apply(lambda x: x['node.authors'][0]['node']['first_name'], axis=1) df['last_name'] = df.apply(lambda x: x['node.authors'][0]['node']['last_name'], axis=1)  

Измененный вариант 2 (самый быстрый)

 df['genres'] = df.apply(lambda x: ', '.join([d['name'] for d in x['node.genres']]), axis=1) df [['first_name', 'last_name']] = df.apply(lambda x: (x['node.authors'][0]['node']['first_name'], x['node.authors'][0]['node']['last_name']), axis=1)  

Результаты Timeit:

 answer 1 from not_speshal:  1.94 ms ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  My options in order: 1.45 ms ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.35 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 925 µs ± 6.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  

Комментарии:

1. пожалуйста, добавьте результаты timeit в актив тестов скорости

2. лучшие результаты включены в ответ; все время после создания начального df. Было бы интересно посмотреть на полный набор данных, хотя

Ответ №2:

Создайте отдельные фреймы данных авторов и жанров и присоединитесь к оригиналу df :

 authors = pd.json_normalize(data["data"], record_path=["node",["authors"]])[["node.first_name", "node.last_name"]] genres = pd.Series([", ".join(g["name"] for g in d) for d in df["node.genres"].tolist()]).rename("genres") output = df.join(authors.join(genres))