Python pandas: как найти разницу во времени на основе максимального количества других столбцов?

#python #pandas

#python #pandas

Вопрос:

Я пытался найти время, затраченное на наиболее активную деятельность на человека в этом наборе данных:

               name  activity           timestamp  money_spent
0    Chandler Bing     party 2017-08-04 08:00:00           51
1    Chandler Bing     party 2017-08-04 13:00:00           60
2    Chandler Bing     party 2017-08-04 15:00:00           59
5       Harry Kane     party 2017-08-04 07:00:00           68
4       Harry Kane     party 2017-08-04 11:00:00           90
3       Harry Kane  football 2017-08-04 13:00:00           80
11  Joey Tribbiani  football 2017-08-04 08:00:00           84
9   Joey Tribbiani     party 2017-08-04 09:00:00           54
10  Joey Tribbiani     party 2017-08-04 10:00:00           67
6         John Doe     beach 2017-08-04 07:00:00           63
7         John Doe     beach 2017-08-04 12:00:00           61
8         John Doe     beach 2017-08-04 14:00:00           65
12   Monica Geller    travel 2017-08-04 07:00:00           90
13   Monica Geller    travel 2017-08-04 08:00:00           96
14   Monica Geller    travel 2017-08-04 09:00:00           74
15   Phoebe Buffey    travel 2017-08-04 10:00:00           52
16   Phoebe Buffey    travel 2017-08-04 12:00:00           84
17   Phoebe Buffey  football 2017-08-04 15:00:00           58
18     Ross Geller     party 2017-08-04 09:00:00           96
19     Ross Geller     party 2017-08-04 11:00:00           81
20     Ross Geller    travel 2017-08-04 14:00:00           60

df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')

df # party day 2017-08-04 for some guys.
# find most involved activity and time spent on that activity per person.
  

Требуемый вывод:

                 activity_num activity time_diff
name                                           
Chandler Bing            1.0    party  07:00:00
Harry Kane               2.0    party  04:00:00
Joey Tribbiani           2.0    party  02:00:00
John Doe                 1.0    beach  07:00:00
Monica Geller            1.0   travel  02:00:00
Phoebe Buffey            2.0   travel  03:00:00
Ross Geller              2.0   travel  03:00:00
  

Примечание: Гарри Кейн веселился с 7 до 11 утра, поэтому ответ для него — 4 часа.

 df.head()
              name  activity           timestamp  money_spent
0    Chandler Bing     party 2017-08-04 08:00:00           51
1    Chandler Bing     party 2017-08-04 13:00:00           60
2    Chandler Bing     party 2017-08-04 15:00:00           59
3       Harry Kane  football 2017-08-04 13:00:00           80
4       Harry Kane     party 2017-08-04 11:00:00           90
5       Harry Kane     party 2017-08-04 07:00:00           68
  

Моя попытка:

 df.groupby(['name','activity'])['timestamp'].max() # no idea
  

Комментарии:

1. Не должно 'Phoebe Buffey' быть 2 часов и 'Ross Geller' быть 2 часа в пути?

Ответ №1:

Проверьте ниже

 s=df.groupby(['name','activity']).timestamp.apply(pd.Series.ptp).reset_index()
#s=df.groupby(['name','activity']).timestamp.apply(np.ptp).reset_index()
uni=s.groupby('name').activity.nunique()
s=s.sort_values('timestamp').drop_duplicates('name',keep='last')
s['numberofact']=s.name.map(uni)
s
             name activity timestamp  numberofact
4  Joey Tribbiani    party  01:00:00            2
6   Monica Geller   travel  02:00:00            1
8   Phoebe Buffey   travel  02:00:00            2
9     Ross Geller    party  02:00:00            2
2      Harry Kane    party  04:00:00            2
0   Chandler Bing    party  07:00:00            1
5        John Doe    beach  07:00:00            1
  

Комментарии:

1. TypeError: unsupported operand type(s) for -: 'str' and 'str'

2. TypeError: DatetimeIndex cannot perform the operation ptp после метки времени создается datetime dtype.

3. @astro123 ты запускал df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')

4. да, если я запускаю pd.to_datetime, то я получаю ошибку operation ptp.

5. @astro123 какая у вас версия pandas?

Ответ №2:

Попробуйте это:

 gb = df.groupby(['name', 'activity'])['timestamp']

print((gb.max() - gb.min()).sort_values(ascending=False).reset_index().drop_duplicates(subset='name'))
  

Вывод:

              name activity timestamp
0        John Doe    beach  07:00:00
1   Chandler Bing    party  07:00:00
2      Harry Kane    party  04:00:00
3     Ross Geller    party  02:00:00
4   Phoebe Buffey   travel  02:00:00
5   Monica Geller   travel  02:00:00
6  Joey Tribbiani    party  01:00:00
  

Комментарии:

1. Согласен с @Wen-Ben Да, Фиби играла в футбол с 12:00 до 15:00, поэтому время для нее должно составлять 3 часа, а не 2.

2. @Wen-Ben ладно, я неправильно понял формат входных данных. Да, это неправильно.

Ответ №3:

Это абсолютно (вероятно) неподходящий способ сделать это, но мы начинаем:

 ### Get the max of the timestampe into separate dataframes
df_max = df.loc[df.groupby(['name','activity',])['timestamp'].idxmax()].reset_index(drop=True)
df_min = df.loc[df.groupby(['name','activity',])['timestamp'].idxmin()].reset_index(drop=True)

### Merge those puppies on the index values
df_tot = df_max.merge(df_min, how='outer', left_index=True, right_index=True, suffixes= ('_max', '_min'))

### Subtract the max timestamp from the minimum timestamp
df_tot['net time'] = df_tot['timestamp_max'] - df_tot['timestamp_min']

### Drop unnecessary columns
df_tot.drop(['name_min','activity_min','timestamp_min','money_spent_min', 'money_spent_max','timestamp_max'], axis=1, inplace=True)

### Rename our columns
df_tot = df_tot.rename(columns={i:i.replace('_max', '') for i in df_tot.columns.values.tolist()})

### Set activity_number as the cumulative count of name
df_tot['activity_number'] = df_tot.groupby('name').cumcount()   1

### Get the max of that result
df_tot = df_tot.loc[df_tot.groupby(['name',])['net time'].idxmax()].reset_index(drop=True)

### Rearrange our results
df_tot = df_tot.reindex(columns=['name','activity_number', 'net time']).copy()
  

Вывод:

              name  activity_number net time
0   Chandler Bing                1 07:00:00
1      Harry Kane                2 04:00:00
2  Joey Tribbiani                2 01:00:00
3        John Doe                1 07:00:00
4   Monica Geller                1 02:00:00
5   Phoebe Buffey                2 02:00:00
6     Ross Geller                1 02:00:00
  

Ответ №4:

Я провел обсуждение с @Wen-Ben, получил предложения и получил ответ.

 # find maximum time spent for a given activity
df = pd.DataFrame([['Chandler Bing','party','2017-08-04 08:00:00',51],
 ['Chandler Bing','party','2017-08-04 13:00:00',60],
 ['Chandler Bing','party','2017-08-04 15:00:00',59],
 ['Harry Kane','football','2017-08-04 13:00:00',80],
 ['Harry Kane','party','2017-08-04 11:00:00',90],
 ['Harry Kane','party','2017-08-04 07:00:00',68],
 ['John Doe','beach','2017-08-04 07:00:00',63],
 ['John Doe','beach','2017-08-04 12:00:00',61],
 ['John Doe','beach','2017-08-04 14:00:00',65],
 ['Joey Tribbiani','party','2017-08-04 09:00:00',54],
 ['Joey Tribbiani','party','2017-08-04 10:00:00',67],
 ['Joey Tribbiani','football','2017-08-04 08:00:00',84],
 ['Monica Geller','travel','2017-08-04 07:00:00',90],
 ['Monica Geller','travel','2017-08-04 08:00:00',96],
 ['Monica Geller','travel','2017-08-04 09:00:00',74],
 ['Phoebe Buffey','travel','2017-08-04 10:00:00',52],
 ['Phoebe Buffey','travel','2017-08-04 12:00:00',84],
 ['Phoebe Buffey','football','2017-08-04 15:00:00',58],
 ['Ross Geller','party','2017-08-04 09:00:00',96],
 ['Ross Geller','party','2017-08-04 11:00:00',81],
 ['Ross Geller','travel','2017-08-04 14:00:00',60]],
 columns=['name','activity','timestamp','money_spent'])
  

Фрейм данных после сортировки:

               name  activity           timestamp  money_spent
0    Chandler Bing     party 2017-08-04 08:00:00           51
1    Chandler Bing     party 2017-08-04 13:00:00           60
2    Chandler Bing     party 2017-08-04 15:00:00           59
5       Harry Kane     party 2017-08-04 07:00:00           68
4       Harry Kane     party 2017-08-04 11:00:00           90
3       Harry Kane  football 2017-08-04 13:00:00           80
11  Joey Tribbiani  football 2017-08-04 08:00:00           84
9   Joey Tribbiani     party 2017-08-04 09:00:00           54
10  Joey Tribbiani     party 2017-08-04 10:00:00           67
6         John Doe     beach 2017-08-04 07:00:00           63
7         John Doe     beach 2017-08-04 12:00:00           61
8         John Doe     beach 2017-08-04 14:00:00           65
12   Monica Geller    travel 2017-08-04 07:00:00           90
13   Monica Geller    travel 2017-08-04 08:00:00           96
14   Monica Geller    travel 2017-08-04 09:00:00           74
15   Phoebe Buffey    travel 2017-08-04 10:00:00           52
16   Phoebe Buffey    travel 2017-08-04 12:00:00           84
17   Phoebe Buffey  football 2017-08-04 15:00:00           58
18     Ross Geller     party 2017-08-04 09:00:00           96
19     Ross Geller     party 2017-08-04 11:00:00           81
20     Ross Geller    travel 2017-08-04 14:00:00           60
  

Ответ:

 df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')
df = df.sort_values(['name','timestamp'])

df['activity_change'] = (df.activity!=df.activity.shift()) | (df.name!=df.name.shift())
df['activity_num'] = df.groupby('name')['activity_change'].cumsum()

df['time_diff'] = df['timestamp'].diff()
df.loc[df.name != df.name.shift(), 'time_diff'] = None

activity_duration = df.groupby(['name','activity_num','activity'])['time_diff'].sum()
df1 = activity_duration.reset_index().groupby('name').max().drop('activity_num',1)
print(df1)
  

Дает:

                activity time_diff
name                             
Chandler Bing     party  07:00:00
Harry Kane        party  04:00:00
Joey Tribbiani    party  02:00:00
John Doe          beach  07:00:00
Monica Geller    travel  02:00:00
Phoebe Buffey    travel  03:00:00
Ross Geller      travel  03:00:00
  

Объяснение:
Для Фиби первое занятие — это путешествие с 10 утра до 12 ночи, это 2 часа. Второе действие — 12:00, для 3mp — 3 часа. на футбол в 3 часа дня не затрачено время, поэтому оно не учитывается.
Итак, время для ответа Фиби составляет 3 часа.