#python #pandas
#python #pandas
Вопрос:
Я пытался найти время, затраченное на наиболее активную деятельность на человека в этом наборе данных:
name activity timestamp money_spent
0 Chandler Bing party 2017-08-04 08:00:00 51
1 Chandler Bing party 2017-08-04 13:00:00 60
2 Chandler Bing party 2017-08-04 15:00:00 59
5 Harry Kane party 2017-08-04 07:00:00 68
4 Harry Kane party 2017-08-04 11:00:00 90
3 Harry Kane football 2017-08-04 13:00:00 80
11 Joey Tribbiani football 2017-08-04 08:00:00 84
9 Joey Tribbiani party 2017-08-04 09:00:00 54
10 Joey Tribbiani party 2017-08-04 10:00:00 67
6 John Doe beach 2017-08-04 07:00:00 63
7 John Doe beach 2017-08-04 12:00:00 61
8 John Doe beach 2017-08-04 14:00:00 65
12 Monica Geller travel 2017-08-04 07:00:00 90
13 Monica Geller travel 2017-08-04 08:00:00 96
14 Monica Geller travel 2017-08-04 09:00:00 74
15 Phoebe Buffey travel 2017-08-04 10:00:00 52
16 Phoebe Buffey travel 2017-08-04 12:00:00 84
17 Phoebe Buffey football 2017-08-04 15:00:00 58
18 Ross Geller party 2017-08-04 09:00:00 96
19 Ross Geller party 2017-08-04 11:00:00 81
20 Ross Geller travel 2017-08-04 14:00:00 60
df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')
df # party day 2017-08-04 for some guys.
# find most involved activity and time spent on that activity per person.
Требуемый вывод:
activity_num activity time_diff
name
Chandler Bing 1.0 party 07:00:00
Harry Kane 2.0 party 04:00:00
Joey Tribbiani 2.0 party 02:00:00
John Doe 1.0 beach 07:00:00
Monica Geller 1.0 travel 02:00:00
Phoebe Buffey 2.0 travel 03:00:00
Ross Geller 2.0 travel 03:00:00
Примечание: Гарри Кейн веселился с 7 до 11 утра, поэтому ответ для него — 4 часа.
df.head()
name activity timestamp money_spent
0 Chandler Bing party 2017-08-04 08:00:00 51
1 Chandler Bing party 2017-08-04 13:00:00 60
2 Chandler Bing party 2017-08-04 15:00:00 59
3 Harry Kane football 2017-08-04 13:00:00 80
4 Harry Kane party 2017-08-04 11:00:00 90
5 Harry Kane party 2017-08-04 07:00:00 68
Моя попытка:
df.groupby(['name','activity'])['timestamp'].max() # no idea
Комментарии:
1. Не должно
'Phoebe Buffey'
быть 2 часов и'Ross Geller'
быть 2 часа в пути?
Ответ №1:
Проверьте ниже
s=df.groupby(['name','activity']).timestamp.apply(pd.Series.ptp).reset_index()
#s=df.groupby(['name','activity']).timestamp.apply(np.ptp).reset_index()
uni=s.groupby('name').activity.nunique()
s=s.sort_values('timestamp').drop_duplicates('name',keep='last')
s['numberofact']=s.name.map(uni)
s
name activity timestamp numberofact
4 Joey Tribbiani party 01:00:00 2
6 Monica Geller travel 02:00:00 1
8 Phoebe Buffey travel 02:00:00 2
9 Ross Geller party 02:00:00 2
2 Harry Kane party 04:00:00 2
0 Chandler Bing party 07:00:00 1
5 John Doe beach 07:00:00 1
Комментарии:
1.
TypeError: unsupported operand type(s) for -: 'str' and 'str'
2.
TypeError: DatetimeIndex cannot perform the operation ptp
после метки времени создается datetime dtype.3. @astro123 ты запускал
df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')
4. да, если я запускаю pd.to_datetime, то я получаю ошибку operation ptp.
5. @astro123 какая у вас версия pandas?
Ответ №2:
Попробуйте это:
gb = df.groupby(['name', 'activity'])['timestamp']
print((gb.max() - gb.min()).sort_values(ascending=False).reset_index().drop_duplicates(subset='name'))
Вывод:
name activity timestamp
0 John Doe beach 07:00:00
1 Chandler Bing party 07:00:00
2 Harry Kane party 04:00:00
3 Ross Geller party 02:00:00
4 Phoebe Buffey travel 02:00:00
5 Monica Geller travel 02:00:00
6 Joey Tribbiani party 01:00:00
Комментарии:
1. Согласен с @Wen-Ben Да, Фиби играла в футбол с 12:00 до 15:00, поэтому время для нее должно составлять 3 часа, а не 2.
2. @Wen-Ben ладно, я неправильно понял формат входных данных. Да, это неправильно.
Ответ №3:
Это абсолютно (вероятно) неподходящий способ сделать это, но мы начинаем:
### Get the max of the timestampe into separate dataframes
df_max = df.loc[df.groupby(['name','activity',])['timestamp'].idxmax()].reset_index(drop=True)
df_min = df.loc[df.groupby(['name','activity',])['timestamp'].idxmin()].reset_index(drop=True)
### Merge those puppies on the index values
df_tot = df_max.merge(df_min, how='outer', left_index=True, right_index=True, suffixes= ('_max', '_min'))
### Subtract the max timestamp from the minimum timestamp
df_tot['net time'] = df_tot['timestamp_max'] - df_tot['timestamp_min']
### Drop unnecessary columns
df_tot.drop(['name_min','activity_min','timestamp_min','money_spent_min', 'money_spent_max','timestamp_max'], axis=1, inplace=True)
### Rename our columns
df_tot = df_tot.rename(columns={i:i.replace('_max', '') for i in df_tot.columns.values.tolist()})
### Set activity_number as the cumulative count of name
df_tot['activity_number'] = df_tot.groupby('name').cumcount() 1
### Get the max of that result
df_tot = df_tot.loc[df_tot.groupby(['name',])['net time'].idxmax()].reset_index(drop=True)
### Rearrange our results
df_tot = df_tot.reindex(columns=['name','activity_number', 'net time']).copy()
Вывод:
name activity_number net time
0 Chandler Bing 1 07:00:00
1 Harry Kane 2 04:00:00
2 Joey Tribbiani 2 01:00:00
3 John Doe 1 07:00:00
4 Monica Geller 1 02:00:00
5 Phoebe Buffey 2 02:00:00
6 Ross Geller 1 02:00:00
Ответ №4:
Я провел обсуждение с @Wen-Ben, получил предложения и получил ответ.
# find maximum time spent for a given activity
df = pd.DataFrame([['Chandler Bing','party','2017-08-04 08:00:00',51],
['Chandler Bing','party','2017-08-04 13:00:00',60],
['Chandler Bing','party','2017-08-04 15:00:00',59],
['Harry Kane','football','2017-08-04 13:00:00',80],
['Harry Kane','party','2017-08-04 11:00:00',90],
['Harry Kane','party','2017-08-04 07:00:00',68],
['John Doe','beach','2017-08-04 07:00:00',63],
['John Doe','beach','2017-08-04 12:00:00',61],
['John Doe','beach','2017-08-04 14:00:00',65],
['Joey Tribbiani','party','2017-08-04 09:00:00',54],
['Joey Tribbiani','party','2017-08-04 10:00:00',67],
['Joey Tribbiani','football','2017-08-04 08:00:00',84],
['Monica Geller','travel','2017-08-04 07:00:00',90],
['Monica Geller','travel','2017-08-04 08:00:00',96],
['Monica Geller','travel','2017-08-04 09:00:00',74],
['Phoebe Buffey','travel','2017-08-04 10:00:00',52],
['Phoebe Buffey','travel','2017-08-04 12:00:00',84],
['Phoebe Buffey','football','2017-08-04 15:00:00',58],
['Ross Geller','party','2017-08-04 09:00:00',96],
['Ross Geller','party','2017-08-04 11:00:00',81],
['Ross Geller','travel','2017-08-04 14:00:00',60]],
columns=['name','activity','timestamp','money_spent'])
Фрейм данных после сортировки:
name activity timestamp money_spent
0 Chandler Bing party 2017-08-04 08:00:00 51
1 Chandler Bing party 2017-08-04 13:00:00 60
2 Chandler Bing party 2017-08-04 15:00:00 59
5 Harry Kane party 2017-08-04 07:00:00 68
4 Harry Kane party 2017-08-04 11:00:00 90
3 Harry Kane football 2017-08-04 13:00:00 80
11 Joey Tribbiani football 2017-08-04 08:00:00 84
9 Joey Tribbiani party 2017-08-04 09:00:00 54
10 Joey Tribbiani party 2017-08-04 10:00:00 67
6 John Doe beach 2017-08-04 07:00:00 63
7 John Doe beach 2017-08-04 12:00:00 61
8 John Doe beach 2017-08-04 14:00:00 65
12 Monica Geller travel 2017-08-04 07:00:00 90
13 Monica Geller travel 2017-08-04 08:00:00 96
14 Monica Geller travel 2017-08-04 09:00:00 74
15 Phoebe Buffey travel 2017-08-04 10:00:00 52
16 Phoebe Buffey travel 2017-08-04 12:00:00 84
17 Phoebe Buffey football 2017-08-04 15:00:00 58
18 Ross Geller party 2017-08-04 09:00:00 96
19 Ross Geller party 2017-08-04 11:00:00 81
20 Ross Geller travel 2017-08-04 14:00:00 60
Ответ:
df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')
df = df.sort_values(['name','timestamp'])
df['activity_change'] = (df.activity!=df.activity.shift()) | (df.name!=df.name.shift())
df['activity_num'] = df.groupby('name')['activity_change'].cumsum()
df['time_diff'] = df['timestamp'].diff()
df.loc[df.name != df.name.shift(), 'time_diff'] = None
activity_duration = df.groupby(['name','activity_num','activity'])['time_diff'].sum()
df1 = activity_duration.reset_index().groupby('name').max().drop('activity_num',1)
print(df1)
Дает:
activity time_diff
name
Chandler Bing party 07:00:00
Harry Kane party 04:00:00
Joey Tribbiani party 02:00:00
John Doe beach 07:00:00
Monica Geller travel 02:00:00
Phoebe Buffey travel 03:00:00
Ross Geller travel 03:00:00
Объяснение:
Для Фиби первое занятие — это путешествие с 10 утра до 12 ночи, это 2 часа. Второе действие — 12:00, для 3mp — 3 часа. на футбол в 3 часа дня не затрачено время, поэтому оно не учитывается.
Итак, время для ответа Фиби составляет 3 часа.