#python #pandas #dataframe
Вопрос:
У меня есть следующие данные о событиях, происходящих на станциях.
stationId start end
0 0 2021-03-01 02:00:00 2021-03-01 05:00:00
1 0 2021-03-01 07:00:00 2021-03-01 08:30:00
2 0 2021-03-01 04:00:00 2021-03-01 09:30:00
3 1 ... ...
Одновременно может происходить более одного события.
Я хочу знать количество часов, в течение которых произошло n событий между двумя датами (например, с 2021-03-01 00:00:00 по 2021-03-01 10:00:00).
Желаемый результат должен выглядеть примерно так (где count-это количество параллельных событий, а часы-количество часов, в течение которых было активировано столько параллельных событий).
stationId count hours
0 0 0 2.5
1 0 1 5.0
2 0 2 2.5
3 1 ... ...
Поскольку это довольно трудно понять в обычном тексте, это изображение может помочь понять, чего я хочу достичь.
Визуальное представление алгоритма:-
дополнительная информация
Входной файл содержит более 500 000 строк с примерно 800 станциями.
Ответ №1:
Я могу сказать вам, как бы я подошел к этому. По сути, вы получите «симуляцию» действия.
Я бы немного реорганизовал данные, чтобы разделить каждую запись на две строки с одной меткой времени, вот так:
data1 = []
for row in data:
data1.append( (data['stationid'], "start", data['start'] )
data1.append( (data['stationid'], "end", data['end'] )
Это дает вам:
stationid activity timestamp
0 start 2021-03-01 02:00:00
0 end 2021-03-01 05:00:00
0 start 2021-03-01 07:00:00
0 end 2021-03-01 08:30:00
0 start 2021-03-01 04:00:00
0 end 2021-03-01 09:30:00
Как только вы это сделаете, вы сможете отсортировать весь список по метке времени. Это дает вам хронологическую запись того, что именно произошло и в каком порядке:
stationid activity timestamp
0 start 2021-03-01 02:00:00
0 start 2021-03-01 04:00:00
0 end 2021-03-01 05:00:00
0 start 2021-03-01 07:00:00
0 end 2021-03-01 08:30:00
0 end 2021-03-01 09:30:00
С этими данными задача становится довольно простой.
class State(object):
def __init__(self, stn, chg):
self.station = stn
self.count = 0
self.lastchange = chg
state = {}
newdata = []
for row in data:
# We have a change in the status of "stationid", so go save an entry.
station = row['stationid']
if stationid in state:
delta = row['timestamp'] - state[station].lastchange;
newdata.append( station, state[station.count], delta )
state[station].lastchange = row['timestamp']
else:
state[station] = State(station, row['timestamp'])
if activity == "start":
state[station].count = 1
else:
state[station].count -= 1
Ответ №2:
Хорошо, вот полная реализация, от случайно сгенерированных данных до окончательных результатов.
import sys
import pandas as pd
import random
import datetime
from pprint import pprint
base = datetime.datetime(2021,3,21,0,0,0)
def generateRandomData(n):
for i in range(n):
station = random.randint(0,9)
start = random.randint(0,40) * 15
end = start random.randint(0,12) * 15
start = base datetime.timedelta(minutes=start)
end = base datetime.timedelta(minutes=end)
yield( station, start, end )
df = pd.DataFrame( generateRandomData(30), columns=("stationId", "start", "end"))
print( df )
df1 = []
for row in df.iterrows():
row=row[1]
df1.append({'stn': row['stationId'], 'event': "start", 'time':row['start']})
df1.append({'stn': row['stationId'], 'event': "end", 'time':row['end']})
df1 = sorted( df1, key=lambda row: row['time'] )
pprint( df1 )
class State(object):
def __init__(self, stn, chg):
self.station = stn
self.count = 0
self.lastchange = chg
state = {}
newdata = []
for row in df1:
# We have a change in the status of "stationid", so go save an entry.
station = row['stn']
if station in state:
delta = row['time'] - state[station].lastchange;
newdata.append( (station, state[station].count, delta) )
state[station].lastchange = row['time']
else:
state[station] = State(station, row['time'])
if row['event'] == "start":
state[station].count = 1
else:
state[station].count -= 1
pprint(newdata)
df = pd.DataFrame(newdata, columns=("stationId", "count", "hours" ) )
print(df)
Вот результат. Последний столбец времени-это поле timedelta, но его достаточно легко преобразовать в часы.
stationId start end
0 1 2021-03-21 07:00:00 2021-03-21 08:45:00
1 1 2021-03-21 05:45:00 2021-03-21 07:00:00
2 8 2021-03-21 06:45:00 2021-03-21 07:30:00
3 7 2021-03-21 07:45:00 2021-03-21 10:30:00
4 2 2021-03-21 02:00:00 2021-03-21 02:15:00
5 7 2021-03-21 00:15:00 2021-03-21 03:00:00
6 5 2021-03-21 02:45:00 2021-03-21 05:15:00
7 2 2021-03-21 05:00:00 2021-03-21 05:45:00
8 9 2021-03-21 09:00:00 2021-03-21 12:00:00
9 5 2021-03-21 07:00:00 2021-03-21 09:45:00
10 1 2021-03-21 05:45:00 2021-03-21 08:15:00
11 3 2021-03-21 02:00:00 2021-03-21 05:00:00
12 0 2021-03-21 07:45:00 2021-03-21 08:30:00
13 3 2021-03-21 00:00:00 2021-03-21 00:15:00
14 8 2021-03-21 03:00:00 2021-03-21 06:00:00
15 4 2021-03-21 03:45:00 2021-03-21 05:45:00
16 9 2021-03-21 06:15:00 2021-03-21 09:00:00
17 7 2021-03-21 01:45:00 2021-03-21 01:45:00
18 0 2021-03-21 00:15:00 2021-03-21 03:15:00
19 2 2021-03-21 08:15:00 2021-03-21 08:45:00
20 4 2021-03-21 00:30:00 2021-03-21 00:30:00
21 4 2021-03-21 08:30:00 2021-03-21 10:15:00
22 6 2021-03-21 04:45:00 2021-03-21 06:00:00
23 7 2021-03-21 00:45:00 2021-03-21 03:15:00
24 8 2021-03-21 01:00:00 2021-03-21 04:00:00
25 8 2021-03-21 07:15:00 2021-03-21 08:45:00
26 7 2021-03-21 01:30:00 2021-03-21 04:15:00
27 6 2021-03-21 02:45:00 2021-03-21 04:00:00
28 5 2021-03-21 05:45:00 2021-03-21 06:45:00
29 5 2021-03-21 00:30:00 2021-03-21 01:45:00
[{'event': 'start', 'stn': 3, 'time': Timestamp('2021-03-21 00:00:00')},
{'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 00:15:00')},
{'event': 'end', 'stn': 3, 'time': Timestamp('2021-03-21 00:15:00')},
{'event': 'start', 'stn': 0, 'time': Timestamp('2021-03-21 00:15:00')},
{'event': 'start', 'stn': 4, 'time': Timestamp('2021-03-21 00:30:00')},
{'event': 'end', 'stn': 4, 'time': Timestamp('2021-03-21 00:30:00')},
{'event': 'start', 'stn': 5, 'time': Timestamp('2021-03-21 00:30:00')},
{'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 00:45:00')},
{'event': 'start', 'stn': 8, 'time': Timestamp('2021-03-21 01:00:00')},
{'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 01:30:00')},
{'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 01:45:00')},
{'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 01:45:00')},
{'event': 'end', 'stn': 5, 'time': Timestamp('2021-03-21 01:45:00')},
{'event': 'start', 'stn': 2, 'time': Timestamp('2021-03-21 02:00:00')},
{'event': 'start', 'stn': 3, 'time': Timestamp('2021-03-21 02:00:00')},
{'event': 'end', 'stn': 2, 'time': Timestamp('2021-03-21 02:15:00')},
{'event': 'start', 'stn': 5, 'time': Timestamp('2021-03-21 02:45:00')},
{'event': 'start', 'stn': 6, 'time': Timestamp('2021-03-21 02:45:00')},
{'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 03:00:00')},
{'event': 'start', 'stn': 8, 'time': Timestamp('2021-03-21 03:00:00')},
{'event': 'end', 'stn': 0, 'time': Timestamp('2021-03-21 03:15:00')},
{'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 03:15:00')},
{'event': 'start', 'stn': 4, 'time': Timestamp('2021-03-21 03:45:00')},
{'event': 'end', 'stn': 8, 'time': Timestamp('2021-03-21 04:00:00')},
{'event': 'end', 'stn': 6, 'time': Timestamp('2021-03-21 04:00:00')},
{'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 04:15:00')},
{'event': 'start', 'stn': 6, 'time': Timestamp('2021-03-21 04:45:00')},
{'event': 'start', 'stn': 2, 'time': Timestamp('2021-03-21 05:00:00')},
{'event': 'end', 'stn': 3, 'time': Timestamp('2021-03-21 05:00:00')},
{'event': 'end', 'stn': 5, 'time': Timestamp('2021-03-21 05:15:00')},
{'event': 'start', 'stn': 1, 'time': Timestamp('2021-03-21 05:45:00')},
{'event': 'end', 'stn': 2, 'time': Timestamp('2021-03-21 05:45:00')},
{'event': 'start', 'stn': 1, 'time': Timestamp('2021-03-21 05:45:00')},
{'event': 'end', 'stn': 4, 'time': Timestamp('2021-03-21 05:45:00')},
{'event': 'start', 'stn': 5, 'time': Timestamp('2021-03-21 05:45:00')},
{'event': 'end', 'stn': 8, 'time': Timestamp('2021-03-21 06:00:00')},
{'event': 'end', 'stn': 6, 'time': Timestamp('2021-03-21 06:00:00')},
{'event': 'start', 'stn': 9, 'time': Timestamp('2021-03-21 06:15:00')},
{'event': 'start', 'stn': 8, 'time': Timestamp('2021-03-21 06:45:00')},
{'event': 'end', 'stn': 5, 'time': Timestamp('2021-03-21 06:45:00')},
{'event': 'start', 'stn': 1, 'time': Timestamp('2021-03-21 07:00:00')},
{'event': 'end', 'stn': 1, 'time': Timestamp('2021-03-21 07:00:00')},
{'event': 'start', 'stn': 5, 'time': Timestamp('2021-03-21 07:00:00')},
{'event': 'start', 'stn': 8, 'time': Timestamp('2021-03-21 07:15:00')},
{'event': 'end', 'stn': 8, 'time': Timestamp('2021-03-21 07:30:00')},
{'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 07:45:00')},
{'event': 'start', 'stn': 0, 'time': Timestamp('2021-03-21 07:45:00')},
{'event': 'end', 'stn': 1, 'time': Timestamp('2021-03-21 08:15:00')},
{'event': 'start', 'stn': 2, 'time': Timestamp('2021-03-21 08:15:00')},
{'event': 'end', 'stn': 0, 'time': Timestamp('2021-03-21 08:30:00')},
{'event': 'start', 'stn': 4, 'time': Timestamp('2021-03-21 08:30:00')},
{'event': 'end', 'stn': 1, 'time': Timestamp('2021-03-21 08:45:00')},
{'event': 'end', 'stn': 2, 'time': Timestamp('2021-03-21 08:45:00')},
{'event': 'end', 'stn': 8, 'time': Timestamp('2021-03-21 08:45:00')},
{'event': 'start', 'stn': 9, 'time': Timestamp('2021-03-21 09:00:00')},
{'event': 'end', 'stn': 9, 'time': Timestamp('2021-03-21 09:00:00')},
{'event': 'end', 'stn': 5, 'time': Timestamp('2021-03-21 09:45:00')},
{'event': 'end', 'stn': 4, 'time': Timestamp('2021-03-21 10:15:00')},
{'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 10:30:00')},
{'event': 'end', 'stn': 9, 'time': Timestamp('2021-03-21 12:00:00')}]
[(3, 1, Timedelta('0 days 00:15:00')),
(4, 1, Timedelta('0 days 00:00:00')),
(7, 1, Timedelta('0 days 00:30:00')),
(7, 2, Timedelta('0 days 00:45:00')),
(7, 3, Timedelta('0 days 00:15:00')),
(7, 4, Timedelta('0 days 00:00:00')),
(5, 1, Timedelta('0 days 01:15:00')),
(3, 0, Timedelta('0 days 01:45:00')),
(2, 1, Timedelta('0 days 00:15:00')),
(5, 0, Timedelta('0 days 01:00:00')),
(7, 3, Timedelta('0 days 01:15:00')),
(8, 1, Timedelta('0 days 02:00:00')),
(0, 1, Timedelta('0 days 03:00:00')),
(7, 2, Timedelta('0 days 00:15:00')),
(4, 0, Timedelta('0 days 03:15:00')),
(8, 2, Timedelta('0 days 01:00:00')),
(6, 1, Timedelta('0 days 01:15:00')),
(7, 1, Timedelta('0 days 01:00:00')),
(6, 0, Timedelta('0 days 00:45:00')),
(2, 0, Timedelta('0 days 02:45:00')),
(3, 1, Timedelta('0 days 03:00:00')),
(5, 1, Timedelta('0 days 02:30:00')),
(2, 1, Timedelta('0 days 00:45:00')),
(1, 1, Timedelta('0 days 00:00:00')),
(4, 1, Timedelta('0 days 02:00:00')),
(5, 0, Timedelta('0 days 00:30:00')),
(8, 1, Timedelta('0 days 02:00:00')),
(6, 1, Timedelta('0 days 01:15:00')),
(8, 0, Timedelta('0 days 00:45:00')),
(5, 1, Timedelta('0 days 01:00:00')),
(1, 2, Timedelta('0 days 01:15:00')),
(1, 3, Timedelta('0 days 00:00:00')),
(5, 0, Timedelta('0 days 00:15:00')),
(8, 1, Timedelta('0 days 00:30:00')),
(8, 2, Timedelta('0 days 00:15:00')),
(7, 0, Timedelta('0 days 03:30:00')),
(0, 0, Timedelta('0 days 04:30:00')),
(1, 2, Timedelta('0 days 01:15:00')),
(2, 0, Timedelta('0 days 02:30:00')),
(0, 1, Timedelta('0 days 00:45:00')),
(4, 0, Timedelta('0 days 02:45:00')),
(1, 1, Timedelta('0 days 00:30:00')),
(2, 1, Timedelta('0 days 00:30:00')),
(8, 1, Timedelta('0 days 01:15:00')),
(9, 1, Timedelta('0 days 02:45:00')),
(9, 2, Timedelta('0 days 00:00:00')),
(5, 1, Timedelta('0 days 02:45:00')),
(4, 1, Timedelta('0 days 01:45:00')),
(7, 1, Timedelta('0 days 02:45:00')),
(9, 1, Timedelta('0 days 03:00:00'))]
stationId count hours
0 3 1 0 days 00:15:00
1 4 1 0 days 00:00:00
2 7 1 0 days 00:30:00
3 7 2 0 days 00:45:00
4 7 3 0 days 00:15:00
5 7 4 0 days 00:00:00
6 5 1 0 days 01:15:00
7 3 0 0 days 01:45:00
8 2 1 0 days 00:15:00
9 5 0 0 days 01:00:00
10 7 3 0 days 01:15:00
11 8 1 0 days 02:00:00
12 0 1 0 days 03:00:00
13 7 2 0 days 00:15:00
14 4 0 0 days 03:15:00
15 8 2 0 days 01:00:00
16 6 1 0 days 01:15:00
17 7 1 0 days 01:00:00
18 6 0 0 days 00:45:00
19 2 0 0 days 02:45:00
20 3 1 0 days 03:00:00
21 5 1 0 days 02:30:00
22 2 1 0 days 00:45:00
23 1 1 0 days 00:00:00
24 4 1 0 days 02:00:00
25 5 0 0 days 00:30:00
26 8 1 0 days 02:00:00
27 6 1 0 days 01:15:00
28 8 0 0 days 00:45:00
29 5 1 0 days 01:00:00
30 1 2 0 days 01:15:00
31 1 3 0 days 00:00:00
32 5 0 0 days 00:15:00
33 8 1 0 days 00:30:00
34 8 2 0 days 00:15:00
35 7 0 0 days 03:30:00
36 0 0 0 days 04:30:00
37 1 2 0 days 01:15:00
38 2 0 0 days 02:30:00
39 0 1 0 days 00:45:00
40 4 0 0 days 02:45:00
41 1 1 0 days 00:30:00
42 2 1 0 days 00:30:00
43 8 1 0 days 01:15:00
44 9 1 0 days 02:45:00
45 9 2 0 days 00:00:00
46 5 1 0 days 02:45:00
47 4 1 0 days 01:45:00
48 7 1 0 days 02:45:00
49 9 1 0 days 03:00:00