Как получить количество времени с подсчетом параллельных событий между двумя датами

#python #pandas #dataframe

Вопрос:

У меня есть следующие данные о событиях, происходящих на станциях.

    stationId               start                 end
0          0 2021-03-01 02:00:00 2021-03-01 05:00:00
1          0 2021-03-01 07:00:00 2021-03-01 08:30:00
2          0 2021-03-01 04:00:00 2021-03-01 09:30:00
3          1          ...                  ...
 

Одновременно может происходить более одного события.

Я хочу знать количество часов, в течение которых произошло n событий между двумя датами (например, с 2021-03-01 00:00:00 по 2021-03-01 10:00:00).

Желаемый результат должен выглядеть примерно так (где count-это количество параллельных событий, а часы-количество часов, в течение которых было активировано столько параллельных событий).

    stationId  count  hours
0          0      0    2.5
1          0      1    5.0
2          0      2    2.5
3          1    ...    ...
 

Поскольку это довольно трудно понять в обычном тексте, это изображение может помочь понять, чего я хочу достичь.

Визуальное представление алгоритма:-

введите описание изображения здесь

дополнительная информация

Входной файл содержит более 500 000 строк с примерно 800 станциями.

Ответ №1:

Я могу сказать вам, как бы я подошел к этому. По сути, вы получите «симуляцию» действия.

Я бы немного реорганизовал данные, чтобы разделить каждую запись на две строки с одной меткой времени, вот так:

 data1 = []
for row in data:
    data1.append( (data['stationid'], "start", data['start'] )
    data1.append( (data['stationid'], "end", data['end'] )
 

Это дает вам:

 stationid  activity  timestamp
      0     start    2021-03-01 02:00:00
      0     end      2021-03-01 05:00:00
      0     start    2021-03-01 07:00:00
      0     end      2021-03-01 08:30:00
      0     start    2021-03-01 04:00:00
      0     end      2021-03-01 09:30:00
 

Как только вы это сделаете, вы сможете отсортировать весь список по метке времени. Это дает вам хронологическую запись того, что именно произошло и в каком порядке:

 stationid  activity  timestamp
      0     start    2021-03-01 02:00:00
      0     start    2021-03-01 04:00:00
      0     end      2021-03-01 05:00:00
      0     start    2021-03-01 07:00:00
      0     end      2021-03-01 08:30:00
      0     end      2021-03-01 09:30:00
 

С этими данными задача становится довольно простой.

 class State(object):
    def __init__(self, stn, chg):
        self.station = stn
        self.count = 0
        self.lastchange = chg

state = {}
newdata = []
for row in data:
    # We have a change in the status of "stationid", so go save an entry.
    station = row['stationid']
    if stationid in state:
        delta = row['timestamp'] - state[station].lastchange;
        newdata.append( station, state[station.count], delta )
        state[station].lastchange = row['timestamp']
    else:
        state[station] = State(station, row['timestamp'])
    if activity == "start":
        state[station].count  = 1
    else:
        state[station].count -= 1
 

Ответ №2:

Хорошо, вот полная реализация, от случайно сгенерированных данных до окончательных результатов.

 import sys
import pandas as pd
import random
import datetime
from pprint import pprint

base = datetime.datetime(2021,3,21,0,0,0)

def generateRandomData(n):
    for i in range(n):
        station = random.randint(0,9)
        start = random.randint(0,40) * 15
        end = start   random.randint(0,12) * 15
        start = base   datetime.timedelta(minutes=start)
        end = base   datetime.timedelta(minutes=end)
        yield( station, start, end )


df = pd.DataFrame( generateRandomData(30), columns=("stationId", "start", "end"))

print( df )

df1 = []
for row in df.iterrows():
    row=row[1]
    df1.append({'stn': row['stationId'], 'event': "start", 'time':row['start']})
    df1.append({'stn': row['stationId'], 'event': "end", 'time':row['end']})

df1 = sorted( df1, key=lambda row: row['time'] )

pprint( df1 )

class State(object):
    def __init__(self, stn, chg):
        self.station = stn
        self.count = 0
        self.lastchange = chg

state = {}
newdata = []
for row in df1:
    # We have a change in the status of "stationid", so go save an entry.
    station = row['stn']
    if station in state:
        delta = row['time'] - state[station].lastchange;
        newdata.append( (station, state[station].count, delta) )
        state[station].lastchange = row['time']
    else:
        state[station] = State(station, row['time'])
    if row['event'] == "start":
        state[station].count  = 1
    else:
        state[station].count -= 1

pprint(newdata)
df = pd.DataFrame(newdata, columns=("stationId", "count", "hours" ) )
print(df)
 

Вот результат. Последний столбец времени-это поле timedelta, но его достаточно легко преобразовать в часы.

     stationId               start                 end
0           1 2021-03-21 07:00:00 2021-03-21 08:45:00
1           1 2021-03-21 05:45:00 2021-03-21 07:00:00
2           8 2021-03-21 06:45:00 2021-03-21 07:30:00
3           7 2021-03-21 07:45:00 2021-03-21 10:30:00
4           2 2021-03-21 02:00:00 2021-03-21 02:15:00
5           7 2021-03-21 00:15:00 2021-03-21 03:00:00
6           5 2021-03-21 02:45:00 2021-03-21 05:15:00
7           2 2021-03-21 05:00:00 2021-03-21 05:45:00
8           9 2021-03-21 09:00:00 2021-03-21 12:00:00
9           5 2021-03-21 07:00:00 2021-03-21 09:45:00
10          1 2021-03-21 05:45:00 2021-03-21 08:15:00
11          3 2021-03-21 02:00:00 2021-03-21 05:00:00
12          0 2021-03-21 07:45:00 2021-03-21 08:30:00
13          3 2021-03-21 00:00:00 2021-03-21 00:15:00
14          8 2021-03-21 03:00:00 2021-03-21 06:00:00
15          4 2021-03-21 03:45:00 2021-03-21 05:45:00
16          9 2021-03-21 06:15:00 2021-03-21 09:00:00
17          7 2021-03-21 01:45:00 2021-03-21 01:45:00
18          0 2021-03-21 00:15:00 2021-03-21 03:15:00
19          2 2021-03-21 08:15:00 2021-03-21 08:45:00
20          4 2021-03-21 00:30:00 2021-03-21 00:30:00
21          4 2021-03-21 08:30:00 2021-03-21 10:15:00
22          6 2021-03-21 04:45:00 2021-03-21 06:00:00
23          7 2021-03-21 00:45:00 2021-03-21 03:15:00
24          8 2021-03-21 01:00:00 2021-03-21 04:00:00
25          8 2021-03-21 07:15:00 2021-03-21 08:45:00
26          7 2021-03-21 01:30:00 2021-03-21 04:15:00
27          6 2021-03-21 02:45:00 2021-03-21 04:00:00
28          5 2021-03-21 05:45:00 2021-03-21 06:45:00
29          5 2021-03-21 00:30:00 2021-03-21 01:45:00
[{'event': 'start', 'stn': 3, 'time': Timestamp('2021-03-21 00:00:00')},
 {'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 00:15:00')},
 {'event': 'end', 'stn': 3, 'time': Timestamp('2021-03-21 00:15:00')},
 {'event': 'start', 'stn': 0, 'time': Timestamp('2021-03-21 00:15:00')},
 {'event': 'start', 'stn': 4, 'time': Timestamp('2021-03-21 00:30:00')},
 {'event': 'end', 'stn': 4, 'time': Timestamp('2021-03-21 00:30:00')},
 {'event': 'start', 'stn': 5, 'time': Timestamp('2021-03-21 00:30:00')},
 {'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 00:45:00')},
 {'event': 'start', 'stn': 8, 'time': Timestamp('2021-03-21 01:00:00')},
 {'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 01:30:00')},
 {'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 01:45:00')},
 {'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 01:45:00')},
 {'event': 'end', 'stn': 5, 'time': Timestamp('2021-03-21 01:45:00')},
 {'event': 'start', 'stn': 2, 'time': Timestamp('2021-03-21 02:00:00')},
 {'event': 'start', 'stn': 3, 'time': Timestamp('2021-03-21 02:00:00')},
 {'event': 'end', 'stn': 2, 'time': Timestamp('2021-03-21 02:15:00')},
 {'event': 'start', 'stn': 5, 'time': Timestamp('2021-03-21 02:45:00')},
 {'event': 'start', 'stn': 6, 'time': Timestamp('2021-03-21 02:45:00')},
 {'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 03:00:00')},
 {'event': 'start', 'stn': 8, 'time': Timestamp('2021-03-21 03:00:00')},
 {'event': 'end', 'stn': 0, 'time': Timestamp('2021-03-21 03:15:00')},
 {'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 03:15:00')},
 {'event': 'start', 'stn': 4, 'time': Timestamp('2021-03-21 03:45:00')},
 {'event': 'end', 'stn': 8, 'time': Timestamp('2021-03-21 04:00:00')},
 {'event': 'end', 'stn': 6, 'time': Timestamp('2021-03-21 04:00:00')},
 {'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 04:15:00')},
 {'event': 'start', 'stn': 6, 'time': Timestamp('2021-03-21 04:45:00')},
 {'event': 'start', 'stn': 2, 'time': Timestamp('2021-03-21 05:00:00')},
 {'event': 'end', 'stn': 3, 'time': Timestamp('2021-03-21 05:00:00')},
 {'event': 'end', 'stn': 5, 'time': Timestamp('2021-03-21 05:15:00')},
 {'event': 'start', 'stn': 1, 'time': Timestamp('2021-03-21 05:45:00')},
 {'event': 'end', 'stn': 2, 'time': Timestamp('2021-03-21 05:45:00')},
 {'event': 'start', 'stn': 1, 'time': Timestamp('2021-03-21 05:45:00')},
 {'event': 'end', 'stn': 4, 'time': Timestamp('2021-03-21 05:45:00')},
 {'event': 'start', 'stn': 5, 'time': Timestamp('2021-03-21 05:45:00')},
 {'event': 'end', 'stn': 8, 'time': Timestamp('2021-03-21 06:00:00')},
 {'event': 'end', 'stn': 6, 'time': Timestamp('2021-03-21 06:00:00')},
 {'event': 'start', 'stn': 9, 'time': Timestamp('2021-03-21 06:15:00')},
 {'event': 'start', 'stn': 8, 'time': Timestamp('2021-03-21 06:45:00')},
 {'event': 'end', 'stn': 5, 'time': Timestamp('2021-03-21 06:45:00')},
 {'event': 'start', 'stn': 1, 'time': Timestamp('2021-03-21 07:00:00')},
 {'event': 'end', 'stn': 1, 'time': Timestamp('2021-03-21 07:00:00')},
 {'event': 'start', 'stn': 5, 'time': Timestamp('2021-03-21 07:00:00')},
 {'event': 'start', 'stn': 8, 'time': Timestamp('2021-03-21 07:15:00')},
 {'event': 'end', 'stn': 8, 'time': Timestamp('2021-03-21 07:30:00')},
 {'event': 'start', 'stn': 7, 'time': Timestamp('2021-03-21 07:45:00')},
 {'event': 'start', 'stn': 0, 'time': Timestamp('2021-03-21 07:45:00')},
 {'event': 'end', 'stn': 1, 'time': Timestamp('2021-03-21 08:15:00')},
 {'event': 'start', 'stn': 2, 'time': Timestamp('2021-03-21 08:15:00')},
 {'event': 'end', 'stn': 0, 'time': Timestamp('2021-03-21 08:30:00')},
 {'event': 'start', 'stn': 4, 'time': Timestamp('2021-03-21 08:30:00')},
 {'event': 'end', 'stn': 1, 'time': Timestamp('2021-03-21 08:45:00')},
 {'event': 'end', 'stn': 2, 'time': Timestamp('2021-03-21 08:45:00')},
 {'event': 'end', 'stn': 8, 'time': Timestamp('2021-03-21 08:45:00')},
 {'event': 'start', 'stn': 9, 'time': Timestamp('2021-03-21 09:00:00')},
 {'event': 'end', 'stn': 9, 'time': Timestamp('2021-03-21 09:00:00')},
 {'event': 'end', 'stn': 5, 'time': Timestamp('2021-03-21 09:45:00')},
 {'event': 'end', 'stn': 4, 'time': Timestamp('2021-03-21 10:15:00')},
 {'event': 'end', 'stn': 7, 'time': Timestamp('2021-03-21 10:30:00')},
 {'event': 'end', 'stn': 9, 'time': Timestamp('2021-03-21 12:00:00')}]
[(3, 1, Timedelta('0 days 00:15:00')),
 (4, 1, Timedelta('0 days 00:00:00')),
 (7, 1, Timedelta('0 days 00:30:00')),
 (7, 2, Timedelta('0 days 00:45:00')),
 (7, 3, Timedelta('0 days 00:15:00')),
 (7, 4, Timedelta('0 days 00:00:00')),
 (5, 1, Timedelta('0 days 01:15:00')),
 (3, 0, Timedelta('0 days 01:45:00')),
 (2, 1, Timedelta('0 days 00:15:00')),
 (5, 0, Timedelta('0 days 01:00:00')),
 (7, 3, Timedelta('0 days 01:15:00')),
 (8, 1, Timedelta('0 days 02:00:00')),
 (0, 1, Timedelta('0 days 03:00:00')),
 (7, 2, Timedelta('0 days 00:15:00')),
 (4, 0, Timedelta('0 days 03:15:00')),
 (8, 2, Timedelta('0 days 01:00:00')),
 (6, 1, Timedelta('0 days 01:15:00')),
 (7, 1, Timedelta('0 days 01:00:00')),
 (6, 0, Timedelta('0 days 00:45:00')),
 (2, 0, Timedelta('0 days 02:45:00')),
 (3, 1, Timedelta('0 days 03:00:00')),
 (5, 1, Timedelta('0 days 02:30:00')),
 (2, 1, Timedelta('0 days 00:45:00')),
 (1, 1, Timedelta('0 days 00:00:00')),
 (4, 1, Timedelta('0 days 02:00:00')),
 (5, 0, Timedelta('0 days 00:30:00')),
 (8, 1, Timedelta('0 days 02:00:00')),
 (6, 1, Timedelta('0 days 01:15:00')),
 (8, 0, Timedelta('0 days 00:45:00')),
 (5, 1, Timedelta('0 days 01:00:00')),
 (1, 2, Timedelta('0 days 01:15:00')),
 (1, 3, Timedelta('0 days 00:00:00')),
 (5, 0, Timedelta('0 days 00:15:00')),
 (8, 1, Timedelta('0 days 00:30:00')),
 (8, 2, Timedelta('0 days 00:15:00')),
 (7, 0, Timedelta('0 days 03:30:00')),
 (0, 0, Timedelta('0 days 04:30:00')),
 (1, 2, Timedelta('0 days 01:15:00')),
 (2, 0, Timedelta('0 days 02:30:00')),
 (0, 1, Timedelta('0 days 00:45:00')),
 (4, 0, Timedelta('0 days 02:45:00')),
 (1, 1, Timedelta('0 days 00:30:00')),
 (2, 1, Timedelta('0 days 00:30:00')),
 (8, 1, Timedelta('0 days 01:15:00')),
 (9, 1, Timedelta('0 days 02:45:00')),
 (9, 2, Timedelta('0 days 00:00:00')),
 (5, 1, Timedelta('0 days 02:45:00')),
 (4, 1, Timedelta('0 days 01:45:00')),
 (7, 1, Timedelta('0 days 02:45:00')),
 (9, 1, Timedelta('0 days 03:00:00'))]
    stationId  count           hours
0           3      1 0 days 00:15:00
1           4      1 0 days 00:00:00
2           7      1 0 days 00:30:00
3           7      2 0 days 00:45:00
4           7      3 0 days 00:15:00
5           7      4 0 days 00:00:00
6           5      1 0 days 01:15:00
7           3      0 0 days 01:45:00
8           2      1 0 days 00:15:00
9           5      0 0 days 01:00:00
10          7      3 0 days 01:15:00
11          8      1 0 days 02:00:00
12          0      1 0 days 03:00:00
13          7      2 0 days 00:15:00
14          4      0 0 days 03:15:00
15          8      2 0 days 01:00:00
16          6      1 0 days 01:15:00
17          7      1 0 days 01:00:00
18          6      0 0 days 00:45:00
19          2      0 0 days 02:45:00
20          3      1 0 days 03:00:00
21          5      1 0 days 02:30:00
22          2      1 0 days 00:45:00
23          1      1 0 days 00:00:00
24          4      1 0 days 02:00:00
25          5      0 0 days 00:30:00
26          8      1 0 days 02:00:00
27          6      1 0 days 01:15:00
28          8      0 0 days 00:45:00
29          5      1 0 days 01:00:00
30          1      2 0 days 01:15:00
31          1      3 0 days 00:00:00
32          5      0 0 days 00:15:00
33          8      1 0 days 00:30:00
34          8      2 0 days 00:15:00
35          7      0 0 days 03:30:00
36          0      0 0 days 04:30:00
37          1      2 0 days 01:15:00
38          2      0 0 days 02:30:00
39          0      1 0 days 00:45:00
40          4      0 0 days 02:45:00
41          1      1 0 days 00:30:00
42          2      1 0 days 00:30:00
43          8      1 0 days 01:15:00
44          9      1 0 days 02:45:00
45          9      2 0 days 00:00:00
46          5      1 0 days 02:45:00
47          4      1 0 days 01:45:00
48          7      1 0 days 02:45:00
49          9      1 0 days 03:00:00