Функции ранжирования SQL для сортировки и агрегирования данных для данных беседы

#sql #google-bigquery #window-functions #ranking-functions

# #sql #google-bigquery #окно-функции #ранжирование-функции

Вопрос:

Я ищу ранжирование / агрегирование данных беседы в SQL (в частности, BigQuery). Данные — это данные беседы, где каждая строка представляет одно предложение. На изображении ниже я добавил пример данных для speaker, sentence и sequence_start. Желаемый рейтинг — это целевой результат (или что-то подобное).

Я считаю, что должна быть функция окна, такая как rank / lag / first, которая должна получить желаемый ранг программно.

Пример данных беседы

Самое близкое, что я получил, изначально было следующим:

 WITH DATA AS (
SELECT 'Speaker A' as speaker, 'Sentence 1' as sentence, 1 as sentence_start, 1 as desired_rank
UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 2' as sentence, 9 as sentence_start, 1 as desired_rank
UNION ALL SELECT 'Speaker B' as speaker, 'Sentence 3' as sentence, 27 as sentence_start, 2 as desired_rank
UNION ALL SELECT 'Speaker C' as speaker, 'Sentence 4' as sentence, 46 as sentence_start, 3 as desired_rank
UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 5' as sentence, 78 as sentence_start, 4 as desired_rank
)
SELECT speaker, sentence, sentence_start, desired_rank,

FIRST_VALUE(sentence_start)
  OVER (
    PARTITION BY speaker
    ORDER BY sentence_start
    RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

FROM DATA
ORDER BY sentence_start
 

Проблема с результатом заключается в том, что динамик A всегда оценивается как 1, где он должен быть 4 (или что-то подобное).

Ваша помощь приветствуется. Спасибо!

Ответ №1:

Понял это. Необходимо перейти к следующей строке, чтобы определить изменение. Добавлено усложнение, когда говорящий А говорит за предложения 5 и 6.

 WITH data AS (
SELECT          'Speaker A' as speaker, 'Sentence 1' as sentence, 1 as sentence_start, 1 as desired_rank
UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 2' as sentence, 9 as sentence_start, 1 as desired_rank
UNION ALL SELECT 'Speaker B' as speaker, 'Sentence 3' as sentence, 27 as sentence_start, 2 as desired_rank
UNION ALL SELECT 'Speaker C' as speaker, 'Sentence 4' as sentence, 46 as sentence_start, 3 as desired_rank
UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 5' as sentence, 78 as sentence_start, 4 as desired_rank
UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 6' as sentence, 90 as sentence_start, 4 as desired_rank
),
data_ranked AS (
SELECT speaker, sentence, sentence_start, desired_rank,
COALESCE(LEAD(sentence_start) OVER (ORDER BY sentence_start asc),9999999999999) AS next_sentence_start
FROM DATA
ORDER BY sentence_start
),
sentence_information AS (
SELECT sentence_information.speaker, sentence_information.sentence, sentence_information.sentence_start, sentence_information.next_sentence_start
  , CASE WHEN sentence_information.speaker <> next_sentence_information.speaker THEN TRUE ELSE FALSE END as next_sentence_speaker_change_indicator
FROM DATA_RANKED as sentence_information
  LEFT OUTER JOIN DATA AS next_sentence_information ON sentence_information.next_sentence_start = next_sentence_information.sentence_start
),
compiled_sentence_information AS (SELECT sentence_information.speaker, sentence_information.sentence, sentence_information.sentence_start, sentence_information.next_sentence_start
, COALESCE(next_sentence_information.next_sentence_speaker_change_indicator, FALSE) as speaker_change_indicator
, CASE WHEN COALESCE(next_sentence_information.next_sentence_speaker_change_indicator, FALSE) THEN 1 ELSE 0 END as speaker_change_number
, SUM(CASE WHEN COALESCE(next_sentence_information.next_sentence_speaker_change_indicator, FALSE) THEN 1 ELSE 0 END) OVER (ORDER BY sentence_information.sentence_start ASC) AS speaker_sentence_rank
, CASE WHEN sentence_information.next_sentence_start = 9999999999999 THEN TRUE ELSE sentence_information.next_sentence_speaker_change_indicator END as final_sentence_in_paragraph
FROM sentence_information 
  LEFT OUTER JOIN sentence_information as next_sentence_information on sentence_information.sentence_start = next_sentence_information.next_sentence_start
),
paragraphs as (
SELECT *, STRING_AGG(sentence, " ") OVER (PARTITION BY speaker_sentence_rank ORDER BY sentence_start) as paragraph
FROM compiled_sentence_information
)
SELECT speaker, paragraph
FROM paragraphs
WHERE final_sentence_in_paragraph = TRUE
ORDER BY sentence_start
 

Ответ №2:

Возможно, поможет одна из функций RANK, DENSE_RANK и ROW_NUMBER . Добавлена еще одна строка в DATA с sentence_start = 9 , чтобы указать на разницу:

 WITH DATA AS (
  SELECT 'Speaker A' as speaker, 'Sentence 1' as sentence, 1 as sentence_start, 1 as desired_rank
  UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 2' as sentence, 9 as sentence_start, 1 as desired_rank
  UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 2' as sentence, 9 as sentence_start, 1 as desired_rank
  UNION ALL SELECT 'Speaker B' as speaker, 'Sentence 3' as sentence, 27 as sentence_start, 2 as desired_rank
  UNION ALL SELECT 'Speaker C' as speaker, 'Sentence 4' as sentence, 46 as sentence_start, 3 as desired_rank
  UNION ALL SELECT 'Speaker A' as speaker, 'Sentence 5' as sentence, 78 as sentence_start, 4 as desired_rank
)
SELECT 
  speaker,
  sentence,
  sentence_start,
  desired_rank,
  RANK() OVER (ORDER BY sentence_start) AS rank,
  DENSE_RANK() OVER (ORDER BY sentence_start) AS dense_rank,
  ROW_NUMBER() OVER (ORDER BY sentence_start) AS row_number,
FROM DATA
 

введите описание изображения здесь