#apache-spark #pyspark #apache-spark-sql
#apache-spark #pyspark #apache-spark-sql
Вопрос:
date_range = mydata[mydata.headline_category=='india'].sort('publish_date')
date_range.show()
------------------- ----------------- --------------------
| publish_date|headline_category| headline_text|
------------------- ----------------- --------------------
|2001-01-04 00:00:00| india|Dudhwa tiger died...|
|2001-01-05 00:00:00| india|MP best in forest...|
|2001-05-28 00:00:00| india|India-Bangladesh ...|
|2001-05-28 00:00:00| india|Govt to modernise...|
|2001-05-28 00:00:00| india|Priyanka is the C...|
|2001-05-28 00:00:00| india|MPs riling Relian...|
|2001-05-28 00:00:00| india|CBI probing A-I's...|
|2001-05-28 00:00:00| india|Gujarat braces as...|
|2001-05-28 00:00:00| india|Ayodhya may force...|
|2001-05-28 00:00:00| india|3 new frigates to...|
|2001-05-28 00:00:00| india|Plea in SC challe...|
|2001-05-28 00:00:00| india|Kashmiri Sikhs pr...|
|2001-05-28 00:00:00| india|Bengal to revamp ...|
|2001-05-29 00:00:00| india|Rs 280 cr sanctio...|
|2001-05-29 00:00:00| india|DD Metro is up fo...|
|2001-05-29 00:00:00| india|Govt employees' n...|
|2001-05-29 00:00:00| india|BMS; Left to oppo...|
|2001-05-29 00:00:00| india|CBI vetting paper...|
|2001-05-29 00:00:00| india|Indo-Pak ties: Fr...|
|2001-05-29 00:00:00| india|BJP; Samata to st...|
------------------- ----------------- --------------------
Найти 10 лучших слов в столбце headline_text для категории headline_category Индия?
Комментарии:
1. в любом случае ответ подойдет, либо в sql, либо в spark dataframe
Ответ №1:
Вы можете разбить заголовок на слова, разбить массив слов, сгруппировать по словам и посчитать слова.
import pyspark.sql.functions as F
result = date_range.withColumn('words', F.explode(F.split('headline_text', ' ')))
.groupBy('words')
.count()
.orderBy(F.desc('count'))
.limit(10)