#python #python-3.x #regex
#python #python-3.x #регулярное выражение
Вопрос:
У меня есть сложная задача, которая заключается в удалении повторяющихся непрерывных слов или предложений. Ниже приведен пример ввода.
The
The Up
The Up next
The Up next we
The Up next we bring
The Up next we bring you
The Up next we bring you a
The Up next we bring you a rebroadcast
The Up next we bring you a rebroadcast of
The Up next we bring you a rebroadcast of.
of. The
of. The Diane
of. The Diane Rehm
of. The Diane Rehm radio
of. The Diane Rehm radio talk
of. The Diane Rehm radio talk show
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The
The Diane Rehm radio talk show. The program
The Diane Rehm radio talk show. The program is
The Diane Rehm radio talk show. The program is heard
The Diane Rehm radio talk show. The program is heard over
The Diane Rehm radio talk show. The program is heard over W.A.M.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M.
The program is heard over W.A.M. you F.M. on
The program is heard over W.A.M. you F.M. on the
The program is heard over W.A.M. you F.M. on the campus
The program is heard over W.A.M. you F.M. on the campus of
The program is heard over W.A.M. you F.M. on the campus of the
The program is heard over W.A.M. you F.M. on the campus of the American
F.M. on the campus of the American University
F.M. on the campus of the American University in
F.M. on the campus of the American University in the
F.M. on the campus of the American University in the nation's
F.M. on the campus of the American University in the nation's capital
F.M. on the campus of the American University in the nation's capital.
University in the nation's capital. The
University in the nation's capital. The special
University in the nation's capital. The special Martin
University in the nation's capital. The special Martin Luther
University in the nation's capital. The special Martin Luther King
University in the nation's capital. The special Martin Luther King Day
University in the nation's capital. The special Martin Luther King Day show
The special Martin Luther King Day show recorded
The special Martin Luther King Day show recorded Monday
The special Martin Luther King Day show recorded Monday.
recorded Monday. Focused
recorded Monday. Focused on
recorded Monday. Focused on race
recorded Monday. Focused on race relations
recorded Monday. Focused on race relations.
Focused on race relations. Ms
Focused on race relations. Ms Rames
Focused on race relations. Ms Rames guests
Focused on race relations. Ms Rames guests were
Focused on race relations. Ms Rames guests were Eleanor
Focused on race relations. Ms Rames guests were Eleanor Holmes
Ms Rames guests were Eleanor Holmes Norton
Ms Rames guests were Eleanor Holmes Norton.
Текущий вывод приведен ниже
The Up next we bring you a rebroadcast of.
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M. on the campus of the American
F.M. on the campus of the American University in the nation's capital.
University in the nation's capital. The special Martin Luther King Day show
The special Martin Luther King Day show recorded Monday.
recorded Monday. Focused on race relations.
Focused on race relations. Ms Rames guests were Eleanor Holmes
Ms Rames guests were Eleanor Holmes Norton.
Как вы можете видеть, даже после этого процесса у нас все еще есть повторения, такие как
The Up next we bring you a rebroadcast of.
of. The Diane Rehm radio talk show.
The Diane Rehm radio talk show. The program is heard over W.A.M. you
The program is heard over W.A.M. you F.M. on the campus of the American
Я просто хочу что-то вроде
The Up next we bring you a rebroadcast of.
The Diane Rehm radio talk show.
The program is heard over W.A.M. you F.M. on the campus of the American
University in the nation's capital. The special Martin Luther King Day show
recorded Monday. Focused on race relations.
...etc
Как мне выполнить эту задачу?
текущий код
import os
def load_and_discard(file_path):
"""
Load and discard previous substrings.
Args:
file_path (PathLike): path to data file
Returns:
list[str]
"""
data = []
with open("./input/" infile_path) as f:
for i, line in enumerate(f):
st = line.strip()
if i > 0 and st.startswith(data[-1]):
data[-1] = st
elif len(st) > 0: # guard against empty string
data.append(st)
return data
def find_lebms(s1, s2):
"""
Binary search on the longest-end-begin-matching-substring (LEBMS).
Args:
s1 (str): 1st stripped str (match the end)
s2 (str): 2nd stripped str (match the begin)
Returns:
int: length of LEBMS
"""
# search up to this length
n1 = min(len(s1), len(s2))
for i in range(1, n1 1):
if s1[-i:] == s2[:i]:
return i
else:
return 0
def remove_repeated_substr(data):
"""
Generate strings (in-place) ready for concatenation by
removing the repeated substring in the first string.
Args:
data (list[str]): list of strings
Returns:
None
"""
n0 = len(data)
for i, st in enumerate(data):
# guard: no chopping for the last line
if i == n0 - 1:
break
# chop the current row
n = find_lebms(st, data[i 1])
if n > 0: # guard against n = 0
data[i] = st[:-n]
directory = './input'
for filename in os.listdir(directory):
infile_path = filename
data = load_and_discard(infile_path)
remove_repeated_substr(data)
# (optional) prevent un-spaced ending periods
for i, st in enumerate(data):
if st[-1] == ".":
data[i] = " "
ans = "n".join(data)
with open("./output/" filename, "w") as text_file:
text_file.write(ans)
Если вы хотите, вы можете использовать выходные данные в качестве входных данных, если это проще. Таким образом, вам не нужно обрабатывать повторяющиеся строки. Это полностью зависит от вас, хотите ли вы использовать ввод в качестве своего ввода или мой вывод в качестве вашего ввода. Но когда вы опубликуете, пожалуйста, дайте мне знать.
Альтернативный ввод
You can watch a representative.
Twenty three zero seven of the Rayburn Office Building.
Washington D.C. each week. C.-SPAN
Washington D.C. each week. C.-SPAN breaks
Washington D.C. each week. C.-SPAN breaks from
Washington D.C. each week. C.-SPAN breaks from its
Washington D.C. each week. C.-SPAN breaks from its public
Washington D.C. each week. C.-SPAN breaks from its public affairs
C.-SPAN breaks from its public affairs programming
C.-SPAN breaks from its public affairs programming to
C.-SPAN breaks from its public affairs programming to give
C.-SPAN breaks from its public affairs programming to give the
C.-SPAN breaks from its public affairs programming to give the viewer
C.-SPAN breaks from its public affairs programming to give the viewer updated schedule information.
Join us at eight o'clock A.M. Eastern five o'clock A.M. Pacific Time.
Six thirty P.M. Eastern three thirty P.M. Pacific Time.
Eight o'clock P.M. Eastern five o'clock P.M. Pacific Time.
One o'clock A.M. Eastern ten o'clock P.M. Pacific Time. As always C.-SPAN
P.M. Pacific Time. As always C.-SPAN scheduled
P.M. Pacific Time. As always C.-SPAN scheduled programming
As always C.-SPAN scheduled programming is preempted by live coverage of the U.S. House of Representatives.
Going on this election year.
Covering every issue in the campaign calendar.
The calendar list the network's plans for campaign.
From now through election day.
In addition to election coverage.
Other major events are cameras record.
Call toll free one eight hundred three four six. Her it to order the C.-SPAN
four six. Her it to order the C.-SPAN update for
Her it to order the C.-SPAN update for twenty four dollars.
You can use your credit card or will be glad to send you a bill.
Call one eight hundred three four six eight hundred.
And you'll receive fifty issues of the C.-SPAN update.
If you order an update subscription now.
The receive a free gift. The C.-SPAN road to the White House
The C.-SPAN road to the White House poster is twenty two by twenty eight inch pen and ink drawing.
Attractively depicts the spans grassroots approach to the campaign called.
Комментарии:
1. Если я правильно понимаю, вы хотите удалить повторяющиеся строки (например, удалить «я есть» и сохранить «Я рыба»). Возможно, вы можете использовать древовидную структуру данных и хранить каждое слово как узел в дереве. Это эффективный способ поиска повторяющихся последовательностей, выполняющих это за линейное время. Но это предполагает, что начало повторяющихся последовательностей одинаковое.
2. @ILike Да, но немного по-другому. Вы можете видеть, что ввод содержит
The Up next we bring you a rebroadcast of. of. The Diane Rehm radio talk show.
, гдеof
повторяется дважды в конце и начале двух разных строк. Каким-то образом, если возможно получитьThe Up next we bring you a rebroadcast of. The Diane Rehm radio talk show.
3. Ну, если каждое предложение является продолжением предыдущего предложения, я думаю, что проще всего было бы проверить, содержит ли оно подстроку в предыдущем предложении.
Ответ №1:
Возможно, вы сможете использовать это регулярное выражение, используя предварительную и обратную ссылку, чтобы сопоставить перекрывающиеся дубликаты и удалить их.
(b[-ws.'] ?)(?=[s.] 1)[s.]
Используйте пустую строку для замены.
Демонстрация регулярных выражений
Код:
s = re.sub(r'(b[-ws.'] ?)(?=[s.] 1)[s.] ', 'n', s)
Подробности регулярных выражений:
(
: Запустить группу захвата # 1b
: Граница слова[-ws.']
: Сопоставьте 1 слово, пробел, дефис, точку или'
символы
)
: Завершить группу захвата # 1(?=[s.] 1)
: Положительный прогноз, чтобы утверждать, что у нас есть записанное значение группы 1 перед использованием после 1 пробелов / точек[s.]
: Сопоставьте 1 пробелы или точки
Чтобы сохранить несколько строк, вы можете использовать 2 замены:
s = re.sub(r'(b[-ws.'] ?)(?=[s.] 1)[s.] ', 'n', s)
s = re.sub(r'An |(?<=[^.] )n |n (?=n)|n Z', '', s)
Комментарии:
1. Спасибо за ваше решение! Если бы я хотел добавить новую строку после каждого предложения. Как мне это сделать?
2. Большое вам спасибо!
3. Большое вам спасибо за ваш ответ. Не могли бы вы взглянуть на обновленный ввод? кажется, что код работает не во всех случаях.
4. Можете ли вы проверить эту демонстрацию с вашим новым вводом: regex101.com/r/qVIybq/3