Проблема с извлечением составных существительных, включая дефисы в NLP

#python #python-3.x #string #nlp #spacy

#python #python-3.x #строка #nlp #spacy

Вопрос:

Предпосылки и цель

Я хотел бы извлечь существительные и составные существительные, включая дефисы, из каждого предложения, как показано ниже. Если он содержит дефисы, мне нужно извлечь его с помощью дефисов.

 {The T-shirt is old.: ['T-shirt'], 
I bought the computer and the new web-cam.: ['computer', 'web-cam'], 
I bought the computer and the new web camera.: ['computer', 'web camera']}

проблема

Текущий вывод приведен ниже. В первом слове составных существительных есть метки «compound», но я не могу извлечь то, что ожидаю на данный момент.

 T T PROPN NNP compound X True False
shirt shirt NOUN NN nsubj xxxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
cam cam NOUN NN conj xxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
camera camera NOUN NN conj xxxx True False

{The T-shirt is old.: ['T -', 'T', 'T -', 'shirt'], 
I bought the computer and the new web-cam.: ['web -', 'computer', 'web -', 'web', 'web -', 'cam'], 
I bought the computer and the new web camera.: ['web camera', 'computer', 'web camera', 'web', 'web camera', 'camera']}

Текущий код

Я использую библиотеку NLP, spaCy, чтобы различать существительные и составные существительные. Надеюсь услышать ваш совет, как исправить текущий код.

 import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ["The T-shirt is old.", "I bought the computer and the new web-cam.", "I bought the computer and the new web camera."]

nouns = []*len(texts)
dic = {k: v for k, v in zip(texts, nouns)}

for i in range(len(texts)):
    text = nlp(texts[i])
    words = []
    for word in text:
        if word.pos_ == 'NOUN'or word.pos_ == 'PROPN':
            print(word.text, word.lemma_, word.pos_, word.tag_, word.dep_,
                word.shape_, word.is_alpha, word.is_stop)

            #compound words
            for j in range(len(text)):
                    token = text[j]
                    if token.dep_ == 'compound':
                        if j < len(text)-1:
                            nexttoken = text[j 1]
                            words.append(str(token.text   ' '   nexttoken.text))


            else:
                words.append(word.text)
    dic[text] = words       
print(dic)

Среда разработки

Python 3.7.4

Расширенная версия 2.3.2

Ответ №1:

Пожалуйста, попробуйте:

 import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ("The T-shirt is old",
          "I bought the computer and the new web-cam",
          "I bought the computer and the new web camera",
         )
docs = nlp.pipe(texts)  

compounds = []
for doc in docs:
    compounds.append({doc.text:[doc[tok.i:tok.head.i 1] for tok in doc if tok.dep_=="compound"]})
print(compounds)
[{'The T-shirt is old.': [T-shirt]}, 
{'I bought the computer and the new web-cam.': [web-cam]}, 
{'I bought the computer and the new web camera.': [web camera]}]

computer отсутствует в этом списке, но я не думаю, что он квалифицируется как составной.

Вопрос:

Предпосылки и цель

проблема

Текущий код

Среда разработки

Ответ №1:

Вам также может понравиться

Как повторить функцию и обработчик, пока пользователь не даст действительный ответ telegram-боту?

Обрабатывать исключение базового класса

Как я могу заставить Resharper использовать имена типов CLR для автоматически сгенерированного кода?