自然语言处理工具集 nltk (1)

Python publisher01 30℃ 0评论

首先我们要明确 nltk 是一个处理自然语言的处理工具集,而不是分析自然语言,处理自然语言整理出适合机器学习框架使用的数据。

example_text = "Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome. The sky is pinkish-blue. You should not eat carboard."

首先我们需要给出断句的规则,如果我们根据(.)后面紧跟首字母大写作为规则进行断句,那么 Hello Mr. Smith显然也符合断句规则。

不用担心 nltk 可以帮助我们很好完成对段落按句子或单词的划分,要使用相应工具我们需要引入依赖包。

from nltk.tokenize import sent_tokenize, word_tokenize
print(sent_tokenize(example_text))

输出单位为句子,会对段落按一定规则划分为句子。

['Hello Mr. Smith, how are you doing today?', 'The weather is great and Python is awesome.', 'The sky is pinkish-blue.', 'You should not eat carboard.']

下面类似方式将段落划分为单词

print(word_tokenize(example_text))
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is','awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', 'not', 'eat', 'carboard', '.']
for i in word_tokenize(example_text):
    print(i)

停止词

首先,我们看下什么是停止词。停止词,是由英文单词:stop word翻译过来的,原来在英语里面会遇到很多a,the,or等使用频率很多的字或词。

在中文网站里面其实也存在大量的stop word,我们称它为停止词。比如,我们前面这句话,“在”、“里面”、“也”、“的”、“它”、“为”这些词都是停止词。这些词因为使用频率过高,几乎 每个网页上都存在,所以搜索引擎开发人员都将这一类词语全部忽略掉。

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_text = "This is an example showing off stop word filration."
stop_words = set(stopwords.words("english"))
print(stop_words)
{"it's", 'being', 'him', 'own', 'above', "you'll", 'yourself', 'again', 'because', 'a', 'i', 'yours', "didn't", 've', 'his', 'only', 'hasn', 'all', 'out', 'this', 'just', 'below', 'of', 'will', 'who', 'shan', 'or', 'should', 'here', 'be', 'against', 't', 'than', 'have', 'is', 'does', "wouldn't", 'hers', 'while', 'ours', 'there', 'when', 'himself', 'hadn', 'theirs', 'your', 'doing', 'before', "shouldn't", 'more', 'over', 'both', 'if', 'so', 'themselves', 'll', 'their', 'ma', 'now', 're', 'we', "won't", 'these', 'why', "she's", 'can', 'its', 'up', 'me', 'the', 'most', 'doesn', 'd', 'herself', "needn't", 'an', 'about', 'as', 'further', 'few', "haven't", 'other', 'aren', 'between', "couldn't", 'are', 'where', 'o', "doesn't", 'at', "you've", "wasn't", 'isn', 'each', "you'd", 'yourselves', 'has', 'did', 'off', 'couldn', 'y', "hasn't", 'very', 'not', "mustn't", 'my', 'then', 'myself', "don't", 'those', 'from','any', 'too', 'to', 'weren', 'am', "you're", 'them', 'down', "shan't", 'into', 'nor', 'ain', 'but', 'didn', 'mightn', 'on', 'and',"aren't", 'it', 'how', "that'll", 'wouldn', 'by', 'was', 'during', 'our', 'same', 'until', 'had', 'some', 'been', 'such', 'shouldn', 'do', 'having', "hadn't", 'that', 'mustn', 'don', 'were', 'what', 'ourselves', "mightn't", 'through', 'no', 'wasn', 'needn', 'he', "weren't", 'once', 'they', 'in', "isn't", 'won', 'after', 'you', 'itself', 'which', 'she', 'm', 'her', "should've", 'with', 'haven', 'under', 'for', 's', 'whom'}
words = word_tokenize(example_text)
filtered_sentence = []
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
print(filtered_sentence)
['This', 'example', 'showing', 'stop', 'word', 'filration', '.']
stop_words = set(stopwords.words("english"))
# print(stop_words)
words = word_tokenize(example_text)
filtered_sentence = []
# for w in words:
#     if w not in stop_words:
#         filtered_sentence.append(w)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)

词干分析器

在学习应该我们都学过动词的时态,有时候我们需要剥去其变化看其本质这就是 stem 用途。

# I was taking a ride in the car.
# I was riding in the car.

在两个句子中 ride 以两种形式存在,但是表示意思都是 ride,

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
ps = PorterStemmer()
example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]
for w in example_words:
    print(ps.stem(w))
python
python
python
python
pythonli

从输出来看我们可以看出将 python 的其他形态去掉保留词根。

new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)
for w in words:
    print(ps.stem(w))

大家可以自己输出看一看,里面好像有些问题,大家可以自己发现。

打标签

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(sample_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)
    except Exception as e:
        print(str(e))
process_content()
[(u'And', 'CC'), (u'so', 'RB'), (u'we', 'PRP'), (u'move', 'VBP'), (u'forward', 'RB'), (u'--', ':'), (u'optimistic', 'JJ'), (u'about', 'IN'), (u'our', 'PRP$'), (u'country', 'NN'), (u',', ','), (u'faithful', 'JJ'), (u'to', 'TO'), (u'its', 'PRP$'), (u'cause', 'NN'), (u',', ','), (u'and', 'CC'), (u'confident', 'NN'), (u'of', 'IN'), (u'the', 'DT'), (u'victories', 'NNS'), (u'to', 'TO'), (u'come', 'VB'), (u'.','.')]
CC  并列连词          NNS 名词复数        UH 感叹词
CD  基数词              NNP 专有名词        VB 动词原型
DT  限定符            NNP 专有名词复数    VBD 动词过去式
EX  存在词            PDT 前置限定词      VBG 动名词或现在分词
FW  外来词            POS 所有格结尾      VBN 动词过去分词
IN  介词或从属连词     PRP 人称代词        VBP 非第三人称单数的现在时
JJ  形容词            PRP$ 所有格代词     VBZ 第三人称单数的现在时
JJR 比较级的形容词     RB  副词            WDT 以wh开头的限定词
JJS 最高级的形容词     RBR 副词比较级      WP 以wh开头的代词
LS  列表项标记         RBS 副词最高级      WP$ 以wh开头的所有格代词
MD  情态动词           RP  小品词          WRB 以wh开头的副词
NN  名词单数           SYM 符号            TO  to

你或许想:《去原作者写文章的地方

转载请注明:Python量化投资 » 自然语言处理工具集 nltk (1)

喜欢 (0)or分享 (0)
发表我的评论
取消评论
表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址