博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
综合练习:词频统计
阅读量:6851 次
发布时间:2019-06-26

本文共 2874 字,大约阅读时间需要 9 分钟。

1.英文词频统

下载一首英文的歌词或文章

news='''It was summer. A duck  was sitting in her nest. Her little ducklings were about to hatched.     One egg after another began to crack,but the biggest one was still there. At last, it cracked.    The baby was big and ugly. The next day,the mother duck with her family went down to the moat.     One duckling jumped in after another. The big ugly one swam about with them.But the poor      duckling was chased and harassed by all the ducklings because he was  very ugly.The ducks      bit him,the hens pecked him. And the girl who fed them kicked him aside.Then he ran off      and soon came to a great marsh where the wild ducks lived. He stayed there for two whole      days.A big dog appeared close beside him,but he did not touch the duckling. "Oh,I am so      ugly that even the dog won`t bite me." Sighed the duckling.In the evening,he reached a      little cottage. Because he could not lay eggs,he was driven away by the hen.   one      evening,he saw some swans.He flew into the water and swam towards them. What did he      see in the clear water? He was no longer a dark grey ugly bird. He wa s himself a swan.   He said to himself,"I never dreamed that I could be so happy when I was the ugly duckling."'''

  

将所有,.?!’:等分隔符全部替换为空格

sep = '''.,?""'''for c in  sep:    news=news.replace(c,'')

 

将所有大写转换为小写

生成单词列表

wordList = news.lower().split()for w in wordList:    print(w)

 

生成词频统计

wordDict = {}wordSet = set(wordList)for w in wordSet:    wordDict[w] = wordList.count(w)

 

排序

dictList = list(wordDict.items())dictList.sort(key = lambda x: x[1], reverse=True)

 

排除语法型词汇,代词、冠词、连词

exclude ={
'the','who','a','it','that','he','by','when'}wordSet = set(wordList)-excludefor w in wordSet: wordDict[w] = wordList.count(w)

 

输出词频最大TOP20

for i in range(20):    print(dictList[i])

 

将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

f = open('news.txt', 'r', encoding='utf-8')news = f.read()f.close()print(news)f = open('newscount.txt', 'a')for i in range(20):    f.write(dictList[i][0] + '' + str(dictList[i][1]) + '\n')f.close()

 

 

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

news = open('gzccnews.txt','r',encoding = 'utf-8')

安装与使用jieba进行中文分词。

pip install jieba

import jieba

list(jieba.lcut(news))

import jiebaf = open('price.txt','r',encoding='utf-8')price= f.read()f.close()

 

生成词频统计

排序

排除语法型词汇,代词、冠词、连词

输出词频最大TOP20(或把结果存放到文件里)

 

将代码与运行结果截图发布在博客上。

can1 = '''“”,。?:()! '''exclude = {
'你','的','我','他','她','是'}jieba.add_word('白雪公主')for i in can1: price = price.replace(i,'')result = list(jieba.cut(price))wordDict = {}words = list(set(result)-exclude)for i in words: wordDict[i]= result.count(i)wordList = list(wordDict.items())wordList.sort(key = lambda x: x[1], reverse=True)print(wordList)f = open('bxgz.txt','a',encoding='utf-8')for i in range(20): f.write(wordList[i][0] + '' + str(wordList[i][1]) + '\n')f.close()

转载于:https://www.cnblogs.com/candyxue/p/8666059.html

你可能感兴趣的文章
集群入门简析及LB下LVS详解
查看>>
Linux与GPT
查看>>
管理或技术
查看>>
分配到弱属性;对象将在赋值之后释放
查看>>
java作用域public ,private ,protected 及不写时的区别
查看>>
until循环语句
查看>>
Android桌面悬浮窗进阶,QQ手机管家小火箭效果实现
查看>>
提高用户体验方式:饥饿营销
查看>>
Java8中的LocalDateTime工具类
查看>>
Exchange 2013 PowerShell创建自定义对象
查看>>
RAID-10 阵列的创建(软)
查看>>
javaScript的调试(四)
查看>>
nginx不使用正则表达式匹配
查看>>
利用putty进行vnc + ssh tunneling登录
查看>>
hadoop1.x作业提交过程分析(源码分析第二篇)
查看>>
默认安装vsftpd后
查看>>
《Redis设计与实现》读书笔记
查看>>
waiting for changelog lock.
查看>>
小白学爬虫-批量部署Splash负载集群
查看>>
你离BAT之间,只差这一套Java面试题
查看>>