函数程序设计实验十一:词频统计 wu-kan

{-
给定一个文本文件,构建文件中所有单词及其出现次数的词频表,并将词频表按照格式要求写到一个文件中。
词频表按照单词出现次数从高到低次序排列,出现次数相同的按照字典序排列。第十四周上课提交。
输入文件样例(见text0.txt ):
Mind-controlled Mouse
The three study participants are part of a clinical trial to test a brain-computer interface (BCI) called BrainGate. BrainGate translates participant’s brain activity into commands that a computer can understand. In the new study, researchers first implanted microelectrode arrays into the area of the brain that governs hand movement. The participants trained the system by thinking about moving their hands, something the BCI learned to translate into actions on the screen.
输出文件样例(见text0Fre.txt ):
the:8
a:3
into:3
bci:2
brain:2
braingate:2
of:2
participants:2
study:2
that:2
to:2
about:1
actions:1
activity:1
are:1
area:1
arrays:1
brain-computer:1
by:1
called:1
can:1
clinical:1
commands:1
computer:1
first:1
governs:1
hand:1
hands:1
implanted:1
in:1
interface:1
learned:1
microelectrode:1
mind-controlled:1
mouse:1
movement:1
moving:1
new:1
on:1
part:1
participant:1
researchers:1
screen:1
something:1
system:1
test:1
their:1
thinking:1
three:1
trained:1
translate:1
translates:1
trial:1
understand:1
建议:
main ::IO
main = do
   ls <- readFile inputfile
   let result = string2listofpairs ls
   let formatedResult = formatting result
   writeFile formatedResult
其中
string2listofpairs :: String ->[(String, Int)]  -- 计算输入串中所有单词及其频率,并排序
formatting :: [(String, Int)] ->String  -- 按照要求将词频列表转换为输出格式串
几点注意事项:
1. 大小写不敏感,所有结果用小写字母;
2. It's 视为单词 it;
3. brain-computer 视为一个单词。
-}
import System.IO
import Text.Printf
import Data.Char

inputfile="text0.txt"
outputfile="text0Fre.txt"

main::IO()
main=do
   ls<-readFile inputfile
   let result = string2listofpairs ls
   let formatedResult=formatting result
   writeFile outputfile formatedResult


string2listofpairs::String->[(String,Int)]
string2listofpairs s=sortG(cal(group(sortS(words(tolow s)))))

sortS::[String]->[String]
sortS []=[]
sortS (x:xs)=(sortS[y|y<-xs,y<x])++[x]++(sortS[y|y<-xs,x<=y])

cmp::(String,Int)->(String,Int)->Bool
cmp (a,b) (c,d)=
    if b/=d
        then b>d
        else a<c

sortG::[(String,Int)]->[(String,Int)]
sortG []=[]
sortG (x:xs)=(sortG[y|y<-xs,cmp y x])++[x]++(sortG[y|y<-xs,cmp y x==False])

group::[String]->[[String]]
group []=[]
group (x:xs)=[[x]++[y|y<-xs,y==x]]++(group[y|y<-xs,y/=x])

tolow::String->String
tolow []=[]
tolow (x:xs)=
    if (isSpace x)||(isAlphaNum x)||(x=='-')
        then ((toLower x):(tolow xs))
        else tolow xs

cal::[[String]]->[(String,Int)]
cal []=[]
cal (x:xs)=((x!!0,length x):(cal xs))

formatting :: [(String, Int)] ->String  -- 按照要求将词频列表转换为输出格式串
formatting []=[]
formatting ((a,b):xs)=(printf "%s:%d\n" a b)++(formatting xs)