r - 如何从tm中存储术语文档矩阵的稀疏性和最大期限

  显示原文与译文双语对照的内容
96 5

如何将 R 中术语文档矩阵的稀疏性和最大期限长度存储在中,同时查找?

library(tm)
library(RWeka)
#stdout <- vector('character')
#con <- textConnection('stdout','wr',local = TRUE)
#reading the csv file
worklog <- read.csv("To_Kamal_WorkLogs.csv");
#removing the unwanted columns
cols <- c("A","B","C","D","E","F");
colnames(worklog)<-cols;
worklog2 <- worklog[c("F")]
#removing non-ASCII characters
z=iconv(worklog2,"latin1","ASCII", sub="")
#cleaning the data Removing Date and Time
worklog2$F=gsub("[0-9]+/[0-9]+/[0-9]+ [0-9]+:[0-9]+:[0-9]+ [A,P][M]","",worklog2$F);
#loading the vector Data to corpus
a <- Corpus(VectorSource(worklog2$F))
#cleaning the data
a <- tm_map(a,removeNumbers)
a <- tm_map(a,removePunctuation)
a <- tm_map(a,stripWhitespace)
a <- tm_map(a,tolower)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(a,removeWords,stopwords("english")) 
a <- tm_map(a,stemDocument,language ="english")
#removing custom stopwords
stopwords="open";
if(!is.null(stopwords)) a <- tm_map(a, removeWords, words=as.character(stopwords))
#finding 2,3,4 grams
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm2 <- TermDocumentMatrix(a, control = list(tokenize = bigramTokenizer))
tdm2 <- removeSparseTerms(tdm2, 0.75)
#output
> tdm2
<<TermDocumentMatrix (terms: 27, documents: 8747)>>
Non-/sparse entries: 87804/148365
Sparsity : 63%
Maximal term length: 20
Weighting : term frequency (tf)

如何将上稀疏。最大期限。加权。non-/稀疏项存储在单独变量中。

时间:原作者:0个回答

61 1

这将返回你需要的数据。你的问题没有指定所需的格式,所以这里我使用了命名列表。( 这可以很容易地作为 data.frame. 返回)

我从tm包源代码( file Matrix.R, ) 中获取了这个,其中TermDocumentMatrix对象的打印方法。

getTDMstats <- function(x) {
 # where x is a TermDocumentMatrix
 list(sparsity = ifelse(!prod(dim(x)), 100, round((1 - length(x$v)/prod(dim(x))) * 100))/100,
 maxtermlength = max(nchar(Terms(x), type ="chars"), 0), 
 weightingLong = attr(x,"weighting")[1], 
 weightingShort = attr(x,"weighting")[2], 
 nonsparse = length(x$v), 
 sparse = prod(dim(x)) - length(x$v))
}
data(crude)
tdm2 <- TermDocumentMatrix(crude)
tdm2
## <<TermDocumentMatrix (terms: 1266, documents: 20)>>
## Non-/sparse entries: 2255/23065
## Sparsity : 91%
## Maximal term length: 17
## Weighting : term frequency (tf)
getTDMstats(tdm2)
## $sparsity
## [1] 0.91
## 
## $maxtermlength
## [1] 17
## 
## $weightingLong
## [1]"term frequency"
## 
## $weightingShort
## [1]"tf"
## 
## $nonsparse
## [1] 2255
## 
## $sparse
## [1] 23065
原作者:
...