python sklearn CountVectorizer的使用及相关说明|南北小站

参考：

https://blog.csdn.net/Clannad_niu/article/details/95216996

from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.feature_extraction.text import TfidfTransformer
import pandas as pd


# 参考：https://blog.csdn.net/Clannad_niu/article/details/95216996

if __name__ == '__main__':

    data = ['I love you', 'you are my angle are']
    '''
    ngram_range：（1,1）表示1-gram （2,2）表示2-gram   （1,2）表示1-gram + 2-gram
    binary：FALSE表示不将文本进行二进制表示
    token_pattern：通过正则表达式来确定哪些数据被过滤掉，默认情况下单个英文字母会被过滤掉，代码中的\w{1,}可以避免这种情况
    '''
    vecl = CountVectorizer(ngram_range=(1, 2), binary=False, token_pattern='\w{1,}')
    # 用来对数据进行处理，表示成n-gram的形式
    xl = vecl.fit_transform(data)
    # 特征名称
    print(vecl.get_feature_names())
    # 特征在列表中的索引位置
    print(vecl.vocabulary_)
    # 这样看不直观，使用pandas看会很直观
    print(xl)

    df = pd.DataFrame(xl.toarray(), columns=vecl.get_feature_names())  # to DataFrame

    print(df.head())


    '''
       angle  are  i  love  my  you
    0      0    0  1     1   0    1
    1      1    2  0     0   1    1
    输出结果表示，第一句话有1个i，1个love，1个you，第二句话有2个are，
    所以可以看出，使用CountVectorize，只能统计每句话每个词的词频，这个时候每个维度就表示一个单词，
    如果有10000句话，包含60000个单词，那么他们就可以表示成一个10000*60000的矩阵（很大），如果词汇量更大，
    则矩阵也会相应增大，并且由于很多词只出现一次，是非常冗余的，所以这种方式的使用需谨慎
    '''

输出

['angle', 'angle are', 'are', 'are my', 'i', 'i love', 'love', 'love you', 'my', 'my angle', 'you', 'you are']
{'i': 4, 'love': 6, 'you': 10, 'i love': 5, 'love you': 7, 'are': 2, 'my': 8, 'angle': 0, 'you are': 11, 'are my': 3, 'my angle': 9, 'angle are': 1}
  (0, 4)    1
  (0, 6)    1
  (0, 10)   1
  (0, 5)    1
  (0, 7)    1
  (1, 10)   1
  (1, 2)    2
  (1, 8)    1
  (1, 0)    1
  (1, 11)   1
  (1, 3)    1
  (1, 9)    1
  (1, 1)    1
   angle  angle are  are  are my  i  ...  love you  my  my angle  you  you are
0      0          0    0       0  1  ...         1   0         0    1        0
1      1          1    2       1  0  ...         0   1         1    1        1

MySQL报错Got an error reading communication packets问题分析指南

[892]三维形体的表面积（重做）

Administrator