TFIDF

虽然主要的检索功能实现了，但是我们还需要一个“推荐阅读”的功能。当用户浏览某条具体新闻时，我们在页面底端给出5条和该新闻相关的新闻，也就是一个最简单的推荐系统。搜狐新闻“相关新闻”模块推荐模块的思路是度量两两新闻之间的相似度，取相似度最高的前5篇新闻作为推荐阅读的新闻。我们前面讲过，一篇文档可以用一个向量表示，向量中的每个值是不同词项t在该文档d中的词频tf。但是一篇较短的文档（如新闻）的关键词并不多，所以我们可以提取每篇新闻的关键词，用这些关键词的tfidf值构成文档的向量表示，这样能够大大减少相似度计算量，同时保持较好的推荐效果。 jieba分词组件自带关键词提取功能，并能返回关键词的tfidf值。所以对每篇新闻，我们先提取tfidf得分最高的前25个关键词，用这25个关键词的tfidf值作为文档的向量表示。由此能够得到一个1000*m的文档词项矩阵M，矩阵每行表示一个文档，每列表示一个词项，m为1000个文档的所有互异的关键词（大概10000个）。矩阵M当然也是稀疏矩阵。得到文档词项矩阵M之后，我们利用sklearn的pairwise_distances函数计算M中行向量之间的cosine相似度，对每个文档，得到与其最相似的前5篇新闻id，并把结果写入数据库。推荐阅读模块的代码如下： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 # -*- coding: utf-8 -*- """ Created on Wed Dec 23 14:06:10 2015 @author: bitjoy.net """ from os import listdir import xml.etree.ElementTree as ET import jieba import jieba.analyse import sqlite3 import configparser from datetime import * import math import pandas as pd import numpy as np from sklearn.metrics import pairwise_distances class RecommendationModule: stop_words = set() k_nearest = [] config_path = '' config_encoding = '' doc_dir_path = '' doc_encoding = '' stop_words_path = '' stop_words_encoding = '' idf_path = '' db_path = '' def __init__(self, config_path, config_encoding): self.config_path = config_path self.config_encoding = config_encoding config = configparser.ConfigParser() config.read(config_path, config_encoding) self.doc_dir_path = config['DEFAULT']['doc_dir_path'] self.doc_encoding = config['DEFAULT']['doc_encoding'] self.stop_words_path = config['DEFAULT']['stop_words_path'] self.stop_words_encoding = config['DEFAULT']['stop_words_encoding'] self.idf_path = config['DEFAULT']['idf_path'] self.db_path = config['DEFAULT']['db_path'] f = open(self.stop_words_path, encoding = self.stop_words_encoding) words = f.read() self.stop_words = set(words.split('\n')) def write_k_nearest_matrix_to_db(self): conn = sqlite3.connect(self.db_path) c = conn.cursor() c.execute("'DROP TABLE IF EXISTS knearest'") c.execute("'CREATE TABLE knearest(id INTEGER PRIMARY KEY, first INTEGER, second INTEGER, third INTEGER, fourth INTEGER, fifth INTEGER)'") for docid, doclist in self.k_nearest: c.execute("INSERT INTO knearest VALUES (?, ?, ?, ?, ?, ?)", tuple([docid] + doclist)) conn.commit() conn.close() def is_number(self, s): try: float(s) return True except ValueError: return False def construct_dt_matrix(self, files, topK = 200): jieba.analyse.set_stop_words(self.stop_words_path) jieba.analyse.set_idf_path(self.idf_path) M = len(files) N = 1 terms = {} dt = [] for i in files: root = ET.parse(self.doc_dir_path + i).getroot() title = root.find('title').text body = root.find('body').text docid = int(root.find('id').text) tags = jieba.analyse.extract_tags(title + '。' + body, topK=topK, withWeight=True) #tags = jieba.analyse.extract_tags(title, topK=topK, withWeight=True) cleaned_dict = {} for word, tfidf in tags: word = word.strip().lower() if word == '' or self.is_number(word): continue cleaned_dict[word] = tfidf if word not in terms: terms[word] = N N += 1 dt.append([docid, cleaned_dict]) dt_matrix = [[0 for i in range(N)] for j in range(M)] i =0 for docid, t_tfidf in dt: dt_matrix[i][0] = docid for term, tfidf in t_tfidf.items(): dt_matrix[i][terms[term]] = tfidf i += 1 dt_matrix = pd.DataFrame(dt_matrix) dt_matrix.index = dt_matrix[0] print('dt_matrix shape:(%d %d)'%(dt_matrix.shape)) return dt_matrix def construct_k_nearest_matrix(self, dt_matrix, k): tmp = np.array(1 – pairwise_distances(dt_matrix[dt_matrix.columns[1:]], metric = "cosine")) similarity_matrix = pd.DataFrame(tmp, index = dt_matrix.index.tolist(), columns = dt_matrix.index.tolist()) for i in similarity_matrix.index: tmp = [int(i),[]] j = 0 while j <= k: max_col = similarity_matrix.loc[i].idxmax(axis = 1) similarity_matrix.loc[i][max_col] = -1 if max_col != i: tmp[1].append(int(max_col)) #max column name j += 1 self.k_nearest.append(tmp) def gen_idf_file(self): files = listdir(self.doc_dir_path) n = float(len(files)) idf = {} for i in files: root = ET.parse(self.doc_dir_path + i).getroot() title = root.find('title').text body = root.find('body').text seg_list = jieba.lcut(title + '。' + body, cut_all=False) seg_list = set(seg_list) – self.stop_words for word in seg_list: word = word.strip().lower() if word == '' or self.is_number(word): continue if word not in idf: idf[word] = 1 else: idf[word] = idf[word] + 1 idf_file = open(self.idf_path, 'w', encoding = 'utf-8') for word, df in idf.items(): idf_file.write('%s %.9f\n'%(word, math.log(n / df))) idf_file.close() def find_k_nearest(self, k, topK): self.gen_idf_file() files = listdir(self.doc_dir_path) dt_matrix = self.construct_dt_matrix(files, topK) self.construct_k_nearest_matrix(dt_matrix, k) self.write_k_nearest_matrix_to_db() if __name__ == "__main__": print('—–start time: %s—–'%(datetime.today())) rm = RecommendationModule('../config.ini', 'utf-8') rm.find_k_nearest(5, 25) print('—–finish time: %s—–'%(datetime.today())) 这个模块的代码量最多，主要原因是需要构建文档词项矩阵，并且计算k邻居矩阵。矩阵数据结构的设计需要特别注意，否则会严重影响系统的效率。我刚开始把任务都扔给了pandas.DataFrame，后来发现当两个文档向量合并时，需要join连接操作，当数据量很大时，非常耗时，所以改成了先用python原始的list存储，最后一次性构造一个完整的pandas.DataFrame，速度比之前快了不少。 ...

构建好倒排索引之后，就可以开始检索了。检索模型有很多，比如向量空间模型、概率模型、语言模型等。其中最有名的、检索效果最好的是基于概率的BM25模型。给定一个查询Q和一篇文档d，d对Q的BM25得分公式为 $$BM25_{score}(Q,d)=\sum_{t\in Q}w(t,d)$$$$w(t,d)=\frac{qtf}{k_3+qtf}\times \frac{k_1\times tf}{tf+k_1(1-b+b\times l_d/avg\_l)}\times log_2\frac{N-df+0.5}{df+0.5}$$公式中变量含义如下： $qtf$：查询中的词频 $tf$：文档中的词频 $l_d$：文档长度 $avg\_l$：平均文档长度 $N$：文档数量 $df$：文档频率 $b,k_1,k_3$：可调参数这个公式看起来很复杂，我们把它分解一下，其实很容易理解。第一个公式是外部公式，一个查询Q可能包含多个词项，比如“苹果手机”就包含“苹果”和“手机”两个词项，我们需要分别计算“苹果”和“手机”对某个文档d的贡献分数w(t,d)，然后将他们加起来就是整个文档d相对于查询Q的得分。第二个公式就是计算某个词项t在文档d中的得分，它包括三个部分。第一个部分是词项t在查询Q中的得分，比如查询“中国人说中国话”中“中国”出现了两次，此时qtf=2，说明这个查询希望找到的文档和“中国”更相关，“中国”的权重应该更大，但是通常情况下，查询Q都很短，而且不太可能包含相同的词项，所以这个因子是一个常数，我们在实现的时候可以忽略。第二部分类似于TFIDF模型中的TF项。也就是说某个词项t在文档d中出现次数越多，则t越重要，但是文档长度越长，tf也倾向于变大，所以使用文档长度除以平均长度$l_d/avg\_l$起到某种归一化的效果，$k_1$和$b$是可调参数。第三部分类似于TFIDF模型中的IDF项。也就是说虽然“的”、“地”、“得”等停用词在某文档d中出现的次数很多，但是他们在很多文档中都出现过，所以这些词对d的贡献分并不高，接近于0；反而那些很稀有的词如”糖尿病“能够很好的区分不同文档，这些词对文档的贡献分应该较高。所以根据BM25公式，我们可以很快计算出不同文档t对查询Q的得分情况，然后按得分高低排序给出结果。下面是给定一个查询句子sentence，根据BM25公式给出文档排名的函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 def result_by_BM25(self, sentence): seg_list = jieba.lcut(sentence, cut_all=False) n, cleaned_dict = self.clean_list(seg_list) BM25_scores = {} for term in cleaned_dict.keys(): r = self.fetch_from_db(term) if r is None: continue df = r[1] w = math.log2((self.N - df + 0.5) / (df + 0.5)) docs = r[2].split('\n') for doc in docs: docid, date_time, tf, ld = doc.split('\t') docid = int(docid) tf = int(tf) ld = int(ld) s = (self.K1 * tf * w) / (tf + self.K1 * (1 - self.B + self.B * ld / self.AVG_L)) if docid in BM25_scores: BM25_scores[docid] = BM25_scores[docid] + s else: BM25_scores[docid] = s BM25_scores = sorted(BM25_scores.items(), key = operator.itemgetter(1)) BM25_scores.reverse() if len(BM25_scores) == 0: return 0, [] else: return 1, BM25_scores 首先将句子分词得到所有查询词项，然后从数据库中取出词项对应的倒排记录表，对记录表中的所有文档，计算其BM25得分，最后按得分高低排序作为查询结果。 ...

TFIDF

和我一起构建搜索引擎（五）推荐阅读

和我一起构建搜索引擎（四）检索模型