前言:
很多站长,当网站数据量非常大的时候,对于整站搜索 ,难免觉得亚历山大,为什么呢?
简单来说你网站有几十条文章,或者几千条文章,还好说一些,如果网站有几千万篇文章呢?搭建自己的搜索引擎, 会不会很头疼?
再加上全文搜索,分词搜索,联想搜索,语义搜索, 这些功能实现起来,真是要费一番周折。还有选取数据库也是一种挑战,创建索引,优化查询语句,整个工序下来,人力物力,工作量确实挺大的!
正文:
说了这么多,就是为了说到 阿里云开放搜索,
先来看一下官方的介绍
开放搜索(OpenSearch)是解决用户结构化数据搜索需求的托管服务,支持数据结构、搜索排序、数据处理自由定制。 开放搜索为您的网站或应用程序提供简单、低成本、稳定、高效的搜索解决方案。
阿里云开放搜索OpenSearch是一款阿里巴巴自主研发的大规模分布式搜索引擎平台,该平台承载了淘宝、天猫、1688、神马搜索、口碑、菜鸟等搜索业务,通过OpenSearch云服务的方式,将阿里巴巴成熟的搜索技术共享给广大开发者。
搜索应用自由创建,动态修改将应用结构简单化、定制化,用户可以通过可视化界面,自由配置文档的字段及属性
多种接入方式,数据自动同步支持RDS、ODPS数据源无缝接入、API/SDK数据上传、界面上传等多种接入方式,数据自动同步和定时索引重建,省时省力!
支持多表,插件式数据处理通过简单操作即可完成多表join和数据处理,数据复杂应用再也不用担心享受不到开放的便捷了!
搜索结果可定制支持两轮相关性排序定制,简单、灵活,加速产品效果迭代。
丰富的搜索结果调优功能设置,提升用户搜索体验拥有查询智能识别,自动提示,超强纠错、模糊搜索、拼音搜索等丰富产品功能
O2O应用已经涉及多个方面,如外卖、电影、旅行等,这类产品对搜索依赖比较重,除了显式用户搜索关键词外,还会根据业务场景根据搜索形成推荐页,有效做到千人千面的展现效果。主要搜索功能有关键词搜索、附近人、配送范围、营业时间、按距离排序、商家打散等。
配合阿里云RDS数据库或者ODPS数据源可以一键同步,步骤简直简单到爆!一个字 那就是爽
那么今天就来跟大家分享一下接入方法:
首先先去开通开放搜索:
如果前期只是测试, 建议你先使用标准版的入门型,价格比较便宜。一天也就2毛钱左右!
然后创建数据表:
创建索引:
然后就是导入测试数据了, 这里可以整理成json文件,本地上传.不过文件需要分隔,一次最大只能上传2M,每次最多上传1000条数据。那么100万数据 需要分1000次上传.上传python脚本如下:
#! /usr/bin/env python #coding=utf-8 #json生成 #by http://tools.bugscaner.com/ import json import md5,time,random,hmac,base64, copy import urllib from hashlib import sha1 import httplib import hashlib class V3Api: URI_PREFIX = '/v3/openapi/apps/' OS_PREFIX = 'OPENSEARCH' VERB = 'POST' #定义需推送到的应用表名,替换下面内容为数据需推送到应用中某个表名 TABLE_NAME = '换成自己创建的表' #定义上传数据,将下面待上传数据替换为自己的数据 def __init__(self,body_json): self.accesskey_id = '换成自己的' self.accesskey_secret = '换成自己的' # 下面host地址,替换为访问对应应用api地址,例如华东1区 self.address = 'opensearch-cn-qingdao.aliyuncs.com' self.appname = 'ceshiyixia' self.port = 80 self.body_json = body_json def runPost(self): query, header = self.buildQuery(app_name = self.appname, access_key = self.accesskey_id, secret = self.accesskey_secret, http_header = {}, http_params = {}) print query print header conn = httplib.HTTPConnection(self.address, self.port) conn.request(self.VERB, url = query, body = self.body_json, headers = header) response = conn.getresponse() return response.status, response.getheaders(), response.read() def buildQuery(self, app_name = None, access_key = None, secret = None, http_header = {}, http_params = {}): uri = self.URI_PREFIX if app_name is not None: uri += app_name uri += '/{TABLE_NAME}/actions/bulk'.format(TABLE_NAME=self.TABLE_NAME) request_header = self.buildRequestHeader(uri = uri, access_key = access_key, secret = secret, http_params = http_params, http_header = http_header) return uri , request_header def buildAuthorization(self, uri, access_key, secret, http_params, request_header): canonicalized = self.VERB + '\n'\ + self.__getHeader(request_header, 'Content-MD5', hashlib.md5(self.body_json).hexdigest()) + '\n' \ + self.__getHeader(request_header, 'Content-Type', '') + '\n' \ + self.__getHeader(request_header, 'Date', '') + '\n' \ + self.__canonicalizedHeaders(request_header) \ + self.__canonicalizedResource(uri, http_params) h = hmac.new(secret, canonicalized, sha1) signature = base64.encodestring(h.digest()).strip() return '%s %s%s%s' %(self.OS_PREFIX, access_key, ':', signature) def __getHeader(self, header, key, default_value = None): if key in header and header[key] is not None: return header[key] return default_value def __canonicalizedHeaders(self, request_header): header = {} for key, value in request_header.iteritems(): if key is None or value is None: continue k = key.strip(' \t') v = value.strip(' \t') if k.startswith('X-Opensearch-') and len(v) > 0: header[k] = v if len(header) == 0: return '' sorted_header = sorted(header.items(), key=lambda header: header[0]) canonicalized = '' for (key, value) in sorted_header: canonicalized += (key.lower() + ':' + value + '\n') return canonicalized def __canonicalizedResource(self, uri, http_params): canonicalized = urllib.quote(uri).replace('%2F', '/') sorted_params = sorted(http_params.items(), key = lambda http_params : http_params[0]) params = [] for (key, value) in sorted_params: if value is None or len(value) == 0: continue params.append(urllib.quote(key) + '=' + urllib.quote(value)) return canonicalized + '&'.join(params) def generateDate(self, format = "%Y-%m-%dT%H:%M:%SZ", timestamp = None): if timestamp is None: return time.strftime(format, time.gmtime()) else: return time.strftime(format, timestamp) def generateNonce(self): return str(int(time.time()*100)) + str(random.randint(1000, 9999)) def buildRequestHeader(self, uri, access_key, secret, http_params, http_header): request_header = copy.deepcopy(http_header) if 'Content-MD5' not in request_header: request_header['Content-MD5'] = hashlib.md5(self.body_json).hexdigest() if 'Content-Type' not in request_header: request_header['Content-Type'] = 'application/json' if 'Date' not in request_header: request_header['Date'] = self.generateDate() if 'X-Opensearch-Nonce' not in request_header: request_header['X-Opensearch-Nonce'] = self.generateNonce() if 'Authorization' not in request_header: request_header['Authorization'] = self.buildAuthorization(uri, access_key, secret, http_params, request_header) key_del = [] for key, value in request_header.iteritems(): if value is None: key_del.append(key) for key in key_del: del request_header[key] return request_header goods = [] nb = 0 for x in open("updata.txt","r"): if nb>999: #这里执行上传然后清空nb insertdata = json.dumps(goods) api = V3Api(insertdata) print api.runPost() nb = 0 goods = [] else: infos = {} fields = {} oks = x.strip().split(" ") fields["id"]=oks[0] fields["title"] = oks[1] fields["user"] = oks[2] fields["time"] = oks[4] fields["size"] = oks[3] fields["read"] = oks[5] fields["path"] = "a" infos["fields"] = fields infos["cmd"] = "ADD" goods.append(infos) nb+=1
updata.txt文件里放入你要导入的文本,一排一行格式如下:
由于本人比较擅长用django开发网站, 所以今天自然要跟大家分享一段接入网站程序的python sdk
官方很早之前写过一个相当简陋的sdk,找了好久才找到。注意这个sdk只是搜索数据库sdk,上面那个python脚本是上传sdk 。
#! /usr/bin/env python #coding=utf-8 import urllib import requests import collections from hashlib import sha1 import md5,time,random,hmac,base64, copy class V3Api: #定义变量 URI_PREFIX = '/v3/openapi/apps/' OS_PREFIX = 'OPENSEARCH' def __init__(self, address = '', port = ''): self.address = address self.port = port def runQuery(self, app_name = None, access_key = None, secret = None, http_header = {}, http_params = {}): query, header = self.buildQuery(app_name = app_name, access_key = access_key, secret = secret, http_header = http_header, http_params = http_params) makeurl = "http://"+self.address+":"+self.port+query returnjson = requests.get(makeurl,headers=header).json() return returnjson #return response.status, response.getheaders(), response.read() def buildQuery(self, app_name = None, access_key = None, secret = None, http_header = {}, http_params = {}): uri = self.URI_PREFIX if app_name is not None: uri += app_name uri += '/search' param = [] for key, value in http_params.iteritems(): param.append(urllib.quote(key) + '=' + urllib.quote(value)) query = ('&'.join(param)) request_header = self.buildRequestHeader(uri = uri, access_key = access_key, secret = secret, http_params = http_params, http_header = http_header) return uri + '?' + query, request_header # 签名实现 def buildAuthorization(self, uri, access_key, secret, http_params, request_header): canonicalized = 'GET\n'\ + self.__getHeader(request_header, 'Content-MD5', '') + '\n' \ + self.__getHeader(request_header, 'Content-Type', '') + '\n' \ + self.__getHeader(request_header, 'Date', '') + '\n' \ + self.__canonicalizedHeaders(request_header) \ + self.__canonicalizedResource(uri, http_params) h = hmac.new(secret, canonicalized, sha1) signature = base64.encodestring(h.digest()).strip() return '%s %s%s%s' %(self.OS_PREFIX, access_key, ':', signature) def __getHeader(self, header, key, default_value = None): if key in header and header[key] is not None: return header[key] return default_value def __canonicalizedResource(self, uri, http_params): canonicalized = urllib.quote(uri).replace('%2F', '/') sorted_params = sorted(http_params.items(), key = lambda http_params : http_params[0]) params = [] for (key, value) in sorted_params: if value is None or len(value) == 0: continue params.append(urllib.quote(key) + '=' + urllib.quote(value)) return canonicalized + '?' + '&'.join(params) def generateDate(self, format = "%Y-%m-%dT%H:%M:%SZ", timestamp = None): if timestamp is None: return time.strftime(format, time.gmtime()) else: return time.strftime(format, timestamp) def generateNonce(self): return str(int(time.time()*100)) + str(random.randint(1000, 9999)) def __canonicalizedHeaders(self, request_header): header = {} for key, value in request_header.iteritems(): if key is None or value is None: continue k = key.strip(' \t') v = value.strip(' \t') if k.startswith('X-Opensearch-') and len(v) > 0: header[k] = v if len(header) == 0: return '' sorted_header = sorted(header.items(), key=lambda header: header[0]) canonicalized = '' for (key, value) in sorted_header: canonicalized += (key.lower() + ':' + value + '\n') return canonicalized # 构建请求 Header 参数 def buildRequestHeader(self, uri, access_key, secret, http_params, http_header): request_header = copy.deepcopy(http_header) if 'Content-Type' not in request_header: request_header['Content-Type'] = 'application/json' if 'Date' not in request_header: request_header['Date'] = self.generateDate() if 'X-Opensearch-Nonce' not in request_header: request_header['X-Opensearch-Nonce'] = self.generateNonce() if 'Authorization' not in request_header: request_header['Authorization'] = self.buildAuthorization(uri, access_key, secret, http_params, request_header) key_del = [] for key, value in request_header.iteritems(): if value is None: key_del.append(key) for key in key_del: del request_header[key] return request_header def sqlquery(query_subsentences_params): accesskey_id = '改成自己的' accesskey_secret = '改成自己的' # 下面的值替换为应用访问api地址,例如 opensearch-cn-hangzhou.console.aliyun.com internet_host = 'opensearch-cn-qingdao.aliyuncs.com' appname = 'ceshiyixia' api = V3Api(address = internet_host, port = '80') return api.runQuery(app_name = appname, access_key=accesskey_id, secret=accesskey_secret, http_params=query_subsentences_params, http_header={}) if __name__ == '__main__': accesskey_id = '改成自己的' accesskey_secret = '改成自己的' # 下面的值替换为应用访问api地址,例如 opensearch-cn-hangzhou.console.aliyun.com internet_host = 'opensearch-cn-qingdao.aliyuncs.com' appname = 'ceshiyixia' api = V3Api(address = internet_host, port = '80') # 下面为设置查询信息,query参数中可设置对应的查询子句,添加查询参数,参考fetch_fields用法 #'query': "query=name:'搜索'&&config=start:0,hit:1,format:json&&sort=+id", 这句是官方语句 #start:0,hit:1 这两个好像是分页 query_subsentences_params = { 'query': "query=title:'搜索'&&config=start:0,hit:1,format:json", 'fetch_fields':'id;title' } print api.runQuery(app_name = appname, access_key=accesskey_id, secret=accesskey_secret, http_params=query_subsentences_params, http_header={})
导入模块接入到自己的django视图函数里即可:
大概关键代码如下:
searchdict = { 'query': "query=title:'"+keyword+"'&&config=start:"+str((page-1)*10)+",hit:10,format:fulljson", 'fetch_fields':'id;title;user;size;read;time;path', 'summary':'''summary_field:title,summary_element:em,summary_snipped:2''', } jieguo = sqlquery(searchdict)
好了文章导入完成后测试一下搜索:
导入了测试数据大概5万条, 查询速度相当快 ,l来看一下接入sdk在django查询中的效果:
完全正常!bye
#文中提到的产品前往连接地址: