Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → mbinary → Dbworld Search

mbinary / Dbworld Search

Licence: mit

🔍 简单的搜索引擎, django 框架

Labels

html django crawler search-engine

Projects that are alternatives of or similar to Dbworld Search

Cosmos Search

🌱 The next generation unbiased real-time privacy and user focused code search engine for everyone; Join us at https://discourse.opengenus.org/

Stars: ✭ 137 (+251.28%)

Mutual labels: search-engine, django

Googlescraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.

Stars: ✭ 2,363 (+5958.97%)

Mutual labels: crawler, search-engine

Awesome Python Primer

自学入门 Python 优质中文资源索引，包含书籍 / 文档 / 视频，适用于爬虫 / Web / 数据分析 / 机器学习方向

Stars: ✭ 57 (+46.15%)

Mutual labels: crawler, django

Tutorialdb

A search 🔎 engine for programming/dev tutorials, See it in action 👉

Stars: ✭ 93 (+138.46%)

Mutual labels: search-engine, django

Opensearchserver

Open-source Enterprise Grade Search Engine Software

Stars: ✭ 408 (+946.15%)

Mutual labels: crawler, search-engine

Python Testing Crawler

A crawler for automated functional testing of a web application

Stars: ✭ 68 (+74.36%)

Mutual labels: crawler, django

An Open Source Search Engine

Stars: ✭ 139 (+256.41%)

Mutual labels: crawler, search-engine

Woid

Simple news aggregator displaying top stories in real time

Stars: ✭ 204 (+423.08%)

Mutual labels: crawler, django

Jivesearch

A search engine that doesn't track you.

Stars: ✭ 364 (+833.33%)

Mutual labels: crawler, search-engine

indieweb-search

Source code for the IndieWeb search engine.

Stars: ✭ 16 (-58.97%)

Mutual labels: search-engine, crawler

Filemasta

A search application to explore, discover and share online files

Stars: ✭ 571 (+1364.1%)

Mutual labels: crawler, search-engine

Fess

Fess is very powerful and easily deployable Enterprise Search Server.

Stars: ✭ 561 (+1338.46%)

Mutual labels: crawler, search-engine

Funpyspidersearchengine

Word2vec 千人千面个性化搜索 + Scrapy2.3.0(爬取数据) + ElasticSearch7.9.1(存储数据并提供对外Restful API) + Django3.1.1 搜索

Stars: ✭ 782 (+1905.13%)

Mutual labels: search-engine, django

Ustbcrawlers

那些年，我爬过的北科。一个由浅入深的定向爬虫教程。

Stars: ✭ 35 (-10.26%)

Mutual labels: crawler

Django Crm

Open Source Python CRM based on Django

Stars: ✭ 981 (+2415.38%)

Mutual labels: django

Django Modeltranslation

Translates Django models using a registration approach.

Stars: ✭ 977 (+2405.13%)

Mutual labels: django

Diskover

File system crawler, disk space usage, file search engine and file system analytics powered by Elasticsearch

Stars: ✭ 977 (+2405.13%)

Mutual labels: crawler

Djangocms Picture

django CMS Picture is a plugin for django CMS that allows you to add images on your site.

Stars: ✭ 37 (-5.13%)

Mutual labels: django

Tensorflow Mnist Tutorial

MNIST classification in Tensorflow using Django

Stars: ✭ 36 (-7.69%)

Mutual labels: django

Pontoon

Mozilla's Localization Platform

Stars: ✭ 976 (+2402.56%)

Mutual labels: django

View All Similar Projects ➔

搜索引擎实现

使用 Django-2.1.3, python3.6 实现的一个非常非常 naive 的搜索引擎.

我初学 django, 写得并不熟练, 所以此代码仅供参考.

需要

编程语言: python3
运行环境: linux, shell
使用工具:
- Django-2.1.3
- python3.6
  - summa (text-rank)
  - dj-pagination
  - BeautifulSoup

结果展示

首页
分页

设计

设计数据结构

我们要保存一个倒排索引, 以及一个主题对应的发送时间, 发送者, 主题, 主题链接等内容. 所以我设计了下面的数据库结构.

Doc: 一个文件, 也就是一个网页, 包含一些主要信息.
File: 外键是Doc, 包含了网页文件的文本内容, 以及标记是否已经被索引(isIndexed)
Wordindex: 这就是倒排索引中的一个项, 包含一个 term, 和倒排索引表, 倒排索引表设计成 hashtable 形式, 键为 Doc. id, 值为在 Doc 中出现的次数. 为了简便,在数据库库中的存储形式是将上面的 hashtable (在 python 中为 dict 类型) 用 json 格式保存为文本字符串形式.

需要注意的是增加一个键值对不能使用下面代码

word.index [ doc.id] = num
word.save()

应该

dic = word.index
dic[doc.id] = num
word.index = dic
word.save()

下面给出的是 django 中 model 的代码

from django.db import models
class Doc(models.Model):
    sendTime= models.DateField() # 2018-12-12 ,  differ from DateTimeField which can be datetime or date
    sender = models.CharField(max_length=20)
    messageType = models.CharField(max_length = 20) # Journal, conf, et al
    subject = models.CharField(max_length=100)
    begin= models.DateField()
    deadline= models.DateField()
    subjectUrl= models.CharField(max_length=100)
    webpageUrl= models.CharField(max_length=100)
    desc = models.CharField(max_length= 250,default='')
    loc = models.CharField(max_length=40,default='')
    keywords = models.CharField(max_length=200,default='')

    def __str__(self):
        return self.subjectUrl

import json
class Wordindex(models.Model):
    word= models.CharField(max_length=45)

    # model to store a list, another way is to create a custom field
    _index = models.TextField(null=True)
    @property
    def index(self):
        return json.loads(self._index)
    @index.setter
    def index(self,li):
        self._index = json.dumps(li)
    def __str__(self):
        return self.word
class File(models.Model):
    doc = models.OneToOneField(Doc,on_delete=models.CASCADE)
    content = models.TextField(null=True)
    isIndexed = models.BooleanField(default=False)
    def __str__(self):
        return 'file: {} -> doc: {}'.format(self.id,self.doc.id)

网页提取

首先是主页其结构是这样

<TBODY>
<TR VALIGN=TOP>
<TD>03-Jan-2019 </TD>
<TD>conf. ann. </TD>
<TD>marta cimitile </TD>
<TD><A HREF="http://www.cs.wisc.edu/dbworld/messages/2019-01/1546520301.html" rel="nofollow">Call forFUZZ IEEE Special Session</A> </TD>
<TD>13-Jan-2019</TD>
<TD><A rel="nofollow" HREF="http://sites.ieee.org/fuzzieee-2019/special-sessions/">web page</A></TD>
</TR></TBODY>

有规律性, 可以直接提取. 在实现时, 我用的 python 的 BeautifulSoup 包来提取.

使用过程中, 关键是传递解析器, 试过了 html, lxml 有问题, 最后用的 html5lib

然后是上面一行表格中的第四列(即第四个 td 标签), 其中的 <a>标签是主题所在的网页链接. 也要进行提取

提取时间, 地点

由于时间, 地点具有一般的模式, 可以列举出常见的模式, 使用正则表达式匹配

提取摘要, 关键字

使用了 textrank 算法
最开始我自己实现了一个很基础的 textrank 算法, 效果很差, 后来就使用了 text-rank 的官方版本.

建立索引

这部分就是按照倒排索引的原理, 将网页文本分词, 去除标点符号等, 然后使用上面介绍的数据库模型存储倒排索引.

设计网页

首先是标题下面是一行是一排选项, 可以根据这些字段排序. 接着一行有一个 update 按钮, 一个搜索提交表格,

下面的内容就是用 div 排列起来的搜索结果.

每个结果包含一个标题, 关键字, 时间,地点, 还有摘要.

查找排序

这里我自己实现了 tf-idf算法来排序结果. 代码如下

def tfidf(words):
    if not words:return docs
    ct = process(words)
    weight = {}
    tf = {}
    for term in ct:
        try:
            tf[term] = Wordindex.objects.get(word=term).index
        except Exception as e:
            print(e)
            tf[term]={}
            continue
        for docid in tf[term]:
            if docid not in weight:
                weight[docid]=0
    N = len(weight)
    for term in ct:
        dic = tf[term]
        for docid, freq in dic.items():
            w = (1+log10(freq))*(log10(N/len(dic)))*ct[term]
            if term in stopWords:
                w*=0.3
            weight[docid]+=w
    ids = sorted(weight,key = lambda k:weight[k],reverse=True)
    if len(ids)<8: pass #???
    return [Doc.objects.get(id=int(i)).__dict__ for i in ids]

不足

提取网页主题还需要改进, 提取地点方面, 有时可能提取不到.
网页设计还可以更美观一点.
还未对搜索引擎进行性能评估

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 39

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗