All Projects → jroakes → tech-seo-crawler

jroakes / tech-seo-crawler

Licence: MIT license
Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to tech-seo-crawler

Rendertron
A Headless Chrome rendering solution
Stars: ✭ 5,593 (+9712.28%)
Mutual labels:  rendering, seo
Git Wiki Theme
A revolutionary full-featured wiki for github pages and jekyll. You don't need to compile it!
Stars: ✭ 139 (+143.86%)
Mutual labels:  github-pages, wikipedia
11r
America's favorite Eleventy blog template.
Stars: ✭ 135 (+136.84%)
Mutual labels:  github-pages, seo
flyyer-ruby
Ruby helpers to create https://cdn.flyyer.io URLs | Og:Image as a Service
Stars: ✭ 13 (-77.19%)
Mutual labels:  seo
podcastcrawler
PHP library to find podcasts
Stars: ✭ 40 (-29.82%)
Mutual labels:  crawling
Technique-iOS
A simple implementation of SCNTechnique
Stars: ✭ 65 (+14.04%)
Mutual labels:  rendering
mal-analysis
github repo for MyAnimeList analysis. Also links to the MAL dataset.
Stars: ✭ 31 (-45.61%)
Mutual labels:  crawling
CPU-Rasterizer
A tile based cpu rasterizer
Stars: ✭ 30 (-47.37%)
Mutual labels:  rendering
yii2-render-many
Trait for Yii Framework 2
Stars: ✭ 14 (-75.44%)
Mutual labels:  rendering
Real-Time-Rendering-4th-Bibliography-Collection
Real-Time Rendering 4th (RTR4) 参考文献合集典藏 | Collection of <Real-Time Rendering 4th (RTR4)> Bibliography / Reference
Stars: ✭ 2,806 (+4822.81%)
Mutual labels:  rendering
keywordsextract
keywords-extract - Command line tool extract keywords from any web page.
Stars: ✭ 50 (-12.28%)
Mutual labels:  seo
bside
Github Content Management System
Stars: ✭ 22 (-61.4%)
Mutual labels:  github-pages
gulp-sitemap
Generate a search engine friendly sitemap.xml using a Gulp stream
Stars: ✭ 60 (+5.26%)
Mutual labels:  seo
RadeonProRenderMayaPlugin
This hardware-agnostic rendering plug-in for Maya uses accurate ray-tracing technology to produce images and animations of your scenes, and provides real-time interactive rendering and continuous adjustment of effects.
Stars: ✭ 32 (-43.86%)
Mutual labels:  rendering
on-this-day
App that serves and displays events, births and deaths that occurred during the queried day of history, scraped from Wikipedia
Stars: ✭ 12 (-78.95%)
Mutual labels:  wikipedia
C-Raytracer
A CPU raytracer from scratch in C
Stars: ✭ 49 (-14.04%)
Mutual labels:  rendering
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+115.79%)
Mutual labels:  crawling
emacs-easy-jekyll
Emacs major mode for managing jekyll
Stars: ✭ 53 (-7.02%)
Mutual labels:  github-pages
core
The complete web scraping toolkit for PHP.
Stars: ✭ 1,110 (+1847.37%)
Mutual labels:  crawling
Awesome-meta-tags
📙 Awesome collection of meta tags
Stars: ✭ 18 (-68.42%)
Mutual labels:  seo

TechSEO Crawler

Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

TechSEO Screenshot

Play with the results here: Simple Search Engine

Please Note: The link above is hosted on a small AWS box, so if you have issues loading, try again later.

Slideshare is here: Building a Simple Crawler on a Toy Internet

Description

Web Folder

In order to crawl a small internet of sites, we have to create it. This tool creates 3 small sites from Wikipedia data and hosts them on Github Pages. The sites are not linked to any other site on the internet, but are linked to each other.

Main function

This tool attempts to implement a small ecosystem of 3 websites, along with a simple crawler, renderer, and indexer. While the author did research to construct the repo, it was a design feature to prefer simplicity over complexity. Items that are part of large crawling infrastructures, most notably disparate systems, and highly efficient code and data storage, are not part of this repo. We focus on simple representations of items such that it is more accessible to newer developers.

Parts:

  • PageRank
  • Chrome Headless Rendering
  • Text NLP Normalization
  • Bert Embeddings
  • Robots
  • Duplicate Content Shingling
  • URL Hashing
  • Document Frequency Functions (BM25 and TFIDF)

Made for a presentation at Tech SEO Boost

Getting Started

Get the repo

git clone https://github.com/jroakes/tech-seo-crawler.git

Dependencies

  • Please see the requirements.txt file for a list of dependencies.

It is strongly suggested to do the following, first, in a new, clean environment.

  • May need to install [Microsoft Build Tools] (http://go.microsoft.com/fwlink/?LinkId=691126&fixForIE=.exe.) and upgrade setup tools pip install --upgrade setuptools if you are on Windows.
  • Install PyTorch pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
  • See requirements-libraries.txt file for remaining library requirements. To install the frozen requirements this was developed with, use pip install -r requirements.txt

Install with:

pip install -r requirements.txt

Executing program

  1. Make sure you've created your three sites first. See README file in the web folder. Conversely, if you just want to use the crawler/renderer, you can run with the premade sites and skip to step 3.
  2. After creating your three sites, go to the config file and add the crawler_seed URL. This will be the organization name you created on github.io. For example: myorganization.github.io/
  3. Run streamlit run main.py in the terminal or command prompt. A new Browser window should open.
  4. The tool can also be run interactively with the Run.ipynb notebook in Jupyter.

Sharing

If you want to share your search engine for others to see, you can use Streamlit and Localtunnel.

  1. Install Localtunnel npm install -g localtunnel
  2. Start the tunnel with lt --port 80 --subdomain <create a unique sub-domain name>
  3. Start the Streamlit server with streamlit run main.py --server.port 80 --global.logLevel 'warning' --server.headless true --server.enableCORS false --browser.serverAddress <the unique subdomain from step 2>.localtunnel.me
  4. Navigate to https://<the unique subdomain from step 2>.localtunnel.me in your browser, or share the link with a friend.

Complete example:

In a new terminal:

npm install -g localtunnel
lt --port 80 --subdomain tech-seo-crawler

In another terminal:

cd /tech-seo-crawler/
activate techseo
streamlit run main.py --server.port 80 --global.logLevel 'warning' --server.headless true --server.enableCORS false --browser.serverAddress tech-seo-crawler.localtunnel.me

Troubleshooting

  • When running in streamlit we experienced a few connection closed errors during the Rendering process. If you experience this error just rerun the script by using the top right menu and clicking on rerun in streamlit.

Contributors

Contributors names and contact info

Version History

  • 0.1 - Alpha
    • Initial Release

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Libraries

Topics

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].