All Projects → viasite → site-audit-seo

viasite / site-audit-seo

Licence: other
Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx, Google Drive.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to site-audit-seo

Serp
Google Search SERP Scraper
Stars: ✭ 40 (-56.04%)
Mutual labels:  scraper, seo
Jvppeteer
Headless Chrome For Java (Java 爬虫)
Stars: ✭ 193 (+112.09%)
Mutual labels:  scraper, puppeteer
Public Instagram
Tool to fetch Instagram's public content.
Stars: ✭ 43 (-52.75%)
Mutual labels:  scraper, puppeteer
SearchScraperAPI
Aiohttp web server API, which scrapes Google and returns scrape results as response. Supports proxies, multiple geos and number of results.
Stars: ✭ 31 (-65.93%)
Mutual labels:  scraper, seo
barclayscrape
A small app to programmatically mainpulate Barclays online banking
Stars: ✭ 57 (-37.36%)
Mutual labels:  scraper, puppeteer
Socialmanagertools Gui
🤖 👻 Desktop application for Instagram Bot, Twitter Bot and Facebook Bot
Stars: ✭ 293 (+221.98%)
Mutual labels:  scraper, puppeteer
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+87.91%)
Mutual labels:  scraper, puppeteer
Lighthouse Check Action
GitHub Action for running @GoogleChromeLabs Lighthouse audits with all the bells and whistles 🔔 Multiple audits, Slack notifications, and more!
Stars: ✭ 175 (+92.31%)
Mutual labels:  seo, lighthouse
playwright-lighthouse
🎭: Playwright Lighthouse Audit
Stars: ✭ 120 (+31.87%)
Mutual labels:  seo, lighthouse
lopez
Crawling and scraping the Web for fun and profit
Stars: ✭ 20 (-78.02%)
Mutual labels:  scraper, seo
bots-zoo
No description or website provided.
Stars: ✭ 59 (-35.16%)
Mutual labels:  scraper, puppeteer
instagram-get-images
Instagram get images 🌄 (hashtags, account, locations) with puppeteer
Stars: ✭ 69 (-24.18%)
Mutual labels:  scraper, puppeteer
whatsapp-tracking
Scraping the status of WhatsApp contacts
Stars: ✭ 49 (-46.15%)
Mutual labels:  scraper, puppeteer
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+5536.26%)
Mutual labels:  scraper, puppeteer
opensea-scraper
Scrapes nft floor prices and additional information from opensea. Used for https://nftfloorprice.info
Stars: ✭ 129 (+41.76%)
Mutual labels:  scraper, puppeteer
Serpscrap
SEO python scraper to extract data from major searchengine result pages. Extract data like url, title, snippet, richsnippet and the type from searchresults for given keywords. Detect Ads or make automated screenshots. You can also fetch text content of urls provided in searchresults or by your own. It's usefull for SEO and business related research tasks.
Stars: ✭ 153 (+68.13%)
Mutual labels:  scraper, seo
Gatsby V2 Tutorial Starter
Gatsby V2 Starter - product of step by step tutorial
Stars: ✭ 139 (+52.75%)
Mutual labels:  seo, lighthouse
Rendora
dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites
Stars: ✭ 1,853 (+1936.26%)
Mutual labels:  seo, puppeteer
seo-audits-toolkit
SEO & Security Audit for Websites. Lighthouse & Security Headers crawler, Sitemap/Keywords/Images Extractor, Summarizer, etc ...
Stars: ✭ 311 (+241.76%)
Mutual labels:  seo, lighthouse
vue-seo-friendly-spa-template
Vue.js PWA/SPA template initially scaffolded with vue-cli and configured for SEO. Makes use of prerendering and other techniques/packages in order to achieve a perfect "Lighthouse Score".
Stars: ✭ 41 (-54.95%)
Mutual labels:  seo, lighthouse

npm npm

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx, Google Drive.

Web view report - site-audit-seo-viewer.

Demo:

Русское описание ниже

site-audit-demo

Using without install

Open https://viasite.github.io/site-audit-seo-viewer/.

Features:

  • Crawls the entire site, collects links to pages and documents
  • Does not follow links outside the scanned domain (configurable)
  • Analyse each page with Lighthouse (see below)
  • Analyse main page text with Mozilla Readability and Yake
  • Search pages with SSL mixed content
  • Scan list of urls, --url-list
  • Set default report fields and filters
  • Scan presets
  • Documents with the extensions doc, docx, xls, xlsx, ppt, pptx, pdf, rar, zip are added to the list with a depth == 0

Technical details:

  • Does not load images, css, js (configurable)
  • Each site is saved to a file with a domain name in ~/site-audit-seo/
  • Some URLs are ignored (preRequest in src/scrap-site.js)

XLSX features

  • The first row and the first column are fixed
  • Column width and auto cell height are configured for easy viewing
  • URL, title, description and some other fields are limited in width
  • Title is right-aligned to reveal the common part
  • Validation of some columns (status, request time, description length)
  • Export xlsx to Google Drive and print URL

Web viewer features:

  • Fixed table header and url column
  • Add/remove columns
  • Column presets
  • Field groups by categories
  • Filters presets (ex. h1_count != 1)
  • Color validation
  • Verbose page details (+ button)
  • Direct URL to same report with selected fields, filters, sort
  • Stats for whole scanned pages, validation summary
  • Persistent URL to report when --upload using
  • Switch between last uploaded reports
  • Rescan current report

Fields list (18.08.2020):

  • url
  • mixed_content_url
  • canonical
  • is_canonical
  • previousUrl
  • depth
  • status
  • request_time
  • title
  • h1
  • page_date
  • description
  • keywords
  • og_title
  • og_image
  • schema_types
  • h1_count
  • h2_count
  • h3_count
  • h4_count
  • canonical_count
  • google_amp
  • images
  • images_without_alt
  • images_alt_empty
  • images_outer
  • links
  • links_inner
  • links_outer
  • text_ratio_percent
  • dom_size
  • html_size
  • lighthouse_scores_performance
  • lighthouse_scores_pwa
  • lighthouse_scores_accessibility
  • lighthouse_scores_best-practices
  • lighthouse_scores_seo
  • lighthouse_first-contentful-paint
  • lighthouse_speed-index
  • lighthouse_largest-contentful-paint
  • lighthouse_interactive
  • lighthouse_total-blocking-time
  • lighthouse_cumulative-layout-shift
  • and 150 more lighthouse tests!

Install

Install with docker-compose

git clone https://github.com/viasite/site-audit-seo
cd site-audit-seo
git clone https://github.com/viasite/site-audit-seo-viewer data/front
docker-compose pull # for skip build step
docker-compose up -d

Service will available on http://localhost:5302

Default ports:
  • Backend: 5301
  • Frontend: 5302
  • Yake: 5303

You can change it in .env file or in docker-compose.yml.

Install with NPM:

npm install -g site-audit-seo

For linux users

npm install -g site-audit-seo --unsafe-perm=true

After installing on Ubuntu, you may need to change the owner of the Chrome directory from root to user.

Run this (replace $USER to your username or run from your user, not from root):

sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"

Error details Invalid file descriptor to ICU data received.

Command line usage:

$ site-audit-seo --help
Usage: site-audit-seo -u https://example.com --upload

Options:
  -u --urls <urls>             Comma separated url list for scan
  -p, --preset <preset>        Table preset (minimal, seo, headers, parse, lighthouse, lighthouse-all) (default: "seo")
  -e, --exclude <fields>       Comma separated fields to exclude from results
  -d, --max-depth <depth>      Max scan depth (default: 10)
  -c, --concurrency <threads>  Threads number (default: by cpu cores)
  --lighthouse                 Appends base Lighthouse fields to preset
  --delay <ms>                 Delay between requests (default: 0)
  -f, --fields <json>          Field in format --field 'title=$("title").text()' (default: [])
  --no-skip-static             Scan static files
  --no-limit-domain            Scan not only current domain
  --docs-extensions            Comma-separated extensions that will be add to table (default: doc,docx,xls,xlsx,ppt,pptx,pdf,rar,zip)
  --follow-xml-sitemap         Follow sitemap.xml (default: false)
  --ignore-robots-txt          Ignore disallowed in robots.txt (default: false)
  -m, --max-requests <num>     Limit max pages scan (default: 0)
  --no-headless                Show browser GUI while scan
  --no-remove-csv              No delete csv after xlsx generate
  --out-dir <dir>              Output directory (default: ".")
  --csv <path>                 Skip scan, only convert csv to xlsx
  --xlsx                       Save as XLSX (default: false)
  --gdrive                     Publish sheet to google docs (default: false)
  --json                       Output results in JSON (default: false)
  --upload                     Upload JSON to public web (default: false)
  --no-color                   No console colors
  --lang <lang>                Language (en, ru, default: system language)
  --open-file                  Open file after scan (default: yes on Windows and MacOS)
  --no-open-file               Don't open file after scan
  --no-console-validate        Don't output validate messages in console
  -V, --version                output the version number
  -h, --help                   display help for command

Custom fields

Linux/Mac:

site-audit-seo -d 1 -u https://example -f 'title=$("title").text()' -f 'h1=$("h1").text()'

Windows:

site-audit-seo -d 1 -u https://example -f title=$('title').text() -f h1=$('h1').text()

Remove fields from results

This will output fields from seo preset excluding canonical fields:

site-audit-seo -u https://example.com --exclude canonical,is_canonical

Lighthouse

Analyse each page with Lighthouse

site-audit-seo -u https://example.com --preset lighthouse

Analyse seo + Lighthouse

site-audit-seo -u https://example.com --lighthouse

Config file

You can copy .site-audit-seo.conf.js to your home directory and tune options.

Send to InfluxDB

It is beta feature. How to config:

  1. Add this to ~/.site-audit-seo.conf:
module.exports = {
  influxdb: {
    host: 'influxdb.host',
    port: 8086,
    database: 'telegraf',
    measurement: 'site_audit_seo', // optional
    username: 'user',
    password: 'password',
    maxSendCount: 5, // optional, default send part of pages
  }
};
  1. Use --influxdb-max-send in terminal.

  2. Create command for scan your urls:

site-audit-seo -u https://page-with-url-list.txt --url-list --lighthouse --upload --influxdb-max-send 100 >> ~/log/site-audit-seo.log
  1. Add command to cron.

Plugins

  • Readability - main page text length, reading time
  • Yake - keywords extraction from main page text

See CONTRIBUTING.md for details about plugin development.

Install plugins:

cd data
npm install site-audit-seo-readability
npm install site-audit-seo-yake

Disable plugins:

You can add argument such: --disable-plugins readability,yake. It more faster, but less data extracted.

Credentials

Based on headless-chrome-crawler (puppeteer). Used forked version @popstas/headless-chrome-crawler.

Bugs

  1. Sometimes it writes identical pages to csv. This happens in 2 cases: 1.1. Redirect from another page to this (solved by setting skipRequestedRedirect: true, hardcoded). 1.2. Simultaneous request of the same page in parallel threads.
  2. Sometimes a number appears instead of the URL, it occurs at the stage of converting csv to xlsx, don't know why.

Free audit tools alternatives

Free data scrapers

  • Web Scraper - free for local use extension
  • Portia - self-hosted visual scraper builder, scrapy based
  • Crawlab - distributed web crawler admin platform, self-hosted with Docker
  • OutWit Hub - free edition, pro edition for $99
  • Octoparse - 10 000 records free
  • Parsers.me - 1 000 pages per run free
  • website-scraper - opensource, CLI, download site to local directory
  • website-scraper-puppeteer - same but puppeteer based
  • Gerapy - distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Русский

Сканирование одного или несколько сайтов в файлы csv и xlsx.

Особенности:

  • Обходит весь сайт, собирает ссылки на страницы и документы
  • Сводка результатов после сканирования
  • Документы с расширениями doc, docx, xls, xlsx, pdf, rar, zip добавляются в список с глубиной 0
  • Поиск страниц с SSL mixed content
  • Каждый сайт сохраняется в файл с именем домена
  • Не ходит по ссылкам вне сканируемого домена (настраивается)
  • Не загружает картинки, css, js (настраивается)
  • Некоторые URL игнорируются (preRequest в src/scrap-site.js)
  • Можно прогнать каждую страницу по Lighthouse (см. ниже)
  • Сканирование произвольного списка URL, --url-list

Особенности XLSX:

  • Первый ряд и первая колонка закрепляются
  • Ширина колонок и автоматическая высота ячеек настроены для удобного просмотра
  • URL, title, description и некоторые другие поля ограничены по ширине
  • Title выравнивается по правому краю для выявления общей части
  • Валидация некоторых колонок (status, request time, description length)
  • Загрузка xlsx на Google Drive и вывод ссылки

Установка:

npm install -g site-audit-seo

Если у вас Ubuntu

npm install -g site-audit-seo --unsafe-perm=true
npm run postinstall-puppeteer-fix

Или запустите это (замените $USER на вашего юзера, либо запускайте под юзером, не под root):

sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"

Подробности ошибки Invalid file descriptor to ICU data received.

Использование

site-audit-seo -u https://example.com --upload

Кастомные поля

Можно передать дополнительные поля так:

site-audit-seo -d 1 -u https://example -f "title=$('title').text()" -f "h1=$('h1').text()"

Lighthouse

Прогнать каждую страницу по Lighthouse

site-audit-seo -u https://example.com --preset lighthouse

Обычный seo аудит + Lighthouse

site-audit-seo -u https://example.com --lighthouse

Как посчитать контент по csv

  1. Открыть в блокноте
  2. Документы посчитать поиском ,0
  3. Листалки исключить поиском ?
  4. Вычесть 1 (шапка)

Баги

  1. Иногда пишет в csv одинаковые страницы. Это бывает в 2 случаях: 1.1. Редирект с другой страницы на эту (решается установкой skipRequestedRedirect: true, сделано). 1.2. Одновременный запрос одной и той же страницы в параллельных потоках.
  2. Иногда вместо URL появляется цифра, происходит на этапе конвертации csv в xlsx, не знаю почему.

TODO:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].