All Projects → DwarfThief → Raspagem-de-dados-para-iniciantes

DwarfThief / Raspagem-de-dados-para-iniciantes

Licence: GPL-3.0 license
Raspagem de dados para iniciante usando Scrapy e outras libs básicas

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to Raspagem-de-dados-para-iniciantes

OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-86.73%)
Mutual labels:  web-crawler, scrapy
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-80.53%)
Mutual labels:  scrapy, spyder
ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
Stars: ✭ 68 (-39.82%)
Mutual labels:  scrapy, webcrawling
proxi
Proxy pool. Finds and checks proxies with rest api for querying results. Can find over 25k proxies in under 5 minutes.
Stars: ✭ 32 (-71.68%)
Mutual labels:  web-crawler, scrapy
Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Stars: ✭ 8,392 (+7326.55%)
Mutual labels:  web-crawler, scrapy
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (+7.96%)
Mutual labels:  web-crawler, scrapy
Terpene Profile Parser For Cannabis Strains
Parser and database to index the terpene profile of different strains of Cannabis from online databases
Stars: ✭ 63 (-44.25%)
Mutual labels:  web-crawler, scrapy
Awesome Web Scraper
A collection of awesome web scaper, crawler.
Stars: ✭ 147 (+30.09%)
Mutual labels:  web-crawler, scrapy
vietnam-ecommerce-crawler
Crawling the data from lazada, websosanh, compare.vn, cdiscount and cungmua with flexible configs
Stars: ✭ 28 (-75.22%)
Mutual labels:  scrapy
doc crawler.py
Explore a website recursively and download all the wanted documents (PDF, ODT…)
Stars: ✭ 22 (-80.53%)
Mutual labels:  web-crawler
asyncpy
使用asyncio和aiohttp开发的轻量级异步协程web爬虫框架
Stars: ✭ 86 (-23.89%)
Mutual labels:  scrapy
scrapy-LBC
Araignée LeBonCoin avec Scrapy et ElasticSearch
Stars: ✭ 14 (-87.61%)
Mutual labels:  scrapy
fernando-pessoa
Classificador de poemas do Fernando Pessoa de acordo com os seus heterônimos
Stars: ✭ 31 (-72.57%)
Mutual labels:  scrapy
crawler
python爬虫项目集合
Stars: ✭ 29 (-74.34%)
Mutual labels:  scrapy
ant
A web crawler for Go
Stars: ✭ 264 (+133.63%)
Mutual labels:  web-crawler
Web-Iota
Iota is a web scraper which can find all of the images and links/suburls on a webpage
Stars: ✭ 60 (-46.9%)
Mutual labels:  scrapy
scrapy helper
Dynamic configurable crawl (动态可配置化爬虫)
Stars: ✭ 84 (-25.66%)
Mutual labels:  scrapy
scrapy-kafka-redis
Distributed crawling/scraping, Kafka And Redis based components for Scrapy
Stars: ✭ 45 (-60.18%)
Mutual labels:  scrapy
Inventus
Inventus is a spider designed to find subdomains of a specific domain by crawling it and any subdomains it discovers.
Stars: ✭ 80 (-29.2%)
Mutual labels:  scrapy
scrapy-wayback-machine
A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.
Stars: ✭ 92 (-18.58%)
Mutual labels:  scrapy

GitHub license

Generic badge

Raspagem de dados para iniciantes 📄

Esse repositório foi construido para ajudar qualquer interessado pela área de Raspagem de dados, todo o repositório será em PT-BR, mas os links/documentação podem estar em inglês (compartilhe se você possuir algo traduzido).

Instalação 💾

Uso Python versão 3.7

As principais libs que vamos usar aqui são:

  • requests
  • bs4 (BeautifulSoup)
  • Scrapy

Para isso você só precisa instalar algumas bibliotecas, no seu Terminal escreva:

pip install -r requirements.txt

Recomendações

Use o ambiente virtual do Python para programar independente de plataforma.

  • Criação:
python3 -m venv venv
  • Ativação (muda conforme S.O):
source venv/bin/activate
  • Dependências:
pip install -r requirements.txt

Jupyter notebooks

Iremos usar Jupyter notebooks aqui, então se você não tem com a ferramenta, visite a documentação.

Trilha para o tutorial: 🎓

  1. Aprendendo a extrair o texto de um Site
  2. Primeira Spider
  3. Raspagem múltipla
  4. Navegando entre paginas
  5. Coletando mais detalhes
  6. Raspagem em site com Infinite Scroll
  7. Rodando Spider na nuvem
  8. Extração de imagens

Materiais de estudo:

Blogs: 💻

Livros: 📚

Documentação: 📜

Podcasts: 🎧 🎵

Vídeos: 📺

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].