All Projects → yogesh-desai → WebCrawlerTokopedia

yogesh-desai / WebCrawlerTokopedia

Licence: other
It is a web crawler and scrapper for https://www.Tokopedia.com. The project scrape the product-ID, product URL and product videos present under the product images present at right bottom of the page.

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to WebCrawlerTokopedia

Idea-ReVue
Social Ideation application to manage Ideas. Developed with Vue, Firebase & Vuetify
Stars: ✭ 15 (-6.25%)
Mutual labels:  hacktoberfest2020
Android-Development
GameofSource
Stars: ✭ 14 (-12.5%)
Mutual labels:  hacktoberfest2020
Tana
Bringing the Picture-in-Picture experience to the desktop.
Stars: ✭ 109 (+581.25%)
Mutual labels:  hacktoberfest2020
opendevufcg.org
Portal da OpenDevUFCG
Stars: ✭ 52 (+225%)
Mutual labels:  hacktoberfest2020
codingblocks.com
The Coding Blocks main website
Stars: ✭ 53 (+231.25%)
Mutual labels:  hacktoberfest2020
YoPlaDo-Youtube-Playlist-Downloader
A simple python program to download Youtube Playlist at once.
Stars: ✭ 16 (+0%)
Mutual labels:  hacktoberfest2020
roadmap-cc
Roadmap para se tornar um cientista da computação na UFCG
Stars: ✭ 49 (+206.25%)
Mutual labels:  hacktoberfest2020
svelte-interview-questions
Concepts and Questions related to Svelte - Part of official Svelte resources list
Stars: ✭ 18 (+12.5%)
Mutual labels:  hacktoberfest2020
ListBot
ListBot is a Discord Bot which let's you create community lists on each channel.
Stars: ✭ 22 (+37.5%)
Mutual labels:  hacktoberfest2020
Microsoft-Udacity-ML-scholarship
Just give your best shot!
Stars: ✭ 64 (+300%)
Mutual labels:  hacktoberfest2020
BhimIntegers
BhimIntegers🚀 is a C++ library that is useful when we are dealing with BigIntegers💥💥. We can handle big integers (integers having a size bigger than the long long int data type) and we can perform arithmetic operations📘 like addition, multiplication, subtraction, division, equality check, etc📐📐. Also, there are several functions like factorial, …
Stars: ✭ 43 (+168.75%)
Mutual labels:  hacktoberfest2020
grandes-testes-do-buzzfeed
Um repositório para colocar testes icônicos do Buzzfeed para fazermos em belos momentos de tédio ou procrastinação. 📱 Espaço para conhecer e começar a contribuir com o open-source/github. Então sem medo, comece a contribuir com outros repositórios também!
Stars: ✭ 19 (+18.75%)
Mutual labels:  hacktoberfest2020
MT4-Telegram-Bot-Recon
Building a Telegram Chat with a MT4 Forex Trading Expert Advisor
Stars: ✭ 71 (+343.75%)
Mutual labels:  hacktoberfest2020
CPE Previous Questions
CPE 的歷屆考題
Stars: ✭ 20 (+25%)
Mutual labels:  hacktoberfest2020
Automatic-attendance-management-system
ROLLCALL an automatic and smart attendance marking and management system which uses Microsoft Azure’s Cognitive service at its core to create a system that could make sure that no human intervention is required and provides government the ability to monitor the attendance of the schools and helps the government officials in mark fake schools.
Stars: ✭ 44 (+175%)
Mutual labels:  hacktoberfest2020
todobot
📝🤖 Simple, efficient and most importantly elegant TODO Bot. A virtual TODO List right inside your Discord server!
Stars: ✭ 32 (+100%)
Mutual labels:  hacktoberfest2020
dimooper
Digital Music Looper
Stars: ✭ 64 (+300%)
Mutual labels:  hacktoberfest2020
wappdriver
Wondering how to send WhatsApp messages using Python using only 3 lines of code? You have come to the right place!
Stars: ✭ 40 (+150%)
Mutual labels:  hacktoberfest2020
MusicPlayer
just a music player that search your storage and plays the song.
Stars: ✭ 25 (+56.25%)
Mutual labels:  hacktoberfest2020
pythoncharmers
Small beginners python programs.
Stars: ✭ 33 (+106.25%)
Mutual labels:  hacktoberfest2020

WebCrawlerTokopedia Build Status

It is a web crawler and scrapper for https://www.Tokopedia.com. It is fully automated code where you just need to give input URL to get started.

The program extract the following,

  • product-ID,
  • product-URL,
  • product-videos-URLs

It has fetcher and extractor functions. The strucutre of the webpage is considered and the code is written specifically for that purpose. One need to change the extractor, DoCDP() function to get the required results.

Dependencies

It uses the chromdp package. You can check it here.

Installation

Install it in the usual way.

$ go get -u github.com/yogesh-desai/WebCrawlerTokopedia

Usage

$ go run main.go

Usage of command-line-arguments:
  -cancelafter duration
    	automatically cancel the fetchbot after a given time
  -cancelat string
    	automatically cancel the fetchbot at a given URL
  -headless
    	Run the CDP in headless mode. (default true)
  -memstats duration
    	display memory statistics at a given interval (default 5m0s)
  -seed string
    	seed URL (default "https://www.tokopedia.com/")
  -stopafter duration
    	automatically stop the fetchbot after a given time
  -stopat string
    	automatically stop the fetchbot at a given URL

Output

The code generates a file to store product details.

Following is the example of the code when ran for a single webpage.


Product_ID	Product_URL	Youtube_Video_URLs
146347138	https://www.tokopedia.com/chocoapple/ready-stock-bnib-iphone-128gb-7-plus-jet-black-garansi-apple-1-tahun-10	https://www.youtube.com/watch?v=oKR2fh09Nic,https://www.youtube.com/watch?v=12JBG20n3jI,https://www.youtube.com/watch?v=mWEG1nu2rVY,https://www.youtube.com/watch?v=wgZ7Q4ywOl8

Features

  • It has fetcher and extractor functions.
  • The fetcher is specifically designed with Filter function.
  • It uses goroutines and channels to make tasks parallel and faster.
  • It has Flags, with bydefault values. You can give your own values at runtime.
  • It also has the Memory Stats to keep track of memory being used by the program.

ToDOs

  • Currently, it uses GUI mode of the Google-Chrome. Need to implement the --headless functionality.
  • Make the code more Faster and stable.
  • More Testing and profiling to understand Memory related issues.

Known Issues

  • Currently, no issues. :)

Please feel free to generate pull requests or issues. :)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].