All Projects → nikhilkumarsingh → content-downloader

nikhilkumarsingh / content-downloader

Licence: MIT license
Python package to download files on any topic in bulk.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to content-downloader

Bilibili member crawler
B站用户爬虫 好耶~是爬虫
Stars: ✭ 115 (+12.75%)
Mutual labels:  multithreading, requests
diskspace
macOS command line tool to return the available disk space on APFS volumes
Stars: ✭ 123 (+20.59%)
Mutual labels:  command-line-tool
bdfr-html
Converts the output of the bulk downloader for reddit to a set of HTML pages.
Stars: ✭ 23 (-77.45%)
Mutual labels:  bulk-downloader
vim-profiler
A vim plugin profiler and data plotter
Stars: ✭ 31 (-69.61%)
Mutual labels:  command-line-tool
PastaBean
Python Script to Scrape Pastebin with Regex
Stars: ✭ 0 (-100%)
Mutual labels:  requests
dr scaffold
scaffold django rest apis like a champion 🚀
Stars: ✭ 116 (+13.73%)
Mutual labels:  command-line-tool
quake-cli-tools
Command line tools for creating Quake content.
Stars: ✭ 41 (-59.8%)
Mutual labels:  command-line-tool
PowerColorLS
PowerShell script to display a colorized directory and file listing with icons
Stars: ✭ 35 (-65.69%)
Mutual labels:  command-line-tool
wifiqr
Create a QR code with your Wi-Fi login details
Stars: ✭ 207 (+102.94%)
Mutual labels:  command-line-tool
cybr-cli
A "Swiss Army Knife" command-line interface (CLI) for easy human and non-human interaction with @cyberark suite of products.
Stars: ✭ 45 (-55.88%)
Mutual labels:  command-line-tool
workerpoolxt
Concurrency limiting goroutine pool without upper limit on queue length. Extends github.com/gammazero/workerpool
Stars: ✭ 15 (-85.29%)
Mutual labels:  multithreading
cati
Cati Unix Package Manager
Stars: ✭ 19 (-81.37%)
Mutual labels:  command-line-tool
dotfiles
dotfiles symbolic links management CLI
Stars: ✭ 156 (+52.94%)
Mutual labels:  command-line-tool
node-banner
Easily integrate ASCII flavored banners to your CLI tool
Stars: ✭ 18 (-82.35%)
Mutual labels:  command-line-tool
grift
swift dependency graph visualizer tool
Stars: ✭ 26 (-74.51%)
Mutual labels:  command-line-tool
geeup
Simple CLI for Google Earth Engine Uploads
Stars: ✭ 67 (-34.31%)
Mutual labels:  command-line-tool
cappy
☕🗄CAching Proxy in Python – Simple file based python http proxy
Stars: ✭ 15 (-85.29%)
Mutual labels:  requests
audio-playback
Ruby/Command Line Audio File Player
Stars: ✭ 20 (-80.39%)
Mutual labels:  command-line-tool
bdk
Streamlined blockchain deployment kit for Hyperledger Fabric.
Stars: ✭ 43 (-57.84%)
Mutual labels:  command-line-tool
ThreadPinning.jl
Pinning Julia threads to cores
Stars: ✭ 23 (-77.45%)
Mutual labels:  multithreading

PyPI license

content-downloader

content-downloader a.k.a ctdl is a python package with command line utility and desktop GUI to download files on any topic in bulk!

Features

  • ctdl can be used as a command line utility as well as a desktop GUI.

  • ctdl fetches file links related to a search query from Google Search.

  • Files can be downloaded parallely using multithreading.

  • ctdl is Python 2 as well as Python 3 compatible.

Installation

  • To install content-downloader, simply,

    $ pip install ctdl
    
  • There seem to be some issues with parallel progress bars in tqdm which have been resolved in this pull. Until this pull is merged, please use my patch by running this command:

    $ pip install -U git+https://github.com/nikhilkumarsingh/tqdm
    

Desktop GUI usage

To use ctdl desktop GUI, open terminal and run this command:

$ ctdl-gui

Command line usage

$ ctdl [-h] [-f FILE_TYPE] [-l LIMIT] [-d DIRECTORY] [-p] [-a] [-t]
       [-minfs MIN_FILE_SIZE] [-maxfs MAX_FILE_SIZE] [-nr]
       [query]

Optional arguments are:

  • -f FILE_TYPE : set the file type. (can take values like ppt, pdf, xml, etc.)

               Default value: pdf
    
  • -l LIMIT : specify the number of files to download.

           Default value: 10
    
  • -d DIRECTORY : specify the directory where files will be stored.

               Default: A directory with same name as the search query in the current directory.
    
  • -p : for parallel downloading.

  • -minfs MIN_FILE_SIZE : specify minimum file size to download in Kilobytes (KB).

               Default: 0
    
  • -maxfs MAX_FILE_SIZE : specify maximum file size to download in Kilobytes (KB).

               Default: -1 (represents no maximum file size)
    
  • -nr : prevent download redirects.

               Default: False
    

Examples

  • To get list of available filetypes:

    $ ctdl -a
    
  • To get list of potential high threat filetypes:

    $ ctdl -t
    
  • To download pdf files on topic 'python':

    $ ctdl python
    

    This is the default behaviour which will download 10 pdf files in a folder named 'python' in current directory.

  • To download 3 ppt files on 'health':

    $ ctdl -f ppt -l 3 health
    
  • To explicitly specify download folder:

    $ ctdl -d /home/nikhil/Desktop/ml-pdfs machine-learning
    
  • To download files parallely:

    $ ctdl -f pdf -p python
    
  • To search for and download in parallel 10 files in PDF format containing the text "python" and "algorithm", without allowing any url redirects, and where the file size is between 10,000 KB (10 MB) and 100,000KB (100 MB), where KB means Kilobytes, which has an equivalent value expressed in Megabytes:

    $ ctdl -f pdf -l 10 -minfs 10000 -maxfs 100000 -nr -p "python algorithm"
    

Usage in Python files

from ctdl import ctdl

ctdl.download_content(
file_type = 'ppt',
limit = 5,
directory = '/home/nikhil/Desktop/ml-pdfs',
query = 'machine learning using python')

TODO

  • Prompt user before downloading potentially threatful files

  • Create ctdl GUI

  • Implement unit testing

  • Use DuckDuckgo API as an option

Want to contribute?

  • Clone the repository

    $ git clone http://github.com/nikhilkumarsingh/content-downloader
    
  • Install dependencies

    $ pip install -r requirements.txt
    

    Note: There seem to be some issues with current version of tqdm. If you do not get expected progress bar behaviour, try this patch:

    $ pip uninstall tqdm
    $ pip install git+https://github.com/nikhilkumarsingh/tqdm
    
  • In ctdl/ctdl.py, remove the . prefix from .downloader and .utils for the following imports, so it changes from:

    from .downloader import download_series, download_parallel
    from .utils import FILE_EXTENSIONS, THREAT_EXTENSIONS

    to:

    from downloader import download_series, download_parallel
    from utils import FILE_EXTENSIONS, THREAT_EXTENSIONS
  • Run the python file directly python ctdl/ctdl.py ___ (instead of with ctdl ___)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].