A collection of ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free (as speech not free food) or open-source.

Stars: ✭ 60 (-1.64%)

Mutual labels: data-mining

TurboDataMiner

The objective of this Burp Suite extension is the flexible and dynamic extraction, correlation, and structured presentation of information from the Burp Suite project as well as the flexible and dynamic on-the-fly modification of outgoing or incoming HTTP requests using Python scripts. Thus, Turbo Data Miner shall aid in gaining a better and fas…

Stars: ✭ 46 (-24.59%)

Mutual labels: data-mining

heidi

heidi : tidy data in Haskell

Stars: ✭ 24 (-60.66%)

Mutual labels: data-mining

cl-torrents

Searching torrents on popular trackers - CLI, readline, GUI, web client. Tutorial and binaries (issue tracker on https://gitlab.com/vindarel/cl-torrents/)

Stars: ✭ 83 (+36.07%)

Mutual labels: web-scraping

alter-nlu

Natural language understanding library for chatbots with intent recognition and entity extraction.

Stars: ✭ 45 (-26.23%)

Mutual labels: information-extraction

rymscraper

Python API to extract data from rateyourmusic.com.

Stars: ✭ 63 (+3.28%)

Mutual labels: web-scraping

gosquito

gosquito ("go" + "mosquito") is a pluggable tool for data gathering, data processing and data transmitting to various destinations.

Stars: ✭ 25 (-59.02%)

Mutual labels: data-mining

grailer

web scraping tool for grailed.com

Stars: ✭ 30 (-50.82%)

Mutual labels: web-scraping

python-notebooks

A collection of Jupyter Notebooks used in conferences or just to have some snippets.

Stars: ✭ 14 (-77.05%)

Mutual labels: data-mining

View All Similar Projects ➔

IWW-IntelliWebWrapper

an AI based web-mining library for web-content-extraction using machine learning algorithms.

currently, the library offers many functionalities to be exploited & some interesting algos to look at:

DOM extractor, mapper, reducer and flattening functionality...
DoC, degree of coherence, a euclidean distance based similarity.
LD, Lists detector algorithm.
MCD, Main content detector algorithm.
MCD algorithms results integrator method.
CETD algorithm.
DOM tags detector script (highlighting the chosen nodes).

P.S :

the documentation isn't available yet.
LD & MCD algorithms are to be released as a research article in the near future.
the pip package of iww will be available online as soon as possible.

USE CASE EXAMPLE :

1- extraction :

from iww.extractor import extractor
from iww.detector import detector
from iww.features_extraction.lists_detector import Lists_Detector as LD
from iww.features_extraction.main_content_detector import MCD

url = "https://www.theiconic.com.au/catalog/?q=kids%20sunglasses"
json_file = "./iconic.json"

extractor.extract(
    url = url, 
    destination = json_file
)

2- data exploratory analysis :

from iww.utils.dom_mapper import DOM_Mapper as DM

dm = DM()
dm.retrieve_DOM_tree("./iconic.json")
print("total number of nodes : {}".format(dm.DOM['CETD']['tagsCount']))

total numbre of nodes : 2098

3- LD algorithm :

ld = LD()
ld.retrieve_DOM_tree(file_path = "./iconic.json")
ld.apply(
    node = ld.DOM, 
    coherence_threshold= (0.75,1), 
    sub_tags_threshold = 2
)
ld.update_DOM_tree()

detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_ld.png",
    mark_path = "LISTS.mark", 
    mark_value = "1"
)

4- MCD algorithm :

mcd = MCD()
mcd.retrieve_DOM_tree("./iconic.json")
mcd.apply(
    node = mcd.DOM, 
    min_ratio_threshold = 0.0, 
    nbr_nodes_threshold = 1
)
mcd.update_DOM_tree()

detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_mcd.png",
    mark_path = "MCD.mark", 
    mark_value = "1"
)

5- LD/MCD integration (main list detection) :

mcd.integrate_other_algorithms_results(
    node = mcd.DOM, 
    nbr_nodes = 1,
    mode = "ancestry", 
    condition_features = [("LISTS.mark","1")])

mcd.update_DOM_tree()

detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_main_list.png",
    mark_path = "MCD.main_node", 
    mark_value = "1"
)

License

MIT

MOHAMED-HMINI 2019

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

MohamedHmini / iww

Programming Languages

Labels

Projects that are alternatives of or similar to iww

IWW-IntelliWebWrapper

USE CASE EXAMPLE :

1- extraction :

2- data exploratory analysis :

3- LD algorithm :

4- MCD algorithm :

5- LD/MCD integration (main list detection) :

License