All Projects → MohamedHmini → iww

MohamedHmini / iww

Licence: MIT license
AI based web-wrapper for web-content-extraction

Programming Languages

python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to iww

Ail Framework
AIL framework - Analysis Information Leak framework
Stars: ✭ 191 (+213.11%)
Mutual labels:  data-mining, information-extraction
Ayakashi
⚡️ Ayakashi.io - The next generation web scraping framework
Stars: ✭ 117 (+91.8%)
Mutual labels:  data-mining, web-scraping
ECG analysis
No description or website provided.
Stars: ✭ 32 (-47.54%)
Mutual labels:  data-mining
gotor
This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.
Stars: ✭ 97 (+59.02%)
Mutual labels:  information-extraction
neuromantic
Latest Data Science Materials
Stars: ✭ 27 (-55.74%)
Mutual labels:  data-mining
bookworm
📚 social networks from novels
Stars: ✭ 72 (+18.03%)
Mutual labels:  data-mining
LeetCode
At present contains scraped data from around 1500 problems present on the site. More to follow....
Stars: ✭ 45 (-26.23%)
Mutual labels:  data-mining
Knowledge Graph Wander
A collection of papers, codes, projects, tutorials ... for Knowledge Graph and other NLP methods
Stars: ✭ 26 (-57.38%)
Mutual labels:  information-extraction
Linkedin-Client
Web scraper for grabing data from Linkedin profiles or company pages (personal project)
Stars: ✭ 42 (-31.15%)
Mutual labels:  web-scraping
Data-Wrangling-with-Python
Simplify your ETL processes with these hands-on data sanitation tips, tricks, and best practices
Stars: ✭ 90 (+47.54%)
Mutual labels:  web-scraping
htmlunit
🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library
Stars: ✭ 39 (-36.07%)
Mutual labels:  web-scraping
PracticalMachineLearning
A collection of ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free (as speech not free food) or open-source.
Stars: ✭ 60 (-1.64%)
Mutual labels:  data-mining
TurboDataMiner
The objective of this Burp Suite extension is the flexible and dynamic extraction, correlation, and structured presentation of information from the Burp Suite project as well as the flexible and dynamic on-the-fly modification of outgoing or incoming HTTP requests using Python scripts. Thus, Turbo Data Miner shall aid in gaining a better and fas…
Stars: ✭ 46 (-24.59%)
Mutual labels:  data-mining
heidi
heidi : tidy data in Haskell
Stars: ✭ 24 (-60.66%)
Mutual labels:  data-mining
cl-torrents
Searching torrents on popular trackers - CLI, readline, GUI, web client. Tutorial and binaries (issue tracker on https://gitlab.com/vindarel/cl-torrents/)
Stars: ✭ 83 (+36.07%)
Mutual labels:  web-scraping
alter-nlu
Natural language understanding library for chatbots with intent recognition and entity extraction.
Stars: ✭ 45 (-26.23%)
Mutual labels:  information-extraction
rymscraper
Python API to extract data from rateyourmusic.com.
Stars: ✭ 63 (+3.28%)
Mutual labels:  web-scraping
gosquito
gosquito ("go" + "mosquito") is a pluggable tool for data gathering, data processing and data transmitting to various destinations.
Stars: ✭ 25 (-59.02%)
Mutual labels:  data-mining
grailer
web scraping tool for grailed.com
Stars: ✭ 30 (-50.82%)
Mutual labels:  web-scraping
python-notebooks
A collection of Jupyter Notebooks used in conferences or just to have some snippets.
Stars: ✭ 14 (-77.05%)
Mutual labels:  data-mining

IWW-IntelliWebWrapper


GitHub license made-with-python GitHub version Generic badge Ask Me Anything !

an AI based web-mining library for web-content-extraction using machine learning algorithms.

currently, the library offers many functionalities to be exploited & some interesting algos to look at:

  • DOM extractor, mapper, reducer and flattening functionality...
  • DoC, degree of coherence, a euclidean distance based similarity.
  • LD, Lists detector algorithm.
  • MCD, Main content detector algorithm.
  • MCD algorithms results integrator method.
  • CETD algorithm.
  • DOM tags detector script (highlighting the chosen nodes).

P.S :

  • the documentation isn't available yet.
  • LD & MCD algorithms are to be released as a research article in the near future.
  • the pip package of iww will be available online as soon as possible.

USE CASE EXAMPLE :

1- extraction :

from iww.extractor import extractor
from iww.detector import detector
from iww.features_extraction.lists_detector import Lists_Detector as LD
from iww.features_extraction.main_content_detector import MCD
url = "https://www.theiconic.com.au/catalog/?q=kids%20sunglasses"
json_file = "./iconic.json"

extractor.extract(
    url = url, 
    destination = json_file
)

2- data exploratory analysis :

from iww.utils.dom_mapper import DOM_Mapper as DM

dm = DM()
dm.retrieve_DOM_tree("./iconic.json")
print("total number of nodes : {}".format(dm.DOM['CETD']['tagsCount']))

total numbre of nodes : 2098

3- LD algorithm :

ld = LD()
ld.retrieve_DOM_tree(file_path = "./iconic.json")
ld.apply(
    node = ld.DOM, 
    coherence_threshold= (0.75,1), 
    sub_tags_threshold = 2
)
ld.update_DOM_tree()
detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_ld.png",
    mark_path = "LISTS.mark", 
    mark_value = "1"
)

4- MCD algorithm :

mcd = MCD()
mcd.retrieve_DOM_tree("./iconic.json")
mcd.apply(
    node = mcd.DOM, 
    min_ratio_threshold = 0.0, 
    nbr_nodes_threshold = 1
)
mcd.update_DOM_tree()
detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_mcd.png",
    mark_path = "MCD.mark", 
    mark_value = "1"
)

5- LD/MCD integration (main list detection) :

mcd.integrate_other_algorithms_results(
    node = mcd.DOM, 
    nbr_nodes = 1,
    mode = "ancestry", 
    condition_features = [("LISTS.mark","1")])

mcd.update_DOM_tree()
detector.detect(
    input_file = "./iconic.json", 
    output_file = "./iconic_main_list.png",
    mark_path = "MCD.main_node", 
    mark_value = "1"
)

License

MIT

MOHAMED-HMINI 2019

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].