All Projects → yuanxu-li → html-table-extractor

yuanxu-li / html-table-extractor

Licence: MIT license
extract data from html table

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to html-table-extractor

Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (+527.03%)
Mutual labels:  scraping, beautifulsoup
Languagepod101 Scraper
Python scraper for Language Pods such as Japanesepod101.com 👹 🗾 🍣 Compatible with Japanese, Chinese, French, German, Italian, Korean, Portuguese, Russian, Spanish and many more! ✨
Stars: ✭ 104 (+40.54%)
Mutual labels:  scraping, beautifulsoup
TorScrapper
A Scraper made 100% in Python using BeautifulSoup and Tor. It can be used to scrape both normal and onion links. Happy Scraping :)
Stars: ✭ 24 (-67.57%)
Mutual labels:  scraping, beautifulsoup
Scraper-Projects
🕸 List of mini projects that involve web scraping 🕸
Stars: ✭ 25 (-66.22%)
Mutual labels:  scraping, beautifulsoup
chopper
Chopper is a tool to extract elements from HTML by preserving ancestors and CSS rules
Stars: ✭ 22 (-70.27%)
Mutual labels:  scraping, beautifulsoup
Souqscraper
Simple scriptes for Level UP your scraping Skills, and source code for Level UP playlist on Youtube
Stars: ✭ 118 (+59.46%)
Mutual labels:  scraping, beautifulsoup
Easy Scraping Tutorial
Simple but useful Python web scraping tutorial code.
Stars: ✭ 583 (+687.84%)
Mutual labels:  scraping, beautifulsoup
html-table-to-json
Generate JSON representations of HTML tables
Stars: ✭ 39 (-47.3%)
Mutual labels:  scraping, html-table
Euro2016 TerminalApp
⚽ Instantly find 🏆EURO 2016 live-streams & highlights, now a Web App!
Stars: ✭ 54 (-27.03%)
Mutual labels:  scraping, beautifulsoup
Requests Html
Pythonic HTML Parsing for Humans™
Stars: ✭ 12,268 (+16478.38%)
Mutual labels:  scraping, beautifulsoup
linkedin-scraper
Tool to scrape linkedin
Stars: ✭ 74 (+0%)
Mutual labels:  scraping, beautifulsoup
react-native-simple-table
A simple table for react native.
Stars: ✭ 32 (-56.76%)
Mutual labels:  table, html-table
Tieba-Birthday-Spider
百度贴吧生日爬虫,可抓取贴吧内吧友生日,并且在对应日期自动发送祝福
Stars: ✭ 28 (-62.16%)
Mutual labels:  beautifulsoup
linkedinBot
Automate the process of sending referral request and cold mailing on LinkedIn
Stars: ✭ 25 (-66.22%)
Mutual labels:  beautifulsoup
jQuery-Freeze-Table-Column-and-Rows
This is a jQuery plugin that can make table rows and columns not scroll. It can take a given HTML table object and set it so it can freeze a given number of columns or rows or both, so the fixed columns or rows do not scroll. The rows to be frozen should be placed in the table head section. It can also freeze rows and columns combined with using…
Stars: ✭ 20 (-72.97%)
Mutual labels:  html-table
docker-selenium-lambda
The simplest demo of chrome automation by python and selenium in AWS Lambda
Stars: ✭ 172 (+132.43%)
Mutual labels:  scraping
vue-table-for
Easily build a table for your records
Stars: ✭ 33 (-55.41%)
Mutual labels:  table
obj-to-table
Create a table from an array of objects
Stars: ✭ 15 (-79.73%)
Mutual labels:  table
covid19br-pub
Projeto de monitoramento de publicações oficiais relacionadas a COVID-19 no Brasil.
Stars: ✭ 12 (-83.78%)
Mutual labels:  scraping
non-api-fb-scraper
Scrape public FaceBook posts from any group or user into a .csv file without needing to register for any API access
Stars: ✭ 40 (-45.95%)
Mutual labels:  beautifulsoup

HTML Table Extractor

Build Status

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Important links

Installation

pip install 'beautifulsoup4==4.5.3'
pip install html-table-extractor

Usage

Example 1 - Simple

12
34
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

12
34
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc, transformer=int)
extractor.parse()
extractor.return_list()

It will print out:

[[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

12
34
from html_table_extractor.extractor import Extractor
from bs4 import BeautifulSoup
table_doc = """
<html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
"""
soup = BeautifulSoup(table_doc, 'html.parser')
extractor = Extractor(soup, id_='wanted')
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex

1 2 3
4
5
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
  <tr>
    <td rowspan=2>1</td>
    <td>2</td>
    <td>3</td>
  </tr>
  <tr>
    <td colspan=2>4</td>
  </tr>
  <tr>
    <td colspan=3>5</td>
  </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted

1 2 3
4
5
from html_table_extractor.extractor import Extractor
table_doc = """
<table>
    <tr>
        <td rowspan=2>1</td>
        <td>2</td>
        <td rowspan=3>3</td>
    </tr>
    <tr>
        <td colspan=2>4</td>
    </tr>
    <tr>
        <td colspan=2>5</td>
    </tr>
</table>
"""
extractor = Extractor(table_doc)
extractor.parse()
extractor.return_list()

It will print out:

[[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

12
34
from html_table_extractor.extractor import Extractor
table_doc = """
<table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
"""
extractor = Extractor(table_doc).parse()
extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

1,2
3,4

Team

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright

Copyright (c) 2017 Justin Li. Released under the MIT License

Third-party copyright in this distribution is noted where applicable.

Misc

How to upload the package to pypi (for the reference of the owner)

  • python setup.py bdist_wheel --universal
  • twine upload dist/* --verbose
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].