TechnikEmpire / GQ

Licence: other
CSS Selector Engine for Gumbo Parser

Programming Languages

C++
36643 projects - #6 most used programming language
HTML
75241 projects
Batchfile
5799 projects
CMake
9771 projects

Projects that are alternatives of or similar to GQ

Css Cheat Sheet
CSS Cheat Sheet - A reference for CSS goodness.
Stars: ✭ 310 (+1140%)
Mutual labels:  css-selector
Ripple Without Js
Create Material Design ripple effect in your HTML without using a single line of JS.
Stars: ✭ 53 (+112%)
Mutual labels:  css-selector
Sqrape
Simple Query Scraping with CSS and Go Reflection (MOVED to Gitlab)
Stars: ✭ 144 (+476%)
Mutual labels:  css-selector
Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (+1756%)
Mutual labels:  css-selector
Css Selector
The CssSelector component converts CSS selectors to XPath expressions.
Stars: ✭ 6,928 (+27612%)
Mutual labels:  css-selector
React Cssom
Css selector for React Components
Stars: ✭ 57 (+128%)
Mutual labels:  css-selector
Temme
📄 Concise selector to extract JSON from HTML.
Stars: ✭ 257 (+928%)
Mutual labels:  css-selector
html2data
Library and cli for extracting data from HTML via CSS selectors
Stars: ✭ 62 (+148%)
Mutual labels:  css-selector
Modest
CSS selectors for HTML5 Parser myhtml
Stars: ✭ 47 (+88%)
Mutual labels:  css-selector
Floki
Floki is a simple HTML parser that enables search for nodes using CSS selectors.
Stars: ✭ 1,642 (+6468%)
Mutual labels:  css-selector
Modest
Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.
Stars: ✭ 572 (+2188%)
Mutual labels:  css-selector
Apifier
Apifier is a very simple HTML parser written in Python based on CSS selectors
Stars: ✭ 5 (-80%)
Mutual labels:  css-selector
Cascadia
Go cascadia package command line CSS selector
Stars: ✭ 67 (+168%)
Mutual labels:  css-selector
Xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
Stars: ✭ 335 (+1240%)
Mutual labels:  css-selector
Tq
Perform a lookup by CSS selector on an HTML input
Stars: ✭ 193 (+672%)
Mutual labels:  css-selector
Css Select
a CSS selector compiler & engine
Stars: ✭ 279 (+1016%)
Mutual labels:  css-selector
Browser Monkey
Reliable DOM testing
Stars: ✭ 53 (+112%)
Mutual labels:  css-selector
CustomWebCheckbox
An example of a make checkbox design on the web.
Stars: ✭ 12 (-52%)
Mutual labels:  css-selector
visdom
A library use jQuery like API for html parsing & node selecting & node mutation, suitable for web scraping and html confusion.
Stars: ✭ 80 (+220%)
Mutual labels:  css-selector
Soupsieve
A modern CSS selector implementation for BeautifulSoup
Stars: ✭ 95 (+280%)
Mutual labels:  css-selector

GQ

GQ is a CSS Selector Engine for Gumbo Parser written in C++11. Using Gumbo Parser as a backend, GQ can parse input HTML and allow users to select and modify elements in the parsed document with CSS Selectors and the provided simple, but powerful mutation API.

This project is a fork of gumbo-query. I opted to have this be an unofficial fork because I intended on performing nearly a complete rewrite, which I did, and as such this source is completely irreconcilable with the original gumbo-query source.

Usage

You can either construct a Document around an existing GumboOutput pointer, at which point the Document will assume managing the lifetime of the GumboOutput, or you can supply a raw string of HTML for Document to parse and also maintain.

std::string someHtmlString = "...";
std::string someSelectorString = "...";
auto testDocument = gq::Document::Create();
testDocument->Parse(someHtmlString);

try
{
    auto results = testDocument->Find(someSelectorString);
    auto numResults = results.GetNodeCount();
}
catch(std::runtime_error& e)
{
    // Necessary because naturally, the parser can throw.
}

As you can see, you can run raw selector strings into the ::Find(...) method, but each time, the selector string will be "compiled" into a SharedSelector and destroyed. You can alternatively "precompile" and save built selectors, and as such avoid wrapping every ::Find(...) call in a try/catch.

GumboOutput* output = SOMETHING_NOT_NULL;
auto testDocument = gq::Document::Create(output);

gq::Parser parser;

std::vector<std::string> collectionOfRawSelectorStrings {...};
std::vector<gq::SharedSelector> compiledSelectors();
compiledSelectors.reserve(collectionOfRawSelectorStrings.size());

for(auto& s : collectionOfRawSelectorStrings)
{
    try
    {
        auto result = parser.CreateSelector(s);
        compiledSelectors.push_back(result);
    }
    catch(std::runtime_error& e)
    {
        // Necessary because naturally, the parser can throw.
    }
}

size_t numResults = 0;
for(auto& ss : compiledSelectors)
{
    auto results = testDocument->Find(ss);
    numResults += results.GetNodeCount();
}

These snippets are just meant to demonstrate the most basic of usage. Thanks to the mutation api, it's possible to have fine grain control over elements matched by selectors. Look at the mutation sample for a complete example of using this feature.

The contract placed on the end user is very light. Keep Document alive for as long as you're storing or accessing any Node object, directly or indirectly. That's basically it.

Speed

One of the primary goals with this engine was to maximize speed. For my purposes, I wanted to ensure I could run an insane amount of selectors without any visible delay to the user. Running the TestParser test benchmarks parsing and using every single selector in EasyList (spare a handful which were removed because they're improperly formatted) against a standard high profile website's landing page HTML. For example, if I download the source for the landing page of yahoo.com and use it in the parser test at the time of this writing, the current results on my dev laptop are:

Processed 27646 selectors. Had handled errors? false
Benchmarking parsing speed.
Time taken to parse 2764600 selectors: 2443.11 ms.
Processed at a rate of 0.000883713 milliseconds per selector or 1131.59 selectors per millisecond.
Benchmarking document parsing.
Time taken to parse 100 documents: 8054.37 ms.
Processed at a rate of 80.5437 milliseconds per document.
Benchmarking selection speed.
Time taken to run 2764600 selectors against the document: 5709.75 ms producing 23300 total matches.
Processed at a rate of 0.00206531 milliseconds per selector or 484.189 selectors per millisecond.
Benchmarking mutation.
Time taken to run 27646 selectors against the document while serializing with mutations 100 times: 6110.32 ms.
Time per cycle 61.1032 ms.
Processed at a rate of 0.0022102 milliseconds per selector or 452.448 selectors per millisecond.

So from these results, a document could be loaded, parsed, and have 27646 precompiled selectors run on it in about 137.6412 milliseconds. If you include reserializing the input to an HTML string with modifications, it's about 141.6469 msec to load, parse the document, run 27646 selectors and serialize the output with modifications based on those selectors back to an HTML string.

It should be obvious that the speed can greatly vary depending on the size and complexity of the input HTML. For example, running the same test program against the cnn.com landing page yields the following results:

Processed 27646 selectors. Had handled errors? false
Benchmarking parsing speed.
Time taken to parse 2764600 selectors: 2396.14 ms.
Processed at a rate of 0.000866723 milliseconds per selector or 1153.77 selectors per millisecond.
Benchmarking document parsing.
Time taken to parse 100 documents: 2081.5 ms.
Processed at a rate of 20.815 milliseconds per document.
Benchmarking selection speed.
Time taken to run 2764600 selectors against the document: 3321.3 ms producing 9900 total matches.
Processed at a rate of 0.00120137 milliseconds per selector or 832.386 selectors per millisecond.
Benchmarking mutation.
Time taken to run 27646 selectors against the document while serializing with mutations 100 times: 3478.62 ms.
Time per cycle 34.7862 ms.
Processed at a rate of 0.00125827 milliseconds per selector or 794.741 selectors per millisecond.

As you can see, approaching double the speed over the yahoo.com website's landing page.

Speed doesn't mean much if the matching code is broken. As such, over 40 tests currently exist that ensure correct functionality of various types of selectors. I have yet to write tests for nested and combined selectors.

Configuration

Presently, there are only scripts/project files for building GQ under Windows with Visual Studio 2015. There is no reason why GQ cannot be used under Linux or OSX, I just simply have not gone there yet. It will come soon.
A recent contribution to the repository has added Cmake support, and bug fixes to enable correct functionality under Linux. There is a minimal amount of setup required for building under Windows with VS, and it's detailed in the Wiki.

TODO

  • Mutation API.
  • Tests for combined and nested selectors.
  • Reduce candidate collections BEFORE attempting to match in the event that the selector is a BinarySelector with the intersection operator. Can reduce sets by only keeping candidates that match the traits from both the left and right hand sides of the BinarySelector, which would drastically reduce candidates and thus drastically increase matching speed. This was tried and abandoned, it's actually faster to just let it chew through all candidates.
  • Modify Selector::Match() and related methods to return the final matched node. Required for child selectors and such.
  • Work around for including root node without having to switch to the abysmal weak_ptr in TreeMap.
  • Scripts/Project Files for building/using under Linux, OSX.

Original Goals

  • Wrapping things up in proper namespaces.
  • Remove custom rolled automatic reference counting, remove any sort of shared_ptr and make lifetime management simple.
  • Fix broken parsing that was ported from cascadia, but is invalid for use with Gumbo Parser.
  • Make parsing/matching produce the same behavior as jQuery does on the exact same test data.
  • Replace std::string with boost::string_ref wherever string copies don't truly need to be generated.
  • Implement a mapping system to dramatically increase matching speed by filtering potential matches by traits.
  • Remove local state tracking from the selector parser.
  • Expose compiled selectors to the public so that they can be retained and recycled against existing and new documents.
  • "Comments. Lots of comments."
  • "Speed. Lots of Speed."
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].