All Projects → vseloved → crawlik

vseloved / crawlik

Licence: other
Lisp web crawler and scrapper

Programming Languages

common lisp
692 projects
Clean
5 projects

CRAWLIK is a Lisp web crawler and scrapper

To use crawlik, you, basically, need to define methods:

  • crawl specifies how to process the whole website. A default version is provided, although, it will probably serve just for illustration purposes
  • scrape specifies how to process a particular page, and this is where you define the data extraction logic and, possibly, how to determine the next page URL (if present, should be returned as a second value)

CRAWLIK DSL

Scraing data from the web page may be performed in arbitrary manner, yet crawlik helps by providing a function to parse HTML (even not very well-formed) - parse-dirty-xml and defining a DSL to match DOM trees. The DSL expression is similar to a regex and it may be fed to match-html taht will find all the matching instances in the tree and extract the matching parts into a hash-table.

The DSL syntax includes:

  • regular matching at point: * matches any tag or a specific tag may be provided: body, th, etc.
  • (tag attrs) matches a tag with a certain set of attributes
  • >> does a DFS search in the current DOM subtree
  • $ specifies a part that should be saved in the resulting hash-table
  • !!! signals a complete match and allows to continue matching to find more than 1 instance

A simple example: (>>> table (tr (th) ((td :class "cell") ($ data))) will match a table with a a row in which there will be a TH and TD of the class "cell" and all of the TD's contents will be saved to the key "data" in the result hash-table.

Organizational notes

(c) 2017, Vsevolod Dyomkin [email protected]

See LICENSE for usage permissions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].