Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ianb → Personal History Archive

ianb / Personal History Archive

Licence: mpl-2.0

An experiment in creating a dump of your personal browser history for analysis

Labels

jupyter-notebook

Projects that are alternatives of or similar to Personal History Archive

Курс программирования на языке Python

Stars: ✭ 27 (-3.57%)

Mutual labels: jupyter-notebook

CheXpert competition models -- attention augmented convolutions on DenseNet, ResNet; EfficientNet

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Data Visualizations Medium

Understanding Data and Machine Learning Models with Visualizations

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Official implementation for ICCV19 "Shadow Removal via Shadow Image Decomposition"

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Data driven science python demos

IPython notebooks with demo code intended as a companion to the book "Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control" by J. Nathan Kutz and Steven L. Brunton

Stars: ✭ 27 (-3.57%)

Mutual labels: jupyter-notebook

Sports Type Classifier

Classify the type of sports from images

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Anatomyofmatplotlib

Anatomy of Matplotlib -- tutorial developed for the SciPy conference

Stars: ✭ 943 (+3267.86%)

Mutual labels: jupyter-notebook

Uc berkeley Applied Machine Learning

Materials for Applied Machine Learning Taught in Python

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Mask Rcnn Tensorflow

Fork of Tensorpack to make breaking performance improvements to the Mask RCNN example. Training is approximately 2x faster than the original implementation on AWS.

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Tensorflow2.0 eager execution tutorials

Tutorials of TensorFlow eager execution

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Idb Idb Invest Coronavirus Impact Dashboard

Follow the impact of COVID-19 outbreak in Latin America in real time

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

A temporal music synthesizer

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Alfabattle2 1stproblem

Alfabattle 2.0 1st task Top-6 solution: 8-folds lgbm blend

Stars: ✭ 27 (-3.57%)

Mutual labels: jupyter-notebook

Stress-based Graph Drawing by Stochastic Gradient Descent

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

tensorflow implementation

Stars: ✭ 944 (+3271.43%)

Mutual labels: jupyter-notebook

Repo for articles in my personal blog and Medium

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Linguistic and stylistic complexity

Linguistic and stylistic complexity measures for (literary) texts

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

天池-印象盐城-汽车销量预测大赛

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

Implementation of ReSimNet for drug response similarity prediction

Stars: ✭ 28 (+0%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

personal-history-archive

Creating a dump of your personal browser history for analysis. This is a tool for people who want to research browsing behavior and content, starting with the only dataset you'll really be able to create: data about yourself.

Motivation

This is for creating a browsing corpus for later analysis. It's not a feasible end-user tool, and it collects information that can't normally be shared. But if you are interested in browsing behavior and web content analysis, then this is the package for you!

The data collected here is specifically what you see and do via the browser. Unlike spidering or fetching documents via the command-line, you get fully rendered and personalized pages. This will help you include information in your corpus that specifically isn't available on the open web.

Features

Using this tool you can:

Extract your history from multiple browsers into a database
Fetch high quality versions of your history items:
- Get frozen pages from the browser (no worries about JavaScript)
- Fetch pages using your cookies and authentication (get personal and personalized versions of pages)
- All HTML is well-formed, links are made absolute
- HTML can be re-rendered easily
The frozen HTML has additional annotations to make it easier to interpret:
- Hidden elements are marked as such
- Elements whose display style is changed are marked as such (useful if you want to look for any block-like element)
- The Readability library is used to extract a "readable" form
- Elements in the original document that form the readable view are marked as such
- The natural/rendered sizes of images are included
- A first-page screenshot is taken, and a full-length thumbnail
Track ongoing browsing; collecting additional information not in normal browsing history:
- Reliably track what page leads to the next page
- Track what link click lead to the next page
- Track how often and for how long the page was the active tab
- And more!
A Python library is included to help interpret your results:
- Load and query history items and pages
- Parse pages (using lxml)
- A growing list of miscellany...

Examples

Overview

This consists of two parts:

A browser extension (for Firefox and Chrome) to save your history and activity
A python library to use and analyze the history

Installation

You must check out this repository to use the package.

Run npm install to install the necessary packages, and to setup the Python 3 environment. (A virtualenv environment is created in .venv/)

After installation you must restart your Firefox browser (Chrome support is iffy right now), go to about:debugging and manually install the extension from build/extension/

Data will begin to be collected in data/

Fetching history

Once you have history uploaded, you may want to fetch static versions of your old history (from before you installed the extension).

Note: these instructions are incorrect, and need updating after #57 is fixed.

Use ./bin/launch-fetcher to launch a Firefox instance dedicated to that fetching. Probably use ./bin/launch-fetcher --use-profile "Profile Name" to use a copy of an existing profile (after doing that once, the profile copy will be kept for later launches). You'll want to use a profile that is logged into your services, so that you can get personalized versions of your pages.

The page http://localhost:11180/ will be loaded automatically in the fetcher browser instance, and that lets you start fetching pages.

You may want to review http://localhost:11180/viewer/redirected to see pages that get redirects. These are often pages that required missing authentication. You can login to the pages, then delete the fetched page so it can be re-fetched.

Python library

There's a Python 3 library in the python/ subdirectory. It gets automatically installed into the .venv/ virtualenv, but you could install it elsewhere too.

You can install it like:

$ cd python
$ pip install -e .
# Optional packages:
$ pip install -r requirements.txt

This adds a package called pha. There is some information in the subdirectory, and the notebooks (*.ipynb) show many examples (though as of March 2018, they are out of date due to refactorings).

Random walk

There's a script that will do random activity in the browser, saving data to test/walk-data/. Run:

$ npm run walk
# Or if you want to try a configuration in test/walk-configs/news.json that goes to news sites:
$ CONFIG=news npm run walk

Testing

The tests are in test/. To run the tests:

$ npm test

You can use NO_CLOSE=1 to leave the browser open after the test completes (this can be helpful to understand failures). Use TEST_ARGS="..." to add Mocha command-line arguments such as TEST_ARGS='-g 404s' npm test to run tests with "404s" in the test description.

The temporary data will be in test/test-data/ and you may find test/test-data/addon.log particularly interesting, as the Browser Console isn't very accessible from the test environment.

Development

If you want to run it interactively in a fresh profile, use:

$ npm start

This will run a new browser profile, with data going into dev-data/ (and logs in dev-data/addon.log). Changes are not automatically picked up, so you have to restart the browser after changes. There is no migration, so you may have to wipe out dev-data/ after changes to the schema.

Collaborating

If you have a question, probably the best thing is to open a ticket. If you are interested in implementing something, it would also be great to open a ticket so we can discuss.

If you'd like to chat, I've created a channel #pha on irc.mozilla.org. I (ianbicking) am usually only online during business hours, Central Time/UTC-6.

Credits

The icon comes from Open Iconic

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 28

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (35) 🔗