All Projects → Networks-Learning → Stackexchange Dump To Postgres

Networks-Learning / Stackexchange Dump To Postgres

Licence: mit
Python scripts to import StackExchange data dump into Postgres DB.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Stackexchange Dump To Postgres

Efcore.pg
Entity Framework Core provider for PostgreSQL
Stars: ✭ 838 (+1344.83%)
Mutual labels:  database, postgres
Squid
🦑 Provides SQL tagged template strings and schema definition functions.
Stars: ✭ 57 (-1.72%)
Mutual labels:  database, postgres
Docker Postgres
A docker container running PostgreSQL
Stars: ✭ 22 (-62.07%)
Mutual labels:  database, postgres
Blog
Everything about database,business.(Most for PostgreSQL).
Stars: ✭ 6,330 (+10813.79%)
Mutual labels:  database, postgres
Niklick
Rails Versioned API solution template for hipsters! (Ruby, Ruby on Rails, REST API, GraphQL, Docker, RSpec, Devise, Postgress DB)
Stars: ✭ 39 (-32.76%)
Mutual labels:  database, postgres
Metabase
The simplest, fastest way to get business intelligence and analytics to everyone in your company 😋
Stars: ✭ 26,803 (+46112.07%)
Mutual labels:  database, postgres
Go Kallax
Kallax is a PostgreSQL typesafe ORM for the Go language.
Stars: ✭ 853 (+1370.69%)
Mutual labels:  database, postgres
Citus
Distributed PostgreSQL as an extension
Stars: ✭ 5,580 (+9520.69%)
Mutual labels:  database, postgres
Goqu
SQL builder and query library for golang
Stars: ✭ 984 (+1596.55%)
Mutual labels:  database, postgres
Entityframeworkcore.bootkit
EntityFrameworkCore Start Kit
Stars: ✭ 29 (-50%)
Mutual labels:  database, postgres
Rein
Database constraints made easy for ActiveRecord.
Stars: ✭ 657 (+1032.76%)
Mutual labels:  database, postgres
East
node.js database migration tool
Stars: ✭ 53 (-8.62%)
Mutual labels:  database, postgres
Pgmetrics
Collect and display information and stats from a running PostgreSQL server
Stars: ✭ 612 (+955.17%)
Mutual labels:  database, postgres
Migrate
Database migrations. CLI and Golang library.
Stars: ✭ 7,712 (+13196.55%)
Mutual labels:  database, postgres
Otj Pg Embedded
Java embedded PostgreSQL component for testing
Stars: ✭ 559 (+863.79%)
Mutual labels:  database, postgres
Node Pg Migrate
Node.js database migration management for Postgresql
Stars: ✭ 838 (+1344.83%)
Mutual labels:  database, postgres
Gnorm
A database-first code generator for any language
Stars: ✭ 415 (+615.52%)
Mutual labels:  database, postgres
Zero downtime migrations
Zero downtime migrations with ActiveRecord 3+ and PostgreSQL
Stars: ✭ 513 (+784.48%)
Mutual labels:  database, postgres
Awesome Postgres
A curated list of awesome PostgreSQL software, libraries, tools and resources, inspired by awesome-mysql
Stars: ✭ 7,468 (+12775.86%)
Mutual labels:  database, postgres
Postgresclientkit
A PostgreSQL client library for Swift. Does not require libpq.
Stars: ✭ 49 (-15.52%)
Mutual labels:  database, postgres

StackOverflow data to postgres

This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.

Schema hints are taken from a post on Meta.StackExchange and from StackExchange Data Explorer.

Quickstart

Install requirements, create a stackoverflow database, and use load_into_pg.py script:

$ pip install -r requirements.txt
...
Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0
$ createdb beerSO
$ python load_into_pg.py -s beer -d beerSO

This will download compressed files from archive.org and load all the tables at once.

Advanced Usage

You can use a custom database name as well. Make sure to explicitly give it while executing the script later.

Each table data is archived in an XML file. Available tables varies accross history. load_into_pg.py knows how to handle the following tables:

  • Badges.
  • Posts.
  • Tags (not present in earliest dumps).
  • Users.
  • Votes.
  • PostLinks.
  • PostHistory.
  • Comments.

You can download manually the files to the folder from where the program is executed: Badges.xml, Votes.xml, Posts.xml, Users.xml, Tags.xml. In some old dumps, the cases in the filenames are different.

Then load each file with e.g. python load_into_pg.py -t Badges.

After all the initial tables have been created:

$ psql beerSO < ./sql/final_post.sql

For some additional indexes and tables, you can also execute the the following;

$ psql beerSO < ./sql/optional_post.sql

If you give a schema name using the -n switch, all the tables will be moved to the given schema. This schema will be created in the script.

The paths are not changed in the final scripts sql/final_post.sql and sql/optional_post.sql. To run them, first set the search_path to your schema name: SET search_path TO <myschema>;

Caveats and TODOs

  • It prepares some indexes and views which may not be necessary for your analysis.
  • The Body field in Posts table is NOT populated by default. You have to use --with-post-body argument to include it.
  • The EmailHash field in Users table is NOT populated.

Sept 2011 data dump

  • The tags.xml is missing from the data dump. Hence, the PostTag and UserTagQA tables will be empty after final_post.sql.
  • The ViewCount in Posts is sometimes equal to an empty value. It is replaced by NULL in those cases.

Acknowledgement

@madtibo made significant contributions by adding jsonb and Foreign Key support. @bersace brought the dependencies and the README.md instructions into 2020.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].