database - How to efficiently utilize 10+ computers to import data -


we have flat files (csv) >200,000,000 rows, import star schema 23 dimension tables. biggest dimension table has 3 million rows. @ moment run importing process on single computer , takes around 15 hours. long time, want utilize 40 computers importing.

my question

how can efficiently utilize 40 computers importing. main worry there lot of time spent replicating dimension tables across nodes need identical on nodes. mean if utilized 1000 servers importing in future, might slower utilize single one, due extensive network communication , coordination between servers.

does have suggestion?

edit:

the following simplification of csv files:

"avalue";"anothervalue" "bvalue";"evenanothervalue" "avalue";"evenanothervalue" "avalue";"evenanothervalue"  "bvalue";"evenanothervalue" "avalue";"anothervalue" 

after importing, tables this:

dimension_table1

id  name 1   "avalue" 2   "bvalue" 

dimension_table2

id  name 1   "anothervalue" 2   "evenanothervalue" 

fact table

  dimension_table1_id       dimension_table2_id     1                      1     2                      2     1                       2     1                       2                   2                       2     1                       1 

loading csv data database slow because needs read, split , validate data.

so should try this:

  1. setup local database on each computer. rid of network latency.

  2. load different part of data on each computer. try give each computer same chunk. if isn't easy reason, give each computer, say, 10'000 rows. when done, give them next chunk.

  3. dump data db tools

  4. load dumps single db

make sure loader tool can import data table contains data. if can't this, check db documentation "remote table". lot of databases allow make table db server visible locally.

that allows run commands insert table (....) select .... remote_server.table

if need primary keys (and should), have problem assign pks during import local dbs. suggest add pks csv file.

[edit] after checking edits, here should try:

  1. write small program extract unique values in first , second column of csv file. simple script like:

     cut -d";" -f1 | sort -u | nawk ' { print fnr";"$0 }' 

    this pretty cheap process (a couple of minutes huge files). gives id-value files.

  2. write program reads new id-value files, caches them in memory , reads huge csv files , replaces values ids.

    if id-value files big, step small files , load huge ones 40 per-machine dbs.

  3. split huge file 40 chunks , load each of them on each machine.

    if had huge id-value files, can use tables created on each machine replace values remained.

  4. use backup/restore or remote tables merge results.

    or, better, keep data on 40 machines , use algorithms parallel computing split work , merge results. that's how google can create search results billions of web pages in few milliseconds.

see here introduction.


Comments

Popular posts from this blog

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

fortran - Function return type mismatch -

queue - mq_receive: message too long -