date - joining data based on a moving time window in R -

- February 15, 2012

i have weather data recorded every hour, , location data (x,y) recorded every 4 hours. want know temperature @ location x,y. weather data isn't @ same time. so, have written loop every location scan through weather data looking "closest" in date/time , extracting data time. problem way ive written it, location #2, scans through weather data not allow closest time information assigned assigned location#1. location #1 & 2 taken within 10 minutes @ 6pm , 6:10pm, closest weather time 6pm. can't allow weather data @ 6pm option. kind of set because 200 locations location data set (say 3 months it), not want starting @ time 0 weather data, when know closest weather data calculated last location , happens 3 months data set too. below sample data , code. don't know if makes sense.

<h6>####location data</h6>  <p>x   y   datetime <br /> 1   2   4/2/2003    18:01:01 3   2   4/4/2003    17:01:33 2   3   4/6/2003    16:03:07 5   6   4/8/2003    15:03:08 3   7   4/10/2003   14:03:06 4   5   4/2/2003    13:02:00 4   5   4/4/2003    12:14:43 4   3   4/6/2003    11:00:56 3   5   4/8/2003    10:02:06</p>  <h2>2   4   4/10/2003   9:02:19</h2>  <p>weather data datetime        wndsp   wnddir  hgt 4/2/2003 17:41:00   8.17    102.86  3462.43 4/2/2003 20:00:00   6.70    106.00  17661.00 4/2/2003 10:41:00   6.18    106.00  22000.00 4/2/2003 11:41:00   5.78    106.00  22000.00 4/2/2003 12:41:00   5.48    104.00  22000.00 4/4/2003 17:53:00   7.96    104.29  6541.00 4/4/2003 20:53:00   6.60    106.00  22000.00 4/4/2003 19:41:00   7.82    105.00  7555.00 4/4/2003 7:41:00    6.62    105.00  14767.50 4/4/2003 8:41:00    6.70    106.00  17661.00 4/4/2003 9:41:00    6.60    106.00  22000.00 4/5/2003 20:41:00   7.38    106.67  11156.67 4/6/2003 18:07:00   7.82    105.00  7555.00 4/6/2003 21:53:00   6.18    106.00  22000.00 4/6/2003 21:41:00   6.62    105.00  14767.50 4/6/2003 4:41:00    7.96    104.29  6541.00 4/6/2003 5:41:00    7.82    105.00  7555.00 4/6/2003 6:41:00    7.38    106.67  11156.67 4/8/2003 18:53:00   7.38    106.67  11156.67 4/8/2003 22:53:00   5.78    106.00  22000.00 4/8/2003 1:41:00    5.78    106.00  22000.00 4/8/2003 2:41:00    5.48    104.00  22000.00 4/8/2003 3:41:00    8.17    102.86  3462.43 4/10/2003 19:53:00  6.62    105.00  14767.50 4/10/2003 23:53:00  5.48    104.00  22000.00 4/10/2003 22:41:00  6.70    106.00  17661.00 4/10/2003 23:41:00  6.60    106.00  22000.00 4/10/2003 0:41:00   6.18    106.00  22000.00 4/11/2003 17:41:00  8.17    102.86  3462.43</p>  <h2>4/12/2003 18:41:00  7.96    104.29  6541.0</h2>

weathrow = 1 (i in 1:nrow(sortloc)) {     t = 0     while (t < 1) {         timedif1 = difftime(sortloc$datetime[i], sortweath$datetime[weathrow], units="auto")         timedif2 =  difftime(sortloc$datetime[i], sortweath$datetime[weathrow+1], units="auto")          if (timedif2 < 0) {             if (abs(timedif1) < abs(timedif2)) {                 sortloc$wndsp[i]=sortweath$wndsp[weathrow]                 sortloc$wnddir[i]=sortweath$wnddir[weathrow]                 sortloc$hgt[i]=sortweath$hgt[weathrow]             } else {                 sortloc$wndsp[i]=sortweath$wndsp[weathrow+1]                 sortloc$wnddir[i]=sortweath$wnddir[weathrow+1]                 sortloc$hgt[i]=sortweath$hgt[weathrow+1]             }             t = 1         }         if (abs(sortloc$datetime[i] - sortloc$datetime[i+1] < 50)) {             weathrow=weathrow         } else {             weathrow = weathrow+1             #if(weathrow = nrow(sortweath)){t=1}         }     } #end while }

you use findinterval function find nearest value:

# example data: x <- rnorm(120000) y <- rnorm(71000) y <- sort(y) # second vector must sorted id <- findinterval(x, y, all.inside=true) # finds position of last y smaller x id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) # find nearest

in case as.numeric might needed.

# assumed sortweath sorted, if not sortweath <- sortweath[order(sortweath$datetime),] x <- as.numeric(sortloc$datetime) y <- as.numeric(sortweath$datetime) id <- findinterval(x, y, all.inside=true) id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) sortloc$wndsp  <- sortweath$wndsp[id_min] sortloc$wnddir <- sortweath$wnddir[id_min] sortloc$hgt    <- sortweath$hgt[id_min]

some addition: should never, absolutely newer add values data.frame in for-loop. check comparison:

n=1000 x <- numeric(n) x <- data.frame(x=x) require(rbenchmark) benchmark(     vector = {for (i in 1:n) x[i]<-1},     data.frame = {for (i in 1:n) x$x[i]<-1} ) #         test replications elapsed relative # 2 data.frame          100    4.32    22.74 # 1     vector          100    0.19     1.00

data.frame version on 20 times slower, , if more rows contain difference bigger.

so if change script , first initialize result vectors:

tmp_wndsp <- tmp_wnddir <- tmp_hg <- rep(na, nrow(sortloc))

then update values in loop

tmp_wndsp[i] <- sortweath$wndsp[weathrow+1] # , on...

and @ end (outside loop) update proper columns:

sortloc$wndsp <- tmp_wndsp sortloc$wnddir <- tmp_wnddir sortloc$hgt <- tmp_hgt

it should run faster.

Search This Blog

Score

date - joining data based on a moving time window in R -

Comments

Post a Comment

Popular posts from this blog

how to build hyperlink for query string in php -

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

queue - mq_receive: message too long -