date - joining data based on a moving time window in R -
i have weather data recorded every hour, , location data (x,y) recorded every 4 hours. want know temperature @ location x,y. weather data isn't @ same time. so, have written loop every location scan through weather data looking "closest" in date/time , extracting data time. problem way ive written it, location #2, scans through weather data not allow closest time information assigned assigned location#1. location #1 & 2 taken within 10 minutes @ 6pm , 6:10pm, closest weather time 6pm. can't allow weather data @ 6pm option. kind of set because 200 locations location data set (say 3 months it), not want starting @ time 0 weather data, when know closest weather data calculated last location , happens 3 months data set too. below sample data , code. don't know if makes sense.
<h6>####location data</h6> <p>x y datetime <br /> 1 2 4/2/2003 18:01:01 3 2 4/4/2003 17:01:33 2 3 4/6/2003 16:03:07 5 6 4/8/2003 15:03:08 3 7 4/10/2003 14:03:06 4 5 4/2/2003 13:02:00 4 5 4/4/2003 12:14:43 4 3 4/6/2003 11:00:56 3 5 4/8/2003 10:02:06</p> <h2>2 4 4/10/2003 9:02:19</h2> <p>weather data datetime wndsp wnddir hgt 4/2/2003 17:41:00 8.17 102.86 3462.43 4/2/2003 20:00:00 6.70 106.00 17661.00 4/2/2003 10:41:00 6.18 106.00 22000.00 4/2/2003 11:41:00 5.78 106.00 22000.00 4/2/2003 12:41:00 5.48 104.00 22000.00 4/4/2003 17:53:00 7.96 104.29 6541.00 4/4/2003 20:53:00 6.60 106.00 22000.00 4/4/2003 19:41:00 7.82 105.00 7555.00 4/4/2003 7:41:00 6.62 105.00 14767.50 4/4/2003 8:41:00 6.70 106.00 17661.00 4/4/2003 9:41:00 6.60 106.00 22000.00 4/5/2003 20:41:00 7.38 106.67 11156.67 4/6/2003 18:07:00 7.82 105.00 7555.00 4/6/2003 21:53:00 6.18 106.00 22000.00 4/6/2003 21:41:00 6.62 105.00 14767.50 4/6/2003 4:41:00 7.96 104.29 6541.00 4/6/2003 5:41:00 7.82 105.00 7555.00 4/6/2003 6:41:00 7.38 106.67 11156.67 4/8/2003 18:53:00 7.38 106.67 11156.67 4/8/2003 22:53:00 5.78 106.00 22000.00 4/8/2003 1:41:00 5.78 106.00 22000.00 4/8/2003 2:41:00 5.48 104.00 22000.00 4/8/2003 3:41:00 8.17 102.86 3462.43 4/10/2003 19:53:00 6.62 105.00 14767.50 4/10/2003 23:53:00 5.48 104.00 22000.00 4/10/2003 22:41:00 6.70 106.00 17661.00 4/10/2003 23:41:00 6.60 106.00 22000.00 4/10/2003 0:41:00 6.18 106.00 22000.00 4/11/2003 17:41:00 8.17 102.86 3462.43</p> <h2>4/12/2003 18:41:00 7.96 104.29 6541.0</h2> .
weathrow = 1 (i in 1:nrow(sortloc)) { t = 0 while (t < 1) { timedif1 = difftime(sortloc$datetime[i], sortweath$datetime[weathrow], units="auto") timedif2 = difftime(sortloc$datetime[i], sortweath$datetime[weathrow+1], units="auto") if (timedif2 < 0) { if (abs(timedif1) < abs(timedif2)) { sortloc$wndsp[i]=sortweath$wndsp[weathrow] sortloc$wnddir[i]=sortweath$wnddir[weathrow] sortloc$hgt[i]=sortweath$hgt[weathrow] } else { sortloc$wndsp[i]=sortweath$wndsp[weathrow+1] sortloc$wnddir[i]=sortweath$wnddir[weathrow+1] sortloc$hgt[i]=sortweath$hgt[weathrow+1] } t = 1 } if (abs(sortloc$datetime[i] - sortloc$datetime[i+1] < 50)) { weathrow=weathrow } else { weathrow = weathrow+1 #if(weathrow = nrow(sortweath)){t=1} } } #end while }
you use findinterval function find nearest value:
# example data: x <- rnorm(120000) y <- rnorm(71000) y <- sort(y) # second vector must sorted id <- findinterval(x, y, all.inside=true) # finds position of last y smaller x id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) # find nearest in case as.numeric might needed.
# assumed sortweath sorted, if not sortweath <- sortweath[order(sortweath$datetime),] x <- as.numeric(sortloc$datetime) y <- as.numeric(sortweath$datetime) id <- findinterval(x, y, all.inside=true) id_min <- ifelse(abs(x-y[id])<abs(x-y[id+1]), id, id+1) sortloc$wndsp <- sortweath$wndsp[id_min] sortloc$wnddir <- sortweath$wnddir[id_min] sortloc$hgt <- sortweath$hgt[id_min] some addition: should never, absolutely newer add values data.frame in for-loop. check comparison:
n=1000 x <- numeric(n) x <- data.frame(x=x) require(rbenchmark) benchmark( vector = {for (i in 1:n) x[i]<-1}, data.frame = {for (i in 1:n) x$x[i]<-1} ) # test replications elapsed relative # 2 data.frame 100 4.32 22.74 # 1 vector 100 0.19 1.00 data.frame version on 20 times slower, , if more rows contain difference bigger.
so if change script , first initialize result vectors:
tmp_wndsp <- tmp_wnddir <- tmp_hg <- rep(na, nrow(sortloc)) then update values in loop
tmp_wndsp[i] <- sortweath$wndsp[weathrow+1] # , on... and @ end (outside loop) update proper columns:
sortloc$wndsp <- tmp_wndsp sortloc$wnddir <- tmp_wnddir sortloc$hgt <- tmp_hgt it should run faster.
Comments
Post a Comment