I have a data frame that looks like this: <pre class="prettyprint"><code>Name Start_Date End_Date A 2015-01-01 2019-12-29 A 2017-03-25 NA A 2019-10-17 NA A 2012-04-16 2015-01-09 A 2002-06-01 2006-02-01 A 2005-12-24 NA B 2018-01-23 NA </code></pre> I want to create a column such that, if two observations have the same <code>Name</code>, and one's <code>Start_Date</code> is ±1 year within the other observation's <code>End_Date</code>, they are classified as being in the same group. Desired output: <pre class="prettyprint"><code>Name Start_Date End_Date Wanted A 2015-01-01 2019-12-29 1 A 2017-03-25 NA NA A 2019-10-17 NA 1 A 2012-04-16 2015-01-09 1 A 2002-06-01 2006-02-01 2 A 2005-12-24 NA 2 B 2018-01-23 NA NA </code></pre> I am searching for a solution with data table but solving my problem would be enough. Added: Row-by-row explanation Row: <ol> <li>Start date is 8 days (< 1 year) before end date for row 4. It is in the same group as row 4.</li> <li>Start date is 2+ years after row 1's end date. Is not in the same group as row 1. Same with row 4, 5. It is not in the same group as those two either.</li> <li>Start date is 2 months (< 1 year) before end date for row 1. It is in the same group as row 1.</li> <li>See row 1.</li> <li>See below.</li> <li>Start date is 3 months ( < 1 year) before end date for row 5. It is in the same group as row 5.</li> <li>No other name B to compare to. It is in its own group.</li> </ol> Therefore, rows <code>1</code>, <code>3</code> and <code>4</code> are in the same group. Row <code>5</code> and <code>6</code> are in the same group. Row <code>2</code> and <code>7</code> do not have groups. EDIT: I have updated my code to have consistent <code>Wanted</code> category when an observation does not get matched with another.

<h1 id="approach-ri52">Approach</h3> Here's a solution with <code>data.table</code>, as preferred: <blockquote> I would prefer a solution with data.table but any solutions at all are much appreciated! </blockquote> While <code>dplyr</code> and <code>fuzzyjoin</code> might appear more elegant, they might also prove less efficient with sufficiently large datasets. Credit goes to ThomasIsCoding for beating me to the punch on this other question, with an answer that harnesses <code>igraph</code> to index networks in graphs. Here, the networks are the separate "chains" (<code>Wanted</code> groups) comprised of "links" (<code>data.frame</code> rows), which are joined by their "closeness" (between their <code>Start_Date</code>s and <code>End_Date</code>s). Such an approach seemed necessary to model the transitive relationship ℛ requested here <blockquote> I am trying to create the chain of "close" links so that I can map A's movements over time. </blockquote> with care to also preserve the symmetry of ℛ (see Further Reading). Per that same request <blockquote> So I would ideally like to flag situations where one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31) and vice versa. </blockquote> and your further clarification <blockquote> ...I would want another column that indicates that [flag]. </blockquote> I have also included a <code>Flag</code> column, to flag each row whose <code>Start_Date</code> is matched by the <code>End_Date</code>s of at least <code>flag_at</code> other rows; or vice versa. <hr> <h1 id="solution-zd2s">Solution</h3> Using your sample <code>data.frame</code>, reproduced here as <code>my_data_frame</code> <pre class="prettyprint"><code># Generate dataset as data.frame. my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"), Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554), class = "Date"), End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA), class = "Date")), row.names = c(NA, -7L), class = "data.frame") </code></pre> we apply <code>data.table</code> and <code>igraph</code> (among other packages) as follows: <pre class="prettyprint"><code>library(tidyverse) library(data.table) library(lubridate) library(igraph) # ... # Code to generate your data.frame 'my_data_frame'. # ... # Treat dataset as a data.table. my_data_table <- my_data_frame %>% data.table::as.data.table() # Define the tolerance threshold as a (lubridate) "period": 1 year. tolerance <- lubridate::years(1) # Set the minimum number of matches for an row to be flagged: 2. flag_at <- 2 ##################################### # BEGIN: Start Indexing the Groups. # ##################################### # Begin indexing the "chain" (group) to which each "link" (row) belongs: output <- my_data_table %>% ######################################################## # STEP 1: Link the Rows That Are "Close" to Each Other # ######################################################## # Prepare data.table for JOIN, by adding appropriate helper columns. .[, `:=`(# Uniquely identify each row (by row number). ID = .I, # Boundary columns for tolerance threshold. End_Low = End_Date - tolerance, End_High = End_Date + tolerance)] %>% # JOIN rows to each other, to obtain pairings. .[my_data_table, # Clearly describe the relation R: x R y whenever the 'Start_Date' of x is # close enough to (within the boundary columns for) the 'End_Date' of y. .(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date, y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name), # JOIN criteria: on = .(# Only pair rows having the same name. Name, # Only pair rows whose start and end dates are within the tolerance # threshold of each other. End_Low <= Start_Date, End_High >= Start_Date), # Make it an OUTER JOIN, to include those rows without a match. nomatch = NA] %>% # Prepare pairings for network analysis. .[# Ensure no row is reflexively paired with itself. # NOTE: This keeps the graph clean by trimming extraneous loops, and it # prevents an "orphan" row from contributing to its own tally of matches. !(x.ID == y.ID) %in% TRUE, # !(x.ID == y.ID) %in% TRUE, # Simplify the dataset to only the pairings (by ID) of linked rows. .(from = x.ID, to = y.ID)] ############################# # PAUSE: Count the Matches. # ############################# # Count how many times each row has its 'End_Date' matched by a 'Start_Date'. my_data_table$End_Matched <- output %>% # Include again the missing IDs for y that were never matched by the JOIN. .[my_data_table[, .(ID)], on = .(to = ID)] %>% # For each row y, count every other row x where x R y. .[, .(Matches = sum(!is.na(from))), by = to] %>% # Extract the count column. .$Matches # Count how many times each row has its 'Start_Date' matched by an 'End_Date'. my_data_table$Start_Matched <- output %>% # For each row x, count every other row y where x R y. .[, .(Matches = sum(!is.na(to))), by = from] %>% # Extract the count column. .$Matches ######################################### # RESUME: Continue Indexing the Groups. # ######################################### # Resume indexing: output <- output %>% # Ignore nonmatches (NAs) which are annoying to process into a graph. .[from != to, ] %>% ############################################################### # STEP 2: Index the Separate "Chains" Formed By Those "Links" # ############################################################### # Convert pairings (by ID) of linked rows into an undirected graph. igraph::graph_from_data_frame(directed = FALSE) %>% # Find all groups (subgraphs) of transitively linked IDs. igraph::components() %>% # Pair each ID with its group index. igraph::membership() %>% # Tabulate those pairings... utils::stack() %>% utils::type.convert(as.is = TRUE) %>% # ...in a properly named data.table. data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>% ##################################################### # STEP 3: Match the Original Rows to their "Chains" # ##################################################### # LEFT JOIN (on ID) to match each original row to its group index (if any). .[my_data_table, on = .(ID)] %>% # Transform output into final form. .[# Sort into original order. order(ID), .(# Select existing columns. Name, Start_Date, End_Date, # Rename column having the group indices. Wanted = Group_Index, # Calculate column(s) to flag rows with sufficient matches. Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))] # View results. output </code></pre> <h1 id="result-2wtu">Result</h3> The resulting <code>output</code> is the following <code>data.table</code>: <pre class="prettyprint"><code> Name Start_Date End_Date Wanted Flag 1: A 2015-01-01 2019-12-29 1 FALSE 2: A 2017-03-25 <NA> NA FALSE 3: A 2019-10-17 <NA> 1 FALSE 4: A 2012-04-16 2015-01-09 1 FALSE 5: A 2002-06-01 2006-02-01 2 FALSE 6: A 2005-12-24 <NA> 2 FALSE 7: B 2018-01-23 <NA> NA FALSE </code></pre> Keep in mind that the <code>Flag</code>s are all <code>FALSE</code> simply because your data lacks any <code>Start_Date</code> matched by (at least) two <code>End_Date</code>s; along with any <code>End_Date</code> matched by (at least) two <code>Start_Date</code>s. Hypothetically, if we lowered <code>flag_at</code> to <code>1</code>, then the <code>output</code> would <code>Flag</code> every row with even a single match (in either direction): <pre class="prettyprint"><code> Name Start_Date End_Date Wanted Flag 1: A 2015-01-01 2019-12-29 1 TRUE 2: A 2017-03-25 <NA> NA FALSE 3: A 2019-10-17 <NA> 1 TRUE 4: A 2012-04-16 2015-01-09 1 TRUE 5: A 2002-06-01 2006-02-01 2 TRUE 6: A 2005-12-24 <NA> 2 TRUE 7: B 2018-01-23 <NA> NA FALSE </code></pre> <hr> <h1 id="warning-3z91">Warning</h3> Because some <code>data.table</code> operations modify by reference (or "in-place"), the value of <code>my_data_table</code> changes throughout the workflow. After Step 1, <code>my_data_table</code> becomes <pre class="prettyprint"><code> Name Start_Date End_Date ID End_Low End_High 1: A 2015-01-01 2019-12-29 1 2018-12-29 2020-12-29 2: A 2017-03-25 <NA> 2 <NA> <NA> 3: A 2019-10-17 <NA> 3 <NA> <NA> 4: A 2012-04-16 2015-01-09 4 2014-01-09 2016-01-09 5: A 2002-06-01 2006-02-01 5 2005-02-01 2007-02-01 6: A 2005-12-24 <NA> 6 <NA> <NA> 7: B 2018-01-23 <NA> 7 <NA> <NA> </code></pre> a structural departure from the <code>my_data_frame</code> it initially copied. Since <code>dplyr</code> (among other packages) assigns by value rather than by reference, a <code>dplyr</code> solution would sidestep this issue entirely. As it is, however, you must take care when modifying the workflow, because the version of <code>my_data_table</code> available before Step 1 cannot be recovered afterwards. <h1 id="further-reading-ubkf">Further Reading</h3> Although the <code>JOIN</code>ing of <code>data.table</code>s is explicitly directional — with a "right" side and a "left" side — this model manages to preserve the relational symmetry you described here <blockquote> if...[either] one's 'Start_Date' is +- 1 year within the other observation's 'End_Date', they are classified as being in the same group. </blockquote> via the use of an undirected graph. When the <code>JOIN</code> relates the 1st row 𝑥 (having a <code>Start_Date</code> of <code>2015-01-01</code>) to the 4th row 𝑦 (having an <code>End_Date</code> of <code>2015-01-09</code>), we gather that the <code>Start_Date</code> of 𝑥 is "sufficiently close" to (within 1 year of) the <code>End_Date</code> of 𝑦. So we say mathematically that 𝑥 ℛ 𝑦, or <blockquote> 𝑥 "is in the same group as" 𝑦. </blockquote> However, the converse 𝑦 ℛ 𝑥 will not necessarily appear in the <code>JOIN</code>ed data, because the <code>Start_Date</code> of 𝑦 might not land so conveniently near the <code>End_Date</code> of 𝑥. That is, the <code>JOIN</code>ed data will not necessarily indicate that <blockquote> 𝑦 "is in the same group as" 𝑥. </blockquote> In the latter case, a strictly directed graph ("digraph") would not capture the common membership of 𝑥 and 𝑦 in the same group. You can observe this jarring difference by setting <code>directed = TRUE</code> in the first line of Step 2 <pre class="prettyprint"><code> igraph::graph_from_data_frame(directed = TRUE) %>% </code></pre> and also setting <code>mode = "strong"</code> in the very next line <pre class="prettyprint"><code> igraph::components(mode = "strong") %>% </code></pre> to yield these disassociated results: <pre class="prettyprint"><code> Name Start_Date End_Date Wanted Flag 1: A 2015-01-01 2019-12-29 4 FALSE 2: A 2017-03-25 <NA> NA FALSE 3: A 2019-10-17 <NA> 3 FALSE 4: A 2012-04-16 2015-01-09 5 FALSE 5: A 2002-06-01 2006-02-01 2 FALSE 6: A 2005-12-24 <NA> 1 FALSE 7: B 2018-01-23 <NA> NA FALSE </code></pre> By contrast, the rows can be properly grouped via the use of an undirected graph (<code>directed = FALSE</code>); or via more lenient criteria (<code>mode = "weak"</code>). Either of these approaches will effectively simulate the presence of 𝑦 ℛ 𝑥 whenever 𝑥 ℛ 𝑦 is present in the <code>JOIN</code>ed data. This symmetric property is particularly important when modeling the behavior you describe here: <blockquote> ...one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31)... </blockquote> In this situation, you want the model to recognize that any two rows 𝑦 and 𝑧 must be in the same group (𝑦 ℛ 𝑧), whenever their <code>End_Date</code>s match the same <code>Start_Date</code> of some other row 𝑥: 𝑦 ℛ 𝑥 and 𝑧 ℛ 𝑥. So suppose we know that 𝑦 ℛ 𝑥 and 𝑧 ℛ 𝑥. Because our model has preserved symmetry, we can say from 𝑧 ℛ 𝑥 that 𝑥 ℛ 𝑧 too. Since we now know that 𝑦 ℛ 𝑥 and 𝑥 ℛ 𝑧, transitivity implies that 𝑦 ℛ 𝑧. Thus, our model recognizes that 𝑦 ℛ 𝑧 whenever 𝑦 ℛ 𝑥 and 𝑧 ℛ 𝑥! Similar logic will suffice for "vice versa". We can verify this outcome by using <pre class="prettyprint"><code>my_data_frame <- my_data_frame %>% rbind(list(Name = "A", Start_Date = as.Date("2010-01-01"), End_Date = as.Date("2015-01-05"))) </code></pre> to append an 8th row to <code>my_data_frame</code>, prior to the workflow: <pre class="prettyprint"><code> Name Start_Date End_Date 1 A 2015-01-01 2019-12-29 # ⋮ ⋮ ⋮ ⋮ 4 A 2012-04-16 2015-01-09 # ⋮ ⋮ ⋮ ⋮ 8 A 2010-01-01 2015-01-05 </code></pre> This 8th row serves as our 𝑧, where 𝑥 is the 1st row and 𝑦 is the 4th row, as before. Indeed, the <code>output</code> properly classifies and 𝑦 and 𝑧 as belonging to the same group <code>1</code>: 𝑦 ℛ 𝑧. <pre class="prettyprint"><code> Name Start_Date End_Date Wanted Flag 1: A 2015-01-01 2019-12-29 1 TRUE 2: A 2017-03-25 <NA> NA FALSE 3: A 2019-10-17 <NA> 1 FALSE 4: A 2012-04-16 2015-01-09 1 FALSE 5: A 2002-06-01 2006-02-01 2 FALSE 6: A 2005-12-24 <NA> 2 FALSE 7: B 2018-01-23 <NA> NA FALSE 8: A 2010-01-01 2015-01-05 1 FALSE </code></pre> Likewise, the <code>output</code> properly <code>Flag</code>s the 1st row, whose <code>Start_Date</code> is now matched by two <code>End_Date</code>s: in the 4th and 8th rows. <h2 id="cheers-j1h4">Cheers!</h3>

Create group based on fuzzy criteria

Q: What is fuzzy grouping?

Fuzzy grouping helps to group commonly misspelled words or closely spelled words by temporarily stripping all vowels (except for the first vowel) and double or triple consonants from extracted words and then comparing them to see if they are the same.

Q: How is fuzzy grouping used in SSIS?

Fuzzy Grouping: Columns I have selected the column village. There are two options available in the match type; Exact and Fuzzy. Rows are considered duplicates if they are similar with a Fuzzy match type. If you specify Exact, only rows that contain identical values are considered duplicates.

Q: What is fuzzy grouping in Power Query?

In this article and video, I’ll explain Fuzzy grouping. Fuzzy grouping in short means grouping text values by their similarity based on a threshold, rather than exact equal values. This option at the moment is available in Power Query online (dataflow), but it will be available soon in Power Query in Power BI Desktop or Excel too.

Q: How to do fuzzy grouping?

My suggestion is to first perform normal grouping on the items that match and then for the non-matching items perform the fuzzy operation. The default threshold for Fuzzy Grouping is 0.8, which means 80% similarity. You can change the options such as Ignore case, or Similarity threshold.

Q: What is the default threshold for fuzzy grouping?

The default threshold for Fuzzy Grouping is 0.8, which means 80% similarity. You can change the options such as Ignore case, or Similarity threshold. For example, if I change the similarity threshold to 1, It means 100% matching, this will result in seven groups.

Q: What is the difference between the fuzzy lookup and fuzzy grouping transformations?

The Fuzzy Lookup performs standardization of data by correcting and providing missing values. While the Fuzzy Grouping transformation performs data cleaning tasks by identifying rows of data that are likely to be duplicated and selecting a canonical row of data to use in standardizing the data. We will demonstrate both of these transformations.

Tags:

r

data.table

igraph

I have a data frame that looks like this:

Name   Start_Date   End_Date
A      2015-01-01   2019-12-29
A      2017-03-25   NA
A      2019-10-17   NA
A      2012-04-16   2015-01-09
A      2002-06-01   2006-02-01
A      2005-12-24   NA
B      2018-01-23   NA

I want to create a column such that, if two observations have the same Name, and one's Start_Date is ±1 year within the other observation's End_Date, they are classified as being in the same group.

Desired output:

Name   Start_Date   End_Date    Wanted
A      2015-01-01   2019-12-29  1
A      2017-03-25   NA          NA
A      2019-10-17   NA          1
A      2012-04-16   2015-01-09  1
A      2002-06-01   2006-02-01  2
A      2005-12-24   NA          2
B      2018-01-23   NA          NA

I am searching for a solution with data table but solving my problem would be enough.

Added: Row-by-row explanation
Row:

Start date is 8 days (< 1 year) before end date for row 4. It is in the same group as row 4.
Start date is 2+ years after row 1's end date. Is not in the same group as row 1. Same with row 4, 5. It is not in the same group as those two either.
Start date is 2 months (< 1 year) before end date for row 1. It is in the same group as row 1.
See row 1.
See below.
Start date is 3 months ( < 1 year) before end date for row 5. It is in the same group as row 5.
No other name B to compare to. It is in its own group.

Therefore, rows 1, 3 and 4 are in the same group. Row 5 and 6 are in the same group. Row 2 and 7 do not have groups.

EDIT: I have updated my code to have consistent Wanted category when an observation does not get matched with another.

876

asked Jul 12 '21 21:07

EconNoobie

Video Answer

1 Answers

Approach

Here's a solution with data.table, as preferred:

I would prefer a solution with data.table but any solutions at all are much appreciated!

While dplyr and fuzzyjoin might appear more elegant, they might also prove less efficient with sufficiently large datasets.

Credit goes to ThomasIsCoding for beating me to the punch on this other question, with an answer that harnesses igraph to index networks in graphs. Here, the networks are the separate "chains" (Wanted groups) comprised of "links" (data.frame rows), which are joined by their "closeness" (between their Start_Dates and End_Dates). Such an approach seemed necessary to model the transitive relationship ℛ requested here

I am trying to create the chain of "close" links so that I can map A's movements over time.

with care to also preserve the symmetry of ℛ (see Further Reading).

Per that same request

So I would ideally like to flag situations where one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31) and vice versa.

and your further clarification

...I would want another column that indicates that [flag].

I have also included a Flag column, to flag each row whose Start_Date is matched by the End_Dates of at least flag_at other rows; or vice versa.

Solution

Using your sample data.frame, reproduced here as my_data_frame

# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
                                Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
                                                       class = "Date"),
                                End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
                                                     class = "Date")),
                           row.names = c(NA, -7L),
                           class = "data.frame")

we apply data.table and igraph (among other packages) as follows:

library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)



# ...
# Code to generate your data.frame 'my_data_frame'.
# ...



# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()


# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)

# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2



#####################################
# BEGIN: Start Indexing the Groups. #
#####################################

# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
  
  ########################################################
  # STEP 1: Link the Rows That Are "Close" to Each Other #
  ########################################################
  
  # Prepare data.table for JOIN, by adding appropriate helper columns.
  .[, `:=`(# Uniquely identify each row (by row number).
           ID = .I,
           # Boundary columns for tolerance threshold.
           End_Low = End_Date - tolerance,
           End_High = End_Date + tolerance)] %>%
    
  # JOIN rows to each other, to obtain pairings.
  .[my_data_table,
    # Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
    # close enough to (within the boundary columns for) the 'End_Date' of y.
    .(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
      y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
    # JOIN criteria:
    on = .(# Only pair rows having the same name.
           Name,
           # Only pair rows whose start and end dates are within the tolerance
           # threshold of each other.
           End_Low <= Start_Date,
           End_High >= Start_Date),
    # Make it an OUTER JOIN, to include those rows without a match.
    nomatch = NA] %>%
  
  # Prepare pairings for network analysis.
  .[# Ensure no row is reflexively paired with itself.
    #   NOTE: This keeps the graph clean by trimming extraneous loops, and it
    #   prevents an "orphan" row from contributing to its own tally of matches.
    !(x.ID == y.ID) %in% TRUE,
    # !(x.ID == y.ID) %in% TRUE,
    # Simplify the dataset to only the pairings (by ID) of linked rows.
    .(from = x.ID, to = y.ID)]



#############################
# PAUSE: Count the Matches. #
#############################

# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
  
  # Include again the missing IDs for y that were never matched by the JOIN.
  .[my_data_table[, .(ID)], on = .(to = ID)] %>%
  
  # For each row y, count every other row x where x R y.
  .[, .(Matches = sum(!is.na(from))), by = to] %>%
  
  # Extract the count column.
  .$Matches


# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
  
  # For each row x, count every other row y where x R y.
  .[, .(Matches = sum(!is.na(to))), by = from] %>%
  
  # Extract the count column.
  .$Matches



#########################################
# RESUME: Continue Indexing the Groups. #
#########################################

# Resume indexing:
output <- output %>%
  
  # Ignore nonmatches (NAs) which are annoying to process into a graph.
  .[from != to, ] %>%
  
  ###############################################################
  # STEP 2: Index the Separate "Chains" Formed By Those "Links" #
  ###############################################################
  
  # Convert pairings (by ID) of linked rows into an undirected graph.
  igraph::graph_from_data_frame(directed = FALSE) %>%
  
  # Find all groups (subgraphs) of transitively linked IDs.
  igraph::components() %>%
  
  # Pair each ID with its group index.
  igraph::membership() %>%
  
  # Tabulate those pairings...
  utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
  
  # ...in a properly named data.table.
  data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
  
  
  
  #####################################################
  # STEP 3: Match the Original Rows to their "Chains" #
  #####################################################
  
  # LEFT JOIN (on ID) to match each original row to its group index (if any).
  .[my_data_table, on = .(ID)] %>%
  
  # Transform output into final form.
  .[# Sort into original order.
    order(ID),
    .(# Select existing columns.
      Name, Start_Date, End_Date,
      # Rename column having the group indices.
      Wanted = Group_Index,
      # Calculate column(s) to flag rows with sufficient matches.
      Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]



# View results.
output

Result

The resulting output is the following data.table:

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

Keep in mind that the Flags are all FALSE simply because your data lacks any Start_Date matched by (at least) two End_Dates; along with any End_Date matched by (at least) two Start_Dates.

Hypothetically, if we lowered flag_at to 1, then the output would Flag every row with even a single match (in either direction):

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1  TRUE
4:    A 2012-04-16 2015-01-09      1  TRUE
5:    A 2002-06-01 2006-02-01      2  TRUE
6:    A 2005-12-24       <NA>      2  TRUE
7:    B 2018-01-23       <NA>     NA FALSE

Warning

Because some data.table operations modify by reference (or "in-place"), the value of my_data_table changes throughout the workflow. After Step 1, my_data_table becomes

   Name Start_Date   End_Date ID    End_Low   End_High
1:    A 2015-01-01 2019-12-29  1 2018-12-29 2020-12-29
2:    A 2017-03-25       <NA>  2       <NA>       <NA>
3:    A 2019-10-17       <NA>  3       <NA>       <NA>
4:    A 2012-04-16 2015-01-09  4 2014-01-09 2016-01-09
5:    A 2002-06-01 2006-02-01  5 2005-02-01 2007-02-01
6:    A 2005-12-24       <NA>  6       <NA>       <NA>
7:    B 2018-01-23       <NA>  7       <NA>       <NA>

a structural departure from the my_data_frame it initially copied.

Since dplyr (among other packages) assigns by value rather than by reference, a dplyr solution would sidestep this issue entirely.

As it is, however, you must take care when modifying the workflow, because the version of my_data_table available before Step 1 cannot be recovered afterwards.

Create group based on fuzzy criteria

Tags:

r

data.table

igraph

EconNoobie

People also ask

Video Answer

1 Answers

Approach

Solution

Result

Warning

Further Reading

Cheers!

Greg

Recent Activity

Donate For Us

Create group based on fuzzy criteria

Tags:

r

data.table

igraph

EconNoobie

People also ask

Video Answer

1 Answers

Approach

Solution

Result

Warning

Further Reading

Cheers!

Greg

Related questions

Recent Activity

Donate For Us