+
The Φ-1(1-pi/2) of "Equation (10)", "(13)", and "(14)" can be computed by using R:
This Python script is designed to transform and amalgamate the contents of an ICD-10 Excel spreadsheet with the ICD-10 information from a text file, outputting the combined data in a JSON format.
Per the International Classification of Diseases, Tenth Revision (ICD-10), "Other Hyperlipidemia" is classified under the code E784. The code E1329 is assigned to "Other Specified Diabetes Mellitus with Other Diabetic Kidney Complication." Additionally, J9699 refers to "Respiratory Failure, Unspecified, Type Unspecified."
In addition, the above Python script is designed to extract data from the file icd10cm_tabular_2024.xml
obtained from the official website of the Centers for Medicare & Medicaid Services (CMS). The primary function of this script is to efficiently process the extracted data for the creation of the file CMS_ICD10.js
.
Topic | Details |
---|---|
Purpose | Interactive builder for JavaScript‑compatible regular expressions. • Live preview with match highlighting. • Token & flag palettes for one‑click insertion. • ChatGPT‑assisted pattern generation. Pure vanilla JS—no frameworks. |
File location | Place re.html anywhere and open in a modern browser (Chrome, Edge, Firefox, Safari). |
AI description → ChatGPT | “Describe the pattern” textarea + Generate Pattern with ChatGPT button. Sends prompt to OpenAI Chat Completion API, receives a raw pattern, inserts it into the Pattern field, and refreshes the preview. |
API key storage | First AI call prompts for an OpenAI key and caches it in localStorage . Key is never transmitted to any server except api.openai.com . |
Pattern field | Free‑text input; accepts any JavaScript regex source (no delimiters). Typing triggers live validation and preview. |
Flags field | Accepts g i m s u y . Palette buttons toggle single‑character flags. |
Token & Flag palettes | One‑click insertion of common tokens (e.g., \d , [ ] ) and flags. Caret position preserved. |
Sample text + preview | Paste or type any text; matches are wrapped in <span class="match"> with a yellow highlight. Works in real‑time. |
Copy utilities | • Copy Pattern — raw pattern. • Copy Escaped Pattern — doubles back‑slashes for embedding in string literals. |
Accordion references | Five collapsible sections inside the page: 1️⃣ Token reference • 2️⃣ Flag reference • 3️⃣ Pattern cheatsheet • 4️⃣ How to use • 5️⃣ Get an OpenAI API key Each may be expanded independently. |
Full‑source reveal | Sixth accordion shows the entire HTML of the page, syntax‑highlighted via Highlight.js. |
Security note | OpenAI key lives only in the user’s browser storage; no external backend required. For production, proxy API calls to keep the key server‑side. |
Dependencies | Highlight.js (CDN) for code rendering. Everything else is native JavaScript and CSS. |
Namespace | All logic wrapped in a single IIFE; CSS scoped to local class names—safe to embed in any page. |
Definition : brackets that capture any one character from a specified set.
Common examples include [a‑z]
for lowercase letters, \d
for digits, and
\w
for “word” characters (letters + digits + underscore).
Symbol | Meaning | Typical example |
---|---|---|
[abc] | Any of a, b, c | gr[ae]y → “gray”, “grey” |
[^abc] | Any character except a, b, c | [^0-9] |
\d | Digit (0‑9) | \d{4} → four‑digit year |
\w | Word character | \w+ → identifier |
Purpose : dictate how many consecutive times a preceding token may occur.
The most frequent forms are *
(0 +), +
(1 +), ?
(0‑1), and
{m,n}
(explicit range).
Greedy quantifiers match as much as possible; append?
to switch them to lazy mode (e.g..*?
).
Anchors pin patterns to positions rather than characters. ^
aligns with the start of a
string (or line in multiline mode), while $
attaches to the end. Word boundaries
(\b
) are invaluable in token extraction, ensuring partial words remain untouched.
Parentheses gather sub‑patterns into logical units and capture their content. Bar
(|
) creates alternatives. Non‑capturing groups (?:…)
improve performance
when captures are unnecessary.
Provide an in‑browser utility that assists users in composing and testing regular expressions in real time, accompanied by contextual guidance.
g
, i
, m
,
s
, u
, and y
.<div class="regex‑builder">
<header>Compose a regular expression</header>
<section class="controls">
<input id="pattern" placeholder="Enter pattern…" />
<input id="flags" placeholder="Flags (e.g. gim)" />
<div id="palette"><!-- buttons injected by JS --></div>
</section>
<section class="sample">
<textarea id="sampleText">Paste or type sample text here…</textarea>
<pre id="preview"></pre>
</section>
<footer>
<button id="copyRaw">Copy /<button>
<button id="copyEscaped">Copy escaped</button>
</footer>
</div>
.regex‑builder{
font-family:system-ui, sans-serif;
line-height:1.5;
max-width:48rem;
margin:auto;
border:1px solid #e5e7eb;
padding:1.5rem;
border-radius:0.75rem;
}
.controls input{
width:100%;
padding:0.5rem 0.75rem;
margin-bottom:0.5rem;
border:1px solid #d1d5db;
border-radius:0.5rem;
font-size:1rem;
}
#palette button{
margin:0.25rem;
padding:0.4rem 0.75rem;
border:1px solid #cbd5e1;
border-radius:0.5rem;
background:#f8fafc;
cursor:pointer;
}
.sample{
margin-top:1rem;
}
#sampleText{
width:100%;
min-height:8rem;
padding:0.75rem;
border:1px solid #d1d5db;
border-radius:0.5rem;
}
#preview .match{
background:#fffbcc;
border-bottom:2px solid #facc15;
}
// Helper: escape for display
const escapeHtml = str => str.replace(/[&<>"']/g, m => ({
'&':'&', '<':'<', '>':'>', '"':'"', "'":'''
})[m]);
const patternEl = document.getElementById('pattern');
const flagsEl = document.getElementById('flags');
const sampleEl = document.getElementById('sampleText');
const previewEl = document.getElementById('preview');
function render(){
let regex;
try{
regex = new RegExp(patternEl.value, flagsEl.value);
}catch(e){
previewEl.innerHTML = '<em>Invalid pattern</em>';
return;
}
const raw = sampleEl.value;
const highlighted = raw.replace(regex, m =>
'<span class="match">' + escapeHtml(m) + '</span>');
previewEl.innerHTML = escapeHtml(raw) === raw
? highlighted
: escapeHtml(raw);
}
['input','keyup','change'].forEach(evt => {
patternEl.addEventListener(evt, render);
flagsEl.addEventListener(evt, render);
sampleEl.addEventListener(evt, render);
});
render();
localStorage
to streamline repetitive testing.Written on May 12, 2025
Pattern | Matches… |
---|---|
Dates & Time | |
\b\d{4}-\d{2}-\d{2}\b | ISO date (YYYY‑MM‑DD) |
\b\d{4}년\s?\d{1,2}월\s?\d{1,2}일\b | Korean date (YYYY년 MM월 DD일) |
\b\d{1,2}/\d{1,2}/\d{4}\b | US date (MM/DD/YYYY) |
\b\d{4}/\d{2}/\d{2}\b | Slash date (YYYY/MM/DD) |
\b\d{2}:\d{2}\b | 24‑h time (HH:MM) |
\b\d{2}:\d{2}:\d{2}\b | 24‑h time with seconds |
\b\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z?\b | ISO datetime (Z optional) |
\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\b | English month abbreviations |
\b(?:Mon|Tue|Wed|Thu|Fri|Sat|Sun),\s\d{2}\s[A-Z][a-z]{2}\s\d{4}\b | RFC 2822 date header |
Numbers & Units | |
\b\d+\b | Positive integer |
\b[+-]?\d+\b | Signed integer |
\b\d+\.\d+\b | Decimal number |
\b\d{1,3}(?:,\d{3})+\b | Thousands‑separated integer (1,234,567) |
\$\d+(?:\.\d{2})?\b | US currency (optional cents) |
\b100(?:\.0+)?%|\b\d{1,2}(?:\.\d+)?% | Percentage 0–100 % |
Text & Strings | |
"[^"\\]*(?:\\.[^"\\]*)*" | Double‑quoted string (escapes OK) |
'[^'\\]*(?:\\.[^'\\]*)*' | Single‑quoted string (escapes OK) |
`[^`\\]*(?:\\.[^`\\]*)*` | Template literal (no ${…}) |
\b[A-Z][a-z]+\b | Capitalised word |
\b[가-힣]+\b | Korean Hangul word |
Networking & Web | |
\bhttps?:\/\/\S+\b | HTTP/HTTPS URL |
\bftp:\/\/\S+\b | FTP URL |
\b(?:\d{1,3}\.){3}\d{1,3}\b | IPv4 address |
\b(?:[A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}\b | IPv6 address (full form) |
\b[a-z0-9-]+\.[a-z]{2,}\b | Domain name |
[\w.+-]+@[\w.-]+\.[A-Za-z]{2,} | E‑mail address |
\b010-\d{4}-\d{4}\b | Korean mobile phone (010-1234-5678) |
\/(?:[^\/\0]+\/)*[^\/\0]+\.[\w\-]+ | POSIX file path with extension |
Identifiers | |
\b[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}\b | UUID v1–v5 |
\b(?:[0-9A-F]{2}:){5}[0-9A-F]{2}\b | MAC address |
\b(?:\d{4}[\s-]?){4}\b | 16‑digit credit‑card number |
\b97[89]-\d-\d{2,5}-\d{2,7}-\d\b | ISBN‑13 (hyphenated) |
Markup & Code | |
<\/?([A-Za-z][A-Za-z0-9]*)\b[^>]*> | HTML tag |
^#{1,6}\s.+$ | Markdown header (# Title) |
"[A-Za-z0-9_]+"(?=\s*:) | JSON key (double‑quoted) |
Whitespace & Misc | |
^\s+|\s+$ | Trim leading/trailing whitespace (use g) |
\s{2,} | Consecutive spaces (≥2) |
\t+ | Tabs |
^\s*$ | Blank line |
^\S.*$ | Non‑blank line |
A deterministic finite automaton (DFA) is a simple computational model used to recognize patterns in input strings. A DFA consists of a finite set of states, an input alphabet, a transition function mapping each state and input symbol to a next state, a designated start state, and one or more accepting (final) states. As the automaton reads an input string symbol by symbol, it moves between states according to the transition function. If the automaton ends in an accepting state after processing the entire input, the string is considered accepted (i.e., part of the language recognized by the DFA). DFAs are a foundational model in formal language theory for defining the class of regular languages .
A regular expression is a symbolic notation for describing sets of strings (languages). Regular expressions use operators such as union (often represented by '|'), concatenation, and Kleene star (denoted '*') to build complex patterns. For example, the expression (a|b)*
denotes the set of all strings composed of zero or more occurrences of 'a' or 'b'. Regular expressions correspond to regular languages because every language described by a regular expression can be recognized by some DFA. Regular expressions are commonly used in programming and text-processing tools for pattern matching. In formal language theory, they describe exactly the same class of languages as DFAs.
A fundamental theorem in formal language theory states that DFAs and regular expressions are equivalent in power: they define the same class of languages, known as the regular languages . This result is sometimes referred to as Kleene's Theorem . It implies that for every regular expression there is an equivalent DFA that accepts the same language, and vice versa. Converting a regular expression into a DFA often involves first creating a nondeterministic finite automaton (NFA) using standard constructions, and then applying the subset construction to obtain a DFA. Conversely, one can derive a regular expression that represents the language of a given DFA (though the resulting expression may be complex). These concepts demonstrate that DFAs and regular expressions are two equivalent ways to describe regular languages, each offering a different perspective on pattern recognition in computation theory.
The table below highlights key aspects of DFAs and regular expressions:
Aspect | DFA | Regular Expression |
---|---|---|
Definition | A state-based automaton defined by states and transitions. | A symbolic formula using union, concatenation, and Kleene star. |
Representation | Graphical or tabular transition structure. | Algebraic pattern syntax. |
Recognized Language | Regular languages (exactly). | Regular languages (exactly). |
Conversion | Can be constructed from a regular expression (via NFA). | Can be converted to a DFA (via NFA). |
Typical Use | Formal language modeling, compilers (lexical analysis). | Text searching and validation in software tools. |
At the graduate level, DFAs and regular expressions are examined with greater mathematical formality. A DFA is formally defined as a 5-tuple (Q, Σ, δ, q₀, F)
, where Q
is a finite set of states, Σ
is the input alphabet, δ: Q × Σ → Q
is the transition function, q₀ ∈ Q
is the initial state, and F ⊆ Q
is the set of accepting states. A regular expression can be defined by a recursive grammar: the empty string ϵ
and each symbol in Σ
are regular expressions, and if R
and S
are regular expressions, then (R|S)
, (RS)
, and (R*)
are also regular expressions. Graduate students study important properties of regular languages, including closure under union, concatenation, and Kleene star, as well as intersection and complement. The Myhill-Nerode theorem provides a characterization of regular languages in terms of distinguishable strings, and the pumping lemma is a fundamental result for proving that certain languages are not regular.
Graduate research often involves explicit constructions to demonstrate the equivalence of DFAs and regular expressions. A standard method is Thompson's construction , which builds an NFA from a given regular expression, followed by the subset construction to convert that NFA into an equivalent DFA. Conversely, one can eliminate states from a DFA or apply Arden's lemma to derive a regular expression that represents its language. These transformations illustrate the fundamental equivalence between the two formalisms. In terms of complexity, converting an NFA (or regex) to a DFA can cause an exponential increase in the number of states. DFA minimization can be applied to reduce states and yields a unique minimal automaton for each language. In contrast, simplification of regular expressions is computationally harder and no canonical minimal form exists. These considerations motivate research on the state complexity and descriptive complexity of regular languages.
Aspect | DFA | Regular Expression |
---|---|---|
Formalism | Graph-based model (states and transitions). | Algebraic syntax with operators (|, concatenation, *). |
Minimization | A unique minimal DFA exists for each regular language. | No unique minimal regex; minimization is PSPACE-hard. |
Closure Properties | Closed under union, intersection, complement, etc., via automaton constructions. | Closed under union, concatenation, star; intersection/complement require additional constructs. |
Conversion | From regex: via Thompson's NFA + subset construction. | From DFA: via state elimination or algebraic methods. |
Decision Problems | Equivalence, emptiness, membership are decidable (often efficiently). | Equivalence (of regex) is PSPACE-complete; membership is linear-time. |
Technical professionals often use regular expressions in practical applications such as text processing, data validation, and log analysis. Regular expressions are implemented in many programming languages and tools (for example, grep, sed, Perl, Python, and Java) to perform pattern matching. These implementations frequently include extended features beyond the basic formalism, such as capturing groups, backreferences (allowing the engine to match the same substring again), and lookahead and lookbehind assertions. While these features provide additional expressive power, they can also make the matching process more complex and, in some cases, allow matching of non-regular patterns. Professionals should understand that the core theory applies to the subset of regex features that remain within regular language constraints.
Understanding that core regular expressions correspond to finite automata provides insight into writing efficient patterns. For instance, any pure regular expression can, in principle, be transformed into a DFA, which means that matching can be performed in linear time relative to the input size. Some modern tools actually convert regex patterns into DFAs or similar state machines under the hood for fast matching. Other regex engines use backtracking algorithms that may exhibit exponential-time behavior on certain patterns. Patterns involving heavy backtracking (such as nested quantifiers) can lead to performance issues. It is advisable to design regexes to be as unambiguous as possible and to anchor patterns (using ^
and $
) when appropriate to limit the search space.
The table below outlines differences between theoretical regular expressions and practical regex engines:
Feature | Theoretical Regex | Practical Regex Engine |
---|---|---|
Operators | Union (|), concatenation, Kleene star (*) only. |
Includes these plus quantifiers (e.g. +, ?, {m,n}), wildcards (.), character classes, anchors (
^
,
$
), etc.
|
Language | Exactly the class of regular languages. | Extended by backreferences and lookaround (can match some non-regular patterns). |
Matching Algorithm | Can be implemented via NFA/DFA construction (guaranteed linear time). | Often implemented with backtracking or hybrid algorithms (potential exponential time). |
Performance | Matching is guaranteed to be linear time in the input size. | Performance may degrade to exponential time for complex patterns if not carefully designed. |
Typical Use | Theoretical analysis and language processing. | Practical searching, text processing, and validation tasks. |
By grounding regular expression practice in automata theory, professionals gain a deeper understanding of why certain patterns are efficient and others are not. The equivalence of DFAs and regular expressions assures that any properly constructed pattern corresponds to some finite automaton. This theoretical foundation can guide the composition of complex regexes in a principled way, helping to ensure they remain efficient and correct when applied in software systems.
Written on May 12, 2025
Table of Contents
- Mathematical Foundation
- Code and Step-by-Step Explanation
- 0. Install and Load Required Packages
- 1. Define Output Directory
- 2. Data Import and Preparation
- 3. Exploratory Data Analysis (EDA)
- 4. Kaplan-Meier Plots Stratified by Site
- 5. Cox Proportional Hazards Model Including Site
- 6. Checking for Multicollinearity
- 7. Penalized Cox Regression (Lasso)
- 8. Fine-Gray Model for Competing Risks
- 9. Visualizing Survival Differences Across Sites
- 10. Additional Recommendations and Checks
- 11. End of Script
The Kaplan-Meier (KM) estimator is a non-parametric statistic used to estimate the survival function \( S(t) \). If there are \( n \) subjects, let \( t_{(1)}, t_{(2)}, \dots, t_{(k)} \) be the distinct event times in ascending order. Let \( d_j \) be the number of events (clearances) at time \( t_{(j)} \) and \( r_j \) be the number of subjects at risk just before \( t_{(j)} \). The KM estimator is:
\[ \hat{S}(t) = \prod_{t_{(j)} \le t} \left( 1 - \frac{d_j}{r_j} \right). \]It allows comparing the time to clearance across different sites (e.g., stool vs. urine).
The Cox Proportional Hazards model expresses the hazard function for individual \( i \) as:
\[ \lambda_i(t) = \lambda_0(t) \exp\left( \boldsymbol{\beta}^\top \mathbf{x}_i \right), \]where
A hazard ratio between two covariate levels reflects the ratio of their hazards at any given time \( t \). If a coefficient \( \beta_j \) is positive, it suggests an increased rate of clearance for that factor level.
When multiple competing events can occur (e.g., clearance is the event of interest, but death or loss to follow-up might preclude clearance), a competing risk model such as Fine and Gray is used:
\[ \text{Fine-Gray: } \quad \text{Subdistribution hazard } = h_j(t) = \lim_{\Delta t \to 0} \frac{P(t \le T < t + \Delta t, \epsilon = j \mid T \ge t \text{ or } (T < t \text{ and } \epsilon \neq j))}{\Delta t}, \]where \( \epsilon \) indicates the type of event (\( j = \) clearance, other = competing events). This method accounts for the fact that once a competing event (e.g., death) happens, one can no longer clear the pathogen.
Step | Section | Purpose |
---|---|---|
0 | Install Packages | Ensures necessary packages are installed |
1 | Define Directory | Output directory for storing analysis results |
2 | Data Import/Prep | Loads and cleans the data |
3 | Exploratory Analysis | Summaries and missing data checks |
4 | Kaplan-Meier Analysis | Non-parametric survival estimates by site |
5 | Cox Model (Simple) | Estimates hazard ratios for clearance with Site only |
6 | Multicollinearity Check | Detects correlated predictors using VIF |
7 | Penalized Cox (Lasso) | Regularization to handle many or correlated predictors |
8 | Fine-Gray Model | Competing risks approach for clearance vs. death |
9 | Visualization | Visual compares survival curves by site |
10 | Recommendations/Checks | Influential points, outliers, and missing data |
11 | End of Script | Wrap-up, optional housekeeping |
Note: The code presented here is R-based and uses packages such as survival, survminer, cmprsk, dplyr, ggplot2, and glmnet. Adjust package calls as needed for local environment.
# -------------------------------------------
# Survival Analysis on Quarantine Clearance
# Comparing Site-Specific Clearance: Stool, Urine, Sputum, Blood
# © 2024 Hyunsuk Frank Roh, MD. All rights reserved.
# -------------------------------------------
# 0. Setup: Install and Load Required Packages
install_and_load <- function(package) {
if (!require(package, character.only = TRUE)) {
install.packages(package, dependencies = TRUE)
library(package, character.only = TRUE)
}
}
required_packages <- c("survival", "survminer", "cmprsk", "dplyr", "readr",
"ggplot2", "patchwork", "forcats", "reshape2", "glmnet", "car")
# Load dplyr early to avoid conflicts
suppressMessages(library(dplyr))
invisible(lapply(required_packages, install_and_load))
Interpretation: - This section ensures that all required packages for reading data, plotting, modeling, and statistical testing are present and loaded. - Automated checks help avoid missing dependencies during runtime.
# 1. Define Output Directory
desktop_path <- file.path(Sys.getenv("USERPROFILE"), "Desktop")
output_dir <- file.path(desktop_path, "Survival_Analysis_Images")
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
message(paste("Created directory:", output_dir))
} else {
message(paste("Directory already exists:", output_dir))
}
Interpretation: - Specifies where plots and model summaries will be stored. - Recursive creation of directories ensures sub-directories exist when writing outputs.
# 2. Data Import and Preparation
data_path <- file.path(desktop_path, "dataset05.csv")
my_data <- read_csv(data_path, locale = locale(encoding = "UTF-8"))
cat("First few rows of the dataset:\n")
print(head(my_data))
cat("Structure of the dataset:\n")
print(str(my_data))
# Convert variables to appropriate data types
my_data <- my_data %>%
mutate(
PatientID = as.factor(PatientID),
Pathogen = as.factor(Pathogen),
Site = as.factor(Site),
Start_Date = as.Date(Start_Date, format = "%m/%d/%Y"),
Event_Date = as.Date(Event_Date, format = "%m/%d/%Y"),
Event_Type = as.factor(Event_Type),
Event_Code = as.numeric(Event_Code),
Age = as.numeric(Age),
Gender = as.factor(Gender),
FactorA = as.factor(FactorA),
FactorB = as.factor(FactorB),
FactorC = factor(FactorC, levels = c("Low", "Medium", "High"), ordered = TRUE)
)
cat("Structure after type conversions:\n")
print(str(my_data))
Interpretation:
- Reads the CSV file containing survival (clearance) data and sets correct data types for key variables.
- Event_Code should indicate whether an event (clearance) occurred (e.g., 1 = clearance
, 0 = censored
, 2 = competing event
if used).
- This step ensures subsequent survival analysis is consistent with R’s expectations for times, events, and factors.
# 3. Exploratory Data Analysis (EDA)
cat("Summary statistics:\n")
print(summary(my_data))
cat("Missing values per column:\n")
print(sapply(my_data, function(x) sum(is.na(x))))
cat("Event distribution across sites:\n")
print(table(my_data$Site, my_data$Event_Code))
Interpretation: - Summary statistics and missing value checks identify potential data issues. - A two-way table with Site vs. Event_Code reveals how many clearance events occurred at each site.
# 4. Kaplan-Meier Plots Stratified by Site
if (!"Time_to_Event" %in% colnames(my_data)) {
my_data <- my_data %>%
mutate(Time_to_Event = as.numeric(difftime(Event_Date, Start_Date, units = "days")))
}
if (any(my_data$Time_to_Event <= 0, na.rm = TRUE)) {
cat("There are observations with non-positive Time_to_Event. These will be removed.\n")
my_data <- my_data %>%
filter(Time_to_Event > 0)
}
# Create the survival object
surv_object_site <- Surv(time = my_data$Time_to_Event, event = my_data$Event_Code == 1)
# Fit Kaplan-Meier survival curves stratified by Site
km_fit_site <- survfit(surv_object_site ~ Site, data = my_data)
# Plot the Kaplan-Meier curves
km_plot <- ggsurvplot(
km_fit_site,
data = my_data,
risk.table = TRUE,
pval = TRUE, # Adds p-value from log-rank test
conf.int = TRUE,
xlab = "Time in Days",
ylab = "Probability of Clearance",
title = "Kaplan-Meier Curves Stratified by Site",
legend.title = "Site",
palette = scales::hue_pal()(length(levels(my_data$Site)))
)
print(km_plot)
ggsave(
filename = file.path(output_dir, "Kaplan_Meier_Curves_Stratified_by_Site.png"),
plot = km_plot$plot, width = 8, height = 6
)
ggsave(
filename = file.path(output_dir, "Kaplan_Meier_Risk_Table_Stratified_by_Site.png"),
plot = km_plot$table, width = 8, height = 3
)
Mathematical Note: - The Kaplan-Meier approach estimates \( \hat{S}(t) \) (the probability that clearance has not yet occurred by time \( t \)). - Stratification by Site compares different clearance curves. A log-rank test checks if the curves are significantly different.
Interpretation: - A steeper curve indicates faster clearance. - The p-value clarifies whether differences in clearance times across sites are statistically significant.
# 5. Cox Proportional Hazards Model Including Site
cox_model_simple <- coxph(surv_object_site ~ Site, data = my_data)
cat("Summary of Simplified Cox Model (Only Site):\n")
summary_cox_simple <- summary(cox_model_simple)
print(summary_cox_simple)
sink(file = file.path(output_dir, "Summary_Simplified_Cox_Model.txt"))
print(summary_cox_simple)
sink()
# Test proportional hazards assumption
ph_test_simple <- tryCatch(
cox.zph(cox_model_simple),
error = function(e) {
cat("Error in proportional hazards test:\n", e$message, "\n")
return(NULL)
}
)
if (!is.null(ph_test_simple)) {
cat("Proportional Hazards Test for Simplified Model:\n")
print(ph_test_simple)
sink(file = file.path(output_dir, "PH_Test_Simplified_Cox_Model.txt"))
print(ph_test_simple)
sink()
ph_plot_simple <- ggcoxzph(ph_test_simple)
print(ph_plot_simple)
ggsave(
filename = file.path(output_dir, "Schoenfeld_Residuals_Simplified_Cox_Model.png"),
plot = ph_plot_simple, width = 8, height = 6
)
} else {
cat("Proportional hazards assumption test could not be performed.\n")
}
Mathematical Note: - The Cox model: \( \lambda_i(t) = \lambda_0(t) \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip}) \). - Here, the predictor is only Site (categorical). The exponent of a coefficient, \( \exp(\hat{\beta}) \), is the hazard ratio. - If \( \exp(\hat{\beta}) > 1 \), that site experiences a faster clearance (on average). - If \( \exp(\hat{\beta}) < 1 \), that site has a slower clearance rate compared to the reference site.
Interpretation: - Cox.ZPH test checks whether proportional hazards assumption is valid. - If non-significant, it suggests no major violation of the assumption for that predictor.
# 6. Checking for Multicollinearity
cox_model_full <- coxph(surv_object_site ~ Site + FactorA + FactorB + FactorC + Age + Gender, data = my_data)
print("Summary of Full Cox Model:")
summary_cox_full <- summary(cox_model_full)
print(summary_cox_full)
sink(file = file.path(output_dir, "Summary_Full_Cox_Model.txt"))
print(summary_cox_full)
sink()
# Compute VIF
if(!is.null(cox_model_full$coefficients)) {
design_matrix <- model.matrix(~ Site + FactorA + FactorB + FactorC + Age + Gender, data = my_data)[, -1]
lm_fit <- lm(Time_to_Event ~ ., data = as.data.frame(design_matrix))
vif_values <- vif(lm_fit)
print("Variance Inflation Factor (VIF) for Predictors:")
print(vif_values)
write.csv(as.data.frame(vif_values), file = file.path(output_dir, "VIF_Full_Cox_Model.csv"), row.names = TRUE)
} else {
warning("Full Cox model did not converge. Skipping VIF computation.")
}
Mathematical Note: - VIF (Variance Inflation Factor) for each predictor \( X_j \) is: \[ \mathrm{VIF}_j = \frac{1}{1 - R_j^2}, \] where \( R_j^2 \) is the coefficient of determination when \( X_j \) is regressed on all other predictors. - High VIF (commonly > 5 or 10) indicates multicollinearity.
Interpretation: - Identifying collinear factors ensures stable model estimates. - High collinearity in a Cox model can lead to large standard errors for coefficient estimates.
# 7. Penalized Cox Regression (Lasso)
my_data_pen <- my_data %>%
select(Time_to_Event, Event_Code, Site, FactorA, FactorB, FactorC, Age, Gender) %>%
drop_na()
surv_object_pen <- Surv(time = my_data_pen$Time_to_Event, event = my_data_pen$Event_Code == 1)
x_pen <- model.matrix(~ Site + FactorA + FactorB + FactorC + Age + Gender, data = my_data_pen)[, -1]
y_pen <- surv_object_pen
set.seed(123)
cv_fit <- cv.glmnet(x_pen, y_pen, family = "cox", alpha = 1, standardize = TRUE)
cv_plot <- plot(cv_fit)
title("Cross-Validation for Penalized Cox Regression", line = 2.5)
print(cv_plot)
ggsave(filename = file.path(output_dir, "Penalized_Cox_CV_Plot.png"),
plot = cv_plot, width = 8, height = 6)
best_lambda <- cv_fit$lambda.min
print(paste("Best lambda selected by cross-validation:", best_lambda))
penalized_cox <- glmnet(x_pen, y_pen, family = "cox", alpha = 1, lambda = best_lambda, standardize = TRUE)
print("Coefficients from Penalized Cox Model:")
penalized_cox_coef <- coef(penalized_cox)
print(penalized_cox_coef)
write.csv(as.data.frame(as.matrix(penalized_cox_coef)),
file = file.path(output_dir, "Penalized_Cox_Model_Coefficients.csv"),
row.names = TRUE)
Mathematical Note: - Penalized Cox adds an \( L_1 \) penalty term \( \alpha \lambda \|\beta\|_1 \) (when \( \alpha=1 \)) to the partial likelihood objective. This is known as Lasso, encouraging sparse solutions (some coefficients shrink to 0).
Interpretation: - Lasso helps in variable selection and reducing overfitting, especially when many predictors might be collinear or the sample size is small.
# 8. Fine-Gray Model for Competing Risks
# Event_Code == 1: Clearance (event of interest)
# Event_Code == 2: Some competing event (e.g., Death)
my_data_fg <- my_data %>%
mutate(
Competing_Event = case_when(
Event_Code == 1 ~ 0, # Event of interest
Event_Code == 2 ~ 1, # Competing event
TRUE ~ NA_real_
)
)
na_count_fg <- sum(is.na(my_data_fg$Competing_Event))
print(paste("Number of NA values in Competing_Event:", na_count_fg))
write.csv(data.frame(Column = "Competing_Event", NA_Count = na_count_fg),
file = file.path(output_dir, "NA_Counts_Competing_Event.csv"),
row.names = FALSE)
my_data_clean_fg <- my_data_fg %>%
filter(!is.na(Competing_Event)) %>%
drop_na(Site, FactorA, FactorB, FactorC, Age, Gender, Time_to_Event, Competing_Event)
print("Dimensions after cleaning for Fine-Gray model:")
print(dim(my_data_clean_fg))
write.csv(data.frame(Dimensions = paste(dim(my_data_clean_fg), collapse = "x")),
file.path(output_dir, "Fine_Gray_Clean_Data_Dimensions.csv"),
row.names = FALSE)
covariates_fg <- model.matrix(~ Site + FactorA + FactorB + FactorC + Age + Gender, data = my_data_clean_fg)[, -1]
print(paste("Number of covariates:", ncol(covariates_fg)))
print(paste("Number of observations:", nrow(my_data_clean_fg)))
write.csv(data.frame(Covariates = ncol(covariates_fg), Observations = nrow(my_data_clean_fg)),
file.path(output_dir, "Covariate_Matrix_Dimensions_Fine_Gray.csv"),
row.names = FALSE)
if(nrow(covariates_fg) == nrow(my_data_clean_fg)) {
fg_model_site <- try(crr(
ftime = my_data_clean_fg$Time_to_Event,
fstatus = my_data_clean_fg$Competing_Event,
cov1 = covariates_fg
), silent = TRUE)
if(class(fg_model_site) != "try-error") {
print("Summary of Fine-Gray Model:")
summary_fg <- summary(fg_model_site)
print(summary_fg)
sink(file = file.path(output_dir, "Summary_Fine_Gray_Model.txt"))
print(summary_fg)
sink()
} else {
warning("Fine-Gray model failed to converge. Check data for issues.")
error_message <- "Fine-Gray model failed to converge. Check data for issues."
write.csv(data.frame(Error = error_message),
file.path(output_dir, "Fine_Gray_Model_Error.csv"),
row.names = FALSE)
}
} else {
stop("Mismatch in the number of rows between covariates and survival data for Fine-Gray model.")
}
Mathematical Note: - Fine and Gray (1999) proposed modeling the subdistribution hazard of a particular event type in the presence of competing events. - The subdistribution hazard function is different from the cause-specific hazard: it explicitly accounts for individuals who have experienced the competing event but remain “at risk” in the subdistribution sense.
Interpretation: - Use this approach if competing events (like death) might preclude observing clearance. - The model yields subdistribution hazard ratios, interpreted as the effect of covariates on the cumulative incidence of clearance.
# 9. Visualizing Survival Differences Across Sites
events_per_site <- my_data %>%
group_by(Site) %>%
summarize(Events = sum(Event_Code == 1, na.rm = TRUE))
print("Number of events per site:")
print(events_per_site)
write.csv(events_per_site, file.path(output_dir, "Events_Per_Site.csv"), row.names = FALSE)
valid_sites <- events_per_site %>%
filter(Events > 0) %>%
pull(Site)
my_data_valid <- my_data %>%
filter(Site %in% valid_sites)
surv_object_site_valid <- Surv(time = my_data_valid$Time_to_Event, event = my_data_valid$Event_Code == 1)
km_fit_site_valid <- survfit(surv_object_site_valid ~ Site, data = my_data_valid)
km_plot_valid <- ggsurvplot(
km_fit_site_valid,
data = my_data_valid,
risk.table = TRUE,
pval = TRUE,
conf.int = TRUE,
xlab = "Time in Days",
ylab = "Survival Probability",
title = "Survival Curves by Site",
legend.title = "Site",
legend.labs = levels(my_data_valid$Site),
palette = c("#E7B800", "#2E9FDF", "#FC4E07", "#00BA38"),
ggtheme = theme_minimal()
)
print(km_plot_valid)
ggsave(filename = file.path(output_dir, "Kaplan_Meier_Curves_By_Site_Valid_Sites.png"),
plot = km_plot_valid$plot, width = 8, height = 6)
ggsave(filename = file.path(output_dir, "Kaplan_Meier_Risk_Table_By_Site_Valid_Sites.png"),
plot = km_plot_valid$table, width = 8, height = 3)
obs_per_site <- my_data_valid %>%
group_by(Site) %>%
summarize(Count = n())
print("Number of observations per valid site:")
print(obs_per_site)
write.csv(obs_per_site, file.path(output_dir, "Observations_Per_Valid_Site.csv"), row.names = FALSE)
if(all(obs_per_site$Count >= 2)) {
km_facet_plot <- ggsurvplot_facet(
km_fit_site_valid,
data = my_data_valid,
facet.by = "Site",
nrow = 2,
ncol = 2,
risk.table = TRUE,
pval = FALSE,
conf.int = TRUE,
xlab = "Time in Days",
ylab = "Survival Probability",
title = "Faceted Kaplan-Meier Curves by Site",
ggtheme = theme_minimal()
)
print(km_facet_plot)
ggsave(filename = file.path(output_dir, "Faceted_Kaplan_Meier_Curves_By_Site.png"),
plot = km_facet_plot$plot, width = 12, height = 8)
ggsave(filename = file.path(output_dir, "Faceted_Kaplan_Meier_Risk_Table_By_Site.png"),
plot = km_facet_plot$table, width = 12, height = 6)
} else {
warning("Not all sites have enough observations for faceted plots. Creating separate plots for each site.")
library(patchwork)
plots <- list()
risk_tables <- list()
for (site in levels(my_data_valid$Site)) {
fit_site <- survfit(Surv(Time_to_Event, Event_Code == 1) ~ 1, data = my_data_valid %>% filter(Site == site))
p <- ggsurvplot(
fit_site,
data = my_data_valid,
risk.table = TRUE,
pval = FALSE,
conf.int = TRUE,
xlab = "Time in Days",
ylab = "Survival Probability",
title = paste("Survival Curve for", site),
ggtheme = theme_minimal()
)
plots[[site]] <- p$plot
risk_tables[[site]] <- p$table
}
combined_plot <- wrap_plots(plots, ncol = 2)
print(combined_plot)
ggsave(filename = file.path(output_dir, "Combined_Kaplan_Meier_Curves_By_Site.png"),
plot = combined_plot, width = 12, height = 8)
for (site in levels(my_data_valid$Site)) {
ggsave(filename = file.path(output_dir, paste0("Risk_Table_", site, ".png")),
plot = risk_tables[[site]], width = 8, height = 3)
}
}
Interpretation: - Faceted or combined survival plots highlight clearance patterns for each site. - This step is crucial for visual diagnosis of clearance timelines across multiple categories.
# 10. Additional Recommendations and Checks
# 10.1. Check for Influential Observations
influence_cox <- residuals(cox_model_simple, type = "dfbeta")
if(!is.null(influence_cox)) {
dfbeta_df <- as.data.frame(influence_cox)
dfbeta_df$PatientID <- rownames(dfbeta_df)
library(reshape2)
dfbeta_melt <- melt(dfbeta_df, id.vars = "PatientID")
dfbeta_plot <- ggplot(dfbeta_melt, aes(x = PatientID, y = value, color = variable)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "DFBeta for Simplified Cox Model", x = "Patient ID", y = "DFBeta") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
print(dfbeta_plot)
ggsave(filename = file.path(output_dir, "DFBeta_Simplified_Cox_Model.png"),
plot = dfbeta_plot, width = 12, height = 6)
} else {
warning("No dfbeta available for the simplified Cox model.")
}
# 10.2. Assess Outliers in Continuous Variables
age_boxplot <- ggplot(my_data, aes(x = "", y = Age)) +
geom_boxplot(fill = "#2E9FDF") +
labs(title = "Boxplot of Age", y = "Age") +
theme_minimal()
print(age_boxplot)
ggsave(filename = file.path(output_dir, "Boxplot_Age.png"),
plot = age_boxplot, width = 6, height = 6)
# 10.3. Handling Missing Data
# Consider advanced imputation if missingness is not negligible.
Interpretation: - DFBeta plots: Identifies influential points that might unduly affect estimates. - Boxplot of Age: Quickly spot outliers or unusual distributions.
# 11. End of Script
# Most plots and model summaries have already been saved in previous steps.
# Additional saves can be added if needed.
Written on December 24th, 2024