Miscellaneous Scripts

Table of Contents

SCI-Indexed Scripts

nGene Regular Expression Builder

Introduction

Regular expression primer and prototype designer 🔍 (Written May 12, 2025)

Pattern Reference

Theory of Computer Science

Regular Expression and Deterministic Finite Automaton

Survival Analysis R-Script

Quarantine Clearance Using Survival Analysis (Written December 24, 2024)

Perpetual Calendar

SCI-Indexed Scripts

[R: Logistic-based Systolic Model Simulation, Parameter Estimation, and Spearman/Kendall-correlation]

+

Frank et al., Noninvasive pulmonary arterial pressure estimation using a logistic-based systolic model. Computers in Biology and Medicine 2018.

[R: Meta-Analysis on Hazard Ratio with Subgroup Analysis] ECMO Mortality

Frank et al., Extracorporeal membrane oxygenation meta-analysis of time-to-event data in cardiopulmonary disease in adults. Protocols.io 2020.

[R: Meta-Regression on Hazard Ratio] ECMO Mortality

Frank et al., Extracorporeal membrane oxygenation meta-analysis of time-to-event data in respiratory failure in adults. Protocols.io 2020.

[R: Inverse cumulative normal] Parmar Methodology

The Φ^-1(1-p_i/2) of "Equation (10)", "(13)", and "(14)" can be computed by using R:

[Python] International Classification of Disease 10th (ICD10) in JSON Format

This Python script is designed to transform and amalgamate the contents of an ICD-10 Excel spreadsheet with the ICD-10 information from a text file, outputting the combined data in a JSON format.

Per the International Classification of Diseases, Tenth Revision (ICD-10), "Other Hyperlipidemia" is classified under the code E784. The code E1329 is assigned to "Other Specified Diabetes Mellitus with Other Diabetic Kidney Complication." Additionally, J9699 refers to "Respiratory Failure, Unspecified, Type Unspecified."

In addition, the above Python script is designed to extract data from the file icd10cm_tabular_2024.xml obtained from the official website of the Centers for Medicare & Medicaid Services (CMS). The primary function of this script is to efficiently process the extracted data for the creation of the file CMS_ICD10.js.

nGene Regular Expression Builder v 1.1 (b)

Topic	Details
Purpose	Interactive builder for JavaScript‑compatible regular expressions. • Live preview with match highlighting. • Token & flag palettes for one‑click insertion. • ChatGPT‑assisted pattern generation. Pure vanilla JS—no frameworks.
File location	Place `re.html` anywhere and open in a modern browser (Chrome, Edge, Firefox, Safari).
AI description → ChatGPT	“Describe the pattern” textarea + Generate Pattern with ChatGPT button. Sends prompt to OpenAI Chat Completion API, receives a raw pattern, inserts it into the Pattern field, and refreshes the preview.
API key storage	First AI call prompts for an OpenAI key and caches it in `localStorage`. Key is never transmitted to any server except `api.openai.com`.
Pattern field	Free‑text input; accepts any JavaScript regex source (no delimiters). Typing triggers live validation and preview.
Flags field	Accepts `g i m s u y`. Palette buttons toggle single‑character flags.
Token & Flag palettes	One‑click insertion of common tokens (e.g., `\d`, `[ ]`) and flags. Caret position preserved.
Sample text + preview	Paste or type any text; matches are wrapped in `<span class="match">` with a yellow highlight. Works in real‑time.
Copy utilities	• Copy Pattern — raw pattern. • Copy Escaped Pattern — doubles back‑slashes for embedding in string literals.
Accordion references	Five collapsible sections inside the page: 1️⃣ Token reference • 2️⃣ Flag reference • 3️⃣ Pattern cheatsheet • 4️⃣ How to use • 5️⃣ Get an OpenAI API key Each may be expanded independently.
Full‑source reveal	Sixth accordion shows the entire HTML of the page, syntax‑highlighted via Highlight.js.
Security note	OpenAI key lives only in the user’s browser storage; no external backend required. For production, proxy API calls to keep the key server‑side.
Dependencies	Highlight.js (CDN) for code rendering. Everything else is native JavaScript and CSS.
Namespace	All logic wrapped in a single IIFE; CSS scoped to local class names—safe to embed in any page.

Introduction

Regular expression primer and prototype designer 🔍 (Written May 12, 2025)

Foundational concepts

Character classes

Definition : brackets that capture any one character from a specified set.
Common examples include [a‑z] for lowercase letters, \d for digits, and \w for “word” characters (letters + digits + underscore).

Symbol	Meaning	Typical example
`[abc]`	Any of a, b, c	`gr[ae]y` → “gray”, “grey”
`[^abc]`	Any character except a, b, c	`[^0-9]`
`\d`	Digit (0‑9)	`\d{4}` → four‑digit year
`\w`	Word character	`\w+` → identifier

Quantifiers

Purpose : dictate how many consecutive times a preceding token may occur.
The most frequent forms are * (0 +), + (1 +), ? (0‑1), and {m,n} (explicit range).

Greedy quantifiers match as much as possible; append ? to switch them to lazy mode (e.g. .*?).
Anchors and boundaries

Anchors pin patterns to positions rather than characters. ^ aligns with the start of a string (or line in multiline mode), while $ attaches to the end. Word boundaries (\b) are invaluable in token extraction, ensuring partial words remain untouched.
Groups and alternation

Parentheses gather sub‑patterns into logical units and capture their content. Bar (|) creates alternatives. Non‑capturing groups (?:…) improve performance when captures are unnecessary.

Prototype interface specification 🧩
1. Objective
  
  Provide an in‑browser utility that assists users in composing and testing regular expressions in real time, accompanied by contextual guidance.
2. Key features
  - Live evaluation : typed pattern instantly highlights matches within a sample text area.
  - Token palette : clickable buttons insert common metacharacters and quantifiers.
  - Reference panel : collapsible cheat‑sheet mirroring the table above.
  - Flags toggles : check‑boxes for g, i, m, s, u, and y.
  - Export function : generated expression can be copied as plain text or escaped string literal.
3. User experience flow
  1. Visitor selects a preset example or pastes custom text.
  2. Pattern field captures the regular expression; matches render with subtle highlight.
  3. Flag toggles and palette shortcuts refine the result interactively.
  4. Finalised expression is copied or stored for later use.

Implementation blueprint ✨

HTML structure

<div class="regex‑builder">
  <header>Compose a regular expression</header>
  <section class="controls">
    <input id="pattern" placeholder="Enter pattern…" />
    <input id="flags"  placeholder="Flags (e.g. gim)" />
    <div id="palette"><!-- buttons injected by JS --></div>
  </section>
  <section class="sample">
    <textarea id="sampleText">Paste or type sample text here…</textarea>
    <pre id="preview"></pre>
  </section>
  <footer>
    <button id="copyRaw">Copy /<button>
    <button id="copyEscaped">Copy escaped</button>
  </footer>
</div>

CSS styling

.regex‑builder{
  font-family:system-ui, sans-serif;
  line-height:1.5;
  max-width:48rem;
  margin:auto;
  border:1px solid #e5e7eb;
  padding:1.5rem;
  border-radius:0.75rem;
}
.controls input{
  width:100%;
  padding:0.5rem 0.75rem;
  margin-bottom:0.5rem;
  border:1px solid #d1d5db;
  border-radius:0.5rem;
  font-size:1rem;
}
#palette button{
  margin:0.25rem;
  padding:0.4rem 0.75rem;
  border:1px solid #cbd5e1;
  border-radius:0.5rem;
  background:#f8fafc;
  cursor:pointer;
}
.sample{
  margin-top:1rem;
}
#sampleText{
  width:100%;
  min-height:8rem;
  padding:0.75rem;
  border:1px solid #d1d5db;
  border-radius:0.5rem;
}
#preview .match{
  background:#fffbcc;
  border-bottom:2px solid #facc15;
}

JavaScript logic

// Helper: escape for display
const escapeHtml = str => str.replace(/[&<>"']/g, m => ({
  '&':'&amp;', '<':'&lt;', '>':'&gt;', '"':'&quot;', "'":'''
})[m]);

const patternEl   = document.getElementById('pattern');
const flagsEl     = document.getElementById('flags');
const sampleEl    = document.getElementById('sampleText');
const previewEl   = document.getElementById('preview');

function render(){
  let regex;
  try{
    regex = new RegExp(patternEl.value, flagsEl.value);
  }catch(e){
    previewEl.innerHTML = '<em>Invalid pattern</em>';
    return;
  }
  const raw = sampleEl.value;
  const highlighted = raw.replace(regex, m =>
    '<span class="match">' + escapeHtml(m) + '</span>');
  previewEl.innerHTML = escapeHtml(raw) === raw
                       ? highlighted
                       : escapeHtml(raw);
}

['input','keyup','change'].forEach(evt => {
  patternEl.addEventListener(evt, render);
  flagsEl.addEventListener(evt,   render);
  sampleEl.addEventListener(evt,  render);
});

render();

Extensibility guidelines
- Support dark mode by toggling CSS variables for background and text hues.
- Integrate a library such as Prism.js for multi‑line syntax colouring in the preview textarea.
- Persist recent expressions in localStorage to streamline repetitive testing.

Written on May 12, 2025

Pattern Reference

Pattern	Matches…
Dates & Time
`\b\d{4}-\d{2}-\d{2}\b`	ISO date (YYYY‑MM‑DD)
`\b\d{4}년\s?\d{1,2}월\s?\d{1,2}일\b`	Korean date (YYYY년 MM월 DD일)
`\b\d{1,2}/\d{1,2}/\d{4}\b`	US date (MM/DD/YYYY)
`\b\d{4}/\d{2}/\d{2}\b`	Slash date (YYYY/MM/DD)
`\b\d{2}:\d{2}\b`	24‑h time (HH:MM)
`\b\d{2}:\d{2}:\d{2}\b`	24‑h time with seconds
`\b\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z?\b`	ISO datetime (Z optional)
`\b(?:Jan\|Feb\|Mar\|Apr\|May\|Jun\|Jul\|Aug\|Sep\|Oct\|Nov\|Dec)\b`	English month abbreviations
`\b(?:Mon\|Tue\|Wed\|Thu\|Fri\|Sat\|Sun),\s\d{2}\s[A-Z][a-z]{2}\s\d{4}\b`	RFC 2822 date header
Numbers & Units
`\b\d+\b`	Positive integer
`\b[+-]?\d+\b`	Signed integer
`\b\d+\.\d+\b`	Decimal number
`\b\d{1,3}(?:,\d{3})+\b`	Thousands‑separated integer (1,234,567)
`\$\d+(?:\.\d{2})?\b`	US currency (optional cents)
`\b100(?:\.0+)?%\|\b\d{1,2}(?:\.\d+)?%`	Percentage 0–100 %
Text & Strings
`"[^"\\](?:\\.[^"\\])*"`	Double‑quoted string (escapes OK)
`'[^'\\](?:\\.[^'\\])*'`	Single‑quoted string (escapes OK)
`[^`\\](?:\\.[^`\\])*`	Template literal (no ${…})
`\b[A-Z][a-z]+\b`	Capitalised word
`\b[가-힣]+\b`	Korean Hangul word
Networking & Web
`\bhttps?:\/\/\S+\b`	HTTP/HTTPS URL
`\bftp:\/\/\S+\b`	FTP URL
`\b(?:\d{1,3}\.){3}\d{1,3}\b`	IPv4 address
`\b(?:[A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}\b`	IPv6 address (full form)
`\b[a-z0-9-]+\.[a-z]{2,}\b`	Domain name
`[\w.+-]+@[\w.-]+\.[A-Za-z]{2,}`	E‑mail address
`\b010-\d{4}-\d{4}\b`	Korean mobile phone (010-1234-5678)
`\/(?:[^\/\0]+\/)*[^\/\0]+\.[\w\-]+`	POSIX file path with extension
Identifiers
`\b[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}\b`	UUID v1–v5
`\b(?:[0-9A-F]{2}:){5}[0-9A-F]{2}\b`	MAC address
`\b(?:\d{4}[\s-]?){4}\b`	16‑digit credit‑card number
`\b97[89]-\d-\d{2,5}-\d{2,7}-\d\b`	ISBN‑13 (hyphenated)
Markup & Code
`<\/?([A-Za-z][A-Za-z0-9])\b[^>]>`	HTML tag
`^#{1,6}\s.+$`	Markdown header (# Title)
`"[A-Za-z0-9_]+"(?=\s*:)`	JSON key (double‑quoted)
Whitespace & Misc
`^\s+\|\s+$`	Trim leading/trailing whitespace (use g)
`\s{2,}`	Consecutive spaces (≥2)
`\t+`	Tabs
`^\s*$`	Blank line
`^\S.*$`	Non‑blank line

Theory of Computer Science

Regular Expression and Deterministic Finite Automaton (DFA)

Undergraduate Students

Deterministic Finite Automata

A deterministic finite automaton (DFA) is a simple computational model used to recognize patterns in input strings. A DFA consists of a finite set of states, an input alphabet, a transition function mapping each state and input symbol to a next state, a designated start state, and one or more accepting (final) states. As the automaton reads an input string symbol by symbol, it moves between states according to the transition function. If the automaton ends in an accepting state after processing the entire input, the string is considered accepted (i.e., part of the language recognized by the DFA). DFAs are a foundational model in formal language theory for defining the class of regular languages .

Regular Expressions

A regular expression is a symbolic notation for describing sets of strings (languages). Regular expressions use operators such as union (often represented by '|'), concatenation, and Kleene star (denoted '*') to build complex patterns. For example, the expression (a|b)* denotes the set of all strings composed of zero or more occurrences of 'a' or 'b'. Regular expressions correspond to regular languages because every language described by a regular expression can be recognized by some DFA. Regular expressions are commonly used in programming and text-processing tools for pattern matching. In formal language theory, they describe exactly the same class of languages as DFAs.

Equivalence and Relationship

A fundamental theorem in formal language theory states that DFAs and regular expressions are equivalent in power: they define the same class of languages, known as the regular languages . This result is sometimes referred to as Kleene's Theorem . It implies that for every regular expression there is an equivalent DFA that accepts the same language, and vice versa. Converting a regular expression into a DFA often involves first creating a nondeterministic finite automaton (NFA) using standard constructions, and then applying the subset construction to obtain a DFA. Conversely, one can derive a regular expression that represents the language of a given DFA (though the resulting expression may be complex). These concepts demonstrate that DFAs and regular expressions are two equivalent ways to describe regular languages, each offering a different perspective on pattern recognition in computation theory.

Comparison

The table below highlights key aspects of DFAs and regular expressions:

Aspect	DFA	Regular Expression
Definition	A state-based automaton defined by states and transitions.	A symbolic formula using union, concatenation, and Kleene star.
Representation	Graphical or tabular transition structure.	Algebraic pattern syntax.
Recognized Language	Regular languages (exactly).	Regular languages (exactly).
Conversion	Can be constructed from a regular expression (via NFA).	Can be converted to a DFA (via NFA).
Typical Use	Formal language modeling, compilers (lexical analysis).	Text searching and validation in software tools.

Graduate-Level Researchers

Formal Definitions and Properties

At the graduate level, DFAs and regular expressions are examined with greater mathematical formality. A DFA is formally defined as a 5-tuple (Q, Σ, δ, q₀, F) , where Q is a finite set of states, Σ is the input alphabet, δ: Q × Σ → Q is the transition function, q₀ ∈ Q is the initial state, and F ⊆ Q is the set of accepting states. A regular expression can be defined by a recursive grammar: the empty string ϵ and each symbol in Σ are regular expressions, and if R and S are regular expressions, then (R|S) , (RS) , and (R*) are also regular expressions. Graduate students study important properties of regular languages, including closure under union, concatenation, and Kleene star, as well as intersection and complement. The Myhill-Nerode theorem provides a characterization of regular languages in terms of distinguishable strings, and the pumping lemma is a fundamental result for proving that certain languages are not regular.

Equivalence, Constructions, and Complexity

Graduate research often involves explicit constructions to demonstrate the equivalence of DFAs and regular expressions. A standard method is Thompson's construction , which builds an NFA from a given regular expression, followed by the subset construction to convert that NFA into an equivalent DFA. Conversely, one can eliminate states from a DFA or apply Arden's lemma to derive a regular expression that represents its language. These transformations illustrate the fundamental equivalence between the two formalisms. In terms of complexity, converting an NFA (or regex) to a DFA can cause an exponential increase in the number of states. DFA minimization can be applied to reduce states and yields a unique minimal automaton for each language. In contrast, simplification of regular expressions is computationally harder and no canonical minimal form exists. These considerations motivate research on the state complexity and descriptive complexity of regular languages.

Aspect	DFA	Regular Expression
Formalism	Graph-based model (states and transitions).	Algebraic syntax with operators (\|, concatenation, *).
Minimization	A unique minimal DFA exists for each regular language.	No unique minimal regex; minimization is PSPACE-hard.
Closure Properties	Closed under union, intersection, complement, etc., via automaton constructions.	Closed under union, concatenation, star; intersection/complement require additional constructs.
Conversion	From regex: via Thompson's NFA + subset construction.	From DFA: via state elimination or algebraic methods.
Decision Problems	Equivalence, emptiness, membership are decidable (often efficiently).	Equivalence (of regex) is PSPACE-complete; membership is linear-time.

Technical Professionals

Application of Regular Expressions

Technical professionals often use regular expressions in practical applications such as text processing, data validation, and log analysis. Regular expressions are implemented in many programming languages and tools (for example, grep, sed, Perl, Python, and Java) to perform pattern matching. These implementations frequently include extended features beyond the basic formalism, such as capturing groups, backreferences (allowing the engine to match the same substring again), and lookahead and lookbehind assertions. While these features provide additional expressive power, they can also make the matching process more complex and, in some cases, allow matching of non-regular patterns. Professionals should understand that the core theory applies to the subset of regex features that remain within regular language constraints.

Text Search and Editing: Using regex to search, match, or replace text in files or code.
Input Validation: Checking data formats (such as emails, dates, or identifiers) against pattern rules.
Lexical Analysis: Tokenizing input in compilers or interpreters (many lexer generators convert regex patterns into DFAs).
Data Extraction and Transformation: Extracting information from logs, HTML, or structured text using regex in scripts.

Theoretical Foundations in Practice

Understanding that core regular expressions correspond to finite automata provides insight into writing efficient patterns. For instance, any pure regular expression can, in principle, be transformed into a DFA, which means that matching can be performed in linear time relative to the input size. Some modern tools actually convert regex patterns into DFAs or similar state machines under the hood for fast matching. Other regex engines use backtracking algorithms that may exhibit exponential-time behavior on certain patterns. Patterns involving heavy backtracking (such as nested quantifiers) can lead to performance issues. It is advisable to design regexes to be as unambiguous as possible and to anchor patterns (using ^ and $ ) when appropriate to limit the search space.

The table below outlines differences between theoretical regular expressions and practical regex engines:

Feature	Theoretical Regex	Practical Regex Engine
Operators	Union (\|), concatenation, Kleene star (*) only.	Includes these plus quantifiers (e.g. +, ?, {m,n}), wildcards (.), character classes, anchors ( `^` , `$` ), etc.
Language	Exactly the class of regular languages.	Extended by backreferences and lookaround (can match some non-regular patterns).
Matching Algorithm	Can be implemented via NFA/DFA construction (guaranteed linear time).	Often implemented with backtracking or hybrid algorithms (potential exponential time).
Performance	Matching is guaranteed to be linear time in the input size.	Performance may degrade to exponential time for complex patterns if not carefully designed.
Typical Use	Theoretical analysis and language processing.	Practical searching, text processing, and validation tasks.

By grounding regular expression practice in automata theory, professionals gain a deeper understanding of why certain patterns are efficient and others are not. The equivalence of DFAs and regular expressions assures that any properly constructed pattern corresponds to some finite automaton. This theoretical foundation can guide the composition of complex regexes in a principled way, helping to ensure they remain efficient and correct when applied in software systems.

Written on May 12, 2025

Survival Analysis R-Script

Quarantine Clearance Using Survival Analysis (Written December 24, 2024)

Table of Contents

Mathematical Foundation

Survival Function and Hazard Function

Kaplan-Meier Estimator

Cox Proportional Hazards Model

Fine-Gray Model for Competing Risks

Code and Step-by-Step Explanation

0. Install and Load Required Packages

1. Define Output Directory

2. Data Import and Preparation

3. Exploratory Data Analysis (EDA)

4. Kaplan-Meier Plots Stratified by Site

5. Cox Proportional Hazards Model Including Site

6. Checking for Multicollinearity

7. Penalized Cox Regression (Lasso)

8. Fine-Gray Model for Competing Risks

9. Visualizing Survival Differences Across Sites

10. Additional Recommendations and Checks

11. End of Script

1. Mathematical Foundation

1.1. Survival Function and Hazard Function

Survival Function $ S(t) $ is defined as: \[ S(t) = P(T > t), \] where $ T $ is the random variable representing the time to event (clearance). $ S(t) $ describes the probability that a subject remains non-cleared (has not yet achieved three consecutive negative tests) after time $ t $.
Hazard Function $ \lambda(t) $ is defined as: \[ \lambda(t) = \lim_{\Delta t \to 0} \frac{P(t \le T < t + \Delta t \mid T \ge t)}{\Delta t}. \] It represents the instantaneous rate of achieving clearance at time $ t $, given that the patient has not cleared the pathogen before $ t $.

1.2. Kaplan-Meier Estimator

The Kaplan-Meier (KM) estimator is a non-parametric statistic used to estimate the survival function $ S(t) $. If there are $ n $ subjects, let $ t_{(1)}, t_{(2)}, \dots, t_{(k)} $ be the distinct event times in ascending order. Let $ d_j $ be the number of events (clearances) at time $ t_{(j)} $ and $ r_j $ be the number of subjects at risk just before $ t_{(j)} $. The KM estimator is:

\[ \hat{S}(t) = \prod_{t_{(j)} \le t} \left( 1 - \frac{d_j}{r_j} \right). \]

It allows comparing the time to clearance across different sites (e.g., stool vs. urine).

1.3. Cox Proportional Hazards Model

The Cox Proportional Hazards model expresses the hazard function for individual $ i $ as:

\[ \lambda_i(t) = \lambda_0(t) \exp\left( \boldsymbol{\beta}^\top \mathbf{x}_i \right), \]

where

$ \lambda_0(t) $ is the baseline hazard function (unspecified),
$ \mathbf{x}_i $ is the vector of covariates (e.g., site, age, gender),
$ \boldsymbol{\beta} $ is the vector of coefficients.

A hazard ratio between two covariate levels reflects the ratio of their hazards at any given time $ t $. If a coefficient $ \beta_j $ is positive, it suggests an increased rate of clearance for that factor level.

1.4. Fine-Gray Model for Competing Risks

When multiple competing events can occur (e.g., clearance is the event of interest, but death or loss to follow-up might preclude clearance), a competing risk model such as Fine and Gray is used:

\[ \text{Fine-Gray: } \quad \text{Subdistribution hazard } = h_j(t) = \lim_{\Delta t \to 0} \frac{P(t \le T < t + \Delta t, \epsilon = j \mid T \ge t \text{ or } (T < t \text{ and } \epsilon \neq j))}{\Delta t}, \]

where $ \epsilon $ indicates the type of event ($ j = $ clearance, other = competing events). This method accounts for the fact that once a competing event (e.g., death) happens, one can no longer clear the pathogen.

2. Code and Step-by-Step Explanation

Step	Section	Purpose
0	Install Packages	Ensures necessary packages are installed
1	Define Directory	Output directory for storing analysis results
2	Data Import/Prep	Loads and cleans the data
3	Exploratory Analysis	Summaries and missing data checks
4	Kaplan-Meier Analysis	Non-parametric survival estimates by site
5	Cox Model (Simple)	Estimates hazard ratios for clearance with Site only
6	Multicollinearity Check	Detects correlated predictors using VIF
7	Penalized Cox (Lasso)	Regularization to handle many or correlated predictors
8	Fine-Gray Model	Competing risks approach for clearance vs. death
9	Visualization	Visual compares survival curves by site
10	Recommendations/Checks	Influential points, outliers, and missing data
11	End of Script	Wrap-up, optional housekeeping

Note: The code presented here is R-based and uses packages such as survival, survminer, cmprsk, dplyr, ggplot2, and glmnet. Adjust package calls as needed for local environment.

2.0. Install and Load Required Packages

# -------------------------------------------
# Survival Analysis on Quarantine Clearance
# Comparing Site-Specific Clearance: Stool, Urine, Sputum, Blood
# © 2024 Hyunsuk Frank Roh, MD. All rights reserved.
# -------------------------------------------

# 0. Setup: Install and Load Required Packages
install_and_load <- function(package) {
  if (!require(package, character.only = TRUE)) {
    install.packages(package, dependencies = TRUE)
    library(package, character.only = TRUE)
  }
}

required_packages <- c("survival", "survminer", "cmprsk", "dplyr", "readr", 
                       "ggplot2", "patchwork", "forcats", "reshape2", "glmnet", "car")

# Load dplyr early to avoid conflicts
suppressMessages(library(dplyr))

invisible(lapply(required_packages, install_and_load))

Interpretation: - This section ensures that all required packages for reading data, plotting, modeling, and statistical testing are present and loaded. - Automated checks help avoid missing dependencies during runtime.

2.1. Define Output Directory

# 1. Define Output Directory
desktop_path <- file.path(Sys.getenv("USERPROFILE"), "Desktop")
output_dir <- file.path(desktop_path, "Survival_Analysis_Images")

if (!dir.exists(output_dir)) {
  dir.create(output_dir, recursive = TRUE)
  message(paste("Created directory:", output_dir))
} else {
  message(paste("Directory already exists:", output_dir))
}

Interpretation: - Specifies where plots and model summaries will be stored. - Recursive creation of directories ensures sub-directories exist when writing outputs.

2.2. Data Import and Preparation

# 2. Data Import and Preparation
data_path <- file.path(desktop_path, "dataset05.csv")
my_data <- read_csv(data_path, locale = locale(encoding = "UTF-8"))

cat("First few rows of the dataset:\n")
print(head(my_data))

cat("Structure of the dataset:\n")
print(str(my_data))

# Convert variables to appropriate data types
my_data <- my_data %>%
  mutate(
    PatientID = as.factor(PatientID),
    Pathogen = as.factor(Pathogen),
    Site = as.factor(Site),
    Start_Date = as.Date(Start_Date, format = "%m/%d/%Y"),
    Event_Date = as.Date(Event_Date, format = "%m/%d/%Y"),
    Event_Type = as.factor(Event_Type),
    Event_Code = as.numeric(Event_Code),
    Age = as.numeric(Age),
    Gender = as.factor(Gender),
    FactorA = as.factor(FactorA),
    FactorB = as.factor(FactorB),
    FactorC = factor(FactorC, levels = c("Low", "Medium", "High"), ordered = TRUE)
  )

cat("Structure after type conversions:\n")
print(str(my_data))

Interpretation: - Reads the CSV file containing survival (clearance) data and sets correct data types for key variables. - Event_Code should indicate whether an event (clearance) occurred (e.g., 1 = clearance, 0 = censored, 2 = competing event if used). - This step ensures subsequent survival analysis is consistent with R’s expectations for times, events, and factors.

2.3. Exploratory Data Analysis (EDA)

# 3. Exploratory Data Analysis (EDA)
cat("Summary statistics:\n")
print(summary(my_data))

cat("Missing values per column:\n")
print(sapply(my_data, function(x) sum(is.na(x))))

cat("Event distribution across sites:\n")
print(table(my_data$Site, my_data$Event_Code))

Interpretation: - Summary statistics and missing value checks identify potential data issues. - A two-way table with Site vs. Event_Code reveals how many clearance events occurred at each site.

2.4. Kaplan-Meier Plots Stratified by Site

# 4. Kaplan-Meier Plots Stratified by Site
if (!"Time_to_Event" %in% colnames(my_data)) {
  my_data <- my_data %>%
    mutate(Time_to_Event = as.numeric(difftime(Event_Date, Start_Date, units = "days")))
}

if (any(my_data$Time_to_Event <= 0, na.rm = TRUE)) {
  cat("There are observations with non-positive Time_to_Event. These will be removed.\n")
  my_data <- my_data %>%
    filter(Time_to_Event > 0)
}

# Create the survival object
surv_object_site <- Surv(time = my_data$Time_to_Event, event = my_data$Event_Code == 1)

# Fit Kaplan-Meier survival curves stratified by Site
km_fit_site <- survfit(surv_object_site ~ Site, data = my_data)

# Plot the Kaplan-Meier curves
km_plot <- ggsurvplot(
  km_fit_site,
  data = my_data,
  risk.table = TRUE,
  pval = TRUE,  # Adds p-value from log-rank test
  conf.int = TRUE,
  xlab = "Time in Days",
  ylab = "Probability of Clearance",
  title = "Kaplan-Meier Curves Stratified by Site",
  legend.title = "Site",
  palette = scales::hue_pal()(length(levels(my_data$Site)))
)

print(km_plot)

ggsave(
  filename = file.path(output_dir, "Kaplan_Meier_Curves_Stratified_by_Site.png"), 
  plot = km_plot$plot, width = 8, height = 6
)
ggsave(
  filename = file.path(output_dir, "Kaplan_Meier_Risk_Table_Stratified_by_Site.png"), 
  plot = km_plot$table, width = 8, height = 3
)

Mathematical Note: - The Kaplan-Meier approach estimates $ \hat{S}(t) $ (the probability that clearance has not yet occurred by time $ t $). - Stratification by Site compares different clearance curves. A log-rank test checks if the curves are significantly different.

Interpretation: - A steeper curve indicates faster clearance. - The p-value clarifies whether differences in clearance times across sites are statistically significant.

2.5. Cox Proportional Hazards Model Including Site

# 5. Cox Proportional Hazards Model Including Site
cox_model_simple <- coxph(surv_object_site ~ Site, data = my_data)

cat("Summary of Simplified Cox Model (Only Site):\n")
summary_cox_simple <- summary(cox_model_simple)
print(summary_cox_simple)

sink(file = file.path(output_dir, "Summary_Simplified_Cox_Model.txt"))
print(summary_cox_simple)
sink()

# Test proportional hazards assumption
ph_test_simple <- tryCatch(
  cox.zph(cox_model_simple),
  error = function(e) {
    cat("Error in proportional hazards test:\n", e$message, "\n")
    return(NULL)
  }
)

if (!is.null(ph_test_simple)) {
  cat("Proportional Hazards Test for Simplified Model:\n")
  print(ph_test_simple)

  sink(file = file.path(output_dir, "PH_Test_Simplified_Cox_Model.txt"))
  print(ph_test_simple)
  sink()

  ph_plot_simple <- ggcoxzph(ph_test_simple)
  print(ph_plot_simple)

  ggsave(
    filename = file.path(output_dir, "Schoenfeld_Residuals_Simplified_Cox_Model.png"), 
    plot = ph_plot_simple, width = 8, height = 6
  )
} else {
  cat("Proportional hazards assumption test could not be performed.\n")
}

Mathematical Note: - The Cox model: $ \lambda_i(t) = \lambda_0(t) \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip}) $. - Here, the predictor is only Site (categorical). The exponent of a coefficient, $ \exp(\hat{\beta}) $, is the hazard ratio. - If $ \exp(\hat{\beta}) > 1 $, that site experiences a faster clearance (on average). - If $ \exp(\hat{\beta}) < 1 $, that site has a slower clearance rate compared to the reference site.

Interpretation: - Cox.ZPH test checks whether proportional hazards assumption is valid. - If non-significant, it suggests no major violation of the assumption for that predictor.

2.6. Checking for Multicollinearity

# 6. Checking for Multicollinearity
cox_model_full <- coxph(surv_object_site ~ Site + FactorA + FactorB + FactorC + Age + Gender, data = my_data)

print("Summary of Full Cox Model:")
summary_cox_full <- summary(cox_model_full)
print(summary_cox_full)

sink(file = file.path(output_dir, "Summary_Full_Cox_Model.txt"))
print(summary_cox_full)
sink()

# Compute VIF
if(!is.null(cox_model_full$coefficients)) {
  design_matrix <- model.matrix(~ Site + FactorA + FactorB + FactorC + Age + Gender, data = my_data)[, -1]
  lm_fit <- lm(Time_to_Event ~ ., data = as.data.frame(design_matrix))
  vif_values <- vif(lm_fit)
  print("Variance Inflation Factor (VIF) for Predictors:")
  print(vif_values)

  write.csv(as.data.frame(vif_values), file = file.path(output_dir, "VIF_Full_Cox_Model.csv"), row.names = TRUE)
} else {
  warning("Full Cox model did not converge. Skipping VIF computation.")
}

Mathematical Note: - VIF (Variance Inflation Factor) for each predictor $ X_j $ is: \[ \mathrm{VIF}_j = \frac{1}{1 - R_j^2}, \] where $ R_j^2 $ is the coefficient of determination when $ X_j $ is regressed on all other predictors. - High VIF (commonly > 5 or 10) indicates multicollinearity.

Interpretation: - Identifying collinear factors ensures stable model estimates. - High collinearity in a Cox model can lead to large standard errors for coefficient estimates.

2.7. Penalized Cox Regression (Lasso)

# 7. Penalized Cox Regression (Lasso)
my_data_pen <- my_data %>%
  select(Time_to_Event, Event_Code, Site, FactorA, FactorB, FactorC, Age, Gender) %>%
  drop_na()

surv_object_pen <- Surv(time = my_data_pen$Time_to_Event, event = my_data_pen$Event_Code == 1)

x_pen <- model.matrix(~ Site + FactorA + FactorB + FactorC + Age + Gender, data = my_data_pen)[, -1]
y_pen <- surv_object_pen

set.seed(123)
cv_fit <- cv.glmnet(x_pen, y_pen, family = "cox", alpha = 1, standardize = TRUE)

cv_plot <- plot(cv_fit)
title("Cross-Validation for Penalized Cox Regression", line = 2.5)
print(cv_plot)

ggsave(filename = file.path(output_dir, "Penalized_Cox_CV_Plot.png"), 
       plot = cv_plot, width = 8, height = 6)

best_lambda <- cv_fit$lambda.min
print(paste("Best lambda selected by cross-validation:", best_lambda))

penalized_cox <- glmnet(x_pen, y_pen, family = "cox", alpha = 1, lambda = best_lambda, standardize = TRUE)

print("Coefficients from Penalized Cox Model:")
penalized_cox_coef <- coef(penalized_cox)
print(penalized_cox_coef)

write.csv(as.data.frame(as.matrix(penalized_cox_coef)), 
          file = file.path(output_dir, "Penalized_Cox_Model_Coefficients.csv"), 
          row.names = TRUE)

Mathematical Note: - Penalized Cox adds an $ L_1 $ penalty term $ \alpha \lambda \|\beta\|_1 $ (when $ \alpha=1 $) to the partial likelihood objective. This is known as Lasso, encouraging sparse solutions (some coefficients shrink to 0).

Interpretation: - Lasso helps in variable selection and reducing overfitting, especially when many predictors might be collinear or the sample size is small.

2.8. Fine-Gray Model for Competing Risks

# 8. Fine-Gray Model for Competing Risks
# Event_Code == 1: Clearance (event of interest)
# Event_Code == 2: Some competing event (e.g., Death)

my_data_fg <- my_data %>%
  mutate(
    Competing_Event = case_when(
      Event_Code == 1 ~ 0,  # Event of interest
      Event_Code == 2 ~ 1,  # Competing event
      TRUE ~ NA_real_
    )
  )

na_count_fg <- sum(is.na(my_data_fg$Competing_Event))
print(paste("Number of NA values in Competing_Event:", na_count_fg))

write.csv(data.frame(Column = "Competing_Event", NA_Count = na_count_fg), 
          file = file.path(output_dir, "NA_Counts_Competing_Event.csv"), 
          row.names = FALSE)

my_data_clean_fg <- my_data_fg %>%
  filter(!is.na(Competing_Event)) %>%
  drop_na(Site, FactorA, FactorB, FactorC, Age, Gender, Time_to_Event, Competing_Event)

print("Dimensions after cleaning for Fine-Gray model:")
print(dim(my_data_clean_fg))

write.csv(data.frame(Dimensions = paste(dim(my_data_clean_fg), collapse = "x")), 
          file.path(output_dir, "Fine_Gray_Clean_Data_Dimensions.csv"), 
          row.names = FALSE)

covariates_fg <- model.matrix(~ Site + FactorA + FactorB + FactorC + Age + Gender, data = my_data_clean_fg)[, -1]

print(paste("Number of covariates:", ncol(covariates_fg)))
print(paste("Number of observations:", nrow(my_data_clean_fg)))

write.csv(data.frame(Covariates = ncol(covariates_fg), Observations = nrow(my_data_clean_fg)), 
          file.path(output_dir, "Covariate_Matrix_Dimensions_Fine_Gray.csv"), 
          row.names = FALSE)

if(nrow(covariates_fg) == nrow(my_data_clean_fg)) {
  fg_model_site <- try(crr(
    ftime = my_data_clean_fg$Time_to_Event,
    fstatus = my_data_clean_fg$Competing_Event,
    cov1 = covariates_fg
  ), silent = TRUE)
  
  if(class(fg_model_site) != "try-error") {
    print("Summary of Fine-Gray Model:")
    summary_fg <- summary(fg_model_site)
    print(summary_fg)

    sink(file = file.path(output_dir, "Summary_Fine_Gray_Model.txt"))
    print(summary_fg)
    sink()
  } else {
    warning("Fine-Gray model failed to converge. Check data for issues.")
    error_message <- "Fine-Gray model failed to converge. Check data for issues."
    write.csv(data.frame(Error = error_message), 
              file.path(output_dir, "Fine_Gray_Model_Error.csv"), 
              row.names = FALSE)
  }
} else {
  stop("Mismatch in the number of rows between covariates and survival data for Fine-Gray model.")
}

Mathematical Note: - Fine and Gray (1999) proposed modeling the subdistribution hazard of a particular event type in the presence of competing events. - The subdistribution hazard function is different from the cause-specific hazard: it explicitly accounts for individuals who have experienced the competing event but remain “at risk” in the subdistribution sense.

Interpretation: - Use this approach if competing events (like death) might preclude observing clearance. - The model yields subdistribution hazard ratios, interpreted as the effect of covariates on the cumulative incidence of clearance.

2.9. Visualizing Survival Differences Across Sites

# 9. Visualizing Survival Differences Across Sites
events_per_site <- my_data %>%
  group_by(Site) %>%
  summarize(Events = sum(Event_Code == 1, na.rm = TRUE))

print("Number of events per site:")
print(events_per_site)

write.csv(events_per_site, file.path(output_dir, "Events_Per_Site.csv"), row.names = FALSE)

valid_sites <- events_per_site %>%
  filter(Events > 0) %>%
  pull(Site)

my_data_valid <- my_data %>%
  filter(Site %in% valid_sites)

surv_object_site_valid <- Surv(time = my_data_valid$Time_to_Event, event = my_data_valid$Event_Code == 1)
km_fit_site_valid <- survfit(surv_object_site_valid ~ Site, data = my_data_valid)

km_plot_valid <- ggsurvplot(
  km_fit_site_valid,
  data = my_data_valid,
  risk.table = TRUE,
  pval = TRUE,
  conf.int = TRUE,
  xlab = "Time in Days",
  ylab = "Survival Probability",
  title = "Survival Curves by Site",
  legend.title = "Site",
  legend.labs = levels(my_data_valid$Site),
  palette = c("#E7B800", "#2E9FDF", "#FC4E07", "#00BA38"),
  ggtheme = theme_minimal()
)

print(km_plot_valid)

ggsave(filename = file.path(output_dir, "Kaplan_Meier_Curves_By_Site_Valid_Sites.png"), 
       plot = km_plot_valid$plot, width = 8, height = 6)
ggsave(filename = file.path(output_dir, "Kaplan_Meier_Risk_Table_By_Site_Valid_Sites.png"), 
       plot = km_plot_valid$table, width = 8, height = 3)

obs_per_site <- my_data_valid %>%
  group_by(Site) %>%
  summarize(Count = n())

print("Number of observations per valid site:")
print(obs_per_site)

write.csv(obs_per_site, file.path(output_dir, "Observations_Per_Valid_Site.csv"), row.names = FALSE)

if(all(obs_per_site$Count >= 2)) {
  km_facet_plot <- ggsurvplot_facet(
    km_fit_site_valid,
    data = my_data_valid,
    facet.by = "Site",
    nrow = 2,
    ncol = 2,
    risk.table = TRUE,
    pval = FALSE,
    conf.int = TRUE,
    xlab = "Time in Days",
    ylab = "Survival Probability",
    title = "Faceted Kaplan-Meier Curves by Site",
    ggtheme = theme_minimal()
  )

  print(km_facet_plot)

  ggsave(filename = file.path(output_dir, "Faceted_Kaplan_Meier_Curves_By_Site.png"), 
         plot = km_facet_plot$plot, width = 12, height = 8)
  ggsave(filename = file.path(output_dir, "Faceted_Kaplan_Meier_Risk_Table_By_Site.png"), 
         plot = km_facet_plot$table, width = 12, height = 6)
} else {
  warning("Not all sites have enough observations for faceted plots. Creating separate plots for each site.")
  library(patchwork)

  plots <- list()
  risk_tables <- list()

  for (site in levels(my_data_valid$Site)) {
    fit_site <- survfit(Surv(Time_to_Event, Event_Code == 1) ~ 1, data = my_data_valid %>% filter(Site == site))
    p <- ggsurvplot(
      fit_site,
      data = my_data_valid,
      risk.table = TRUE,
      pval = FALSE,
      conf.int = TRUE,
      xlab = "Time in Days",
      ylab = "Survival Probability",
      title = paste("Survival Curve for", site),
      ggtheme = theme_minimal()
    )
    plots[[site]] <- p$plot
    risk_tables[[site]] <- p$table
  }

  combined_plot <- wrap_plots(plots, ncol = 2)
  print(combined_plot)

  ggsave(filename = file.path(output_dir, "Combined_Kaplan_Meier_Curves_By_Site.png"), 
         plot = combined_plot, width = 12, height = 8)

  for (site in levels(my_data_valid$Site)) {
    ggsave(filename = file.path(output_dir, paste0("Risk_Table_", site, ".png")), 
           plot = risk_tables[[site]], width = 8, height = 3)
  }
}

Interpretation: - Faceted or combined survival plots highlight clearance patterns for each site. - This step is crucial for visual diagnosis of clearance timelines across multiple categories.

2.10. Additional Recommendations and Checks

# 10. Additional Recommendations and Checks

# 10.1. Check for Influential Observations
influence_cox <- residuals(cox_model_simple, type = "dfbeta")

if(!is.null(influence_cox)) {
  dfbeta_df <- as.data.frame(influence_cox)
  dfbeta_df$PatientID <- rownames(dfbeta_df)
  
  library(reshape2)
  dfbeta_melt <- melt(dfbeta_df, id.vars = "PatientID")
  
  dfbeta_plot <- ggplot(dfbeta_melt, aes(x = PatientID, y = value, color = variable)) +
    geom_point() +
    geom_hline(yintercept = 0, linetype = "dashed") +
    labs(title = "DFBeta for Simplified Cox Model", x = "Patient ID", y = "DFBeta") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
  
  print(dfbeta_plot)
  
  ggsave(filename = file.path(output_dir, "DFBeta_Simplified_Cox_Model.png"), 
         plot = dfbeta_plot, width = 12, height = 6)
} else {
  warning("No dfbeta available for the simplified Cox model.")
}

# 10.2. Assess Outliers in Continuous Variables
age_boxplot <- ggplot(my_data, aes(x = "", y = Age)) +
  geom_boxplot(fill = "#2E9FDF") +
  labs(title = "Boxplot of Age", y = "Age") +
  theme_minimal()

print(age_boxplot)

ggsave(filename = file.path(output_dir, "Boxplot_Age.png"), 
       plot = age_boxplot, width = 6, height = 6)

# 10.3. Handling Missing Data
# Consider advanced imputation if missingness is not negligible.

Interpretation: - DFBeta plots: Identifies influential points that might unduly affect estimates. - Boxplot of Age: Quickly spot outliers or unusual distributions.

2.11. End of Script

# 11. End of Script
# Most plots and model summaries have already been saved in previous steps.
# Additional saves can be added if needed.

Written on December 24th, 2024

Miscellaneous Scripts

SCI-Indexed Scripts

nGene Regular Expression Builder v 1.1 (b)

Introduction

Regular expression primer and prototype designer 🔍 (Written May 12, 2025)

Foundational concepts

Character classes

Quantifiers

Anchors and boundaries

Groups and alternation

Prototype interface specification 🧩

Objective

Key features

User experience flow

Implementation blueprint ✨

HTML structure

CSS styling

JavaScript logic

Extensibility guidelines

Pattern Reference

Theory of Computer Science

Regular Expression and Deterministic Finite Automaton (DFA)

Undergraduate Students

Deterministic Finite Automata

Regular Expressions

Equivalence and Relationship

Comparison

Graduate-Level Researchers

Formal Definitions and Properties

Equivalence, Constructions, and Complexity

Technical Professionals

Application of Regular Expressions

Theoretical Foundations in Practice

Survival Analysis R-Script

Quarantine Clearance Using Survival Analysis (Written December 24, 2024)

Table of Contents

1. Mathematical Foundation

1.1. Survival Function and Hazard Function

1.2. Kaplan-Meier Estimator

1.3. Cox Proportional Hazards Model

1.4. Fine-Gray Model for Competing Risks

2. Code and Step-by-Step Explanation

2.0. Install and Load Required Packages

2.1. Define Output Directory

2.2. Data Import and Preparation

2.3. Exploratory Data Analysis (EDA)

2.4. Kaplan-Meier Plots Stratified by Site

2.5. Cox Proportional Hazards Model Including Site

2.6. Checking for Multicollinearity

2.7. Penalized Cox Regression (Lasso)

2.8. Fine-Gray Model for Competing Risks

2.9. Visualizing Survival Differences Across Sites

2.10. Additional Recommendations and Checks

2.11. End of Script

Perpetual Calendar

nGene Regular Expression Builder v 1.1 (b)