StoryCraft AI

About

Welcome to StoryCraft AI – Infinite Possibilities in Every Story – a cutting-edge platform where boundless creativity meets advanced deep learning. Our website is the result of countless hours of research and development, training powerful deep learning models from scratch to generate compelling, creative stories. We leverage the power of GRU, LSTM, Bidirectional-GRU, and Bidirectional-LSTM architectures to bring you an interactive experience that transforms your seed text into imaginative narratives.

At StoryCraft AI, you can:

Choose Your Model: Select from a range of meticulously trained models, each offering its unique style of story generation.
Visualize the Journey: Dive into detailed training history graphs that showcase the evolution of our models.
Access the Source: Download model files and view the underlying code to understand the technology behind our creative engine.
Experience Innovation: Enjoy a premium, fully interactive interface designed to spark your creativity and inspire new ideas.

Our commitment to excellence and transparency means we not only deliver exceptional stories but also share the full depth of our development process. Join us on this journey where technology and imagination come together to unlock infinite possibilities in storytelling.

Frequently Asked Questions

We form training sequences by sliding a window over the tokenized text. Each sequence contains 51 tokens: the first 50 tokens are used as input (X) and the 51st token is the target (y). Mathematically, if T = [t₁, t₂, ..., tₙ] is the tokenized text, then for each valid index i (i ≥ 51), the sequence Sᵢ = [tᵢ₋₅₀, ..., tᵢ₋₁, tᵢ]. This fixed window size is a trade-off: a longer context may capture more dependencies but increases model complexity and computational cost.

The vocabulary size is computed as the number of unique words in the training set plus one (i.e., vocabulary_size = len(tokenizer.word_index) + 1). This addition accounts for the fact that word indices typically start at 1 (with 0 reserved for padding), ensuring that every word has a unique index. This size defines the dimensions of both the embedding layer (input_dim) and the output layer of the network.

One-hot encoding transforms each target word into a binary vector with length equal to the vocabulary size. For example, if a word has index i, its one-hot encoded vector is all zeros except for a 1 in the i-th position. Mathematically, this allows us to use the categorical crossentropy loss function, defined as L = -Σ (y_true * log(y_pred)), to measure the divergence between the predicted probability distribution (from the softmax layer) and the true distribution.

Categorical crossentropy is ideal for multi-class classification problems like next-word prediction. It calculates the loss as L = -Σ y_true_i * log(y_pred_i) for each word in the vocabulary. This formulation penalizes predictions that diverge from the one-hot true label, effectively training the model to assign a higher probability to the correct next word.

Dropout works by randomly setting a fraction of input units to zero during each training iteration. Mathematically, if p is the dropout rate, each neuron is kept with probability (1 − p). This randomization forces the network to learn redundant representations and prevents the co-adaptation of neurons. It can be seen as training an ensemble of sub-models, which collectively reduce overfitting.

GRU (Gated Recurrent Unit) uses update and reset gates to control the flow of information. Its core equations include:

zₜ = σ(W_zxₜ + U_zhₜ₋₁) (update gate) and rₜ = σ(W_rxₜ + U_rhₜ₋₁) (reset gate). Then, the candidate activation is computed as h̃ₜ = tanh(Wxₜ + U(rₜ ⊙ hₜ₋₁)), and the new hidden state is hₜ = (1 - zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ.

In contrast, LSTM (Long Short-Term Memory) introduces three gates (input, forget, output) along with a cell state to manage information flow:

iₜ = σ(W_ixₜ + U_ihₜ₋₁), fₜ = σ(W_fxₜ + U_fhₜ₋₁), and oₜ = σ(W_oxₜ + U_ohₜ₋₁) control the input, forgetting, and output of information respectively, while the cell state is updated as cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ tanh(Wc xₜ + Uc hₜ₋₁). These mathematical formulations allow LSTM to better capture long-range dependencies.

Bidirectional models process input sequences in both forward and backward directions. Mathematically, this means that for each time step, the hidden state is computed twice—once from past to future and once from future to past. The outputs are then concatenated (or summed), providing a richer representation that captures context from both directions. This dual perspective can improve prediction accuracy by incorporating more comprehensive contextual information.

Categorical crossentropy is used because it quantifies the difference between the predicted probability distribution and the true one-hot encoded vector. It is defined as:

L = -∑ (y_true * log(y_pred))

This loss function penalizes the model when the predicted probability for the correct word is low. By minimizing this loss during training, the model learns to assign higher probabilities to the correct next word.

During training, the model uses backpropagation through time (BPTT) to compute gradients of the loss function with respect to its weights. Gradient descent (or a variant like Adam) updates the weights by moving them in the direction that minimizes the loss. Mathematically, for each weight w, the update rule is:

w ← w - η * ∂L/∂w

where η is the learning rate and ∂L/∂w is the gradient of the loss L with respect to w.

StoryCraft AI - Infinite Possibilities in Every Story

Loss

Accuracy

Frequently Asked Questions

StoryCraft AI - Infinite Possibilities in Every Story

Loss

Accuracy

Frequently Asked Questions

How is the sequence length determined and why did you choose 50 input words with 1 output word?

How is the vocabulary size computed and why do you add 1 to it?

What is the mathematical role of one-hot encoding in training the model?

Why did you choose categorical crossentropy as the loss function, mathematically speaking?

How do dropout layers help prevent overfitting from a mathematical perspective?

How do GRU and LSTM differ mathematically in handling sequence data?

How does bidirectional processing enhance the model's performance mathematically?

What is the mathematical rationale behind using categorical crossentropy for next-word prediction?

How does the model use gradient-based optimization to update its weights?