In the spirit of public code reviews, then, this is a place to put procedural code and have people rip it to shreds. This is not the place for a ParadigmPissingMatch: it's a given that the code will be procedural/relational, and "make it OO" is not an acceptable answer. It is a place to learn better coding techniques, many of which are as applicable to OO code as procedural.
This is a script to read in the LINKS v2.0 database and assorted HTML pages, and migrate the data into the RDMBS schema used in our new system. It's created for http://www.fictionalley.org/, a HarryPotter fanfiction site organized around the 4 Hogwarts houses. Each "house" stores a different type of story.
The database files are pipe-delimited ASCII text files. Schema for the links.db file is:
linkID|title|URL|Date Created|Category Name|Author Name|Author E-mail|Hit count|5 fields that are unimportant|keywords
Schema for categories.db is:
categoryID|Category Name|Description|2 unimportant fields|Keywords|2 unimportant fields|Category Display Name
Both authors and stories are represented as categories, the stories as subcategories of authors. Subcategories are represented in Links by a / in the category name. This is the best way to tell them apart. Also, the fields of importance differ between these two types of categories. Authors have only ID/Name/Description (list of stories by them)/Display Name. Stories have the summary as Description, and additionally have Keywords, and the Display Name is in the format "Title by Story".
An example fic file can be found at http://www.schnoogle.com/authors/nostrademons/TAWGA01.html. The filenames match up with the URLs in links.db. Basically every field in the header must be parsed.
The database schema is included in its current form (likely not the final form when the entire system goes live). This is working code and has not been too difficult to maintain (I've had to change the schema several times since it was first developed, to accommodate other areas of the system).
Also, this script has to process approximately 1.5 GB of data. Thus, it's not practical to store everything in RAM and write it out when we have the full data. I designed it so the most memory-intensive bits - the fic text itself and the parsed hash - can be disposed of on each iteration through the main loop. Last time I ran it on the full data set, it ran in about an hour on my P3 850 MHz. If someone can find a way to significantly cut the running time, that'd be helpful too.
My main concerns with it:
I was tempted to zip up the files and post them on my webspace, but I figure this'll be more helpful if people can comment inline. If this gets TooBigToEdit, I can move them. I've cut out a lot of things like constant-valued inserts, because they're not interesting yet very big.
-- JonathanTang
(I put the actual text file examples back if you don't mind. I still find it helpful to see an actual example. It put in a few spaces to keep the wiki engine from hilighting everything. And, thanks for the schemas above.)
The links.db file behind http://www.thedarkarts.org/authorLinks/The_Gentleman/Strange_Rights/ looks like this:
The categories.db file behind http://www.thedarkarts.org/authorLinks/The_Gentleman/ looks like this:
Questions:
[File: schema.sql. Improvements on database schema are welcome too]
-- 'Myrtle' schema, v0.1
-- Created 7/17/03, Jonathan Tang
-- Recommended user grant statement:
-- GRANT ALTER,CREATE,DELETE,DROP,INDEX,INSERT,SELECT,UPDATE on dbname.* TO dbuser@localhost IDENTIFIED BY 'password';
-- Users table. Holds data for a single user of the system, whether author or reader.
DROP TABLE IF EXISTS users;
CREATE TABLE users (
user_id mediumint NOT NULL auto_increment,
is_enabled int(1) NOT NULL default '1',
username varchar(50), -- Username for favorites display; defaults to vB
forum_user_id int(10) NOT NULL,
PRIMARY KEY (user_id),
UNIQUE KEY (username)
);
-- Pen names. These are displayed on site. Max 3/user.
-- Basic pen_name table contains only metainformation, because pen names are versioned
-- This gives us a history of pen names, both so we can track it, and in case we get asked
-- where author so-and-so has gone.
DROP TABLE IF EXISTS pen_names;
CREATE TABLE pen_names (
pen_name_id mediumint NOT NULL auto_increment,
user_id mediumint NOT NULL references users(user_id),
email varchar(60) NOT NULL,
display_favorites int(1) NOT NULL default '1',
display_email int(1) NOT NULL default '1',
workflow_code char(3) NOT NULL,
PRIMARY KEY (pen_name_id)
);
DROP TABLE IF EXISTS pen_name_versions;
CREATE TABLE pen_name_versions (
pen_name_version_id mediumint NOT NULL auto_increment,
pen_name_id mediumint NOT NULL references pen_names(pen_name_id),
pen_name varchar(50) NOT NULL,
time_submitted timestamp NOT NULL,
author_comments varchar(255) NOT NULL,
PRIMARY KEY (pen_name_version_id),
UNIQUE KEY (pen_name)
);
-- Workflow is 1:1 with versions, but is created at a different time (only when an approver
-- checks out a fic). The table needs to be split so that we can maintain integrity constraints
-- (not null/references) on fields.
DROP TABLE IF EXISTS pen_name_workflow;
CREATE TABLE pen_name_workflow (
pen_name_version_id mediumint NOT NULL references pen_name_versions(pen_name_version_id),
pen_name_id mediumint NOT NULL references pen_names(pen_name_id),
approver_id mediumint NOT NULL references users(user_id),
time_checked_out timestamp NOT NULL,
time_approved datetime,
approver_comments varchar(255) NOT NULL,
resolution_code char(5),
PRIMARY KEY (pen_name_version_id)
);
-- Stories. story_id is not auto_increment because the new story id must be available
-- *before* the submission is complete, so that it's possible to add ships/chars to it.
-- A sequence table (provided automatically by some DB APIs is an easier way to do this.
DROP TABLE IF EXISTS stories;
CREATE TABLE stories (
story_id mediumint NOT NULL,
pen_name_id mediumint NOT NULL references pen_names(pen_name_id),
house_code char(3) NOT NULL references house_values(house_code),
rating_code char(5) NOT NULL references rating_values(rating_code),
language_code char(3) NOT NULL references language_values(language_code),
era_code char(20) NOT NULL references era_values(era_code),
fandom_era_id tinyint NOT NULL references fandom_era_values(fandom_era_id),
other_spoilers varchar(255) NOT NULL, -- but null strings "" allowed
is_published tinyint(1) NOT NULL default '0',
is_completed tinyint(1) NOT NULL default '0',
allow_pdf tinyint(1) NOT NULL default '0',
review_forum_id smallint NOT NULL,
workflow_code char(3) NOT NULL,
PRIMARY KEY (story_id)
);
-- The story sequence number. This is how we keep track of the next story ID.
-- Imperfect solution, but it's the best of several imperfect solutions.
DROP TABLE IF EXISTS stories_seq;
CREATE TABLE stories_seq (
next_id mediumint NOT NULL
);
-- Story submissions. This is solely so we can track when a submission was begun, and
-- know which ones are "in progress". This gets cleaned out periodically, where any old
-- submissions, and ships/chars for them, get deleted.
DROP TABLE IF EXISTS story_submissions;
CREATE TABLE story_submissions (
story_id mediumint NOT NULL,
time_started timestamp NOT NULL,
PRIMARY KEY (story_id, time_started)
);
-- Story versions.
DROP TABLE IF EXISTS story_versions;
CREATE TABLE story_versions (
story_version_id int NOT NULL auto_increment,
story_id mediumint NOT NULL references stories(story_id),
title varchar(80) NOT NULL,
summary text NOT NULL,
time_submitted timestamp NOT NULL,
author_comments varchar(255) NOT NULL,
PRIMARY KEY (story_version_id)
);
-- Story workflow. Same dichotomy as pen_name_versions/workflow
DROP TABLE IF EXISTS story_workflow;
CREATE TABLE story_workflow (
story_version_id int NOT NULL references story_versions(story_version_id),
story_id mediumint NOT NULL references stories(story_id),
approver_id mediumint NOT NULL references users(user_id),
time_checked_out timestamp NOT NULL,
time_approved datetime,
approver_comments varchar(255) NOT NULL,
resolution_code char(5) NOT NULL,
PRIMARY KEY (story_version_id)
);
-- Chapters
DROP TABLE IF EXISTS chapters;
CREATE TABLE chapters (
chapter_id int NOT NULL auto_increment,
story_id mediumint NOT NULL references stories(story_id),
hits int NOT NULL default '0',
published_date date,
modified_date date,
sort_order tinyint unsigned NOT NULL,
autonumber tinyint(1) NOT NULL default '1',
review_thread_id int,
workflow_code char(3),
PRIMARY KEY (chapter_id)
);
-- Chapter content. Holds all the actual data. 2:1 with chapters
DROP TABLE IF EXISTS chapter_content;
CREATE TABLE chapter_content (
chapter_id int NOT NULL references chapters(chapter_id),
chapter_version_id int NOT NULL references chapter_versions(chapter_version_id),
chapter_title varchar(80) NOT NULL,
chapter_summary text NOT NULL,
chapter_disclaimer text NOT NULL,
chapter_pre_notes text NOT NULL,
chapter_post_notes text NOT NULL,
chapter_body mediumtext NOT NULL,
is_published tinyint(1) NOT NULL default '0',
PRIMARY KEY (chapter_id, chapter_version_id)
);
-- Chapter versions. Holds metadata for submissions, notes, rejections, etc.
-- Many:1 with chapters.
DROP TABLE IF EXISTS chapter_versions;
CREATE TABLE chapter_versions (
chapter_id int NOT NULL references chapters(chapter_id),
chapter_version_id int NOT NULL auto_increment,
time_submitted timestamp NOT NULL,
author_comments varchar(255) NOT NULL,
word_count mediumint NOT NULL,
character_count mediumint NOT NULL,
PRIMARY KEY (chapter_id, chapter_version_id)
);
-- Chapter workflow. See pen_name_versions/workflow
-- 1:1 with chapter_versions, but created at a different time
DROP TABLE IF EXISTS chapter_workflow;
CREATE TABLE chapter_workflow (
chapter_id int NOT NULL references chapters(chapter_id),
chapter_version_id int NOT NULL references chapter_versions(chapter_version_id),
time_checked_out timestamp NOT NULL,
time_approved datetime NOT NULL,
approver_comments mediumtext NOT NULL,
resolution_code char(5) NOT NULL,
PRIMARY KEY (chapter_id, chapter_version_id)
);
-- Story characters. Allows us to have multiple main characters per story
DROP TABLE IF EXISTS story_characters;
CREATE TABLE story_characters (
story_id mediumint NOT NULL references stories(story_id),
character_code char(5) NOT NULL references character_values(character_code),
PRIMARY KEY (story_id, character_code)
);
-- Story ships. Multiple ships per story
DROP TABLE IF EXISTS story_ships;
CREATE TABLE story_ships (
story_id mediumint NOT NULL references stories(story_id),
ship_char1_code char(5) NOT NULL references character_values(character_code),
ship_char2_code char(5) NOT NULL references character_values(character_code),
PRIMARY KEY (story_id, ship_char1_code, ship_char2_code)
);
-- Story spoilers.
DROP TABLE IF EXISTS story_spoilers;
CREATE TABLE story_spoilers (
story_id mediumint NOT NULL references stories(story_id),
spoiler_code char(5) NOT NULL references spoiler_values(spoiler_code),
PRIMARY KEY (story_id, spoiler_code)
);
-- Story genres
DROP TABLE IF EXISTS story_genres;
CREATE TABLE story_genres (
story_id mediumint NOT NULL references stories(story_id),
genre_name char(10) NOT NULL references genre_values(genre_name),
rank_order tinyint NOT NULL,
PRIMARY KEY (story_id, genre_name)
);
-- Character values
DROP TABLE IF EXISTS character_values;
CREATE TABLE character_values (
character_code char(5) NOT NULL,
character_name varchar(25) NOT NULL,
sort_order smallint NOT NULL,
PRIMARY KEY (character_code)
);
-- Spoiler values
DROP TABLE IF EXISTS spoiler_values;
CREATE TABLE spoiler_values (
spoiler_code char(5) NOT NULL,
spoiler_name varchar(60) NOT NULL,
spoiler_sort_order tinyint NOT NULL,
PRIMARY KEY (spoiler_code)
);
-- Genre values
DROP TABLE IF EXISTS genre_values;
CREATE TABLE genre_values (
genre_name char(10) NOT NULL,
PRIMARY KEY (genre_name)
);
-- House values
DROP TABLE IF EXISTS house_values;
CREATE TABLE house_values (
house_code char(3) NOT NULL,
house_name varchar(20) NOT NULL,
house_description varchar(255) NOT NULL,
house_color char(6) NOT NULL,
house_rb_url varchar(255) NOT NULL,
house_domain varchar(50) NOT NULL,
house_directory varchar(50) NOT NULL,
house_sort_order tinyint NOT NULL,
PRIMARY KEY (house_code)
);
-- Rating values
DROP TABLE IF EXISTS rating_values;
CREATE TABLE rating_values (
rating_code char(5) NOT NULL,
rating_description text NOT NULL,
rating_sort_order tinyint NOT NULL,
PRIMARY KEY (rating_code)
);
INSERT INTO rating_values (rating_code, rating_sort_order) VALUES
("G", 5),
("PG", 10),
("PG-13", 15),
("R", 20);
-- Language values
DROP TABLE IF EXISTS language_values;
CREATE TABLE language_values (
language_code char(3) NOT NULL,
language_name varchar(30) NOT NULL,
PRIMARY KEY (language_code)
);
-- Era values
DROP TABLE IF EXISTS era_values;
CREATE TABLE era_values (
era_code char(22) NOT NULL,
era_sort_order tinyint NOT NULL,
PRIMARY KEY (era_code)
);
-- Fandom era values
DROP TABLE IF EXISTS fandom_era_values;
CREATE TABLE fandom_era_values (
fandom_era_id tinyint NOT NULL AUTO_INCREMENT,
fandom_era_name varchar(20) NOT NULL,
fandom_era_sort_order tinyint NOT NULL,
PRIMARY KEY (fandom_era_id)
);
[File: migrate.php]
require_once('./local_config.inc.php');
require_once('migrate_helpers.inc.php');
require_once('links_parse.inc.php');
require_once('fic_parse.inc.php');
require_once('data_structures.inc.php');
require_once('field_listings.inc.php');
require_once('db_utils.inc.php');
$migrate_stats = array( 'total_files' => 0, 'parsable_files' => 0, 'unparsable_files' => 0,
'total_stories' => 0);
/** Overall migration entry point */
function migrate() {
global $currentHouse;
global $HOUSE_DIRS;
$LINKS_PATH = '/cgi-bin/links/admin/data/';
for($currentHouse = 1; $currentHouse < 5; $currentHouse++)
{
$linksDir = $HOUSE_DIRS[$currentHouse] . $LINKS_PATH;
handleCategories($linksDir . 'categories.db');
handleLinks($linksDir . 'links.db');
// Bulk of the insertion takes place within handleLinks
}
updateStories();
}
/**
function handleLinks($filename)
{
global $currentHouse;
global $HOUSE_DIRS;
global $migrate_stats;
$links =& parseLinksDb($filename, 'parseLink');
foreach ($links as $link)
{
// We have all the data available from the LINKs entry
// Now we should follow that, find the corresponding fic
// file, and parse that to get the rest of the data.
$filename = followLink($link['url'], $HOUSE_DIRS[$currentHouse]);
echo EOL . "Working on fic $filename...";
$migrate_stats['total_files']++;
$fic = parseFic($filename);
if($fic)
{ // Only if we could parse it
echo 'Parsed!';
$migrate_stats['parsable_files']++;
insertChapter($link, $fic);
}
else
{
$migrate_stats['unparsable_files']++;
}
}
}
/**
function insertChapter($link, $fic)
{
global $storyIds;
if(array_key_exists($link['category'], $storyIds))
{
$storyId = $storyIds[$link['category']];
}
else
{
$storyId = insertStory($link, $fic);
if(!isAuthor($link['category']))
{ // Only save chaptereds
$storyIds[$link['category']] = $storyId;
}
}
// Chaptering/autonumbering
if(isAuthor($link['category']))
{ // One-shot
$chapter = 1;
$autoNumber = 0; // Irrelevant; no chapters
}
else
{
$chapter = extractChapterNumber($link['url']);
$autoNumber = isAutonumbered($link['title'], $fic['title'], $chapter);
updateStoryInfo($link['category'], $fic['genre'], $fic['subgenre'], $fic['rating']);
}
// Non-autonumbered chapters need chapter titles
if(!$autoNumber)
{
$chapterTitle = $link['title'];
}
else
{
$chapterTitle = ; // Let them fix it themselves...
}
$record = array(
'story_id' => $storyId,
'hits' => $link['hits'],
'published_date' => convertDate($link['date']),
'modified_date' => convertDate($link['date']),
'sort_order' => $chapter,
'autonumber' => $autoNumber,
'review_thread_id' => $fic['review_thread']
);
$id = insertRecord($record, 'chapters');
$versionId = insertChapterVersion($id, $fic['text']);
insertChapterContent($id, $versionId, $fic, $chapterTitle);
insertChapterWorkflow($id, $versionId, convertDate($link['date']));
}
function insertChapterContent($id, $versionId, $fic, $chapterTitle)
{
$record = array(
'chapter_id' => $id,
'chapter_version_id' => $versionId,
'chapter_title' => $chapterTitle,
'chapter_summary' => $fic['summary'],
'chapter_disclaimer' => $fic['disclaimer'],
'chapter_pre_notes' => $fic['pre_an'],
'chapter_post_notes' => $fic['post_an'],
'chapter_body' => $fic['text']
);
insertRecord($record, 'chapter_content', false);
}
function insertChapterVersion($chapterId, $text)
{
$record = array(
'chapter_id' => $chapterId,
'author_comments' => ,
'word_count' => wordcount($text),
'character_count' => strlen($text)
);
return insertRecord($record, 'chapter_versions');
}
function insertChapterWorkflow($chapterId, $versionId, $datePublished)
{
$record = array(
'chapter_id' => $chapterId,
'chapter_version_id' => $versionId,
'time_approved' => $datePublished,
'approver_comments' => ,
'resolution_code' => 'OS'
);
insertRecord($record, 'chapter_workflow', false);
}
/**
function updateStoryInfo($key, $genre, $subgenre, $rating)
{
global $storyUpdates;
global $RATINGS;
$update =& $storyUpdates[$key];
$update->mainGenre = $genre;
$update->secondGenre = $subgenre;
if($RATINGS[$rating] > $update->rating)
{
$update->rating = $RATINGS[$rating];
}
}
function insertStory($link, $fic)
{
global $storyUpdates;
global $penNameIds;
global $currentHouse;
global $RATINGS;
global $HOUSES;
global $migrate_stats;
$authorCat = authorPart($link['category']);
if(array_key_exists($authorCat, $penNameIds))
{
$penNameId = $penNameIds[$authorCat];
}
else
{
$penNameId = insertPenName($link, $fic);
$versionId = insertPenNameVersion($penNameId, $link, $fic);
insertPenNameWorkflow($versionId, $penNameId, $link, $fic);
$penNameIds[$authorCat] = $penNameId;
}
$spoilers = array();
$otherSpoilers = parseSpoilers($fic['spoilers'], $spoilers);
$id = storiesAutoIncrement();
$record = array(
'pen_name_id' => $penNameId,
'story_id' => $id,
'house_code' => $HOUSES[$currentHouse],
'rating_code' => $fic['rating'], // Updated later for chaptereds
'language_code' => 'US', // Default to U.S. English
'era_code' => 'Unspecified Era', // None will be specified at first
'fandom_era_id' => 2, // Shall we do this by date instead?
'other_spoilers' => $otherSpoilers,
'is_published' => 1, // True
'is_completed' => isAuthor($link['category']),
'review_forum_id' => $fic['review_thread'],
'workflow_code' => WORKFLOW_ON_SITE
);
insertRecord($record, 'stories', false);
insertSpoilers($id, $spoilers);
insertCharacters($id, $link['keywords']);
if(isAuthor($link['category']))
{
updateOneShot($link, $fic, $id);
}
else
{
$update =& $storyUpdates[$link['category']];
$update->id = $id;
}
$migrate_stats['total_stories']++;
return $id;
}
/**
function updateOneShot($link, $fic, $id) {
$versionId = insertStoryVersion($id, $link['title'], $link['desc']);
insertStoryWorkflow($id, $versionId);
insertGenres($id, $fic['genre'], $fic['subgenre']);
}
/**
function parseSpoilers($spoilerString, &$spoilers)
{
global $SPOILERS;
$spoilerArray = explode(', ', $spoilerString);
$otherSpoilers = ;
foreach ($spoilerArray as $spoiler)
{
if(array_key_exists($spoiler, $SPOILERS))
{
array_push($spoilers, $SPOILERS[$spoiler]);
}
else
{
$otherSpoilers .= $spoiler;
}
}
return $otherSpoilers;
}
/**
function insertSpoilers($id, $spoilers) {
foreach($spoilers as $spoiler)
{
$record = array(
'story_id' => $id,
'spoiler_code' => $spoiler
);
insertRecord($record, 'story_spoilers', false);
}
}
/**
function insertCharacters($storyId, $keywords)
{
global $CHARS;
$keys = explode(' ', strtolower($keywords));
foreach($keys as $candidate)
{
if(array_key_exists($candidate, $CHARS))
{
$record = array(
'story_id' => $storyId,
'character_code' => $CHARS[$candidate]
);
insertRecord($record, 'story_characters', false);
}
}
}
/**
function insertPenName($link, $fic)
{
global $userids;
global $penNames;
$vb_id = $fic['vb_id'];
if(array_key_exists($vb_id, $userids))
{
$userId = $userids[$vb_id];
}
else
{
$penName = $penNames[authorPart($link['category'])];
$userId = insertUser($vb_id, $penName);
$userids[$vb_id] = $userId;
}
$record = array(
'user_id' => $userId,
'email' => $link['email'],
'workflow_code' => WORKFLOW_ON_SITE
);
return insertRecord($record, 'pen_names');
}
function insertPenNameVersion($penNameId, $link, $fic)
{
global $penNames;
$penName = $penNames[authorPart($link['category'])];
$record = array(
'pen_name_id' => $penNameId,
'pen_name' => $penName,
'author_comments' =>
);
return insertRecord($record, 'pen_name_versions', false);
}
function insertPenNameWorkflow($versionId, $penNameId, $link, $fic)
{
// All entries get inserted so that they've already passed through,
// are on site, with null comments and FictionAlleyMods as the approver
$record = array(
'pen_name_version_id' => $versionId,
'pen_name_id' => $penNameId,
'approver_id' => '1',
'time_approved' => 'NULL',
'approver_comments' => ,
'resolution_code' => 'APPR'
);
return insertRecord($record, 'pen_name_workflow', false);
}
/**
function insertUser($vb_id, $penName) {
$record = array(
'username' => $penName,
'forum_user_id' => $vb_id
);
return insertRecord($record, 'users');
}
/**
function updateStories()
{
global $storyUpdates;
global $migrate_stats;
foreach ($storyUpdates as $update)
{
if(isset($update->id) && $update->id != )
{
$id = insertStoryVersion($update->id, $update->title, $update->summary);
insertStoryWorkflow($update->id, $id);
updateStory($update);
insertGenres($update->id, $update->mainGenre, $update->secondGenre);
}
}
}
/**
function insertStoryVersion($id, $title, $summary)
{
$record = array(
'story_id' => $id,
'title' => $title,
'summary' => $summary,
'author_comments' =>
);
return insertRecord($record, 'story_versions');
}
/**
function insertStoryWorkflow($storyId, $storyVersionId)
{
$record = array(
'story_version_id' => $storyVersionId,
'story_id' => $storyId,
'approver_id' => '1',
'time_approved' => 'NULL',
'approver_comments' => ,
'resolution_code' => 'APPR'
);
insertRecord($record, 'story_workflow', false);
}
/**
function updateStory($update)
{
global $db;
global $RATINGS;
$record = array(
'rating_code' => array_search($update->rating, $RATINGS),
'is_completed' => $update->completed
);
dbUpdate('stories', $record, "story_id = $update->id");
}
/**
function insertGenres($id, $main, $secondary) {
insertGenre($main, $id, 1);
if(isset($secondary) && $secondary != ) {
insertGenre($secondary, $id, 2);
}
}
/**
function insertGenre($genreName, $storyId, $rankOrder)
{
global $GENRES;
$record = array(
'story_id' => $storyId,
'genre_name' => $GENRES[$genreName],
'rank_order' => $rankOrder
);
insertRecord($record, 'story_genres', false);
}
/**
function insertRecord($record, $tableName, $printDebug = true)
{
$result = dbInsert($tableName, $record);
$id = mysql_insert_id();
if($printDebug) {
// printRecord($record, "Inserted into $tableName");
print EOL . "Inserted record $id into $tableName.";
}
return $id;
}
// Start script
migrate();
printRecord($migrate_stats, EOL . 'Migration complete');
?>
[File: data_structures.inc.php]
require_once("db_utils.inc.php"); // Our Eclipse wrapper
/*
/**
$currentHouse = 1;
$HOUSE_DIRS = array(
);
/*
$userids = array();
/**
$penNames = array();
/**
$penNameIds = array();
/**
class StoryUpdate {
var $title;
var $summary;
var $id;
var $mainGenre;
var $secondGenre;
var $rating = 0; // Numeric rating; updated on each chapter
var $completed;
}
$storyUpdates = array();
/**
$storyIds = array();
/*
?>
[File: fic_parse.inc.php]
require_once("utils.inc.php");
/**
$FicParse_filename = "";
/**
function parseFic($filename)
{
global $FicParse_filename;
$FicParse_filename = $filename;
if(!file_exists($filename))
{
logUnfoundFile();
return false;
}
$data =& file_get_contents($filename);
$sections = splitFic($data); // Split & trim into header & text
if($sections == false) {
return false;
}
$retval = array(
"text" => $sections["text"],
"review_thread" => $sections["threadid"],
"post_an" => $sections["footer"]
);
$retval += parseHeader($sections["header"]);
return $retval;
}
/**
function splitFic($data)
{
// Set up regexp constants. I use % to delimit the pattern so we don't need
// to escape all the slashes.
$HEADER_START = "(?:|)";
$HEADER_END = "";
$FOOTER_START = "";
// First try; covers fics submitted since the post-AN field went in
$READ_REVIEW = "Read\? \s*\s*"
. "http://www\.fictionalley\.org/\w+/reviews/newreply\.php\?"
. "(?:s=&)?(?:action=newreply&)?threadid=(\d+)\"(?: target=_blank)?>\s*"
. "Review!(?:|)
";
$regexp = "%$HEADER_START(.+)$HEADER_END(.+)$FOOTER_START.*$READ_REVIEW(.*)
%sUi";
if(preg_match($regexp, $data, $matches))
{
return array(
"header" => $matches[1],
"text" => trim($matches[2]),
"threadid" => $matches[3],
"footer" => trim($matches[4])
);
}
// Second try; should cover most fics from introduction of the submit form up to
// the introduction of post-ANs on the form. Since we moved post-A/Ns manually
// then, they might appear either before or after the FOOTER_START delimiter
$FOOTER = "(?:$FOOTER_START\s*$READ_REVIEW|$READ_REVIEW(.+)$FOOTER_START)";
$regexp = "%$HEADER_START(.+)$HEADER_END(.+)$FOOTER%sUi";
if(preg_match($regexp, $data, $matches))
{
return array(
"header" => $matches[1],
"text" => trim($matches[2]),
"threadid" => (count($matches) == 6) ? $matches[4] : $matches[3],
"footer" => (count($matches) == 6) ? trim($matches[5]) : ""
);
}
// Third try; covers a period where the templates were really messed up for some
// reason
$READ_REVIEW2 = "Read\?
"
. "http://www\.fictionalley\.org/\w+/reviews/newreply.php\?"
. "threadid=(\d+)\">Review!";
$FOOTER2 = "$FOOTER_START.*$READ_REVIEW2";
$regexp3 = "%$HEADER_START(.+)$HEADER_END(?:|
)(.+)$FOOTER2%sUi";
if(preg_match($regexp3, $data, $matches))
{
// Replace the end of the header with a break tag, so we can extract the A/N
$header = preg_replace("#\s*~~~~~~~~~~~~~
#", "
", $matches[1]);
return array(
"header" => $header,
"text" => trim($matches[2]),
"threadid" => $matches[3],
"footer" => ""
);
}
// No match found; log and bail out
logUnparsableFile();
return false;
}
/**
function parseHeader($header)
{
$titleLine = extractHeader("Title", $header);
$authorLine = extractHeader("Author name", $header);
// $emailLine = extractHeader("Author email", $header);
$disclaimerLine = extractHeader("DISCLAIMER", $header);
$retval = splitTitleLine($titleLine); // title & chapter
$retval = array_merge($retval, splitAuthorLine($authorLine)); // Author & VB ID
// E-mail comes from LINKs, so this is redundant and error-prone
// $retval["email"] = pruneEmail($emailLine); // Remove HTML
$retval["genre"] = extractHeader("Category", $header);
$retval["subgenre"] = extractHeader("Sub Category", $header);
$retval["keywords"] = extractHeader("Keywords", $header);
$retval["rating"] = extractHeader("Rating", $header);
$retval["spoilers"] = extractHeader("Spoilers", $header);
$retval["summary"] = extractHeader("Summary", $header);
$retval["disclaimer"] = pruneDisclaimer($disclaimerLine); // Remove standard disclaimer
$retval["pre_an"] = extractHeader("Author notes", $header);
return $retval;
}
/**
function extractHeader($keyword, $data)
{
if(!preg_match("/$keyword:<\/b>\s(.+?)
/is", $data, $matches)) {
logUnfoundHeader($keyword);
return "";
}
return $matches[1];
}
/**
function splitTitleLine($line)
{
$regexp = "/(.+) (?:\((?i:ch.+)?(\d+).*\)|pro\w*)?/";
if(!preg_match($regexp, $line, $matches)) {
logUnsplittableTitle(); // Shouldn't happen
$matches[1] = "";
$matches[2] = 0;
}
// Couldn't match chapter...either it's a prologue or combined chapter or nonstandard
if(count($matches) == 2) {
// logMissingChapter();
$matches[2] = 0;
}
return array(
"title" => trim($matches[1]),
"chapter" => $matches[2]
);
}
/**
*
function splitAuthorLine($line)
{
$regexp = "/.+userid=(\d+)\"(?: target=\w+)?>(.+)<\/a>/is";
if(!preg_match($regexp, $line, $matches)) {
logUnsplittableVbId(); // Shouldn't happen, unless we've changed the format
$matches[1] = "";
$matches[2] = "";
}
return array(
"vb_id" => $matches[1],
"author" => preg_replace("/\s+/", " ", $matches[2])
);
}
/**
function pruneEmail($line)
{
$regexp = "/.+mailto:(.+)\">/";
if(!preg_match($regexp, $line, $matches)) {
return "";
}
return $matches[1];
}
/**
function pruneDisclaimer($line)
{
$regexp = "/infringement is intended\.(.+)/";
if(!preg_match($regexp, $line, $matches)) {
return "";
}
return trim($matches[1]);
}
/**
function logUnparsableFile()
{
saveUnparsable("Unparsable file");
}
/**
function logUnsplittableTitle()
{
saveUnparsable("Unsplittable title line");
}
/**
function logUnsplittableVbId()
{
saveUnparsable("Couldn't extract VB ID");
}
/**
function logUnfoundHeader($header)
{
// Author notes and subcategories are no problem, so ignore them
if($header == "Author notes" or $header == "Sub Category" or $header == "Spoilers")
{
return;
}
saveUnparsable("Couldn't find header $header");
}
/**
function logMissingChapter()
{
saveUnparsable("Couldn't find chapter number");
}
/**
function logUnfoundFile()
{
saveUnparsable("File doesn't exist");
}
/**
function filenameToHref()
{
global $FicParse_filename;
$url = preg_replace("#D:/FA/#", "http://www.", $FicParse_filename);
return "$url";
}
/**
function saveUnparsable($message)
{
$link = filenameToHref();
$output = "";
echo "$message!" . EOL;
file_put_contents("unparsables.html", "$message: $link
\r\n", "a");
}
?>
[File: links_parse.inc.php]
require_once("utils.inc.php");
/**
function &parseLinksDb($file_name, $parser) {
$records = array(); // Store parsed DB here
$data = file($file_name);
if($data === false) {
error("Couldn't load database file $file_name");
return;
}
foreach ($data as $line) {
$record = $parser($line);
array_push($records, $record);
}
return $records;
}
/**
function parseLink($line) {
$fields = splitRecord($line);
$record = array(
"id" => $fields[0],
"title" => $fields[1],
"url" => $fields[2],
"date" => $fields[3],
"category" => $fields[4],
"desc" => $fields[5],
"contact_name" => $fields[6],
"email" => $fields[7],
"hits" => $fields[8],
"is_new" => $fields[9],
"is_popular" => $fields[10],
"rating" => $fields[11],
"votes" => $fields[12],
"receivemail" => $fields[13],
"keywords" => $fields[14]
);
return $record;
}
/* Parses a single record from a category database.
function parseCat($line) {
$fields = splitRecord($line);
$record = array(
"id" => $fields[0],
"name" => $fields[1],
"desc" => $fields[2],
"related" => $fields[3],
"metadesc" => $fields[4],
"metakey" => $fields[5],
"header" => $fields[6],
"footer" => $fields[7],
"nonenglish" => $fields[8]
);
return $record;
}
/** Splits a record into fields, using the LINKS separator | */
function splitRecord($line) {
return explode("|", trim($line));
}
?>
[File: migrate_helpers.inc.php]
require_once("links_parse.inc.php");
require_once("data_structures.inc.php");
/**
function handleCategories($filename)
{
GLOBAL $penNames;
GLOBAL $storyUpdates;
$categories =& parseLinksDb($filename, "parseCat");
foreach ($categories as $category) {
if(isAuthor($category["name"])) {
$penNames[$category["name"]] =& $category["nonenglish"];
} else {
$storyUpdate =& new StoryUpdate();
$storyUpdate->title =& extractTitle($category["nonenglish"]);
$storyUpdate->completed =& isCompleted($category["desc"]);
$storyUpdate->summary =& stripCompleted($category["desc"]);
$storyUpdate->rating = 1;
$storyUpdates[$category["name"]] =& $storyUpdate;
}
}
}
/**
function isAuthor($categoryName)
{
return strrpos($categoryName, "/") === FALSE;
}
function extractTitle($categoryName)
{
$parts = explode(" by ", $categoryName);
return $parts[0];
}
/**
function followLink($url, $baseDir)
{
$parsed = parse_url($url);
return $baseDir . $parsed["path"];
}
/**
function authorDir($url)
{
$parsed = parse_url($url);
$path = explode("/", $parsed["path"]);
return $path[2];
}
/**
function authorPart($categoryName)
{
$parts = explode("/", $categoryName);
return $parts[0];
}
/**
function stripCompleted($summary)
{
return trim(preg_replace("/\[completed?\]/i", "", $summary));
}
/**
function isCompleted($summary)
{
return preg_match("/\[completed?\]/i", $summary);
}
/**
function convertDate($linksFormat)
{
$MONTHS = array(
"Jan" => "01",
"Feb" => "02",
"Mar" => "03",
"Apr" => "04",
"May" => "05",
"Jun" => "06",
"Jul" => "07",
"Aug" => "08",
"Sep" => "09",
"Oct" => "10",
"Nov" => "11",
"Dec" => "12"
);
$parts = explode("-", $linksFormat);
$day = $parts[0];
$month = $MONTHS[$parts[1]];
$year = $parts[2];
return "$year-$month-$day";
}
/**
function extractChapterNumber($url)
{
if(!preg_match("/.+(\d\d)\.html/", $url, $matches)) {
return "0"; // Prologue
}
return $matches[1];
}
/**
function isAutonumbered($linkTitle, $title, $chapter)
{
return $linkTitle == "$title $chapter";
}
function wordcount($text)
{
// Use same criteria as wc
preg_match_all("/\S+/", $text, $result);
return count($result[0]);
}
function printRecord($record, $message)
{
print "$message:" . EOL;
print_r($record);
print EOL . EOL;
}
/** Story auto_increment simulation */
$currentStoryId = 0;
function storiesAutoIncrement()
{
global $currentStoryId;
$currentStoryId++;
return $currentStoryId;
}
/** TODO: save stories_seq to database */
?>
Due to the size of the listing, comments have been moved to FictionPublishingExampleDiscussion.