We use cookies to improve your experience. No personal information is gathered and we don't serve ads. Cookies Policy.

ExpressionEngine Logo ExpressionEngine
Features Pricing Support Find A Developer
Partners Upgrades
Blog Add-Ons Learn
Docs Forums University
Log In or Sign Up
Log In Sign Up
ExpressionEngine Logo
Features Pro new Support Find A Developer
Partners Upgrades
Blog Add-Ons Learn
Docs Forums University Blog
  • Home
  • Forums

Borked line breaks

Development and Programming

LHDonline's avatar
LHDonline
18 posts
16 years ago
LHDonline's avatar LHDonline

Ever have that sinking feeling–like you’ve discovered that the 15K entries that were imported via the MT import utility have no br or p tags anywhere? There are line breaks in the html, but that’s it.

I found Matthew Mullenweg’s autop script that’s used in WordPress, and I have the following:

<?php
/**
 * Replaces double line-breaks with paragraph elements.
 *
 * A group of regex replaces used to identify text formatted with newlines and
 * replace double line-breaks with HTML paragraph tags. The remaining
 * line-breaks after conversion become <
> tags, unless $br is set to '0'
 * or 'false'.
 *
 * @since 0.71
 *
 * @param string $pee The text which has to be formatted.
 * @param int|bool $br Optional. If set, this will convert all remaining line-breaks after paragraphing. Default true.
 * @return string Text which has been converted into correct paragraph tags.
 */
function wpautop($pee, $br = 1) {
    $pee = $pee . "\n"; // just to make things a little easier, pad the end
    $pee = preg_replace('|
\s*
|', "\n\n", $pee);
    // Space things out a little
    $allblocks = '(?:table|thead|tfoot|caption|colgroup|tbody|tr|td|th|div|dl|dd|dt|ul|ol|li|pre|select|form|map|area|blockquote|address|math|style|input|p|h[1-6]|hr)';
    $pee = preg_replace('!(<' . $allblocks . '[^>]*>)!', "\n$1", $pee);
    $pee = preg_replace('!(</' . $allblocks . '>)!', "$1\n\n", $pee);
    $pee = str_replace(array("\r\n", "\r"), "\n", $pee); // cross-platform newlines
    if ( strpos($pee, '<object') !== false ) {
        $pee = preg_replace('|\s*<param([^>]*)>\s*|', "", $pee); // no pee inside object/embed
        $pee = preg_replace('|\s*</embed>\s*|', '</embed>', $pee);
    }
    $pee = preg_replace("/\n\n+/", "\n\n", $pee); // take care of duplicates
    // make paragraphs, including one at the end
    $pees = preg_split('/\n\s*\n/', $pee, -1, PREG_SPLIT_NO_EMPTY);
    $pee = '';
    foreach ( $pees as $tinkle )
        $pee .= '' . trim($tinkle, "\n") . "\n";
    $pee = preg_replace('|\s*?|', '', $pee); // under certain strange conditions it could create a P of entirely whitespace
    $pee = preg_replace('!([^<]+)\s*?(</(?:div|address|form)[^>]*>)!', "$1$2", $pee);
    $pee = preg_replace( '||', "$1", $pee );
    $pee = preg_replace('!\s*(</?' . $allblocks . '[^>]*>)\s*!', "$1", $pee); // don't pee all over a tag
    $pee = preg_replace("|(<li.+?)|", "$1", $pee); // problem with nested lists
    $pee = preg_replace('|<blockquote([^>]*)>|i', "<blockquote$1>", $pee);
    $pee = str_replace('</blockquote>', '</blockquote>', $pee);
    $pee = preg_replace('!\s*(</?' . $allblocks . '[^>]*>)!', "$1", $pee);
    $pee = preg_replace('!(</?' . $allblocks . '[^>]*>)\s*!', "$1", $pee);
    if ($br) {
        $pee = preg_replace_callback('/<(script|style).*?<\/\\1>/s', create_function('$matches', 'return str_replace("\n", "<WPPreserveNewline >", $matches[0]);'), $pee);
        $pee = preg_replace('|(?<!
)\s*\n|', "
\n", $pee); // optionally make line breaks
        $pee = str_replace('<WPPreserveNewline >', "\n", $pee);
    }
    $pee = preg_replace('!(</?' . $allblocks . '[^>]*>)\s*
!', "$1", $pee);
    $pee = preg_replace('!
(\s*</?(?:p|li|div|dl|dd|dt|th|pre|td|ul|ol)[^>]*>)!', '$1', $pee);
    if (strpos($pee, '<pre>!is', 'clean_pre', $pee );
    $pee = preg_replace( "|\n$|", '', $pee );
    $pee = preg_replace('/\s*?(' . get_shortcode_regex() . ')\s*<\/p>/s', '$1', $pee); // don't auto-p wrap shortcodes that stand alone

    return $pee;
}

?>

…but I have no clue what to do with it–this is way out of my wheelhouse. Is there existing EE plugin that would do something like this? Seemed like everything I saw was going the opposite direction (stripping/replacing the br or p tags).

       
silenz's avatar
silenz
1,648 posts
16 years ago
silenz's avatar silenz

Setting the text formatting to XHTML for the fields in question is not an option?

       
LHDonline's avatar
LHDonline
18 posts
16 years ago
LHDonline's avatar LHDonline

We’re currently using LG TinyMCE for that field, but I don’t think that would help. What I have is this plain text HTML:

(UNDATED) Sarah Vowell loves the Puritans, who left England in search of religious freedom and then condemned and expelled those who didn't believe exactly as they did.
``What can I say?'' Vowell said. ``I love a contradiction. Massachusetts was supposed to be this community of like-minded individuals, but basically it was a totalitarian community like the Soviet Union.''
The man who embodied those contradictions was John Winthrop (1588-1649), the governor of the Massachusetts Bay Company. Winthrop is best known today for his sermon ``A Model of Christian Charity,'' which includes the phrase ``city upon a hill'' and has been referenced in speeches by John F. Kennedy, Ronald Reagan and other politicians. Winthrop had a lot more to say about community and charity that appealed to Vowell, who made Winthrop the focus of her new book, ``The Wordy Shipmates.''

Which renders in a blob like this on the page:

(UNDATED) Sarah Vowell loves the Puritans, who left England in search of religious freedom and then condemned and expelled those who didn’t believe exactly as they did. “What can I say?” Vowell said. “I love a contradiction. Massachusetts was supposed to be this community of like-minded individuals, but basically it was a totalitarian community like the Soviet Union.” The man who embodied those contradictions was John Winthrop (1588-1649), the governor of the Massachusetts Bay Company. Winthrop is best known today for his sermon “A Model of Christian Charity,” which includes the phrase “city upon a hill” and has been referenced in speeches by John F. Kennedy, Ronald Reagan and other politicians. Winthrop had a lot more to say about community and charity that appealed to Vowell, who made Winthrop the focus of her new book, “The Wordy Shipmates.”

The other problem is that some of the entries are OK, and some are like this. The thing about this autop script is that it’s supposed to ignore the ones that are OK, and fix the bad ones (if I read the docs right). I just don’t have a clue how to go about either making this some kind of plugin or just using it as a straight up PHP script.

       
Robin Sowell's avatar
Robin Sowell
13,160 posts
16 years ago
Robin Sowell's avatar Robin Sowell

xhtml really should do the same thing- it doesn’t alter the data, but it alters the display. I’d give that a try real quick and see what happens. Just change the field formatting for the relevant fields- in ‘Admin- Weblog Admin- Custom Fields’. When you change to xhtml you can choose whether to apply it retroactively to all entries. As long as the only entries are the imported ones, won’t hurt to do it. Can always retroactively change them to something else.

       
LHDonline's avatar
LHDonline
18 posts
16 years ago
LHDonline's avatar LHDonline

I don’t have the option to permanently switch the field to XHTML, though. This weblog will continue to have entries added/edited, and the client requires WYSIWYG for that.

       
Robin Sowell's avatar
Robin Sowell
13,160 posts
16 years ago
Robin Sowell's avatar Robin Sowell

Hrm- are you horribly oppossed to re-importing the data? Seems to me, that would be a fairly easy way to do it- IF it was simple the first go round.

We could call a function where it defines the fields before importing - a la

// BODY
                preg_match("/BODY:(.*)/", $sections[$i], $meta_info);
                if (isset($meta_info['1']))
                {
                     $body[$id] = trim($meta_info['1']);
                     continue;
                }

Other ways to go at it, that just seems like it might be easy. In addition to the trim, we run it through a tranformation (could likely use EE’s typography class very easily).

       
LHDonline's avatar
LHDonline
18 posts
16 years ago
LHDonline's avatar LHDonline

No-can-do on the reimport. It was a bear to start with (20K items), plus we had to have some custom database work done to create FoxyCart downloadable codes for the entries.

       
Robin Sowell's avatar
Robin Sowell
13,160 posts
16 years ago
Robin Sowell's avatar Robin Sowell

Hrm- you’ll need someone a bit familiar with php/mysql to polish this off, but it worked for me in a quick test. The function above threw an error for me, and since I’m more familiar w/EE’s typography class, I used that. You might change the settings I used, though. And you’ll need to do this in batches- so like, a limit of 100, then a limit 100 200- in other words, I wouldn’t do more than a couple hundred at a go.

You can just create a blank template, turn php parsing on. What it does- starts by querying the exp_weblog_data table. My simple test is just grabbing all the entries there. That could be refined. From that table, I grabbed 2 custom fields- field_id_1 and field_id_2- you’ll want to change to reflect the fields you need.

So- I get the data from the db- run it through EE’s typography class, formatting as xhtml, then update the db. It worked on my really simple test.

But before you go manipulating the db, make sure you have backups and are comfortable rolling back. And you’ll need a bit of familiarity w/both php and mysql. But it’s pretty simple overall.

<?php

global $DB;
        
if ( ! class_exists('Typography'))
{
require '/Applications/MAMP/htdocs/system/core/core.typography.php';
}
        
$TYPE = new Typography;

$data = array();
$i = 0;
$query = $DB->query("SELECT entry_id, field_id_1, field_id_2 FROM exp_weblog_data WHERE field_id_1 != '' LIMIT 1");

    if ($query->num_rows > 0)
    {
        foreach($query->result as $row)
        {

            $data['field_id_1'] =     $TYPE->parse_type( 
                            $row['field_id_1'], 
                                   array(
                                        'text_format'   => 'xhtml',
                                        'html_format'   => 'all',
                                        'auto_links'    => 'n',
                                        'allow_img_url' => 'y'
                                        )
                                     );
            $data['field_id_2'] =     $TYPE->parse_type( 
                            $row['field_id_2'], 
                                   array(
                                        'text_format'   => 'xhtml',
                                        'html_format'   => 'all',
                                        'auto_links'    => 'n',
                                        'allow_img_url' => 'y'
                                        )
                                     );


                       $sql = $DB->update_string('exp_weblog_data', $data, 'entry_id = "'.$DB->escape_str($row['entry_id']).'"');

                     $DB->query($sql);
                     $i++;

        }
    }
echo 'Done: '.$i;
?>

This is to get you started- so not final code. But should be fairly easy to tweak. You don’t have to use the typography class- could use whatever function you want. This was just easier for me.

Make sense?

       

Reply

Sign In To Reply

ExpressionEngine Home Features Pro Contact Version Support
Learn Docs University Forums
Resources Support Add-Ons Partners Blog
Privacy Terms Trademark Use License

Packet Tide owns and develops ExpressionEngine. © Packet Tide, All Rights Reserved.