Parsing With PHP [Part2]

This is the next parsing tutorial from part 1. After I learned two methods of parsing with PHP on part 1. I tried to create some PHP script from “PHP Programming Solution” that allow you to counting words in a string. Use a pattern to identify the individual words in the string, and then count how many times that pattern recurs:

1
2
3
4
5
6
7
8
9
<?php
// define string
$text = "The film begins with the Joker robbing a mob-owned bank with several other accomplices, whom he tricks into killing each other. That night, Batman impersonators attempt to interrupt a meeting of various mobsters and the Scarecrow. The real Batman appears, but suffers injuries which lead him to re-design the batsuit. Batman and Lieutenant James Gordon contemplate including new district attorney Harvey Dent in their plan to eradicate the mob, as he could be the public hero Batman cannot be. Bruce discovers that Harvey Dent is dating Rachel Dawes. Mob bosses meet to discuss Batman, Gordon, and Dent, while a Chinese mobster accountant, Lau, informs the gang leaders he has hidden their money to preempt a plan that Gordon has hatched to seize the mobsters' loot. The Joker arrives unexpectedly, offering to kill Batman for half their money.";
// decompose the string into an array of "words"
$words = preg_split('/[^0-9A-Za-z\']+/', $text, -1, PREG_SPLIT_NO_EMPTY);
// count number of words (elements) in array
// result: "139 words"
echo count($words) . " words";
?>

The preg_split() function is probably one of PHP’s most underappreciated functions. This function accepts a Perl-compliant regular expression and a subject string, and returns an array containing substrings matching the pattern. It’s a great way of finding the matches in a string and placing them in a separate array for further
processing. Read more about the function and its arguments at http://www.php.net/preg_split. In this listing, the regular expression [^0-9A-Za-z\']+ is a generic pattern that will match any word. All the words thus matched are fed into the $words array. Counting the number of words in the string is then simply a matter of obtaining the size of the $words array.

An alternative is to use the new str_word_count() function to perform this task.
Here’s an example:

1
2
3
4
5
6
7
8
9
<?php
// define string
$text = "The film begins with the Joker robbing a mob-owned bank with several other accomplices, whom he tricks into killing each other. That night, Batman impersonators attempt to interrupt a meeting of various mobsters and the Scarecrow. The real Batman appears, but suffers injuries which lead him to re-design the batsuit. Batman and Lieutenant James Gordon contemplate including new district attorney Harvey Dent in their plan to eradicate the mob, as he could be the public hero Batman cannot be. Bruce discovers that Harvey Dent is dating Rachel Dawes. Mob bosses meet to discuss Batman, Gordon, and Dent, while a Chinese mobster accountant, Lau, informs the gang leaders he has hidden their money to preempt a plan that Gordon has hatched to seize the mobsters' loot. The Joker arrives unexpectedly, offering to kill Batman for half their money.";
 
// count number of words
// result: "137 words"
$numWords = str_word_count($text);
echo $numWords . " words";
?>

Why both of them has different result?? Good question. Because if you preg_split() function words with “-” will divide to two words. For example in this case “mob-owned” or “re-design” counted as two words. But if you use str_word_count() function words “mob-owned” or “re-design” counted as one word.

Reference:
-Vaswani Vikram, 2007, “PHP Programming Solution”, McGraw-Hill



Related Post:

Post a Comment

Your email is never published nor shared. You're allow to say what you want...

Blogroll Link Update